Sharding Library for SQL Azure Data Federation

The ability to write scale out applications for SQL Azure will soon become much easier, thanks to the upcoming SQL Azure Data Federation capabilities. In an earlier post, I outlined specific steps developers can take to prepare for this key enhancement.  Some of the key capabilities of SQL Azure Data Federation is to distribute large data sets across multiple databases, hence providing a mechanism to avoid the current database size limitations of SQL Azure.

While it may be tempting to look at this new feature as a workaround to an existing size limitation, the real benefit of federating data is to give each data set its own set of independent processing power; indeed each federation member is assigned its own CPU, storage, memory, TempDB and network bandwidth. In other words, Data Federation provides the basic foundation for a scale out database model, typically needed by SaaS applications and historical data sets.

Distributed Needs

Since data stored in federation members are actually stored across multiple databases, certain capabilities taken for granted in typical development environments are not available just yet.  And with this type of new features, new needs are surfacing.  In a nutshell, some of the capabilities currently missing include:

-          The ability to fan out queries when running a query across federation members

-          The ability to perform basic joins across federation members to exclude data sets

-          The ability to execute simple aggregated functions such as MIN, MAX... across federated members

Since these capabilities are not yet supported by the Data Federation feature, it is necessary to leverage parallel processing on the client (or middle-tier) application in order to efficiently query across federation members (i.e. databases).

Open-Source Enzo Library

In order to provide support the above capabilities across federation members, I updated the Enzo Shard Library open-source project (http://enzosqlshard.codeplex.com/releases/view/72791), which is currently in Version 2.0 Beta. This updated library offers new important features, including support for Data Federation. A new class (called the Distributed Query) allows developers to execute a simple SELECT query across federation members. For example, a Distributed Query that fetches customer data across federation members (hence databases) to return the Maximum CustomerID would look like this:

SELECT MAX(CustomerId) USING (select customerid from customers) FEDERATED ON (customerfederation)

The above statement is interpreted by the Enzo library; the inner statement (select customerid from customers) is the T-SQL part that is actually executed across each federation member in parallel against SQL Azure. Then the library aggregates the results and returns the maximum customer id from all the databases involved in the federation (see the details on CodePlex for a complete specification of the Distributed Query).

There are many new features available in this library, which will be the subject of my talk at the PASS Conference 2011 in October (more information).

 

Twitter