As a developer you are familiar with Web Servers, and Database Servers. Both service data, in different ways. And this creates an interesting challenge. Let’s face it: accessing data is hard. Data in databases, in XML documents, in PDF documents, in flat files, on FTP servers in a proprietary format, in cubes, in no-SQL… you name it. In addition to those storage formats, data is made available in a large variety of file formats, available through a multitude of protocols, and serviced through an ever increasing set of providers each with their own authentication and authorization implementation. But the storage of data and the way to access it isn’t the only challenge. Who wants to access all this data? Developers, DBAs, reports writers, Systems Integrators, consultants, managers, business users, office workers, applications…
What’s the real problem?
Accessing data is hard because of those three forces: storage formats, protocols, and consumers. Different consumers need the same data in different ways, because they consume data in different ways. When the data is needed by applications (such as a mobile phone), then it is best accessed using REST requests returning JSON documents; this is usually best accomplished by servicing data through a web server. But if the data is needed by a report (such as Excel, or a cube for analytical purposes), it is usually best accessed using SQL commands, which is usually best accomplished using a database server.
This creates an interesting challenge: who are the consumers for a given set of data, and in which protocol should it be accessed? This simple question is hard to answer because for a given set of data, consumers change over time, but not the data stores, nor the protocols. The challenge is usually resolved by hiring consultants (or spending a lot of time with internal resources) to build bridges that move/copy data from one storage format to the next, so it can be used by the consumers that need the data in a specific format; that’s why integration tools are very popular: let’s copy the data from point A to point B, and C, so it can be used by consumers (business partners, reports, executives…). All this takes time and money. Copying no-sql data to a relational database or in a cube, so it can be reported on, takes effort. Or extracting flat files from an FTP site daily, and load it in a database, so it can used by a web application requires the creation of a complex set of programs that orchestrate this work.
As a result, the difficulty in making data access ubiquitous inhibits certain companies from making timely decisions, because the necessary data is not immediately available in the right format at the right time. And as mentioned previously, the need for data in various formats is clear. How many deployments of SSIS (SQL Server Integration Services), Joomla, BizTalk, Informatica, Scheduled jobs and ETL processes that move files around are you aware of? Some are needed because complex transformations are necessary, but a vast number of those implementations are in place because the source data is simply in the wrong format. [note: I am not saying these tools are not necessary; I am naming these tools to outline the need for data movement in general]
Introducing the Data Server
With this challenge continuing to grow with every new platform, API, consumer, and storage type, I suggest that a new technology be built, that I will simply call a Data Server, so that data can be serviced in real-time through virtually any protocol regardless of where the data comes from. In other words, it shouldn’t matter who needs the data and through which protocol; a mobile application needs data through REST/JSON? No problem. A report needs the exact same data through SQL? No problem. The same data, coming from the same source, should be accessible in real-time regardless of consumer preferences. If data were available in a ubiquitous manner, regardless of the protocol being used, a large number of integration routines would become obsolete, or would be largely simplified.
So what are the attributes of a Data Server? It should be able to hide the complexities of the underlying data sources, and present data in a uniform way, through multiple protocols. For example, a Data Server would present Tweets through a REST/JSON interface and through an SQL interface (the same data in real-time; not a copy of the data). SharePoint lists could be made available through the same REST/JSON interface, or SQL as well. An FTP server could be accessed through REST/JSON and SQL too, and so would WMI, no-sql, CICS screens, flat files, SOAP endpoints… Regardless of the origin of the data, it would be serviced through a uniform response in the desired protocol.
The Data Server could also abstract security by shielding the authentication and authorization details of the underlying source of data. This makes a lot of sense too because most sources of data use different security protocols, or variations of published standards. For example, authenticating to Microsoft Bing, Azure Blobs, Google Maps, FTP, SharePoint or Twillio is difficult because they all use different implementations. So abstracting authentication through a single REST layer or an SQL interface, and adding a layer of authorization on top of these data endpoints makes things much easier and more secure. It becomes possible to monitor data consumption across private and public sources of data as well, which can be important in certain organizations.
Data Cross Concerns
A Data Server would also help in implementing data cross concerns that are not usually easy to configure (or use), such as caching, asynchronous processing, scheduling, logging and more. For example, caching becomes much more interesting through a data server because it doesn’t matter which interface is used to access the data; cached data sets can be made available to both REST/JSON and SQL interfaces at the same time, which means that the data needs to be cached only once and remains consistent no matter which consumer reads the data.
Asynchronous processing is also an interesting cross concern; consumers can start a request without waiting for its completion, through REST or SQL equally. For example, a REST command could initiate an asynchronous request, and a SQL command could check the completion of the request and fetch the results. Since the protocol used by consumers becomes an implementation choice, the data takes center place in a Data Server. Accessing, managing, recycling and updating data through a vast array of data sources becomes protocol agnostic.
To accelerate data-intensive projects and to help organizations consume the data they need efficiently, data should be made available in a uniform way regardless of the client protocol, no matter where it comes from, so that it doesn’t matter anymore who needs to consume that data. By creating this level of abstraction, the authentication and authorization mechanism can be streamlined too, so that there is only one way to secure the information. And since a Data Server channels requests to a multitude of data endpoints, it becomes a hub where common data related concerns can be implemented, including caching, scheduling and asynchronous requests.
About Herve Roggero
Herve Roggero, Microsoft Azure MVP, @hroggero, is the founder of Blue Syntax Consulting (http://www.bluesyntaxconsulting.com/). Herve's experience includes software development, architecture, database administration and senior management with both global corporations and startup companies. Herve holds multiple certifications, including an MCDBA, MCSE, MCSD. He also holds a Master's degree in Business Administration from Indiana University. Herve is the co-author of "PRO SQL Azure" and “PRO SQL Server 2012 Practices” from Apress, a PluralSight author, and runs the Azure Florida Association.