About a month ago'ish I read some very sad news. Microsoft announced that they were killing off the DryadLinq
(or LINQ-to-HPC) project in favor of Hadoop.
I was one of the first users of DryadLinq outside of Microsoft, back when it was a pre-alpha project inside Microsoft Research. My company had a running HPC cluster and my boss convinced me to install DryadLinq on it to see what I could make it do. I worked with it for a year, and being a big LINQ and PLINQ fan, really enjoyed how easy it was to write non-cryptic code and get it to run in parallel across a cluster of machines. Fast forward two years, after spending a year working with Hadoop, in my opinion DryadLinq beats Hadoop hands down. The key to DryadLinq's goodness was that you wrote the same algorithm code you would normally write, no matter if its executing on one core, multiple cores, or multiple machines; its just LINQ (as long as you keep in mind to pass all data into your lambda functions, and share nothing externally). You don't have to retrain your ming to think about algorithms in a different paradigm, like grouping and sorting key value pairs. DryadLinq dramatically sped up the development iteration cycle of designing algorithm compared to Hadoop, because you write your algorithm in the same code style to run across many machines as you would if it was just running on one machine.
The main item that DryadLinq lacked was any kind of DFS. DryadLinq sucked in that respect, and lets face it, without a good DFS, a distributed processing framework isn't all that functional. You basically had to fake a DFS, by manually partitioning your data across all machines, and generate an INI file that described how the data was partitioned. The DryadLinq runtime would read the INI file, and use that as the basis for a DFS. But if you had to write a fare amount of code if you wanted to automate the process of distributing your data.
But, even though I'm bummed that Microsoft killed off DryadLinq, I do have a glimmer of hope and heres why. First, last month Microsoft announced its support for coming up with a Windows compatible distribution of Hadoop. So Microsoft is committing to getting Hadoop running on Windows. Second, in Hadoop 0.23 one of the things they did was a full rewrite of the distributed execution model. In Next Gen Hadoop (Yarn), MapReduce has been written as an implementation on top of an abstract distributed job execution framework. Another implementation besides MapReduce thats it'll support is a DAG (directed acyclic graph http://en.wikipedia.org/wiki/Directed_acyclic_graph) of jobs. Basically, a graph of nodes (jobs) where any node V can not link back to another node such that following the edges of the graph would loop back to V again.
Now put all that together and here is that glimmer of hope: the core coolness around DryadLinq was that it was a framework for analyzing a .Net LINQ expression tree and generating a DAG structure for it that executed on Windown HPC. At runtime, LINQ code is represented as an expression tree that gets interpreted. DryadLinq analyzes the expression tree and figured out how to segment the sequence of lambda functions into a DAG structure. It then generated .dlls that contained all the functions needed to execute the DAG, and shipped them over to HPC, and then started the DAG execution.
So basically Microsoft already has a kick-butt model for distributed processing of LINQ code. Add on top that you can use .Net with Hadoop streaming, and Hadoop comes with an industry tested DFS, the thing DryadLinq sorely missed, and I can potentially see where Microsoft is going with this.
Now the real question is, is Microsoft that forward thinking? Or are they doing what they do best; making a knee jerk reaction to the market.