Geeks With Blogs
Josh Reuben

I recently read the Big Data Glossary -



Big Data is essentially a MapReduce stack for scatter-gather-aggregate scaleout of compute jobs.

The core tools are:

  • Apache Hadoop – a MapReduce scale-out infrastructure
  • Hive – SQL language for Hadoop
  • Pig – procedural language for Hadoop
  • Cascading – orchestration of jobs on Hadoop
  • Datameer – BI on Hadoop
  • Mahout – distributed machine learning library on Hadoop
  • ZooKeeper – work coordinator / monitor

On top of these are various tools & extensions, as well as ports (e.g. HDInsight )

You also need to be aware of elastic cloud platforms to run on, and the various NoSQL DBs tend to be leveraged in this space as well.

Additionally, MapReduce is just an infrastructure pattern for distributed processing of algorithms – you will not get much usage out of it without knowledge of the appropriate algorithms to leverage on the nodes in your compute grid – the whole point of Big Data.

Posted on Tuesday, December 25, 2012 11:39 AM | Back to top

Comments on this post: Big Data–Where to Start

No comments posted yet.
Your comment:
 (will show your gravatar)

Copyright © JoshReuben | Powered by: | Join free