Big Data–Where to Start

I recently read the Big Data Glossary -



Big Data is essentially a MapReduce stack for scatter-gather-aggregate scaleout of compute jobs.

The core tools are:

  • Apache Hadoop – a MapReduce scale-out infrastructure
  • Hive – SQL language for Hadoop
  • Pig – procedural language for Hadoop
  • Cascading – orchestration of jobs on Hadoop
  • Datameer – BI on Hadoop
  • Mahout – distributed machine learning library on Hadoop
  • ZooKeeper – work coordinator / monitor

On top of these are various tools & extensions, as well as ports (e.g. HDInsight )

You also need to be aware of elastic cloud platforms to run on, and the various NoSQL DBs tend to be leveraged in this space as well.

Additionally, MapReduce is just an infrastructure pattern for distributed processing of algorithms – you will not get much usage out of it without knowledge of the appropriate algorithms to leverage on the nodes in your compute grid – the whole point of Big Data.

Print | posted on Tuesday, December 25, 2012 11:39 AM


No comments posted yet.

Your comment:


Copyright © JoshReuben

Design by Bartosz Brzezinski

Design by Phil Haack Based On A Design By Bartosz Brzezinski