Table of contents
Key Links
Hadoop on delicious
Distributed Computing on delicious
Why Hadoop?
Hadoop is a general purpose framework for distributed processing on large amounts of data. Hadoop includes an implementation of the MapReduce paradigm and a distributed file system, HDFS and has found much use in the web community. Over the years, the Hadoop ecosystem has evolved rapidly as well. My personal interest is in applying Hadoop and associated applications for scientific problems, especially life science problems. To get even more specific I have two key areas of interest, large scale genomic data, and data related to protein structure and sequence.
Key ecosystem components
- Cascading- Cascading is a feature rich API for defining and executing complex, scale-free, and fault tolerant data processing workflows on a Hadoop cluster
- Hive - A data warehouse infrastructure that provides data summarization and ad hoc querying
- Pig - A high-level data-flow language and execution framework for parallel computation
Of these I am most interested in Cascading at this point. Cascading allows you to write rich data flows and you can use any scripting language
Hadoop Configuration
On OSX, I used this tutorial
For remote deployment, I intend to primarily use Amazon Elastic MapReduce
You can alwways start with the excellent VM and EC2 AMI from Cloudera
I am also playing with Hadoop Studio
Since I like Ruby, the Wukong library from Flip Kromer at the infochimps is really cool and might just end up being something I use a lot. There is a great tutorial on how to use this with Elastic MapReduce

Comments