Hadoop & Data Processing

Page last modified 01:49, 26 Oct 2009 by mndoci | Page History

Key Links

Hadoop on delicious

Distributed Computing on delicious

Why Hadoop?

Hadoop is a general purpose framework for distributed processing on large amounts of data.  Hadoop includes an implementation of the MapReduce paradigm and a distributed file system, HDFS and has found much use in the web community.   Over the years, the Hadoop ecosystem has evolved rapidly as well.  My personal interest is in applying Hadoop and associated applications for scientific problems, especially life science problems.  To get even more specific I have two key areas of interest, large scale genomic data, and data related to protein structure and sequence.

Key ecosystem components

  • Cascading- Cascading is a feature rich API for defining and executing complex, scale-free, and fault tolerant data processing workflows on a Hadoop cluster
  • Hive - A data warehouse infrastructure that provides data summarization and ad hoc querying
  • Pig - A high-level data-flow language and execution framework for parallel computation

Of these I am most interested in Cascading at this point.  Cascading allows you to write rich data flows and you can use any scripting language

Hadoop Configuration

On OSX, I used this tutorial

For remote deployment, I intend to primarily use Amazon Elastic MapReduce

You can alwways start with the excellent VM and EC2 AMI from Cloudera

I am also playing with Hadoop Studio

Since I like Ruby, the Wukong library from Flip Kromer at the infochimps is really cool and might just end up being something I use a lot.  There is a great tutorial on how to use this with Elastic MapReduce

Hadoop Tutorials

Tag page
Page statistics
749 view(s), 7 edit(s), and 3737 character(s)

Comments

You must login to post a comment.

Attach file

Attachments