AWS Official Blog

Taking Massive Distributed Computing to the Common Man – Hadoop on Amazon EC2/S3

by Jeff Barr | on | in Thought Pieces | | Comments

Not so long ago, it was both difficult and expensive to perform massive distributed processing using a large cluster of machines. Mainly because:

  1. It was difficult to get the funding to acquire this ‘large cluster of machines’. Once acquired, it was difficult to manage (powering/cooling/maintenance) it and we always had a fear of what-if the experiment failed and how would one recover the losses from the investment already made.
  2. After it was acquired and managed, there were technical problems. It was difficult to run massively distributed tasks on the machines, storing and accessing large datasets, parallelization was not easy and Job scheduling was error-prone. Moreover, If nodes failed, detecting this was difficult and recovery was very expensive. Tracking jobs and status was often ignored because it quickly became complicated as number of machines in cluster increased.

Hence it was difficult to innovate and/or solve real-world problems like these:

  • Web Company : Analyze large-data sets of user behavior and clickstream logs
  • Social Networking Company : Analyze social, demographic and market data
  • Phone Company : Locate all customers who have called in a given area
  • Large Retailer Chain : Wants to know what items a particular customer bought last month or recall a certain product and inform customers who bought that product.
  • Surveillance Company : Wants to transcode video accumulated over several years
  • Pharma Company : Wants locate people who were prescribed a certain drug

Just a few years ago, it was difficult. But now, it is easy.

The Open Source Hadoop framework has given developers the power to do some pretty extraordinary things.

Hadoop gives developers an opportunity to focus on their idea/implementation and not worry about software-level “muck” associated with distributed processing (#2 above). It handles job scheduling, automatic parallelization, and job/status tracking all by itself while developers focus on the Map and Reduce implementation. It allows processing of large datasets by splitting the dataset into manageable chunks, spreading it across a fleet of machines and managing the overall process by launching jobs, processing the job no matter where the data is physically located and, at the end, aggregating the job output into a final result.

Large companies can afford to acquire 10,000 node clusters and run their experiments on massive distributed processing platforms that process 20000 TB/day.

But if I am a startup, or a university with minimal funding, or a self-employed individual who would like to test distributed processing over a large cluster with 1000+ nodes, can I afford it? OR even If I am a well funded company (think “enterprise”) with lot of free cash flow, will management approve the budget for my experiment?  Every organization has a person who says “no”. Will I be able to fight the battle with those people? Should I even fight the battle (of logistics)? Will I be able to get an environment to experiment with large datasets (think “weather data simulation”, oer “genome comparisons”)?

Cloud Computing makes this a reality (solving #1 above). Click a button and get a server. Flick a switch and store terabytes of data geographically distributed. Click a button and dispose of temporary resources.

Posts like this and this inspired me to write this post. Amazon Web Services is leveling the playing field for experimentation, innovation and competition. Users are able to iterate on their ideas quickly, if your idea works, bingo! If it does not, shutdown your “droplet” in the cloud and move on to the next idea and start a new “droplet” whenever you are ready.

I would say:

The Open Source Hadoop framework on Amazon EC2/S3 has given every developer the power to do some pretty extraordinary things.

Everyday, I hear new stories about running Hadoop on EC2. For example, The New York Times used 100 Amazon EC2 instances and a Hadoop application to process 4TB of raw image TIFF data (stored in S3) into 1.1 million finished PDFs in the space of 24 hours at a computation cost of just $240. It not only makes massive distributed processing easy but also makes it headache-free.

Whether it is Startup companies or University Classrooms in UCSB, BYU, Stanford or even enterprise companies, its just amazing to see every new story that is utilizing Hadoop on Amazon EC2/S3 in innovative ways.

Thats what I love about Amazon Web Services – a common man with just a credit card can afford to think about massive distributed computing and compete with the rest and emerge to the top.

–Jinesh

p.s.The real power and potential of hadoop over Amazon EC2 would be when I see Hadoop-on-demand with Condor spawning EC2 instances on-the-fly when I need them (or when situation demands them) automatically and shutting them down when I dont need them. Has anybody tried that yet ?