Hadoop Filesystem Using S3
I blogged about Hadoop on EC2 late last year. In a nutshell, Hadoop is an open source implementation of Google’s MapReduce algorithm. MapReduce is a simple and efficient programming model for processing large data sets using a whole bunch of processors (you are supposed to start thinking of EC2 at this point).
Tom White sent me a note this week to inform me that he had implemented a Hadoop file system on top of S3. This file system can be used as a full or partial replacement for HDFS, the Hadoop Distributed File System.
Because bandwidth between EC2 instances and data stored in S3 is not metered or billed, this is a very cost-effective way to process large amounts of data.
If you aren’t already running Hadoop on EC2, you can read all about how to do it here.
I would be overjoyed to hear from someone who’s used Hadoop on EC2 to do something really cool. Drop me an email.