Hadoop Filesystem Using S3

I blogged about Hadoop on EC2 late last year. In a nutshell, Hadoop is an open source implementation of Google’s MapReduce algorithm. MapReduce is a simple and efficient programming model for processing large data sets using a whole bunch of processors (you are supposed to start thinking of EC2 at this point).

Tom White sent me a note this week to inform me that he had implemented a Hadoop file system on top of S3. This file system can be used as a full or partial replacement for HDFS, the Hadoop Distributed File System.

Because bandwidth between EC2 instances and data stored in S3 is not metered or billed, this is a very cost-effective way to process large amounts of data.

If you aren’t already running Hadoop on EC2, you can read all about how to do it here.

I would be overjoyed to hear from someone who’s used Hadoop on EC2 to do something really cool. Drop me an email.

— Jeff;

AWS News Blog

Hadoop Filesystem Using S3

Modified 2/1/2021 – In an effort to ensure a great experience, expired links in this post have been updated or removed from the original post.

Resources

Follow