Word Count Example

This example shows how to use Hadoop Streaming to count the number oftimes that words occur within a text collection.

Details

Submitted By: Jai@AWS
Created On: March 31, 2009 4:12 AM GMT
Last Updated: April 2, 2009 8:53 PM GMT

Provided by Richard@AWS

This example shows how to use Hadoop Streaming to count the number of times that words occur within a text collection. Hadoop streaming allows one to execute MapReduce programs written in languages such as Python, Ruby and PHP.

Source Location on Amazon S3: s3://elasticmapreduce/samples/wordcount/wordSplitter.py
Source License: Apache License, Version 2.0
How to Run this Application:

You can run this application using AWS Management Console or Command Line Tools

To count the occurrence of words we need a map function that iterates through its input emitting word, count pairs. We can implement this in Python as

  #!/usr/bin/python
     
   import sys
   import re
     
   def main(argv):
     line = sys.stdin.readline()
     pattern = re.compile("[a-zA-Z][a-zA-Z0-9]*")
     try:
       while line:
         for word in  pattern.findall(line):
           print  "LongValueSum:" + word.lower() + "\t" + "1"
         line =  sys.stdin.readline()
     except "end of file":
       return None
   if __name__ == "__main__":
     main(sys.argv)

In order to run a Hadoop Streaming job with Amazon Elastic MapReduce this program must be uploaded to Amazon S3. This can be done using tools such as s3cmd or the Firefox plugin S3 Organizer. Luckily this word count example has already been uploaded to Amazon S3 at the location:

   s3://elasticmapreduce/samples/wordcount/wordSplitter.py

This can be run on Amazon Elastic MapReduce using the AWS Management Console (https://console.aws.amazon.com). Choose the Amazon Elastic MapReduce tab and then the "Create New Job Flow" button. Next choose the word count example.

You'll notice that the word count example is using the builtin reducer called aggregate. This reducer adds up the counts of words being
emitted by the wordSplitter map function. It knows to use data type Long from the prefix on the words.

It is also possible to run this example using the Elastic MapReduce Command Line Ruby Client with the command (Make sure you replace
my-bucket in the output parameter with the name of one of your Amazon S3 buckets):

  elastic-mapreduce --create --stream \
     --mapper  s3://elasticmapreduce/samples/wordcount/wordSplitter.py \
     --input   s3://elasticmapreduce/samples/wordcount/input \
     --output s3://my-bucket/output \
     --reducer aggregate

Comments

Need more information
A lot of people including me are new to Hadoop streaming. With the word count example a lot of things are clarified, but still the input is not available. Although the job flow creates and executes correctly and creates output the make sense, it is not clear how the input looks Shivani
raoshivani on October 25, 2010 4:15 PM GMT
We are temporarily not accepting new comments.
©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved.