Provided by Richard@AWS
This example shows how to use Hadoop Streaming to count the number of times that words occur within a text collection. Hadoop streaming allows one to execute MapReduce programs written in languages such as Python, Ruby and PHP.
|Source Location on Amazon S3:||s3://elasticmapreduce/samples/wordcount/wordSplitter.py|
|Source License:||Apache License, Version 2.0|
|How to Run this Application:|
To count the occurrence of words we need a map function that iterates through its input emitting word, count pairs. We can implement this in Python as
#!/usr/bin/python import sys import re def main(argv): line = sys.stdin.readline() pattern = re.compile("[a-zA-Z][a-zA-Z0-9]*") try: while line: for word in pattern.findall(line): print "LongValueSum:" + word.lower() + "\t" + "1" line = sys.stdin.readline() except "end of file": return None if __name__ == "__main__": main(sys.argv)
In order to run a Hadoop Streaming job with Amazon Elastic MapReduce this program must be uploaded to Amazon S3. This can be done using tools such as s3cmd or the Firefox plugin S3 Organizer. Luckily this word count example has already been uploaded to Amazon S3 at the location:
This can be run on Amazon Elastic MapReduce using the AWS Management Console (https://console.aws.amazon.com). Choose the Amazon Elastic MapReduce tab and then the "Create New Job Flow" button. Next choose the word count example.
You'll notice that the word count example is using the builtin reducer called aggregate. This reducer adds up the counts of words being
emitted by the wordSplitter map function. It knows to use data type Long from the prefix on the words.
It is also possible to run this example using the Elastic MapReduce Command Line Ruby Client with the command (Make sure you replace
my-bucket in the output parameter with the name of one of your Amazon S3 buckets):
elastic-mapreduce --create --stream \ --mapper s3://elasticmapreduce/samples/wordcount/wordSplitter.py \ --input s3://elasticmapreduce/samples/wordcount/input \ --output s3://my-bucket/output \ --reducer aggregate