Apache LogAnalysis using Pig

Sample Code & Libraries>Elastic MapReduce>Apache LogAnalysis using Pig
Analyze your Apache logs using Pig and Amazon Elastic MapReduce.


Submitted By: Ian@AWS
Created On: August 5, 2009 3:50 PM GMT
Last Updated: March 20, 2014 3:27 PM GMT
  • Total bytes transferred per hour
  • A list of the top 50 IP addresses by traffic per hour
  • A list of the top 50 external referrers
  • The top 50 search terms in referrals from Bing and Google

You can modify the Pig script to generate additional information.

Location of Pig script s3://elasticmapreduce/samples/pig-apache/do-reports.pig
Sample data set s3://elasticmapreduce/samples/pig-apache/input
Source license Apache License, Version 2.0

Running the Pig Sample Using AWS Management Console

To run the application using the AWS Management Console please see our documentation: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-pig.html.

Running the Pig Sample Using EMR's Ruby Command Line Client

If you have the Amazon Elastic MapReduce Command Line Client installed, you can generate the reports using the following commands. Make sure to change mybucket in the output path to be the name of a bucket you own. Also, make sure the output path doesn't already exist. If it does, the script will fail.

  $ INPUT_PATH=s3://elasticmapreduce/samples/pig-apache/input
  $ OUTPUT_PATH=s3://mybucket/pig-apache/output
  $ PIG_SCRIPT=s3://elasticmapreduce/samples/pig-apache/do-reports.pig
  $ ./elastic-mapreduce --create --pig-script --args "-p,INPUT=$INPUT_PATH,-p,OUTPUT=$OUTPUT_PATH,$PIG_SCRIPT"

Alternatively, you could start a development job flow and then add steps to the job flow to execute Pig scripts.

  $ INPUT_PATH=s3://elasticmapreduce/samples/pig-apache/input
  $ OUTPUT_PATH=s3://mybucket/pig-apache/output
  $ PIG_SCRIPT=s3://elasticmapreduce/samples/pig-apache/do-reports.pig
  $ ./elastic-mapreduce --create --alive
  Created jobflow j-A1212121212
  $ ./elastic-mapreduce --jobflow j-A1212121212 --pig-script --args "-p,INPUT=$INPUT_PATH,-p,OUTPUT=$OUTPUT_PATH,$PIG_SCRIPT"

In this way, you can execute more than a single Pig script in your job flow. If you start a job flow, you must terminate it when you're finished.

  $ ./elastic-mapreduce --jobflow j-A1212121212 --terminate

Customizing this Pig Script

To customize the Pig script:

  1. Download the script from do-reports.pig.
  2. Modify and save a copy in your Amazon S3 bucket .
  3. Run the script following one of the instructions above but replace the pig script location with the location of your modified script in Amazon S3.

Further Reading and Resources

©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved.