Apache LogAnalysis using Pig

Articles & Tutorials>Apache LogAnalysis using Pig
Analyze your Apache logs using Pig and Amazon Elastic MapReduce.

Details

Submitted By: Ian@AWS
Created On: August 5, 2009 3:50 PM GMT
Last Updated: August 11, 2009 12:09 AM GMT
  • Total bytes transferred per hour
  • A list of the top 50 IP addresses by traffic per hour
  • A list of the top 50 external referrers
  • The top 50 search terms in referrals from Bing and Google

You can modify the Pig script to generate additional information.

Location of Pig script s3://elasticmapreduce/samples/pig-apache/do-reports.pig
Sample data set s3://elasticmapreduce/samples/pig-apache/input
Source license Apache License, Version 2.0

Running the Pig Sample Using AWS Management Console

To run the application using the AWS Management Console:

  1. Navigate to AWS Management Console and sign in.
  2. Click Create New JobFlow, select Sample Applications, and choose Apache Log Reports (Pig Script).
  3. Click Continue.
  4. In the Output Locations field, replace <yourbucket> with the name of the Amazon S3 bucket into which you would like to place the generated reports.
    If you don't have a bucket, use a tool, such as Firefox S3 Organizer to create one. Make sure the path doesn't already exist in your Amazon S3 bucket. If the path already exists then the job will fail.
  5. On the next page, choose the number of Amazon EC2 instances to use.
  6. Review the parameters and click Create Job Flow to launch the application.
    When the job flow finishes, your reports should be available in the Amazon S3 bucket you specified.

Running the Pig Sample Using the Command Line Client

If you have the Amazon Elastic MapReduce Command Line Client installed, you can generate the reports using the following commands. Make sure to change mybucket in the output path to be the name of a bucket you own. Also, make sure the output path doesn't already exist. If it does, the script will fail.

  $ INPUT_PATH=s3://elasticmapreduce/samples/pig-apache/input
  $ OUTPUT_PATH=s3://mybucket/pig-apache/output
  $ PIG_SCRIPT=s3://elasticmapreduce/samples/pig-apache/do-reports.pig
  $ ./elastic-mapreduce --create --pig-script --args "-p,INPUT=$INPUT_PATH,-p,OUTPUT=$OUTPUT_PATH,$PIG_SCRIPT"

Alternatively, you could start a development job flow and then add steps to the job flow to execute Pig scripts.

 
  $ INPUT_PATH=s3://elasticmapreduce/samples/pig-apache/input
  $ OUTPUT_PATH=s3://mybucket/pig-apache/output
  $ PIG_SCRIPT=s3://elasticmapreduce/samples/pig-apache/do-reports.pig
  $ ./elastic-mapreduce --create --alive
  Created jobflow j-A1212121212
  $ ./elastic-mapreduce --jobflow j-A1212121212 --pig-script --args "-p,INPUT=$INPUT_PATH,-p,OUTPUT=$OUTPUT_PATH,$PIG_SCRIPT"

In this way, you can execute more than a single Pig script in your job flow. If you start a job flow, you must terminate it when you're finished.

 
  $ ./elastic-mapreduce --jobflow j-A1212121212 --terminate

Customizing this Pig Script

To customize the Pig script:

  1. Download the script from do-reports.pig.
  2. Modify and save a copy in your Amazon S3 bucket .
  3. Run the script following one of the instructions above but replace the pig script location with the location of your modified script in Amazon S3.

Further Reading and Resources

©2013, Amazon Web Services, Inc. or its affiliates. All rights reserved.