Note: This tutorial assumes you are using Hadoop 1.0.3.
Amazon CloudFront is a web service that delivers your content using a global network of edge locations. Amazon CloudFront can be configured to collect access logs by updating the distribution configuration (http://docs.amazonwebservices.com/AmazonCloudFront/latest/DeveloperGuide/HowToUpdateDistribution.html).
Using Amazon Elastic MapReduce and LogAnalyzer application you can generate usage reports containing total traffic volume, object popularity, a break down of traffic by client IPs and edge location. Reports are formatted as tab delimited text files, and delivered to the Amazon S3 bucket that you specify.
Amazon CloudFront's Access Logs provide detailed information about requests made for your content delivered through Amazon CloudFront, AWS's content delivery service. The LogAnalyzer for Amazon CloudFront analyzes the service's raw log files to produce a series of reports that answer business questions commonly asked by content owner.
|Source Location on Amazon S3||elasticmapreduce/samples/cloudfront/
|Compiled JAR Location||elasticmapreduce/samples/cloudfront/logprocessor.jar|
|Sample Dataset Location||elasticmapreduce/samples/cloudfront/input|
|Source License||Apache License, Version 2.0 and GPL Version 2.0|
Running the Analyzer
To run the application using the console click on the "Create New JobFlow" button, select Sample Applications and choose CloudFront LogAnalyzer (Custom Jar). Click "Continue". In the Jar Arguments textbox replace <yourbucket> with the name of the Amazon S3 bucket in which you would like the generated reports to be placed. Check to make sure that the path doesn't already exist in your S3 bucket, otherwise your job will fail. Click "Continue". Choose the number of instances to be used and then click "Continue". Review your parameters and click "Create Job Flow" to launch the application. After the Job Flow has finished, your reports should be available in the Amazon S3 bucket that you provided.
If you have the Ruby Client already installed then you can generate reports by running
./elastic-mapreduce --create --jar s3n://elasticmapreduce/samples/cloudfront/logprocessor.jar --args "-input,s3n://elasticmapreduce/samples/cloudfront/input,-output,s3n://<yourbucket>/cloudfront/log-reports"
In this command replace <yourbucket> with the name of the Amazon S3 bucket in which you would like the generated reports to be placed. Check to make sure that the path doesn't already exist in your S3 bucket, otherwise your job will fail.
This sample application produces four sets of reports based on Amazon CloudFront access logs. The Overall Volume Report displays total amount of traffic delivered by CloudFront over the course of whatever period you specify. The Object Popularity Report shows how many times each of your objects are requested. The Client IP report shows the traffic from each different Client IP that made a request for your content. The Edge Location Report shows the total number of traffic delivered through each edge location. Each report measures traffic in three ways: the total number of requests, the total number of bytes transferred, and the number of request broken down by HTTP response code.
Customizing the Application
The LogAnalyzer is implemented using Cascading (http://www.cascading.org) and is an example of how to construct an Amazon Elastic MapReduce application. To customize the reports generated by the LogAnalyzer, download the source code from this page. Follow the instructions in the README for building and uploading to Amazon S3 for use with Amazon Elastic MapReduce.
|How to Run this Application||You can run this application using the AWS Management Console or Command Line Tools|
|Sample Input Parameters||-input s3n://elasticmapreduce/samples/cloudfront/input
-output s3n://<yourbucket>/<output prefix>
-end <Current date in YYYY-MM-dd-HH format>