AWS Big Data Blog

Using Amazon EMR and Hunk for Rapid Response Log Analysis and Review

Patrick Shumate is a Solutions Architect for AWS.

Introduction

It is fairly common to collect access and application logs but never interactively review them. Monitoring dashboards, coupled with well-instrumented applications, allow operators to manage day-to-day operations without ever digging into the flood of logs silently stored in Amazon S3. That works until the monitoring dashboard lights up with errors and a detailed analysis is required.

I want to share a quickly-deployed, “break glass” solution to review large volumes of logs with Amazon EMR and Splunk’s Hunk. I use S3 access logs, but this pattern works with Amazon CloudFront, Elastic Load Balancing (ELB), or just about any access log stored in S3.

Hunk is a big data analytics platform that lets you rapidly explore, analyze, and visualize data in Hadoop and NoSQL data stores with interactive streamed searches. EMR uses Hadoop, an open source framework, to distribute your data and processing across a resizable cluster of Amazon EC2 instances. EMR can run other frameworks, such as Spark, and additional applications as add-ons to the Hadoop cluster, such as HBase, Hunk, and Ganglia.  Pairing the Hunk AMI with EMR creates a very rapidly deployable solution to review logs on demand.

Launch an EMR Cluster

First, set up an EMR cluster to access your log data (Splunk also has a good guide with cluster instructions).  I am going to use the command line interface (AWS CLI)  to set this one up. The prerequisites are an AWS account, installing and configuring the CLI, and a user with the correct rights to launch an EMR cluster. From this command, you receive the cluster ID. Hold on to that value for later.

aws emr create-cluster --applications Name=Hunk --ami-version 3.2.1 --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge InstanceGroupType=TASK,InstanceCount=10,InstanceType=m3.xlarge --no-auto-terminate --region us-east-1 --use-default-roles --ec2-attributes KeyName=<put your key here>

I built my cluster with 13 m3.xlarge instances. How large or small a cluster do you need? Which instance types?  For more information about EMR cluster sizing, see Choose the Number and Type of Virtual Servers.

A quick word on cost before we proceed: for an hour of testing, this cluster costs around $5.00 USD as of publication.  Hunk is $0.75 USD per instance hour, so one hour of Hunk on this cluster is under $10.00 USD. The Simple Monthly Calculator can help model what a workload would cost.

Launch Hunk

Splunk has a nice Amazon CloudFormation script which they have shared in their documentation Install Hunk on Amazon Web Services with hourly pricing;  skip down to Step 3: Provision a Hunk Instance for links to the CloudFormation templates.  Select the region that the EMR cluster was deployed into and launch a Hunk instance.

Connect to Hunk and set up the EMR Cluster

Point your browser at the host IP address for the Hunk search head. That IP address is recorded in the EC2 console that you launched earlier.

By default, the GUI uses port 8000 for the web interface, so in this case the value would be http:// 203.0.113.24:8000.   The CloudFormation template uses the instance ID as the default password so you need that too.

After the demo and walkthrough, click Settings in the upper right corner of the screen and then Virtual indexes. Hunk needs to be configured with a provider and a virtual index.  A provider is the information required to work with your EMR directory. A virtual index defines the data store accessed through the provider.

Hunk index

The Hunk AMI should automatically configure the EMR cluster as a provider named “development-cluster” as shown in the screen shot below. If the “development-cluster” provider is there, you can skip the next steps and scroll down to instructions for creating a virtual index.

Note: If the provider isn’t listed, you need to set up a new provider. This might happen if the Hunk Instance started before the EMR cluster. To create a new provider, on the Provider tab, click New Provider:

Hunk new provider

You need to know the S3 bucket path for the logs and the internal IP for the master node on the EMR cluster. To get that IP address, head over to the EMR console, expand the Hardware section to the MASTER instance line, and select View EC2 instances.  To skip using the console, head back to the CLI and use the following command with the cluster ID you saved when creating the EMR cluster.  With that information and the table below, fill in the new provider page and click Save.

aws emr list-instances --cluster-id clusterID --instance-group-types MASTER --query 'Instances[*].PrivateIpAddress' --output table

From the provider, you need a new virtual index. On the Virtual Indexes tab, click New Virtual Index.

Add a unique name, a meaningful description, the full S3 path to the logs, and an optional whitelist if there are many log times in that path, and then click Save.

Now you can start digging in the logs. On the Apps menu, choose Search & Reporting.

I used s3logger-index as the name of the virtual index. To start a search into that index, enter it in to the search bar.

From here, there are quite a few possible quick searches to start exploring the data.  Here are a few:

How many times did a remote IP address request a bucket?

index=s3logger-index 
|rex field=_raw "^(?.+?)s(?.+?)s[(?.+?)]s(?.+?)s(?.+?)s" 
| stats count by remoteIP bucket 
| head 10 |sort count

Total request time by remote IP address

index=s3logger-index 
|rex field=_raw "^(?.+?)s(?.+?)s[(?.+?)]s(?.+?)s(?.+?)s(?.+?)s(?.+?)s(?.+?)s"(?.+?)s(?.+?)s(.+?)s(?.+?)s(.+?)s(?.+?)"
| stats sum(totalTime) as Duration by remoteIP  
| head 10 |sort Duration

Total request time and remote IP address count by operation

index=s3logger-index 
|rex field=_raw "^(?.+?)s(?.+?)s[(?.+?)]s(?.+?)s(?.+?)s(?.+?)s(?.+?)s(?.+?)s"(?.+?)s(?.+?)s(.+?)s(?.+?)s(.+?)s(?.+?)"
| stats count sum(totalTime) as Duration by operation  
| head 10 |sort Duration

A visualization of time by operation

index=s3logger-index 
|rex field=_raw "^(?.+?)s(?.+?)s[(?.+?)]s(?.+?)s(?.+?)s(?.+?)s(?.+?)s(?.+?)s"(?.+?)s(?.+?)s(.+?)s(?.+?)s(.+?)s(?.+?)"
| timechart sum(totalTime) as Duration by operation  
| head 10 |sort Duration

The regular expressions in the searches are shortcuts to breaking the fields out. Hunk provides several methods to do this automatically so that the entire stanza can be removed. For the S3 logging format, see Server Access Log Format. For the CloudFront logging format, see Log File Format.  After iterating through the data and finding what you are looking for, it’s a few clicks to save that search to a dashboard. Click Save As and then Dashboard Panel.

In the Save as Dashboard Panel dialog box, enter names for the new dashboard and panel, and then click Save.

Navigate to the new dashboard and review the results. Creating dashboards in Splunk or Hunk, like most things, is very easy to get started with but takes time, training, and experience to master.  Splunk’s Dashboards and Visualizations documentation will get you started.

Clean Up

After you finish exploring the logs, be sure to terminate the Hunk CloudFormation template and the EMR cluster using the AWS Management Console.

Conclusion

Getting access to undiscovered data in your logs does not require “always on” big data systems and advanced skills working with Hadoop. Access to that data with Hunk and EMR can be set up very quickly, run for as long as required, and then torn down without disrupting the logs at rest. Also, you don’t need a deep understanding of the logs and their formats to dig into the data and make fast discoveries and connections. That speed allows you to iterate and rapidly track down the data you’re looking for.

Using Hunk and EMR together creates a tool for exploring data quickly. Adding Splunk Enterprise extends exploring with collecting and indexing extracted metrics and data, to dive further into your data.  Familiar Splunk features are added and also extended back into EMR, such as archiving to S3 while maintaining full archive searchability.

If you have questions or suggestions, please leave a comment below.

—————————————————————

Related:

Using IPython Notebook to Analyze Data with EMR

Getting Started with Elasticsearch and Kibana on EMR

Strategies for Reducing your EMR Costs

Love to work on open source? Check out EMR’s careers page.

—————————————————————-