AWS Big Data Blog

Using Amazon Redshift to Analyze Your Elastic Load Balancer Traffic Logs

Biff Gaut is a Solutions Architect with AWS

Introduction

With the introduction of Elastic Load Balancing (ELB) access logs, administrators have a tremendous amount of data describing all traffic through their ELB. While Amazon Elastic MapReduce (Amazon EMR) and some partner tools are excellent solutions for ongoing, extensive analysis of this traffic, they can require advanced data and analytics skills. Often the need to analyze your ELB logs is in response to an incident, so the ability to analyze many terabytes of logs quickly with skills already available to the team is critical.

Fortunately, it’s a relatively straightforward process to set up an Amazon Redshift cluster and load your ELB access logs for analysis via SQL queries.  This post explains how to do this whether your goal is ad hoc, time-sensitive analysis in response to an incident or periodic, straightforward log analysis.  With Amazon Redshift’s ability to quickly provision a data warehouse cluster from terabytes to petabytes in size; ingest massive amounts of data in parallel; and expose that data via an ODBC/JDBC PostgreSQL interface; it is an excellent solution for SQL-based analysis of your ELB logs.  If you have experience with Amazon EMR and would perform MapReduce-style analysis on your log data, AWS has also created a tutorial to help you load ELB log data into Amazon EMR.

Examining Your Log Files

The first step is to determine the logs you wish to load and the space they require in your Amazon Redshift cluster.  Decide on the time period you want to analyze and follow the steps below to find the corresponding log files. Inside the Amazon Simple Storage Service (Amazon S3) bucket where ELB is writing the logs, you will find the following folder structure:

s3://yourbucketname/AWSLogs/youraccount#/elasticloadbalancing/region/year/month/day

The resulting folder structure in your Amazon S3 bucket will look something like this:

So the log files for March 5, 2014 in the us-east-1 region would be found in:

s3://corporation-com-elb-logs/ AWSLogs/515290123456/elasticloadbalancing/us-east-1/2014/03/05

Similarly, to specify the logs for all of March you would use:

s3://corporation-com-elb-logs/ AWSLogs/515290123456/elasticloadbalancing/us-east-1/2014/03

In each day’s folder you should find several objects.  To get the size of a single folder, you can either use the console and manually add up the size of all the objects, or you can use this CLI command (replacing the yourbucket below with your bucket name and the youraccount# with your account number):

aws s3 ls s3://yourbucket/AWSLogs/youraccount#/elasticloadbalancing/us-east-1/2014/03/16 –recursive | grep -v -E “(Bucket: |Prefix: |LastWriteTime|^$|–)” | awk ‘BEGIN {total=0}{total+=$3}END{print total/1024/1024″ MB”}’

To find the size of all the logs for March, just leave off the date folder like this:

aws s3 ls s3://yourbucket/AWSLogs/youraccount#/elasticloadbalancing/us-east-1/2014/03 –recursive | grep -v -E “(Bucket: |Prefix: |LastWriteTime|^$|–)” | awk ‘BEGIN {total=0}{total+=$3}END{print total/1024/1024″ MB”}’

Whichever method you choose, calculate the size of the data you will load.

Launch an Amazon Redshift Cluster

Next, you’ll set up an Amazon Redshift cluster to hold your Elastic Load Balancer access log data. To begin, log in to the AWS Console and select Redshift from the Services menu.

As part of launching your cluster, you must create a security group that allows you to import and access your data from your SQL client while blocking any other traffic.  The security group can open access to your Amazon Redshift cluster from a specific CIDR block or an Amazon Elastic Cloud Compute (Amazon EC2) security group.  While you will install the SQL client in a subsequent step, identify the host now so you can create the proper security group rules.  If you plan to run the SQL client on an Amazon EC2 instance, know the security group protecting that instance.  If you will run from an on-premises workstation, you need the IP address (or address range) that represents that workstation on the Internet.

After identifying your SQL client host, click Security in the left menu of the Amazon Redshift screen.  On the Security Groups tab, click the blue Create Cluster Security Group button.

Note: Cluster Security Groups are only necessary in EC2 Classic accounts when you are running outside of a VPC.  For VPC by Default accounts you will want to create a VPC Security Group with the same Inbound Rules through port 5439.

On the Cluster Security Group dialog box, fill out the fields as follows:

Field Description
Cluster Security Group Name Any name you like that refers to this group.
Description Description for this group.

After completing the fields, click Create.

The group is initially created blocking all traffic. You must create a rule to allow the traffic necessary for you to use your cluster.  Select the box next to your new group and then select CIDR/IP in the Connection Type dropdown box in the details window.

On the next screen, fill out the fields as follows:

Field Description
Connection Type Leave as CIDR/IP

CIDR/IP to Authorize

If you’ll run your SQL Client on this machine, fill in the box with a CIDR/IP range containing your IP address (which is shown on the screen).  If your SQL Client will be on a different machine, enter a CIDR block that includes that machine’s IP address.

After filling out the fields, click Authorize.

Note: The previous instructions identify your SQL Client by IP address (either on-premises or an Amazon EC2 instance with a public IP address).  If you are using an Amazon EC2 instance, you could alternatively identify it using a security group assigned to the instance by choosing EC2 Security Group in the Connection Type dropdown box.

Your security group is now created and it is time to create your cluster.  Click the Clusters link in the left column menu.  On the next page, click the blue Launch Cluster button.

On the cluster details page, fill out the fields as follows:

Field Description
Cluster Identifier Any lowercase name you want to apply to your cluster.
Database Name Name for your database within your cluster.  You don’t need to fill in this field. If you leave it blank the default name dev is used.
Database Port Port for connecting to the database.  You can leave this on the default value unless your corporate network firewall blocks port 5439, in which case you should choose a port not blocked by your network firewall.
Master User Name This account will be created for you to access the database.
Master User Password / Confirm Password Password for the Master User Name account.  Note the policy requirements.

After completing the required fields, click Continue.

On the Node Configuration page you specify the size of your cluster.  Allocate space equal to the size of the logs you calculated earlier; the data compression in Amazon Redshift will more than allow for built-in redundancy and system use.  Fill out the fields using the following guidance:

Field Description
Node Type Determines the CPU and storage space of each node.  The cheapest possible cluster is a single dc1.large node, which is currently 25 cents/hour in US-East-1 and provides 160 GB of SSD storage.  If you are new to Amazon Redshift, you may be eligible for a free trial that offers 750 hours per month of dc1.large usage for free for two months. Look at the storage of the different node types until you find a combination of Node Type and number of nodes that gives you enough space.  Small node types may not store as much data, but they allow you to scale up in smaller increments.
Cluster Type Decide between a cluster consisting of one compute node (single node cluster) or multiple compute nodes (multi-node cluster).  For very small data sets a single node will work fine, but once you get to several TB you should go with a multi-node cluster.
Number of Compute Nodes Choose how many compute nodes you want to deploy in your Amazon Redshift cluster.  Because Amazon Redshift uses the instance storage on the compute nodes for storage, the total storage of your cluster is the (total storage for your Node Type) x (the number of Nodes in the cluster).

After completing the fields, click Continue.

On the Additional Configuration page, fill out the fields using the following guidance:

Field

Description

Cluster Parameter Group Use the default value here.
Encrypt Database By choosing KMS or HSM in this box you instruct AWS to encrypt the data on your Amazon Redshift cluster with the Key Management System or CloudHSM.  You probably don’t need to choose this for your access logs and can choose None.
Choose a VPC

Choose Default VPC here.

If you have an older account that is EC2 Classic (without a default VPC) you will not see that option and you do not need to run in a VPC.  You can choose Not in VPC here.

Cluster Subnet Group (VPC Only) Use the default value.
Publicly Accessible (VPC Only) This determines whether the cluster can be accessed over the Internet as opposed to through a VPN or through a bastion host.  Choose Yes for this example.
Choose a Public IP Address (VPC Only) This determines whether the publicly facing IP address will be an EIP that you assign (Yes) or automatically assigned by AWS when the instance is launched (No).  Choose No for this exercise.
Availability Zone Leave this at No Preference for this use case.
VPC/Cluster Security Groups

Choose the Security Group you created at the beginning of this section.

If you are running outside a VPC, this will say Cluster Security Groups

Create Cloudwatch Alarm In this section, you set up an alarm that notifies you if your Amazon Redshift cluster is in danger of running out of space.  Since this use case is to create a cluster for a very short use case, choose No and ignore the rest of the fields in this section.

After completing the fields, click Continue.

On the Review page, check your settings. You can also see the hourly cost for your cluster before it launches.  If everything on this page looks correct and you are OK with the cost, click Launch Cluster.

Your cluster will take several minutes to launch. While it launches, go to the next section to set up your PostgreSQL client.

Set up PostgreSQL Client

To manipulate your Amazon Redshift cluster (for example, to set up tables, import data, and make queries), you must set up a PostgreSQL client.  There are several options, such as:

SQL Workbench/J: A free PostgreSQL client.  AWS provides instructions for installing the PostgreSQL client (scroll down to Install SQL Client and Drivers). Be sure to follow the link to Configure a JDBC Connection to download the Redshift JDBC driver.

Load ELB Logs into Amazon Redshift

Go back to the Amazon Redshift home page (under Services click Redshift) and click Clusters in the menu on the left.  When your cluster is ready, “available” appears in the Cluster Status column:

To see all the details for your cluster, click the Info icon next to the cluster name to display the screen below.

Key values on this screen are the JDBC and ODBC URLs for accessing your database (JDBC highlighted above).

Open your Amazon Redshift client and navigate to a screen that allows you to create a new connection.  This example demonstrates SQL Workbench/J, but should be easy to translate to your favorite SQL client.  In the diagram below, note the information from the Cluster Details screen.  The URL field contains the JDBC URL from the Cluster Info screen above.  Use these details to open a session with your Amazon Redshift cluster.

After connecting to your to your Amazon Redshift cluster, you can begin to manipulate it using PostgreSQL and Amazon Redshift commands.  First, create the table that will hold the access log data.  Use the CREATE TABLE command below to create the table:

CREATE TABLE elb_logs (
RequestTime DateTime encode lzo,
ELBName varchar(100) encode lzo,
RequestIP_Port varchar(22) encode lzo,
BackendIP_Port varchar(22) encode lzo,
RequestProcessingTime FLOAT encode bytedict,
BackendProcessingTime FLOAT encode bytedict,
ClientResponseTime FLOAT encode bytedict,
ELBResponseCode varchar(3) encode lzo,
BackendResponseCode varchar(3) encode lzo,
ReceivedBytes BIGINT encode lzo,
SentBytes BIGINT encode lzo,
HttpRequest varchar(5083) encode lzo,
UserAgent varchar(500) encode lzo,
SSL_Cipher varchar(40) encode lzo,
SSL_Protocol varchar(40) encode lzo
)

sortkey(RequestTime);

Note : The encode terms above define the specific compression used for each data column.  When designing your own tables in the future, Amazon Redshift’s COPY command can, on first load, automatically choose appropriate compression algorithms for each column based on sampling your input. An explanation of the contents of each of these fields can be found in the documentation.

Now that the table is created, copy your ELB access log data into it.  This use case uses the COPY command to take advantage of its parallel processing features for faster data load.  The COPY command loads multiple files into Amazon Redshift depending on the filespec you specify.  Here are some examples:

Include all the logs for March 16, 2014:

s3://yourbucket/AWSLogs/youraccount#/elasticloadbalancing/us-east-1/2014/03/16

Include all the logs for March 2014:

s3://yourbucket/AWSLogs/youraccount#/elasticloadbalancing/us-east-1/2014/03

To load two specific days of data, you must run the COPY command twice with a filespec for each day.  Once you determine the filespec(s) to load the correct data, use it in this COPY command:

copy
elb_logs
from
‘s3://yourfilespec
COMPUPDATE OFF
CREDENTIALS
‘aws_access_key_id=yourkey;aws_secret_access_key=yoursecretkey
delimiter ‘ ‘
TIMEFORMAT as ‘auto’
ACCEPTINVCHARS
REMOVEQUOTES
FILLRECORD
MAXERROR as 100000

The COPY command pumps data into Amazon Redshift at the fastest possible speed. Once it is complete, there are a couple additional steps to run that format your data for the best possible query performance.  Run the two commands below:

VACUUM elb_logs;

ANALYZE elb_logs;

You don’t need to run the VACUUM command after the initial load, but it is a good habit.  Your data is now loaded into your Amazon Redshift data warehouse and is ready for analysis.

Analyze Your Data

You can now use any PostgreSQL query to examine the data in your table.  To help you get started, here are a few common examples.

To get all the access requests between two times:

select
RequestTime,
HttpRequest,
RequestIP_Port,
to_char(BackendProcessingTime, ‘FM90D99999999’) BackendTime
from
elb_logs
where
requesttime >= cast(‘2014-03-18 03:00:00’ as DateTime) and
requesttime < cast(‘2014-03-18 04:00:00’ as DateTime)

To get the 10 slowest access requests:

select top 10
RequestTime,
ELBName,
RequestIP_Port,
BackendIP_Port,
to_char(RequestProcessingTime, ‘FM90D99999999′) RequestTime,
to_char(BackendProcessingTime,’FM90D99999999’) BackendTime,
to_char(ClientResponseTime, ‘FM90D99999999’) ClientResponseTime,
ELBResponseCode,
BackendResponseCode,
ReceivedBytes,
SentBytes,
HttpRequest,
SSL_Cipher,
SSL_Protocol
from elb_logs
order by BackendTime desc

Conclusion

With a few quick steps, you set up and populated a data warehouse containing a huge log of access requests and ran queries against that warehouse to gain insight into your ELB traffic.  After analyzing your data, remember to delete your cluster so you’re not charged for it.  As you have seen, it’s easy to recreate the cluster if you’d like to revisit your analysis.

If you have questions or suggestions, please add a comment below.

Update 12/3/14: This post was updated to accommodate log entries for non-HTTP calls which have a slightly different format. The NonHttpField column was added to the Amazon Redshift table and the FILLRECORD option was added to the COPY table.

Update 8/3/2015 Changed the table format and the copy command to keep quoted log entries as a single data value rather than parsing them.  This allows us to successfully do all ELB formats from 2014 and 2015.  Also highlighted prompts and options that will be different in VPC by Default accounts.

 

——————————————-

Related:

A Zero-Administration Amazon Redshift Database Loader