AWS Big Data Blog

Analyze and visualize your VPC network traffic using Amazon Kinesis and Amazon Athena

Network log analysis is a common practice in many organizations.  By capturing and analyzing network logs, you can learn how devices on your network are communicating with each other, and the internet.  There are many reasons for performing log analysis, such as audit and compliance, system troubleshooting, or security forensics.  Within an Amazon Virtual Private Cloud (VPC), you can capture network flows with VPC Flow Logs.  You can create a flow log for a VPC, a subnet, or a network interface.  If you create a flow log for a subnet or VPC, each network interface in the VPC or subnet is monitored. Flow log data is published to a log group in Amazon CloudWatch Logs, and each network interface has a unique log stream.

CloudWatch Logs provides some great tools to get insights into this log data.  However, in most cases, you want to efficiently archive the log data to S3 and query it using SQL.  This provides more flexibility and control over log retention and the analysis you want to perform.  But also, you often want the ability to obtain near real-time insights into that log data by performing analysis automatically, soon after the log data has been generated.  And, you want to visualize certain network characteristics on a dashboard so you can more clearly understand the network traffic within your VPC.  So how can you accomplish both efficient log archival to S3, real-time network analysis, and data visualization?  This can be accomplished by combining several capabilities of CloudWatch, Amazon Kinesis, AWS Glue, and Amazon Athena, but setting up this solution and configuring all the services can be daunting.

In this blog post, we describe the complete solution for collecting, analyzing, and visualizing VPC flow log data.  In addition, we created a single AWS CloudFormation template that lets you efficiently deploy this solution into your own account.

Solution overview

This section describes the overall architecture and each step of this solution.

We want the ability to query the flow log data in a one-time, or ad hoc, fashion. We also want to analyze it in near real time. So our flow log data takes two paths through the solution.  For ad hoc queries, we use Amazon Athena.  By using Athena, you can use standard SQL to query data that has been written to S3.  An Athena best practice to improve query performance and reduce cost is to store data in a columnar format such as Apache Parquet.  This solution uses Kinesis Data Firehose’s record format conversion feature to convert the flow log data to Parquet before it writes the files to S3. Converting the data into a compressed, columnar format lowers your cost and improves query performance by enabling Athena to scan less data from S3 when executing your queries.

By streaming the data to Kinesis Data Firehose from CloudWatch logs, we have enabled our second path for near real-time analysis on the flow log data.  Kinesis Data Analytics is used to analyze the log data as soon as it is delivered to Kinesis Data Firehose.  The Analytics application aggregates key data from the flow logs and creates custom CloudWatch metrics that are used to drive a near real-time CloudWatch dashboard.

Let’s review each step in detail.

1.  VPC Flow Logs

The VPC Flow Logs feature contains the network flows in a VPC.  In this solution, it is assumed that you want to capture all network traffic within a single VPC.  By using the CloudFormation template, and you can define the VPC you want to capture.  Each line in the flow log contains space-delimited information about the packets traversing the network between two entities, which are source and destination.  The log line contains details including the source and destination IP addresses and ports, the number of packets, and the action taken on that data. Examples of the action taken would be whether it was accepted or rejected.  Here’s an example of a typical flow log:

2 123456789010 eni-abc123de 172.31.16.139 172.31.16.21 20641 22 6 20 4249 1418530010 1418530070 ACCEPT OK

For more information about each item in the line, see Flow Log Records.  Note that VPC flow logs buffer for about 10 minutes before they’re delivered to CloudWatch Logs.

2.  Stream to Kinesis Data Firehose

By creating a CloudWatch Logs subscription, our flow logs can automatically be streamed when they arrive in CloudWatch Logs.  This solution’s subscription filter uses Kinesis Data Firehose as its destination.  Kinesis Data Firehose is the most effective way to load streaming data into data stores, such as Amazon S3.  The CloudWatch Logs subscription filter has also been configured to parse the space-delimited log lines and create a structured JSON object for each line in the log.  The naming convention for each attribute in the object follow the names defined by for each element by VPC Flow Logs.  Therefore, the example log line referenced earlier streams as the following JSON record:

{
    "version": 2,
    "account-id": "123456789010",
    "interface-id": "eni-abc123de",
    "srcaddr": "172.31.16.139",
    "dstaddr": "172.31.16.21",
    "srcport": 20641,
    "dstport": 22,
    "protocol": 6,
    "packets": 20,
    "bytes": 4249,
    "start": 1418530010,
    "end": 1418530070,
    "action": "ACCEPT",
    "log-status": "OK"
}

CloudWatch Logs subscriptions sends data to the configured destination as a gzipped collection of records.  Before we can analyze the data, we must first decompress it.

3.  Decompress records with AWS Lambda

There may be situations where you want to transform or enrich streaming data before writing it to its final destination.  In this solution, we must decompress the data that is streamed from CloudWatch Logs.  With the Amazon Kinesis Data Firehose Data Transformation feature, we can decompress the data with an AWS Lambda function.  Kinesis Data Firehose manages the invocation of the function.  Inside the function, the data is decompressed and returned to Kinesis Data Firehose.  The complete source code for the Lambda function can be found here.

4.  Convert data to Apache Parquet

To take advantage of the performance capabilities in Amazon Athena, we convert the streaming data to Apache Parquet before persisting it to S3.  We use the record format conversion capabilities of Kinesis Data Firehose to perform this conversion.  When converting from JSON to Parquet, Kinesis Data Firehose must know the schema.  To accomplish this, we configure a table in the Glue Data Catalog.  In this table, we map the attributes of our incoming JSON records to fields in the table.

5.  Persist data to Amazon S3

When using the data format conversion feature in Kinesis Data Firehose, the only supported destination is S3.  Kinesis Data Firehose buffers data for a period of time, or until a data size threshold is met, before it creates the Parquet files in S3.  In general, converting to Parquet results in effective file compression.  If the file size is too small, it isn’t optimal for Athena queries.  To maximize the file sizes created in S3, the solution has been configured to buffer for 15 minutes, or 128 MB.  However, you can adjust this configuration to meet your needs by using the Kinesis Data Firehose console.

6.  Query flow logs with SQL in Athena

In this solution, Athena uses the database and table created in the Glue Data Catalog to make your flow log data queryable.  There are sample queries to review later in this article.

7.  Analyze the network flows in near real-time with Kinesis Data Analytics

Following the data through the first six steps, the solution enables you to query flow log data using SQL in Athena.  This is great for ad hoc queries, or querying data that was generated over a long period of time.  However, to get the most out of the data, you should analyze it as soon as possible after it is generated.  To accomplish this, the solution uses Kinesis Data Analytics (KDA) to analyze the flow logs and extract some immediate insights.  Kinesis Data Analytics (KDA) enables you to query streaming data using SQL so you can get immediate insights into your data.  In this solution, the KDA application uses a Lambda function to decompress the gzipped records from Kinesis Data Firehose, and then analyzes the flow log data to create some aggregations of interest.  The KDA application creates the following aggregations:

  • A count of rejected TCP packets, every 15 minutes.
  • A count of rejected TCP packets by protocol, every 15 minutes.

These metrics are aggregated over a 15-minute window.  At the end of the window, KDA invokes a Lambda function, passing the aggregated values as input to the function.

8.  Write the aggregations as custom CloudWatch metrics

At the end of the 15-minute window, KDA invokes a Lambda function, passing the aggregated values.  The Lambda function writes these values to CloudWatch as custom metrics. This enables the solution to support alarms on those metrics using CloudWatch alarms, and it enables custom dashboards to be created from the metrics.

9.  View the aggregated data in CloudWatch dashboards

CloudWatch dashboards are customizable home pages in the CloudWatch console for monitoring your resources in a single view.  You can use CloudWatch dashboards to create customized views of the metrics and alarms for your AWS resources. In this solution, we create a dashboard that monitors the custom aggregations created in our KDA application. The solution creates a sample dashboard to get you started, but you should review the metrics and create a dashboard and alarms to meet your needs.

Deploying the solution

To deploy this solution into your own account, you use the CloudFormation template to create the stack. You can deploy the solution stack into the following AWS Regions: US East (N. Virginia), US West (Oregon), and EU (Ireland).  To deploy, choose the link for the Region where you want to deploy.  The CloudFormation console for that Region opens, and the template URL is pre-populated:

Deploy the solution in:

US East (N. Virginia)

The Create Stack wizard for CloudFormation will be opened.  The template location is pre-populated.  Click Next, and you will prompted to provide values for several template parameters.

Let’s review what each parameter represents:

  • Stack name — The name for this CloudFormation stack.  You can rename it from the default, but choose a short (up to 16 characters) name, and ensure your name uses only lower-case letters.  The value you use here will be used as a prefix in the name of many of the resources created by this stack.  By providing a short name with lower-case letters, the names for those resources will pass the resource naming rules.
  • S3BucketName — The name of the S3 bucket into which the Parquet files are delivered. This name must be globally unique.
  • VPCId — The ID of the existing VPC for which flow logs are captured.

Choose Next, and accept any defaults for the remainder of the CloudFormation wizard. The stack is created in a few minutes.

Analyze the flow log data

After the stack has been deployed, it may take up to 15 minutes before data can be queried in Athena, or viewed in the CloudWatch dashboard.  Let’s look at a few sample queries you can run in Athena to learn more about the network traffic within your VPC.

Navigate to the Athena console in the Region where you deployed the stack.  In the console, choose the database named “vpc_flow_logs”.  Notice that this database contains one table, named “flow_logs.”  Run the following query to see which protocol is being rejected the most within your VPC:

select protocol, sum(packets) as rejected_packets
from flow_logs
where action = 'REJECT'
group by protocol
order by rejected_packets desc

Your results should look similar to the following example

This example shows that the value for the protocol box follows the standard defined by the Internet Assigned Numbers Authority (IANA).  So in the previous example, the top two rejected protocols are TCP and ICMP.

Here are a few additional queries to help you understand the network traffic in your VPC:

Identify the top 10 IP addresses from which packets were rejected in the past 2 weeks:

SELECT
	srcaddr,
	SUM(packets) AS rejected_packets
FROM flow_logs
WHERE start >= current_timestamp - interval '14' day
GROUP BY srcaddr
ORDER BY rejected_packets DESC
LIMIT 10;

Identify the top 10 servers that are receiving the highest number of HTTPS requests:

SELECT
	dstaddr,
	SUM(packets) AS packet_count
FROM flow_logs
WHERE dstport = '443'
GROUP BY dstaddr
ORDER BY packet_count DESC
LIMIT 10;

Now let’s look at the analysis we’re performing in real time with Kinesis Data Analytics.  By default, the solution creates a dashboard named “VPC-Flow-Log-Analysis.”  On this dashboard, we’ve created a few default widgets.  The aggregate data being generated by KDA is plotted in a few charts, as shown in the following example:

This example shows that the Rejected Packets per Protocol chart has been created to plot only a subset of all possible protocols.  Modify this widget to show the protocols that are relevant for your environment.

Next steps

The solution outlined in this blog post provides an efficient way to get started with analyzing VPC Flow Logs.  To get the most out of this solution, consider these next steps:

  • Create partitions in the Glue table to help optimize Athena query performance. The current solution creates data in S3 partitioned by Y/M/D/H, however these S3 prefixes are not automatically mapped to Glue partitions.  This means that Athena queries scan all Parquet files.  As the volume of data grows, the Athena query performance degrades.  For more information about partitioning and Athena tuning, see Top 10 Performance Tuning Tips for Amazon Athena.
  • Apply the solution to additional VPCs, or in different regions. If your account contains multiple VPCs, or if your infrastructure is deployed in multiple Regions, you must create the stack in those Regions.  If you have multiple VPCs within the same Region, you can create a new flow log for each additional VPC by using the VPC console.  Configure the flow log to deliver to the same Destination Log group that you created with the stack was initially created (CWLogGroupName parameter value in the CloudFormation template).
  • Modify the default widgets in the CloudWatch dashboard. The stack created a couple of default CloudWatch dashboards; however, you can create more to meet your needs, based on the insights you’d like to get from the flow logs in your environment.
  • Create additional queries in Athena to learn more about your network behavior.

Conclusion

Using the solution provided in this blog post, you can quickly analyze the network traffic in your VPC.  It provides both a near real-time solution, and also the capabilities to query historical data.  You can get the most out of this solution by extending it with queries and visualizations of your own to meet the needs of your system.

 


Additional Reading

If you found this post useful, be sure to check out Analyze Apache Parquet optimized data using Amazon Kinesis Data Firehose, Amazon Athena, and Amazon Redshift and Preprocessing Data in Amazon Kinesis Analytics with AWS Lambda.


About the Author

Allan MacInnis is a Solutions Architect at Amazon Web Services. He works with our customers to help them build streaming data solutions using Amazon Kinesis. In his spare time, he enjoys mountain biking and spending time with his family.