Analyze VPC Flow Logs with point-and-click Amazon Athena integration
Analyzing your VPC Flow Logs using Athena is now easier than ever! The recently introduced VPC Flow Logs integration with Amazon Athena helps you get started with extracting meaningful insights from VPC Flow Logs in just a few clicks. In this blog post, we will walk you through how you can use this recently announced integration.
VPC Flow Logs provide you with rich traffic telemetry data of how your workloads and resources on AWS are communicating. Analyzing VPC Flow Logs can help you understand how your applications are communicating over your VPC network with log records containing the Instance ID, Source and Destination IP addresses, Subnet ID, VPC ID, and the type and volume of traffic to list a few. They can also help you understand and optimize your security posture with information about flows that are permitted or dropped by your Network ACLs and Security Groups. With the recently introduced enriched metadata fields of AWS service, traffic path and traffic direction in VPC Flow Logs, you can now more readily derive insights about your AWS environment from VPC Flow Log data.
Some of the questions about your environment you can answer from VPC Flow Logs includes:
- Top Talkers: Which of my instances and applications are generating most traffic?
- Application Traffic Patterns: To which destinations does my application connect to, and how much traffic volume does it generate?
- Risky traffic: Which are the resources to which there is high-risk usage such as administrative or SSH/RDP traffic.
- Internet Traffic: What is my traffic volume to the Internet? What are those sources?
- Network Troubleshooting: Is there traffic flowing between my source and destination as I would expect?
- Dropped Traffic: Which traffic is getting dropped due to Security Groups or NACLs.
These are just a few ways in which VPC Flow Logs can help you understand your AWS environment, and provide you with the data to help you operate and optimize your network on AWS. While the raw VPC Flow Logs by themselves provide detailed information about every single network traffic flow, you still need to filter and aggregate them to derive the necessary insights illustrated above. This is where you need an analytical tool such as Athena to query the raw VPC Flow Logs and get you to the required insights.
Several customers have been using Amazon Athena to analyze VPC Flow Logs exported to Amazon Simple Storage Service (Amazon S3). Amazon Athena allows you to easily and interactively query VPC Flow Log data stored in Amazon S3 using standard SQL. Athena provides a ready way for you to analyze your VPC Flow Logs since it is serverless and hence there is no infrastructure for you to manage.
So far, you needed to manually perform certain preparatory steps in Athena to make your VPC Flow Log data in S3 query-able from Athena. These included 1) creating a table in Athena to define the VPC Flow Log data format, 2) creating the date-based partitions for efficiently querying the Flow Log data, and 3) writing your own automation to periodically create new partitions and load data to use Athena on an ongoing basis to analyze the log data. If you do not usually use Data and Analytics services from AWS, you may be unfamiliar with these steps and hence not be able to readily analyze your VPC Flow Log data stored in S3.
The recently announced Athena integration for VPC Flow Logs makes it really easy for you to get started with using Athena for analyzing VPC Flow Logs. It generates a CloudFormation template that performs all the preparatory steps required for analyzing VPC Flow Logs using Athena, right from the AWS Management Console. It automatically deploys this CloudFormation template to get you started. You can also download this CloudFormation template, customize it as needed and then deploy it. Lastly, it creates a few pre-defined queries in Athena, so that you can start analyzing VPC Flow Log data without having to write any SQL.
A few of the queries provided through the template are:
- VpcFlowLogsTrafficFrmSrcAddr – All traffic from a particular source IP address
- VpcFlowLogsTrafficToDstAddr – All traffic to a particular destination IP address
- VpcFlowLogsTopTalkers – The top 50 IP traffic sources by bytes
- VpcFlowLogsTotalBytesTransferred – Top 50 source-destination IP address pairs by bytes
- VpcFlowLogsRejectedTraffic – Top 25 source-destination IP address pairs for which traffic was rejected
For the complete list of pre-defined Athena queries that the template provides, take a look at the VPC documentation.
In the rest of this blog post, we briefly describe the CloudFormation template created as a part of this integration, and we will walk you through how to use this integration in a step-wise manner.
VPC Flow Logs integration with Amazon Athena
The figure below shows a high-level overview of the CloudFormation template that is created for this integration. You can also download and customize the CloudFormation templates to modify the infrastructure setup or queries per your requirements.
You still need to create the VPC Flow Log subscription with S3 as a destination as illustrated in the VPC documentation. Once you have created a VPC Flog Log subscription, the CloudFormation template creates the rest of the elements necessary to analyze the VPC Flow Logs in S3 using Athena.
The CloudFormation template creates the following:
- A partitioned table in Amazon Glue corresponding to the VPC Flow Logs records
- A database in Amazon Glue to store the Amazon Glue tables
- A Lambda function that loads new partitions to the table on the specified schedule (daily, weekly, or monthly).
- An IAM role that grants permission to run the Lambda functions
- A workgroup in Athena to store the named queries, along with a set of named queries in the workgroup
Using VPC Flow Logs Athena integration
The following steps provide detailed information on how to enable the feature and analyze VPC Flow Logs using Athena.
Step 1 – Generate CloudFormation template
After you have created your VPC Flow Logs subscription with S3 as the destination, you can generate the CloudFormation template to load the logs into Athena. Navigate to VPC Flow logs console, select a flow log subscription that publishes to Amazon S3 and then choose Actions, Generate Athena integration
Now, you need to provide some information to help populate the CloudFormation template that will get generated for you.
The key information you will need to provide is around in the template settings form, you specify:
- Partition Load Frequency: You can simply specify a Daily, Weekly, or Monthly schedule to periodically load your newly generated VPC Flow Log data into Athena. Note that this will only apply for the future VPC Flow Log data. If you want to load existing data, you need to choose None as your Partition Load Frequency, and specify the partition start and end dates.
You can use the CLI in case you want to obtain a template that lets you load partitions both on periodic schedule and load partitions for existing data.
aws ec2 get-flow-logs-integration-template
- Select or create an S3 bucket for the generated template, and an S3 bucket for the query results.
- Click Generate Athena integration.
This will generate a fully populated CloudFormation Template with all the automation necessary to setup Amazon Athena for your VPC Flow Logs analysis.
Step 2 – Create CloudFormation stack
In the success message, choose Create CloudFormation stack to open the Create Stack wizard in the AWS CloudFormation console. The URL for the generated CloudFormation template is specified in the Template section. Complete the wizard to create the resources that are specified in the template.
Step 3 – Analyze network traffic using predefined queries
Once your CloudFormation stack has been created, you can use Athena to analyze your VPC Flow Log data.
To do so, navigate to the Athena console. First, make sure that you select the correct Data Source and Database to query as shown in the figures below.
Figure 6: Athena Workgroup
Running a predefined query
The CloudFormation template provides a set of predefined queries that you can run to quickly get some insights about the traffic in your AWS network. To access these pre-defined queries, navigate to the Workgroup panel, and set your Workgroup to the one created for the VPC Flow Logs. Then, navigate to the Saved queries panel to see the list of pre-defined queries.
In this example, we have selected the “VPCFlowLogsRejectedTraffic” query to see all the traffic that has been blocked.
Click on the query, and you will be transported to the Query editor where you can see the query and modify it as needed. Click “Run query” to see the results of your query in Athena. The results of the query are also saved in the S3 bucket you had specified. As seen in this figure, We can now see all the traffic which is being blocked by my Security Groups or Network ACLs.
Note that the queries created for you by the generated CloudFormation template depend on the VPC Flow Log fields that are enabled in your Flow Log subscription. To get the most flexibility, ensure you have all relevant fields enabled when you create your Flow Log subscription.
To avoid ongoing charges for resources you created, you should delete the CloudFormation template (referenced in Step 2 – Figure 4 above) by going to the CloudFormation console and deleting the stack you deployed. Also delete the S3 buckets that were used to store the CloudFormation template and query results (referenced in Step 1 – Figure 3 above).
You can now easily get started with using Amazon Athena to analyze your VPC Flow Logs stored in Amazon S3. You no longer have to manually worry about creating an Athena table, partitioning and loading data into the table. The CloudFormation template provided as a part of the VPC Flow Logs Athena integration automates these initial steps required for you to use Athena to analyze VPC Flow Logs. You can deploy this CloudFormation template to automatically perform this setup, and also get a set of named queries in Athena to help you easily analyze VPC Flow Log data, and get insights about your AWS environment based on network traffic data.