Analyzing Amazon VPC Flow Log data with support for Amazon S3 as a destination

In a world of highly distributed applications and increasingly bespoke architectures, data monitoring tools help DevOps engineers stay abreast of ongoing system problems. This post focuses on one such feature: Amazon VPC Flow Logs.
In this post, I explain how you can deliver flow log data to Amazon S3 and then use Amazon Athena to execute SQL queries on the data. This post also shows you how to visualize the logs in near real-time using Amazon QuickSight. All these steps together create useful metrics to help synthesize and analyze the terabytes of flow log data in a single, approachable view.
Before I start explaining the solution in detail, I review some basic concepts about flow logs and Amazon CloudWatch Logs.

What are flow logs, and why are they important?
Flow logs enable you to track and analyze the IP address traffic going to and from network interfaces in your VPC. For example, if you have a content delivery platform, flow logs can profile, analyze, and predict customer patterns of the content access, and track down top talkers and malicious calls.

Some of the benefits of flow logs include:

You can publish flow log data to CloudWatch Logs and S3, and query or analyze it from either platform.
You can troubleshoot why specific traffic is not reaching an instance, which helps you diagnose overly restrictive security group rules.
You can use flow logs as an input to security tools to monitor the traffic reaching your instance.
For applications that run in multiple AWS Regions or use multi-account architecture, you can analyze and identify the account and Region where you receive more traffic.
You can predict seasonal peaks based on historical data of incoming traffic.

Using CloudWatch to analyze flow logs
AWS originally introduced VPC Flow Logs to publish data to CloudWatch Logs, a monitoring and observability service for developers, system operators, site reliability engineers, and IT managers. CloudWatch integrates itself into more than 70 log-generating AWS services—such as Amazon VPC, AWS Lambda, and Amazon Route 53, providing you a single place to monitor all your AWS resources, applications, and services that run on AWS and on-premises servers.

CloudWatch Logs publishes your flow log data to a log group, with each network interface generating a unique log stream in the log group. Log streams contain flow log records. You can create multiple flow logs that publish data to the same log group. For example, you can use cross-account log data sharing with subscriptions to send multiple flow logs from different accounts in your organization to the same log group. This lets you audit accounts for real-time intrusion detection.

You can also use CloudWatch to get access to a real-time feed of flow logs events from CloudWatch Logs. You can then deliver the feed to other services such as Amazon Kinesis, Kinesis Data Firehose, or AWS Lambda for custom processing, transformations, analysis, or loading to other systems.

Publishing to S3 as a new destination
With the recent launch of a new feature, flow logs can now be directly delivered to S3 using the AWS CLI or through the Amazon EC2 or VPC consoles. You can now deliver flow logs to both S3 and CloudWatch Logs.

CloudWatch is a good tool for system operators and SREs to capture and monitor the flow log data. But you might want to store copies of your flow logs for compliance and audit purposes, which requires less frequent access and viewing. By storing your flow log data directly into S3, you can build a data lake for all your logs.

From this data lake, you can integrate the flow log data with other stored data, for example, joining flow logs with Apache web logs for analytics. You can also take advantage of the different storage classes of S3, such as Amazon S3 Standard-Infrequent Access, or write custom data processing applications.

Solution overview
The following diagram shows a simple architecture to send the flow log data directly to an S3 bucket. It also creates tables in Athena for an ad hoc query, and finally connects the Athena tables with Amazon QuickSight to create an interactive dashboard for easy visualization.

Now I show you the steps to move flow log data to S3 and analyze it using Amazon QuickSight.

The following steps provide detailed information on how the architecture defined earlier can be deployed in minutes using AWS services.

1. Create IAM policies to generate and store flow logs in an S3 bucket.
2. Enable the new flow log feature to send the data to S3.
3. Create an Athena table and add a date-based partition.
4. Create an interactive dashboard with Amazon QuickSight.

Step 1: Create IAM policies to generate and store flow logs in an S3 bucket
Create and attach the appropriate IAM policies. The IAM role associated with your flow log must have permissions to publish flow logs to the S3 bucket. For more information about implementing the required IAM policies, see the documentation on Publishing Flow Logs to Amazon S3

Step 2: Enable the new flow log feature to send the data to S3
You can create the flow log from the AWS Management Console, or using the AWS CLI.

To create the flow log from the Console:

1. In the VPC console, select the specific VPC for which to generate flow logs.

2. Choose Flow Logs, Create flow log.

3. For Filter, choose the option based on your needs. For Destination, select Send to an S3 bucket. For S3 bucket ARN*, provide the ARN of your destination bucket.

To create the flow log from the CLI:

1. Use the following example command to return the flow log in JSON format:

186590dfd865:~ avijitg$ aws ec2 create-flow-logs --resource-type VPC --resource-ids <your VPC id> --traffic-type <ACCEPT/REJECT/ALL>  --log-destination-type s3 --log-destination <Your S3 ARN> --deliver-logs-permission-arn <ARN of the IAM Role>

{
    "ClientToken": "gUk0TEGdf2tFF4ddadVjWoOozDzxxxxxxxxxxxxxxxxx=",
    "FlowLogIds": [
        "fl-xxxxxxx"
    ],
    "Unsuccessful": []
}

2. Check the status and description of the flow log by running the following command with a filter and providing the flow log ID that you received during creation:

186590dfd865:~ avijitg$ aws ec2 describe-flow-logs --filter "Name=flow-log-id,Values="fl-xxxxxxx""

{
    "FlowLogs": [
        {
            "CreationTime": "2018-08-15T05:30:15.922Z",
            "DeliverLogsPermissionArn": "arn:aws:iam::acctid:role/rolename",
            "DeliverLogsStatus": "SUCCESS",
            "FlowLogId": "fl-xxxxxxx",
            "FlowLogStatus": "ACTIVE",
            "ResourceId": "vpc-xxxxxxxx",
            "TrafficType": "REJECT",
            "LogDestinationType": "s3",
            "LogDestination": "arn:aws:s3:::aws-flowlog-s3"
        }
    ]
}

3. You can check the S3 bucket and ensure that your flow logs output correctly with the following specific structure:

AWSLogs/<account-id>/vpcflowlogs/<region>/2018/08/15/<account-id>_vpcflowlogs_us-west-2_fl-xxxxxxx_20180815T0555Z_14dc5cfd.log.gz"

Step 3: Create an Athena table and add a date-based partition
In the Athena console, create a table on your flow log data.

Use the following DDL to create a table in Athena:

CREATE EXTERNAL TABLE IF NOT EXISTS vpc_flow_logs (
  version int,
  account string,
  interfaceid string,
  sourceaddress string,
  destinationaddress string,
  sourceport int,
  destinationport int,
  protocol int,
  numpackets int,
  numbytes bigint,
  starttime int,
  endtime int,
  action string,
  logstatus string
)  
PARTITIONED BY (dt string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ' '
LOCATION 's3://<your bucket location with object keys>/'
TBLPROPERTIES ("skip.header.line.count"="1");

After creating the table, you can partition it based on the ingestion date. Doing this helps speed queries of the flow log data for specific dates.

Be aware that the folder structure created by a flow log is different from the Hive partitioning format. You can manually add partitions and map them to portions of the keyspace using ALTER TABLE ADD PARTITION. Create multiple partitions based on the ingestion date.

Here is an example with a partition for ingestion date 2019-05-01:

ALTER TABLE vpc_flow_logs  ADD PARTITION (dt = '2019-05-01') location 's3://aws-flowlog-s3/AWSLogs/<account id>/vpcflowlogs/<aws region>/2019/05/01';

Step 4: Create an interactive dashboard with Amazon QuickSight
Now that your data is available in Athena, you can quickly create an Amazon QuickSight dashboard to visualize the log in near real time.

First, go to Amazon QuickSight and choose New Analysis, New datasets, Athena. For Data Source Name, enter a name for your new data source.

Next, for Database: contain sets of tables, choose your new table. Under Tables: contain the data you can visualize, select the source to monitor.

You can start creating dashboards based on the metrics to monitor.

Conclusion
In the past, to store flow log data cost-effectively, you had to use a solution involving Lambda, Kinesis Data Firehose, or other sophisticated processes to deliver the logs to S3. In this post, I demonstrated the speed and ease of importing flow logs to S3 using recent VPC updates and Athena to satisfy your analytics needs. For more information about controlling and monitoring your flog logs, see the documentation on working with flow logs.

If you have comments or feedback, please leave them below, or reach out on Twitter!

References:

Amazon CloudWatch
Documentation on CloudWatch Logs
Amazon VPC Flow Logs can now be delivered to S3

About the Author

Avijit Goswami is a Sr. Solutions Architect helping AWS customers to build their Infrastructure and Application on Cloud conforming to AWS Well Architected methodologies including Operational Excellence, Security, Reliability, Performance, and Cost Optimizations. When not at work, Avijit likes to travel, watch sports and listening to music.

AWS Cloud Operations & Migrations Blog

Analyzing Amazon VPC Flow Log data with support for Amazon S3 as a destination

About the Author

Resources

Follow