How do I analyze the Amazon VPC flow logs using Amazon Athena?

Last updated: 2021-04-16

I want to analyze my Amazon Virtual Private Cloud (Amazon VPC) flow logs using Amazon Athena.

Short description

Amazon VPC flow logs allow you to capture information about the IP traffic going to and from network interfaces in your VPC. The logs can be used to investigate network traffic patterns and identify threats and risks across your Amazon VPC network.

Resolution

Analyze VPC logs using Athena

To analyze the access logs using Amazon Athena, do the following:

1.    On the Amazon Athena console query editor tab, create a database test_db_vpclogs by running a command similar to the following:
Important: It's a best practice to create the database in the same AWS Region as that of the Amazon S3 bucket where you want to save the flow logs.

CREATE DATABASE test_db_vpclogs;

Note: Be sure to replace test_db_vpclogs with the name of the database that you want to create.

2.    In the database that you created, create a table for the VPC flow logs by running a command similar to the following. With this command, you can create a table, partition the table, and populate the partitions automatically according to your use case using partition projection.

CREATE EXTERNAL TABLE IF NOT EXISTS test_table_vpclogs (
version int,
account string,
interfaceid string,
sourceaddress string,
destinationaddress string,
sourceport int,
destinationport int,
protocol int,
numpackets int,
numbytes bigint,
starttime int,
endtime int,
action string,
logstatus string,
vpcid string,
subnetid string,
instanceid string,
tcpflags int,
type string,
pktsrcaddr string,
pktdstaddr string,
region string,
azid string,
sublocationtype string,
sublocationid string,
pktsrcawsservice string,
pktdstawsservice string,
flowdirection string,
trafficpath string
)
PARTITIONED BY (region string, day string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ' '
LOCATION 's3://awsexamplebucket/awsexampleprefix/awsexamplelogs/1111222233334444/vpcflowlogs/test_region_code/'
TBLPROPERTIES
(
"skip.header.line.count"="1",
"projection.enabled" = "true",
"projection.region.type" = "enum",
"projection.region.values" = "us-east-1,us-west-2,ap-south-1,eu-west-1",
"projection.day.type" = "date",
"projection.day.range" = "2021/01/01,NOW",
"projection.day.format" = "yyyy/mm/dd",
"storage.location.template" = "s3://awsexamplebucket/awsexampleprefix/awsexamplelogs/1111222233334444/vpcflowlogs/test_region_code/${region}/${day}"
)

Be sure to do the following:

  • Replace test_table_vpclogs in the query with the name of your table.
  • Modify the LOCATION parameter in the query to point to the Amazon S3 bucket that contains your log data.

Note: If a projected partition doesn't exist in Amazon S3, Athena still projects the partition. It's a best practice to use partitioned attributes in your queries.

3.    Use the Query editor in the console to run SQL statements on the table. You can save the queries, view previous queries, or download query results in CSV format.

Example queries

Note: Replace test_table_vpclogs in the queries with the name of the table that you created. Modify the column values and other variables to fit your query.

1.    To view the first 100 access log entries in chronological order for a certain period of time, run a query similar to the following:

SELECT * FROM test_table_vpclogs WHERE day >= '2021/02/01' AND day < '2021/02/28' ORDER BY time ASC LIMIT 100;

2.    To view which server receives the top ten number of HTTP packets for a certain period of time, run a query similar to the following:

SELECT SUM(numpackets) AS
  packetcount,
  destinationaddress
FROM test_table_vpclogs
WHERE destinationport = 443 AND day >= '2021/03/01' AND day < '2021/03/31'
GROUP BY destinationaddress
ORDER BY packetcount DESC
LIMIT 10;

This query counts the number of packets received on HTTPS port 443, groups them by destination IP address, and returns the top 10 entries from the previous week.

3.    To check the logs that were created during a certain period of time, run a query similar to the following:

SELECT interfaceid, sourceaddress, action, protocol, to_iso8601(from_unixtime(starttime))
AS start_time, to_iso8601(from_unixtime(endtime))
AS end_time
FROM test_table_vpclogs
WHERE day >= '2021/04/01' AND day < '2021/04/30';

This query returns the logs created between 2020-12-04 11:28:19.000 and 2020-12-04 11:28:33.000.

4.    To view access logs for a specific source IP address between certain time periods, run a query similar to the following:

SELECT * FROM test_table_vpclogs WHERE sourceaddress= '10.117.1.22' AND day >= '2021/02/01' AND day < '2021/02/28';

5.    To list the rejected TCP connections, run a query similar to the following:

SELECT day_of_week(date) AS
day,
date,
interfaceid,
sourceaddress,
action,
protocol
FROM test_table_vpclogs
WHERE action = 'REJECT' AND protocol = 6 AND day >= '2021/02/01' AND day < '2021/02/28' LIMIT 10;

6.    To view the access logs for the IP address range that starts with '10.117', run a query similar to the following:

SELECT * FROM test_table_vpclogs WHERE split_part(sourceaddress,’.’, 1)=’10’ AND split_part(sourceaddress,’.’, 2) =‘117’;

7.    To view the access logs for a specific destination IP address between a certain time range, run a query similar to the following:

SELECT * FROM test_table_vpclogs WHERE destinationaddress= '10.0.1.14' AND day >= '2021/01/01' AND day < '2021/01/31';