How to use Amazon Athena queries to analyze AWS WAF logs and provide the visibility needed for threat detection

Web application security is an ongoing process. AWS WAF enables real-time monitoring and blocking of potentially harmful web requests. Bot Control and Fraud Control use machine learning (ML) to detect and prevent sophisticated threats. Bot traffic can make up anywhere from 30% to 50% or even more of total web traffic. After enabling AWS WAF, you want to evaluate the web traffic for false positives or false negatives, identify traffic patterns, identify new attack signatures, and better understand the web traffic incoming to your application. To guarantee optimal security, it’s crucial to regularly assess and improve your AWS WAF configuration by using traffic insights. By doing so, you can enhance the security posture of your application and efficiently mitigate malicious traffic.

This post will provide information on how to use Amazon Athena to analyze AWS WAF logs published to an Amazon Simple Storage Service (Amazon S3) bucket and gather insights into potential attacks. If you are publishing AWS WAF logs to Amazon CloudWatch Logs, please refer to Analyzing AWS WAF Logs in Amazon CloudWatch Logs. Using Athena, it’s easy to perform ongoing analysis of AWS WAF by easily surfacing outliers such as the top 10 IP addresses accessing your application, top 10 URI accessed, top IP addresses with token rejections, tracking of a client session, and many more use cases.

AWS WAF dashboards

AWS WAF provides web ACL traffic overview dashboards to provide an overview of web traffic. The dashboards are divided into four different categories: All traffic, Bot Control, Account takeover protection, and Account creation fraud prevention. For example, if you would like to understand requests per top 10 countries, token status, or client session thresholds, you can use default dashboards. But, for deeper web traffic analysis and to generate data for your custom use case, you need to analyze AWS WAF logs. Athena provides you with the right integration to run queries.

AWS WAF logging

With AWS WAF logging, you can view metadata (JSON format) about the traffic accessing your protected resources, including client IP addresses, requested resources, and more. See Log fields for a full list of available data. This information allows you to analyze traffic across various dimensions, such as client IPs, URIs, hosts, client IP country, headers, and AWS WAF rules, providing valuable insights into your web application’s security. AWS WAF generates a unique JSON log entry for each HTTP request it handles, which can be stored and analyzed. You can send AWS WAF logs to various destinations, including S3 buckets, CloudWatch, or by using Amazon Kinesis Data Firehose to send logs to third-party solutions like Datadog or Splunk. Once a Kinesis Data Firehose delivery stream is created, you can associate it with an AWS WAF web access control list (ACL) through the AWS Command Line Interface (AWS CLI) or the AWS WAF console, which enables AWS WAF to send real-time logs to a required destination.

Examples of threat detection analysis using AWS WAF logs and Athena

This post provides seven examples such as top talkers by different criteria, top talkers with additional details, get count of various bot traffic by a given date, get counts of different labels by IP address, website scraping and attacks, AWS WAF token analysis by IP address, session tracking. These scenarios help you to dive deep into the AWS WAF traffic across single or multiple AWS resources that has web ACL attached to them.

As mentioned earlier in this post, AWS WAF logs contain metadata required to analyze the details of the user request. AWS considers terminating and non-terminating rule actions to halt or continue processing the request down the configured rule priority order. When analyzing bot traffic, it is important to understand the AWS WAF token characteristics and labels generated by AWS WAF intelligent threat detection. The AWS WAF token is saved as a cookie named aws-waf-token. For CAPTCHA and challenge requests, AWS WAF inspects token status. If the request has a valid token, it’s treated as a non-terminating match. If the token is invalid (because it is absent, expired, or rejected), AWS WAF presents a CAPTCHA or challenge interstitial to clients.

Prerequisites

Turn on AWS WAF logging: Follow the post directions in AWS re:Post and publish logs to an S3 bucket directly from AWS WAF or using Kinesis Data Firehose. The preferred method is to use Kinesis Data Firehose for better control of log delivery.
Configure Athena: AWS WAF has a known structure to specify a partition scheme in advance. Follow the directions in Querying AWS WAF logs to create a table in Athena referencing the AWS WAF schema. Points to note when creating Athena tables:
- It is recommend to create an Athena table using the partitioned schema by date. If you use direct Amazon S3 log delivery instead of using Kinesis, consider partitioning by Region, too.
- Be sure to specify the correct S3 bucket location for storing AWS WAF logs, as configured in the previous step.
- AWS WAF regularly updates log fields when new features are launched, so you’ll need to update your query schema to get the latest data using Athena.

Example 1: Top talkers by different criteria

Top talkers refer to the devices, bots, or users that generate the most network traffic or pose the greatest potential threat to your applications. For example, you start by looking at your CloudWatch metrics to identify a spike in the number of requests hitting your application in the last few days. Then, you dive deep into the requests to understand their source by categories such as IP address, URI, or HTTP headers. The top10ip.sql query helps you to get the top 10 IP sources using Athena queries. The top10uri.sql query helps you to get the top 10 URI accessed. You can use additional filters like httprequest.uri or httprequest.httpmethod to get top talkers for each HTTP attribute. For more examples refer to querying AWS WAF logs.

Figure 1: Athena query output of top 10 IP sources of web requests

Figure 2: Athena query output of top 10 URIs accessed

If you want to see all the traffic from a client before and after token acquisition. The alltraffic_by_ip_includingtoken.sql query gives you information about all traffic from a given client IP, including the traffic before the token was acquired, traffic that caused the token to be acquired, and traffic after the token acquisition.

Figure 3: Athena query output of all traffic from a client IP address

Example 2: Get counts of various bot traffic for a given set of days

You have configured the bot rule group. You would like to get statistics on the requests that are matched with the rules in the Bot Control rule group. The bot.sql query provides you output with the count of requests against different categories of bot labels matched over a specific date.

Figure 4: Athena query output of different categories of bot traffic by date

To get similar information for a range of IPs, you can replace the “group by” criteria of “date” with “httprequest.clientip” and get metrics for the relevant IP. The bot_byip.sql query provides the different categories of bot labels matched to each clientIP over a specific date range.

Figure 5: Athena query output of different categories of bot traffic by IP address

Example 3: Get counts of labels per IP address

You have configured multiple AWS WAF rulesets per web ACL. You would like to get statistics on the requests that are matched with each rule and build a chart from the .csv output. The alllabels_byip.sql query provides you output with the count of each label match per IP address or for a specific date. As of today, AWS WAF provides around 400+ labels, and you can customize the query to add or remove the labels according to your use case.

Figure 6: Athena query output of count of each AWS WAF label by IP address.

Example 4: Top talker with additional details

You have configured multiple AWS WAF rulesets per web ACL. You would like to get statistics on which IP is generating the most requests and details of every rule that they match. The toptraffic.sql query provides the count of traffic by an IP over a range of dates and the corresponding terminatingruleids.

Figure 7: Athena query output of count of traffic by IP addresses and matching AWS WAF terminating rules

Example 5: Website scraping and attacks

Website scraping is the process of extracting data from websites and involves software and algorithms to navigate a website to try to extract specific data. Malicious bots attack a website with the intention to degrade website performance and breach security. From the previous example, you get the output with the IP address and total requests in descending order. To understand if the particular IP address is attacking your website, use the httprequest.clientip as a filter and run the alltraffic_byip.sql query to identify the URLs the IP is trying to access. Make sure to replace INSERT_IP_ADDRESS with a valid IP value in the query.

Figure 8: Athena query output of different URIs accessed by an IP with matching terminating rule and labels

Example 6: AWS WAF tokens analysis (activity by IP and token misuse)

AWS WAF provides a unique token based on the immunity time configured (with a minimum of 60 seconds to a maximum of 3 days). AWS WAF presents the user with a CAPTCHA or challenge after the immunity time expires. A malicious user can generate a token and reuse it through their scripts to generate a high load of requests or spread the requests across multiple IP addresses. Because the token is unique to each client IP, this query provides an output on the token ID and number of IPs that sent the same token. The expectation is to have no more than a small number of IPs for each AWS WAF token. Use the waftoken_byip.sql query to find if you have token misuse. The exception to having one IP per AWS WAF token is when the user is traveling and acquires a different IP, and the token immunity period has not expired.

Figure 9: Athena query output of token ID and number of unique IP addresses

If you have enabled a challenge, CAPTCHA, or Bot Control, it helps to understand if the IPs generating the most traffic have a valid AWS WAF token. The waftoken_analysis.sql query will clarify if the AWS WAF token is missing or expired or the domain is invalid.

Figure 10: Athena query output of client IP addresses and matched token status

Example 7: Session tracking – Lifecycle of a client request (client session activity by token)

With the query in Example 6, you have a list of tokens utilized by different IPs. Once a AWS WAF token is issued, use the alltraffic_bywaftoken.sql query to find out which requests were made with a specific token. Before running the query, replace INSERT_THE_TOKEN_ID_HERE with the AWS WAF token in the query.

Figure 11: Athena query output of all traffic from a client using a specific AWS WAF token ID

Tips to make Athena queries faster

To improve query performance, refer to Performance tuning in Athena. It is important to reduce the data being queried. Here are some tips to help improve your Athena queries.

Use DATE in the Athena partition criteria for AWS WAF logs.
Use DATE in the WHERE clause and restrict it to a few days with the filter date >= date_format(current_date – interval ‘7’ day, ‘%Y/%m/%d’). You can decrease or increase the number of days being queried to meet your performance service level agreement (SLA).
Avoid using more than one UNNEST clause in a single query.
Avoid joining with the entire dataset without a “DATE” based where clause.
If you have a single bucket for logging across web ACLs from multiple AWS account IDs, try to partition the log data with the account ID as the partition key. This lets you use the account ID in the where clause, reducing the data queried.
If you are sending logs from multiple accounts into one single Kinesis Data Firehose stream, you can also do dynamic partitioning based on the account ID. This allows you to use the account ID in the partition and reduce the number of files being queried at any given time. Here is additional information on dynamic partitioning.
If multiple web ACLs log into the same bucket, then filter by webACL using webaclid = ‘arn of webACL’
If multiple AWS resources are attached to a single web ACL, then filter by httpsourceid = ‘id of the resource’. Refer to Log fields for more information.
Remember that all times are recorded in Coordinated Universal Time (UTC). Factor that in when converting to the local timezone. You can query based on certain absolute dates, such as date = ‘2024/03/22’, when querying historical data for better performance.

Pricing

You will be charged for publishing AWS WAF logs and querying using Athena. For information on cost-effective ways to configure AWS WAF, refer to Cost-effective ways for securing your web applications using AWS WAF.

Conclusion

This post discusses how to use AWS WAF logs and Amazon Athena to gain insights into your application traffic. You can also build Amazon QuickSight dashboards using specific Athena queries by following the directions in Enabling serverless security analytics using AWS WAF full logs, Amazon Athena, and Amazon QuickSight.

The provided example queries will help you get started with querying AWS WAF logs using Athena. For queries related to other use cases, refer to waf-log-sample-athena-queries GitHub repository. Our team will keep adding new queries to this repository, and please use the discussions forum to provide feedback or request queries for additional specific use cases.

Moreover, to keep updated with AWS WAF, refer to the AWS Security Blog and What’s New with Security, Identity, & Compliance? If you have feedback about this post, submit comments in the comments section. If you have questions about this post, start a new thread on AWS WAF re:Post or contact AWS Support.

About the Author

Kartik Bheemisetty

Kartik Bheemisetty is a Sr Technical Account Manager under US-ISV segment, where he helps customer achieve their business goals with AWS cloud services. He hold’s subject matter expertise in AWS Network and Content Delivery services. He offers expert guidance on best practices, facilitates access to subject matter experts, and delivers actionable insights on optimizing AWS spend, workloads, and events. You can connect with him on LinkedIn

Vishal Lakhotia

Vishal Lakhotia is a Senior Solutions Architect at Amazon Web Services focused on accelerating cloud adoption and ensuring customer success leveraging AWS Cloud for business outcomes. He is a subject matter expert on Edge Services and End User Computing services. He can be connected on LinkedIn

Jess Izen

Jess Izen is a Senior Software Development Engineer with AWS WAF, building products like Bot Control and Fraud Control. She likes to work on performance sensitive, highly concurrent Rust services as well as distributed systems involving technologies like Kafka and Redis/Valkey. In her free time, she rides bikes and competes in amateur Muay Thai. You can connect with her on LinkedIn

Networking & Content Delivery