Understand Amazon S3 data transfer costs by classifying requests with Amazon Athena

Cost is top of mind for many enterprises, and building awareness of different cost contributors is the first step toward managing costs and improving efficiency. Costs for transferring data may segregate into common but low cost and less frequent but higher cost groups. Data about these two groups is mixed together, and separating them enables better decision making about improvements.

Amazon Simple Storage Service (Amazon S3) is a cost-effective choice for storage where you pay only for what you use. One part of that usage is data transfer. Many types of data transfer incur no charge, and those may represent the majority of your Amazon S3 usage. To identify the sources of data transfer that contribute to costs, a solution to filter requests that don’t create data transfer charges is useful. Then, you can focus on the requests that are the source of costs.

In this post, we walk you through how to create a dataset composed entirely of Amazon S3 requests that contribute costs. First, we walk you through how to enable Amazon S3 server access logs and configure Amazon Athena. Next, we show how to create Athena queries that return only the relevant requests. Finally, we discuss some of the options to analyze those remaining records to create insights that you can use to make cost optimizations.

Solution overview

The Amazon S3 pricing page provides details on data transfer categories for which you are not charged. In this post, we filter out the following categories first to help you analyze request charges that specifically contribute to your data transfer out costs:

Data transferred in from the internet.
Data transferred between S3 buckets in the same AWS Region.
Data transferred from an S3 bucket to any AWS service(s) within the same AWS Region as the S3 bucket (such as to a different account in the same AWS Region).
Data transferred out to Amazon CloudFront.

The preceding categories do not incur costs, but they represent a large percentage of overall requests for a typical user.

You can find the usage amount and costs for each type of data transfer in your AWS usage report, or through AWS Cost Explorer. The Amazon S3 usage types with names containing In-Bytes or Out-Bytes summarize this usage. For a full list of all Amazon S3 usage types, see the documentation on understanding your AWS billing and usage reports for Amazon S3.

In the AWS usage report, Amazon S3 data transfer usage types are summarized at the bucket level and hourly. Since this is a summary, usage reports cannot directly provide insights into who, when, and what objects are being requested that cause data transfer. Therefore, they also cannot directly provide insights into the attributes that drive costs.

Amazon S3 server access logs provides detailed records for the requests that are made to an S3 bucket. They contain one record for each request. Each record contains fields, such as the key of the requested object, the time of the request, the remote IP of the requester, and the amount of data sent. Therefore, they do have the necessary details to determine the source of data transfer costs. However, there might be millions or billions of records per month depending on usage patterns, so you must filter out the requests that don’t create charges.

The following filters reduce these logs to the data that is needed to distinguish between charged and no-charge requests:

The primary field we use to do this is the Remote IP field as follows:
- Filter internal access via AWS Services, which often show up with a Remote IP of -. This field should be filtered out because it falls under “Data transferred between S3 buckets in the same AWS Region.”
- Filter access from your VPCs through Amazon S3 VPC Endpoints. This traffic uses an IP from one of the Private Address Spaces, 10.0.0.0/8, 172.16.0.0/12 or 192.168.0.0/16. We filter this out because it falls under “Data transferred from an Amazon S3 bucket to any AWS service(s) within the same AWS Region as the S3 bucket.”
  - There is overlap between these ranges and what an on-premises network might use. If you reach Amazon S3 through an AWS Direct Connect public virtual interface, then the IP from your internal network appears.
- Filter access from Amazon Elastic Compute Cloud (Amazon EC2) or any other AWS Service in the same AWS Region that reaches an Amazon S3 public endpoint. This traffic uses an AWS Public IP from the same AWS Region. Create a list of all applicable IPs from the ip-ranges.json file where AWS publishes its current IP address ranges. We filter this out because it falls under “Data transferred from an Amazon S3 bucket to any AWS service(s) within the same AWS Region as the S3 bucket.”
- Filter access the CloudFront fleet or origin facing servers. These are also available from ip-ranges.json, under the grouping CLOUDFRONT_ORIGIN_FACING. We filter this out because it falls under “Data transferred out to Amazon CloudFront.”
We also use the Operation field to focus only on retrievals, as uploads count as data transfer in, which does not incur charges. We filter Operations other than REST.GET.OBJECT, as other requests would be small or classified as data transferred in.

In this solution, to query our S3 server access logs, we use Amazon Athena. You can use similar techniques in other tools, although you likely need to rewrite query statements to properly match CIDR ranges.

Prerequisites

Enable S3 server access logging for the S3 bucket you want to analyze.
For this post, we use Athena SQL within the AWS Management Console. To set up Athena to work with S3 server access logs, follow the steps to create a database and table schema to query.
To create the query, you run a Python script. Learn more about setting up Python on your computer.

Solution walkthrough

For this walkthrough, we will go through the following steps:

Generate Amazon Athena query
Enable Amazon Athena version 3
Execute Amazon Athena query
Perform further analysis

Step 1: Generate Amazon Athena query

In the first step, we use a Python script to generate an SQL query that we can use with Athena to find the request of interest and filter those that are not creating data transfer costs.

Create a file generate-athena-query.py. Copy and paste the following code and save:

import sys
import json
from urllib.request import urlopen

def region_tester(region):
    def is_in_region(prefix):
        return prefix["service"] == "AMAZON" \
            and prefix["region"] == region
    return is_in_region


def is_cloud_front_origin_facing(prefix):
    return prefix["service"] == "CLOUDFRONT_ORIGIN_FACING"

def chunks(lst, n):
    """Yield successive n-sized chunks from lst."""
    for i in range(0, len(lst), n):
        yield lst[i:i + n]

if len(sys.argv) < 1:
   sys.exit("No region specified.")
region = sys.argv[1]

with urlopen("https://ip-ranges.amazonaws.com/ip-ranges.json") as url:
    ip_ranges = json.load(url)

all_prefixes = ip_ranges["prefixes"]
internal_prefixes = ["10.0.0.0/8", "172.16.0.0/12", "192.168.0.0/16"]

region_prefixes = [
    prefix["ip_prefix"]
    for prefix
    in filter(region_tester(region), all_prefixes)
]

cloud_front_prefixes = [
    prefix["ip_prefix"]
    for prefix
    in filter(is_cloud_front_origin_facing, all_prefixes)
]

all_cidr = internal_prefixes + region_prefixes + cloud_front_prefixes
quoted_cidrs = [
    f"'{cidr}'"
    for cidr in all_cidr
]

queries = ["none_match(ARRAY[" + (',\n ').join(chunk) + "],\nx -> contains(x, CAST(\"remoteip\" as IPADDRESS)))" 
           for chunk in chunks(quoted_cidrs, 100)]

query = "SELECT * FROM s3_access_logs_db.mybucket_log \n\
WHERE operation = 'REST.GET.OBJECT' \n\
AND remoteip != '-'\n\
AND " + ('\nAND ').join(queries)

print(query)

2. Execute the Python script, specifying your source AWS Region:

python generate-athena-query.py us-east-1

Note: If you installed Python on your Mac with Home Brew, then you may run into issues making the HTTP request to download the JSON file. We suggest you install Python from Python.org.

Step 2: Enable Amazon Athena version 3

The SQL generated by the Python script includes Trino functions that are supported by Athena Engine version 3. Use the following steps to enable Athena version 3 if you haven’t already.

1. Go to Athena in the AWS Management Console in the left navigation panel, select Workgroups under Administration. For the workgroup associated with the query you tried to run, look at the Analytics engine column. If this says Athena version 2, select the workgroup name.

Figure 1: The Workgroups section in the AWS Management Console. A table shows the Name, Description, and the Analytics engine used for the workgroup you created. In this example, we show the primary workgroup and it is using the Athena engine version 3.

Figure 1: The Workgroups section in the AWS Management Console. A table shows the Name, Description, and the Analytics engine used for the workgroup you created. In this example, we show the primary workgroup and it is using the Athena engine version 3.

2. Select Edit.

3. Under Upgrade query engine select Manual, then select Athena engine version 3.

Figure 2: The Upgrade query engine section where you can choose to have Athena automatically upgrade or do this manually. In this example, we show the Query engine version being used is Athena engine version 3 (recommended).

Figure 2: The Upgrade query engine section where you can choose to have Athena automatically upgrade or do this manually. In this example, we show the Query engine version being used is Athena engine version 3 (recommended).

4. Select Save changes.

Step 3: Execute Amazon Athena query

In this step, we execute our Amazon Athena query.

Go to Amazon Athena in the AWS Management Console in the left navigation panel, select Query editor.
Paste the output of the Python script above in the query editor and select Run. Depending on the amount of logs and the number of requests this may take some time. Consider adding a limit to the query if your data set is large to get a smaller sample.
If you encounter an SQL syntax error, then double check Step 2: Enable Athena Engine version 3.
You can download the results via the Download results button and analyze offline, or use the guidance in step 4 to customize the query and perform further analysis.

Step 4: Perform further analysis

Filtering the requests to only include those that cause data transfer costs has made it easier to spot patterns in the data. The following steps are some common methods of diving deeper with the remaining records.

1. To find the most common remaining IPs, try beginning your query with the following:

SELECT Count(*) as Requests, remoteip GROUP BY remoteip ORDER BY count(*) desc WHERE {add where clause here}

Figure 3 - Results table with two columns, requests and remoteip. The request is the total number of requests from a certain IP address. For privacy, the ip addresses have been blacked out.

Figure 3: Results table with two columns: Requests and remoteip. Requests are the total number of requests from a certain IP address.

2. You should also try GROUP by /24 blocks of IPs, as these are frequently contiguous sources of traffic. If this doesn’t reveal patterns, then try larger or smaller block sizes to look for patterns that are hard to see one IP at a time.

3. Look up common IPs to find their owner. Often you find common sources, such as CDN services. Remember that requests made to S3 objects from CloudFront don’t create data transfer charges, but other CDN providers are not. If you are using a multi-CDN architecture, set CloudFront as the origin of your secondary CDN.

4. Alternatively, look for heavily used prefixes such as dates or specific naming conventions you use for your object names, such as user names or bucket names.

5. If traffic isn’t anonymous, then look for a common requester by beginning your query with the following:

SELECT Count(*) as Requests, requester GROUP BY requester ORDER BY count(*) desc WHERE {add where clause here}

A common requester would help identify a workload that uses a specific role, a specific user, or an external partner you’ve assigned a role to or granted cross-account access to.

Cleaning up

Disable server access logs by accessing the Amazon S3 page within the AWS Management Console. Select the bucket where you enabled access. Select the Properties tab. Scroll down to Server access logging, click Edit, select Disable, and click Save changes. Athena does not need to be turned off as you only incur charges when making queries.

Conclusion

In this post, we walked you through how to analyze requests that don’t create data transfer charges and requests that do using Amazon S3 server access logs and Amazon Athena. A small Python script helped us build a query to filter CIDR blocks that represent internal, same-Region, and Amazon CloudFront traffic, leaving only internet and cross-Region requests.

This solution enables you to identify the sources of costs for further analysis. You can use the information gleaned to optimize costs for these sources, or attribute them properly. You might choose to use a CDN such as Amazon CloudFront to reduce direct costs, or to work with users to optimize request patterns.

See the blogs “Amazon S3 + Amazon CloudFront: A Match Made in the Cloud” and “Cost-Optimizing your AWS architectures by utilizing Amazon CloudFront features” for more information.