How to optimize querying your data in Amazon S3

Querying data efficiently and effectively has become a critical capability for customers across industries. By querying data, customers can filter the contents of data, retrieving only the subset of data that they need. Customers need the ability to query and retrieve specific subsets of data relevant to their current needs or objectives. By extracting information from data, organizations can gain valuable insights, make data-driven decisions, and ultimately drive business success in an increasingly competitive and data-driven landscape.

With data in Amazon S3, you have a few options to query data, including client-side filtering, Amazon Athena, and Amazon S3 Object Lambda. First, you can download an S3 object to your AWS compute instance and then query it. This option provides high performance for querying a single S3 object or an object restored from the Amazon S3 Glacier storage classes. Second, for data lake use cases where you are looking to query multiple S3 objects at once, we recommend using Athena. Athena is serverless, so there is no infrastructure to set up or manage. Third, you can use S3 Object Lambda to add your own code to S3 GET, HEAD, and LIST APIs to transform data as it is returned to your applications. You can benefit from S3 Object Lambda for use cases when you want to add transformations while querying data.

In this post, we discuss how to choose the best option for querying your data in S3 and walk through the steps for using each of these options.

Choosing the right option for your use case

Depending on your use case, one of the three options might be the most efficient. Let’s dive into these solutions and the unique benefits of each one.

1. Client-side filtering

You can download an S3 object to your AWS compute instance and query the contents using data analysis libraries on your client application. For example, in a Python application that queries Parquet objects in S3, you can use the S3 GET API to download the object and query the contents using the ‘Pandas’ library.

This option provides high performance among the available options in most cases, depending on the query’s filtering ratio and object properties (size and compression ratio). This is because using S3 GET API to download a compressed object and querying it in your client application is usually faster than transferring the larger uncompressed query results over the network. To learn more about the S3 GET API, refer to the S3 GetObject API documentation.

2. Amazon Athena

Amazon Athena is an interactive query service that makes it easy to analyze data directly in S3 using standard SQL. You should use Athena if you want to run interactive impromptu SQL queries against data in S3, without having to manage any infrastructure.

Athena uses a distributed SQL engine, Trino, to run queries on objects stored in S3, represented as Hive tables. After setting up a table, you can use Athena to query your S3 objects. You can process multiple S3 objects in a single query or even use join operations and window functions to query your S3 objects. Additionally, Athena supports data formats such as ORC and Apache Avro, and also objects compressed with the Zstandard algorithm. To learn more, refer to the Amazon Athena user guide, Analyzing data in S3 using Amazon Athena blog post, and Querying restored data from Glacier using Athena user guide.

3. Amazon S3 Object Lambda

With S3 Object Lambda, you can add your own code to S3 GET requests to process data as it is returned to your application. While writing your own code, you can include your querying logic to filter only the subset of data to return to your application. S3 Object Lambda is suitable if you want to add transformations to your query results. For example, with S3 Object Lambda, you can compress your query results with an algorithm of your choice while it is returned to your application.

When you send a request through S3 Object Lambda, S3 automatically calls an AWS Lambda function in your AWS account to process the data retrieved, returning a transformed result back to the application. AWS Lambda is a serverless compute service that runs user-defined code without requiring the management of underlying compute resources. By using Lambda, you can use a data analysis library in any supported Lambda runtime and use any data querying techniques without being confined to the SQL format. To learn more, please refer to the S3 Object Lambda user guide.

Walkthrough

In this section, we describe how to use client-side filtering, Athena, and S3 Object Lambda to query data from Amazon S3 and Amazon S3 Glacier. For example, you might have a log file stored in S3, containing information about each request from your application. You may want to query this log file to filter out information specific to a particular request that took longer to finish than expected. In this example, we will consider the following log file structured as a CSV file with two columns: request_id and log_text.

R1001, this is the first log line
R1002, this is the second log line

For objects stored in S3 Glacier Flexible Retrieval or S3 Glacier Deep Archive storage classes, you will need to restore the relevant objects using RestoreObject API.

# Restore objects from S3 Glacier (if applicable)
s3.restore_object(
Bucket='examplebucket',
Key='archivedobjectkey',
RestoreRequest={
	'Days': 1,
		'GlacierJobParameters': {
			'Tier': 'Standard',
		},
},
)

Option 1: Client-side filtering
Download the object using the S3 GET API and then use any data analysis library to query the object in your client application, filtering only the required subset of the object. In this example, we use the AWS SDK for Python (Boto3) with AWS Common Runtime (CRT) to download the S3 object. Then we query the object using the Pandas library with PyArrow engine.

import pandas as pd
import boto3
from boto3.s3.transfer import TransferConfig

s3 = boto3.client("s3")


# Download the S3 object, setting concurrency based on your instance’s processing capacity.
s3.download_file(bucket, key, 'output.parquet', Config=TransferConfig(max_concurrency=concurrency-value))

# Filter the S3 object
args = {'filters': [('column_1', '=', 'R1001')]}
df = pd.read_parquet('output.parquet', engine='pyarrow', **args)

Option 2: Amazon Athena
Use Amazon Athena to query your S3 objects using standard SQL queries. To do this, create a table in Athena using an S3 bucket as the source location for data and run your desired queries on the Athena table. The results are stored in the query result location that you set up, and they can be downloaded upon completion of the query. To query restored objects from the S3 Glacier storage classes, you also need to enable S3 Glacier access for queries in Athena SQL workgroups. Please see this blog post for more information.

Open the Amazon Athena console. If this is your first time using the Athena console in your current AWS Region, then choose the Explore the query editor option to open the query editor. Choose Edit Settings to set up a query result location in S3. For each query that you run, Athena automatically stores the query results and metadata information in this query result location.
Create a table using Hive syntax.

CREATE EXTERNAL TABLE IF NOT EXISTS `my_db`.`my_app_logs` (`request_id` string, `log_text` string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 's3://my-s3-bucket/path/to/files'
TBLPROPERTIES ('classification' = 'parquet');

3. If you are using restored objects in Athena, then you can use the ALTER TABLE SET TBLPROPERTIES command to set the table property, as in the following example.

ALTER TABLE table_name SET TBLPROPERTIES (‘read_restored_glacier_objects’ = ‘true’)

4. Start a query using Athena.

import boto3

athena_client = boto3.client('athena')

# Start a query
request_id = 'R1001'
start_query_response = athena_client.start_query_execution(
        QueryString=f'select log_text from my_app_logs where request_id={request_id}',
        QueryExecutionContext={'Database': 'my_db'},
        ResultConfiguration={'OutputLocation': 's3://my-s3-bucket/path/to/results'}
    )

query_execution_id = response['QueryExecutionId']

5. Monitor the progress periodically to check if the query has finished executing.

finish_state = athena_client.get_query_execution(QueryExecutionId=query_execution_id)['QueryExecution']['Status']['State']
# handle finish_states:
#    wait and monitor again if RUNNING/QUEUED
#    throw an error or start a new query if FAILED/CANCELLED

6. Get the query results if your query ran successfully.

if finish_state == 'SUCCEEDED':
    athena_query_response = s3.get_object(Bucket=self.output_location,
                                     Key=f'results/{query_execution_id}.csv')

Option 3: Amazon S3 Object Lambda
You can also use S3 Object Lambda to query your objects in S3 (or objects restored from S3 Glacier) by creating a Lambda function with your desired logic and creating an S3 Object Lambda Access Point to use in your application. We briefly describe below how to create and use an S3 Object Lambda Access Point. Please see this blog post for detailed instructions.

Create a Lambda function to retrieve the required subset of your object after getting the original object from S3. Refer to the Writing Lambda functions page for detailed instructions.

Based on your use case, you can provide additional context from your application by using custom headers or query parameters in the S3 GetObject request. For example, in our log filtering example, we send the log’s request_id (eg: R1001) as a header. These are passed to the Lambda function in its input event object (event.userRequest.url and event.userRequest.headers). See the full structure in this Amazon S3 user guide.

import boto3
import requests

def handler(event, context):
    """
    Retrieve the operation context object from the event. This object indicates where the WriteGetObjectResponse request
    should be delivered and has a presigned URL in 'inputS3Url' where we can download the requested object from.
    The 'userRequest' object has information related to the user who made this 'GetObject' request to 
    S3 Object Lambda.
    """
    get_context = event["getObjectContext"]
    route = get_context["outputRoute"]
    token = get_context["outputToken"]
    s3_url = get_context["inputS3Url"]
    
    """
    Fetch the request_id sent by the client application as a user-defined metadata header. All metadata headers start with the prefix 'x-amz-meta'.
    """
    request_id = event["userRequest"]["headers"]["x-amz-meta-request-id"]

    # Filter the log file using request_id, implement your logic here.
    
    ...

    # Send the filtered records back to the client.
    s3 = boto3.client('s3')
    s3.write_get_object_response(Body=..., RequestRoute=route, RequestToken=token)

    # Gracefully exit the Lambda function.
    return { 'status_code': 200 }

2. Create an S3 Object Lambda Access Point for the S3 bucket that contains your log files and give it a name of your choice. In this example, we have our log files in the S3 bucket my-s3-bucket and we are naming our S3 Object Lambda Access Point my-log-filter-olap.

aws s3control create-access-point-for-object-lambda --account-id 111122223333 --name my-log-filter-olap --configuration file://my-olap-configuration.json

3. Update your client application to use the S3 Object Lambda Access Point that you created previously.

# Add the request_id as a metadata header 

def _add_header(request, **kwargs):
	request.headers.add_header('x-amz-meta-request-id', 'R1001')

event_system = s3.meta.events
event_system.register_first('before-sign.*.*', _add_header)

# Use the S3 Object Lambda Access Point

s3.get_object(Bucket='arn:aws:s3-object-lambda:region:accountid:accesspoint/my-log-filter-olap',
              Key='logfile001.csv')

Conclusion

In this post, we discussed three options to query data in Amazon S3, each with its own benefits. We talked about how you can choose the right option for your use case and walked through the steps you can take to start using each one of these options.

In summary, you can download your objects into your Amazon compute instance and use a data analysis library for client-side filtering to get high performance for querying single S3 objects or restored S3 Glacier objects. You can use Athena to query single or multiple S3 objects, or restored S3 Glacier objects, with SQL queries, and benefit from its support of additional file formats and compression types. This is particularly suited for serverless workloads where you can run SQL queries against data in S3, without having to manage any infrastructure. Alternatively, if you want to add data transformations using your own code, you can use S3 Object Lambda to query and transform your data as it is returned to your application.

Thank you for reading this post. Feel free to leave any comments in the following section.

AWS Storage Blog

How to optimize querying your data in Amazon S3

Choosing the right option for your use case

Walkthrough

Conclusion

Resources

Follow