AWS Storage Blog

Querying data without servers or databases using Amazon S3 Select

In our time as Solutions Architects at AWS, we have supported all types of customers and use cases. We regularly hear the same feedback: instead of deploying and managing systems and infrastructure, customers want to reduce complexity and management overhead to instead focus on their core business. A source of complexity and overhead is the difficulty entailed in efficiently searching through large amounts of data for what you need.

In this post, we cover using structured query language (SQL) queries to search through data loaded to Amazon Simple Storage Service (Amazon S3) as a comma-separated value (CSV) file. Previously, the data would need to be loaded into a database to be queried. In addition to deploying a database, the customer would have needed to deploy an application and website to enable the search. Instead of deploying a database and associated resources, we instead leverage an S3 feature called S3 Select to create a phone book search tool that is completely serverless.

We first show the basics of executing SQL queries to return results from a simple phone book .csv file. To explore this solution a bit further, we create a sample phone book search tool on the AWS Examples GitHub page that includes all the necessary components to create a completely serverless phone book search application.

Customers leverage Amazon S3 to store and protect any amount of data without provisioning storage or managing infrastructure. Amazon S3 Select and Amazon S3 Glacier Select enable customers to run structured query language SQL queries directly on data stored in S3 and Amazon S3 Glacier. With S3 Select, you simply store your data on S3 and query using SQL statements to filter the contents of S3 objects, retrieving only the data that you need. By retrieving only a subset of the data, customers reduce the amount of data that Amazon S3 transfers, which reduces the cost and latency of retrieving this data. Reducing cost and complexity enables AWS customers to move faster and reduce the amount of time required to deliver value to their businesses and their customers.

S3 Select works on objects stored in CSV, JSON, or Apache Parquet format. S3 Select also supports compression on CSV and JSON objects with GZIP or BZIP2, and server-side encrypted objects. You can perform SQL queries using AWS SDKs, the SELECT Object Content REST API, the AWS Command Line Interface (AWS CLI), or the AWS Management Console.

Featured AWS services

Our simple phone book application leverages the following AWS services:

  • Amazon S3 is an object storage service that offers industry-leading scalability, data availability, security, and performance.
  • S3 Select enables applications to retrieve only a subset of data from an S3 object by using simple SQL expressions.

In addition to S3 and S3 Select, the Amazon S3 Select – Phonebook Search GitHub sample project includes the following services:

  • Amazon API Gateway is a fully managed service that makes it easy for developers to create, publish, maintain, monitor, and secure APIs at any scale. Amazon API Gateway is a common component of serverless applications and will be used to interact with AWS Lambda.
  • AWS Lambda lets you run code without provisioning or managing servers. The S3 Select query in our sample project will be executed with AWS Lambda.
  • AWS CloudFormation provides a common language for you to model and provision AWS and third-party application resources in your cloud environment. CloudFormation is used for our sample app to orchestrate the deployment of the resources required for the sample project with minimal effort.

Sample Code

Sample code for this project exists in the AWS-Samples GitHub repository and the high-level details are outlined below.

Getting started

Since S3 Select runs directly on S3 with data stored in your S3 bucket, all you need to get started is an AWS account and an S3 bucket.

Sign in to your existing AWS account, or create a new AWS account. Once you sign in, create a S3 bucket to be used for testing with S3 Select.

The data we use for testing is a simple CSV file containing the Name, Phone Number, City, and Occupation of our users. The raw data of the CSV file is below, and available in GitHub, so feel free to download the file and edit it as well!

Name,PhoneNumber,City,Occupation
Sam,(949) 555-6701,Irvine,Solutions Architect
Vinod,(949) 555-6702,Los Angeles,Solutions Architect
Jeff,(949) 555-6703,Seattle,AWS Evangelist
Jane,(949) 555-6704,Chicago,Developer
Sean,(949) 555-6705,Chicago,Developer
Mary,(949) 555-6706,Chicago,Developer
Kate,(949) 555-6707,Chicago,Developer

Upload the sample_data.csv file to your new S3 bucket.

To quickly test, we run the following in Python, which queries the “sample_data.csv” object in our S3 bucket named “s3select-demo.” Please note the bucket name must be changed to reflect the name of the bucket you created.

To get setup for this quick test, we deploy a t3.micro running EC2 instance Amazon Linux 2 and install boto3 using the pip command. Be sure to use the IAM role with appropriate permissions.

Configuring an EC2 instance to run S3 Select queries

Once the instance is running, log in as an ec2-user and run the following commands to setup your environment:

sudo yum update -y
sudo yum install python3 -y
python3 -m venv ~/s3select_example/env
source ~/s3select_example/env/bin/activate
pip install pip --upgrade
pip install boto3
wget https://raw.githubusercontent.com/aws-samples/s3-select-phonebook-search/master/src/samples/jane.py
wget https://raw.githubusercontent.com/aws-samples/s3-select-phonebook-search/master/src/samples/jane-gzip.py

The steps above create a python3 environment and downloads a Python file called called jane.py with the following contents. This file allows us to search users with the first name of Jane. Note you must replace the current S3 bucket name to match your S3 bucket.

import boto3

s3 = boto3.client('s3')

resp = s3.select_object_content(
    Bucket='s3select-demo',
    Key='sample_data.csv',
    ExpressionType='SQL',
    Expression="SELECT * FROM s3object s where s.\"Name\" = 'Jane'",
    InputSerialization = {'CSV': {"FileHeaderInfo": "Use"}, 'CompressionType': 'NONE'},
    OutputSerialization = {'CSV': {}},
)

for event in resp['Payload']:
    if 'Records' in event:
        records = event['Records']['Payload'].decode('utf-8')
        print(records)
    elif 'Stats' in event:
        statsDetails = event['Stats']['Details']
        print("Stats details bytesScanned: ")
        print(statsDetails['BytesScanned'])
        print("Stats details bytesProcessed: ")
        print(statsDetails['BytesProcessed'])
        print("Stats details bytesReturned: ")
        print(statsDetails['BytesReturned'])

The OutputSerialization field is set to CSV, so this prints the results matching “Jane” as CSV. This could be set to JSON if preferred for your use case.

Executing a S3 Select query

After changing the S3 bucket name in the jane.py file to match the S3 bucket name you created, run the query using the following command:

python jane.py

This results in the output below:

Jane,(949) 555-6704,Chicago,Developer

Stats details bytesScanned:
326
Stats details bytesProcessed:
326
Stats details BytesReturned:
38

The match for the user Jane shows up, along with some optional details we added to show the data scanned, processed, and returned by S3 Select. In this case, the sample_data.csv is 326 bytes. S3 Select scans the entire file and returns only 38 bytes.

S3 Select with compressed data

Let’s run the same test again but this time after compressing and uploading a GZIP version of the phonebook saved as sample_data.csv.gz. This file is also available for download from GitHub.

The modifications in your Python code are to change the Key to the gzip’d object and to change the InputSerialization CompressionType from None to GZIP. The new version of the Python script is saved as jane-gzip.py and is available on the AWS-Samples GitHub page as well.

We changed the object key name to specify the gzip file:

Key='sample_data.csv.gz',

And we changed the InputSerialization line to change the CompressionType to GZIP:

InputSerialization = {'CSV': {"FileHeaderInfo": "Use"}, 'CompressionType': 'GZIP'},

The full file of jane-gzip.py is below. Note you must replace the current S3 bucket name to match the name of your S3 bucket.

import boto3

s3 = boto3.client('s3')

resp = s3.select_object_content(
    Bucket='s3select-demo',
    Key='sample_data.csv.gz',
    ExpressionType='SQL',
    Expression="SELECT * FROM s3object s where s.\"Name\" = 'Jane'",
    InputSerialization = {'CSV': {"FileHeaderInfo": "Use"}, 'CompressionType': 'GZIP'},
    OutputSerialization = {'CSV': {}},
)

for event in resp['Payload']:
    if 'Records' in event:
        records = event['Records']['Payload'].decode('utf-8')
        print(records)
    elif 'Stats' in event:
        statsDetails = event['Stats']['Details']
        print("Stats details bytesScanned: ")
        print(statsDetails['BytesScanned'])
        print("Stats details bytesProcessed: ")
        print(statsDetails['BytesProcessed'])
        print("Stats details bytesReturned: ")
        print(statsDetails['BytesReturned'])

The following command executes an S3 Select query on the gzip file:

python jane-gzip.py

This results in the output below:

Jane,(949) 555-6704,Chicago,Developer

Stats details bytesScanned:
199
Stats details bytesProcessed:
326
Stats details bytesReturned:
38

Comparing results of compressed and uncompressed data

Using gzip compressing saves space on S3 and reduces the amount of data process. In the case of our small csv file for this test, using compression results in a 39% space savings.

The following table shows the difference executing S3 Select between the two files, sample_data.csv and sample_data.csv.gz.

File Size (Bytes) Bytes scanned Bytes processed Bytes returned Difference
Uncompressed 326 326 326 38 N/A
Compressed 199 199 326 38 ~39% smaller

Taking advantage of data compression gets more interesting with larger files, like a 133,975,755-byte CSV file (~128 MB) consisting of ~1,000,000 lines. In testing such a file, the file size was reduced by ~60% down to 50,308,104 bytes (~50.3 MBytes) with GZIP compression.

File Size (Bytes) Bytes scanned Bytes processed Bytes returned Difference
Uncompressed 133,975,755 133,975,755 133,975,755 6 N/A
Compressed 50,308,104 50,308,104 133,975,755 6 ~60% smaller

Querying Archives with S3 Glacier Select

When you provide an SQL query for a S3 Glacier archive object, S3 Glacier Select runs the query in place and writes the output results to Amazon S3. With S3 Glacier Select, you can run queries and custom analytics on data stored in S3 Glacier without having to restore your data to a hotter tier like S3 Standard.

When you perform select queries, S3 Glacier provides three data access tiers—expeditedstandard, and bulk. All of these tiers provide different data access times and costs, and you can choose any one of them depending on how quickly you want your data to be available. For all but the largest archives (250 MB+), data that is accessed using the expedited tier is typically made available within 1–5 minutes. The standard tier finishes within 3–5 hours. The bulk retrievals finish within 5–12 hours.

Conclusion

In this post, we showed how S3 Select provides a simple way to execute SQL queries directly on data stored in Amazon S3 or Amazon S3 Glacier. S3 Select can run against data stored in S3, enabling customers to use this feature to process data uploaded to S3, either programmatically or from services such as AWS Transfer for SFTP (AWS SFTP). For example, customers could upload data directly to S3 using AWS SFTP and then query the data using S3 Select. This work can be automatically triggered by an AWS Lambda execution after a new CSV object is uploaded to S3 with S3 Event Notifications. Searching through your data using S3 Select can potentially save you time and money spent on combing through data other ways.

To get more hands-on with S3 Select, we encourage you to head over to the AWS Samples GitHub repo. There you can find sample code for the simple phonebook application, so you can get more hands-on time with S3 Select. The GitHub repo offloads the command to AWS Lambda, which is initiated by Amazon API Gateway.

Cleaning up

In our sample, we created a S3 bucket, uploaded a .csv file (sample_data.csv), and queried the data using a t3.micro EC2 instance. To clean up the environment, shutdown and terminate the EC2 instance and delete the sample_data.csv file from your S3 bucket. You can also choose to delete the S3 bucket you used for testing. These steps ensure that there are no forthcoming costs to your account stemming from this sample.

Additional Resources