Introducing the Amazon Timestream UNLOAD statement: Export time-series data for additional insights
Amazon Timestream is a fully managed, scalable, and serverless time series database service that makes it easy to store and analyze trillions of events per day. Customers across a broad range of industry verticals have adopted Timestream to derive real-time insights, monitor critical business applications, and analyze millions of real-time events across websites and applications. Timestream makes it easy to build these solutions because it can automatically scale depending on the workload without operational overhead to manage the underlying infrastructure.
Many Timestream customers want to derive additional value from their time series data by using it in other contexts, such as adding it to their data lake, training machine learning (ML) models for forecasting, or enriching it with data using other AWS or third-party services. However, doing so is time-consuming because it requires complex custom solutions. To address these needs, we are excited to introduce support for the UNLOAD statement, a secure and cost-effective way to build AI/ML pipelines and simplify extract, transform, and load (ETL) processes.
In this post, we demonstrate how to export time series data from Timestream to Amazon Simple Storage Service (Amazon S3) using the UNLOAD statement.
Whether you’re a climate researcher predicting weather trends, a healthcare provider monitoring patient health, a manufacturing engineer overseeing production, a supply chain manager optimizing operations, or an ecommerce manager tracking sales, you can use the new UNLOAD statement to derive additional value from your time series data.
The following are example scenarios where the UNLOAD statement can help:
- Healthcare analytics – Healthcare organizations monitor the health metrics of their patients over a period of time, generating massive amounts of time series data. They can use Timestream to track and monitor patient health in real time. They can now export the data to their data lake where they can enrich and further analyze it to predict outcomes and improve patient care.
- Supply chain analytics – Supply chain analysts can use Timestream to track metrics across the supply chain such as inventory levels, delivery times and delays, to optimize their supply chains. They can now use the UNLOAD statement to export gigabytes of the data into Amazon S3 where they can use other AWS services or third-party services for predictive modeling.
- Ecommerce analytics – Ecommerce managers can use Timestream to track ecommerce store and website metrics, such as source of traffic, clickthrough rate, and quantity sold. They can now use the UNLOAD statement to export the data to Amazon S3 and analyze it with other relevant non-time series data such as customer demographics to optimize marketing investments.
UNLOAD statement overview
The UNLOAD statement in Timestream enables you to export your query results into Amazon S3 in a secure and cost-effective manner. With UNLOAD, you can export gigabytes of time series data to select S3 buckets in either Apache Parquet or comma-separated values (CSV) format, providing you the flexibility to store, combine, and analyze time series data using other services such as Amazon Athena, Amazon EMR, and Amazon SageMaker. The UNLOAD statement also allows you to encrypt your exported data using Amazon S3 managed keys (SSE-S3) or AWS Key Management Service (AWS KMS) managed keys (SSE-KMS) and compress it to prevent unauthorized data access and reduce storage costs. In addition, you have the flexibility to choose one or more columns to partition the exported data, enabling downstream services to scan only the data relevant to a query, thereby minimizing the processing time and cost.
The syntax for the UNLOAD statement is as follows:
option as follows:
In this post, we discuss the steps and best practices to export data from Timestream to Amazon S3 and derive additional insights, which includes the following high-level steps:
- Ingest sample data into Timestream.
- Perform data analysis.
- Use the UNLOAD statement to export the query result set to Amazon S3.
- Create an AWS Glue Data Catalog table.
- Derive additional business insights using Athena.
The following diagram illustrates the solution architecture.
Note that you will incur the cost of AWS resources used at public pricing if you choose to reproduce the solution in your environment.
We demonstrate the solution through a sample use case where we use Timestream for tracking metrics from an ecommerce website. Every time a product is sold, the sales data—including product ID, quantity sold, the channel that drove the customer to the website (such as social media or organic search), timestamp of the transaction, and other relevant details—is recorded and ingested into Timestream. We’ve created sample data that you can ingest into Timestream. This data has been generated using Faker and cleaned for the purposes of this demonstration.
The data contains the following information:
quantity. Whenever a search results in a purchase, the
quantity are recorded. When ingesting the data into Timestream, we used the following data model:
- Dimensions – We used channel,
user_group. For more information about dimensions, refer to Amazon Timestream concepts.
- Time – We used
current_time. Note that the sample might have an outdated time. The sample code provided in this post changes it to a recent timestamp while ingesting.
- Multi-measure records – We use
quantity. For more information, refer to Multi-measure records.
To follow along with this post, you must meet the following prerequisites:
- To create databases and tables, you need these permissions to allow CRUD operations
- To insert records, you need these permissions to allow insert operations
- To run an UNLOAD query, you need these prerequisites for writing data to Amazon S3
- Before you run the code blocks, export the appropriate AWS account credentials as environment variables
Ingest data into Timestream
You can use the sample code in this post to create a database and table, and then ingest ecommerce website sales data into Timestream. Complete the following steps:
- Set up a Jupyter notebook or integrated development environment (IDE) of your choice. The following code is split into multiple parts for illustrative purposes and uses Python version 3.9. If you intend to use the same code, combine the code blocks into a single program or use a Jupyter notebook to follow the sample.
- Initialize your Timestream clients:
- Create a database:
After you create the database, you can view it on the Timestream console.
- Create a table:
The table is now viewable on the Timestream console.
- Ingest the sample data into the table:
After ingesting, you can preview the contents of the table using the query editor on the Timestream console.
Perform data analysis
In Timestream, you can perform real-time analytics on the ingested data. For example, you can query the number of units sold per product in a day, the number of customers landing on the store from social media advertising in the past week, trends in sales, patterns in purchases for the last hour, and so on.
To find the number of units sold per product in the last 24 hours, use the following query:
Export the data to Amazon S3
You can use the UNLOAD statement to export the time series data to Amazon S3 for additional analysis. In this example, we analyze customers based on the channel by which they arrived on the website. You can partition the data using the
partitioned_by clause to export channel-specific data into a folder. In this example, we use Parquet format to export the data:
When you use the
partitioned_by clause, the columns used in the
partitioned_by field must be the same as the last columns in the SELECT statement. They must be put into the ARRAY value in the same order they appear in the SELECT statement.
After you run the preceding query containing the UNLOAD statement, you can review the details in the Export to Amazon S3 summary section on the Query results tab.
When you view the
results folder in Amazon S3, you can see that data is partitioned by the channel name.
Create an AWS Glue Data Catalog table
You create an AWS Glue crawler to scan the data in the S3 bucket, infer the schema, and create a metadata table in the AWS Glue Data Catalog for the data exported out of Timestream. Assuming you have the required permissions in AWS Glue, in this section, we present two options: create a metadata file for each channel separately, or crawl the entire
results folder and automatically detect partitions.
Option 1: Create an AWS Glue metadata file for each channel separately
If you need to perform different analyses for each channel and you used the
partitioned_by clause to separate out time series data by channel, you can generate an AWS Glue Data Catalog for a particular channel. For this example, we create a Data Catalog for the
Social media channel. Complete the following steps:
- On the AWS Glue console, choose Crawlers in the navigation pane.
- Choose Create crawler.
- Add a new S3 data source with the location
This is the location that contains all Social media channel-related time series data.
- Create a new AWS Glue database or use an existing one as per your needs.
- Usually, AWS Glue infers the table name from the provided S3 folder structure, but if needed, you can add an optional table prefix. In this case, because we kept the table prefix empty, the final table name will be
- Keep the schedule set to On demand because we’re just going to crawl it one time to create a Data Catalog.
- Fill in the other required fields and choose Create crawler.
- After the crawler is created, select it on the Crawlers page and choose Run crawler to create a Data Catalog based on the exported Timestream data.
When the run is complete, you will see the run history, where it says “1 table change,” which indicates that one table was added to the Data Catalog.
If you navigate to the Tables page on the AWS Glue console, you should see the new table
channel_social_media with a schema that has been auto-inferred by the crawler.
You can now use Athena to view the data in this table:
Option 2: Crawl the results folder with auto-detected partitions by the AWS Glue metastore
Crawler creation for this option follows the same procedure as before. The only change is the S3 location selected is the parent folder
This time, when the crawler runs successfully, you will see that the table changes indicate one table was created with five partitions.
In the table schema, you can notice that the channel is auto-inferred as a partition key.
Creating a partitioned AWS Glue table makes it straightforward to query across channels without joining tables, as well as query on a per-channel basis. For more information, refer to Work with partitioned data in AWS Glue.
Derive insights with Amazon Athena
You can combine the time series data that you track and analyze in Timestream with non-time series data that you have outside of Timestream to derive insightful trends using services such as Athena.
For this example, we use a publicly available user dataset. This dataset has details such as
User_id is also a dimension in our time series data, and we use
user_id to join the time series data with non-time series data to derive insights about user behavior for customers landing on the page from the
Social media channel.
Upload the file to Amazon S3 and create a table in Athena for this data:
We also use a free zipcode dataset, which has
TotalWages. We use this dataset to derive demographical insights.
Upload this file to Amazon S3 and create a table in Athena for the data. Note that because the file has quoted all the fields, we import all the fields as a string for simplicity and later cast them to appropriate types when needed.
Perform a join of all three S3 datasets:
Use the following code to find sales by age group for the channel Social media:
We get the following results:
Use the following query to view sales by state:
We get the following results:
To avoid future costs, remove all the resources you created for this post:
- On the Timestream console, choose Databases in the navigation pane and delete the database you created.
- Choose Tables in the navigation pane and delete the table you created.
- On the AWS Glue console, choose Crawlers in the navigation pane.
- Select the crawler you created and on the Action menu, choose Delete crawler.
- Choose Tables on the console and delete the Data Catalog tables created for this post.
- On the Amazon S3 console, choose Buckets in the navigation pane.
- Empty and delete the bucket created for this post.
The UNLOAD statement in Timestream enables you to unload your time series data into Amazon S3 in a secure and cost-effective manner. The statement allows flexibility through a range of options, such as partitioning by one or more columns or choosing the format, compression, and encryption. Whether you plan to add time series data to a data warehouse, build an AI/ML pipeline, or simplify ETL processes for time series data, the UNLOAD statement will make the process more straightforward.
To learn more about the UNLOAD statement, refer to UNLOAD Concepts.
About the Authors
Shravanthi Rajagopal is a Senior Software Engineer at Amazon Timestream. With over 8 years of experience in building scalable distributed systems, she demonstrates excellence in ensuring seamless customer experiences. Shravanthi has been part of the Timestream team since its inception and is currently a tech lead working on exciting customer features. Away from screens, she finds joy in singing and embarking on culinary adventures in search of delicious cuisine.
Praneeth Kavuri is a Senior Product Manager in AWS working on Amazon Timestream. He enjoys building scalable solutions and working with customers to help deploy and optimize database workloads on AWS.