AWS Database Blog
Improve query performance and reduce cost using scheduled queries in Amazon Timestream
In this post, we show you how to use scheduled queries in Amazon Timestream to improve your query performance and reduce cost. Scheduled queries make real-time analytics more performant and cost-effective, so you can derive additional insights from your data and continue to make better business decisions.
Timestream is a serverless time series database that customers across a broad range of industry verticals have adopted to derive real-time insights, monitor critical business applications, and analyze millions of real-time events across websites and applications. By analyzing these diverse workloads, in conjunction with the query access patterns and query concurrency requirements, we made several optimizations to Timestream that resulted in improvements in query latency up to three times faster across different workloads.
During AWS re:Invent 2021, Timestream announced new updates that make it easier and cheaper to implement many use cases:
- Multi-measure records – You can now include more than one measure in a Timestream record. This makes it easy to load your data and reduces the number of records to write and query for many use cases. For example, you can use multi-measure records when you’re tracking more than one metric emitted from the same source at the same time. Multi-measure records also simplify migrating data from relational databases to Timestream because the database schema needs fewer or no changes.
- Magnetic storage writes – Timestream manages the lifecycle of your data with storage tiers: a memory store for recent data and a magnetic store for historical data. The transfer of data from the memory store to the magnetic store is automated and based on configurable policies. Previously, data could only be loaded to the memory store. Now, you can optimize storage costs by sending late-arriving data to the magnetic store instead of keeping a large memory store.
- Scheduled queries – Scheduled queries in Timestream offer a fully managed, serverless, and scalable solution for calculating and storing aggregates, rollups, and other real-time analytics used to power frequently accessed operational dashboards, business reports, applications, and device-monitoring systems.
In the following sections, we show how to create a scheduled query in Timestream and compare performance against a direct query.
Solution overview
Repeating a query on a frequently computed dataset is expensive because the compute resources take time to process the aggregation and produce the result. This mechanism involves several steps.
With scheduled queries, you simply define the queries that calculate aggregates, rollups, and other real-time analytics on your incoming data. Timestream periodically and automatically runs these queries and reliably writes the results into a configurable destination table. You can then point your dashboards, reports, applications, and monitoring systems to query the destination tables instead of querying the considerably larger source tables containing the incoming time series data. This leads to increased performance while reducing cost by an order of magnitude. The destination tables contain much less data than the source tables, thereby offering faster and less expensive data access and storage.
Given that destination tables contain much less data than source tables, you can store data in the destination tables for a much longer duration at a fraction of the storage cost of the source table. You can also reduce the data retention period of your source tables and further optimize your spend. Scheduled queries can therefore make time series analytics faster, more cost-effective, and more accessible to many more customers, so you can continue to make better data-driven business decisions.
To demonstrate the post with a working example, we’ve created a sample dataset that collects weather data, and we’ll create an aggregated query using a scheduled query.
Schema and dataset
The following screenshot shows the schema of our Internet of Things (IoT) sensor data table.
We loaded a sample IoT sensor dataset that collects the humidity and temperature every 15 seconds. We loaded the table with 3,754,312 records. The following table shows an example of the data.
country | city | state | gpio | time | temperature | humidity |
US | Jersey City | NJ | 17 | 2023-03-07 00:30:00.000000000 | 77.67058823529413 | 37.372549019607845 |
US | Jersey City | NJ | 22 | 2023-03-07 00:30:00.000000000 | 77.0 | 35.254901960784316 |
US | Jersey City | NJ | 17 | 2023-03-07 00:15:00.000000000 | 78.421052631579 | 38.50877192982456 |
US | Jersey City | NJ | 22 | 2023-03-07 00:15:00.000000000 | 77.0 | 35.59649122807018 |
US | Jersey City | NJ | 22 | 2023-03-07 00:00:00.000000000 | 77.0 | 36.206896551724135 |
US | Jersey City | NJ | 17 | 2023-03-07 00:00:00.000000000 | 78.80000000000008 | 39.0 |
US | Jersey City | NJ | 17 | 2023-03-06 23:45:00.000000000 | 78.80000000000008 | 39.29824561403509 |
US | Jersey City | NJ | 22 | 2023-03-06 23:45:00.000000000 | 77.0 | 36.91228070175438 |
US | Jersey City | NJ | 22 | 2023-03-06 23:30:00.000000000 | 77.0 | 37.91228070175438 |
US | Jersey City | NJ | 17 | 2023-03-06 23:30:00.000000000 | 78.80000000000008 | 39.85964912280702 |
Prerequisites
You need the following prerequisites:
- Create an empty table to store the output of the scheduled query. You’ll reference this table when creating your scheduled query.
- An Amazon Simple Storage Service (Amazon S3) bucket to store the query run error logs.
- An Amazon Simple Notification Service (Amazon SNS) topic to send notification updates about the query run status.
- An AWS Identity and Access Management (IAM) role that has permission to the source and destination tables and SNS topic to send notifications.
- An AWS Key Management Service (AWS KMS) key to encrypt the entire data.
* The additional costs you will incur are the query execution cost and the additional storage cost (aggregated result data).
Create an aggregated query
To demonstrate using a scheduled query, we first create a simple aggregated query that calculates the average temperature and humidity by city, month and year:
SELECT city
, AVG (temperature) AS avg_temp
, AVG (humidity) AS avg_humidity
, cast(month(time) as VARCHAR) AS month
, cast(year(time) as VARCHAR) AS year
, from_iso8601_timestamp('2022-01-01T00:00:00') AS time
FROM "IoTHome"."sensordata"
WHERE measure_name ='weather'
GROUP BY city, month(time), year(time)
ORDER BY city, year, month
The following screenshot shows our output.
The following section shows the steps to create and run a scheduled query in Timestream.
Create a scheduled query
To create your scheduled query, complete the following steps:
- On the Timestream console, choose Scheduled queries in the navigation pane (under Management Tools).
- Choose Create scheduled query.
- Under Destination Table, select the database name from the dropdown and select table name (the query output table you created as a prerequisite).
- Under Query Name, enter a name for your query.
- Choose Next.
- Under Query statement, enter the aggregated query and choose Validate.
The destination table schema will be populated after the query has been successfully validated. Customers can either choose to proceed with suggested schema or make changes based on their use case.
- Choose Next.
- Under Run schedule, specify the scheduled interval you want the query to run on. For this post, we want to run the query every 6 hours.
- Under Security settings, provide your IAM role and KMS key.
- Under SNS notifications, provide your SNS topic.
- For Error report logging, enter your S3 bucket.
- Choose Next.
- Review your settings and choose Create.
You can now see your query listed on the Scheduled queries page. After it runs on its schedule, the information in the Last run time column will be updated.
The following screenshots show the metrics details of your queries like the bytes scanned, duration, total rows returned.
Query performance metrics
The following table shows the performance benefit you get by using scheduled queries in Timestream. The scheduled query shows performance improvement in duration as well as the total bytes scanned compared to running the same query directly.
Query Type | Total Bytes Scanned | Duration (Cold Start) | Duration |
Direct Query | 177.57 MB | 3.622 seconds | 2.264 seconds |
Scheduled Query | 627.00 B | 0.2510 seconds | 0.1520 seconds |
The performance of the scheduled query increased by 14 times. This is a small example, but as your data scale increases you can realize significant cost savings because significantly less data is scanned. The actual amounts will vary based on the data scale and the queries.
We also need to consider the number of users or reports that query against the same dataset, which will significantly increase the performance and cost savings as well.
Conclusion
In this post, we showed how scheduled queries in Timestream can help you improve your query performance and reduce cost. For more detail about Timestream Scheduled queries and more example please refer to our documentation.
About the Authors
Sreenath Gotur is leading the Specialist Partner Solutions Architecture team at AWS based out of Charlotte, NC. Prior to joining AWS, he was heading enterprise data management, enterprise data services and data innovation portfolio with a large financial firm. Sreenath has a special interest in Data & Analytics, Document DB, Timestream and Graph DB. In his spare time, he enjoys spending quality time with his family.
Norbert Funke is a Sr. Timestream Specialist Solutions Architect at AWS based out of New York. Prior to joining AWS, he was working for a data consulting company owned by PwC on data architecture and data analytics.