AWS Database Blog
Effective queries for common query patterns in Amazon Timestream
You can use SQL to query time series data in Amazon Timestream, and with SQL you can rewrite a query in multiple ways to get the same results. Although several queries may produce the same result, their performance, cost profile, and complexity can be different, and therefore it’s important to identify the right query. With Timestream as a serverless purpose-built time series database with usage-based pricing, you can have performant and cost optimized queries. Further, the built-in time series functions in Amazon Timestream allow you to reduce complexity of queries as time series data are treated as first class concept.
In this post, we review some of the most common query patterns in time series workloads and how to use SQL in Timestream to write effective queries for such patterns. We first review the sample IoT dataset that we use to demonstrate the access patterns and then review the flat vs. time series view of data in Timestream. The flat vs. time series data model helps us better understand the built-in time series functions in Timestream utilized in this post to reduce query complexity of our queries. Finally, we review seven common access patterns that we see among customers, and for each pattern we discuss the business question and the corresponding query to answer the question. Each query pattern is independent, so that you can read only the ones you’re interested in.
Sample data
We run all the queries presented in this post on a small, crafted dataset, so you can understand how each query works. Although we show how the queries work on a couple of records, we have tested those queries at scale, on a dataset thousands of times bigger. The dataset is based on an IoT workload and includes different scenarios—like missing or invalid measures—that you might see in a real production system.
The script in the following section generates the dataset and inserts it to Timestream. The dataset starts at <current_year>-01-01 09:00:00
and contains data from 14 devices. Each device has multiple sensors that create readings approximately every 10 minutes, until around 10:00 AM. Each reading consists of multiple measures, split between two tables: SensorReadings
and DeviceMetadata
. Measures in the SensorReadings
table include temperature
, humidity
, and air_quality_no2
. Measures in DeviceMetadata
include battery_level
, state
, and uc_temperature
. All measures are modeled in a single Timestream row with MULTI
MeasureValueType
in their respective tables.
We have a time filter in the WHERE clause for all the queries. This predicate along with any other predicates on measure name and dimensions enables Timestream to prune the data and limit data scanned to calculate results of a query. This results in better query performance and reduces the cost of a query. For more query best practices in Timestream, see Queries.
Prerequisites
Before you begin, you must have an AWS account with read and write privileges to Amazon Timestream, and access to AWS CloudShell.
Ingest sample data to Timestream
Complete the following steps to generate the sample data used in this post:
- Create a database in Timestream named
Blog
with two tables,SensorReadings
andDeviceMetadata
, with 1-year memory retention. - With CloudShell, run the following Python script with the command
python3 script_name.py
.
Flat model vs. time series view
Timestream supports two data models for queries: a flat model and a time series model.
The flat model is a tabular format and is the default format for queries. In the flat model, each measurement is represented as an individual row with corresponding time, measure name, measure value(s), and dimensions as columns. We use the flat model with standard SQL operations (including standard SQL functions, operators, aggregations, and filtering) and when we’re not utilizing any of the built-in time series functions. In the sample data in SensorReadings
table we modeled temperature
, humidity
, air_quality_no2
, and request_number
as a single MULTI
measure in Timestream (for dimensions device_id
and city
), and therefore you can access all those fields in SQL statement in a single row. For example, you can use the following SELECT statement to get a list of temperature and humidity readings and their corresponding timestamps and request numbers:
This returns the data represented in the flat model.
time | device_id | city | temperature | humidity | request_number |
2021-01-01 09:01:00.000000000 | real1 | Zion | 102 | 13.2 | 30493 |
2021-01-01 09:13:00.000000000 | real1 | Zion | 103 | 13.1 | 30494 |
2021-01-01 09:31:00.000000000 | real1 | Zion | 107 | 13.1 | 30496 |
2021-01-01 09:44:00.000000000 | real1 | Zion | 106 | 13.0 | 30497 |
2021-01-01 09:58:00.000000000 | real1 | Zion | 105 | 12.9 | 30499 |
2021-01-01 09:00:00.000000000 | real2 | Zion | 104 | 14.2 | 43231 |
2021-01-01 09:12:00.000000000 | real2 | Zion | 105 | 14.1 | 43232 |
2021-01-01 09:22:00.000000000 | real2 | Zion | 109 | 14.1 | 43232 |
2021-01-01 09:33:00.000000000 | real2 | Zion | 109 | 14.1 | 43234 |
The time series model is a query time construct and represents the data as an ordered sequence of time and measure value pairs for a set of dimensions. This data format is needed to utilize the built-in time series functions in Timestream such as interpolation, derivatives, integrals, and correlations, which we can use for gap filling when having missing values (interpolation), finding the rate of change (derivatives), correlation between two time series, or other applicable use cases in time series workloads.
To transform data from a flat model to a time series model, Timestream includes the built-in function CREATE_TIME_SERIES(time, value)
, which takes two input columns, time
and value
, and returns a time series data type as a sequence of timestamps and corresponding values. We can then use this data type as an input to time series functions in Timestream. The following query illustrates using the CREATE_TIME_SERIES
function:
The query returns the following result.
device_id | temp_series | humidity_series |
real1 | [ { time: 2021-01-01 09:01:00.000000000, value: 102 }, { time: 2021-01-01 09:13:00.000000000, value: 103 }, { time: 2021-01-01 09:31:00.000000000, value: 107 }, { time: 2021-01-01 09:44:00.000000000, value: 106 }, { time: 2021-01-01 09:58:00.000000000, value: 105 } ] |
[ { time: 2021-01-01 09:01:00.000000000, value: 13.2 }, { time: 2021-01-01 09:13:00.000000000, value: 13.1 }, { time: 2021-01-01 09:31:00.000000000, value: 13.1 }, { time: 2021-01-01 09:44:00.000000000, value: 13.0 }, { time: 2021-01-01 09:58:00.000000000, value: 12.9 } ] |
real2 | [ { time: 2021-01-01 09:00:00.000000000, value: 104 }, { time: 2021-01-01 09:12:00.000000000, value: 105 }, { time: 2021-01-01 09:22:00.000000000, value: 109 }, { time: 2021-01-01 09:33:00.000000000, value: 109 } ] |
[ { time: 2021-01-01 09:00:00.000000000, value: 14.2 }, { time: 2021-01-01 09:12:00.000000000, value: 14.1 }, { time: 2021-01-01 09:22:00.000000000, value: 14.1 }, { time: 2021-01-01 09:33:00.000000000, value: 14.1 } ] |
UNNEST
In some cases, you might need to transform the data back to the flat model from the time series model in order to join it with other tables or datasets or perform aggregation using standard SQL aggregate or window functions. UNNEST(timeseries)
does the reverse of CREATE_TIME_SERIES
. It takes an input in the time series data type and produces a table with two columns for time
and measure
values. UNNEST supports giving aliases to the table, time
column, and the value
column produced when used as UNNEST(timeseries) AS <alias_name> (time_alias, value_alias)
:
This returns the following result.
device_id | time | value |
real1 | 2021-01-01 09:01:00.000000000 | 102 |
real1 | 2021-01-01 09:13:00.000000000 | 103 |
real1 | 2021-01-01 09:31:00.000000000 | 107 |
real1 | 2021-01-01 09:44:00.000000000 | 106 |
real1 | 2021-01-01 09:58:00.000000000 | 105 |
real2 | 2021-01-01 09:00:00.000000000 | 104 |
real2 | 2021-01-01 09:12:00.000000000 | 105 |
real2 | 2021-01-01 09:22:00.000000000 | 109 |
real2 | 2021-01-01 09:33:00.000000000 | 109 |
Check Interpolation section for a practical example of CREATE_TIME_SERIES
and UNNEST
functions.
Common query patterns
This section contains seven access patterns. For each pattern, we discuss the business question and show you steps which you can follow to get the corresponding query to answer that question. Each query pattern is independent, so that you can only read the patterns that are of your interest:
- Last values queries (asset tracking, latest location, latest sensor reading)
- Interpolation (filling in missing data points)
- N-event correlation (looking for patterns in events)
- Running aggregates
- Derivatives and rate of change
- Show top N series by a high cardinality dimension
- Build histograms
1. Last values queries
A common query pattern in time series workloads is finding the latest record or latest value of a measure to find the latest event, such as the latest reading from a sensor, latest state, or latest location of an asset. In our example, we’re interested in finding the devices that are currently not in service. In our dataset, the state of each device is stored in a measure called state
(for multi-measure named multi
), so our query refers to this column. For this query pattern, we compare two options: one using a common table expression (CTE) and the other without a CTE but utilizing the max_by()
function available in Timestream. First let’s look at getting the latest value using a CTE:
In the preceding query, first we find the latest event timestamp related to the state of each device in the past 365 days for state of each device using the common table expression (CTE) latest_ts
. We have 14 devices and we expect all of them to have reported their latest state in the past 365 days, so the CTE should return 14 rows, one for each device, as shown in the following table.
device_id | latest_timestamp |
smarty9 | 2021-01-01 10:00:00.000000000 |
coldy2 | 2021-01-01 10:20:00.000000000 |
smarty7 | 2021-01-01 10:00:00.000000000 |
coldy1 | 2021-01-01 10:00:00.000000000 |
smarty2 | 2021-01-01 10:00:00.000000000 |
smarty0 | 2021-01-01 10:00:00.000000000 |
smarty6 | 2021-01-01 10:00:00.000000000 |
smarty4 | 2021-01-01 10:00:00.000000000 |
smarty8 | 2021-01-01 10:00:00.000000000 |
smarty5 | 2021-01-01 10:00:00.000000000 |
real2 | 2021-01-01 10:00:00.000000000 |
smarty3 | 2021-01-01 10:00:00.000000000 |
smarty1 | 2021-01-01 10:00:00.000000000 |
real1 | 2021-01-01 09:58:00.000000000 |
Now that we have the latest timestamp for the state of each device, we can query the main table to see the actual state of each device by joining the CTE latest_ts
to the main table on device_id
and the resulting timestamps. Furthermore, we filter the results to only the devices that didn’t report IN_SERVICE
as their latest state. Based on our dataset, this results in two of the devices returned by the query. If we simply needed the latest state of each device, we could remove the WHERE clause, and we can expect 14 rows (one for each device) returned.
Running the preceding query returns the following result.
device_id | latest_timestamp | state |
coldy2 | 2021-01-01 10:20:00.000000000 | AUTO_SHUTDOWN |
real2 | 2021-01-01 10:00:00.000000000 | REPLACE_BATTERY |
If we look at the output tab on the Timestream console, we see this query has scanned 8.17 KB. Now let’s look at the alternative way of finding the latest value without a CTE and utilizing the max_by()
function:
The preceding query produces the same end result as the previous query using a CTE. Here we are utilizing the max_by(x, y)
function with two inputs, where x
is the value associated with the maximum value of y
, which in our case would be the measure_value
associated with the maximum value of time. Furthermore, we filter the results using the SQL HAVING
clause because we’re only interested in devices where their last reported state is not IN_SERVICE
.
When you compare the two queries, you can observe that the second query is much simpler. In addition, if you look at the number of bytes scanned, you notice that it’s smaller than the bytes scanned of the first query. This means the second query can be cheaper than the first query in cases where bytes scanned is larger than 10 MB (note that, in Timestream, the minimum metered bytes for a query is 10 MB).
When you use the max_by()
function, you can also find the latest N values of a measure. For instance, with max_by(temperature, time, 5)
, we can find the latest five temperature measures.
2. Interpolation
It’s common in an IoT use case to see that sometimes data from devices might be missing, reported with varying intervals, or contain unrealistic values. Let’s look at temperature sensor readings from real1
and real2
devices:
We can observe the missing temperature measure around 9:20 AM from sensor real1
, measure timestamps unevenly spaced apart, and unrealistic measure at 9:44 AM from sensor real2
.
time | device_id | temp |
2021-01-01 09:01:00.000000000 | real1 | 102 |
2021-01-01 09:13:00.000000000 | real1 | 103 |
2021-01-01 09:31:00.000000000 | real1 | 107 |
2021-01-01 09:44:00.000000000 | real1 | 106 |
2021-01-01 09:58:00.000000000 | real1 | 105 |
2021-01-01 09:00:00.000000000 | real2 | 104 |
2021-01-01 09:12:00.000000000 | real2 | 105 |
2021-01-01 09:22:00.000000000 | real2 | 109 |
2021-01-01 09:33:00.000000000 | real2 | 109 |
2021-01-01 09:44:00.000000000 | real2 | 999 |
In this case, we can filter out unrealistic data by adding a temperature < 999
condition:
However, how can we fix the missing data point and align the measures to 10-minute intervals, so we can create a nice-looking graph with points every 10 minutes? We can do this using the INTERPOLATE_LINEAR function, which fills in missing data using linear interpolation:
The INTERPOLATE_LINEAR
function works on a time series (first argument) and an array of timestamps (second argument). The function fills in missing data in the time series and aligns the data points to timestamps specified in the array. The INTERPOLATE_LINEAR
function returns a time series object. Therefore, the result of the preceding query contains multiple rows, where each row has a device_id
and a time series.
device_id | interpolated_temp |
real1 | [ { time: 2021-01-01 09:01:00.000000000, value: 102 }, { time: 2021-01-01 09:11:00.000000000, value: 102 }, … ] |
real2 | [ { time: 2021-01-01 09:00:00.000000000, value: 104 }, { time: 2021-01-01 09:10:00.000000000, value: 104 }, … ] |
The time series for device real2
can be visualized as follows.
time | value |
2021-01-01 09:00:00.000000000 | 104 |
2021-01-01 09:10:00.000000000 | 104 |
2021-01-01 09:20:00.000000000 | 108 |
2021-01-01 09:30:00.000000000 | 109 |
Note that the SEQUENCE
function has a limit on the size of output array. Therefore, to make sure you don’t cross that limit, you can create interpolation based on aggregated (in this case averaged) temperatures per time buckets as follows:
If you prefer to return the data in the original form, use the UNNEST
function to convert the result back to a flat table and round the interpolated values:
The result contains the missing data point and all data points are evenly spaced 10 minutes apart.
device_id | time | interpolated_rounded_temp |
real1 | 2021-01-01 09:01:00.000000000 | 102.0 |
real1 | 2021-01-01 09:11:00.000000000 | 102.8 |
real1 | 2021-01-01 09:21:00.000000000 | 104.8 |
real1 | 2021-01-01 09:31:00.000000000 | 107.0 |
real1 | 2021-01-01 09:41:00.000000000 | 106.2 |
real1 | 2021-01-01 09:51:00.000000000 | 105.5 |
real2 | 2021-01-01 09:00:00.000000000 | 104.0 |
real2 | 2021-01-01 09:10:00.000000000 | 104.8 |
real2 | 2021-01-01 09:20:00.000000000 | 108.2 |
real2 | 2021-01-01 09:30:00.000000000 | 109.0 |
3. N-event correlation
In this example, we measure how long a device was cooling off after overheating. The cool-off time is defined as time to transition from an OVERHEATING
state to an IN_SERVICE
state. Let’s focus on the coldy2
sensor:
We get the following result.
time | device_id | state |
2021-01-01 09:00:00.000000000 | coldy2 | OVERHEATING |
2021-01-01 09:10:00.000000000 | coldy2 | IN_SERVICE |
2021-01-01 09:20:00.000000000 | coldy2 | OVERHEATING |
2021-01-01 09:30:00.000000000 | coldy2 | IN_SERVICE |
2021-01-01 09:40:00.000000000 | coldy2 | OVERHEATING |
2021-01-01 09:50:00.000000000 | coldy2 | OVERHEATING |
2021-01-01 10:00:00.000000000 | coldy2 | IN_SERVICE |
2021-01-01 10:10:00.000000000 | coldy2 | OVERHEATING |
2021-01-01 10:20:00.000000000 | coldy2 | AUTO_SHUTDOWN |
Notice at 9:40 and 9:50, the device was overheating more than 10 minutes and reported the state OVERHEATING
twice.
Because we’re interested in state changes, first let’s remove the consecutive duplicate states. For this purpose, we use the LAG windowing function. The LAG
windowing function returns the value from row before the current row in the window:
We get the following result.
binnedTime | device_id | state | prevState |
2021-01-01 09:00:00.000000000 | coldy2 | OVERHEATING | – |
2021-01-01 09:10:00.000000000 | coldy2 | IN_SERVICE | OVERHEATING |
2021-01-01 09:20:00.000000000 | coldy2 | OVERHEATING | IN_SERVICE |
2021-01-01 09:30:00.000000000 | coldy2 | IN_SERVICE | OVERHEATING |
2021-01-01 09:40:00.000000000 | coldy2 | OVERHEATING | IN_SERVICE |
2021-01-01 09:50:00.000000000 | coldy2 | OVERHEATING | OVERHEATING |
2021-01-01 10:00:00.000000000 | coldy2 | IN_SERVICE | OVERHEATING |
2021-01-01 10:10:00.000000000 | coldy2 | OVERHEATING | IN_SERVICE |
2021-01-01 10:20:00.000000000 | coldy2 | AUTO_SHUTDOWN | OVERHEATING |
Because we have the previous state for each state reported from devices, we can remove duplicates by adding predicates on the state
and prevState
columns:
We get the following result.
binnedTime | device_id | state | prevState |
2021-01-01 09:00:00.000000000 | coldy2 | OVERHEATING | – |
2021-01-01 09:10:00.000000000 | coldy2 | IN_SERVICE | OVERHEATING |
2021-01-01 09:20:00.000000000 | coldy2 | OVERHEATING | IN_SERVICE |
2021-01-01 09:30:00.000000000 | coldy2 | IN_SERVICE | OVERHEATING |
2021-01-01 09:40:00.000000000 | coldy2 | OVERHEATING | IN_SERVICE |
2021-01-01 10:00:00.000000000 | coldy2 | IN_SERVICE | OVERHEATING |
2021-01-01 10:10:00.000000000 | coldy2 | OVERHEATING | IN_SERVICE |
2021-01-01 10:20:00.000000000 | coldy2 | AUTO_SHUTDOWN | OVERHEATING |
To calculate how long each transition lasted, we need the time of each previous state. We use again the LAG
function, this time on the binnedTime
column:
This gives us the previous state and previous state time in the same row.
binnedTime | device_id | state | prevState | prevBinnedTime |
2021-01-01 09:00:00.000000000 | coldy2 | OVERHEATING | – | – |
2021-01-01 09:10:00.000000000 | coldy2 | IN_SERVICE | OVERHEATING | 2021-01-01 09:00:00.000000000 |
2021-01-01 09:20:00.000000000 | coldy2 | OVERHEATING | IN_SERVICE | 2021-01-01 09:10:00.000000000 |
2021-01-01 09:30:00.000000000 | coldy2 | IN_SERVICE | OVERHEATING | 2021-01-01 09:20:00.000000000 |
2021-01-01 09:40:00.000000000 | coldy2 | OVERHEATING | IN_SERVICE | 2021-01-01 09:30:00.000000000 |
2021-01-01 10:00:00.000000000 | coldy2 | IN_SERVICE | OVERHEATING | 2021-01-01 09:40:00.000000000 |
2021-01-01 10:10:00.000000000 | coldy2 | OVERHEATING | IN_SERVICE | 2021-01-01 10:00:00.000000000 |
2021-01-01 10:20:00.000000000 | coldy2 | AUTO_SHUTDOWN | OVERHEATING | 2021-01-01 10:10:00.000000000 |
With this data, we can calculate how long the device needed to cool off (moving from an OVERHEATING
state to IN_SERVICE
):
We get the following result.
binnedTime | recovery_time | device_id |
2021-01-01 09:10:00.000000000 | 0 00:10:00.000000000 | coldy2 |
2021-01-01 09:30:00.000000000 | 0 00:10:00.000000000 | coldy2 |
2021-01-01 10:00:00.000000000 | 0 00:20:00.000000000 | coldy2 |
Moreover, we can use the same principle to match event patterns. For example, the following query finds a pattern of moving from IN_SERVICE
to OVERHEATING
to AUTO_SHUTDOWN
:
We get the following result.
binnedTime | device_id | state | prevState | prevPrevState |
2021-01-01 10:20:00.000000000 | coldy2 | AUTO_SHUTDOWN | OVERHEATING | IN_SERVICE |
4. Running aggregates
In this example, we use the running total (moving total) to better monitor and understand the state of our devices. In our case, we want to understand the total number of times a device stated not IN_SERVICE
at any point in time with 30-minute buckets. We use the following query to do this analysis:
Here we create a common table expression called binned_time
and filter the records to our specific state
measure where its value doesn’t equal IN_SERVICE
. Then we count the number of such events or readings in each 30-minute time bucket using bin(time, 30m)
and count()
grouped by device_id
and 30-minute time buckets. When we have our CTE, we use it to calculate the running total of the events by utilizing the SUM() OVER()
window function. Window functions require an OVER()
clause including a PARTITION BY
and an ORDER BY
clause. The PARTITION BY
clause resets the calculations for each partition, and the order by
clause instructs ordering of the records when calculating the results. We use device_id
in our PARTITION BY
clause because we want to know the running total for each device, and we order our records by the time bins because we’re interested in the total at every 30-minute point in time. This ordering also allows us to use the RANGE
clause to instruct the window function to process (sum) all the preceding events or rows up to the current row (for every point in time, the running total is the sum of event counts in all previous buckets + the event count in the current time bucket). The query returns the following results.
device_id | time_bucket | event_count_in_the_bucket | cumulative_event_count |
coldy2 | 2021-01-01 17:00:00.000000000 | 2 | 2 |
coldy2 | 2021-01-01 17:30:00.000000000 | 2 | 4 |
coldy2 | 2021-01-01 18:00:00.000000000 | 2 | 6 |
coldy1 | 2021-01-01 17:00:00.000000000 | 1 | 1 |
coldy1 | 2021-01-01 17:30:00.000000000 | 2 | 3 |
real2 | 2021-01-01 17:30:00.000000000 | 2 | 2 |
real2 | 2021-01-01 18:00:00.000000000 | 1 | 3 |
With running totals, it might be easier to look at the rate of change than an increasing value over time where the value increases at a steady rate. With rate of change, you can identify anomalies or sudden increases or decreases in a running aggregate (such as running total or running average). This is discussed in the next section.
5. Derivatives and rate of change
In this example, we show temperature change in Fahrenheit per 10-minute bucket. The query accounts for missing measures and returns temperature changes per 10-minute interval, even if the sensor was offline.
Let’s look at the data:
Notice missing data around 9:20 AM in the following results.
time | device_id | temperature |
2021-01-01 09:01:00.000000000 | real1 | 102 |
2021-01-01 09:13:00.000000000 | real1 | 103 |
2021-01-01 09:31:00.000000000 | real1 | 107 |
2021-01-01 09:44:00.000000000 | real1 | 106 |
2021-01-01 09:58:00.000000000 | real1 | 105 |
To calculate incremental change between measures, we could use the LAG
function as in our N-event correlation query. However, there is a simpler way to calculate the rate of change, thanks to built-in DERIVATIVE_LINEAR function in Timestream.
DERIVATIVE_LINEAR
takes two arguments: a time series and an interval. It returns the derivative of each point in the time series, for the specified interval. A derivative is a mathematical tool used in calculus that calculates the rate of change of a value. Because DERIVATIVE_LINEAR
returns time series, the following SQL returns a row for device real1
, and this row contains a time series in the rateOfTempChange
column:
We get the following result.
device_id | rateOfTempChange |
real1 | [ { time: 2021-01-01 09:10:00.000000000, value: 1.0 }, { time: 2021-01-01 09:30:00.000000000, value: 2.0 }, … ] |
The time series for device real1
can be visualized as the following table.
time | value |
2021-01-01 09:10:00.000000000 | 1.0 |
2021-01-01 09:30:00.000000000 | 2.0 |
2021-01-01 09:40:00.000000000 | -1.0 |
2021-01-01 09:50:00.000000000 | -1.0 |
At 9:30 AM, the rate of change was calculated as 2F, which is correct because the temperature changed from 107F to 103F over 20 minutes, so it changed 2F per 10-minute interval.
If you prefer to return the data in the original form, use the UNNEST
function to convert the result back to a flat table:
We get the following result.
device_id | time | value |
real1 | 2021-01-01 09:10:00.000000000 | 1.0 |
real1 | 2021-01-01 09:30:00.000000000 | 2.0 |
real1 | 2021-01-01 09:40:00.000000000 | -1.0 |
real1 | 2021-01-01 09:50:00.000000000 | -1.0 |
6. Show top N series by a high cardinality dimension
Let’s assume you want display a graph of the average temperature in cities over time. You could use the following query to show the average temperature in cities based on the temperature reported by devices per 10-minute bucket:
This query returns the following result.
binnedTime | avg_temp | city |
2021-01-01 09:00:00.000000000 | 70.3 | Basin City |
2021-01-01 09:10:00.000000000 | 69.8 | Basin City |
2021-01-01 09:20:00.000000000 | 67.0 | Basin City |
2021-01-01 09:30:00.000000000 | 69.5 | Basin City |
2021-01-01 09:40:00.000000000 | 69.6 | Basin City |
2021-01-01 09:50:00.000000000 | 68.5 | Basin City |
2021-01-01 10:00:00.000000000 | 69.1 | Basin City |
2021-01-01 09:00:00.000000000 | 22.5 | Coruscant |
2021-01-01 09:10:00.000000000 | 23.5 | Coruscant |
2021-01-01 09:20:00.000000000 | 24.0 | Coruscant |
Our dataset contains three cities, but in a production system, you could have tens of thousands of cities, and therefore the query returns tens of thousands of data points per each requested timestamp. Showing such a number of series on a single graph would be challenging and hardly readable for users. What if you’re interested in only showing the graph for the top N hottest cities, sorted by the average temperature for a given time period?
Because we have three cities in the dataset, let’s limit our result to the top two hottest cities in our dataset. We use the following code to identify the top two hottest cities by average temperature:
We get the following result.
city | avg_temp |
Zion | 194.9 |
Basin City | 69.11428571428571 |
Now that we have the top two hottest cities, we can limit the initial query to get the average temperature only for those two cities by using a subquery:
As the result, we get the average temperature for the hottest cities, binned into 10-minute intervals. The third city temperature isn’t returned.
binnedTime | avg_temp | city |
2021-01-01 09:00:00.000000000 | 70.3 | Basin City |
2021-01-01 09:10:00.000000000 | 69.8 | Basin City |
2021-01-01 09:20:00.000000000 | 67.0 | Basin City |
2021-01-01 09:30:00.000000000 | 69.5 | Basin City |
2021-01-01 09:40:00.000000000 | 69.6 | Basin City |
2021-01-01 09:50:00.000000000 | 68.5 | Basin City |
2021-01-01 10:00:00.000000000 | 69.1 | Basin City |
2021-01-01 09:00:00.000000000 | 103.0 | Zion |
2021-01-01 09:10:00.000000000 | 104.0 | Zion |
2021-01-01 09:20:00.000000000 | 109.0 | Zion |
2021-01-01 09:30:00.000000000 | 108.0 | Zion |
2021-01-01 09:40:00.000000000 | 552.5 | Zion |
2021-01-01 09:50:00.000000000 | 105.0 | Zion |
7. Build histograms
Another common query pattern is bucketing measure or dimension values and analyzing data points in these buckets. In our use case, we want to understand distribution of temperature readings for devices in each city and create a histogram. To do so, we count the number of temperature readings in 5-degree buckets (10–15, 15–20, 20–25, and so on) for every city:
Here we use the WIDTH_BUCKET()
function, which has two variants: WIDTH_BUCKET(x, bound1, bound2, n)
and WIDTH_BUCKET (x, bins)
. The first variant returns the bucket number of x in an equi-width histogram, with n buckets within bounds of bound1
and bound2
. The second variant WIDTH_BUCKET(x, bins)
allows us to get the bin number of x
with specific bins specified by the array bins
. Here we use the first variant because we want a histogram with 5-degree bins.
We first create a CTE called buckets
to find the bin number for each temperature reading. Then in the main section, we query the buckets
CTE to find the number of temperature readings, average temperature, approximate 50th and 90th percentiles for each city, and each bucket based on the distribution of temperatures in each bucket. We also create a text label for the calculated bucket numbers for easier interpretation.
city | temp_bucket_id | temp_bucket | temperature_readings | avg_temperature | p50_temperature | p90_temperature |
Basin City | 14 | 65-70 | 38 | 66.76 | 67 | 69 |
Basin City | 15 | 70-75 | 31 | 71.81 | 71 | 74 |
Basin City | 16 | 75-80 | 1 | 75.0 | 75 | 75 |
Coruscant | 5 | 20-25 | 14 | 22.93 | 23 | 24 |
Coruscant | 6 | 25-30 | 2 | 25.0 | 25 | 25 |
Zion | 21 | 100-105 | 3 | 103.0 | 103 | 104 |
Zion | 22 | 105-110 | 6 | 106.83 | 107 | 109 |
Looking at the results, we can see how temperature values are distributed for each city and how temperatures ranges are different for each city (65–80 for Basin City, 20–30 for Coruscant, and 100–110 for Zion).
Clean up
To avoid incurring future charges, delete the Timestream tables (SensorReadings
and DeviceMetadata
) that you created.
Conclusion
Timestream is a serverless purpose-built time series database with usage based pricing which gives you the opportunity to have performant and cost optimized queries. In addition, with time series data treated as first class concept in Timestream, the built-in time series functions allow you to reduce complexity of queries. In this post, you learned how to effectively construct queries for some of the most common access patterns in time series workloads in Timestream. We walked you through the queries and explained how each part of a query behaves and processes the data. We also introduced you to some of the built-in time series functions available in Timestream that treat time series data as a first-class concept. These functions help you make the queries more performant, cost-effective, and simpler to maintain.
Let us know in the comments what other query patterns you have in your workload!
About the Authors
Saleh Ghasemi is a Database Specialist Solutions Architect at AWS. Prior to joining AWS, he founded and led an AWS consulting partner firm helping customers develop, migrate and modernize database and analytics solutions.
Piotr Westfalewicz is a NoSQL data architect at AWS. He is a passionate software and architecture craftsman, striving to always build the right solution for customers. Before working as a data architect, he was a software development engineer at both Amazon and AWS.