AWS Database Blog

Effective queries for common query patterns in Amazon Timestream

You can use SQL to query time series data in Amazon Timestream, and with SQL you can rewrite a query in multiple ways to get the same results. Although several queries may produce the same result, their performance, cost profile, and complexity can be different, and therefore it’s important to identify the right query. With Timestream as a serverless purpose-built time series database with usage-based pricing, you can have performant and cost optimized queries. Further, the built-in time series functions in Amazon Timestream allow you to reduce complexity of queries as time series data are treated as first class concept.

In this post, we review some of the most common query patterns in time series workloads and how to use SQL in Timestream to write effective queries for such patterns. We first review the sample IoT dataset that we use to demonstrate the access patterns and then review the flat vs. time series view of data in Timestream. The flat vs. time series data model helps us better understand the built-in time series functions in Timestream utilized in this post to reduce query complexity of our queries. Finally, we review seven common access patterns that we see among customers, and for each pattern we discuss the business question and the corresponding query to answer the question. Each query pattern is independent, so that you can read only the ones you’re interested in.

Sample data

We run all the queries presented in this post on a small, crafted dataset, so you can understand how each query works. Although we show how the queries work on a couple of records, we have tested those queries at scale, on a dataset thousands of times bigger. The dataset is based on an IoT workload and includes different scenarios—like missing or invalid measures—that you might see in a real production system.

The script in the following section generates the dataset and inserts it to Timestream. The dataset starts at <current_year>-01-01 09:00:00 and contains data from 14 devices. Each device has multiple sensors that create readings approximately every 10 minutes, until around 10:00 AM. Each reading consists of multiple measures, split between two tables: SensorReadings and DeviceMetadata. Measures in the SensorReadings table include temperature, humidity, and air_quality_no2. Measures in DeviceMetadata include battery_level, state, and uc_temperature. All measures are modeled in a single Timestream row with MULTI MeasureValueType in their respective tables.

We have a time filter in the WHERE clause for all the queries. This predicate along with any other predicates on measure name and dimensions enables Timestream to prune the data and limit data scanned to calculate results of a query. This results in better query performance and reduces the cost of a query. For more query best practices in Timestream, see Queries.

Prerequisites

Before you begin, you must have an AWS account with read and write privileges to Amazon Timestream, and access to AWS CloudShell.

Ingest sample data to Timestream

Complete the following steps to generate the sample data used in this post:

  1. Create a database in Timestream named Blog with two tables, SensorReadings and DeviceMetadata, with 1-year memory retention.
  2. With CloudShell, run the following Python script with the command python3 script_name.py.

Flat model vs. time series view

Timestream supports two data models for queries: a flat model and a time series model.

The flat model is a tabular format and is the default format for queries. In the flat model, each measurement is represented as an individual row with corresponding time, measure name, measure value(s), and dimensions as columns. We use the flat model with standard SQL operations (including standard SQL functions, operators, aggregations, and filtering) and when we’re not utilizing any of the built-in time series functions. In the sample data in SensorReadings table we modeled temperature, humidity, air_quality_no2, and request_number as a single MULTI measure in Timestream (for dimensions device_id and city), and therefore you can access all those fields in SQL statement in a single row. For example, you can use the following SELECT statement to get a list of temperature and humidity readings and their corresponding timestamps and request numbers:

SELECT time, device_id, city, temperature, humidity, request_number
FROM "Blog"."SensorReadings"
WHERE time BETWEEN ago(365d) AND now()
	AND city = 'Zion'
	AND device_id IN ('real1', 'real2')
	AND measure_name = 'multi'
	AND temperature < 999
ORDER BY device_id
	,time ASC

This returns the data represented in the flat model.

time device_id city temperature humidity request_number
2021-01-01 09:01:00.000000000 real1 Zion 102 13.2 30493
2021-01-01 09:13:00.000000000 real1 Zion 103 13.1 30494
2021-01-01 09:31:00.000000000 real1 Zion 107 13.1 30496
2021-01-01 09:44:00.000000000 real1 Zion 106 13.0 30497
2021-01-01 09:58:00.000000000 real1 Zion 105 12.9 30499
2021-01-01 09:00:00.000000000 real2 Zion 104 14.2 43231
2021-01-01 09:12:00.000000000 real2 Zion 105 14.1 43232
2021-01-01 09:22:00.000000000 real2 Zion 109 14.1 43232
2021-01-01 09:33:00.000000000 real2 Zion 109 14.1 43234

The time series model is a query time construct and represents the data as an ordered sequence of time and measure value pairs for a set of dimensions. This data format is needed to utilize the built-in time series functions in Timestream such as interpolation, derivatives, integrals, and correlations, which we can use for gap filling when having missing values (interpolation), finding the rate of change (derivatives), correlation between two time series, or other applicable use cases in time series workloads.

To transform data from a flat model to a time series model, Timestream includes the built-in function CREATE_TIME_SERIES(time, value), which takes two input columns, time and value, and returns a time series data type as a sequence of timestamps and corresponding values. We can then use this data type as an input to time series functions in Timestream. The following query illustrates using the CREATE_TIME_SERIES function:

SELECT device_id
	,CREATE_TIME_SERIES(time, temperature) AS temp_series
	,CREATE_TIME_SERIES(time, humidity) AS humidity_series
FROM "Blog"."SensorReadings"
WHERE time BETWEEN ago(365d) AND now() 
	AND city = 'Zion'
	AND device_id IN ('real1', 'real2')
	AND measure_name = 'multi'
	AND temperature < 999
GROUP BY device_id
ORDER BY device_id

The query returns the following result.

device_id temp_series humidity_series
real1 [
{ time: 2021-01-01 09:01:00.000000000, value: 102 },
{ time: 2021-01-01 09:13:00.000000000, value: 103 },
{ time: 2021-01-01 09:31:00.000000000, value: 107 },
{ time: 2021-01-01 09:44:00.000000000, value: 106 },
{ time: 2021-01-01 09:58:00.000000000, value: 105 }
]
[
{ time: 2021-01-01 09:01:00.000000000, value: 13.2 },
{ time: 2021-01-01 09:13:00.000000000, value: 13.1 },
{ time: 2021-01-01 09:31:00.000000000, value: 13.1 },
{ time: 2021-01-01 09:44:00.000000000, value: 13.0 },
{ time: 2021-01-01 09:58:00.000000000, value: 12.9 }
]
real2 [
{ time: 2021-01-01 09:00:00.000000000, value: 104 },
{ time: 2021-01-01 09:12:00.000000000, value: 105 },
{ time: 2021-01-01 09:22:00.000000000, value: 109 },
{ time: 2021-01-01 09:33:00.000000000, value: 109 }
]
[
{ time: 2021-01-01 09:00:00.000000000, value: 14.2 },
{ time: 2021-01-01 09:12:00.000000000, value: 14.1 },
{ time: 2021-01-01 09:22:00.000000000, value: 14.1 },
{ time: 2021-01-01 09:33:00.000000000, value: 14.1 }
]

UNNEST

In some cases, you might need to transform the data back to the flat model from the time series model in order to join it with other tables or datasets or perform aggregation using standard SQL aggregate or window functions. UNNEST(timeseries) does the reverse of CREATE_TIME_SERIES. It takes an input in the time series data type and produces a table with two columns for time and measure values. UNNEST supports giving aliases to the table, time column, and the value column produced when used as UNNEST(timeseries) AS <alias_name> (time_alias, value_alias):

WITH device_temp_series
AS (
	SELECT CREATE_TIME_SERIES(time, temperature) AS temp_series
		,device_id
	FROM "Blog"."SensorReadings"
	WHERE time BETWEEN ago(365d) AND now() 
		AND city = 'Zion'
		AND device_id IN ('real1', 'real2')
		AND measure_name = 'multi'
		AND temperature < 999
	GROUP BY device_id
	ORDER BY device_id
	)
SELECT device_id
	,ts.time
	,ts.value
FROM device_temp_series
CROSS JOIN UNNEST(temp_series) AS ts
ORDER BY device_id
	,ts.time ASC

This returns the following result.

device_id time value
real1 2021-01-01 09:01:00.000000000 102
real1 2021-01-01 09:13:00.000000000 103
real1 2021-01-01 09:31:00.000000000 107
real1 2021-01-01 09:44:00.000000000 106
real1 2021-01-01 09:58:00.000000000 105
real2 2021-01-01 09:00:00.000000000 104
real2 2021-01-01 09:12:00.000000000 105
real2 2021-01-01 09:22:00.000000000 109
real2 2021-01-01 09:33:00.000000000 109

Check Interpolation section for a practical example of CREATE_TIME_SERIES and UNNEST functions.

Common query patterns

This section contains seven access patterns. For each pattern, we discuss the business question and show you steps which you can follow to get the corresponding query to answer that question. Each query pattern is independent, so that you can only read the patterns that are of your interest:

  1. Last values queries (asset tracking, latest location, latest sensor reading)
  2. Interpolation (filling in missing data points)
  3. N-event correlation (looking for patterns in events)
  4. Running aggregates
  5. Derivatives and rate of change
  6. Show top N series by a high cardinality dimension
  7. Build histograms

1. Last values queries

A common query pattern in time series workloads is finding the latest record or latest value of a measure to find the latest event, such as the latest reading from a sensor, latest state, or latest location of an asset. In our example, we’re interested in finding the devices that are currently not in service. In our dataset, the state of each device is stored in a measure called state (for multi-measure named multi), so our query refers to this column. For this query pattern, we compare two options: one using a common table expression (CTE) and the other without a CTE but utilizing the max_by() function available in Timestream. First let’s look at getting the latest value using a CTE:

WITH latest_ts
AS (
	SELECT device_id
		,max(time) AS latest_timestamp
	FROM "Blog"."DeviceMetadata"
	WHERE measure_name = 'multi'
		AND time > ago(365d)
	GROUP BY device_id
		,measure_name
	)
SELECT devmd.device_id, lts.latest_timestamp, devmd.state
FROM "Blog"."DeviceMetadata" devmd
JOIN latest_ts AS lts ON lts.device_id = devmd.device_id
	AND lts.latest_timestamp = devmd.time
WHERE devmd.measure_name = 'multi'
	AND devmd.state <> 'IN_SERVICE'
	AND devmd.time > ago(365d)

In the preceding query, first we find the latest event timestamp related to the state of each device in the past 365 days for state of each device using the common table expression (CTE) latest_ts. We have 14 devices and we expect all of them to have reported their latest state in the past 365 days, so the CTE should return 14 rows, one for each device, as shown in the following table.

device_id latest_timestamp
smarty9 2021-01-01 10:00:00.000000000
coldy2 2021-01-01 10:20:00.000000000
smarty7 2021-01-01 10:00:00.000000000
coldy1 2021-01-01 10:00:00.000000000
smarty2 2021-01-01 10:00:00.000000000
smarty0 2021-01-01 10:00:00.000000000
smarty6 2021-01-01 10:00:00.000000000
smarty4 2021-01-01 10:00:00.000000000
smarty8 2021-01-01 10:00:00.000000000
smarty5 2021-01-01 10:00:00.000000000
real2 2021-01-01 10:00:00.000000000
smarty3 2021-01-01 10:00:00.000000000
smarty1 2021-01-01 10:00:00.000000000
real1 2021-01-01 09:58:00.000000000

Now that we have the latest timestamp for the state of each device, we can query the main table to see the actual state of each device by joining the CTE latest_ts to the main table on device_id and the resulting timestamps. Furthermore, we filter the results to only the devices that didn’t report IN_SERVICE as their latest state. Based on our dataset, this results in two of the devices returned by the query. If we simply needed the latest state of each device, we could remove the WHERE clause, and we can expect 14 rows (one for each device) returned.

Running the preceding query returns the following result.

device_id latest_timestamp state
coldy2 2021-01-01 10:20:00.000000000 AUTO_SHUTDOWN
real2 2021-01-01 10:00:00.000000000 REPLACE_BATTERY

If we look at the output tab on the Timestream console, we see this query has scanned 8.17 KB. Now let’s look at the alternative way of finding the latest value without a CTE and utilizing the max_by() function:

SELECT device_id
	,max_by(time, time) AS latest_timestamp
	,max_by(state, time) AS state
FROM "Blog"."DeviceMetadata"
WHERE time BETWEEN ago(365d) AND now()
	AND measure_name = 'multi'
GROUP BY device_id
HAVING max_by(state, time) <> 'IN_SERVICE'

The preceding query produces the same end result as the previous query using a CTE. Here we are utilizing the max_by(x, y) function with two inputs, where x is the value associated with the maximum value of y, which in our case would be the measure_value associated with the maximum value of time. Furthermore, we filter the results using the SQL HAVING clause because we’re only interested in devices where their last reported state is not IN_SERVICE.

When you compare the two queries, you can observe that the second query is much simpler. In addition, if you look at the number of bytes scanned, you notice that it’s smaller than the bytes scanned of the first query. This means the second query can be cheaper than the first query in cases where bytes scanned is larger than 10 MB (note that, in Timestream, the minimum metered bytes for a query is 10 MB).

When you use the max_by() function, you can also find the latest N values of a measure. For instance, with max_by(temperature, time, 5), we can find the latest five temperature measures.

2. Interpolation

It’s common in an IoT use case to see that sometimes data from devices might be missing, reported with varying intervals, or contain unrealistic values. Let’s look at temperature sensor readings from real1 and real2 devices:

SELECT time
	,device_id
	,temperature AS temp
FROM "Blog"."SensorReadings"
WHERE time BETWEEN ago(365d) AND now()
	AND city = 'Zion'
	AND device_id IN ('real1', 'real2')
	AND measure_name = 'multi'
	AND temperature IS NOT NULL
ORDER BY device_id
	,time ASC

We can observe the missing temperature measure around 9:20 AM from sensor real1, measure timestamps unevenly spaced apart, and unrealistic measure at 9:44 AM from sensor real2.

time device_id temp
2021-01-01 09:01:00.000000000 real1 102
2021-01-01 09:13:00.000000000 real1 103
2021-01-01 09:31:00.000000000 real1 107
2021-01-01 09:44:00.000000000 real1 106
2021-01-01 09:58:00.000000000 real1 105
2021-01-01 09:00:00.000000000 real2 104
2021-01-01 09:12:00.000000000 real2 105
2021-01-01 09:22:00.000000000 real2 109
2021-01-01 09:33:00.000000000 real2 109
2021-01-01 09:44:00.000000000 real2 999

In this case, we can filter out unrealistic data by adding a temperature < 999 condition:

SELECT time
	,device_id
	,temperature AS temp
FROM "Blog"."SensorReadings"
WHERE time BETWEEN ago(365d) AND now()
	AND city = 'Zion'
	AND device_id IN ('real1', 'real2')
	AND measure_name = 'multi'
	AND temperature IS NOT NULL
	AND temperature < 999
ORDER BY device_id
	,time ASC

However, how can we fix the missing data point and align the measures to 10-minute intervals, so we can create a nice-looking graph with points every 10 minutes? We can do this using the INTERPOLATE_LINEAR function, which fills in missing data using linear interpolation:

SELECT device_id
	,INTERPOLATE_LINEAR(
		CREATE_TIME_SERIES(time, temperature), 
		SEQUENCE(min(time), max(time), 10m)
	) AS interpolated_temp
FROM "Blog"."SensorReadings"
WHERE time BETWEEN ago(365d) AND now()
	AND city = 'Zion'
	AND device_id IN ('real1', 'real2')
	AND measure_name = 'multi'
	AND temperature IS NOT NULL
	AND temperature < 999
GROUP BY device_id
ORDER BY device_id

The INTERPOLATE_LINEAR function works on a time series (first argument) and an array of timestamps (second argument). The function fills in missing data in the time series and aligns the data points to timestamps specified in the array. The INTERPOLATE_LINEAR function returns a time series object. Therefore, the result of the preceding query contains multiple rows, where each row has a device_id and a time series.

device_id interpolated_temp
real1 [
{ time: 2021-01-01 09:01:00.000000000, value: 102 },
{ time: 2021-01-01 09:11:00.000000000, value: 102 },

]
real2 [
{ time: 2021-01-01 09:00:00.000000000, value: 104 },
{ time: 2021-01-01 09:10:00.000000000, value: 104 },

]

The time series for device real2 can be visualized as follows.

time value
2021-01-01 09:00:00.000000000 104
2021-01-01 09:10:00.000000000 104
2021-01-01 09:20:00.000000000 108
2021-01-01 09:30:00.000000000 109

Note that the SEQUENCE function has a limit on the size of output array. Therefore, to make sure you don’t cross that limit, you can create interpolation based on aggregated (in this case averaged) temperatures per time buckets as follows:

WITH data
AS (
	SELECT BIN(time, 15s) as t
		,device_id
		,avg(temperature) as avg_temp
	FROM "Blog"."SensorReadings"
	WHERE time BETWEEN ago(365d) AND now()
		AND city = 'Zion'
		AND device_id IN ('real1', 'real2')
		AND measure_name = 'multi'
		AND temperature IS NOT NULL
		AND temperature < 999
	GROUP BY BIN(time, 15s), device_id
	)
SELECT device_id
	,INTERPOLATE_LINEAR(
		CREATE_TIME_SERIES(t, avg_temp), 
		SEQUENCE(min(t), max(t), 10m)
	) AS interpolated_temp
FROM data
GROUP BY device_id
ORDER BY device_id

If you prefer to return the data in the original form, use the UNNEST function to convert the result back to a flat table and round the interpolated values:

WITH data
AS (
	SELECT BIN(time, 15s) as t
		,device_id
		,avg(temperature) as avg_temp
	FROM "Blog"."SensorReadings"
	WHERE time BETWEEN ago(365d) AND now()
		AND city = 'Zion'
		AND device_id IN ('real1', 'real2')
		AND measure_name = 'multi'
		AND temperature IS NOT NULL
		AND temperature < 999
	GROUP BY BIN(time, 15s), device_id
	)
	,interpolated_data
AS (
	SELECT device_id
		,INTERPOLATE_LINEAR(
			CREATE_TIME_SERIES(t, avg_temp), 
			SEQUENCE(min(t), max(t), 10m)
		) AS interpolated_temp
	FROM data
	GROUP BY device_id
	)
SELECT device_id
	,time
	,round(value, 1) as interpolated_rounded_temp
FROM interpolated_data
CROSS JOIN UNNEST(interpolated_temp)

The result contains the missing data point and all data points are evenly spaced 10 minutes apart.

device_id time interpolated_rounded_temp
real1 2021-01-01 09:01:00.000000000 102.0
real1 2021-01-01 09:11:00.000000000 102.8
real1 2021-01-01 09:21:00.000000000 104.8
real1 2021-01-01 09:31:00.000000000 107.0
real1 2021-01-01 09:41:00.000000000 106.2
real1 2021-01-01 09:51:00.000000000 105.5
real2 2021-01-01 09:00:00.000000000 104.0
real2 2021-01-01 09:10:00.000000000 104.8
real2 2021-01-01 09:20:00.000000000 108.2
real2 2021-01-01 09:30:00.000000000 109.0

3. N-event correlation

In this example, we measure how long a device was cooling off after overheating. The cool-off time is defined as time to transition from an OVERHEATING state to an IN_SERVICE state. Let’s focus on the coldy2 sensor:

SELECT time
	,device_id
	,state
FROM "Blog"."DeviceMetadata"
WHERE time BETWEEN ago(365d) AND now()
	AND device_id = 'coldy2'
	AND measure_name = 'multi'
ORDER BY time ASC

We get the following result.

time device_id state
2021-01-01 09:00:00.000000000 coldy2 OVERHEATING
2021-01-01 09:10:00.000000000 coldy2 IN_SERVICE
2021-01-01 09:20:00.000000000 coldy2 OVERHEATING
2021-01-01 09:30:00.000000000 coldy2 IN_SERVICE
2021-01-01 09:40:00.000000000 coldy2 OVERHEATING
2021-01-01 09:50:00.000000000 coldy2 OVERHEATING
2021-01-01 10:00:00.000000000 coldy2 IN_SERVICE
2021-01-01 10:10:00.000000000 coldy2 OVERHEATING
2021-01-01 10:20:00.000000000 coldy2 AUTO_SHUTDOWN

Notice at 9:40 and 9:50, the device was overheating more than 10 minutes and reported the state OVERHEATING twice.

Because we’re interested in state changes, first let’s remove the consecutive duplicate states. For this purpose, we use the LAG windowing function. The LAG windowing function returns the value from row before the current row in the window:

SELECT BIN(time, 10m) AS binnedTime
	,device_id
	,MIN(state) AS state
	,LAG(MIN(state), 1) OVER (
		PARTITION BY device_id ORDER BY BIN(time, 10m)
		) AS prevState
FROM "Blog"."DeviceMetadata"
WHERE time BETWEEN ago(365d) AND now()
	AND device_id = 'coldy2'
	AND measure_name = 'multi'
GROUP BY device_id
	,BIN(time, 10m)

We get the following result.

binnedTime device_id state prevState
2021-01-01 09:00:00.000000000 coldy2 OVERHEATING
2021-01-01 09:10:00.000000000 coldy2 IN_SERVICE OVERHEATING
2021-01-01 09:20:00.000000000 coldy2 OVERHEATING IN_SERVICE
2021-01-01 09:30:00.000000000 coldy2 IN_SERVICE OVERHEATING
2021-01-01 09:40:00.000000000 coldy2 OVERHEATING IN_SERVICE
2021-01-01 09:50:00.000000000 coldy2 OVERHEATING OVERHEATING
2021-01-01 10:00:00.000000000 coldy2 IN_SERVICE OVERHEATING
2021-01-01 10:10:00.000000000 coldy2 OVERHEATING IN_SERVICE
2021-01-01 10:20:00.000000000 coldy2 AUTO_SHUTDOWN OVERHEATING

Because we have the previous state for each state reported from devices, we can remove duplicates by adding predicates on the state and prevState columns:

WITH binnedData
AS (
	SELECT BIN(time, 10m) AS binnedTime
		,device_id
		,MIN(state) AS state
		,LAG(MIN(state), 1) OVER (
			PARTITION BY device_id ORDER BY BIN(time, 10m)
			) AS prevState
	FROM "Blog"."DeviceMetadata"
	WHERE time BETWEEN ago(365d) AND now()
		AND device_id = 'coldy2'
		AND measure_name = 'multi'
	GROUP BY device_id
		,BIN(time, 10m)
)
SELECT *
FROM binnedData
WHERE (
		state != prevState
		OR prevState IS NULL
		)
ORDER BY binnedTime

We get the following result.

binnedTime device_id state prevState
2021-01-01 09:00:00.000000000 coldy2 OVERHEATING
2021-01-01 09:10:00.000000000 coldy2 IN_SERVICE OVERHEATING
2021-01-01 09:20:00.000000000 coldy2 OVERHEATING IN_SERVICE
2021-01-01 09:30:00.000000000 coldy2 IN_SERVICE OVERHEATING
2021-01-01 09:40:00.000000000 coldy2 OVERHEATING IN_SERVICE
2021-01-01 10:00:00.000000000 coldy2 IN_SERVICE OVERHEATING
2021-01-01 10:10:00.000000000 coldy2 OVERHEATING IN_SERVICE
2021-01-01 10:20:00.000000000 coldy2 AUTO_SHUTDOWN OVERHEATING

To calculate how long each transition lasted, we need the time of each previous state. We use again the LAG function, this time on the binnedTime column:

WITH binnedData
AS (
	SELECT BIN(time, 10m) AS binnedTime
		,device_id
		,MIN(state) AS state
		,LAG(MIN(state), 1) OVER (
			PARTITION BY device_id ORDER BY BIN(time, 10m)
			) AS prevState
	FROM "Blog"."DeviceMetadata"
	WHERE time BETWEEN ago(365d) AND now()
		AND device_id = 'coldy2'
		AND measure_name = 'multi'
	GROUP BY device_id
		,BIN(time, 10m)
	)
	,cleaned_data
AS (
	SELECT binnedTime
		,device_id
		,state
		,prevState
	FROM binnedData
	WHERE (
			state != prevState
			OR prevState IS NULL
			)
	ORDER BY binnedTime
	)
SELECT binnedTime
	,device_id
	,state
	,prevState
	,LAG(binnedTime, 1) OVER (
		PARTITION BY device_id ORDER BY binnedTime
		) AS prevBinnedTime
FROM cleaned_data

This gives us the previous state and previous state time in the same row.

binnedTime device_id state prevState prevBinnedTime
2021-01-01 09:00:00.000000000 coldy2 OVERHEATING
2021-01-01 09:10:00.000000000 coldy2 IN_SERVICE OVERHEATING 2021-01-01 09:00:00.000000000
2021-01-01 09:20:00.000000000 coldy2 OVERHEATING IN_SERVICE 2021-01-01 09:10:00.000000000
2021-01-01 09:30:00.000000000 coldy2 IN_SERVICE OVERHEATING 2021-01-01 09:20:00.000000000
2021-01-01 09:40:00.000000000 coldy2 OVERHEATING IN_SERVICE 2021-01-01 09:30:00.000000000
2021-01-01 10:00:00.000000000 coldy2 IN_SERVICE OVERHEATING 2021-01-01 09:40:00.000000000
2021-01-01 10:10:00.000000000 coldy2 OVERHEATING IN_SERVICE 2021-01-01 10:00:00.000000000
2021-01-01 10:20:00.000000000 coldy2 AUTO_SHUTDOWN OVERHEATING 2021-01-01 10:10:00.000000000

With this data, we can calculate how long the device needed to cool off (moving from an OVERHEATING state to IN_SERVICE):

WITH binnedData
AS (
	SELECT BIN(time, 10m) AS binnedTime
		,device_id
		,MIN(state) AS state
		,LAG(MIN(state), 1) OVER (
			PARTITION BY device_id ORDER BY BIN(time, 10m)
			) AS prevState
	FROM "Blog"."DeviceMetadata"
	WHERE time BETWEEN ago(365d) AND now()
		AND device_id = 'coldy2'
		AND measure_name = 'multi'
	GROUP BY device_id
		,BIN(time, 10m)
	)
	,cleaned_data
AS (
	SELECT binnedTime
		,device_id
		,state
		,prevState
		,
		--windowing function executes after where, so we can use it in this query
		LAG(binnedTime, 1) OVER (
			PARTITION BY device_id ORDER BY binnedTime
			) AS prevBinnedTime
	FROM binnedData
	WHERE (
			state != prevState
			OR prevState IS NULL
			)
	ORDER BY binnedTime
	)
SELECT binnedTime
	,binnedTime - prevBinnedTime AS recovery_time
	,device_id
FROM cleaned_data
WHERE state = 'IN_SERVICE'
	AND prevState = 'OVERHEATING'

We get the following result.

binnedTime recovery_time device_id
2021-01-01 09:10:00.000000000 0 00:10:00.000000000 coldy2
2021-01-01 09:30:00.000000000 0 00:10:00.000000000 coldy2
2021-01-01 10:00:00.000000000 0 00:20:00.000000000 coldy2

Moreover, we can use the same principle to match event patterns. For example, the following query finds a pattern of moving from IN_SERVICE to OVERHEATING to AUTO_SHUTDOWN:

WITH binnedData
AS (
	SELECT BIN(time, 10m) AS binnedTime
		,device_id
		,MIN(state) AS state
		,LAG(MIN(state), 1) OVER (
			PARTITION BY device_id ORDER BY BIN(time, 10m)
			) AS prevState
	FROM "Blog"."DeviceMetadata"
	WHERE time BETWEEN ago(365d) AND now()
		AND device_id = 'coldy2'
		AND measure_name = 'multi'
	GROUP BY device_id
		,BIN(time, 10m)
	)
	,cleaned_data
AS (
	SELECT binnedTime
		,device_id
		,state
		,prevState
		,
		--windowing function executes after where, so we can use it in this query
		LAG(binnedTime, 1) OVER (
			PARTITION BY device_id ORDER BY binnedTime
			) AS prevBinnedTime
		,LAG(state, 2) OVER (
			PARTITION BY device_id ORDER BY binnedTime
			) AS prevPrevState
	FROM binnedData
	WHERE (
			state != prevState
			OR prevState IS NULL
			)
	ORDER BY binnedTime
	)
SELECT binnedTime
	,device_id
	,state
	,prevState
	,prevPrevState
FROM cleaned_data
WHERE prevPrevState = 'IN_SERVICE'
	AND prevState = 'OVERHEATING'
	AND state = 'AUTO_SHUTDOWN'

We get the following result.

binnedTime device_id state prevState prevPrevState
2021-01-01 10:20:00.000000000 coldy2 AUTO_SHUTDOWN OVERHEATING IN_SERVICE

4. Running aggregates

In this example, we use the running total (moving total) to better monitor and understand the state of our devices. In our case, we want to understand the total number of times a device stated not IN_SERVICE at any point in time with 30-minute buckets. We use the following query to do this analysis:

WITH binned_time
AS (
	SELECT device_id
		,bin(time, 30m) AS time_bucket
		,count(*) AS event_count_in_the_bucket
	FROM "Blog"."DeviceMetadata"
	WHERE time BETWEEN ago(365d) AND now()
		AND measure_name = 'multi'
		AND state IS NOT NULL
		AND state <> 'IN_SERVICE'
	GROUP BY device_id
		,bin(time, 30m)
	)
SELECT device_id
	,time_bucket
	,event_count_in_the_bucket
	,sum(event_count_in_the_bucket) OVER (
		PARTITION BY device_id ORDER BY time_bucket range BETWEEN unbounded preceding AND CURRENT row
		) AS cumulative_event_count
FROM binned_time

Here we create a common table expression called binned_time and filter the records to our specific state measure where its value doesn’t equal IN_SERVICE. Then we count the number of such events or readings in each 30-minute time bucket using bin(time, 30m) and count() grouped by device_id and 30-minute time buckets. When we have our CTE, we use it to calculate the running total of the events by utilizing the SUM() OVER() window function. Window functions require an OVER() clause including a PARTITION BY and an ORDER BY clause. The PARTITION BY clause resets the calculations for each partition, and the order by clause instructs ordering of the records when calculating the results. We use device_id in our PARTITION BY clause because we want to know the running total for each device, and we order our records by the time bins because we’re interested in the total at every 30-minute point in time. This ordering also allows us to use the RANGE clause to instruct the window function to process (sum) all the preceding events or rows up to the current row (for every point in time, the running total is the sum of event counts in all previous buckets + the event count in the current time bucket). The query returns the following results.

device_id time_bucket event_count_in_the_bucket cumulative_event_count
coldy2 2021-01-01 17:00:00.000000000 2 2
coldy2 2021-01-01 17:30:00.000000000 2 4
coldy2 2021-01-01 18:00:00.000000000 2 6
coldy1 2021-01-01 17:00:00.000000000 1 1
coldy1 2021-01-01 17:30:00.000000000 2 3
real2 2021-01-01 17:30:00.000000000 2 2
real2 2021-01-01 18:00:00.000000000 1 3

With running totals, it might be easier to look at the rate of change than an increasing value over time where the value increases at a steady rate. With rate of change, you can identify anomalies or sudden increases or decreases in a running aggregate (such as running total or running average). This is discussed in the next section.

5. Derivatives and rate of change

In this example, we show temperature change in Fahrenheit per 10-minute bucket. The query accounts for missing measures and returns temperature changes per 10-minute interval, even if the sensor was offline.

Let’s look at the data:

SELECT time
	,device_id
	,temperature
FROM "Blog"."SensorReadings"
WHERE time BETWEEN ago(365d) AND now()
	AND device_id = 'real1'
	AND measure_name = 'multi'
	AND temperature IS NOT NULL
ORDER BY time ASC

Notice missing data around 9:20 AM in the following results.

time device_id temperature
2021-01-01 09:01:00.000000000 real1 102
2021-01-01 09:13:00.000000000 real1 103
2021-01-01 09:31:00.000000000 real1 107
2021-01-01 09:44:00.000000000 real1 106
2021-01-01 09:58:00.000000000 real1 105

To calculate incremental change between measures, we could use the LAG function as in our N-event correlation query. However, there is a simpler way to calculate the rate of change, thanks to built-in DERIVATIVE_LINEAR function in Timestream.

DERIVATIVE_LINEAR takes two arguments: a time series and an interval. It returns the derivative of each point in the time series, for the specified interval. A derivative is a mathematical tool used in calculus that calculates the rate of change of a value. Because DERIVATIVE_LINEAR returns time series, the following SQL returns a row for device real1, and this row contains a time series in the rateOfTempChange column:

WITH binned_view
AS (
	SELECT BIN(time, 10m) AS binnedTime
		,MIN(temperature) AS minTempPerBucket
		,device_id
	FROM "Blog"."SensorReadings"
	WHERE time BETWEEN ago(365d) AND now()
		AND device_id = 'real1'
		AND measure_name = 'multi'
		AND temperature IS NOT NULL
	GROUP BY device_id
		,BIN(time, 10m)
	)
SELECT device_id
	,derivative_linear(CREATE_TIME_SERIES(binnedTime, minTempPerBucket), 10m) AS rateOfTempChange
FROM binned_view
GROUP BY device_id

We get the following result.

device_id rateOfTempChange
real1 [
{ time: 2021-01-01 09:10:00.000000000, value: 1.0 },
{ time: 2021-01-01 09:30:00.000000000, value: 2.0 },

]

The time series for device real1 can be visualized as the following table.

time value
2021-01-01 09:10:00.000000000 1.0
2021-01-01 09:30:00.000000000 2.0
2021-01-01 09:40:00.000000000 -1.0
2021-01-01 09:50:00.000000000 -1.0

At 9:30 AM, the rate of change was calculated as 2F, which is correct because the temperature changed from 107F to 103F over 20 minutes, so it changed 2F per 10-minute interval.

If you prefer to return the data in the original form, use the UNNEST function to convert the result back to a flat table:

WITH binned_view
AS (
	SELECT BIN(time, 10m) AS binnedTime
		,MIN(temperature) AS minTempPerBucket
		,device_id
	FROM "Blog"."SensorReadings"
	WHERE time BETWEEN ago(365d) AND now()
		AND device_id = 'real1'
		AND measure_name = 'multi'
		AND temperature IS NOT NULL
	GROUP BY device_id
		,BIN(time, 10m)
	)
	,interpolated
AS (
	SELECT device_id
		,derivative_linear(CREATE_TIME_SERIES(binnedTime, minTempPerBucket), 10m) AS rateOfTempChange
	FROM binned_view
	GROUP BY device_id
	)
SELECT device_id
	,time
	,value
FROM interpolated
CROSS JOIN UNNEST(rateOfTempChange)

We get the following result.

device_id time value
real1 2021-01-01 09:10:00.000000000 1.0
real1 2021-01-01 09:30:00.000000000 2.0
real1 2021-01-01 09:40:00.000000000 -1.0
real1 2021-01-01 09:50:00.000000000 -1.0

6. Show top N series by a high cardinality dimension

Let’s assume you want display a graph of the average temperature in cities over time. You could use the following query to show the average temperature in cities based on the temperature reported by devices per 10-minute bucket:

SELECT BIN(time, 10m) AS binnedTime
	,avg(temperature) AS avg_temp
	,city
FROM "Blog"."SensorReadings"
WHERE time BETWEEN ago(365d) AND now()
	AND measure_name = 'multi'
	AND temperature IS NOT NULL
GROUP BY BIN(time, 10m)
	,city
ORDER BY city
	,BIN(time, 10m) ASC LIMIT 10

This query returns the following result.

binnedTime avg_temp city
2021-01-01 09:00:00.000000000 70.3 Basin City
2021-01-01 09:10:00.000000000 69.8 Basin City
2021-01-01 09:20:00.000000000 67.0 Basin City
2021-01-01 09:30:00.000000000 69.5 Basin City
2021-01-01 09:40:00.000000000 69.6 Basin City
2021-01-01 09:50:00.000000000 68.5 Basin City
2021-01-01 10:00:00.000000000 69.1 Basin City
2021-01-01 09:00:00.000000000 22.5 Coruscant
2021-01-01 09:10:00.000000000 23.5 Coruscant
2021-01-01 09:20:00.000000000 24.0 Coruscant

Our dataset contains three cities, but in a production system, you could have tens of thousands of cities, and therefore the query returns tens of thousands of data points per each requested timestamp. Showing such a number of series on a single graph would be challenging and hardly readable for users. What if you’re interested in only showing the graph for the top N hottest cities, sorted by the average temperature for a given time period?

Because we have three cities in the dataset, let’s limit our result to the top two hottest cities in our dataset. We use the following code to identify the top two hottest cities by average temperature:

SELECT city
	,avg(temperature) AS avg_temp
FROM "Blog"."SensorReadings"
WHERE time BETWEEN ago(365d) AND now()
	AND measure_name = 'multi'
	AND temperature IS NOT NULL
GROUP BY city
ORDER BY avg(temperature) DESC
	,city LIMIT 2

We get the following result.

city avg_temp
Zion 194.9
Basin City 69.11428571428571

Now that we have the top two hottest cities, we can limit the initial query to get the average temperature only for those two cities by using a subquery:

SELECT BIN(time, 10m) AS binnedTime
	,avg(temperature) AS avg_temp
	,city
FROM "Blog"."SensorReadings"
WHERE time BETWEEN ago(365d) AND now()
	AND measure_name = 'multi'
	AND temperature IS NOT NULL
	AND city IN (
		SELECT city
		FROM "Blog"."SensorReadings"
		WHERE time BETWEEN ago(365d) AND now()
			AND measure_name = 'multi'
			AND temperature IS NOT NULL
		GROUP BY city
		ORDER BY avg(temperature) DESC
			,city LIMIT 2
		)
GROUP BY BIN(time, 10m)
	,city
ORDER BY city
	,BIN(time, 10m) ASC

As the result, we get the average temperature for the hottest cities, binned into 10-minute intervals. The third city temperature isn’t returned.

binnedTime avg_temp city
2021-01-01 09:00:00.000000000 70.3 Basin City
2021-01-01 09:10:00.000000000 69.8 Basin City
2021-01-01 09:20:00.000000000 67.0 Basin City
2021-01-01 09:30:00.000000000 69.5 Basin City
2021-01-01 09:40:00.000000000 69.6 Basin City
2021-01-01 09:50:00.000000000 68.5 Basin City
2021-01-01 10:00:00.000000000 69.1 Basin City
2021-01-01 09:00:00.000000000 103.0 Zion
2021-01-01 09:10:00.000000000 104.0 Zion
2021-01-01 09:20:00.000000000 109.0 Zion
2021-01-01 09:30:00.000000000 108.0 Zion
2021-01-01 09:40:00.000000000 552.5 Zion
2021-01-01 09:50:00.000000000 105.0 Zion

7. Build histograms

Another common query pattern is bucketing measure or dimension values and analyzing data points in these buckets. In our use case, we want to understand distribution of temperature readings for devices in each city and create a histogram. To do so, we count the number of temperature readings in 5-degree buckets (10–15, 15–20, 20–25, and so on) for every city:

WITH buckets
AS (
	SELECT time
		,city
		,device_id
		,temperature
		,width_bucket(temperature, 0, 110, 22) bucket_id
	FROM "Blog"."SensorReadings"
	WHERE time BETWEEN ago(365d) AND now()
		AND measure_name = 'multi'
		AND temperature BETWEEN 0 AND 110
	)
SELECT city
	,bucket_id AS temp_bucket_id
	,cast((bucket_id - 1) * 5 AS VARCHAR) || '-' || cast(bucket_id * 5 AS VARCHAR) AS temp_bucket
	,count(*) AS temperature_readings
	,round(avg(temperature), 2) AS avg_temperature
	,round(approx_percentile(temperature, 0.5), 2) AS p50_temperature
	,round(approx_percentile(temperature, 0.9), 2) AS p90_temperature
FROM buckets
GROUP BY city
	,bucket_id
	,cast((bucket_id - 1) * 5 AS VARCHAR) || '-' || cast(bucket_id * 5 AS VARCHAR)
ORDER BY city
	,bucket_id

Here we use the WIDTH_BUCKET() function, which has two variants: WIDTH_BUCKET(x, bound1, bound2, n) and WIDTH_BUCKET (x, bins). The first variant returns the bucket number of x in an equi-width histogram, with n buckets within bounds of bound1 and bound2. The second variant WIDTH_BUCKET(x, bins) allows us to get the bin number of x with specific bins specified by the array bins. Here we use the first variant because we want a histogram with 5-degree bins.

We first create a CTE called buckets to find the bin number for each temperature reading. Then in the main section, we query the buckets CTE to find the number of temperature readings, average temperature, approximate 50th and 90th percentiles for each city, and each bucket based on the distribution of temperatures in each bucket. We also create a text label for the calculated bucket numbers for easier interpretation.

city temp_bucket_id temp_bucket temperature_readings avg_temperature p50_temperature p90_temperature
Basin City 14 65-70 38 66.76 67 69
Basin City 15 70-75 31 71.81 71 74
Basin City 16 75-80 1 75.0 75 75
Coruscant 5 20-25 14 22.93 23 24
Coruscant 6 25-30 2 25.0 25 25
Zion 21 100-105 3 103.0 103 104
Zion 22 105-110 6 106.83 107 109

Looking at the results, we can see how temperature values are distributed for each city and how temperatures ranges are different for each city (65–80 for Basin City, 20–30 for Coruscant, and 100–110 for Zion).

Clean up

To avoid incurring future charges, delete the Timestream tables (SensorReadings and DeviceMetadata) that you created.

Conclusion

Timestream is a serverless purpose-built time series database with usage based pricing which gives you the opportunity to have performant and cost optimized queries. In addition, with time series data treated as first class concept in Timestream, the built-in time series functions allow you to reduce complexity of queries. In this post, you learned how to effectively construct queries for some of the most common access patterns in time series workloads in Timestream. We walked you through the queries and explained how each part of a query behaves and processes the data. We also introduced you to some of the built-in time series functions available in Timestream that treat time series data as a first-class concept. These functions help you make the queries more performant, cost-effective, and simpler to maintain.

Let us know in the comments what other query patterns you have in your workload!


About the Authors

Saleh Ghasemi is a Database Specialist Solutions Architect at AWS. Prior to joining AWS, he founded and led an AWS consulting partner firm helping customers develop, migrate and modernize database and analytics solutions.

Piotr Westfalewicz is a NoSQL data architect at AWS. He is a passionate software and architecture craftsman, striving to always build the right solution for customers. Before working as a data architect, he was a software development engineer at both Amazon and AWS.