AWS Database Blog
Design patterns for high-volume, time-series data in Amazon DynamoDB
Time-series data shows a pattern of change over time. For example, you might have a fleet of Internet of Things (IoT) devices that record environmental data through their sensors, as shown in the following example graph. This data could include temperature, pressure, humidity, and other environmental variables. Because each IoT device tracks these values over regular periods, your backend needs to ingest up to hundreds, thousands, or millions of events every second.
In this blog post, I explain how to optimize Amazon DynamoDB for high-volume, time-series data scenarios. I do this by using a design pattern powered by automation and serverless computing.
Designing in DynamoDB for a high volume of events
Typically, a data ingestion system requires:
- High write throughput to ingest new records related to the current time.
- Low read throughput for recent records.
- Very low read throughput for older records.
Historically, storing all such events in a single DynamoDB table could have resulted in hot partitions (or “diluted” capacity), but this is no longer a concern thanks to adaptive capacity. On the other hand, you often want to design for periods that are well suited for your analytical needs. For example, you might want to analyze the last few days or months of data. By tuning the lengths of these time periods, you can optimize for both analysis performance and cost.
General design principles of DynamoDB recommend using the smallest number of tables possible. When it comes to time-series data though, break from these design principles and create multiple tables for each period. In this post, I show you how to use such an anti-pattern for DynamoDB, but it is a great fit for time-series data.
Unless you opt for on-demand capacity mode, every DynamoDB access pattern requires a different allocation of read capacity units and write capacity units. For this post, I classify records in three distinct groups based on how often you read and write them:
- New records written every second
- Recent records read frequently
- Older records read infrequently
I want to maximize write throughput for newly ingested records. Therefore, I create a new table for each period and allocate the maximum number of write capacity units and the minimum number of read capacity units to it. I also prebuild the next period’s table before the end of each period so that I can move all the traffic to it when the current period ends. When I stop writing new records to the old table, I can reduce its write capacity units to 1. I also provision the appropriate read capacity units based on my short-term read requirements. After the next period ends, I reduce the allocated read capacity units as well because I don’t want to overprovision read or write capacity.
I also must consider my analytical needs when estimating how often I should switch to a new period. For example, I might want to analyze what’s happened in the last year. In this case, I could use quarterly tables so that I can retrieve the data more efficiently with four parallel queries and then aggregate the four result sets.
In other use cases, I might want to analyze only last quarter’s data and I could decide to use monthly tables. This would allow me to perform my analysis by running three parallel queries (one for each month in the quarter). On the other hand, if my analysis requires specific daily insights, I might opt for daily tables.
In the remainder of this post, I focus on the latter scenario with daily tables. I assume that yesterday’s data is relevant for analysis purposes, and older data is not accessed often. I set up a scheduled job to create a new table just before midnight and to reduce the write capacity units of the old table just after midnight.
This way I always have:
- Today’s table with 1,000 write capacity units and 300 read capacity units (maximum writes and some reads).
- Yesterday’s table with 1 write capacity unit and 100 read capacity units (minimum writes and some reads).
- Older tables with 1 write capacity unit and 1 read capacity unit (access is unlikely).
I also can set up DynamoDB automatic scaling to make sure that I do not overprovision or underprovision capacity at any time. This way, I can let read capacity units flow as needed and provision maximum write capacity units to 1,000 for the current table.
In the scenario where I can’t predict which day will be more relevant for my analysis, I could enable on-demand capacity mode on older tables. This ensures that I can read any day’s data without worrying about capacity. For the remainder of this post, I continue to assume that my analytical needs are predictable and that I want to have full control over read capacity.
How to automatically prebuild tables and decrease write capacity with AWS Lambda
I am going to automate the creation and resizing of tables every day at midnight by using AWS Lambda and Amazon CloudWatch Events. I create a serverless application, which is a set of AWS resources related to a Lambda function and its triggers, permissions, and environment variables.
My simple serverless application does not need a data store, RESTful API, message queue, or complex orchestration. I want to schedule my Lambda function invocations every day a few minutes before and after midnight. CloudWatch Events allows me to do this by using a cron-like syntax.
First, I implement the table creation and resizing logic in a supported language (such as Node.js, Java, Python, Go, and C#). I have implemented everything in Python using the official AWS SDK for Python (boto 3).
We will use and deploy the following code later in this post, but I’ll explain it briefly first.
As shown in the preceding code example, this implementation is platform agnostic. Everything is encapsulated in a Python class called DailyResize
that has two public methods: create_new
and resize_old
. I save this code in a resizer.py
file that I include in my Lambda function handler. Writing code this way allows me to decouple the main logic from Lambda itself. This way I can reuse this code on different compute platforms, such as Amazon Elastic Container Service or Amazon EC2.
I now implement the Lambda function handler. I use and deploy the following code later in the post.
As you can see in the preceding Python code, my Lambda handler is a Python function that takes an event
and the invocation context
as input. The code extracts the Operation
parameter from the incoming event and uses it to decide which operation to perform. Because I decoupled my main logic from the Lambda details, my Lambda handler is minimal and only acts as an adapter of my DailyResize
class.
Now I have implemented the main logic and the Lambda handler. I can use AWS Serverless Application Model (AWS SAM) to define all the AWS resources that compose my serverless application. As a best practice, I follow the guidelines of infrastructure as code, so I define my serverless application in a YAML file (let’s call this file template.yml
) that I use to deploy the application later.
I define an AWS::Serverless::Function
resource and two daily scheduled invocations as its Events
. CloudWatch Events allows me to define what the input event of each invocation is so that my Lambda handler receives a different Operation
to perform based on the time of the day:
CreateNewTableEveryDay
: Runs every day at 11:45 PM and performs thecreate_new
operation.ResizeYesterdaysTablesEveryDay
: Runs every day at 12:15 AM and performs theresize_old
operation.
Two more important details about the serverless application definition:
- The template requires an input parameter named
TablePrefix
, which is used to append the period suffix (for example,timeseries_2018-10-25
). - I grant fine-grained permissions to the Lambda function so that it can invoke two DynamoDB APIs (
dynamodb:CreateTable
anddynamodb:UpdateTable
) on tables in the same account and AWS Region, and whose names start with the givenTablePrefix
.
The following code shows the entire AWS SAM template.
To deploy this serverless application in the AWS SAM template, I use the AWS SAM CLI, which is a tool for local development and testing of serverless applications. First, I run sam package
with the template to upload my code to Amazon S3 and generate a compiled version of my template. Then, I run sam deploy
with the compiled template to submit it to CloudFormation, which applies the AWS::Serverless Transform and provisions all resources and triggers.
That’s it! As soon as my function runs, the new table is ready to accept new records. The application that is writing new records into DynamoDB starts writing to the new table exactly at midnight so that tomorrow’s records will be written to my new daily table. If you do not feel like waiting until midnight, you can change the cron-like schedule. You can invoke the Lambda function from the web console or through the API to trigger the first table creation.
The following screenshot shows the result in the DynamoDB console after a few days. I have one table for each day. Today’s table has 1,000 write capacity units and 300 read capacity units. Yesterday’s table has 1 write capacity unit and 100 read capacity units, and older tables have 1 write capacity unit and 1 read capacity unit.
I can further extend the design to archive older tables’ data to Amazon S3 by using DynamoDB Streams to cut costs and enable big-data analytics scenarios.
Summary
Time-series data requires optimization techniques generally considered to be anti-patterns for DynamoDB. One of these techniques is using multiple tables for each time period. This technique maximizes write throughput and optimizes costs for both data that is not accessed frequently and analytical queries.
In this post, I showed how to automate table prebuilding and scaling down of write capacity with Lambda, CloudWatch Events, and AWS SAM. The architecture I implement in this post is fully automated and serverless because it does not require human intervention, server patching, or infrastructure maintenance. Remember that on-demand capacity mode might help you simplify the proposed solution further in case you can’t easily predict your analytical patterns.
It’s also worth remembering that Amazon Timestream—the new, fast, scalable, fully managed time series database—is currently in preview. Despite the availability of Amazon Timestream, the design pattern and considerations in this post are still valid options for time-series use cases in DynamoDB.
If you have any comments about this post, submit them in the comments section below or start a new thread in the DynamoDB forum.
About the Author
Alex Casalboni is an AWS technical evangelist. He enjoys playing the saxophone, jogging, and traveling.