Design patterns for high-volume, time-series data in Amazon DynamoDB

Time-series data shows a pattern of change over time. For example, you might have a fleet of Internet of Things (IoT) devices that record environmental data through their sensors, as shown in the following example graph. This data could include temperature, pressure, humidity, and other environmental variables. Because each IoT device tracks these values over regular periods, your backend needs to ingest up to hundreds, thousands, or millions of events every second.

Graph of sensor data

In this blog post, I explain how to optimize Amazon DynamoDB for high-volume, time-series data scenarios. I do this by using a design pattern powered by automation and serverless computing.

Designing in DynamoDB for a high volume of events

Typically, a data ingestion system requires:

High write throughput to ingest new records related to the current time.
Low read throughput for recent records.
Very low read throughput for older records.

Historically, storing all such events in a single DynamoDB table could have resulted in hot partitions (or “diluted” capacity), but this is no longer a concern thanks to adaptive capacity. On the other hand, you often want to design for periods that are well suited for your analytical needs. For example, you might want to analyze the last few days or months of data. By tuning the lengths of these time periods, you can optimize for both analysis performance and cost.

General design principles of DynamoDB recommend using the smallest number of tables possible. When it comes to time-series data though, break from these design principles and create multiple tables for each period. In this post, I show you how to use such an anti-pattern for DynamoDB, but it is a great fit for time-series data.

Unless you opt for on-demand capacity mode, every DynamoDB access pattern requires a different allocation of read capacity units and write capacity units. For this post, I classify records in three distinct groups based on how often you read and write them:

New records written every second
Recent records read frequently
Older records read infrequently

I want to maximize write throughput for newly ingested records. Therefore, I create a new table for each period and allocate the maximum number of write capacity units and the minimum number of read capacity units to it. I also prebuild the next period’s table before the end of each period so that I can move all the traffic to it when the current period ends. When I stop writing new records to the old table, I can reduce its write capacity units to 1. I also provision the appropriate read capacity units based on my short-term read requirements. After the next period ends, I reduce the allocated read capacity units as well because I don’t want to overprovision read or write capacity.

I also must consider my analytical needs when estimating how often I should switch to a new period. For example, I might want to analyze what’s happened in the last year. In this case, I could use quarterly tables so that I can retrieve the data more efficiently with four parallel queries and then aggregate the four result sets.

In other use cases, I might want to analyze only last quarter’s data and I could decide to use monthly tables. This would allow me to perform my analysis by running three parallel queries (one for each month in the quarter). On the other hand, if my analysis requires specific daily insights, I might opt for daily tables.

In the remainder of this post, I focus on the latter scenario with daily tables. I assume that yesterday’s data is relevant for analysis purposes, and older data is not accessed often. I set up a scheduled job to create a new table just before midnight and to reduce the write capacity units of the old table just after midnight.

This way I always have:

Today’s table with 1,000 write capacity units and 300 read capacity units (maximum writes and some reads).
Yesterday’s table with 1 write capacity unit and 100 read capacity units (minimum writes and some reads).
Older tables with 1 write capacity unit and 1 read capacity unit (access is unlikely).

I also can set up DynamoDB automatic scaling to make sure that I do not overprovision or underprovision capacity at any time. This way, I can let read capacity units flow as needed and provision maximum write capacity units to 1,000 for the current table.

In the scenario where I can’t predict which day will be more relevant for my analysis, I could enable on-demand capacity mode on older tables. This ensures that I can read any day’s data without worrying about capacity. For the remainder of this post, I continue to assume that my analytical needs are predictable and that I want to have full control over read capacity.

How to automatically prebuild tables and decrease write capacity with AWS Lambda

I am going to automate the creation and resizing of tables every day at midnight by using AWS Lambda and Amazon CloudWatch Events. I create a serverless application, which is a set of AWS resources related to a Lambda function and its triggers, permissions, and environment variables.

My simple serverless application does not need a data store, RESTful API, message queue, or complex orchestration. I want to schedule my Lambda function invocations every day a few minutes before and after midnight. CloudWatch Events allows me to do this by using a cron-like syntax.

First, I implement the table creation and resizing logic in a supported language (such as Node.js, Java, Python, Go, and C#). I have implemented everything in Python using the official AWS SDK for Python (boto 3).

We will use and deploy the following code later in this post, but I’ll explain it briefly first.

import os
import boto3
import datetime

region = os.environ.get('AWS_DEFAULT_REGION', 'us-west-2')
dynamodb = boto3.client('dynamodb', region_name=region)

class DailyResize(object):
    
    FIRST_DAY_RCU, FIRST_DAY_WCU = 300, 1000
    SECOND_DAY_RCU, SECOND_DAY_WCU = 100, 1
    THIRD_DAY_RCU, THIRD_DAY_WCU = 1, 1

    def __init__(self, table_prefix):
        self.table_prefix = table_prefix
    
    def create_new(self):
        # create new table (300 RCU, 1000 WCU)
        today = datetime.date.today()
        new_table_name = "%s_%s" % (self.table_prefix, self._format_date(today))
        dynamodb.create_table(
            TableName=new_table_name,
            KeySchema=[       
                { 'AttributeName': "pk", 'KeyType': "HASH"},  # Partition key
                { 'AttributeName': "sk", 'KeyType': "RANGE" } # Sort key
            ],
            AttributeDefinitions=[       
                { 'AttributeName': "pk", 'AttributeType': "N" },
                { 'AttributeName': "sk", 'AttributeType': "N" }
            ],
            ProvisionedThroughput={       
                'ReadCapacityUnits': self.FIRST_DAY_RCU, 
                'WriteCapacityUnits': self.FIRST_DAY_WCU,
            },
        )
    
        print("Table created with name '%s'" % new_table_name)
        return new_table_name
    
    
    def resize_old(self):
        # update yesterday's table (100 RCU, 1 WCU)
        yesterday = datetime.date.today() - datetime.timedelta(1)
        old_table_name = "%s_%s" % (self.table_prefix, self._format_date(yesterday))
        self._update_table(old_table_name, self.SECOND_DAY_RCU, self.SECOND_DAY_WCU)
    
        # update the day before yesterday's table (1 RCU, 1 WCU)
        the_day_before_yesterday = datetime.date.today() - datetime.timedelta(2)
        very_old_table_name = "%s_%s" % (self.table_prefix, self._format_date(the_day_before_yesterday))
        self._update_table(very_old_table_name, self.THIRD_DAY_RCU, self.THIRD_DAY_WCU)
        
        return "OK"
        
    
    def _update_table(self, table_name, RCU, WCU):
        """ Update RCU/WCU of the given table (if exists) """
        print("Updating table with name '%s'" % table_name)
        try:
            dynamodb.update_table(
                TableName=table_name,
                ProvisionedThroughput={
                    'ReadCapacityUnits': RCU,
                    'WriteCapacityUnits': WCU,
                },
            )
        except dynamodb.exceptions.ResourceNotFoundException as ex:
            print("DynamoDB Table %s not found" % table_name)
    
    
    @staticmethod
    def _format_date(d):
        return d.strftime("%Y-%m-%d")

As shown in the preceding code example, this implementation is platform agnostic. Everything is encapsulated in a Python class called DailyResize that has two public methods: create_new and resize_old. I save this code in a resizer.py file that I include in my Lambda function handler. Writing code this way allows me to decouple the main logic from Lambda itself. This way I can reuse this code on different compute platforms, such as Amazon Elastic Container Service or Amazon EC2.

I now implement the Lambda function handler. I use and deploy the following code later in the post.

import os
from resizer import DailyResize

def daily_resize(event, context):
    operation = event['Operation']
    resizer = DailyResize(table_prefix=os.environ['TABLE_NAME'])
    if operation == 'create_new':
        resizer.create_new()
    elif operation == 'resize_old':
        resizer.resize_old()
    else:
        raise ValueError("Invalid operation")

As you can see in the preceding Python code, my Lambda handler is a Python function that takes an event and the invocation context as input. The code extracts the Operation parameter from the incoming event and uses it to decide which operation to perform. Because I decoupled my main logic from the Lambda details, my Lambda handler is minimal and only acts as an adapter of my DailyResize class.

Now I have implemented the main logic and the Lambda handler. I can use AWS Serverless Application Model (AWS SAM) to define all the AWS resources that compose my serverless application. As a best practice, I follow the guidelines of infrastructure as code, so I define my serverless application in a YAML file (let’s call this file template.yml) that I use to deploy the application later.

I define an AWS::Serverless::Function resource and two daily scheduled invocations as its Events. CloudWatch Events allows me to define what the input event of each invocation is so that my Lambda handler receives a different Operation to perform based on the time of the day:

CreateNewTableEveryDay: Runs every day at 11:45 PM and performs the create_new operation.
ResizeYesterdaysTablesEveryDay: Runs every day at 12:15 AM and performs the resize_old operation.

Two more important details about the serverless application definition:

The template requires an input parameter named TablePrefix, which is used to append the period suffix (for example, timeseries_2018-10-25).
I grant fine-grained permissions to the Lambda function so that it can invoke two DynamoDB APIs (dynamodb:CreateTable and dynamodb:UpdateTable) on tables in the same account and AWS Region, and whose names start with the given TablePrefix.

The following code shows the entire AWS SAM template.

AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Parameters: 
  TablePrefix:
    Type: String
Resources:
  TableDailyResize:
    Type: AWS::Serverless::Function
    Properties:
      Handler: handler.daily_resize
      Policies:
        - AWSLambdaExecute # Managed Policy
        - Version: '2012-10-17'
          Statement:
            - Effect: Allow
              Action:
                - dynamodb:CreateTable
                - dynamodb:UpdateTable
              Resource: !Sub 'arn:aws:dynamodb:${AWS::Region}:${AWS::AccountId}:table/${TablePrefix}*'      
      Runtime: python2.7
      Timeout: 30
      MemorySize: 256
      Environment:
        Variables:
          TABLE_NAME: !Ref TablePrefix
      Events:
        CreateNewTableEveryDay:
          Type: Schedule
          Properties:
            Input: '{"Operation": "create_new"}'
            Schedule: cron(45 23 * * ? *)  # every day at 11.45PM
        ResizeYesterdaysTableEveryDay:
          Type: Schedule
          Properties:
            Input: '{"Operation": "resize_old"}'
            Schedule: cron(15 0 * * ? *)  # every day at 00.15AM

To deploy this serverless application in the AWS SAM template, I use the AWS SAM CLI, which is a tool for local development and testing of serverless applications. First, I run sam package with the template to upload my code to Amazon S3 and generate a compiled version of my template. Then, I run sam deploy with the compiled template to submit it to CloudFormation, which applies the AWS::Serverless Transform and provisions all resources and triggers.

# install AWS SAM
pip install --user aws-sam-cli

# configure your AWS credentials
aws configure

# package your raw YAML template
sam package
    --template-file template.yml \
    --s3-bucket YOUR_BUCKET \
    --output-template-file compiled.yml

# deploy your compiled YAML template
sam deploy
    --template-file compiled.yml \
    --stack-name YOUR_STACK_NAME \
    --capabilities CAPABILITY_IAM \
    --parameter-overrides TablePrefix=YOUR_PREFIX_

That’s it! As soon as my function runs, the new table is ready to accept new records. The application that is writing new records into DynamoDB starts writing to the new table exactly at midnight so that tomorrow’s records will be written to my new daily table. If you do not feel like waiting until midnight, you can change the cron-like schedule. You can invoke the Lambda function from the web console or through the API to trigger the first table creation.

The following screenshot shows the result in the DynamoDB console after a few days. I have one table for each day. Today’s table has 1,000 write capacity units and 300 read capacity units. Yesterday’s table has 1 write capacity unit and 100 read capacity units, and older tables have 1 write capacity unit and 1 read capacity unit.

The result in the DynamoDB console after a few days.

I can further extend the design to archive older tables’ data to Amazon S3 by using DynamoDB Streams to cut costs and enable big-data analytics scenarios.

Summary

Time-series data requires optimization techniques generally considered to be anti-patterns for DynamoDB. One of these techniques is using multiple tables for each time period. This technique maximizes write throughput and optimizes costs for both data that is not accessed frequently and analytical queries.

In this post, I showed how to automate table prebuilding and scaling down of write capacity with Lambda, CloudWatch Events, and AWS SAM. The architecture I implement in this post is fully automated and serverless because it does not require human intervention, server patching, or infrastructure maintenance. Remember that on-demand capacity mode might help you simplify the proposed solution further in case you can’t easily predict your analytical patterns.

It’s also worth remembering that Amazon Timestream—the new, fast, scalable, fully managed time series database—is currently in preview. Despite the availability of Amazon Timestream, the design pattern and considerations in this post are still valid options for time-series use cases in DynamoDB.

If you have any comments about this post, submit them in the comments section below or start a new thread in the DynamoDB forum.

About the Author

Alex Casalboni is an AWS technical evangelist. He enjoys playing the saxophone, jogging, and traveling.

AWS Database Blog

Design patterns for high-volume, time-series data in Amazon DynamoDB

Designing in DynamoDB for a high volume of events

How to automatically prebuild tables and decrease write capacity with AWS Lambda

Summary

About the Author

Resources

Blog Topics

Follow

Learn

Resources

Developers

Help