Automatically Archive Items to S3 Using DynamoDB Time to Live (TTL) with AWS Lambda and Amazon Kinesis Firehose
Adam Wagner is a solutions architect at Amazon Web Services.
Earlier this year, Amazon DynamoDB released Time to Live (TTL) functionality, which automatically deletes expired items from your tables, at no additional cost. TTL eliminates the complexity and cost of scanning tables and deleting items that you don’t want to retain, saving you money on provisioned throughput and storage. One AWS customer, TUNE, purged 85 terabytes of stale data and reduced their costs by over $200K per year.
Today, DynamoDB made TTL better with the release of a new CloudWatch metric for tracking the number of items deleted by TTL, which is also viewable for no additional charge. This new metric helps you monitor the rate of TTL deletions to validate that TTL is working as expected. For example, you could set a CloudWatch alarm to fire if too many or too few automated deletes occur, which might indicate an issue in how you set expiration time stamps for your items.
In this post, I’ll walk through an example of a serverless application using TTL to automate a common database management task: moving old data from your database into archival storage automatically. Archiving old data helps reduce costs and meet regulatory requirements governing data retention or deletion policies. I’ll show how TTL—combined with DynamoDB Streams, AWS Lambda, and Amazon Kinesis Firehose—facilitates archiving data to a low-cost storage service like Amazon S3, a data warehouse like Amazon Redshift, or to Amazon OpenSearch Service.
TTL and archiving data
Customers often use DynamoDB to store time series data, such as webpage clickstream data or IoT data from sensors and connected devices. Rather than delete older, less frequently accessed items, many customers want to archive them instead. One way to accomplish this is via “rolling tables,” that is pre-creating tables to store data for particular months/weeks/days. However, this requires custom application logic to handle creating and deleting tables, and switching of reads and writes to new tables. Another approach is to run periodic jobs to delete old items, but this consumes write throughput and also requires custom application logic.
DynamoDB TTL simplifies archiving by automatically deleting items based on the time stamp attribute of your choice. Items deleted by TTL can be identified in DynamoDB Streams, which captures a time-ordered sequence of item-level modifications on a DynamoDB table and stores them in a log for up to 24 hours. You can then archive this stream data using applications like the one referenced later in this blog post. If you choose S3 as the archive, you can optimize your costs even further using S3 lifecycle configuration rules, which automatically transition older data to infrequent access storage class in S3 or to Amazon Glacier for long-term backup.
Sample application overview
This post shows how to build a solution to remove older items from a DynamoDB table and archive them to S3 without having to manage a fleet of servers (see the following simplified workflow diagram). You use TTL to automatically delete old items and DynamoDB Streams to capture the TTL-expired items. You then connect DynamoDB Streams to Lambda, which lets you run code without provisioning or managing any servers. When new items are added to the DynamoDB stream (that is, as TTL deletes older items), the Lambda function triggers, writing the data to an Amazon Kinesis Firehose delivery stream. Firehose provides a simple, fully managed solution to load the data into S3, which is the archive.
At a high level, this post takes you through the following steps:
- Activate TTL and DynamoDB Streams on your DynamoDB table.
- Create a Firehose delivery stream to load the data into S3.
- Create a Lambda function to poll the DynamoDB stream and deliver batch records from streams to Firehose.
- Validate that the application works.
Note: This post assumes that you already have a DynamoDB table created, and that you have an attribute on that table that you want to use as the time stamp for TTL. To get started with a simple DynamoDB table, see Getting Started with Amazon DynamoDB. For more details on setting up TTL, see Manage DynamoDB Items Using Time to Live on the AWS Blog.
Step 1: Enable DynamoDB TTL and DynamoDB streams
Start by signing in to the DynamoDB console and navigating to the table that contains the items that you want to archive. Go to the Overview tab for the table. Under Table details, choose Manage TTL.
Enter the table attribute containing the time stamp that will flag items for automated TTL deletion. Next, select the check box to enable DynamoDB streams with view type New and old images. This gives you both the new and old versions of items that are updated. Selecting this view type is required for your stream to contain items removed by TTL.
This stream will contain records for new items that are created, updated items, items that are deleted by you, and items that are deleted by TTL. The records for items that are deleted by TTL contain an additional metadata attribute to distinguish them from items that are deleted by you. The
userIdentity field for TTL deletions (shown in the following example) indicates that the DynamoDB service performed the delete.
In this example, you archive all item changes to S3. If you wanted to archive only the items deleted by TTL, you could archive only the records where
principalId equal to
Now that TTL and your DynamoDB stream is activated, you can move on to configuring Amazon Kinesis Firehose. (Configuring Lambda later on is more straightforward if you set up Firehose first.)
Step 2: Set up the Amazon Kinesis Firehose delivery stream
First, create the Amazon Kinesis Firehose delivery stream using the same name as your DynamoDB table. Specify the S3 bucket where Firehose will send the TTL-deleted items. (At this point, you might want to create a new S3 bucket to be your DynamoDB archive.)
Next, you configure details about the delivery stream.
You can set both size-based and time-based limits on how much data the Firehose will buffer before writing to S3. There is a tradeoff between the number of requests per second to S3 and the delay for writing data to S3, depending on whether you favor few large objects or many small objects. Whichever condition is met first—time or size—triggers the writing of data to S3. If you’re planning to analyze the data with Amazon EMR or Athena, it’s best to aim for a larger file size, so set the buffer size to the max of 128MB and a longer buffer interval. For this example, I use the max buffer size of 128MB and max buffer interval of 900 seconds.
You can compress or encrypt your data before writing to S3. You can also enable or disable error logging to CloudWatch Logs.
Note: you can optionally transform your data using a Lambda function. This is beyond the scope of this post, but for more details on this option, see Amazon Kinesis Firehose Data Transformation.
You must also configure an IAM role that will be used by Firehose to write to your S3 bucket, access KMS keys, write to CloudWatch Logs, and Lambda functions as needed. You can use an existing IAM role if you have one with the necessary permissions. Or you can create a new role (shown in the following screenshot), which is auto-populated with a policy with the necessary permissions:
Choose Create Delivery Stream.
After a few minutes, the status of the delivery stream changes from CREATING to ACTIVE.
Now you can move on to configuring the Lambda function that will take items from the DynamoDB stream and write them to the Firehose delivery stream.
Step 3: Configure Lambda
Now that you have a DynamoDB table with TTL, a DynamoDB stream, and a Firehose delivery stream, you can set up the Lambda function that listens to the DynamoDB stream and write the items to Firehose. This example uses the lambda-streams-to-firehose project, written by my colleague Ian Meyers, available in this GitHub repository. It contains several advanced features. However, this example uses the base functionality: it takes all messages in the DynamoDB stream and forwards them to the Firehose delivery stream.
Start by cloning the repository:
Next edit index.js to specify the DynamoDB Stream and Firehose delivery stream. You can use this Lambda function to handle multiple DynamoDB streams and/or Amazon Kinesis streams. But in this case, you configure it for this one stream.
In index.js, locate line 86, as shown here:
Edit this line so that it contains your
DynamoDB Stream Name ARN : Kinesis Firehose Delivery Stream Name. For example:
By default, this Lambda function also supports the option of having a default delivery stream. For this use case, you can disable this by editing line 78:
Change the contents of this line to the following:
You’re now ready to package up the Lambda function. Running the included
./build.sh zips up the function and places it in
Next, create the IAM role that will be used by the Lambda function. Navigate to the Roles page in the IAM console and choose Create New Role. Choose AWS Lambda as the service role type.
Next you select policies to attach to the Lambda function. For this example, choose the following two policies: AWSLambdaDynamoDBExecutionRole and AmazonKinesisFirehoseFullAccess.
This gives the Lambda function access to all CloudWatch Logs for logging, full access to all DynamoDB streams, and full access to Amazon Kinesis Firehose. You could also start with these policies as a baseline but limit them to the specific DynamoDB stream and Firehose delivery stream.
Give the role a meaningful name, such as lambda_streams_to_firehose, and provide a description for the role. The Review role screen should look similar to the following screenshot. Create the role.
Now you create the Lambda function in the console. Go to the Lambda console and choose Create a Lambda function. Choose the blank blueprint in the upper-left corner:
That brings you to the Configure triggers screen. Choose the dotted gray box to open the dropdown list for triggers, and then choose DynamoDB.
This brings up the detailed configuration for using DynamoDB Streams as a trigger. In the dropdown list, choose the table you’re using.
You now set the batch size, which is the maximum number of records that will be sent to your Lambda function in one batch. Keep in mind that this only sets a ceiling; your Lambda function might be invoked with smaller batches. To decide on a batch size, think about the number of records that your Lambda function can process within its configured timeout, which has a maximum value of five minutes.
In this example you’re not doing heavy processing of the records in Lambda, and you should be able to handle a batch size larger than the default of 100, so raise it to 250. When your function is configured, you can monitor the average duration of your Lambda functions to make sure you’re not close to your timeout value. If your Lambda functions are running for a duration that’s close to the timeout value, you can either raise the timeout value or decrease the batch size.
Next, configure the starting position. This can either be Trim Horizon (the oldest records in the stream) or Latest (the newest records added to the stream). In this example, use Trim Horizon. Finally, decide if you want to enable the trigger right away, which you will in this example.
Next, give your function a name and a description. Choose the Node.js 4.3 runtime.
Upload the function you zipped up earlier.
You can optionally add environment variables for your Lambda functions.
Now configure the function handler and the IAM role used by your Lambda function. The function handler can stay at the default value of index.handler. In the role dropdown list, choose the role you created earlier.
In the Advanced settings screen, leave the memory setting at the default 128 MB, and set the timeout at 1 minute. You can tune these settings later based on the average size of your records and how large you plan on making the batches.
Accept the default options for the additional advanced settings. Choose Next, and review your configuration, which should look similar to this:
Finally, create your Lambda function.
Step 4: Validate that it’s working
Now that TTL is enabled on this table, you can look at the newly launched
TimeToLiveDeletedItemCount CloudWatch metric. You can find it in the DynamoDB console under the Metrics tab after selecting a table (see the following screenshot). This metric is updated every minute with the number of items removed by DynamoDB TTL. You can also access this metric via the GetMetricStatistics API or the AWS CLI. As with other CloudWatch metrics, you can set alarms to notify you or trigger actions based on the metric breaching a specific threshold (e.g., if too many or not enough deletes are happening).
Now that you’re seeing TTL deletes in the
TimeToLiveDeletedItemCount metric, you can check that the workflow is archiving items as expected—that is, that TTL items are put on the DynamoDB stream, processed in batches by the Lambda function, forwarded to the Firehose delivery stream, batched into objects, and loaded into S3.
Go to the Lambda console and look at the Triggers tab of your Lambda function, which shows the configuration of the trigger along with the result of the last batch. This should say OK. If it doesn’t, look at the Lambda CloudWatch Logs to troubleshoot the configuration.
Next, go to the Amazon Kinesis console and navigate to your Firehose delivery stream. Then look at the Monitoring tab. You should start to see bytes and records flowing in. In this example, there were a large number of records built up in the stream, so there is a large spike at first.
Finally, go to the Amazon S3 console and look in your destination S3 bucket. After 15 minutes or so, you should see objects being written into folders in the format Year/Month/Day/Hour. The following screenshot shows objects in the example bucket.
Taking a look at one of those files, there is one JSON formatted stream item per line. The example data in this case is parsed web logs.
Now you’ve built a scalable solution to archive older items out of DynamoDB into S3, and validated that it works!
If you are new to processing data in S3, I would recommend checking out Amazon Athena, which enables you to query S3 with standard SQL without having to build or manage clusters or other infrastructure. For a walkthrough to help you get started with Athena, see Amazon Athena – Interactive SQL Queries for Data in Amazon S3 on the AWS Blog.