Introducing incremental export from Amazon DynamoDB to Amazon S3
Amazon DynamoDB is a key-value and document database that delivers single-digit millisecond performance at any scale. It’s a fully managed, multi-Region, multi-active, durable database with built-in security, backup and restore, and in-memory caching for internet-scale applications.
In 2020, DynamoDB introduced a feature to export DynamoDB table data to Amazon Simple Storage Service (Amazon S3) with no code writing required. It works without having to manage servers or clusters and allows you to export data to any point in time in the last 35 days at a per-second granularity. Plus, it doesn’t affect the read capacity or the availability of your production tables. After your data is exported to Amazon S3—in DynamoDB JSON or Amazon Ion format—you can query or reshape it with your favorite tools such as Amazon Athena, Amazon SageMaker, AWS Lake Formation, and Amazon Redshift.
Today we are launching a new functionality: incremental export to Amazon S3.
In this post, we show how to use incremental exports that you can use to update your downstream systems regularly using only the changed data. You no longer need to do a full export each time you need fresh data. The incremental export feature exports only the data items that have been inserted, updated, or deleted between two specified points in time. You can now build change data capture (CDC) pipelines more efficiently and cost-effectively.
To get started, ensure your table has point-in-time-recovery (PITR) enabled, which is needed for any full or incremental export. You can then use the AWS Management Console, AWS Command Line Interface (AWS CLI), or SDK to initiate the incremental export. Pick an export period for which you want to export changed data, from 15 minutes up to 24 hours. The following screenshot shows the options when initiating an incremental export on the DynamoDB console.
The export is compacted such that each item that was changed during the selected period is exported at most once, providing the final view of the item. Each item’s output includes a timestamp that represents when that item was modified, followed by a data structure that indicates if the modification was an insert, update, or delete. In the output, you can select to see only the new image or both the new and old images. For updates, this can be helpful if you want to see how the item was changed. It can also be useful if you need the old image to find the item in the downstream system.
If you are using the AWS CLI, you can perform an incremental export by providing a new
--incremental-export-specification, as shown in the following code. Substitute your own values in the placeholders. Times are specified as seconds since epoch.
You can check the status of your export using the DynamoDB console or the
describe-export AWS CLI APIs. The following screenshot shows a list of recent exports on the console.
The following screenshot shows a detailed description of a single export in the console.
The following is an example S3 folder holding one full export and two subsequent incremental exports:
The original full export folder includes its own metadata and data subfolder. The incremental exports include their own metadata and a shared data subfolder across all incremental exports. Within the metadata folder is a manifest of what was written for that export.
Exports don’t depend on each other. You can request an incremental export without having already done a full export, for example, and you can request more than one incremental export covering the same time period, so long as the start and end times are within the PITR window (which starts when you enabled PITR and goes back up to 35 days). Exports using the same time parameters will always include the same data in the output. To produce a contiguous view, you will generally want your exports to start with a full export, followed by a series of incremental exports that all share time boundaries.
The data file formats are different than the full export and include more metadata. The following code is a portion of an incremental export data file. Each item provides the timestamp (in microseconds) of when the item was last changed during the incremental export window, the primary key of the item, the new image (for inserts and updates), and the old image (for updates and deletes, if you’ve selected to include old images as well). If an item underwent multiple modifications during the incremental time window, the old image is the item image before the start of the export, the new image is the newest, and the timestamp is when the item was made to match the new image.
Whether the change was an insert, update, or delete can be inferred from the structure. If you’ve selected an incremental export with new images only, then an insert and update will look the same with the timestamp, primary key, and new image, and a delete will include no new image. If you’ve selected new and old images, then an insert will have a new image, an update will have both old and new images, and a delete will just have an old image. The following table summarizes the output structure of each operation.
|New Images Only
|New and Old Images
|Keys + new image
|Keys + new image
|Keys + new image
|Keys + old image + new image
|Keys + old image
|Insert + Delete
The following is a sample output for new and old images:
Time periods for exports are inclusive on the start time and exclusive on the end time. That means that to produce hourly incremental exports, you can do a one-time bootstrapping full export at some arbitrary clean time, such as 6:00:00 on some day, and initiate ongoing incremental exports for each future hour from 6:00:00 to 7:00:00, 7:00:00 to 8:00:00, and so on, and have no overlap. You can use Amazon EventBridge to schedule these incremental actions.
Pricing is based on the size of data processed to create each incremental export. This size is based on the amount of change logs a table generates. If you have a very active table with many writes, it will have more change logs, and the incremental export cost will be proportional to that activity. For pricing, see Amazon DynamoDB pricing.
Note that any tooling built to read the full export format will not naturally be able to read the combination of a full export plus the series of incremental exports. To facilitate downstream access, you can convert the series of exports into a single destination format such as an Apache Iceberg table. You can use Amazon EMR with Apache Spark to bulk process the exports and keep the Iceberg table up to date.
The DynamoDB incremental export to Amazon S3 feature enables you to update your downstream systems regularly using only the incremental changed data. This new feature is available in all commercial AWS Regions and GovCloud.
About the Authors
Jason Hunter is a California-based Principal Solutions Architect specializing in Amazon DynamoDB. He’s been working with NoSQL databases since 2003. He’s known for his contributions to Java, open source, and XML.
Shahzeb Farrukh is a Seattle-based Senior Product Manager at AWS DynamoDB. He works on DynamoDB’s data protection features like backups and restores, and data movement capabilities that help customers integrate their data with other services. He has been working with databases and analytics since 2010.