Efficiently compare items across two Amazon DynamoDB tables

In this post, we show an algorithm to efficiently compare two Amazon DynamoDB tables and find the differences between their items. You can find an implementation of this algorithm within the Bulk Executor for DynamoDB, an open source command line utility for performing bulk tasks against DynamoDB.

We provide an example where two tables, each containing approximately half a billion items, are compared in less than 7 minutes, for less than $10.

The need for comparing tables

The need to compare (or “diff”) table contents can arise when you want to double-check your table actions are happening as expected. For example:

You’re migrating a table across AWS accounts and want to verify that no data was lost. With cross-account access enabled, you can compare the original and migrated tables directly.
You perform a point-in-time recovery of a table and want to compare the restored version to the original. Because they are at different timestamps, some differences might be expected and need to be efficiently identified. You can also use incremental export to identify changes to a table during the time period, but you want to compare the live tables.
You performed an “export, modify, import cycle” where you exported a table to Amazon Simple Storage Service (Amazon S3), possibly modified the content in Amazon S3, and imported the data into a new table. You then want to cross-check the new table with the old table to have a definitive record of all changes made during offline processing.
You regularly propagate data from one table to another using a custom process (such as stream processing) and want anti-entropy to check that you don’t have issues causing missing or malformed data.

You can perform this comparison by using a scan against each table and identifying when the item sequence diverges. On finding a divergence, walk forward to see which table has extra items. This algorithm runs in linear time, is highly parallelizable using segmented scans, and uses only a small amount of memory. We will dive deeper into this algorithm in the next sections.

Overview of solution

To understand the algorithm, you need to first understand how partitions and data distribution work in DynamoDB:

Partition keys are consistently hashed – DynamoDB uses the same hash function for all tables—the same partition key value will land in the same part of the keyspace regardless of table, AWS Region, or account.
Scans are ordered – The results from a scan operation are returned in keyspace order—sorted by hashed partition key and then by sort key. Two tables with the same contents will produce the same scan sequence.
Parallel scans divide the keyspace predictably – The keyspace will be divided into your choice of total segments. Items appearing within a given segment in one table will always appear in the same segment on the other table if the same value for total segments is used for both tables.

This means you can compare two tables efficiently by comparing their scan sequences. You can use parallel scan to compare scan sequences of segments independently, and in parallel.

The comparison algorithm: Comparing scan sequences

DynamoDB scan is a paginated operation. We create an abstraction over this to make a scan appear as a stream (a continuous set of items from beginning to end). We can perform the following operations on this stream:

Cursor tracking – Remembers where we are in the stream processing and can retrieve the item at this point (sometimes called the head of the stream)
Advancing – Moves the cursor forward to the next item in the stream (transparently loading the next scan page if necessary)
Peeking – Looks ahead in the stream without advancing the cursor

We will call this abstraction a scan stream.

The following steps illustrate the exact logic of the comparison algorithm:

Compare the schemas of the two tables. Only proceed if they’re identical.
Initialize two scan streams, one for each table.
Compare the head of each stream. If they are fully equal (partition key, sort key, and attributes), then advance both cursors. This is the common case for mostly identical tables.
If the items have identical partition keys and sort keys but differ in attributes, mark the item as changed, and advance both cursors.
If the items have the same partition key but different sort keys, that means we are processing an item collection with differences.
1. Whichever item has the lower sort key value is missing from the other table. Mark it as missing. Advance only that stream. Repeat this logic back and forth until the sort keys match up again.
2. If the partition key changes after advancing, that stream’s item collection has ended. Any unprocessed items in the other table’s item collection are missing. Mark them as missing and advance through them. Leave both cursors at the first item after the item collection. Start the comparison loops again.
If the items have different partition keys, it’s ambiguous which records are missing. Peek ahead in both streams to find the first shared partition key. The items preceding the shared partition key in each stream are known to be missing from the opposite table. Mark these as missing. Resume comparison at the first shared partition key items. (Note that when peeking ahead, only the distinct partition key values need to be retained, so this is memory efficient even for large peeks, although a peek-ahead limit could also be imposed. Using a large number of segments also limits the possible peek-ahead size per segment.)
When one stream reaches the end, mark all remaining items in the other stream as missing.

This algorithm works the same if the source is a table or a segment of a table.

Running the comparison code

This algorithm implemented within the Bulk Executor for DynamoDB uses AWS Glue to run hundreds of segmented scans in parallel against the two tables and collect the output to the command line or to Amazon S3.

After you’ve bootstrapped the Bulk Executor according to its documentation, you can perform a comparison with a single command. The following command sample compares two nearly identical tables. Each table is about 180 GB with about half a billion items. The run completes in 6.5 minutes:

./bulk diff --table before-table --table2 after-table

Starting DynamoDB Bulk Executor Glue Job...
DynamoDB Bulk Executor Glue Job started with job run ID: jr_480094f9a0b6cfed78d42f1b5b81f1eb5f61de7de930839d508a6160c7e19f78
Press Ctrl+C to cancel DynamoDB Bulk Executor Glue Job.
Table: 'before-table'
Billing mode: On-demand

Table Item Count (approx): 558,298,362
Table Size (approx): 179,194,713,322 bytes

DynamoDB read costs depend on the table size.
DynamoDB read units required for a full scan (approx): 22,133,735
Approx DynamoDB cost for on-demand scan consuming 22,133,735 RRUs (using us-east-1 prices): $2.77


Table: 'after-table'
Billing mode: On-demand

Table Item Count (approx): 558,298,362
Table Size (approx): 179,194,713,322 bytes

DynamoDB read costs depend on the table size.
DynamoDB read units required for a full scan (approx): 22,133,735
Approx DynamoDB cost for on-demand scan consuming 22,133,735 RRUs (using us-east-1 prices): $2.77

TOTAL DynamoDB cost for scanning both tables (approx): $5.53


* {'pk': {'S': 'xu3u2g3bqmqpn'}, 'sk': {'S': 'abc'}}
- {'pk': {'S': 'l7uqwzb7yym9j'}, 'sk': {'S': 'abc'}}
+ {'pk': {'S': 'NewPK'}, 'sk': {'S': 'NewSK'}}

Job completed successfully. Job duration: 0:06:28

This output indicates one changed item (*), one removed item (-), one added item (+). By default, the output contains only the keys of the added, removed, and changed items. You can also ask for the full difference:

./bulk diff --table before-table --table2 after-table \
            --format full

- {'pk': {'S': 'xu3u2g3bqmqpn'}, 'sk': {'S': 'abc'}, 'payload': {'S': '9pen'}}
+ {'pk': {'S': 'xu3u2g3bqmqpn'}, 'sk': {'S': 'abc'}, 'payload': {'S': '9pen!'}}
- {'pk': {'S': 'l7uqwzb7yym9j'}, 'sk': {'S': 'xyz'}, 'payload': {'S': 'hu64'}}
+ {'pk': {'S': 'NewPK'}, 'sk': {'S': 'NewSK'}, 'payload': {'S': 'NewPayload'}}

The first two lines show the same item as it was before and after (with the added !). The other two lines show the complete added and removed items.

When results get large, you can specify to store the results in Amazon S3.

Cost considerations

Performing this algorithm requires a full scan of each table. The Bulk Executor provides an estimate of the read costs on startup based on the table metrics. To confirm the pricing yourself, an eventually consistent scan of 1 GB of table data requires about 125,000 read request units. Assuming on-demand mode in us-east-1, the cost is $0.125 per million read request units. The cost per GB scanned then is around $0.016. For the preceding tables, it was $2.77 for each table. For long-running bulk processing, consider using provisioned mode and possibly timing the activity to lower costs.

The Bulk Executor tool also consumes AWS Glue resources in the form of Data Processing Units (DPUs) based on runtime and the number of workers allocated to the job. You can add --XWaitForDPU to the bulk command to gather the DPU costs at the end. That changes the final line of output:

Job completed successfully. Job duration: 0:06:28 (8.98 DPU hours)

With pricing at $0.44 per DPU hour in us-east-1, the AWS Glue cost was about $3.95 for this comparison.

Conclusion

In this post, we detailed an algorithm to efficiently compare two DynamoDB tables and find the differences between their items. This algorithm runs in linear time, is highly parallelizable, and uses minimal memory.The algorithm is implemented in the Bulk Executor for DynamoDB tool. If you need to compare two tables, you have a fast and scalable way to do this. In this post, we showed that two tables, each containing around half a billion items, were compared in less than 7 minutes with a cost of less than $10 using this tool.

To get started with the Bulk Executor tool, visit the installation documentation. To learn more about DynamoDB, see the Amazon DynamoDB Developer Guide.

AWS Database Blog

Efficiently compare items across two Amazon DynamoDB tables

The need for comparing tables

Overview of solution

The comparison algorithm: Comparing scan sequences

Running the comparison code

Cost considerations

Conclusion

About the authors

Resources

Blog Topics

Follow

Learn

Resources

Developers

Help