Amazon DynamoDB – Parallel Scans, 4x Cheaper Reads, Other Good News

UPDATE (May 5, 2018)

The capacity management capabilities of Amazon DynamoDB were enhanced after this blog post was published. As a result, the post references information that may no longer be the most accurate or a best practice. Pease read the DynamoDB documentation on Best Practices for Designing and Using Partition Keys Effectively to learn more.

We continue to make improvements, large and small, to Amazon DynamoDB. In addition to a new parallel scan feature, you can now change your provisioned throughput more quickly. We are also changing the way that we measure read capacity in a way that will reduce your costs by up to 4x for certain types of queries and scans.

Parallel Scans
As you may know, DynamoDB stores your data across multiple physical storage partitions for rapid access. the throughput of a DynamoDB Scan operation is constrained by the maximum throughput of a single partition. In some cases, this means that a Scan cannot take advantage of the table’s full provisioned read capacity.

In order to give you the ability to retrieve data from your DynamoDB tables more rapidly, we are introducing a new parallel scan model today. To make use of this feature, you will need to run multiple worker threads or processes in parallel. Each worker will be able to scan a separate segment of a table concurently with the other workers. DynamoDB’s Scan function now accepts two additional parameters:

TotalSegments denotes the number of workers that will access the table concurrently.
Segment denotes the segment of table to be accessed by the calling worker.

Let’s say you have 4 workers. You would issue the following calls simultaneously to initiate a parallel scan:

Scan(TotalSegments=4, Segment=0, …)
Scan(TotalSegments=4, Segment=1, …)
Scan(TotalSegments=4, Segment=2, …)
Scan(TotalSegments=4, Segment=3, …)

The two parameters, when used together, limit the scan to a particular block of items in the table. You can also use the existing Limit parameter to control how much data is returned by an individual Scan request.

The AWS SDK for Java comes with high-level support for parallel scan. DynamoDBMapper implements a new method parallelScan, which handles threading and pagination of individual segments, which makes it even easier to try out this new feature.

To learn more about the parallel scan model, read the conceptual introduction and the best practices guide.

Provisioned Throughput Changes
You can now change the provisioned throughput of a particular DynamoDB table up to four times per day (the previous limit was twice per day). This will allow you to react more quickly to changes in load.

Read Capacity Metering
We are changing the way that we measure read capacity. With this change, a single read capacity unit will allow you to do 1 read per second for an item up to 4 KB (formerly 1 KB). In other words, larger reads cost one-fourth as much as they did before.

This change is being rolled out across all AWS Regions over the next week. Don’t be alarmed if you see that your consumed capacity graph shows a lot less capacity than before.

With this change, scanning your DynamoDB table, running queries against the tables, copying data to Redshift using the DynamoDB/Redshift integration, or using Elastic MapReduce to query or export your tables, are all more cost-effective than ever before.

I hope that you can make good use of the new parallel scan model, and that the other two changes are of value to you as well.

— Jeff;

AWS News Blog

Amazon DynamoDB – Parallel Scans, 4x Cheaper Reads, Other Good News

Resources

Follow