Optimize Amazon DynamoDB scan latency through schema design
In this post, we demonstrate how Amazon DynamoDB table structure can affect scan performance and offer techniques for optimizing table scan times.
Amazon DynamoDB is a NoSQL database that allows for a flexible schema. This means that items in the same table may differ from each other in terms of what attributes are present for each item.
Most DynamoDB schemas and access patterns are oriented and optimized around the GetItem and Query operations, which provide consistent, single-digit millisecond response times when accessing single items from a table or index. However, some use cases and access patterns require scanning the table or indices.
In a database with a flexible schema, for every item returned from a database scan, the network responses contain not only the data but also metadata. This metadata can include the attribute names and data types of each attribute.
Adding more attributes to each item also adds some amount of client overhead, as each column in the network response must be marshaled into the appropriate client data structure. Examples are a Python dictionary, a Node.js map, a Java object, and so on.
Because attribute metadata consumes space, fewer items fit into the 1-MB limit of a DynamoDB response. Consequently, scanning data requires more round trips.
We created one table with a simple structure: a primary key consisting of a partition key and a sort key (both are strings). There is also a third string attribute named field1 containing a string of 144 random characters.
We also created other tables with different combinations of 7-character attribute names (field01… field24) with both 3– and 6-character attribute values. These tables have the same primary key structure as the first.
NoSQL databases with flexible schemas must store the attribute name with each item. As items have more attributes, or attribute names get longer, they consume more space for storing attribute names.
Finally, we created another table with 24 attributes, each of which have the same 7-character attribute names and 100-character attribute values.
We take the following measurements for each table structure:
- How long it takes to insert 10,000 items.
- How long it takes to scan 10,000 items.
- How many items fit in the 1-MB limit. DynamoDB has a 1-MB limit on the amount of data that it retrieves in a single request, so we ran scans that retrieved the full 1-MB limit).
- How long it takes to retrieve and unmarshal those items on the client.
The following table summarizes the results.
|Total Size (MB)||Time to Write 10,000 Items (ms)||Time to Scan 10,000 Items (ms)||Single Threaded Throughput (MB/s)||Time to Scan 1 MB (ms)||# Items in 1-MB Scan|
|1 144-character data attribute||2.1||5057||569||3.7||238||4969|
|24 3-character data attributes||3.0||9359||2392||1.2||797||3496|
|24 6-character data attributes||3.7||9928||2391||1.5||682||2819|
|24 100-character data attributes||26.3||27,553||2819||9.4||110||400|
These particular numbers are from a Python client written for this benchmarking test. Other programming languages like Java and Node.js show similar performance characteristics from having many item attributes.
We did this timing from the client side of a Python process. Any query times shown inside of Amazon CloudWatch metrics are not the same, because they don’t account for network transfers or item unmarshaling.
Also, the throughput numbers in the fourth data column are for single-threaded scans. Using parallel scans would enable us to drive much higher throughput.
In the first two rows in the table in the Empirical results section, you can see that writing items with more attributes to the table takes nearly twice as long. Scanning those items back takes nearly four times as long. This is because each attribute in every item needs to be marshaled on the server and unmarshaled on the client.
From the last two rows in the table, you can see how the total size of the 10,000 objects affects the scan time, because both tests have the same 24 attributes with 7-character attribute names. Row 3 has attribute values that are six characters long, while row 4 has attribute values that are 100 characters long. It does take longer (three times as long) to write the bigger data, but only marginally longer (18%) to scan back the items that are significantly larger.
This supports our conclusion that the number of attributes, and subsequent marshaling and unmarshaling, is the biggest driver in the longer scan times. However, it’s not always realistic to have only three attributes in your DynamoDB table, as you may have other attributes on which to index, filter, atomically increment, and so on.
The key takeaway here is to have only as many attributes as you actually need for operations at the database level. To reduce the amount of attribute-name metadata, combine the rest of your data into a single attribute (perhaps as a JSON blob).
A corollary to this is that because attribute names consume space, both on-disk and in network throughput, you should strive for shorter attribute names.
About the Authors
Chad Tindel is a DynamoDB Specialist Solutions Architect based out of New York City. He works with large enterprises to evaluate, design, and deploy DynamoDB-based solutions. Prior to joining Amazon he held similar roles at Red Hat, Cloudera, MongoDB, and Elastic.
Mat Werber is a Solutions Architect on the AWS Community SA Team and is responsible for providing architectural guidance across the full AWS stack with a deep focus on Serverless, Redshift, DynamoDB, and RDS. He also has a background in IT and financial audit and public accounting.
Daniel Yoder is an LA-based senior NoSQL Specialist Solutions Architect at Amazon Web Services.