Optimizing Provisioned Throughput in Amazon DynamoDB
David Yanacek of the Amazon DynamoDB team is back with another guest post, this one on the topic of optimizing the use of DynamoDB’s unique provisioned throughput feature.
While I’m on the topic of DynamoDB, I should mention that we will be running an 8-hour DynamoDB bootcamp session at re:Invent. You’ll need to have some development experience with a relational or non-relational database system in order to benefit.
DynamoDB offers scalable throughput and storage by horizontally partitioning your data across a sufficient number of servers to meet your needs. When you need more throughput for your table, you simply use the AWS Management Console or call an API. When your table grows in size, DynamoDB automatically adds more partitions for storing your growing dataset.
With traditional databases, you may be accustomed to buying larger and larger databases as your demands grow. This is known as a “scale-up” solution. However, when your dataset grows too large for even the largest database server, you must “scale-out” your database, implementing logic in your application to route each query to the correct database server. DynamoDB offers you this “scale-out” architecture, while handling all the complex components that are needed to run a secure, durable, scalable, and highly available data store.
While DynamoDB allows you to specify your level of throughput, your application needs to be designed with the DynamoDB architecture in mind to make use of it all. The Amazon DynamoDB Developer Guide describes some best practices for achieving your full provisioned throughput.
One of those recommendations is about how to efficiently store time series data in DynamoDB. Often when you store time series data, you access recent .hot. data more frequently than older, “cold” data. When storing time series data in DynamoDB, it is recommended that you spread your data across multiple tables – one per time period (month, day, etc). This article describes the reasons behind that advice, and the benefits of designing your application in this way.
To understand why hot and cold data separation is important, consider the advice about Uniform Workloads in the developer guide:
When storing data, Amazon DynamoDB divides a table’s items into multiple partitions, and distributes the data primarily based on the hash key element. The provisioned throughput associated with a table is also divided evenly among the partitions, with no sharing of provisioned throughput across partitions. Consequently, to achieve the full amount of request throughput you have provisioned for a table, keep your workload spread evenly across the hash key values. Distributing requests across hash key values distributes the requests across partitions.
For example, if a table has a very small number of heavily accessed hash key elements, possibly even a single very heavily used hash key element, traffic is concentrated on a small number of partitions – potentially only one partition. If the workload is heavily unbalanced, meaning disproportionately focused on one or a few partitions, the operations will not achieve the overall provisioned throughput level. To get the most out of Amazon DynamoDB throughput, build tables where the hash key element has a large number of distinct values, and values are requested fairly uniformly, as randomly as possible.
Another example of a non-uniform workload is where an individual request consumes a large amount of throughput. Expensive requests are generally caused by Scan or Query operations, or even single item operations when items are large. Even if these expensive requests are spread out across a table, each request creates a temporary hot spot that can cause subsequent requests to be throttled.
For instance, consider the Forum, Thread, and Reply tables from the Getting Started section of the developer guide, which demonstrate a forums web application on top of DynamoDB. The Reply table stores messages sent between users within a conversation, and within a thread, are sorted by time.
If you Query the Reply table for all messages in a very popular thread, that query could consume lots of throughput all at once from a single partition. In the worst case, this expensive query could consume so much of the partition’s throughput that it causes subsequent requests to be throttled for a few seconds, even if other partitions have throughput to spare.
Tip: Use Pagination
To spread that workload out over time, it is recommended to take advantage of the pagination features of the Query operation, and limit the number of items retrieved per call. Since a forums web application displays a fixed number of replies to a thread at a time, pagination lends itself well to this use case.
Impact of Non-Uniform Workloads on Throughput for Large Tables
As your table grows in size, DynamoDB adds more partitions behind the scenes to handle your storage needs. As the number of partitions in your table increases, each partition is given a smaller portion of your overall throughput.
In the case of a non-uniform workload, as your dataset increases some of your requests could be throttled even though you did not see this throttling when the table was smaller in size, and even if you are not utilizing your table’s full provisioned throughput. When you first create your table, you may have hot spots that go unnoticed because each partition is allotted a larger amount of your overall table throughput. However, when your application adds large amounts of data to your table, DynamoDB automatically adds more partitions, which decreases your per-partition throughput, and can lead to increased throttling for non-uniform workloads.
However, if your request workload is uniform across your table, even as the number of partitions grows, your application will continue to run smoothly.
Tip: Separate Hot and Cold Data
Some types of applications store a mix of hot and cold data together in the same table. Hot data is accessed frequently, like recent replies in the example forums application. Cold data is accessed infrequently or never, like forum replies from several months ago.
Applications that store time series data, often with a range key involving a timestamp, fall into this category. The developer guide describes the best practices for storing time series data in DynamoDB, which involves creating a new table for each time period. This approach offers several benefits, including:
- Cost: You can provision higher throughput for tables which contain hot data, and provision lower throughput for the tables containing cold data. This keeps your per-partition throughput higher on you hot tables, helping them better tolerate non-uniformity in your workloads.
- Simplified Analytics: When analyzing your data for periodic reports, you use the built-in integration with Amazon Elastic Map Reduce, and run complex data analysis queries that are otherwise not supported natively in DynamoDB. Since analytics tends to be recurring on a scheduled basis, separating tables into time periods makes it so that the analytics job only needs to access the new data for analysis.
- Easier Archival: If older data is no longer relevant to your application, you can simply archive to cheaper storage systems like Amazon S3 and delete the old table without having to delete items one at a time, which would otherwise cost a great deal of provisioned throughput.
Unlike traditional databases, DynamoDB lets you scale up your throughput requirements with the push of a button. DynamoDB also automatically manages your storage as your table grows in size. As your table grows, its provisioned throughput is spread out across your additional partitions. In order to fully utilize your provisioned throughput in DynamoDB, you have to take this partitioning into consideration when you design your application. You can find additional best practices and suggestions for application designs that work well on DynamoDB in the Amazon DynamoDB Developer Guide.
— David Yanacek