AWS Database Blog

Choosing the Right DynamoDB Partition Key

Gowri Balasubramanian is a solutions architect at Amazon Web Services

This blog post will cover important considerations and strategies for choosing the right partition key while migrating from a relational database to DynamoDB. This is an important step in the design  and building of scalable and reliable applications on top of DynamoDB.

What is a partition key?

DynamoDB supports two types of primary keys:

  • Partition key: Also known as a hash key, the partition key is composed of a single attribute. Attributes in DynamoDB are similar in many ways to fields or columns in other database systems.
  • Partition key and sort key: Referred to as a composite primary key or hash-range key, this type of key is composed of two attributes. The first attribute is the partition key, and the second attribute is the sort key. Here is an example:

PartitionKey

Figure 1 A DynamoDB table with a composite primary key

Why do I need a partition key?

DynamoDB stores data as groups of attributes, known as items. Items are similar to rows or records in other database systems. DynamoDB stores and retrieves each item based on the primary key value which must be unique. Items are distributed across 10 GB storage units, called partitions (physical storage internal to DynamoDB). Each table has one or more partitions, as shown in Figure 2. For more information, see the Understand Partition Behavior in the DynamoDB Developer Guide.

DynamoDB uses the partition key’s value as an input to an internal hash function. The output from the hash function determines the partition in which the item will be stored. Each item’s location is determined by the hash value of its partition key.

All items with the same partition key are stored together, and for composite partition keys, are ordered by the sort key value. DynamoDB will split partitions by sort key if the collection size grows bigger than 10 GB.

Partition

Figure 2 DynamoDB tables and partitions

Partition keys and request throttling

DynamoDB evenly distributes provisioned throughput (read capacity units and write capacity units) between partitions. Therefore, it limits the throughput per partition based on your provisioned capacity. If your read or write throughput exceeds this level for a single partition, your requests may be throttled with a ProvisionedThroughputExceededexceptions error.

Reading/writing above the limit can be caused by:

  • Uneven distribution of data due to the wrong choice of partition key.
  • Frequent accessing of the same key in a partition (the most popular item, also known as a hot key).
  • A request rate greater than the provisioned throughput.

To avoid request throttling, design your DynamoDB table with the right partition key to meet your access requirements and provide even distribution of data.

Recommendations for partition keys

Use  high-cardinality attributes. These are attributes that have distinct values for each item  like e-mail id, employee_no, customerid, sessionid, ordered, and so on.

Use composite attributes. Try to combine more than one attribute to form a unique key, if that meets your access pattern. For example, consider an orders table with customerid+productid+countrycode as the partition key and order_date as the sort key.

Cache the popular items when there is a high volume of read traffic. The cache acts as a low-pass filter, preventing reads of unusually popular items from swamping partitions. For example, consider a table that has deals information for products. Some deals are expected to be more popular than others during major sale events like Black Friday or Cyber Monday.

Add random numbers/digits from a predetermined range for write-heavy use cases. If you expect a large volume of writes for a partition key, use an additional prefix or suffix (a fixed number from predeternmined range, say 1-10) and add it to the partition key.  For example,  consider a table of invoice transactions. A single invoice can contain thousands of transactions per client. How do we enforce uniqueness and ability to query/update the invoice details for high-volumetric clients?

Here is the recommended table layout for this scenario:

  • Partition key: Add a random suffix (1-10 or 1-100) with the InvoiceNumber,  depending on the number of items per InvoiceNumber.
  • Sort key: ClientTransactionid.
    Partition Key Sort Key Attribute1
    InvoiceNumber+Randomsuffix ClientTransactionid Invoice_Date
    121212-1 Client1_trans1 2016-05-17 01.36.45
    121212-1 Client1-trans2 2016-05-18 01.36.30
    121212-2 Client2_trans1 2016-06-15 01.36.20
    121212-2 Client2_trans2 2016-07-1 01.36.15
  • This combination will give us a good spread through the partitions. The sort key can be used to filter for a specific client (for example, where InvoiceNumber=121212-1 and ClientTransactionid begins with Client1).
  • Because we have a random number appended to our partition key (1-10), we need to query the table 10 times for a given InvoiceNumber. Our partition key could be 121212-[1-10], so we need to query where partition key is 121212-1 and ClientTransactionid begins with Client1. We need to repeat this for 121212-2, on up to 121212-10 and then merge the results.

Here’s an alternative to using the random number (1-10) option: If you will always use Invoice_Date to filter the query results, you can append a portion of Invoice_Date (such as Day, Month and Year (DDMMYYYY)) to the InvoiceNumber to spread the writes. If you need to identify the list of invoices for Client1 for Sep2016, then you need to issue query 30 times, where InvoiceNumber=121212-01SEP2016 to 121212-30SEP2016 and ClientTransactionid starts with Client1. On the other hand, if your access pattern is to typically fetch invoices on particular date, then you can use the partition key to query directly.

Note:

After the prefix range is decided, there is no easy way to further spread the data because prefix modifications also require application-level changes. Therefore, consider how hot each partition key could get and add enough random prefixes (with buffer) to accomodate the anticipated future growth.

This option will induce additional latency for reads due to X number of read requests per query.

Segregate the tables for large vs. small use cases. Using a random number with a partition key works well if there are large number of items for a partition key. What if you have a mix of small and large items for a partition key?  For example, what if you have one customer/product with tens of thousands of items per partition key and another customer with only a few items per partition key?  In this scenario, using a random prefix will be ineffective for small customers.

Consider the same InvoiceTransactions table. Assume that you know your client profile in advance and your clients can be classified as a small and big. Small invoices have 10 transactions and big invoices have ~10,000 transactions.

Here is the recommended table layout for this scenario:

Table Name Partition Key Sort Key
InvoiceTransactions_big InvoiceNumber+Randomsuffix ClientTransactionid
InvoiceTransactions_small InvoiceNumber ClientTransactionid

In this case, the random prefix logic is only applied to the big table where it is expected to store tens of thousands of invoice transactions per client. The queries against the small table do not require multiple API calls.

Anti-patterns for partition keys

Use of sequences or unique IDs generated by the DB engine as partition key. It’s common to use sequences (schema.sequence.NEXTVAL) as the primary key to enforce uniqueness in Oracle tables. They are not usually used for accessing the data.

The following is an example schema layout for an order table that has been migrated from Oracle to DynamoDB. The main table partition key (TransactionID) is populated by a UID. A global secondary index (GSI) is created on OrderID and Order_Date for query purposes.

Partition Key Attribute1 Attribute2
TransactionID OrderID Order_Date
1111111 Customer1-1 2016-05-17 01.36.45
1111112 Customer1-2 2016-05-18 01.36.30
1111113 Customer2-1 2016-05-18 01.36.30

Here are the potential issues with this approach:

  • You  can’t use TransactionID for any query purposes, so you lose the ability to use the partition key to perform a fast lookup of data.
  • GSIs support eventual consistency only, with additional costs for reads and writes.

Note:  You can use the conditional writes feature instead of sequences to enforce uniqueness and prevent the overwriting of an item.

Low-cardinality attributes like product_code as partition key and order_date as sort key.

  • This design greatly increases the likelihood of hot partition issues. For example, if one product is more popular, then the reads and writes for that key will be high, resulting in throttling issues.
  • Except for scan, DynamoDB API operations require an equal operator (EQ) on the partition key for tables and GSIs. As a result, the partition key must be something that is easily queried by your application with a simple lookup (for example, using key=value, which returns either a unique item or fewer items). There is a 1 MB limit on items you can fetch through a single Query operation, which means that you will need to paginate using LastEvaluatedKey, which is not optimal.

Do not build your own hash on top of the key like customer_id before storing it in DynamoDB. DynamoDB will hash your key and distribute it across partitions. Double hashing will only add complexity to your design.

Do not lift and shift primary keys from the source database without analyzing the data model and access patterns of the target DynamoDB table.

Conclusion

When it comes to DynamoDB partition key strategies, no single solution will fit all use cases. You should evaluate various approaches based on your data ingestion and access pattern, then choose the most appropriate key which has the least probability of hitting throttling issues.

Reference

Design for Uniform Data Access Across Items in Your Tables in the DynamoDB Developer Guide.