AWS Database Blog

Using Sort Keys to Organize Data in Amazon DynamoDB

It’s easy to start filling an Amazon DynamoDB table with data. However, without forethought about organizing your data, you can limit your data-retrieval options later.

Data organization and planning for data retrieval are critical steps when designing a table. Without proper data organization, the only options for retrieving data are retrieval by partition key or a full scan of the table. These retrieval methods are limited or expensive in terms of price and performance in a table.

In this post, I explore real-life examples of optimizing tables for data retrieval with sort keys. You use sort keys to not only group and organize data, but also to provide additional means for querying against items in a table.

A brief introduction to partition keys

To better understand how to organize your data, it’s important to understand how data is stored in DynamoDB. Each item in a DynamoDB table requires that you create a primary key for the table, as described in the DynamoDB documentation. A primary key can be a partition key or a combination of a partition key and sort key. The primary key must be unique across the table.

When using only a partition key as the primary key, a suboptimal choice for a partition key can affect the overall performance of the table. For example, frequent retrieval of the same primary key can cause throughput-exceeded exceptions and throttle requests to your table.

For more information about partition keys and their best practices, see the post Choosing the Right DynamoDB Partition Key on the AWS Database Blog.

Using sort keys to extend data retrieval options

In some cases, you might supply only a partition key as the primary key when creating a table. In these cases, you’re limited to data retrieval by the partition key or by returning all items in a table with the scan operation. Creating a table this way is simple, and in some cases, simple is all you need.

However, as your dataset grows, table scans can become a costly burden in terms of price and performance. Table scans can quickly exhaust your reading capacity and thus increase your bill. For more information about read and write capacity units and how these units affect your DynamoDB bill, see Amazon DynamoDB Pricing and Throughput Capacity for Reads and Writes in the DynamoDB documentation. Additionally, you can’t always retrieve items solely by the partition key. Consider the examples of retrieving customer orders before a specific date and retrieving all products that match a category. Adding a sort key to a table opens more data retrieval functionality beyond scans and partition keys.

Keep in mind that sort keys are not only for determining a sort order for items returned by a query. As this post shows, sort keys aren’t tied directly to sorting items for retrieval, but instead extend data retrieval options.

When you combine a partition key and sort key, they create a composite key, and that composite key is the primary key for individual items in a table. With a composite key, you gain the ability to use queries with a KeyConditionExpression against the sort key. In a query, you can use KeyConditionExpression to write conditional statements by using comparison operators that evaluate against a key and limit the items returned. In other words, you can use special operators to include, exclude, and match items by their sort key values.

Let’s examine a quick example of querying with a sort key and querying by time. Suppose that the sort key of a particular table is a Unix timestamp stored as a number. In this case, you can issue a query with a key condition that returns all items before a specific time using the < comparison operator. I cover this example of time-bound retrieval in more depth later.

Working with ranges

You use ranges with sort keys to create queries. The following operators are available with the KeyConditionExpressions query:

  • =
  • <
  • >
  • <=
  • >=
  • between
  • begins_with

The between operator accepts two values to determine a range, and begins_with checks if a string starts with a substring. You can use most of these operators on numeric values, but the begins_with operator enables some interesting ways of querying data.

For example, consider how you might group locations. In the United States, locations are usually grouped by city, state, and country. Let’s say we create a sort key that stores an item’s location in a string with the template of city-state-country.

The order of city-state-country starts with the most specific part of the location data. As a result, your queries using the begins_with operator are limited to the city. If I switch the order of the location string to be country-state-city, the begins_with operator instead provides three levels of querying:

  • begins_with(USA) – Returns all items located in the United States.
  • begins_with(USA-TX) – Returns only items located in Texas.
  • begins_with(USA-TX-Houston) – Returns only items located in Houston.

When grouping locations this way, you’re limited only by your ingenuity.

How to use sort keys for data organization and retrieval

Applying the knowledge of sort keys to real-life examples highlights repeatable techniques and patterns that you can use for data organization and retrieval in tables you create in the future.

Let’s consider log data first, such as event monitoring, pub/sub events, service logs, and the like. Some examples of data sources might be clickstreams, systemd logs, event logs, or streaming APIs from business productivity and software-as-a-service (SaaS) platforms such as Salesforce.

Example 1: Working with log data

I start with a simple example and implement the concept of time-bound retrieval that I mentioned earlier when discussing Unix timestamps. Let’s assume you’re collecting telemetry data from multiple devices. Each device has a device ID, an event ID for each unique event, and a timestamp. Logs often have time data associated with the events captured in the log. Commonly, you retrieve logs by querying a specific timespan.

Navigate to the DynamoDB console and create a new table. To organize the log data, you provide a partition key unique to the service creating the log data, as shown in the following screenshot. For example, the partition key might be a hostname or deviceID value. For the sort key, provide the timestamp value of the individual event.

To organize the log data, you provide a partition key unique to the service creating the log data

You can then issue queries using the between operator and two timestamps, >, or <. As shown in the following screenshot, three records are stored for a device with the deviceID of 123. The dates stored as a Unix timestamp equate to August 29, 2018, at 12:00 p.m. UTC; September 4, 2018, at 1:00 a.m. UTC; and July 9, 2011, at 1:00 p.m. UTC. To convert timestamps to human-readable formats, see the Epoch Unix Time Stamp Converter website.

Three records are stored for a device with the deviceID of 123

I issue a query for any logs with time stamps before September 4, 2018, at 12:00 a.m. UTC by using the < operator and the Unix timestamp for that time: 1536019200.

Results of the query for any logs with time stamps before September 4, 2018, at 12:00 a.m. UTC

Example 2: Working with chat messages

This example builds on the concept of using a template string for the sort key. Remember the location data example I previously discussed. For this new case, I consider data coming from a chat application. For reference, assume that the users have a user name and the chat rooms have unique IDs, and you want to store each chat message. I want to be able to retrieve messages by user and also by user and time frame.

The chatroomID can serve as the partition key. If you have a busy chat room, you can add an increasing numeric suffix to the chatroomID such as seattle-1 or seattle-2 to distribute data access across partitions. This practice can help you avoid issues related to creating a hot partition. For more information about hot partitions and picking the best partition key for a table, see Choosing the Right DynamoDB Partition Key on the AWS Database Blog.

The compound sort key should follow this format: user_name#datetime. For the datetime value, using a string representation such as yyyy-MM-dd:HH:mm:ss provides the most querying options.

To organize the log data, you provide a partition key unique to the service creating the log data

In this example, messages are stored for two users at various times, as shown in the following screenshot.

Messages are stored for two users at various times

Constructing the compound sort key gives a few options for querying with the begins_with operator. You can query for messages limited by the room and specific user.

The compound sort key gives a few options for querying with the begins_with operator

The date portion of the chatMessageIdentifier template string is arranged from least specific to most specific. You can query messages in a chat room by user and return all messages from a year (amsg#2018), or from a year and specific month (amsg#2018-08), for example.

The date portion of the chatMessageIdentifier template string is arranged from least specific to most specific

Querying does come with tradeoffs. For example, you can’t query data based on time here because we chose to start the template string with user_name. If time is a more important retrieval factor for you, you can order your template string differently or add a secondary index, with an additional sort key that reverses the original template string: datetime#user_name.

Example 3: Tracking documents

Finally, let’s consider a more advanced use case where the sort key is an arbitrary value and can change to help retrieve different sets of data about the same object. I use the real-life example of a document tracking system. Document tracking systems often keep data about who has permissions to a document, which version the document is on, where the document is stored, and metadata about the document. I set the partition key to be a unique ID tied to the document, and I use the sort key to represent each of these variables in the document tracking system.

To organize the log data, you provide a partition key unique to the service creating the log data

Using this generic but context-sensitive sort key, I can allow arbitrary metadata to be stored and retrieved by adding a sort key value of metadata. Or I can use the field to store access identifiers to stipulate who can access this document with the sort key value of permissions. Instead of predefining database schema, I instead can allow my application to provide the context of what kind of data I’m storing about the documents. With this flexibility, my application can supply the definition of the data instead of the database.

The structure of the data differs for each item and stores the information relevant to each arbitrary sort key value

As you can see in the preceding screenshot, the structure of the data differs for each item and stores the information relevant to each arbitrary sort key value. To retrieve one of these items, I issue a query using the = operator with the type of document information I need. For instance, to retrieve the document metadata, I use the = operator and the value metadata for my query.

Issue a query using the = operator with the type of document information I need

Based on this document tracking example, let’s cover an advanced use case for the sort key. I can use the same documentInfo sort key to implement document versioning. Once again, using the begins_with operator with a template string, I create items in the table for each version of a document. The sort key value adheres to the following template: v_# where # is the version ID or document version number. The sort key value v_0 is reserved to store the most recent version of the document and always is a duplicate row of whatever document version was last added. In this example, v_0 stores the same data as v_2, because v_2 is the latest document.

With this pattern, you can query to retrieve a full document history with begins_with(v_), as shown in the following screenshot.

With this pattern, you can query to retrieve a full document history with begins_with(v_)

Summary

Sort keys can add more power to the querying capabilities of DynamoDB. However, you must use some forethought. Remember to employ string templates that take advantage of the begins_with operator and to order the string by least specific data to most specific. The sort key also can store unique, arbitrary string values such as metadata and permissions to aid in retrieval.

Don’t be afraid to use DynamoDB data with other services as well when you have more intensive data querying, searching, and analysis needs. For example, you can use Amazon DynamoDB Streams to load streaming data into Amazon Elasticsearch Service. Some of the strengths of DynamoDB lie in its auto scaling feature and its simple interface for storing data. Others include its speedy response times for demanding applications with features such as Amazon DynamoDB Accelerator (DAX), and its pairing in serverless architectures. Querying and retrieving data doesn’t have to be challenging if you consider and plan accordingly.

 


About the Author

Photo of AM GrobelnyAM Grobelny is a senior technical evangelist based in Seattle, Washington, and is focused on enabling all kinds of software developers to be successful on AWS. AM enjoys using most languages, but C# and TypeScript have a special place in his heart.