Querying on Multiple Attributes in Amazon DynamoDB
Amazon DynamoDB is a non-relational key/value store database that provides incredible single-digit millisecond response times for reading or writing, and is unbounded by scaling issues. But as in any key/value store, it can be tricky to store data in a way that allows you to retrieve it efficiently. The most efficient method is to fetch the exact key of the item that you’re looking for. But if you need to search for data based on some attribute that isn’t part of the sorting key, you might wonder how to get the exact search result in the most cost-effective and efficient way.
DynamoDB provides filter expressions as one potential solution that you can use to refine the results of a Query operation. Although filtering is done on the server side before results are sent back, the read costs are calculated on the Query operation before the filter is applied. In some cases, the cost may be too high.
In this post, we discuss an alternative approach to querying on multiple attributes in DynamoDB, which is to concatenate attributes on a key. This approach allows you to get exactly the item you want, and it works great when you have a read pattern that requires all attribute values that would exist in the key to be present when you fetch an item from your DynamoDB table.
To query an item on multiple attributes without filter expressions, you must create a new global secondary index (GSI) with a new attribute as the partition key that contains the concatenation of the attributes for your query. The issue with this method is that the partition key for the GSI is a new attribute and is empty for all items in the table. If a partition key for a GSI has no value, it doesn’t exist in the index, and so the new GSI that you create is empty.
The solution is to retrofit the table with the concatenated attribute values for the new attribute so that your GSI is populated.
Suppose that you have the following item:
This item exists in the following table:
And the Java model is defined as follows:
This table stores audit information about when users access a resource. It records the method of access and a time stamp of when the resource was accessed. From the table, you need to determine what times a user has accessed a resource in a certain way. For example, what times has
joe1 viewed resource
As mentioned before, to perform the query without scanning or using filter expressions, you must create a GSI that has a partition key based on the attributes that you want to look up. But you can’t create a partition key for an index that spans multiple attributes.
In this example, the solution is to create a new field that consists of multiple attributes for your key. You can take the string values of the
accessedBy as the partition key and select
timestamp as the sort key.
You need to create a new attribute named
resourceId-Action-AccessedBy to contain these values and reference it as your partition key. The long attribute name in this example is used for readability. In a real use case with large datasets, you should use shorter attribute names because attribute names count towards the total data size used.
The item mentioned earlier now becomes the following:
The table now becomes the following:
|id (PrimaryKey)||resourceId-Action-AccessedBy (GSI HashKey)||resourceId||resourceName||action||accessedBy||timestamp (GSI SortKey)|
And the Java model is now as follows:
Now you can get items where a resource ID, user, and action are given. You can also filter on the range key for the year, month, day, etc., or for the original use case: What times has
joe1 viewed resource
123bbb? In Java, the query looks like the following example:
But because this is a new attribute, it doesn’t exist for pre-existing items, which causes many items to not be included in the index. You must create a retrofitting script to generate the data for this new attribute on existing items so that they are included in the index. If this isn’t done, the query that you created the GSI for returns incomplete results. If a query is performed for all the times that
jane has edited
123ccc, no results are returned. The previous table shows how the data in the table looks without retrofitting the existing items with data for the new attribute.
Retrofitting the table
There are two approaches for retrofitting the new attribute with data: an in-place update or updating to a new table.
The in-place approach is generally simpler to write. It just updates the items in place without regard to existing reads and writes to the table, and it requires no new services or products. This is a good option for a lightly accessed table and a small number or items. When you write the item back to the table with the
DynamoDBMapper, the new attribute is written automatically, and the existence of the new attribute makes it available in the new GSI.
The other approach for retrofitting is to update to a new table. You do this by writing items to a new table and writing a replicator that listens to the DynamoDB stream for the table. It listens for writes and updates in the original table and performs the write and update if the items exist in the new table. After the new table is created and replicating and you verify that the new items are as expected, you can switch your code to use the new table.
This is a good approach if you have a large number of items that take too long to update in place, or if you have heavy read/write volume on the table and don’t want to increase your throughput for an in-place update. This approach is available as a package and is documented in the DynamoDB Developer Guide.
You can also create your own replicator by using AWS Lambda, and writing one Lambda function to process writes from the table’s stream and write them to a new table. Then, create a second Lambda function to scan and write items that don’t yet exist in the new table. The benefit with this approach is that there is no infrastructure to manage because Lambda functions are serverless.
Before deciding which option to use, you should consider the Read/Write Unit costs and your table’s configured throughput.
For this example, we use the simpler approach of updating in place with a script. The following shows the Java solution.
We started with a DynamoDB table that had the correct keys selected for our initial use case. But a new use case was added in which we needed to query data in a way that was inefficient for our current schema.
We considered using filter expressions, but we decided against it due to the additional consumed
ReadUnits that would have been consumed for values that matched but were ultimately filtered out. Instead, we decided to introduce a new GSI that fit our use case better.
In designing the schema for the new GSI, we decided to make the key contain a concatenation of the required values in our new read pattern. This enabled us to get exact matches in our query without any filtering and provided a more efficient query. In doing so, we needed to rewrite the data for the GSI’s new keys that we introduced because an empty key excludes an item from a GSI.
After creating and running a script to retrofit the existing data into the new attribute, we can query for the exact item that we’re requesting, and we have some flexibility with the range key without needing to scan a table or filter results.
About the Author
Scott Todd is a software development engineer at Amazon.