AWS Database Blog
Z-Order Indexing for Multifaceted Queries in Amazon DynamoDB: Part 1
Update from May 29, 2018: Today, we published Z-order indexing for multifaceted queries in Amazon DynamoDB: Part 2.
TL;DR
Using Z-order indexing, you can efficiently run range queries on any combination of fields in your schema. Although Amazon DynamoDB doesn’t natively support Z-order indexing, you can implement the functionality entirely from the client side. A single Z-order index can outperform and even replace entire collections of secondary indexes, saving you money and improving your throughput.
Background
When DynamoDB launched in 2012, its indexing options were limited. There were no secondary indexes. You could only order your data on disk by specifying a sort key during table setup, and a sort key could only be a single attribute. The following year, DynamoDB introduced the ability to declare a variety of indexes for an individual table, greatly increasing the database’s flexibility. However, the core design in which each index orders data by a single attribute is still in effect. This post explores an alternative approach to indexing, using which you can execute efficient queries across any number of attributes.
A practical example
Imagine that you run a website that aggregates weather data from a variety of universities and government organizations. For each of these sources, you periodically collect and persist updates that include the temperature that was recorded by a sensor at a particular time and location. You create a table in DynamoDB using the following schema:
sourceId | timestamp | latitude | longitude | celsius |
---|---|---|---|---|
1 | 1469566800 | 40.749207 | -73.985174 | 30 |
1 | 1469566800 | 47.617326 | -122.330787 | 23 |
1 | 1469567100 | 47.617326 | -122.330787 | 23 |
1 | 1469567400 | 40.749207 | -73.985174 | 30 |
For illustrative purposes, we have randomly generated a set of 300,000 such temperature reports. Each report has been attributed to a single fictional weather organization called Under the Weather (UTW), which has a source ID of 1. The data spans three months and covers the continental United States. We will use this dataset to demonstrate the performance of various queries and indexes.
Traditional indexing in DynamoDB
In DynamoDB, composite indexes can be made up of at most two attributes: a partition key and a sort key. In our example, the sourceId
field of each entry (shown in blue) is our partition key; all data with the same sourceId
value will reside in the same partition of our table. Because many of our queries for weather data will be limited to a given time frame (for example, a specified day, week, or month), you might be tempted to simply use the timestamp
field (shown in green) as our sort key . However, in doing so you quickly hit a roadblock, as we discuss following.
Choosing an effective sort key
To select our sort key, we must give careful consideration to the full range of queries that we hope to run against our data. We must also work within the following constraints:
- DynamoDB doesn’t permit two rows to share the same combination of partition key and sort key values.
- When querying for a range of items within a partition, DynamoDB can only constrain its search by the value of the sort key. We can specify constraints on other fields, but the server will only use them to filter the set of items that has already been retrieved from the table using the sort key. Constraints imposed on fields other than the sort key will reduce neither the amount of read capacity consumed by the query nor the amount of time that it takes the query to run.
The uniqueness constraint
Given our use case, the uniqueness constraint complicates things quite a bit. We cannot use timestamp
alone as our sort key because we expect to receive multiple updates from sensors around the world at more or less the same time. If no two items in a DynamoDB table can have the same sourceId
and timestamp
combination, we’ll end up occasionally losing data. In our example table, preceding, the first and second rows cannot coexist.
We can address this problem by appending some additional information to the sort key: the latitude
and longitude
of each item. We do this by setting the type of our table’s sort key to String instead of Number and then concatenating the other values onto the end of the timestamp. Now our table looks like this:
sourceId | timestamp_lat_long | latitude | longitude | celsius |
---|---|---|---|---|
1 | 1469566800_40.749207_-73.985174 | 40.749207 | -73.985174 | 30 |
1 | 1469566800_47.617326_-122.330787 | 47.617326 | -122.330787 | 23 |
1 | 1469567100_47.617326_-122.330787 | 47.617326 | -122.330787 | 23 |
1 | 1469567400_40.749207_-73.985174 | 40.749207 | -73.985174 | 30 |
Because the newly generated timestamp_lat_long
values are all prefixed with the timestamp
value, we can still query for all items within a given timeframe as we could before. But, because no single source of weather data will have two sensors sending reports from the same place at the same time, we no longer risk having primary key collisions in our data. Thus, we’ve satisfied the uniqueness constraint. However, another constraint is still looming.
Multifaceted query constraints
Let’s take a moment to imagine some of the queries that we might want to run on our temperature data:
- How warm did it get in Atlanta, Georgia, in the last week of March?
- How often was it cold enough to snow in New York City in the first quarter of 2016?
- In what places did we record exactly 0°C (32° F) in the last hour?
- Has it ever dipped below 5°C (41° F) within 30 degrees’ latitude of the equator?
- When the sun rose this morning, what was the warmest city in the US Eastern Time Zone?
Examining query 1, we can see that it provides helpful query bounds on most of the data that we collect. We’re only looking for items created “in the last week of March,” so we have a relatively small timestamp
search space. Better still, we only care about results “in Atlanta,” so we can define a narrow range of latitude
and longitude
values for our query.
Let’s see what this looks like using our Under the Weather data:
Query 1:
How warm did it get in Atlanta, Georgia, in the last week of March?
Index:
timestamp_lat_long
Key condition:
sourceId = 1 AND
timestamp_lat_long BETWEEN “1471996800” AND “1472601601”
Filter expression:
latitude BETWEEN 33.7 AND 33.9
AND longitude BETWEEN -84.5 AND -84.3
AND celsius BETWEEN -20 AND 40
AND timestamp BETWEEN 1471996800 AND 1472601600
Results:
Items retrieved: 1
Items scanned: 23,102
Read capacity units consumed: 447.5
As you can see, our randomized dataset only contained a single item relevant to our query, but we had to scan over 23,000 items to find it!
Unfortunately, the nature of DynamoDB’s composite indexes prevents us from deriving any benefit from the query’s very specific bounds on latitude
and longitude
. Our data from UTW spans three months, so a single week represents about 1/12 of it—approximately the fraction of the table’s 300,000 items that were scanned. DynamoDB narrowed its search to items from the last week in March, but it was then forced to scan items from the entire country in that timeframe to find those reported from Atlanta. The non-Atlanta items that DynamoDB found were filtered out of our query results, but only after the scan had taken place. This inefficiency causes the query to consume a surprisingly large amount of read capacity. Because that metric is central to DynamoDB’s billing calculations, it means that we’re spending more money per query than we’d like to.
The situation gets even worse when we move on to the second query in our list:
Query 2:
How often was it cold enough to snow in New York City in the first quarter of 2016?
Key condition:
sourceId = 1 AND
timestamp_lat_long BETWEEN “1451606400” AND “1459468801”
Filter expression:
latitude BETWEEN 40.6 AND 40.8
AND longitude BETWEEN -74.1 AND -73.9
AND celsius BETWEEN -20 AND 0
AND timestamp BETWEEN 1451606400 AND 1459468800
Results:
Items retrieved: 1
Items scanned: 300,000
Read capacity units consumed: 5,812.5
In this query, we’re trying to identify a set of points in time. As a result, the bounds on the timestamp
field are very loose—they cover the entire three-month range of the data. Because of this, our index is effectively useless. DynamoDB must scan the entire table in order to find a single item, consuming a whopping 5,812.5 read capacity units. Depending on the read capacity that we have provisioned for our table, later pages being requested as part of this query might not be returned due to throttling. If throttling occurs, other queries might also be prevented from executing.
As the data shows, the index we’ve created is only helpful for queries with a very specific time window. Let’s revisit our list of sample queries to see which attributes are going to help us narrow the search space the most in each case:
Query | Very Helpful Fields | Helpful Fields | Unhelpful Fields |
---|---|---|---|
How warm did it get in Atlanta, Georgia in the last week of March? | latitude, longitude | timestamp | celsius |
How often was it cold enough to snow in New York City in the first quarter of 2016? | latitude, longitude | celsius | timestamp |
In what places did we record exactly 0°C (32° F) in the last hour? | celsius,timestamp | latitude, longitude | |
Has it ever dipped below 5°C (41° F) within 30 degrees’ latitude of the Equator? | celsius,latitude | timestamp,longitude | |
When the sun rose this morning, what was the warmest city in the US Eastern Time Zone? | timestamp | latitude,longitude | celsius |
According to our analysis, the third query should get a lot more help from its timestamp constraints than either of the first two did. Let’s try it out:
Query #3:
In what places did we record exactly 0°C (32° F) in the last hour?
Key Condition:
sourceId = 1 AND
timestamp_lat_long BETWEEN “1455710400” AND “1455714001”
Filter Expression:
latitude BETWEEN 18 AND 48
AND longitude BETWEEN -124 AND -62
AND celsius BETWEEN 0 AND 0
AND timestamp BETWEEN 1455710400 AND 1455714000
Results:
Items retrieved: 1
Items scanned: 149
Read capacity units consumed: 3
Ah, there we go! As expected, this query found the data with only a relatively small number of erroneous items being scanned.
That’s all well and good for query 3, but we want all of our queries to be reasonably efficient. With that in mind, what actions can we take?
Local secondary indexes
One option is to take advantage of DynamoDB’s local secondary indexes (LSIs). Using this feature, we can create a set of alternative indexes that each order our data using a different field. This approach allows us to use the field that’s most helpful for each query.
Unfortunately, if we want to cover a broad assortment of queries, we would need to create quite a few LSIs—as many as one per field. Each LSI that we maintain consumes additional write capacity when we perform PutItem, UpdateItem, and DeleteItem operations on our main table. LSIs also maintain their own copies of data from the main table, causing the amount of storage being consumed to balloon. Both of these characteristics translate to additional expense.
Perhaps worse, having a separate index for each field still doesn’t provide any performance improvement for queries with multiple helpful bounds. For example, a query with bounds on both latitude
and longitude
(which would be typical for location-based searches) can’t use a combination of those fields’ respective indexes. In cases like this, you’re forced to use whichever of the two indexes you believe narrows the search space the most on its own.
Intuitively, this situation feels ripe for optimization. Let’s try something different.
Introducing Z-order
Z-ordering is a technique that allows you to map multidimensional data to a single dimension. For example, imagine that you have a collection of (X, Y) coordinate pairs laid out on a 2-dimensional plane. Using Z-ordering, you could arrange those 2D pairs on a 1-dimensional line. Importantly, values that were close together in the 2D plane would still be close to each other on the line. Let’s look at a diagram:
Credit: David Epstein via Wikipedia
As you can see, the line that traverses the plane follows a “Z” pattern from which this algorithm derives its name. We call this line a Z-order curve. As it happens, Z-ordering is not limited to 2-dimensional space—it can be abstracted to work in any number of dimensions.
So?
In a database context, each field that we want to search against can be considered a dimension. In our weather aggregation example, we have four dimensions: timestamp
, latitude
, longitude
, and celsius
. If we treat each record as a 4-dimensional point in a hypercube (a 4D space), we can use Z-ordering to map each record onto a 1D line. Thanks to Z-ordering’s locality property, records that were adjacent in the 4D space will end up close to each other on the line. This behavior is important because, although classic databases have some difficulty effectively indexing multidimensional data, they’re terrific at indexing linear data. Using Z-ordering, we can use a common B-tree to organize our multidimensional records in the same manner that one would organize data sorted by a conventional number or string.
Calculating Z-addresses
The core mechanism underlying Z-ordering is the ability to translate a tuple of several field values into a single number: that record’s Z-address. To achieve this, we simply interleave the bits of each field in the record.
For example, imagine that we’re trying to calculate the Z-address of a 2D point [x=97, y=214]. Having reviewed our data, we decide to represent each dimension as a single byte. Starting with the leftmost bit of Y, we interleave the bits of each dimension, resulting in a 2-byte Z-address that comes out to 46,633 in decimal:
X value: 01100001
Y value: 11010110
Z address: 1011011000101001
You don’t have to use the same number of bits in each dimension. When a particular dimension’s bits are exhausted, you can simply omit it from the round-robin rotation. In this modified example, our data is very different: Although X values are always small, Y values can grow relatively large. We decide to represent Y values using two bytes instead of one. We are trying to calculate the Z-address of [x=97, y=54,813]. Even though X and Y have different numbers of bits, we can still interleave the two dimensions to create a three-byte Z-address of 11,938,077:
X value: 01100001
Y value: 1101011000011101
Z address: 101101100010100100011101
Now that we have a Z-address for our record, what do we do with it? Much like our weather data’s index ordered its records by timestamp
, we will order our records according to their z_address
value.
Note: As we learned at the beginning, DynamoDB does not allow two items in a table to have the same partition key value and the same sort key value. Because we are sorting on the z_address
field of our items, only those items with the same value for all of the Z-address’s constituent dimensions will be considered duplicates. In the preceding examples, two records would have to share the same X and Y values in order to produce the same Z-address and cause a collision.
Creating queries
Now that we can calculate the Z-address for any individual item, we can start building queries for our data. As with conventional sort key indexes, our queries consist of a minimum and a maximum value that define a range of items that we want to read. To find the Z-addresses that we should use as our minimum and maximum, we need to synthesize a record whose field values represent the lower bound on our range and another record whose fields represent the upper bound.
For example, imagine that we are searching for records with an x value between 2 and 3 and a y value between 4 and 5. Our data is 2-dimensional and we are interested in items that fall within a specified rectangular query space. We create item [x=2, y=4] to represent our lower bound (the upper left of our rectangle) and item [x=3, y=5] to represent the upper bound (the lower right of our rectangle). If we represent x and y with one byte each, we can translate these items into the Z-addresses 36 (100100) and 39 (100111), respectively.
Note: If one or more dimensions in a query are unbounded (for example, no X range is specified), simply use the minimum and maximum possible values for that dimension.
Having done this, we now have a linear range that we can iterate across that contains all of the items in our rectangular query space. It’s important to understand that the preceding diagram represents the fraction of the total Z-address domain that we’ll need to search. Our range covers four Z-addresses, but that does not mean we will get back four items. Rather, we have calculated the four possible locations that might house data that match our query. Every record in the query space falls on the line between the minimum and maximum Z-addresses in our query. If we craft an API request for this range of Z-addresses, we’ll get all of the data that we need.
The preceding example is slightly contrived. As you’ve likely noticed, there are “seams” in the Z-order curve where the linear traversal is required to jump to a different region before continuing its path. If your query range intersects with one of these seams, our queries get slightly more complicated.
Consider the query rectangle from a minimum of [x=1, y=3] to a maximum of [x=3, y=4]. Our minimum and maximum Z-addresses are 11 (001011) and 37 (100101), respectively.
Now we have a problem: In this case, the line’s journey from 11 to 37 also traverses large areas of the search space that are outside of the query space, the portion of the path shown in red. This detour from our query’s rectangle means that in addition to our desired results, we might also get back some amount of garbage data from our query. Although we can use DynamoDB’s FilterExpression feature to weed out these irrelevant entries before they’re returned to us, scanning garbage data is obviously something that we want to avoid wherever possible.
We can avoid these irrelevant regions of our search space using two functions: isRelevant and nextJumpIn.
- isRelevant—Provided a minimum Z-address, a maximum Z-address, and a test Z-address, the isRelevant function tells us whether the test address falls within the query rectangle created by the minimum and maximum Z-addresses.
- nextJumpIn—Given any point on the Z-order curve that is between 11 and 37 but which is outside the query rectangle, the nextJumpIn function calculates the next smallest Z-address that is back inside the query rectangle. For example, calculating nextJumpIn(11, 37, 16) gives us a Z-address of 33. This result allows us to jump from [x=4, y=1] all the way to [x=1, y=4], eliminating a huge, irrelevant portion of the search space.
Starting with our query’s minimum Z-address of 11, we can iterate through each Z-address and test its relevance by calling isRelevant(11, 37, zAddress). If it returns false, we know that we have stepped outside of the query rectangle and can use nextJumpIn(11, 37, zAddress) to skip to the next Z-address that we care about. Following this procedure, we can break the range of 11 to 37 into a set of very narrow subranges: 11 to 11, 14 to 15, 33 to 33, and 36 to 37. Those are the only Z-addresses we need to check—we have narrowed the number of possible Z-addresses for our query from 27 to 6.
To summarize: identifying a minimum and maximum Z-address for our query gives us a single linear query range to explore for relevant results. Using nextJumpIn allows us to skip regions of the query range that are irrelevant, breaking that range down into a list of subranges that contain no garbage values. Computing these subranges can be done locally, without any interaction with the database. This property is very powerful—we can get flawless results from the database if we’re willing to do a little bit of math beforehand.
Although this approach sounds very promising, it’s not without drawbacks. Breaking our single search range into more precise subranges reduces the number of garbage items being scanned, thereby reducing read capacity consumption. However, because DynamoDB doesn’t currently support batched range queries, each separate subrange represents an additional API call to make. Every API call requires execution time and consumes a small amount of read capacity even if it doesn’t find or scan any items.
Search space sparseness
The optimal approach to querying a Z-order index varies on a case-by-case basis. What works best is influenced heavily by how much data has actually been stored in the database in relation to the number of distinct Z-addresses that can be represented by the number of bits in each Z-address.
Let’s revisit our earlier example in which we stored items that represented 2D points on a plane. Each entry had an X and a Y field that were represented by a single byte apiece. As a result, the Z-addresses in that table have a size of 2 bytes, a search space of 65,536 distinct values. We’ll consider three approaches to querying this table.
Naive querying
Imagine that our table contains a single 2D point. In this case, our data’s Z-addresses are incredibly sparse—only one out of the potential 65,536 Z-addresses is in use. Knowing this, we can build a query that requests a huge Z-address range with the confidence that the range contains a relatively low number of items, garbage or not.
In such a situation, it makes sense to use naive querying—we simply calculate the minimum and maximum Z-addresses for our query as described previously and perform an API request for that single, broad range.
This approach minimizes both the amount of up-front query calculation required and the number of API requests being made. Some garbage items might be scanned, but we know that they’ll be few and far between and our FilterExpression will prevent them from being returned.
Precise querying
Now let’s consider the opposite case. Imagine that our table contains 65,000 items. Nearly all of our 65,536 potential Z-addresses are in use. Now there’s a very high probability that a naive query will scan a substantial number of garbage items. We might even go through several pages of data that return few or no results due to wasted read capacity.
In this situation, it is worthwhile to spend some time calculating all of the subranges that are free of garbage items before we begin running our query. Because our data is very dense in relationship to our search space, very few of our subrange queries will have been wasted effort. We might perform many more individual queries than we would using the naive technique, but almost all of them will return relevant items and none of them will scan garbage data.
Note that while the data in the database is dynamic, with items being added and deleted over time, our Z-address search space is completely fixed. It is always the same 65,536 potential values. Because of this, the subranges that we calculate for a query only need to be computed once. This makes these subranges ideal candidates for caching.
Page jump querying
The two scenarios described preceding represent a continuum. Realistically, most tables will end up falling somewhere in the middle of the continuum, with their data being neither exceptionally sparse nor dense in relationship to the Z-address domain. Because naive and precise querying each perform best at their relative extremes, we need a technique that offers a compromise. We want a good balance between computation required, garbage items scanned, and number of API requests required. Also, critically, we need a procedure that we can implement using DynamoDB’s existing client APIs. Fortunately, such a procedure exists.
Rather than deciding in advance whether we should calculate a single (naive) subrange or painstakingly calculate all of them, we can intelligently probe the search space as we go:
- Decide on a range for each dimension of our data (for example, x between 4 and 7, y between 6 and 9).
- Using these ranges, calculate the naive Z-address query range.
- Use isRelevant and nextJumpIn to locate the start of the first relevant query subrange.
- Create a DynamoDB API request that:
- Uses the range from the start of the first relevant query subrange to the naive Z-address query maximum as its key condition.
- Uses the dimensional ranges as its filter expression.
- Uses DynamoDB’s maxPageSize feature (a.k.a. limit) to ask that only up to N items be scanned. (The number of items scanned is not the same as the number of items returned.) We might end up getting a page back that contains zero items, but we would not have examined more than N items. N is a tunable value, but should be relatively small (for example, 16).
- Submit your request. When viewed on a straight line, the query range that we’ve submitted looks like this:
- If the response from DynamoDB doesn’t include a LastEvaluatedKey field, our search is complete. If LastEvaluatedKey does appear in the response, this field represents the last key that DynamoDB scanned, regardless of whether that item was returned or not. Because we’re using a Z-order index, LastEvaluatedKey is a Z-address. We can calculate a new value called nextKeyToEvaluate — the next Z-address after LastEvaluatedKey — by simply incrementing the value of LastEvaluatedKey. Once we have computed nextKeyToEvaluate, there are two possibilities. Either:
- isRelevant(naiveMinimum, naiveMaximum, nextKeyToEvaluate) is true—we’re still inside a relevant subrange of our search space, a part of the Z-order curve within our query rectangle. In this case, we can repeat our original query with a small modification: We’ll replace the lower Z-address bound with nextKeyToEvaluate. We’ll keep the naive maximum as our upper bound.
OR - isRelevant(naiveMinimum, naiveMaximum, nextKeyToEvaluate) is false—we’ve stepped outside of the query rectangle. We can use the nextJumpIn function to determine the start of the next relevant subrange. If nextJumpIn(naiveMinimum, naiveMaximum, nextKeyToEvaluate) does not find a next value, our search is complete. Otherwise, we can submit our query using nextJumpIn(naiveMinimum, naiveMaximum, nextKeyToEvaluate) as the new lower bound, retaining the naive maximum as the upper bound.
- isRelevant(naiveMinimum, naiveMaximum, nextKeyToEvaluate) is true—we’re still inside a relevant subrange of our search space, a part of the Z-order curve within our query rectangle. In this case, we can repeat our original query with a small modification: We’ll replace the lower Z-address bound with nextKeyToEvaluate. We’ll keep the naive maximum as our upper bound.
Repeat step 6 as necessary until the response from DynamoDB doesn’t include a LastEvaluatedKey field, indicating that your search is complete.
This strategy ensures that each page that we request from the server has a minimum bound that is within a relevant subrange. The database will scan N items and return those that pass the filter expression as a page of data. In scanning N items, it might step outside of the relevant subrange that it started in or it might not, depending on the data’s density. If it does step outside of the range, it is very possible that it will step into the next range, covering more than a single range per request. (Recall our earlier example—a subrange can be as narrow as a single value.)
If our data is sparse, each request that we perform might traverse vast regions of our Z-address search space by scanning N items. We will discover all of the relevant documents in relatively few API requests.
If our data is dense, we will only scan up to (N – 1) irrelevant items per request. If several small subranges fit within a range of N items, we reduce the number of API requests that we have to make.
Benchmarks
Let’s look at the queries that we ran earlier using conventional indexes. This time, we’ll compare their performance and capacity costs to those of the same queries being run against a Z-order index built using the same data.
For the purposes of our comparison, we will calculate the cost of provisioning enough capacity to allow us to run each query up to once per second for a 30-day period. At the time of this writing, the cost of provisioning a single read capacity unit is $0.00013 per hour. We’ll look at the cost of provisioning UNITS_NEEDED read capacity units every hour for 30 days using the following formula:
UNITS_NEEDED * 0.00013 * 24 * 30
DynamoDB’s costs are calculated using a combination of factors. See the official documentation for details.
Query 1: How warm did it get in Atlanta, Georgia, in the last week of March?
timestamp index |
z_address index, N=16 |
|
---|---|---|
Number of Items Scanned | 23,102 | 630 |
Read Capacity Consumed | 447.5 | 20 |
Capacity Cost Per 30 Days | $41.89 | $1.87 |
Query 2: How often was it cold enough to snow in New York City in the first quarter of 2016?
timestamp index |
z_address index, N=16 |
|
---|---|---|
Number of Items Scanned | 300,000 | 560 |
Read Capacity Consumed | 5,812.5 | 18 |
Capacity Cost Per 30 Days | $544.05 | $1.68 |
Query 3: In what places did we record exactly 0°C (32° F) in the last hour?
timestamp index |
z_address index, N=16 |
|
---|---|---|
Number of Items Scanned | 149 | 3,569 |
Read Capacity Consumed | 3 | 149 |
Capacity Cost Per 30 Days | $0.28 | $13.95 |
As you can see, a single Z-order index offers great performance across a variety of queries. What’s more, its ability to minimize read operations can equate to real savings.
Unbound dimensions
You might wonder why the Z-order index performed noticeably worse in the third query than it did in the first two. As our earlier analysis indicated, this query has two unhelpful fields: latitude
and longitude
. Z-order indexes derive value from each of their constituent dimensions. In this example query, we are only offering meaningful bounds on two of the four fields that were used to build our index—50 percent of our query is open-ended. By contrast, the timestamp index is tailor-made for queries exactly like this, in which the window of time being searched is very narrow indeed. With that in mind, it’s pretty impressive that the Z-order index still managed to provide such reasonable performance. However, it’s also an important design consideration: The more dimensions a query leaves unbound, the more that query’s performance degrades. It is critical to consider what queries you’ll be running when selecting the dimensions that you plan to include in your Z-addresses.
Summary
Z-order indexes allow developers to blend a set of attributes from a table’s items together to form a single, scalar, representative value. This value acts as the item’s sort key, allowing us to easily work within DynamoDB’s composite key uniqueness constraint. It also allows us to run queries against our data that benefit from multiple dimensional constraints, making Z-order indexes particularly suited to geospatial data and AWS IoT use cases.
Z-order indexes can outperform traditional indexes by one or more orders of magnitude in the best case, allowing you to save money on provisioned throughput. Depending on your query patterns, they might be able to replace entire sets of LSIs, reducing write overhead and cutting storage costs.
In Part 2 of this series we’ll dive even deeper into the details of using Z-order indexes in DynamoDB. We’ll explore:
- Concrete steps to create and maintain a Z-order index
- How to work with a variety of data types in your index dimensions
- How to parallelize your queries to cut execution time without increasing read capacity consumption
- Guidelines for keeping your search space reasonable
About the Author
Zack Slayton is a software development engineer at Amazon. He works on the open source Ion data serialization format, which was built to address rapid development, decoupling, and efficiency challenges associated with large-scale, service-oriented architectures.