AWS Database Blog

How to design Amazon DynamoDB global secondary indexes

Back in college, I created entity-relationship diagrams to model the system requirements of a relational database. The process involved finding all of the entities of the software system and defining relationships among them. I then modeled the relationships and entities into database tables before deciding which queries the database had to support. This method of designing database schemas worked well until I started using nonrelational databases in order to take advantage of their scalability and more consistent performance.

With nonrelational databases, the approach for designing a schema proceeds in reverse. You use a “query first” approach to identify the queries the applications need before designing the database schema. Data is therefore explicitly stored the way that the application needs to use it, increasing query efficiency.

If you also want to add flexibility to your queries, you can use global secondary indexes with Amazon DynamoDB. When you use global secondary indexes on a DynamoDB table, you can query data flexibly in other dimensions, using nonkey attributes.

However, in order to maintain efficient query performance, you have to design the schema of your global secondary indexes carefully, in the same way that you designed the schema for the DynamoDB table. In this blog post, I show an approach for designing the schema of a global secondary index, explain how to avoid common pitfalls in the design process, and provide tips for reducing costs.

The schema design process for global secondary indexes

The following diagram summarizes this post’s approach for how to design the schema of global secondary indexes.

Diagram summarizing this post’s approach for how to design the schema of global secondary indexes

Identify the query patterns

Application-specific query patterns (the types of queries your table supports) drive the design of global secondary indexes. The main question driving your design should be “What questions do I need the global secondary index to answer?”

When you’ve determined the questions that have to be answered, map the questions to queries of your table data. Use data filters that are based on range queries such as “greater than,” “less than,” “between,” and “starts with.” You should also think about other data that must be accessed but that does not require filtering or sorting. For example, to display product information on an online shopping website, you filter your data by the ProductId of the product. However, other data that would need to be accessed in the query could include product description, price, weight, product color, and so on. Identify as many queries as possible in advance. Accounting for them in the schema design helps to optimize the cost and performance of your global secondary indexes.

Let’s look at an example to see how an application-specific query translates to table queries. Let’s say an online shopping website stores all of the orders of a customer in an Orders table with OrderId as the partition key. The table also stores other data about the order such as OrderDate, CustomerId, and Status. The following table shows some of the common application-specific questions and their corresponding table queries.

Application specific question Table query
Find all orders for a customer sorted by order date Filter all orders in the Orders table by CustomerId and then sort by OrderDate
Get orders for a given customer within a given date range Filter all orders in the Orders table by CustomerId and then filter by range query on OrderDate
Find all pending orders for a customer Filter all orders in the Orders table by CustomerId and then filter by Status as Pending
Find all pending orders for a customer that are more than five days old Filter all orders in the Orders table by CustomerId and then filter by Status as Pending and range query of OrderDate < CurrentDate-5
Get OrderId, OrderDate, and Status for all orders of a customer Filter all orders in the Orders table by CustomerId and get their OrderId, OrderDate, and Status

Identify the candidate fields

After you identify the query patterns, you identify the global secondary index candidate fields that are required to query data for these query patterns. First, let’s see how candidate fields are formed and how they should be used.

Candidate fields of a global secondary index include the candidate partition key, candidate sort key, and attribute projections. Global secondary indexes use the value of the partition key as input to an internal hash function, which is similar to how a partition key is used in a DynamoDB table. The output value from the hash function determines the partition in which the item is stored. A partition key value is always required to query a global secondary index.

A sort key is an optional key that can be used to store all of the items with the same partition key value so that they are physically close together, ordered by sort key value. When you query a global secondary index, you can apply a condition to the sort key so that it returns only items within a certain range of values.

Attribute projections include the set of attributes that are copied from a table into the global secondary index. The partition key and sort key of the table are always projected into the global secondary index. You can access the projected attributes by querying the global secondary index.

The following table shows which candidate fields to choose for some of the more common query patterns.

Query Candidate partition key Candidate sort key Attribute projections
Filter table records on A A
Filter table records on A and sort on B A B
Filter table records on A and B A B
Filter table records on A and B, and sort on C A:B (composite partition key) C
Filter table records on A, and sort on B and C A B:C (composite sort key)
Filter table records on A and then filter on B by a range query such as GREATER THAN, LESS THAN, STARTS WITH, or BETWEEN A B
Filter table records on A and access a subset of attributes B, C, and D, but do not filter or sort by these attributes A B, C, D

Global secondary indexes do not allow you to query data in fields other than primary keys. This means that all fields that require data to be filtered on them should be part of either the candidate partition key or the candidate sort key.

Answering the following questions can help you determine if the field should be a candidate partition key or a candidate sort key:

  • Is data sorting required on this field? If so, the field should be part of candidate sort key.
  • Is a range-based query needed on this field? If so, the field should be part of candidate sort key.
  • Is data sorting on more than one field required? If so, combine these fields into a single composite field and use it as the candidate sort key.
  • Is data filtering required on more than one field? If so, combine these fields into a single composite field and use it as the candidate partition key.
  • Is data sorting required on a field along with a range query on a different field? If so, combine these fields into a single composite attribute and use it as the candidate sort key.
  • Is filtering on an attribute and then subsequent filtering on another attribute required? If so, the field for initial filtering should be part of candidate partition key and the field for subsequent filtering should be part of candidate sort key.

All other fields in the query that must be accessed but don’t need data to be filtered on should be included in attribute projections.

The following table shows an example, using the online shopping website stores data used earlier.

Table query Candidate partition key Candidate sort key Attribute projections
Filter all orders in the Orders table by CustomerId and then sort by OrderDate CustomerId OrderDate
Filter all orders in the Orders table by CustomerId and then filter by range query on OrderDate CustomerId OrderDate
Filter all orders in the Orders table by CustomerId and then filter by Status as Pending CustomerId Status
Filter all orders in the Orders table by CustomerId and then filter by Status as Pending and range query of OrderDate < CurrentDate-5 CustomerId Status:OrderDate (composite sort key)
Filter all orders in the Orders table by CustomerId and get their OrderId, OrderDate, and Status CustomerId OrderId, OrderDate, Status

Optimize candidate fields for cost and performance

You should optimize the candidate fields you identify for each query for cost and performance, before you use them in a global secondary index schema.

Optimize candidate fields for cost

Every global secondary index is provisioned independently and maintains its own copy of the data separately from the base table. As a result, sharing indexes to answer more than one query helps reduce the cost of maintaining an index.

Answering the following questions can help you design the best schema and optimize for cost.

  1. Can I use a single global secondary index to answer multiple queries?
    Sometimes multiple queries can be answered by using a single global secondary index. Some of the common use cases in which global secondary indexes can be reused are:

    1. Using global secondary indexes with a composite primary key (partition key and sort key): A composite primary key gives you additional flexibility when querying data. When you use a composite primary key, you can query your data by either specifying the partition key or by specifying both the partition key and sort key. Let’s go back to the online shopping website example used earlier to understand how we can reuse a global secondary index to answer multiple queries using this technique. The following table shows queries that can be answered by using a single global secondary index.
      Query Table query Candidate partition key Candidate sort key Attribute projections
      1 Filter all orders in the Orders table by CustomerId and then sort by OrderDate CustomerId OrderDate OrderId, OrderDate, Status
      2 Filter all orders in the Orders table by CustomerId and then filter by range query on OrderDate
      3 Filter all orders in the Orders table by CustomerId and get their OrderId, OrderDate, and Status

      For query 1 and 2 in the preceding table, you can query the global secondary index with both the partition key and sort key. For query 3, the same global secondary index can be queried with just the partition key. As a result, a single global secondary index with CustomerId as the partition key and OrderDate as the sort key can be reused to answer all three queries – there’s no need to create a separate global secondary index for each query.

    2. Using composite sort keys: Composite sort keys are the sort keys that are created by combining more than one attribute. When you use composite sort keys as sort keys of a global secondary index, they enable powerful querying capabilities by using various key conditions such as BeginsWith, Greater Than, Less Than, and Between. For example, let’s say a local search engine application allows location-aware search queries of local businesses. Each business listing is stored in a DynamoDB table called Business with BusinessId as the partition key. The table does not have a sort key. Each business listing has BusinessType, Country, State, City, BusinessName, Business Address, and PhoneNumber attributes associated with it.The application requires support for the following query patterns:
      • Filter businesses of Business Type A and Country B.
      • Filter businesses of Business Type A, Country B, and State C.
      • Filter businesses of Business Type A, Country B, State C, and City D.

      The following table shows the candidate fields of the global secondary index for each of these queries.

      Query Table query Candidate partition key Candidate sort key Attribute projections
      1 Filter businesses of BusinessType A and Country B BusinessType Country
      2 Filter businesses of BusinessType A, Country B, and State C BusinessType Country:State
      3 Filter businesses of BusinessType A, Country B, State C, and City D BusinessType Country:State:
      City

      A single composite global secondary index, with BusinessType as the partition key and Country:State:City as the sort key can support all three queries. There is no need to create a separate global secondary index for each query. The idea here is to use the BeginsWith condition for the sort key condition while querying the data from the global secondary index.

      The following table shows some of the records in this global secondary index.

      Business type (partition key) Country:State:City (sort key) Business name Business address Phone number
      FitnessCenter USA:Washington:Seattle Business1 Address1 111-111-1111
      CoffeeShop USA:California:PaloAlto Business2 Address2 222-222-2222
      Restaurant India:Maharashtra:Mumbai Business3 Address3 123-456-7890

      For this example, let’s look at some application queries that use a composite global secondary index to fetch the required data.

      Query Partition key condition Sort key condition
      Find all coffee shops in the United States BusinessType equals CoffeeShop Country:State:City begins with USA
      Find all restaurants in the State of Washington BusinessType equals Restaurant Country:State:City begins with USA:Washington
      Find all fitness centers in Seattle BusinessType equals FitnessCenter Country:State:City equals USA:Washington:Seattle
  2. Am I projecting the right data?You can reduce storage and read costs by creating global secondary indexes on attributes that store filtered records, rather than attributes that require you to filter the output of your query. Don’t project attributes that you know will never be queried. It’s also best to avoid projecting attributes that you think will rarely be needed in queries. Every time you update an attribute that is projected in an index, you incur the extra cost of also updating the index. Remember that you can still retrieve nonprojected attributes by querying your table.

Optimize for provisioned throughput use

You can achieve a better use of provisioned throughput for a global secondary index by distributing read and write requests across its partitions. This means you should design your global secondary index schema so that the partition key is accessed uniformly. You don’t have to access all partition key values to achieve an efficient throughput level, and the percentage of accessed partition key values doesn’t have to be high. However, your goal should be to maximize the number of distinct partition key values that your workload accesses, because that will increase the number of requests that are spread across the partitioned space. In general, you use your provisioned throughput more efficiently as the ratio of partition key values accessed to the total number of partition key values increases.

Choose global secondary index partition key attributes that have a large number of distinct values. Some common ways to achieve such well-distributed partition keys include:

  • Unique attributes: Use attributes such as emailId, phoneNumber, orderId, customerId, and sessionId that have distinct values for each item (high cardinality attributes).
  • Composite attributes: Try to combine more than one attribute to form a unique key, if that works with your access pattern. For example, an Orders table might have customerid+productid+countrycode as the partition key and order_date as the sort key.
  • Add a random number: From a predetermined range (a fixed number from, say, 1 to 10), choose a random number and add it to the partition This applies if you need a global secondary index on a field with few values, such as a Boolean flag or an enum value.

For example, an Orders table might have OrderId as the partition key and Status as one of the attributes. The Status field of an order can be Ordered, In Transit, or Delivered. The application requires querying on all of the orders in a specific status. A global secondary index on the Status field results in low cardinality of the global secondary index’s partition key. Because a low cardinality partition key results in a skewed key distribution, we recommend instead that you append a randomly generated suffix from 1 to 10 to each partition key. The application can then query the global secondary index with each of the 10 partition keys in parallel and merge the results to get all of the orders in a specific status.

Best practices for global secondary indexes

Consider the following recommended practices when using global secondary indexes.

  • Provision sufficient capacity: Provisioning sufficient capacity for your global secondary indexes is crucial. Failure to do so can result in provisioned throughput exceptions to your base table writes. Updates to the table require corresponding global secondary index updates, and they consume the provisioned write capacity of the global secondary index and not the table. Under-provisioning your global secondary index not only slows down the updates to your index, but it also ultimately results in failed writes to your table.The simplest way to prevent under-provisioning is to select on-demand capacity mode for the base table (which is then applied to all global secondary indexes associated with the base table).
  • Handle eventual consistency: Global secondary indexes are eventually consistent because attributes are projected asynchronously. Updates are usually propagated to global secondary indexes within a fraction of a second. Most of the applications do not really need strong consistency guarantees for their use cases, as long as the propagation to your index is fast. Take a moment to consider whether your use case actually demands a strong consistency guarantee. If it does, consider using DynamoDB transactions instead as a way to enforce strong consistency across tables.

Additional tips for reducing costs when using global secondary indexes

The following additional tips can be helpful in reducing costs when using global secondary indexes:

  • Choose attribute names wisely for your global secondary indexes
    The size of an item is calculated by adding the lengths of its attribute names and values. Having shorter attribute names helps reduce the storage and write costs. For example, instead of using Customer_Age as an attribute name, consider using Age.
  • Use sparse indexes
    DynamoDB writes a corresponding global secondary index item only if the global secondary index key value is present in the item. If the key doesn’t appear in every table item, the global secondary index is said to be sparse. Avoid storing null and empty values for the global secondary index key attributes. If the value of a global secondary index key attribute is null or empty, it is better to just skip the attribute when writing it. Because global secondary indexes are stored separately, if you skip writing null or empty attributes they are not projected to the global secondary index, saving storage and write cost. Use sparse indexes whenever possible.
  • Consider item size
    The total number of write capacity units required to write an item to DynamoDB depends on the item size. Therefore, it’s cost effective to keep the size of item small, regardless of whether your table uses provisioned capacity mode or on-demand capacity mode. When storing large attributes in your DynamoDB table, consider:

    • Compressing the attributes before storing them in DynamoDB: Compressing large attribute values can help reduce the size of DynamoDB items and reduce your storage costs.
    • Storing the attributes in Amazon S3 and having the Amazon S3 mapping in your DynamoDB table: You can store attributes as an object in Amazon S3 and then store the object identifier in your DynamoDB item. Anytime you need to read the full item, you can use the object identifier stored in the item and read it from Amazon S3.

For more information about reducing item size, see Best Practices for Storing Large Items and Attributes.

Summary

In this post, I walked through the process of designing a schema for global secondary indexes. A good schema design helps in maximizing the performance and minimizing the cost of querying your data from global secondary indexes. Good schema design includes reusing your global secondary indexes to support multiple queries, and optimizing provisioned throughput usage. I also described best practices that can help you avoid pitfalls when using global secondary indexes.


About the author

Photo of ShubhamShubham Sethi is an AWS software development engineer.