AWS Database Blog

How to index on Amazon DocumentDB (with MongoDB compatibility)

Amazon DocumentDB (with MongoDB compatibility) is a fast, scalable, highly available, and fully managed document database service that supports MongoDB workloads. You can use the same MongoDB 3.6 and 4.0 application code, drivers, and tools to run, manage, and scale workloads on Amazon DocumentDB without having to worry about managing the underlying infrastructure. As a document database, Amazon DocumentDB makes it easy to store, query, and index JSON data.

In this post, I go over the different index types and how to best work with them in Amazon DocumentDB. Indexing in databases is not unique—this is a concept used in nearly all database engines. With Amazon DocumentDB, indexes help improve the performance of queries. Without an index, Amazon DocumentDB has to scan every document in a collection in order to select those documents that match the query. Indexes are references to your documents, efficiently ordered by keys. They are stored in the sorted order, allowing fast lookup.

Indexing overview

Amazon DocumentDB uses a B-tree data structure for its indexes. A B-tree index uses a hierarchical structure that stores data in its nodes in a sorted order. Keeping the data sorted this way allows for reading, inserting, and deleting data in a memory-efficient way. As new data is written, in addition to modifying the documents in the collection, the indexes need to be updated, which can impact write latencies and increase IOPS. By choosing fields that have high cardinality, such as an identity field, you increase the index selectivity. Conversely, choosing a low cardinality field, such as Boolean where there are only two unique values, can retrieve more information than is needed, which increases memory usage and the risk of going into the distributed storage volume. The goal with index selectivity is to allow the query to filter as much data as possible up front, thereby letting the query processor do less work.

When you run a query, Amazon DocumentDB filters and retrieves documents in two possible ways: a collection scan or index scan. A collection scan inspects the entire collection, document by document, to find a match, and is compute and I/O intensive. However, you can make your query more efficient by creating an index that limits the number of documents retrieved. The index stores a small portion of the dataset, holding the value of a specific field or sets of fields, ordered by the value of that field. This ordering is based on the key ordering in the index definition, starting with the leading key. Consider a phone book that is ordered alphabetically; it has entries organized by last name (the leading key), then by first name. When looking for a phone number, you can skip directly to the section containing the first letter of the last name you’re looking for and continue filtering down using the first name until you retrieve your result; this is equivalent to an index scan. If, instead of being ordered by last name, the phone book simply inserted new people by when they moved into town, you could no longer filter by last name. This requires you to go through the entire phone book, page by page, looking for your result. This is similar to the collection scan Amazon DocumentDB performs when there is no index.

Index types

If you create an index on your collection of documents, you can avoid high compute and I/O usage. You can choose from several types of indexes:

  • Single field index
  • Compound index
  • Multi-key index
  • Compound multi-key index
  • Sparse index
  • Time to Live (TTL) index

The following is a real-world example of how to apply and use these indexes. This collection contains approximately 1 million documents with the following key-value pairs:

rs0:PRIMARY> db.food.findOne()
{
        "_id" : ObjectId("6010bc96f2612d11c720ce01"),
        "name" : "Chefs Table at the Edgewater",
        "cuisine" : "Italian",
        "stars" : 1.6,
        "address" : {
                "street" : "371 Gule Loop",
                "city" : "Ujukamew",
                "state" : "ME",
                "zipcode" : "62421"
        },
        "keywords" : [
                "Italian",
                "Delivery",
                "New"
        ],
        "meals" : [
                "lunch",
                "dinner"
        ],
	 "special_diets" : [
                "vegetarian friendly",
                "vegan options"
        ],
        "patrons" : {
                "name" : "Ted Bracken",
                "visits" : 3,
                "preferences" : {
                        "waiter" : "Pat Reah",
                        "drink" : "iced tea",
                        "allergies" : {
                                "food" : eggs
                        }
                }
        }
}
rs0:PRIMARY> db.food.count()
1000021

You can run the getIndexes() operator to view all the indexes on the collection:

rs0:PRIMARY> db.food.getIndexes()
[
        {
                "v" : 4,
                "key" : {
                        "id" : 1
                },
                "name" : "_id",
                "ns" : "business.food"
        }
]

You can find the following information from this output:

 "v" : 4,

This is the version of the index. Version 4 is the default version of text indexes.

When you create a collection, a default index is created on the _id field:

                "key" : {
                        "id" : 1
                },

This shows you the _id field is the key for your index and it’s in ascending (1) order. If you have an index on multiple fields, they show in this key field in the order they were passed to the createIndex() operator.

In the following code, the name of the index is _id, and ns is the index namespace, written in the format {database name}.{collection name}:

               },
                "name" : "_id",
                "ns" : "business.food"
        }

In all the examples in this post, you call on the createIndex() operator to create the indexes, passing along the optional name parameter. This allows you to give your indexes names you choose. If you don’t include the name parameter, the index names concatenate the names of the indexed fields and the sort order. For example, the following code creates an ascending index on the cuisine field and names it cuisine_asc:

rs0:PRIMARY> db.food.createIndex({"cuisine":1},{name:"cuisine_asc"})
{
        "createdCollectionAutomatically" : false,
        "numIndexesBefore" : 1,
        "numIndexesAfter" : 2,
        "ok" : 1,
        "operationTime" : Timestamp(1611136273, 1)
}
rs0:PRIMARY> db.food.getIndexes()
[
        {
                "v" : 4,
                "key" : {
                        "_id" : 1
                },
                "name" : "_id_",
                "ns" : "business.food"
        },
        {
                "v" : 4,
                "key" : {
                        "cuisine" : 1
                },
                "name" : "cuisine_asc",
                "ns" : "business.food"
        }
]

If you don’t add the name parameter, the name of the index is cuisine_1:

rs0:PRIMARY> db.food.createIndex({"cuisine":1})
{
        "createdCollectionAutomatically" : false,
        "numIndexesBefore" : 1,
        "numIndexesAfter" : 2,
        "ok" : 1,
        "operationTime" : Timestamp(1611136477, 1)
}
rs0:PRIMARY> db.food.getIndexes()
[
        {
                "v" : 4,
                "key" : {
                    "_id" : 1
                },
                "name" : "_id_",
                "ns" : "business.food"
        },
        {
                "v" : 4,
                "key" : {
                   "cuisine" : 1
        },
                "name" : "cuisine_1",
                "ns" : "business.food"
        }
]

Single field index

Without any additional indexes, if you query the name field and view the output of explain(“executionStats”), it shows that you performed a collection scan and had a run time (“executionTimeMillis”) of 770.505 milliseconds. From your earlier query, we could see our collection had over 1 million documents. This means the query has to look at all 1 million documents in the collection to find the name field you query on. See the following code:

rs0:PRIMARY> db.food.find({"name":"Chefs Table at the Edgewater"}).explain("executionStats")
{
        "queryPlanner" : {
                "plannerVersion" : 1,
                "namespace" : "business.food",
                "winningPlan" : {
                        "stage" : "COLLSCAN"
                }
        },
        "executionStats" : {
                "executionSuccess" : true,
                "executionTimeMillis" : "770.505",
                "planningTimeMillis" : "0.044",
                "executionStages" : {
                        "stage" : "COLLSCAN",
                        "nReturned" : "9903",
                        "executionTimeMillisEstimate" : "770.064"
                }
        },
        "serverInfo" : {
                "host" : "docdbinstance01",
                "port" : 27017,
                "version" : "4.0.0"
        },
        "ok" : 1,
        "operationTime" : Timestamp(1611136872, 1)
}

To be more efficient and query the collection as quickly as possible, you can create a single field index on the name field. With a single field index, you can index on any field in a collection of documents. See the following code:

rs0:PRIMARY> db.food.createIndex({"name":1},{name: "name_asc"})
{
        "createdCollectionAutomatically" : false,
        "numIndexesBefore" : 1,
        "numIndexesAfter" : 2,
        "ok" : 1,
        "operationTime" : Timestamp(1611137232, 1)
}

Walking through the preceding command, you begin by passing along the details of your index:

rs0:PRIMARY> db.food.createIndex({"name":1},{name: "name_asc"})

This command instructs Amazon DocumentDB to create an index on the name field in ascending order (:1). You have two options for an index’s key pattern: 1 (ascending) or -1 (descending). Direction doesn’t matter for single field indexes, but if you’re doing sorts or range queries on compound indexes, ensure your direction is correct. The code also passes the optional name parameter to name the index name_asc.

The output of the createIndex() operation shows the following details:

        "createdCollectionAutomatically" : false,

If you create an index on a collection that doesn’t exist, Amazon DocumentDB creates the collection as part of the index creation (the result is true). In our example, this is false because the food collection already exists.

The following code shows how many indexes existed on the collection before you added the new index and after the new index is created:

        "numIndexesBefore" : 1,
        "numIndexesAfter" : 2,

When you create a collection, the _id field is automatically indexed, so your numIndexesBefore is 1.

The following code is the status of the createIndex() operation, indicating it’s successful:

        "ok" : 1

If an error occurs when creating the index, you see an ok value of 0 and an errmsg you can use to troubleshoot the issue:

rs0:PRIMARY> db.food.createIndex({"name":2},{name: "name_asc"})
{
    "ok" : 0,
    "code" : 67,
    "errmsg" : "Values in the index key pattern can only be 1 or -1",
    "operationTime" : Timestamp(1611137352, 1)

}

If you attempt to create an index that already exists, you receive a message similar to the following:

rs0:PRIMARY> db.food.createIndex({"name":1},{name: "name_asc"})
{
        "numIndexesBefore" : 2,
        "numIndexesAfter" : 2,
        "note" : "all indexes already exist",
        "ok" : 1,
    	"operationTime" : Timestamp(1611137652, 1)
}

When you run a query against the name field, you can see that the newly created index was used for an index scan (“stage” : “IXSCAN”), improving the executionTimeMillis to 9.961:

rs0:PRIMARY> db.food.find({"name":"Chefs Table at the Edgewater"}).explain("executionStats")
{
        "queryPlanner" : {
                "plannerVersion" : 1,
                "namespace" : "business.food",
                "winningPlan" : {
                        "stage" : "IXSCAN",
                        "indexName" : "name_1",
                        "direction" : "forward"
                }
        },
        "executionStats" : {
                "executionSuccess" : true,
                "executionTimeMillis" : "9.961",
                "planningTimeMillis" : "0.282",
                "executionStages" : {
                        "stage" : "IXSCAN",
                        "nReturned" : "9903",
                        "executionTimeMillisEstimate" : "9.375",
                        "indexName" : "name_1",
                        "direction" : "forward"
                }
        },
        "serverInfo" : {
                "host" : "docdbinstance01",
                "port" : 27017,
                "version" : "4.0.0"
        },
        "ok" : 1,
    	 "operationTime" : Timestamp(1611137832, 1)
}

By adding an index, you have decreased the query run time from 770 milliseconds to 9.9 milliseconds.

You can also create indexes on nested fields within a document. For example, you may want to query quickly on the restaurant’s state field, embedded within the address field. See the following code:

rs0:PRIMARY> db.food.createIndex({"address.state":1},{name: "address_state_asc"})
{
        "createdCollectionAutomatically" : false,
        "numIndexesBefore" : 2,
        "numIndexesAfter" : 3,
        "ok" : 1,
    	 "operationTime" : Timestamp(1611138312, 1)
}
rs0:PRIMARY> db.food.find({"address.state":"ME"}).explain("executionStats")
{
        "queryPlanner" : {
                "plannerVersion" : 1,
                "namespace" : "business.food",
                "winningPlan" : {
                        "stage" : "IXSCAN",
                        "indexName" : "address.state_asc",
                        "direction" : "forward"
                }
        },
        "executionStats" : {
                "executionSuccess" : true,
                "executionTimeMillis" : "18.161",
                "planningTimeMillis" : "0.322",
                "executionStages" : {
                        "stage" : "IXSCAN",
                        "nReturned" : "19612",
                        "executionTimeMillisEstimate" : "17.207",
                        "indexName" : "address.state_asc",
                        "direction" : "forward"
                }
        },
        "serverInfo" : {
                "host" : "docdbinstance01",
                "port" : 27017,
                "version" : "4.0.0"
        },
        "ok" : 1,
    	 "operationTime" : Timestamp(1611139212, 1)
}

Compound index

A compound index holds references to multiple fields within a single index structure. You can optionally combine nested indexes with top-level fields to create compound indexes. You can go as deep as you want with these indexes (for example, Country.State.County.City.Street.House), but there is a performance penalty for going too deep. It’s recommended to not go more than three levels deep. Individual keys can have the same value, but the combined value across all keys appear in the index too.

For our collection, we want to query by both the cuisine and state fields. Create the index as follows:

rs0:PRIMARY> db.food.createIndex({"cuisine":1, "address.state":1},{name: "cuisine_state_asc"})
{
        "createdCollectionAutomatically" : false,
        "numIndexesBefore" : 3,
        "numIndexesAfter" : 4,
        "ok" : 1,
    	 "operationTime" : Timestamp(1611139872, 1)
}

The query does an index scan on the new index to find this information:

rs0:PRIMARY> db.food.find({"cuisine":"Italian","address.state":"ME"}).explain("executionStats")
{
        "queryPlanner" : {
                "plannerVersion" : 1,
                "namespace" : "business.food",
                "winningPlan" : {
                        "stage" : "IXSCAN",
                        "indexName" : "cuisine_state_asc",
                        "direction" : "forward"
                }
        },
        "executionStats" : {
                "executionSuccess" : true,
                "executionTimeMillis" : "1.242",
                "planningTimeMillis" : "0.545",
                "executionStages" : {
                        "stage" : "IXSCAN",
                        "nReturned" : "487",
                        "executionTimeMillisEstimate" : "0.663",
                        "indexName" : "cuisine_state_asc",
                        "direction" : "forward"
                }
        },
        "serverInfo" : {
                "host" : "docdbinstance01",
                "port" : 27017,
                "version" : "4.0.0"
        },
        "ok" : 1,
    	 "operationTime" : Timestamp(1611140049, 1)
}

Compound indexes work well when fields appear together in queries by increasing selectivity, resulting in better cardinality than individual fields. However, if the fields are used individually, multiple indexes may be a better choice. A compound index can be used on the prefix of the index key (the first, or leading, field of the index), but queries on a non-prefix field generally don’t use an index. Generally, one compound index is smaller than multiple individual indexes.

For example, within this document, we have an embedded subdocument called patrons that includes food allergies:

        "patrons" : {
                "name" : "Mike Catcher",
                "visits" : 14,
                "preferences" : {
                        "waiter" : "Glen Key",
                        "drink" : "Diet Coke",
                        "allergies" : {
                                "food" : null
                        }
                }
        }

You create an index on cuisine and food (ratings.preferences.allergies.food) as follows:

rs0:PRIMARY> db.food.createIndex({"cuisine":1,"patrons.preferences.allergies.food":1},{name:"cuisine_food_allergies_asc"})
{
        "createdCollectionAutomatically" : false,
        "numIndexesBefore" : 2,
        "numIndexesAfter" : 3,
        "ok" : 1,
    	 "operationTime" : Timestamp(1611140127, 1)
}

When you query the cuisine field, you can see the index scan using the new index:

rs0:PRIMARY> db.food.find({"cuisine":"Nuveau Italian"}).explain("executionStats")
{
        "queryPlanner" : {
                "plannerVersion" : 1,
                "namespace" : "business.food",
                "winningPlan" : {
                        "stage" : "IXSCAN",
                        "indexName" : "cuisine_food_allergies_asc",
                        "direction" : "forward"
                }
        },
        "executionStats" : {
                "executionSuccess" : true,
                "executionTimeMillis" : "0.164",
                "planningTimeMillis" : "0.116",
                "executionStages" : {
                        "stage" : "IXSCAN",
                        "nReturned" : "1",
                        "executionTimeMillisEstimate" : "0.026",
                        "indexName" : "cuisine_food_allergies_asc",
                        "direction" : "forward"
                }
        },
        "serverInfo" : {
                "host" : "docdbinstance01",
                "port" : 27017,
                "version" : "4.0.0"
        },
        "ok" : 1,
    	 "operationTime" : Timestamp(1611140333, 1)
}

However, when you query on ratings.preference.allergies.food, Amazon DocumentDB performs a collection scan:

rs0:PRIMARY> db.food.find({"patrons.preferences.allergies.food":null}).explain("executionStats")
{
        "queryPlanner" : {
                "plannerVersion" : 1,
                "namespace" : "business.food",
                "winningPlan" : {
                        "stage" : "COLLSCAN"
                }
        },
        "executionStats" : {
                "executionSuccess" : true,
                "executionTimeMillis" : "937.123",
                "planningTimeMillis" : "0.379",
                "executionStages" : {
                        "stage" : "COLLSCAN",
                        "nReturned" : "1000020",
                        "executionTimeMillisEstimate" : "903.702"
                }
        },
        "serverInfo" : {
                "host" : "docdbinstance01",
                "port" : 27017,
                "version" : "4.0.0"
        },
        "ok" : 1,
    	 "operationTime" : Timestamp(1611141068, 1)
}

Multi-key index

For fields that have an array value, a multi-key index allows you to create an index key for each element in the array. The array can be scalar (strings or numbers) and nested documents. After an index is set as multi-key, it can never be “un-multi-keyed” even if all the documents containing arrays have been removed; you must drop the index and recreate it. Multiple entries in a multi-key index may point to a single document. This can cause performance penalties because there is a slight delay as de-duping occurs when getting the results.

Indexing an array creates an index entry for each element of the array. For example, if an array has 50 items, it has 50 index entries. This means that for a single create, update, or deletion, every array entry might have to be updated. Indexing an array field indexes each element of the array, not the array itself.

Our documents contain an array field, keywords:

        "keywords" : [
                "Lebanese",
                "Eccentric",
                "New"
        ]

You can create a multi-key index on keywords:

rs0:PRIMARY> db.food.createIndex({"keywords":1},{name: "keywords_asc"})
{
        "createdCollectionAutomatically" : false,
        "numIndexesBefore" : 4,
        "numIndexesAfter" : 5,
        "ok" : 1,
	 "operationTime" : Timestamp(1611168757, 1)
}

You can then query for elements of your keywords array using this new index:

rs0:PRIMARY> db.food.find({"keywords":"New"}).explain("executionStats")
{
        "queryPlanner" : {
                "plannerVersion" : 1,
                "namespace" : "business.food",
                "winningPlan" : {
                        "stage" : "IXSCAN",
                        "indexName" : "keywords_1"
                }
        },
        "executionStats" : {
                "executionSuccess" : true,
                "executionTimeMillis" : "0.125",
                "planningTimeMillis" : "0.088",
                "executionStages" : {
                        "stage" : "IXSCAN",
                        "nReturned" : "1",
                        "executionTimeMillisEstimate" : "0.022",
                        "indexName" : "keywords_asc"
                }
        },
        "serverInfo" : {
                "host" : "docdbinstance01",
                "port" : 27017,
                "version" : "4.0.0"
        },
        "ok" : 1,
	 "operationTime" : Timestamp(1611169140, 1)
}

Compound multi-key index

Extending multi-key indexes, you can combine a multi-key index with another, non-multi-key index to create a compound multi-key index. With compound multi-key indexes, each indexed document can have at most one indexed field whose value is in an array.

When creating a multi-key index, only one array field can be used. For example, our document contains an array field for both keywords and meals:

        "keywords" : [
                "Lebanese",
                "Eccentric",
                "New"
        ],
        "meals" : [
                "lunch",
                "dinner"
        ]

You can create a new index on the cuisine string and keywords array:

rs0:PRIMARY> db.food.createIndex({"cuisine":1,"keywords":1},{name: "cuisine_keyword_array_asc"})
{
        "createdCollectionAutomatically" : false,
        "numIndexesBefore" : 5,
        "numIndexesAfter" : 6,
        "ok" : 1,
    	 "operationTime" : Timestamp(1611169931, 1)
}

When you query on the indexed fields, use this new index:

rs0:PRIMARY> db.food.find({"cuisine":"Lebanese","keywords":"New"}).explain("executionStats")
{
        "queryPlanner" : {
                "plannerVersion" : 1,
                "namespace" : "business.food",
                "winningPlan" : {
                        "stage" : "FETCH",
                        "inputStage" : {
                                "stage" : "IXSCAN",
                                "indexName" : "cuisine_keyword_array_asc"
                        }
                }
        },
        "executionStats" : {
                "executionSuccess" : true,
                "executionTimeMillis" : "0.480",
                "planningTimeMillis" : "0.426",
                "executionStages" : {
                        "stage" : "FETCH",
                        "nReturned" : "1",
                        "executionTimeMillisEstimate" : "0.038",
                        "inputStage" : {
                                "stage" : "IXSCAN",
                                "nReturned" : "1",
                                "executionTimeMillisEstimate" : "0.024",
                                "indexName" : "cuisine_keyword_array_asc"
                        }
                }
        },
        "serverInfo" : {
                "host" : "docdbinstance01",
                "port" : 27017,
                "version" : "4.0.0"
        },
        "ok" : 1,
    	 "operationTime" : Timestamp(1611170152, 1)
}

However, you can’t create an index on both the keywords and meals arrays:


rs0:PRIMARY> db.food.createIndex({"keywords":1, "meals":1},{name: "keyword_meals_asc"})
{
	"ok" : 0,
	"operationTime" : Timestamp(1611170550, 1),
	"code" : 171,
	"errmsg" : "multiple fields of compound index cannot be arrays"
}

This is to avoid an exponential number of index entries that would result from multiple multi-key indexes; every possible pair of array elements would need to be indexed.

If you have a query that needs to filter on two arrays, you can create a multi-key index on each array, and the query plan uses both the indexes. Because we already created an index on the keywords array, you can first run the ensureIndex() operator against your collection to validate it exists:

rs0:PRIMARY> db.food.ensureIndex({"keywords":1})
{
        "numIndexesBefore" : 6,
        "numIndexesAfter" : 6,
        "note" : "all indexes already exist",
        "ok" : 1,
        "operationTime" : Timestamp(1611170701, 1) 
}

You can now create a new multi-key index on the meals array:

rs0:PRIMARY> db.food.createIndex({"meals":1},{name:"meals_asc"})
{
        "createdCollectionAutomatically" : false,
        "numIndexesBefore" : 6,
        "numIndexesAfter" : 7,
        "ok" : 1,
	    "operationTime" : Timestamp(1611170957, 1)
}

When you query on both the keywords array and meals array, you can see the query run plan is using both indexes (output truncated):

rs0:PRIMARY> db.food.find({"keywords":"New","meals":"dinner"}).explain("executionStats")
…
"executionStats" : {
                "executionSuccess" : true,
                "executionTimeMillis" : "0.720",
                "planningTimeMillis" : "0.551",
                "executionStages" : {
                        "stage" : "FETCH",
                        "nReturned" : "4",
                        "executionTimeMillisEstimate" : "0.106",
                        "inputStage" : {
                                "stage" : "IXAND",
                                "nReturned" : "0",
                                "executionTimeMillisEstimate" : "0.104",
                                "inputStages" : [
                                        {
                                                "stage" : "IXSCAN",
                                                "nReturned" : "116",
                                                "executionTimeMillisEstimate" : "0.054",
												   "indexName" : "meals_asc"
                                        },
                                        {
                                                "stage" : "IXSCAN",
                                                "nReturned" : "164",
                                                "executionTimeMillisEstimate" : "0.044",
												   "indexName" : "keywords_asc"
                                        }
                                ]
                        }
                }
        },

Sparse index

A sparse index is useful when you have a field that is only present in some documents, not others. When creating a document database model, it’s a best practice to omit keys that contain a null value. Having a key with a null value is different than omitting the key altogether. Unique indexes count null as a value, so you can’t have a unique index with more than one document missing that key. A sparse index skips indexing the documents missing the field, reducing the index size and saving space in memory. Because the index size is smaller, the queries that use it are more efficient. For a query to utilize a sparse index, it must include the $exist:true operator and parameter on the indexed field (this is functionally different than in MongoDB, where the $exist operator is not required, but supported).

In our collection, the special_diets field is only present in approximately 1% of the documents:

rs0:PRIMARY> db.food.find({"special_diets":{$exists:true}}).count()
9257
rs0:PRIMARY> db.food.count()
1000021

To create an index only on the documents with the special_diets field, run the following:

rs0:PRIMARY> db.food.createIndex({"special_diets":1},{sparse: true,name: "special_diets_sparse_asc"})
{
        "createdCollectionAutomatically" : false,
        "numIndexesBefore" : 7,
        "numIndexesAfter" : 8,
        "ok" : 1,
		"operationTime" : Timestamp(1611172102, 1)
}

To compare the sizes of the indexes, you can run the stats() operator against the collection. This shows you how large the new special_diets_sparse_asc sparse index is versus the keywords_asc multi-key index from the preceding example (output truncated):

rs0:PRIMARY> db.food.stats(1024*1024)
{
…
        "indexSizes" : {
                "_id_" : 64.046875,
                "name_asc" : 26.5859375,
                "address_state_asc" : 40.484375,
                "cuisine_state_asc" : 114.859375,
				"keywords_asc" : 158.609375,
                "cuisine_keyword_array_asc" : 224.7265625,
                "meals_asc" : 114.859375,
				"special_diets_sparse_asc" : 1.2578125
        },

The default output of stats() is bytes. In our example, you pass along a scale factor of 1024*1024; this shows the output in megabytes. You can also pass along a scale factor of 1024 to return the various size values in kilobytes. From this output, you can see the new sparse index is only 1.25 MB, an order of magnitude smaller than the 158.61 MB multi-key index.

If you run your query with the $exist:true operator, you can see the index scan:

rs0:PRIMARY> db.food.find({"special_diets":{ $exists: true, $eq: "vegan options" } }).explain("executionStats")
{
        "queryPlanner" : {
                "plannerVersion" : 1,
                "namespace" : "business.food",
                "winningPlan" : {
                        "stage" : "IXSCAN",
                        "indexName" : "special_diets_sparse_asc"
                }
        },
        "executionStats" : {
                "executionSuccess" : true,
                "executionTimeMillis" : "0.218",
                "planningTimeMillis" : "0.145",
                "executionStages" : {
				        "stage" : "IXSCAN",
                        "nReturned" : "1",
                        "executionTimeMillisEstimate" : "0.016",
						"indexName" : "special_diets_sparse_asc"
                }
        },
        "serverInfo" : {
                "host" : "docdbinstance01",
                "port" : 27017,
                "version" : "4.0.0"
        },
        "ok" : 1,
	 "operationTime" : Timestamp(1611172799, 1)

}

If you omit the $exist:true operator, you can see that the query performs a collection scan:

rs0:PRIMARY> db.food.find({"special_diets":{ $eq: "vegan options" } }).explain("executionStats")
{
        "queryPlanner" : {
                "plannerVersion" : 1,
                "namespace" : "business.food",
                "winningPlan" : {
                        "stage" : "COLLSCAN"
                }
        },
        "executionStats" : {
                "executionSuccess" : true,
                "executionTimeMillis" : "709.328",
                "planningTimeMillis" : "0.060",
                "executionStages" : {
				        "stage" : "COLLSCAN",
                        "nReturned" : "1",
                        "executionTimeMillisEstimate" : "709.251"
                }
        },
        "serverInfo" : {
                "host" : "docdbinstance01",
                "port" : 27017,
                "version" : "4.0.0"
        },
        "ok" : 1,
	 "operationTime" : Timestamp(1611178233, 1)

}

TTL index

In some cases, you need a flexible way to remove documents based on their age. Time to Live (TTL) indexes allow you to set a timeout on each document. When a document reaches the TTL age set, it’s deleted from the collection. TTL works as a best effort process—documents aren’t guaranteed to be deleted within a specific period. Factors like size, resource utilization, document size, overall throughput, number of indexes, and whether the indexes and the working set can fit in memory can all affect this.

An important difference between Amazon DocumentDB TTL indexes and MongoDB indexes is that MongoDB uses the lastUpdated field to expire a document. Amazon DocumentDB doesn’t expire a field unless it’s in ISO 8601 format or an array type containing this format.

You can have multiple TTL indexes on a collection, but they can’t be compound indexes or used for the purposes of sorting and query optimization. This can be useful for removing data stored from IoT devices based on a combination of payload date, log date, or event date. When using multiple TTL indexes on a collection, the first field to meet the TTL criteria deletes the document.

With TTL indexes, each deletion incurs I/O costs—if throughput and delete rates increase, expect a higher bill due to increased I/O usage. However, if you don’t create a TTL index to delete documents, but instead segment documents into collections based on time and simply drop those collections when they’re no longer needed, you don’t incur any I/O costs. For time-series workloads, you can consider creating rolling collections instead of a TTL index as rolling collections can be a more performant way to delete data and less I/O intensive. For more information about these best practices, see TTL and Timeseries Workloads.

Index best practices

Although indexes improve the performance of queries by decreasing the number of documents the database has to search through to satisfy the query, it does increase the amount of storage and IOPS needed. Each insert, update, and delete to a collection also needs to update the index as well, which slows down writes, increases the storage, and increases the memory and IOPS usage. Because of this overhead, it’s recommended to have no more than five indexes on your Amazon DocumentDB collection.

For this post, we use the getIndexes() operation to see the indexes we’ve created on our collection. To get additional information from the index, you can run the following query using the $indexStats aggregation stage (output truncated):

rs0:PRIMARY> db.food.aggregate([{$indexStats:{}}]).pretty()
...
{
        "name" : "special_diets_sparse_asc",
        "key" : {
                "special_diets" : 1
        },
        "host" : "docdbinstance01.xxxxxxxxxxxx.xx-xxxx-x.docdb.amazonaws.com:27017",
        "accesses" : {
                "ops" : NumberLong(0),
                "since" : ISODate("2021-01-20T05:13:51.166Z")
        }
}
{
        "name" : "cuisine_keyword_array_asc",
        "key" : {
                "cuisine" : 1,
                "keywords" : 1
        },
        "host" : "docdbinstance01.xxxxxxxxxxxx.xx-xxxx-x.docdb.amazonaws.com:27017",
        "accesses" : {
                "ops" : NumberLong(0),
                "since" : ISODate("2020-12-13T21:51:48.191Z")
        }
}
{
        "name" : "keywords_asc",
        "key" : {
                "keywords" : 1
        },
        "host" : "docdbinstance01.xxxxxxxxxxxx.xx-xxxx-x.docdb.amazonaws.com:27017",
        "accesses" : {
                "ops" : NumberLong(15),
                "since" : ISODate("2021-02-01T15:26:32.993Z")
        }
}

You can use this output to identify unused indexes. The name and key fields provide similar information from the getIndexes() operation that we discussed earlier. The host field gives you details on your host name information to include the Region (such as us-east-1). The ops field tells you the number of operations that used this index, and the since field is the time Amazon DocumentDB started gathering the statistics for the index (this is generally the index creation date).

From the preceding output, you see that although the keywords_asc index was created most recently, it has been used the greatest number of times (15). However, both the special_diets_sparse_asc and cuisine_keyword_array_asc indexes have never been used. You can drop these unused indexes with the dropIndex() operator, passing along the index name:

rs0:PRIMARY> db.food.dropIndex("special_diets_sparse_asc")
{ "nIndexesWas" : 8, "ok" : 1 , "operationTime" : Timestamp(1611179040, 1) }
rs0:PRIMARY> db.food.dropIndex("cuisine_keyword_array_asc")
{ "nIndexesWas" : 7, "ok" : 1, "operationTime" : Timestamp(1611179092, 1) }

If migrating a workload from MongoDB to Amazon DocumentDB, it’s recommended you remove all unnecessary or unused indexes prior to migration in order to decrease the amount of storage used and decrease the migration time. For more information about migration approaches, see Migrating to Amazon DocumentDB.

When we create indexes, we want to limit their creation to fields where the number of duplicate values is less than 1% of the total number of documents in the collection. For example, if our collection contains 1 million documents, we want to create indexes on fields where the same value occurs 10,000 times or fewer. An example of a poor field choice would be a status flag or gender, where there is a very small subset of unique values. Examples of a better choice are a field containing identification numbers, email addresses, or user names. Choosing an index with a high number of unique values (that is, a high cardinality) ensures that filter operations return a small number of documents, thereby yielding good performance during index scans.

It’s also recommended to ensure that all indexes fit in memory. This eliminates the need for the query to reach into the storage layer, which increases I/O and query time. To get details of your existing indexes’ sizes, run the stats() operator against the collection:

rs0:PRIMARY> db.food.stats(1024*1024)
{
        "ns" : "business.food",
        "count" : 1000021,
        "size" : 232.70141983032227,
        "avgObjSize" : 244.8747,
        "storageSize" : 269.234375,
        "capped" : false,
        "nindexes" : 6,
        "totalIndexSize" : 296.296875,
        "indexSizes" : {
                "_id_" : 64.046875,
                "name_asc" : 26.5859375,
                "address_state_asc" : 40.484375,
                "cuisine_state_asc" : 114.859375,
                "keywords_asc" : 158.609375,
                "meals_asc" : 114.859375
        },
        "ok" : 1,
	 "operationTime" : Timestamp(1611180942, 1)
}

This output provides useful information about the collection, such as the total number of documents (count). For this post, we focus on the index information:

        "nindexes" : 6,
        "totalIndexSize" : 296.296875,
        "indexSizes" : {
                "_id_" : 64.046875,
                "name_asc" : 26.5859375,
                "address_state_asc" : 40.484375,
                "cuisine_state_asc" : 114.859375,
                "keywords_asc" : 158.609375,
                "meals_asc" : 114.859375
        },

The collection has six total indexes, occupying approximately 300 MB of space. You can also see the size information specific to the individual indexes, such as name_asc and meals_asc. You can use the value of totalIndexSize along with the size of the working set to help you choose an instance size with the appropriate amount of memory for this workload. It’s also important to define the purge requirements of your collections. For example, if you have a large collection holding multiple weeks’ worth of data, but you only read data for the last 2 days, the index size still grows as you add more data and retain the old data, which increases the memory requirements and index scan time. You need to either purge your older, unused data or move that data to another collection.

Additionally, you can use the IndexBufferCacheHitRatio metric from Amazon CloudWatch to see what percentage of index requests are being served from the buffer cache. If this number drops, you should consider scaling your Amazon DocumentDB instances vertically, replacing it with another instance containing more RAM. You should keep the IndexBufferCacheHitRatio above 95% and create a CloudWatch alarm when it goes below that value. For more information about Amazon CloudWatch metrics for Amazon DocumentDB, see Monitoring metrics and setting up alarms on your Amazon DocumentDB (with MongoDB compatibility) clusters.

Summary

In this post, I reviewed the different index types and how to best work with them in Amazon DocumentDB. I showed multiple useful operators, such as creating indexes and showing indexes information about size and usage, that you can use to help optimize your Amazon DocumentDB cluster. For more information about working with indexes on Amazon DocumentDB, see Working with Indexes.


About the author

Cody Allen is a Senior DocumentDB Specialist Solutions Architect based out of Texas. He is passionate about working side by side with customers to solve complex problems, and supporting teammates through mentorship and knowledge transfer. He has spent his career deploying and managing systems, softwares, and infrastructure for B2B SaaS providers, materiel and logistics suppliers, the U.S Air Force, and other government agencies domestic and international.