AWS Database Blog

New Amazon DocumentDB features for aggregations, arrays, and indexing

Amazon DocumentDB (with MongoDB compatibility) is a fast, scalable, highly available, and fully managed document database service that supports MongoDB workloads. You can use the same MongoDB application code, drivers, and tools as you do today to run, manage, and scale workloads on Amazon DocumentDB. This way, you can enjoy improved performance, scalability, and availability without having to worry about managing the underlying infrastructure.

Today, Amazon DocumentDB added support for new aggregation pipeline operators and stages that allow you to compose powerful aggregations over your documents. The new capabilities include aggregation string operators ($concat, $substr, $substrBytes, $substrCP, $strcasecmp), an array aggregation operator ($size), an aggregation group accumulator operator ($push), and aggregation stages ($redact and $indexStats). Additionally, Amazon DocumentDB now supports positional array operators ($[] and $[<identifier>]) for updating elements in an array and hint() for selecting an index.

In this blog post, I introduce some of these new capabilities by showing common use cases. This way you can start using the capabilities to build and manage applications at-scale with Amazon DocumentDB.

Get started with Amazon DocumentDB

To get started with Amazon DocumentDB, you can use either the Amazon DocumentDB getting started guide, the quick start with AWS CloudFormation, or watch the following video. When you’re ready to start developing on Amazon DocumentDB, use the same application code, drivers, and tools that you’re using today with MongoDB.

New capabilities

After you create and connect to your Amazon DocumentDB cluster, you can better understand and extend the following examples.

Aggregation operators and stages

It’s common to store strings in documents such as first name, last name, or location. To make data processing easier and faster in your application, you might push manipulations such as joining two strings together, down to the database level. The new string aggregation operators that were added to Amazon DocumentDB enable you to manipulate strings for your specific use cases. Imagine you are building a human resources application and have a collection in Amazon DocumentDB that contains a number of documents for people in your organization.

Input

Each document in this example is for a single person and contains their first name, last name, date of birth, and a company specific identifier for where their desk resides.

db.people.insertMany([
{ "_id":1, "first_name":"Jane", "last_name":"Doe", "DOB":"2/1/1999", "Desk": "MSP102-MN"},
{ "_id":2, "first_name":"John", "last_name":"Doe", "DOB":"12/21/1992", "Desk": "DSM301-IA"},
{ "_id":3, "first_name":"Steve", "last_name":"Smith", "DOB":"3/21/1981", "Desk":"MKE233-WI"}])

$concat

The $concat aggregation operator concatenates (or combines) multiple strings in a document to produce a single string that can be returned to the application. This reduces the work done in the application. In the following example, I concatenate the first and last names of users to produce each person’s full name.

Query:

db.people.aggregate(
[
 { $project: { full_name: { $concat: [ "$first_name", " ", "$last_name"] } } }
])

Result:

{ "_id" : 1, "full_name" : "Jane Doe" }
{ "_id" : 2, "full_name" : "John Doe" }
{ "_id" : 3, "full_name" : "Steve Smith" }

With the use of $concat aggregation operator, you can return the person’s full name back to the application so that it can be used without requiring application-level logic.

$substrCP

The $substrCP operator returns a portion of a string (substring), where the desired substring is specified as a range of UTF-8 code points (CP). While the $substr operator is useful for defining substrings as range of characters, and the $substrBytes operator defines a substring as a range of bytes, $substrCP is useful for defining substrings when you are not certain how many characters or bytes a given string element uses. This is common in Unicode strings. Let’s use the string “caffè latte” as an example as the use of the “è” highlights this point.

Let’s consider the following use of $substrCP: { $substrCP: [ "caffè latte", 0, 5] }. The output would yield “caffè” as “caffè” is five code points from the beginning of the string (i.e., 0). If I used $subtrBytes for a similar query { $substrBytes: [ "caffè latte", 0, 5] }, I would get an error as it takes two bytes to store “è” and only returning the first byte of “è” won’t do anyone any good. Thus, to get the same result, I use the following query { $substrBytes: [ "caffè latte", 0, 6] }. Using $substrCP versus $subtrBytes is preferred and simplifies your development. You don’t have to think about how many bytes it takes to store the characters in your strings.

Following from the previous example and using the collection of people in the organization, if I want to use $substrCP to extract the state in which the employees reside, I use the following query.

Query:

db.people.aggregate(
  [
    {
      $project: {
          "state": { $substrCP: [ "$Desk", 7, 2] }
      }
    }
  ]
)

Result:

{ "_id" : 1, "state" : "MN" }
{ "_id" : 2, "state" : "IA" }
{ "_id" : 3, "state" : "WI" }

From the encoding of each employee’s desk location, I know that the two character abbreviation for the state starts seven code points from the beginning of the string and is two code points long, i.e., { $substrCP: [ "$Desk", 7, 2] }. Thus the results returned are the last two code points in the string. If I were using this type of encoding for the employees desk locations, I would ensure that I adhere to fixed lengths for each of the components. If not, it breaks the query that I just wrote.

$strcasecmp

Similarly, I can use the $strcasecmp operator to perform a case-insensitive comparison between two strings, i.e., { $strcasecmp: [ string1, string2 ] }. The output of the comparison tells us if the string2 is greater (1) than string1, equal (0) to string1, or less than (-1) string1.

Query:

db.people.aggregate(
   [
     {
       $project:
          {
            item: 1,
            compare: { $strcasecmp: [ "$Desk", "mke233-wi" ] }
          }
      }
   ]
)

Result:

{ "_id" : 1, "compare" : 1 }
{ "_id" : 2, "compare" : -1 }
{ "_id" : 3, "compare" : 0 }

From the people collection, if I compare “mke233-wi” with the desk locations of each of the employees, with the case insensitive $strcasecmp operator, I should expect that the last entry (i.e., “_id”:3) to be equal to “MKE233-WI”.

For the array expression aggregation operator $size and the aggregation group accumulator operator $push, let’s consider a scenario where I build a user profile for a fantasy hockey league. Each user is able to select the teams they want to follow.

Input:

db.profiles.insertMany([
{ "_id":1, "user":"hockeyFan01", "teams": ["sharks", "panthers", "stars"]},
{ "_id":2, "user":"puck22", "teams": ["ducks", "rangers"]},
{ "_id":3, "user":"pondHockey12", "teams": ["sharks", "panthers", "stars"]}
])

$size

I can use the $size operator to return the count of items within an array field. In this scenario, let’s use $size to return the count the number of teams each user follows.

Query:

db.profiles.aggregate([
   {
      $project: {
         item: 1,
         "numberOfTeams": { $size: "$teams" }
      }
   }
])

Result:

{ "_id" : 1, "numberOfTeams" : 3 }
{ "_id" : 2, "numberOfTeams" : 2 }
{ "_id" : 3, "numberOfTeams" : 3 }

$push

I can use the $push aggregation group accumulator operator with the $unwind and $group aggregation stages to recategorize or pivot the data by teams and fans. Recategorizing the data can be useful for scenarios in which I want to target the fans of specific teams with a promotional email. In this case, the application logic would be straightforward. I would send the promotional email for the Sharks to “hockeyFan01” and “pondHockey12” versus having to either process every document and result email one-by-one, or do a collections scan for each team.

Query:

db.profiles.aggregate( 
[ { $unwind: "$teams" },
 { $group : { _id : "$teams", "fans": { $push: "$user" } } }] 
)

Result:

{ "_id" : "rangers", "fans" : [ "puck22" ] }
{ "_id" : "ducks", "fans" : [ "puck22" ] }
{ "_id" : "sharks", "fans" : [ "hockeyFan01", "pondHockey12" ] }
{ "_id" : "stars", "fans" : [ "hockeyFan01", "pondHockey12" ] }
{ "_id" : "panthers", "fans" : [ "hockeyFan01", "pondHockey12" ] }

$redact

The $redact aggregation stage is useful for scenarios of including and excluding content in documents based on the selected fields in the document. To better understand how to utilize this aggregation stage, let’s say I build a patient record system and I only want doctors with permissions to specific codes to be able to see patient records. If a doctor does not have access to a code, the patient records are pruned from the documents returned from the query.

Input:

db.patient.insertMany([
{ "_id":1,
  "code": "ICU", 
  "patient":"John Doe", 
  "DOB": "9/30/1987",
  "Hospital": "First Hill"
},
{ "_id":2,
  "code": "Reg", 
  "patient":"Jane Doe", 
  "DOB": "3/27/1989",
  "Hospital": "Cherry Hill"
},
{ "_id":3,
  "code": "Spec", 
  "patient":"Steve Smith", 
  "DOB": "1/8/1997",
  "Hospital": "Pill Hill"
}
])

Query:

db.patient.aggregate(
   [
     { $redact: {
        $cond: {
           if: { $eq: ["Reg", "$code"]},
           then: "$$DESCEND",
           else: "$$PRUNE"
         }
       }
     }
   ]
);

Result:

{
	"_id" : 2,
	"code" : "Reg",
	"patient" : "Jane Doe",
	"DOB" : "3/27/1989",
	"Hospital" : "Cherry Hill"
}

Positional array operators

To understand the newly added positional array operators, let’s consider a use case where I use Amazon DocumentDB to store flight miles as part of an airlines rewards program.

Input:

db.miles.insertMany([
{ "_id" : 1, "member_since" : ISODate("1987-01-01T00:00:00Z"), "credit_card" : false, "flight_miles" : [ 1205, 2560, 880 ]},
{ "_id" : 2, "member_since" : ISODate("1982-01-01T00:00:00Z"), "credit_card" : true, "flight_miles" : [ 1205, 2560, 890, 2780]},
{ "_id" : 3, "member_since" : ISODate("1999-01-01T00:00:00Z"), "credit_card" : true, "flight_miles" : [ 1205, 880]}])

Note: In Amazon DocumentDB, insertMany() is an atomic operation, so either the entire command succeeds or it fails. Having atomicity here is useful as you do not have to worry about a failure occurring hallway through your insert and then trying to reconcile which documents were written to your database.

$[]

The positional array operator $[] can be used to perform an operation on each of the elements in an array field. For example, an airline is recording, in an array, their customers’ flight miles from a given year. If a customer registers for a credit card with the airline, I give them a 100 percent bonus for the miles that they’ve already earned that year.

Query:

db.miles.update(
   { credit_card: {$eq: true}},
   { $mul: { "flight_miles.$[]": NumberInt(2) } },
   { multi: true }
)

db.miles.find()

Result:

{ "_id" : 1, "member_since" : ISODate("1987-01-01T00:00:00Z"), "credit_card" : false, "flight_miles" : [ 1205, 2560, 880 ] }
{ "_id" : 2, "member_since" : ISODate("1982-01-01T00:00:00Z"), "credit_card" : true, "flight_miles" : [ 2410, 5120, 1780, 5560 ] }
{ "_id" : 3, "member_since" : ISODate("1999-01-01T00:00:00Z"), "credit_card" : true, "flight_miles" : [ 2410, 1760 ] }

For customers that are credit card members, I can see that I doubled each of their flight miles. 

$[<identifier>]

For the best customers that have been with the airline since 1980, I can run an additional promotion where I round up all flight miles less than 2,000 miles to 2,000 miles. To run that query in Amazon DocumentDB, I can use the $[<identifier>] positional array operator.

Query:

db.miles.update(
	{member_since: {$lte: ISODate("1990-01-01T00:00:00Z")}},    
	{$set: {"flight_miles.$[element]": 2000}},    
	{multi: true, arrayFilters: [{"element": {$lt:2000}}]}
)

db.miles.find()

 Result:

{ "_id" : 3, "member_since" : ISODate("1999-01-01T00:00:00Z"), "credit_card" : true, "flight_miles" : [ 2410, 1760 ] }
{ "_id" : 1, "member_since" : ISODate("1987-01-01T00:00:00Z"), "credit_card" : false, "flight_miles" : [ 2000, 2560, 2000 ] }
{ "_id" : 2, "member_since" : ISODate("1982-01-01T00:00:00Z"), "credit_card" : true, "flight_miles" : [ 2410, 5120, 2000, 5560 ] }

For the two customers that had been with the airline since 1980s, I see from the result that because flight_miles “1205,” “880,” and “1780” were below the 2000-mile threshold that they were set to 2000 miles as part of the additional promotion.

Indexing

You can use the newly added indexing capabilities to control and manage your index selection and usage. If you have indexes that are not being used, either using them or removing them can go a long way to improving the performance of your database and ultimately your application.

hint()

In Amazon DocumentDB, the query processor evaluates and picks a query plan that maximizes for performance. At times, you might want to force the query processor to use a specific index. To do that, you can use the hint() operator.

Input:

Let’s consider a use case where I store the inventory for a grocery store in Amazon DocumentDB. The grocery collection below contains the name of the product, category, pricing, and the quantity.

db.grocery.insertMany([
{ "_id" : 1, "product":"orange", "category":"fruit", "price":0.89, "qty":354 },
{ "_id" : 2, "product" : "apple", "category": "fruit", "price":1.29, "qty":876},
{ "_id" : 3, "product" : "tomato", "category": "apparel", "price":0.29, "qty":39}
])	

db.grocery.createIndex( { product: 1 } )

As a result of the input queries, I create a grocery collection and an index on the product field. I can use hint() to ensure that the query processor uses the index that I created on the product field.

Query:

db.grocery.find({"product":"orange"}).hint( { product: 1 } )

Result:

{
	"_id" : 1,
	"product" : "orange",
	"category" : "fruit",
	"price" : 0.89,
	"qty" : 354
}

I can validate that the index on the product field was indeed used with the explain() method, which provides us information about the query plan or how the query processor executed the query on the grocery collection.

Query:

db.grocery.find({"product":"orange"}).hint({"product":1}).explain()

Result:

db.grocery.find({"product":"orange"}).hint({"product":1}).explain()
{
        "queryPlanner" : {
                "plannerVersion" : 1,
                "namespace" : "test.grocery",
                "winningPlan" : {
                        "stage" : "SUBSCAN",
                        "inputStage" : {
                                "stage" : "IXSCAN",
                                "indexName" : "product_1",
                                "direction" : "forward"
                        }
                }
        },
        "serverInfo" : {
                "host" : "blog-host",
                "port" : 27017,
                "version" : "3.6.0"
        },
        "ok" : 1
}

From the output of explain(), I can verify that query plan (i.e., stage) that produced the query results was an index scan (i.e., “IXSCAN”) that used “indexName”:”product_1” and thus the index was used query the documents in the grocery collection.

$indexStats

Following from the preceding grocery example, if I run a few queries on the indexes, I can then understand the operational use of these indexes using the $indexStats aggregation operator. While indexes help improve query performance, they also create additional work for the database when inserting data. Understanding how often your indexes are being used when querying the database helps you make more informed decisions on whether indexes are needed.

Query:

db.grocery.aggregate( [ { $indexStats: { } } ] )

Result:

{ "name" : "_id_", "key" : { "_id" : 1 }, "host" : "blog-host.com:27017", "accesses" : { "ops" : NumberLong(3), "since" : ISODate("2019-02-19T23:27:10.214Z") } }
{ "name" : "product_1", "key" : { "product" : 1 }, "host" : "blog-host.com:27017", "accesses" : { "ops" : NumberLong(7), "since" : ISODate("2019-02-19T23:27:18.042Z") } }

In this result,  you can see that index “_id_” has been queried three times since it was created on 2019-02-19 or since the server last rebooted. Similarly, index “product_1” was queries seven times since it was created on the same date. For a production database, you can use $indexStats to baseline and monitor your index usage over time.

Summary

With this Amazon DocumentDB release, we added new capabilities that enable you to build and manage applications at scale. For more information about Amazon DocumentDB, see the Amazon DocumentDB home page and getting started guide.

If you have any questions or comments about this blog post, please use the comments section on this page.

 


About the Author

Joseph Idziorek is a Principal Product Manager at Amazon Web Services.