Searching DynamoDB Data with Amazon CloudSearch

Articles & Tutorials>Searching DynamoDB Data with Amazon CloudSearch
This guide describes how to use Amazon CloudSearch to search DynamoDB data.

Details

Submitted By: Deborah Adair
AWS Products Used: Amazon CloudSearch, Amazon DynamoDB
Created On: November 16, 2012 12:41 AM GMT
Last Updated: December 20, 2012 8:32 PM GMT
By Jon Handler and Siva Raghupathy

This guide describes how to use Amazon CloudSearch to search DynamoDB data.

Introduction

Amazon DynamoDB is a fully managed NoSQL database service that provides fast and predictable performance with seamless scalability. While DynamoDB supports GET and query, it lacks a rich query capability - there is no simple way to retrieve items based on matches within item attributes. Amazon CloudSearch is a fully-managed search service in the cloud that allows you to easily integrate fast and highly scalable search functionality into your applications. By using DynamoDB and Amazon CloudSearch together, you get both the throughput and durability of DynamoDB and the rich and powerful search capabilities of Amazon CloudSearch: simple text matches, complex Boolean queries, faceting, and integer range searching.

To use Amazon CloudSearch to search DynamoDB data, you must export data from a DynamoDB table to a CloudSearch domain and implement a solution to keep DynamoDB and CloudSearch in sync. This document describes how to:

  1. Create an Amazon CloudSearch domain for a DynamoDB table.
  2. Configure your search domain to map item attributes to CloudSearch index fields.
  3. Upload data from the DynamoDB table to your search domain.
  4. Keep your CloudSearch domain in sync with your DynamoDB table.

To search your DynamoDB data, you must map each DynamoDB item to a single CloudSearch document, each DynamoDB attribute to a CloudSearch index field, and the DynamoDB primary key (or other unique identifier) to the CloudSearch document ID. To feed updates to your CloudSearch domain, you can record updates in a separate DynamoDB table as they come in, and then submit them to CloudSearch as batches of add and delete requests.

Creating an Amazon CloudSearch Domain for Your DynamoDB Table

To search the data in your DynamoDB table using Amazon CloudSearch, the first thing you do is create a CloudSearch domain. A search domain encapsulates the collection of data that you want to search (the data from your DynamoDB table) and the search instances that process your search requests.

For example, you might have a DynamoDB table called movies where each item represents a unique movie and the item attributes contain information about the movie such as its title, genre, director, and actors. Creating a CloudSearch domain will enable you to easily search the movie attributes.

To create a CloudSearch domain for the DynamoDB table, go to the Amazon CloudSearch console at https://console.aws.amazon.com/cloudsearch/home and click Create Your First Search Domain to launch the Create Domain wizard. (If you have already created a CloudSearch domain, click the Create a New Domain button on the CloudSearch dashboard to launch the wizard.) When prompted, specify a name for your new domain, such as movies, and step through the wizard. The next section describes how to configure the domain to map the item attributes to index fields, so you can choose the Manual Configuration option and configure the index fields later. For more information about creating search domains, see Creating a Search Domain in the Amazon CloudSearch Developer Guide.

Configuring Your Amazon CloudSearch Domain

A search domain's configuration controls how the data is indexed and searched. In Amazon CloudSearch, each item that can be returned as a search result is represented as a document. The individual attributes that contain the data you want to search, return in the search results, and use to rank and filter the results are represented as index fields. To use CloudSearch to search the data in a DynamoDB table, you must configure index fields that correspond to the item attributes in your DynamoDB table. CloudSearch supports three types of index fields:

  • text—a text field contains arbitrary alphanumeric data. A text field is always searchable. The value of a text field can either be returned in search results or the field can be used as a facet. By default, text fields are not result-enabled or facet-enabled.
  • literal—a literal field contains an identifier or other data that you want to be able to match exactly. The value of a literal field can be returned in search results or the field can be used as a facet, but not both. By default, literal fields are not search-enabled, result-enabled, or facet-enabled.
  • uint—a uint field contains an unsigned integer value. Uint fields are always searchable, the value of a uint field can always be returned in results, and faceting is always enabled. Uint fields can also be used in rank expressions.

For more information about indexing options, faceting, and ranking, see the Amazon CloudSearch Developer Guide.

For example, the data in your DynamoDB movies table might look like this:

Primary Key Attributes
id = "tt0076759" docid="tt0076759",
version = 1
title = "Star Wars",
director = "Lucas, George",
genre = {"Action","Adventure","Fantasy","Sci-Fi"},
actor = {"Ford, Harrison","Fisher, Carrie","Hamill, Mark",
"Jones,
James Earl","Guinness, Alec","Johnston, Joe",
"Mayhew, Peter","Cushing, Peter","Prowse, David","Daniels, Anthony"}
id = "tt1411664" docid="tt1411664",
version = 1
title = "Born to Be a Star",
director = "Brady, Tom",
genre = {"Comedy"},
actor = {"Ricci, Christina","Swardson, Nick","Dorff, Stephen",
"Johnson, Don","Bain, Robin","Herrmann, Edward","Goodman, Dana",
"Giangrande, Meredith","Dawn, Nadia","Locke, Tembi","Herschman, Adam"}

To configure your movies domain to search this DynamoDB table, you must define index fields for each of the item attributes. Configuring title, director, and actor fields as text fields enables you to perform free-text searches on those fields. When you search a text field, Amazon CloudSearch finds all documents that contain the search terms anywhere within the specified field, in any order. For example, if you search the title field for "star", you will find all of the movies that contain star anywhere in the title field, such Star Wars and Born to Be a Star. This differs from searching a literal field, where the field value must be identical to the search string to be considered a match. Since the genre field contains a limited set of possible values, it makes sense to configure it as a literal field rather than a text field. By making the genre field facet-enabled, you can retrieve a count of how many movies in a particular genre matched the search parameters.

You can configure index fields interactively by going to the Amazon CloudSearch console at https://console.aws.amazon.com/cloudsearch/home, selecting your domain from the CloudSearch dashboard, and clicking the domain's Indexing Options link in the navigation panel. You can also configure index fields for a domain with the command line tools or the DefineIndexField configuration action. A convenient way to manage your domain's configuration is to collect the configuration commands in a script that you can easily modify and re-run whenever you need to update your configuration.

When you are finished configuring index fields for your domain, you must call IndexDocuments to deploy the updated configuration to your search index. You can run IndexDocuments through the console, with the cs-index-documents command line tool, or by invoking the IndexDocuments configuration action. The status of the new fields will be shown as PROCESSING on the CloudSearch dashboard until the IndexDocuments operation completes. During this time, you can send data to your domain, but you won't be able to search it using the new configuration until the status of the fields changes to ACTIVE.

Managing Data in DynamoDB and Amazon CloudSearch

To make the data in your DynamoDB table searchable, you must retrieve each item, represent it as a document in the Amazon CloudSearch SDF format, and upload the SDF documents to your search domain. Every document must have a unique document ID (docid), a version number, and at least one data field. You must map the item attributes that you want to search and return in results to individual data fields. For more information about formatting data in SDF, see Preparing Your Data in the Amazon CloudSearch Developer Guide.

Document IDs must be unique across your data set. Search results always include the document IDs of the matching documents, so basing your document IDs on your table's primary key enables you to easily retrieve items from the table using the DynamoDB GetItem API. However, keep in mind that CloudSearch document IDs must start with a letter or numeral, and can only contain the following characters: a-z (lower-case letters), 0-9, and _ (underscore).

Note that CloudSearch document IDs must also be 128 characters or less, while DynamoDB hash keys can be up to 2,048 bytes and range keys can be up to 1,024 bytes. If your DynamoDB keys are longer than 128 bytes, you will need to create a mapping between your primary keys and document IDs. One way to do that is to create an MD5 hash of your DynamoDB primary key and use that as the CloudSearch document ID. By also adding the DynamoDB primary key to your CloudSearch document as a result-enabled literal field, you can retrieve that field in your search results and use it to map back to your DynamoDB items. (To keep things simple in the example code shown below, we have included docid as a table item.)

A document's version is specified as an unsigned integer. The version is used to guarantee that obsolete updates are not applied if update requests are received out of order. If the version number is higher than the last applied update, it will be applied. Updates that have a lower version number are ignored. If the version number specified for a document is the same as in a previously-received update, the results are undefined - one of the uploaded documents will be added to the search index and the other will be discarded. However, there's no way to predict which one will take precedence. (For more information about versioning, see Document Versions in the Amazon CloudSearch Developer Guide.)

If your DynamoDB items do not already contain a version number, you must add one. This can be used for optimistic concurrency control (OCC) using DynamoDB conditional updates. To easily keep track of versions, you can use the current system time represented as a Unix Epoch time with second resolution.

Bootstrapping Your Amazon CloudSearch Domain

When you first create a CloudSearch domain for searching your DynamoDB data, you must export all of the data from your DynamoDB table and upload it to your search domain as SDF data.

For example, the items in the sample movies table can be represented in SDF as follows:

[{ "type":   "add",
  "id":      "tt0076759",
  "version": 1,
  "lang":    "en",
  "fields":  {
    "title": "Star Wars",
    "director": "Lucas, George",
    "genre": ["Action","Adventure","Fantasy","Sci-Fi"],
    "actor": ["Ford, Harrison","Fisher, Carrie","Hamill, Mark",
              "Jones, James Earl","Guinness, Alec","Johnston, Joe",
              "Mayhew, Peter","Cushing, Peter","Prowse, David",
              "Daniels, Anthony","Baker, Kenny","Rimmer, Shane",
              "De Aragon, Maria","McCallum, Rick","Tippett, Phil",
              "Goffe, Rusty","Lyons, Derek","Ward, Larry",
              "Tierney, Malcolm","Diamond, Peter"]  }
},
{ "type":    "add",
  "id":      "tt1411664",
  "version": 1,
  "lang":    "en",
  "fields":  {
    "title": "Born to Be a Star",
    "director": "Brady, Tom",
    "genre": ["Comedy"],
    "actor": ["Ricci, Christina","Swardson, Nick","Dorff, Stephen",
              "Johnson, Don","Bain, Robin","Herrmann, Edward",
              "Goodman, Dana","Giangrande, Meredith","Dawn, Nadia",
              "Locke, Tembi","Herschman, Adam","Flynn, Miriam",
              "Taylor, Tabitha","Armani, Angelina","Jaymes, Jayden",
              "Lychnikoff, Pasha D.","Bjorge, Jaimarie","Wolov, Julia Lea",
              "Grunberg, Brad","Joyner, Mario"]  }
}]

To bootstrap the domain, you must specify an SDF add operation for each item in the movies table. You can easily retrieve all of the items in your DynamoDB table by performing a table scan. Boto (2.4.1 or later) includes interfaces to the DynamoDB and Amazon CloudSearch services and provides a simple way to perform the table scan, create the SDF batches, and submit these batches to your CloudSearch domain using Python.

You can bootstrap your search domain by incorporating the following code snippets into your source.

  1. Import the Boto library and use the following code to create the connections to DynamoDB and CloudSearch:

        ddb_conn = boto.connect_dynamodb(access_key, secret_key)
        table = ddb_conn.get_table(movies_table)
    
        cs_conn =  boto.connect_cloudsearch(access_key, secret_key)
        domain = cs_conn.lookup(domain_name)
    

    You must replace access_key and secret_key with your AWS account credentials. Set the movies_table variable to the name of the DynamoDB table that contains your data. Set the domain_name variable to the name of the CloudSearch domain you created to search your data.

  2. After you have created connections to DynamoDB and CloudSearch, scan the DynamoDB table and add a document to the search domain for each item. For example:

        def _dynamo_to_cloudsearch(table, domain):
            doc_service = domain.get_document_service()
    
            for table_item in table.scan():
                docid = table_item['docid']
                table_item.delete_attribute('docid')
    
                version = table_item.get('version')
                table_item.delete_attribute('version')
    
                doc_service.add(docid, version, _get_fields(table_item))
    
                if len(doc_service.get_sdf()) >= 4000000:
                    resp = doc_service.commit()
    
        def _get_fields(table_item):
            attrs = {}
            for key,value in table_item.iteritems():
                attrs[key] = [item for item in value]
            return attrs
    

This sample code simply loops through all of the table items, collects the docid, version, and other attributes for each item, and adds a document to the search domain for each item.

Once we collect the docid and version, they are removed from our local copy of the item before we create fields for the remaining attributes. This is necessary for the docid attribute because docid is a reserved keyword in CloudSearch and you cannot create a field with that name. You could opt to create a document field for the version attribute, but it's not necessary. Note that these changes are not committed back to the DynamoDB table, the docid and version attributes are only removed from the local copy of the item.

It's important to note that the documents are sent to the CloudSearch domain in batches. When you add documents to the doc_service through Boto, they are automatically collected in a batch. To upload the batch to your search domain, you must explicitly call the document service object's commit() method. In this sample, we commit the batch once it reaches 4 million characters to stay within the maximum batch size of 5 MB. For optimal performance, you should batch as many document updates together as possible while staying within the 5 MB limit.

Synchronizing Updates in DynamoDB and Amazon CloudSearch

After you bootstrap your CloudSearch domain with the initial data from your DynamoDB table, you must track the changes and periodically submit updates to keep your CloudSearch domain in sync with your DynamoDB data.

You can use a DynamoDB table to track the changes. Whenever you write an update to your primary table, write the same update to your update table along with an attribute that specifies whether the update type is an SDF add or delete operation. PutItem and UpdateItem are add operations, DeleteItem is a delete operation.

Important: To make sure that you don't lose any changes that are made while you are bootstrapping your domain, you must begin collecting updates in your update table before you initiate the full scan to export your DynamoDB data to Amazon CloudSearch. While this means that you might update some CloudSearch documents with identical data, it will ensure that no updates are lost and your CloudSearch domain contains an up-to-date version of every document.

The document version attribute is key to keeping items properly synchronized. The update table must contain a version number that you can use for each document that you are adding or deleting. For a document update to be applied, the version number must be greater than the last version number CloudSearch received, even for delete operations. This guarantees that obsolete updates are not applied if document updates are received out of order. A convenient way to manage versioning is to use the current system time represented as a Unix Epoch time with second resolution. (For more information about versioning, see Document Versions in the Amazon CloudSearch Developer Guide.)

How often you process your update table depends on the volume of changes and your update latency tolerance. One approach is to accumulate changes over a fixed time period, creating a new update table at the start of each time period, then processing and deleting the table at the end of the time period. For example, to send updates once per day, at the beginning of each day you could create a table called updates_YYYY_MM_DD to collect updates for that day. At the end of the day, you would then scan the day's table and send the changes to CloudSearch. Once the table has been processed, you can clean up by simply deleting that day's table.

Conclusion

Amazon CloudSearch provides a convenient, powerful search solution for data stored in DynamoDB. With just a few lines of code, you can create, populate, and synchronize a search domain with a DynamoDB table. You get the best of both worlds - CloudSearch's rich query capabilities, and DynamoDB's high throughput, low-latency, durable storage.

©2014, Amazon Web Services, Inc. or its affiliates. All rights reserved.