Perform fuzzy full-text search and semantic search on Amazon DocumentDB using Amazon OpenSearch Service
In this post, we show you how to integrate Amazon DocumentDB (with MongoDB compatibility) with Amazon OpenSearch Service using AWS Lambda integration and run full-text search, fuzzy search, and synonym search on an artificially generated reviews dataset.
Amazon DocumentDB is a fast, scalable, highly durable, and fully managed database service for operating mission-critical MongoDB API-compatible JSON-based workloads without having to worry about managing the underlying infrastructure. As a document database, Amazon DocumentDB makes it easy to store, query, and index JSON data.
As your business evolves, new opportunities arise, requiring you to delve deeper into your data for better insights. For example, consider that you are a large ecommerce platform using Amazon DocumentDB to store product reviews as JSON documents. To enhance your customer experience, you can develop functionality to help them find relevant product reviews based on their interests, which could involve finding reviews not only based on the exact keywords of their interests but also considering synonyms and semantics.
OpenSearch Service is a managed service that makes it easy to deploy, operate, and scale OpenSearch clusters in the AWS Cloud. With OpenSearch Service, you can perform real-time search, full-text search, semantic search, fuzzy search, and other analyses on your data for use cases like recommendation engines, ecommerce sites, and much more.
Amazon DocumentDB change streams provide a time-ordered sequence of change events that occur within your Amazon DocumentDB cluster’s collections. Lambda recently launched an integration with Amazon DocumentDB change streams. With this launch, you can use Lambda functions to stream your Amazon DocumentDB data changes to the OpenSearch Service index and run fuzzy full-text search and semantic search queries. For more information, see Using Lambda with Amazon DocumentDB.
This solution involves the following high-level steps:
- Deploy an AWS CloudFormation template to create the following resources:
- A VPC and the required networking components
- An Amazon DocumentDB cluster to store the JSON data
- An OpenSearch Service domain for running fuzzy, full-text queries
- An AWS Cloud9 environment to connect to Amazon DocumentDB and OpenSearch Service
- A secret in AWS Secrets Manager to store Amazon DocumentDB credentials
- A Lambda function to stream Amazon DocumentDB data to the OpenSearch Service index
- Set up the AWS Cloud9 environment.
- Enable Amazon DocumentDB change streams.
- Configure the Amazon DocumentDB change stream as a source for the Lambda function.
- Load the reviews dataset into Amazon DocumentDB.
- Run fuzzy full-text queries on Amazon DocumentDB data in OpenSearch Service.
The following architecture diagram illustrates the solution.
The CloudFormation template deploy resources in your AWS account, which incur costs. For more information on pricing for the resources, see AWS Pricing.
Deploy the CloudFormation template
Complete the following tasks to deploy the CloudFormation template:
- Download the template or quick launch the CloudFormation stack by choosing Launch stack:
- For Stack name, enter the name for your CloudFormation stack.
- For DocDBIdentifier, enter the name of your Amazon DocumentDB cluster.
- For DocDBPassword, enter the administrator password for your Amazon DocumentDB cluster (minimum 8 characters).
- For DocDBUsername, enter the name of your administrator user in the Amazon DocumentDB cluster.
- For ExistingCloud9Role, choose True if you have the AWS Identity and Access Management (IAM) role
AWSCloud9SSMAccessRolecreated in your account. If you have used AWS Cloud9 before, you should already have an existing role. You can verify by going to the IAM console and searching for it on the Roles page. Stack creation will fail if the roles exists and you choose False.
- Choose Next.
- Select the check box in the Capabilities section to allow the stack to create an IAM role, then choose Submit.
Set up an AWS Cloud9 environment
To set up your AWS Cloud9 environment, complete the following steps:
- On the AWS Cloud9 console, launch the environment that you created in the previous step (
- From your environment, launch a new terminal window by choosing Window and New Terminal.
- Install the required packages by running the following script to connect to Amazon DocumentDB using a terminal and load the reviews dataset using a Python script:
Enable Amazon DocumentDB change streams
Amazon DocumentDB change stream events comprise a time-ordered sequence of data changes due to inserts, updates, and deletes on your data. We use these change stream events to transmit data changes from the Amazon DocumentDB cluster to the OpenSearch Service domain.
Change streams are disabled by default; you can enable them at an individual collection level, at the database level, or at the cluster level.
To enable change streams on your cluster, complete the following steps:
- Navigate to your AWS Cloud9 terminal and run the following code, replacing the values with those of your cluster:
You can find the Amazon DocumentDB endpoint on your CloudFormation stack’s Outputs tab or on the Amazon DocumentDB console, and the Amazon DocumentDB user name and password are the values you provided during the creation of the CloudFormation stack.
- Connect to Amazon DocumentDB:
- Enable change streams on all databases and collections:
For more information on change streams, see Using change streams with Amazon DocumentDB.
Configure the Amazon DocumentDB change stream as a source for the Lambda function
To accomplish this task, complete the following steps:
- On the Lambda console, navigate to the Lambda function named
- On the Configuration tab, choose Triggers and choose Add trigger.
- Select the source as Amazon DocumentDB for the trigger configuration.
- For DocumentDB cluster, choose the cluster created by the CloudFormation stack.
- For Database name, enter
- For Collection name, enter
- For Secrets Manager key, choose the Secrets Manager key created by the CloudFormation stack. You can find it in the CloudFormation stack outputs as the value for the key
- For Batch window, set it to the maximum amount of time in seconds to gather records before invoking your function. We set this to a low amount (5 seconds) to make the invocations happen faster.
- For all other parameters, leave them at their defaults.
- Choose Add.
Load the reviews dataset into Amazon DocumentDB
Navigate to AWS Cloud9, and in a new terminal, run the loader script to start inserting the review dataset into Amazon DocumentDB (the script will run for a few minutes; do not close the terminal):
As the loader script loads the data into the Amazon DocumentDB, the Lambda function streams the data into OpenSearch Service. You can monitor this process through the Lambda function metrics to make sure the function invoked successfully and view the function logs to make sure there is no issue with the indexing process, such as incorrect permission.
To verify data is streaming to the OpenSearch Service index, open a new terminal in your AWS Cloud9 environment and run the following commands (replace the OpenSearch Service endpoint with the value from your CloudFormation template outputs):
The following sample output contains document count (docs.count) information:
In addition to verifying the index document count, OpenSearch Service provides several Amazon Cloudwatch metrics to monitor, including IndexingRate, which would be an indicator of running indexing operations.
If you’re working with an existing Amazon DocumentDB collection, you can perform a one-time full migration using an AWS Database Management Service (AWS DMS) full load task before enabling the change stream and processing future changes. AWS DMS supports having a source from Amazon DocumentDB and target to OpenSearch Service. For more information, refer to Source endpoints for data migration and Target endpoints for data migration.
Run queries on Amazon DocumentDB data in OpenSearch Service
As the data is being replicated to the OpenSearch Service domain, you can run full-text search, fuzzy search, and synonym search queries in OpenSearch Service. The following are some example queries that you can run in your AWS Cloud9 terminal.
Full-text search queries
To find out the reviews that have a rating greater than or equal to 4 out of 5 and contain the phrase
"easy to use". The query also highlights the matched phrase in the query response.
This query also highlights the matched phrase in the query response:
Full-text search boost queries:
Using the boost feature, you can improve search relevancy by “boosting” certain fields. Boosts are multipliers that weigh matches in one field more heavily than matches in other fields.
In the following example, a match for
game in the
review_body field influences
_score twice as much as a match in the
product_title field. You can add the boosting factor to the query through the caret (^) operator.
Fuzzy queries return documents that contain terms similar to the search term. For example, if the search term is
"easy," documents with data matching
"eays", "ease", "easi" and more are matched.
Here is query to find all reviews with a review body that has a fuzzy match for “easi”:
Search with synonyms:
You can upload custom dictionary files such as stop words and synonyms referred to as packages to your Amazon OpenSearch cluster to tell OpenSearch to ignore certain high-frequency words or to treat terms like “brinjal”, “aubergine”, and “eggplant” as equivalent, resulting in better search results.
To implement search with synonyms, you need to perform additional configuration on your Amazon OpenSearch Service cluster. For steps to implement synonym search, see custom packages for Amazon OpenSearch. The following is an example for synonym search that considers
"program" equivalent in the
To stop incurring costs, clean up the resources created in this post by deleting the CloudFormation stack you created. For instructions, refer to Deleting a stack on the AWS CloudFormation console.
In this post, we showed you how to integrate Amazon DocumentDB, OpenSearch Service, and Lambda to perform full-text search and fuzzy search queries over JSON data. Specifically, we used the latest Lambda integration to replicate change events from an Amazon DocumentDB change stream to an OpenSearch Service index.
Visit Get Started with Amazon DocumentDB to begin using Amazon DocumentDB.
About the Authors
Kaarthiik Thota is a Senior Amazon DocumentDB Specialist Solutions Architect at AWS based out of London. He is passionate about database technologies and enjoys helping customers solve problems and modernize applications using NoSQL databases. Before joining AWS, he worked extensively with relational databases, NoSQL databases, and business intelligence technologies for more than 15 years.
Hendy Wijaya is a Senior OpenSearch Specialist Solutions Architect at Amazon Web Services. Hendy enables customers to leverage AWS services to achieve their business objectives and gain competitive advantages. He is passionate in collaborating with customers in getting the best out of OpenSearch and Amazon OpenSearch