AWS Big Data Blog
Amazon DocumentDB zero-ETL integration with Amazon OpenSearch Service is now available
Today, we are announcing the general availability of Amazon DocumentDB (with MongoDB compatibility) zero-ETL integration with Amazon OpenSearch Service.
Amazon DocumentDB provides native text search and vector search capabilities. With Amazon OpenSearch Service, you can perform advanced search analytics, such as fuzzy search, synonym search, cross-collection search, and multilingual search, on Amazon DocumentDB data.
Zero-ETL integration simplifies your architecture for advanced search analytics. It frees you from performing undifferentiated heavy lifting tasks and the costs associated with building and managing data pipeline architecture and data synchronization between the two services.
In this post, we show you how to configure zero-ETL integration of Amazon DocumentDB with OpenSearch Service using Amazon OpenSearch Ingestion. It involves performing a full load of Amazon DocumentDB data and continuously streaming the latest data to Amazon OpenSearch Service using change streams. For other ingestion methods, see documentation.
Solution overview
At a high level, this solution involves the following steps:
- Enable change streams on the Amazon DocumentDB collections.
- Create the OpenSearch Ingestion pipeline.
- Load sample data on the Amazon DocumentDB cluster.
- Verify the data in OpenSearch Service.
Prerequisites
To implement this solution, you need the following prerequisites:
- An Amazon DocumentDB instance-based cluster. You can use an existing cluster or create a new one.
- An active OpenSearch Service domain. You can use an existing domain or create a new domain.
- A secret for the Amazon DocumentDB cluster stored in AWS Secrets Manager.
- An Amazon Simple Storage Service (Amazon S3) bucket.
Zero-ETL will perform an initial full load of your collection by doing a collection scan on the primary instance of your Amazon DocumentDB cluster, which may take several minutes to complete depending on the size of the data, and you may notice elevated resource consumption on your cluster.
Enable change streams on the Amazon DocumentDB collections
Amazon DocumentDB change stream events comprise a time-ordered sequence of data changes due to inserts, updates, and deletes on your data. We use these change stream events to transmit data changes from the Amazon DocumentDB cluster to the OpenSearch Service domain.
Change streams are disabled by default; you can enable them at the individual collection level, database level, or cluster level. To enable change streams on your collections, complete the following steps:
- Connect to Amazon DocumentDB using mongo shell.
- Enable change streams on your collection with the following code. For this post, we use the Amazon DocumentDB database
inventory
and collectionproduct
:
If you have more than one collection for which you want to stream data into OpenSearch Service, enable change streams for each collection. If you want to enable it at the database or cluster level, see Enabling Change Streams.
It’s recommended to enable change streams for only the required collections.
Create an OpenSearch Ingestion pipeline
OpenSearch Ingestion is a fully managed data collector that delivers real-time log and trace data to OpenSearch Service domains. OpenSearch Ingestion is powered by the open source data collector Data Prepper. Data Prepper is part of the open source OpenSearch project.
With OpenSearch Ingestion, you can filter, enrich, transform, and deliver your data for downstream analysis and visualization. OpenSearch Ingestion is serverless, so you don’t need to worry about scaling your infrastructure, operating your ingestion fleet, and patching or updating the software.
For a comprehensive overview of OpenSearch Ingestion, visit Amazon OpenSearch Ingestion, and for more information about the Data Prepper open source project, visit Data Prepper.
To create an OpenSearch Ingestion pipeline, complete the following steps:
- On the OpenSearch Service console, choose Pipelines in the navigation pane.
- Choose Create pipeline.
- For Pipeline name, enter a name (for example,
zeroetl-docdb-to-opensearch
).
- Set up pipeline capacity for compute resources to automatically scale your pipeline based on the current ingestion workload.
- Input the minimum and maximum Ingestion OpenSearch Compute Units (OCUs). In this example, we use the default pipeline capacity settings of minimum 1 Ingestion OCU and maximum 4 Ingestion OCUs.
Each OCU is a combination of approximately 8 GB of memory and 2 vCPUs that can handle an estimated 8 GiB per hour. OpenSearch Ingestion supports up to 96 OCUs, and it automatically scales up and down based on your ingest workload demand.
- Choose the configuration blueprint and under Use case in the navigation pane, choose ZeroETL.
- Select Zero-ETL with DocumentDB to build the pipeline configuration.
This pipeline is a combination of a source
part from the Amazon DocumentDB settings and a sink
part for OpenSearch Service.
You must set multiple AWS Identity and Access Management (IAM) roles (sts_role_arn
) with the necessary permissions to read data from the Amazon DocumentDB database and collection and write to an OpenSearch Service domain. This role is then assumed by OpenSearch Ingestion pipelines to make sure the right security posture is always maintained when moving the data from source to destination. To learn more, see Setting up roles and users in Amazon OpenSearch Ingestion.
You need one OpenSearch Ingestion pipeline per Amazon DocumentDB collection.
Provide the following parameters from the blueprint:
- Amazon DocumentDB endpoint – Provide your Amazon DocumentDB cluster endpoint.
- Amazon DocumentDB collection – Provide your Amazon DocumentDB database name and collection name in the format
dbname.collection
within thecollections
section. For example,inventory.product
. - s3_bucket – Provide your S3 bucket name along with the AWS Region and S3 prefix. This will be used temporarily to hold the data from Amazon DocumentDB for data synchronization.
- OpenSearch hosts – Provide the OpenSearch Service domain endpoint for the host and provide the preferred index name to store the data.
- secret_id – Provide the ARN for the secret for the Amazon DocumentDB cluster along with its Region.
- sts_role_arn – Provide the ARN for the IAM role that has permissions for the Amazon Document DB cluster, S3 bucket, and OpenSearch Service domain.
To learn more, see Creating Amazon OpenSearch Ingestion pipelines.
- After entering all the required values, validate the pipeline configuration for any errors.
- When designing a production workload, deploy your pipeline within a VPC. Choose your VPC, subnets, and security groups. Also select Attach to VPC and choose the corresponding VPC CIDR range.
The security group inbound rule should have access to the Amazon DocumentDB port. For more information, refer to Securing Amazon OpenSearch Ingestion pipelines within a VPC.
Load sample data on the Amazon DocumentDB cluster
Complete the following steps to load the sample data:
- Connect to your Amazon DocumentDB cluster.
- Insert some documents into the collection product in the inventory database by running the following commands. For creating and updating documents on Amazon DocumentDB, refer to Working with Documents.
Verify the data in OpenSearch Service
You can use the OpenSearch Dashboards dev console to search for the synchronized items within a few seconds. For more information, see Creating and searching for documents in Amazon OpenSearch Service.
To verify the change data capture (CDC), run the following command to update the OnHand
and MinOnHand
fields for the existing document item Ultra GelPen
in the product
collection:
Verify the CDC for the update to the document for the item Ultra GelPen
on the OpenSearch Service index.
Monitor the CDC pipeline
You can monitor the state of the pipelines by checking the status of the pipeline on the OpenSearch Service console. Additionally, you can use Amazon CloudWatch to provide real-time metrics and logs, which lets you set up alerts in case of a breach of user-defined thresholds.
Clean up
Make sure you clean up unwanted AWS resources created during this post in order to prevent additional billing for these resources. Follow these steps to clean up your AWS account:
- On the OpenSearch Service console, choose Domains under Managed clusters in the navigation pane.
- Select the domain you want to delete and choose Delete.
- Choose Pipelines under Ingestion in the navigation pane.
- Select the pipeline you want to delete and on the Actions menu, choose Delete.
- On the Amazon S3 console, select the S3 bucket and choose Delete.
Conclusion
In this post, you learned how to enable zero-ETL integration between Amazon DocumentDB change data streams and OpenSearch Service. To learn more about zero-ETL integrations available with other data sources, see Working with Amazon OpenSearch Ingestion pipeline integrations.
About the Authors
Praveen Kadipikonda is a Senior Analytics Specialist Solutions Architect at AWS based out of Dallas. He helps customers build efficient, performant, and scalable analytic solutions. He has worked with building databases and data warehouse solutions for over 15 years.
Kaarthiik Thota is a Senior Amazon DocumentDB Specialist Solutions Architect at AWS based out of London. He is passionate about database technologies and enjoys helping customers solve problems and modernize applications using NoSQL databases. Before joining AWS, he worked extensively with relational databases, NoSQL databases, and business intelligence technologies for over 15 years.
Muthu Pitchaimani is a Search Specialist with Amazon OpenSearch Service. He builds large-scale search applications and solutions. Muthu is interested in the topics o f networking and security, and is based out of Austin, Texas.