Amazon DocumentDB zero-ETL integration with Amazon OpenSearch Service is now available

Today, we are announcing the general availability of Amazon DocumentDB (with MongoDB compatibility) zero-ETL integration with Amazon OpenSearch Service.

Amazon DocumentDB provides native text search and vector search capabilities. With Amazon OpenSearch Service, you can perform advanced search analytics, such as fuzzy search, synonym search, cross-collection search, and multilingual search, on Amazon DocumentDB data.

Zero-ETL integration simplifies your architecture for advanced search analytics. It frees you from performing undifferentiated heavy lifting tasks and the costs associated with building and managing data pipeline architecture and data synchronization between the two services.

In this post, we show you how to configure zero-ETL integration of Amazon DocumentDB with OpenSearch Service using Amazon OpenSearch Ingestion. It involves performing a full load of Amazon DocumentDB data and continuously streaming the latest data to Amazon OpenSearch Service using change streams. For other ingestion methods, see documentation.

Solution overview

At a high level, this solution involves the following steps:

Enable change streams on the Amazon DocumentDB collections.
Create the OpenSearch Ingestion pipeline.
Load sample data on the Amazon DocumentDB cluster.
Verify the data in OpenSearch Service.

Prerequisites

To implement this solution, you need the following prerequisites:

An Amazon DocumentDB instance-based cluster. You can use an existing cluster or create a new one.
An active OpenSearch Service domain. You can use an existing domain or create a new domain.
A secret for the Amazon DocumentDB cluster stored in AWS Secrets Manager.
An Amazon Simple Storage Service (Amazon S3) bucket.

Zero-ETL will perform an initial full load of your collection by doing a collection scan on the primary instance of your Amazon DocumentDB cluster, which may take several minutes to complete depending on the size of the data, and you may notice elevated resource consumption on your cluster.

Enable change streams on the Amazon DocumentDB collections

Amazon DocumentDB change stream events comprise a time-ordered sequence of data changes due to inserts, updates, and deletes on your data. We use these change stream events to transmit data changes from the Amazon DocumentDB cluster to the OpenSearch Service domain.

Change streams are disabled by default; you can enable them at the individual collection level, database level, or cluster level. To enable change streams on your collections, complete the following steps:

Connect to Amazon DocumentDB using mongo shell.
Enable change streams on your collection with the following code. For this post, we use the Amazon DocumentDB database inventory and collection product:
```
db.adminCommand({modifyChangeStreams: 1,
    database: "inventory",
    collection: "product", 
    enable: true});
```

If you have more than one collection for which you want to stream data into OpenSearch Service, enable change streams for each collection. If you want to enable it at the database or cluster level, see Enabling Change Streams.

It’s recommended to enable change streams for only the required collections.

Create an OpenSearch Ingestion pipeline

OpenSearch Ingestion is a fully managed data collector that delivers real-time log and trace data to OpenSearch Service domains. OpenSearch Ingestion is powered by the open source data collector Data Prepper. Data Prepper is part of the open source OpenSearch project.

With OpenSearch Ingestion, you can filter, enrich, transform, and deliver your data for downstream analysis and visualization. OpenSearch Ingestion is serverless, so you don’t need to worry about scaling your infrastructure, operating your ingestion fleet, and patching or updating the software.

For a comprehensive overview of OpenSearch Ingestion, visit Amazon OpenSearch Ingestion, and for more information about the Data Prepper open source project, visit Data Prepper.

To create an OpenSearch Ingestion pipeline, complete the following steps:

On the OpenSearch Service console, choose Pipelines in the navigation pane.
Choose Create pipeline.
For Pipeline name, enter a name (for example, zeroetl-docdb-to-opensearch).
Set up pipeline capacity for compute resources to automatically scale your pipeline based on the current ingestion workload.
Input the minimum and maximum Ingestion OpenSearch Compute Units (OCUs). In this example, we use the default pipeline capacity settings of minimum 1 Ingestion OCU and maximum 4 Ingestion OCUs.

Each OCU is a combination of approximately 8 GB of memory and 2 vCPUs that can handle an estimated 8 GiB per hour. OpenSearch Ingestion supports up to 96 OCUs, and it automatically scales up and down based on your ingest workload demand.

Choose the configuration blueprint and under Use case in the navigation pane, choose ZeroETL.
Select Zero-ETL with DocumentDB to build the pipeline configuration.

This pipeline is a combination of a source part from the Amazon DocumentDB settings and a sink part for OpenSearch Service.

You must set multiple AWS Identity and Access Management (IAM) roles (sts_role_arn) with the necessary permissions to read data from the Amazon DocumentDB database and collection and write to an OpenSearch Service domain. This role is then assumed by OpenSearch Ingestion pipelines to make sure the right security posture is always maintained when moving the data from source to destination. To learn more, see Setting up roles and users in Amazon OpenSearch Ingestion.

You need one OpenSearch Ingestion pipeline per Amazon DocumentDB collection.

version: "2"
documentdb-pipeline:
  source:
    documentdb:
      acknowledgments: true
      host: "<<docdb-2024-01-03-20-31-17.cluster-abcdef.us-east-1.docdb.amazonaws.com>>"
      port: 27017
      authentication:
        username: ${{aws_secrets:secret:username}}
        password: ${{aws_secrets:secret:password}}
      aws:
        sts_role_arn: "<<arn:aws:iam::123456789012:role/Example-Role>>"
      
      s3_bucket: "<<bucket-name>>"
      s3_region: "<<bucket-region>>" 
      # optional s3_prefix for Opensearch ingestion to write the records
      # s3_prefix: "<<path_prefix>>"
      collections:
        # collection format: <databaseName>.<collectionName>
        - collection: "<<databaseName.collectionName>>"
          export: true
          stream: true
  sink:
    - opensearch:
        # REQUIRED: Provide an AWS OpenSearch endpoint
        hosts: [ "<<https://search-mydomain-1a2a3a4a5a6a7a8a9a0a9a8a7a.us-east-1.es.amazonaws.com>>" ]
        index: "<<index_name>>"
        index_type: custom
        document_id: "${getMetadata(\"primary_key\")}"
        action: "${getMetadata(\"opensearch_action\")}"
        # DocumentDB record creation or event timestamp
        document_version: "${getMetadata(\"document_version\")}"
        document_version_type: "external"
        aws:
          # REQUIRED: Provide a Role ARN with access to the domain. This role should have a trust relationship with osis-pipelines.amazonaws.com
          sts_role_arn: "<<arn:aws:iam::123456789012:role/Example-Role>>"
          # Provide the region of the domain.
          region: "<<us-east-1>>"
          # Enable the 'serverless' flag if the sink is an Amazon OpenSearch Serverless collection
          # serverless: true
          # serverless_options:
            # Specify a name here to create or update network policy for the serverless collection
            # network_policy_name: "network-policy-name"
          
extension:
  aws:
    secrets:
      secret:
        # Secret name or secret ARN
        secret_id: "<<my-docdb-secret>>"
        region: "<<us-east-1>>"
        sts_role_arn: "<<arn:aws:iam::123456789012:role/Example-Role>>"
        refresh_interval: PT1H

Provide the following parameters from the blueprint:

Amazon DocumentDB endpoint – Provide your Amazon DocumentDB cluster endpoint.
Amazon DocumentDB collection – Provide your Amazon DocumentDB database name and collection name in the format dbname.collection within the collections section. For example, inventory.product.
s3_bucket – Provide your S3 bucket name along with the AWS Region and S3 prefix. This will be used temporarily to hold the data from Amazon DocumentDB for data synchronization.
OpenSearch hosts – Provide the OpenSearch Service domain endpoint for the host and provide the preferred index name to store the data.
secret_id – Provide the ARN for the secret for the Amazon DocumentDB cluster along with its Region.
sts_role_arn – Provide the ARN for the IAM role that has permissions for the Amazon Document DB cluster, S3 bucket, and OpenSearch Service domain.

To learn more, see Creating Amazon OpenSearch Ingestion pipelines.

After entering all the required values, validate the pipeline configuration for any errors.
When designing a production workload, deploy your pipeline within a VPC. Choose your VPC, subnets, and security groups. Also select Attach to VPC and choose the corresponding VPC CIDR range.

The security group inbound rule should have access to the Amazon DocumentDB port. For more information, refer to Securing Amazon OpenSearch Ingestion pipelines within a VPC.

Load sample data on the Amazon DocumentDB cluster

Complete the following steps to load the sample data:

Connect to your Amazon DocumentDB cluster.

Insert some documents into the collection product in the inventory database by running the following commands. For creating and updating documents on Amazon DocumentDB, refer to Working with Documents.

use inventory;

 db.product.insertMany([
   {
      "Item":"Ultra GelPen",
      "Colors":[
         "Violet"
      ],
      "Inventory":{
         "OnHand":100,
         "MinOnHand":35
      },
      "UnitPrice":0.99
   },
   {
      "Item":"Poster Paint",
      "Colors":[
         "Red",
         "Green",
         "Blue",
         "Black",
         "White"
      ],
      "Inventory":{
         "OnHand":47,
         "MinOnHand":50
      }
   },
   {
      "Item":"Spray Paint",
      "Colors":[
         "Black",
         "Red",
         "Green",
         "Blue"
      ],
      "Inventory":{
         "OnHand":47,
         "MinOnHand":50,
         "OrderQnty":36
      }
   }
])

Verify the data in OpenSearch Service

You can use the OpenSearch Dashboards dev console to search for the synchronized items within a few seconds. For more information, see Creating and searching for documents in Amazon OpenSearch Service.

To verify the change data capture (CDC), run the following command to update the OnHand and MinOnHand fields for the existing document item Ultra GelPen in the product collection:

db.product.updateOne({
   "Item":"Ultra GelPen"
},
{
   "$set":{
      "Inventory":{
         "OnHand":300,
         "MinOnHand":100
      }
   }
});

Verify the CDC for the update to the document for the item Ultra GelPen on the OpenSearch Service index.

Monitor the CDC pipeline

You can monitor the state of the pipelines by checking the status of the pipeline on the OpenSearch Service console. Additionally, you can use Amazon CloudWatch to provide real-time metrics and logs, which lets you set up alerts in case of a breach of user-defined thresholds.

Clean up

Make sure you clean up unwanted AWS resources created during this post in order to prevent additional billing for these resources. Follow these steps to clean up your AWS account:

On the OpenSearch Service console, choose Domains under Managed clusters in the navigation pane.
Select the domain you want to delete and choose Delete.
Choose Pipelines under Ingestion in the navigation pane.
Select the pipeline you want to delete and on the Actions menu, choose Delete.
On the Amazon S3 console, select the S3 bucket and choose Delete.

Conclusion

In this post, you learned how to enable zero-ETL integration between Amazon DocumentDB change data streams and OpenSearch Service. To learn more about zero-ETL integrations available with other data sources, see Working with Amazon OpenSearch Ingestion pipeline integrations.

About the Authors

Praveen Kadipikonda is a Senior Analytics Specialist Solutions Architect at AWS based out of Dallas. He helps customers build efficient, performant, and scalable analytic solutions. He has worked with building databases and data warehouse solutions for over 15 years.

Kaarthiik Thota is a Senior Amazon DocumentDB Specialist Solutions Architect at AWS based out of London. He is passionate about database technologies and enjoys helping customers solve problems and modernize applications using NoSQL databases. Before joining AWS, he worked extensively with relational databases, NoSQL databases, and business intelligence technologies for over 15 years.

Muthu Pitchaimani is a Search Specialist with Amazon OpenSearch Service. He builds large-scale search applications and solutions. Muthu is interested in the topics o f networking and security, and is based out of Austin, Texas.

AWS Big Data Blog