AWS Big Data Blog
Perform reindexing in Amazon OpenSearch Serverless using Amazon OpenSearch Ingestion
Amazon OpenSearch Serverless is a serverless deployment option for Amazon OpenSearch Service that makes it straightforward to run search and analytics workloads without managing infrastructure. Customers using OpenSearch Serverless often need to copy documents between two indexes within the same collection or across different collections. This primarily arises from two scenarios:
- Reindexing – You frequently need to update or modify index mapping due to evolving data needs or schema changes
- Disaster recovery – Although OpenSearch Serverless data is inherently durable, you may want to copy data across AWS Regions for added redundancy and resiliency
Amazon OpenSearch Ingestion had recently introduced a feature supporting OpenSearch as a source. OpenSearch Ingestion, a fully managed, serverless data collector, facilitates real-time ingestion of log, metric, and trace data into OpenSearch Service domains and OpenSearch Serverless collections. We can leverage this feature to address these two scenarios, by reading the data from an OpenSearch Serverless Collection. This capability allows you to effortlessly copy data between indexes, making data management tasks more streamlined and eliminating the need for custom code.
In this post, we outline the steps to copy data between two indexes in the same OpenSearch Serverless collection using the new OpenSearch source feature of OpenSearch Ingestion. This is particularly useful for reindexing operations where you want to change your data schema. OpenSearch Serverless and OpenSearch Ingestion are both serverless services that enable you to seamlessly handle your data workflows, providing optimal performance and scalability.
Solution overview
The following diagram shows the flow of copying documents from the source index to the destination index using an OpenSearch Ingestion pipeline.
Implementing the solution consists of the following steps:
- Create an AWS Identity and Access Management (IAM) role to use as an OpenSearch Ingestion pipeline role.
- Update the data access policy attached to the OpenSearch Serverless collection.
- Create an OpenSearch Ingestion pipeline that simply copies data from one index to another, or you can even create an index template using the OpenSearch Ingestion pipeline to define explicit mapping, and then copy the data from the source index to the destination index with the defined mapping applied.
Prerequisites
To get started, you must have an active OpenSearch Serverless collection with an index that you want to reindex (copy). Refer to Creating collections to learn more about creating a collection.
When the collection is ready, note the following details:
- The endpoint of the OpenSearch Serverless collection
- The name of the index from which the documents need to be copied
- If the collection is defined as a VPC collection, note down the name of the network policy attached to the collection
You use these details in the ingestion pipeline configuration.
Create an IAM role to use as a pipeline role
An OpenSearch Ingestion pipeline needs certain permissions to pull data from the source and write to its sink. For this walkthrough, both the source and sink are the same, but if the source and sink collections are different, modify the policy accordingly.
Complete the following steps:
- Create an IAM policy (
opensearch-ingestion-pipeline-policy
) that provides permission to read and send data to the OpenSearch Serverless collection. The following is a sample policy with least privileges (modify{account-id}
,{region}
,{collection-id}
and{collection-name}
accordingly): - Create an IAM role (
opensearch-ingestion-pipeline-role
) that the OpenSearch Ingestion pipeline will assume. While creating the role, use the policy you created (opensearch-ingestion-pipeline-policy
). The role should have the following trust relationship (modify{account-id}
and{region}
accordingly): - Record the ARN of the newly created IAM role (
arn:aws:iam::111122223333:role/opensearch-ingestion-pipeline-role
).
Update the data access policy attached to the OpenSearch Serverless collection
After you create the IAM role, you need to update the data access policy attached to the OpenSearch Serverless collection. Data access policies control access to the OpenSearch operations that OpenSearch Serverless supports, such as PUT <index> or GET _cat/indices
. To perform the update, complete the following steps:
- On the OpenSearch Service console, under Serverless in the navigation pane, choose Collections.
- From the list of the collections, choose your OpenSearch Serverless collection.
- On the Overview tab, in the Data access section, choose the associated policy.
- Choose Edit.
- Edit the policy in the JSON editor to add the following JSON rule block in the existing JSON (modify
{account-id}
and{collection-name}
accordingly):
You can also use the Visual Editor method to choose Add another rule and add the preceding permissions for arn:aws:iam::{account-id}:role/opensearch-ingestion-pipeline-role
.
- Choose Save.
Now you have successfully allowed the OpenSearch Ingestion role to perform OpenSearch operations against the OpenSearch Serverless collection.
Create and configure the OpenSearch Ingestion pipeline to copy the data from one index to another
Complete the following steps:
- On the OpenSearch Service console, choose Pipelines under Ingestion in the navigation pane.
- Choose Create a pipeline.
- In Choose Blueprint, select
OpenSearchDataMigrationPipeline
. - For Pipeline name, enter a name (for example,
sample-ingestion-pipeline
). - For Pipeline capacity, you can define the minimum and maximum capacity to scale up the resources. For this walkthrough, you can use the default value of 2 Ingestion OCUs for Min capacity and 4 Ingestion OCUs for Max capacity. However, you can even choose different values as OpenSearch Ingestion automatically scales your pipeline capacity according to your estimated workload, based on the minimum and maximum Ingestion OpenSearch Compute Units (Ingestion OCUs) that you specify.
- Update the following information for the source:
- Uncomment
hosts
and specify the endpoint of the existing OpenSearch Serverless collection that was copied as part of prerequisites. - Uncomment
include
andindex_name_regex
, and specify the name of the index that will act as the source (in this demo, we’re usinglogs-2024.03.01
). - Uncomment
region
underaws
and specify the AWS Region where your OpenSearch Serverless collection is (for example,us-east-1
). - Uncomment
sts_role_arn
underaws
and specify the role that has permission to read data from the OpenSearch Serverless collection (for example,arn:aws:iam::111122223333:role/opensearch-ingestion-pipeline-role
). This is the same role that was added in the data access policy of the collection. - Update the
serverless
flag totrue
. - If the OpenSearch Serverless collection has VPC access, uncomment
serverless_options
andnetwork_policy_name
and specify the name of the network policy used for the collection. - Uncomment
scheduling
,interval
,index_read_count
, andstart_time
and modify these parameters accordingly.
Using these parameters makes sure the OpenSearch Ingestion pipeline processes the indexes multiple times (to pick up new documents).
Note – If the collection specified in the sink is of theTime series
orVector search
type, you can keep thescheduling
,interval
,index_read_count
, andstart_time
parameters commented.
- Uncomment
- Update the following information for the sink:
- Uncomment
hosts
and specify the endpoint of the existing OpenSearch Serverless collection. - Uncomment
sts_role_arn
underaws
and specify the role that has permission to write data into the OpenSearch Serverless collection (for example,arn:aws:iam::111122223333:role/opensearch-ingestion-pipeline-role
). This is the same role that was added in the data access policy of the collection. - Update the
serverless
flag totrue
. - If the OpenSearch Serverless collection has VPC access, uncomment
serverless_options
andnetwork_policy_name
and specify the name of the network policy used for the collection. - Update the value for
index
and provide the index name to which you want to transfer the documents (for example,new-logs-2024.03.01
). - For
document_id
, you can get the ID from the document metadata in the source and use the same in the target.
However, it is important to note that custom document IDs are only supported for theSearch
type of collection. If your collection is of theTime Series
orVector Search
type, you should comment out thedocument_id
line. - (Optional) The values for
bucket
,region
andsts_role_arn
keys within thedlq
section can be modified to capture any failed requests in an S3 bucket.
Note – Additional permission toopensearch-ingestion-pipeline-role
needs to be given, if configuring DLQ. Please refer Writing to a dead-letter queue, for the changes required.
For this walkthrough, you will not set up a DLQ. You can remove the entiredlq
block.
- Uncomment
- Now click on Validate pipeline to validate the pipeline configuration.
- For Network settings, choose your preferred setting:
- Choose VPC access and select your VPC, subnet, and security group to set up the access privately. Choose this option if the OpenSearch Serverless collection has VPC access. AWS recommends using a VPC endpoint for all production workloads.
- Choose Public to use public access. For this walkthrough, we select Public because the collection is also accessible from public network.
- For Log Publishing Option, you can either create a new Amazon CloudWatch group or use an existing CloudWatch group to write the ingestion logs. This provides access to information about errors and warnings raised during the operation, which can help during troubleshooting. For this walkthrough, choose Create new group.
- Choose Next, and verify the details you specified for your pipeline settings.
- Choose Create pipeline.
It will take a couple of minutes to create the ingestion pipeline. After the pipeline is created, you will see the documents in the destination index, specified in the sink (for example, new-logs-2024.03.01
). After all the documents are copied, you can validate the number of documents by using the count API.
When the process is complete, you have the option to stop or delete the pipeline. If you choose to keep the pipeline running, it will continue to copy new documents from the source index according to the defined schedule, if specified.
In this walkthrough, the endpoint defined in the hosts parameter under source and sink of the pipeline configuration belonged to the same collection which was of the Search
type. If the collections are different, you need to modify the permissions for the IAM role (opensearch-ingestion-pipeline-role
) to allow access to both collections. Additionally, make sure you update the data access policy for both the collections to grant access to the OpenSearch Ingestion pipeline.
Create an index template using the OpenSearch Ingestion pipeline to define mapping
In OpenSearch, you can define how documents and their fields are stored and indexed by creating a mapping. The mapping specifies the list of fields for a document. Every field in the document has a field type, which defines the type of data the field contains. OpenSearch Service dynamically maps data types in each incoming document if an explicit mapping is not defined. However, you can use the template_type
parameter with the index-template
value and template_content
with JSON of the content of the index-template in the pipeline configuration to define explicit mapping rules. You also need to define the index_type
parameter with the value as custom
.
The following code shows an example of the sink portion of the pipeline and the usage of index_type
, template_type
, and template_content
:
Or you can create the index first, with the mapping in the collection before you start the pipeline.
If you want to create a template using an OpenSearch Ingestion pipeline, you need to provide aoss:UpdateCollectionItems
and aoss:DescribeCollectionItems
permission for the collection in the data access policy for the pipeline role (opensearch-ingestion-pipeline-role
). The updated JSON block for the rule would look like the following:
Conclusion
In this post, we showed how to use an OpenSearch Ingestion pipeline to copy data from one index to another in an OpenSearch Serverless collection. OpenSearch Ingestion also allows you to perform transformation of data using various processors. AWS offers various resources for you to quickly start building pipelines using OpenSearch Ingestion. You can use various built-in pipeline integrations to quickly ingest data from Amazon DynamoDB, Amazon Managed Streaming for Apache Kafka (Amazon MSK), Amazon Security Lake, Fluent Bit, and many more. You can use the following OpenSearch Ingestion blueprints to build data pipelines with minimal configuration changes.
About the Authors
Utkarsh Agarwal is a Cloud Support Engineer in the Support Engineering team at Amazon Web Services. He specializes in Amazon OpenSearch Service. He provides guidance and technical assistance to customers thus enabling them to build scalable, highly available, and secure solutions in the AWS Cloud. In his free time, he enjoys watching movies, TV series, and of course, cricket. Lately, he has also been attempting to master the art of cooking in his free time – the taste buds are excited, but the kitchen might disagree.
Prashant Agrawal is a Sr. Search Specialist Solutions Architect with Amazon OpenSearch Service. He works closely with customers to help them migrate their workloads to the cloud and helps existing customers fine-tune their clusters to achieve better performance and save on cost. Before joining AWS, he helped various customers use OpenSearch and Elasticsearch for their search and log analytics use cases. When not working, you can find him traveling and exploring new places. In short, he likes doing Eat → Travel → Repeat.