AWS Big Data Blog

Migrate data into Amazon ES using remote reindex

September 8, 2021: Amazon Elasticsearch Service has been renamed to Amazon OpenSearch Service. See details.


Amazon OpenSearch Service recently launched support for remote reindexing. This feature adds the ability to copy data to an Amazon OpenSearch Service domain from self-managed Elasticsearch running on-premises, self-managed on Amazon Elastic Compute Cloud (Amazon EC2) on AWS, or another Amazon OpenSearch Service domain.

Remote reindex supports Elasticsearch 1.5 and higher for the remote Elasticsearch cluster and Amazon OpenSearch Service 6.7 and higher for the local domain.

The remote reindex feature migrates data from the remote cluster using the Elasticsearch scan API function and reindexes each document to the local Amazon OpenSearch Service domain.

In this post, we cover the following common use cases for using remote reindex to migrate data into an Amazon OpenSearch Service domain:

Use case 1: Copying from a self-managed Elasticsearch cluster using ELB

Our first use case has the following configuration:

  • Remote – Elasticsearch self-hosted in AWS on Amazon EC2 1.5 or higher
  • Local – Amazon OpenSearch Service domain 6.7 or higher

The following diagram illustrates our architecture.

Before getting started, make sure you have the following prerequisites:

  • An Amazon EC2 server running in a public subnet with access to Amazon OpenSearch Service running in a private subnet within the same VPC
  • ELB running in a public subnet of the same VPC as the remote Elasticsearch cluster with a security group configured to allow inbound traffic on port 443 and listeners configured on port 443 to an Elasticsearch DNS endpoint

To copy your data, complete the following steps:

  1. Open the Kibana dashboard on the Amazon EC2 server to connect to the local Amazon OpenSearch Service domain (for example, https://vpc-abc123.us-east-1.es.amazonaws.com/_plugin/kibana).
  2. The connection to the local Elasticsearch cluster needs to be authorized to perform reindex operations. If the local cluster is secured with basic authorization, it only needs a username and password. However, if it’s using fine-grained access control, the user performing reindex operations needs reindex privileges on the local Amazon OpenSearch Service domain and read index privileges on the remote Elasticsearch cluster.
  3. Run the POST reindex operation on the local Amazon OpenSearch Service domain using Kibana Dev Tools to reindex data from the remote Elasticsearch cluster. See the following code:
    POST _reindex/?pretty=true&scroll=10h&wait_for_completion=true
    {
      "source": {
        "remote": {
          "host": "https://<remote endpoint>:443",
          "username": "<username>",
          "password": "<password>",
          "socket_timeout": "30m"
        },
        "size": 1000,
        "index": "movies"
      },
      "dest": {
        "index": "movies"
      }
    }

You can perform the same reindex operation using curl commands:

curl -XPOST -u <username>:<password> "https://<local-domain-endpoint>/_reindex/?pretty=true&scroll=10h&wait_for_completion=false" -H 'Content-Type: application/json' 
-d’{  "source": {"remote": {"host": "https://< local-domain-endpoint >:443", "socket_timeout": "60m", “external”: true, "username": "<username>", "password": "<password>"    },   \ "size": 1000,    "index": “movies”  },  "dest": {"index": “movies”  }}'

Check the progress of index migration on the local Amazon OpenSearch Service domain using the following command:

GET <local-domain-endpoint>/movies/_search

In the preceding code, you copy the movies index from the remote Elasticsearch cluster to the local Amazon OpenSearch Service domain. The remote reindex operation sends a scroll request to the remote domain with the following default values:

  • Search context of 5 minutes
  • Socket timeout of 30 seconds
  • Batch size of 1,000

Refer to the Performance improvements section later in this post for information about tuning these values.

We use the external flag to indicate that the index is hosted outside of the Amazon OpenSearch Service.

Use case 2: Copying from self-managed Elasticsearch using an NGINX proxy server

Our second use case has the following configuration:

  • Remote – Self-hosted Elasticsearch on premises
  • Local – Amazon OpenSearch Service domain version 6.7 or higher

The following diagram illustrates our architecture.

Make sure you have the following prerequisites:

  • Amazon OpenSearch Service version 6.7 or higher with software release version R2020117 with an Amazon VPC
  • An Amazon EC2 server running in a public subnet with network connectivity to the local Elasticsearch cluster
  • Public internet connectivity to an NGINX reverse proxy server that can connect to the remote Elasticsearch cluster

Connectivity to the remote cluster is secured using TLS encryption, therefore you need a certificate signed by a public certificate authority. For instructions on configuring security credentials for NGINX, see Update: Using Free Let’s Encrypt SSL/TLS Certificates with NGINX. If you’re generating certificates for external domains, see Manual for additional options.

To properly route the reindex requests, you need to modify the NGINX reverse proxy default configuration file default.conf under /etc/nginx/conf.d directory. Update the following key variables:

  • /etc/nginx/cert.crt
  • /etc/nginx/cert.key
  • $ES_endpoint

For more information, see Using a Proxy to Access Amazon OpenSearch Service from Kibana.

After you establish connectivity with the remote Elasticsearch cluster, you can run the reindex operation as outlined in the first use case. Be sure to change the host argument to the DNS name of the publicly accessible NGINX reverse proxy.

Use case 3: Copying from a public Amazon OpenSearch Service domain using IAM credentials

Our next use case has the following configuration:

  • Remote – Publicly accessible Amazon OpenSearch Service domain version 1.5 or higher
  • Local – Amazon OpenSearch Service domain version 6.7 or higher

The following diagram illustrates our architecture.

To copy the data, complete the following steps:

  1. Create an IAM user that has been granted access to both the local and remote Amazon OpenSearch Service domain. The following code is an example access policy:
    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "AWS": "<arn of user>"
          },
          "Action": "es:*",
          "Resource": "<arn of Amazon OpenSearch Service domain>/<index name>/*"
        }
      ]
    }
  1. On the IAM console, create an access key for the specified user.
  2. Record the access key ID and secret in a secure location.
  3. Make the call to the _reindex request using IAM credentials to sign the request using the Signature Version 4 (SigV4) signing process.

To simplify the signing process, you can use the Postman application and the AWS Signature authorization type.

  1. Launch Postman.
  2. Enter the endpoint URL of the local Amazon OpenSearch Service domain in the address bar, followed by
    / _reindex/?pretty=true&scroll=10h&wait_for_completion=true.
  3. On the HTTP method drop-down menu, choose POST.
  4. On the Authorization tab, for the authorization type, choose AWS Signature.
  5. For AccessKey, enter your IAM user’s access key ID.
  6. For SecretKey, enter your IAM user’s secret key.
  7. Specify the appropriate AWS Region that matches the Region of your Amazon OpenSearch Service domain
  8. For Service Name, enter es.
  9. On the Body tab, select raw.
  10. Enter the local and target JSON as shown in use case 1, but make the setting "external": false.
  11. Enter Send.

You can check the progress using Kibana on the local Amazon OpenSearch Service domain through Dev Tools by issuing a search on the remote index similar to use case 1.

Use case 4: Copying from an Amazon OpenSearch Service domain in the same VPC using IAM credentials

Our final use case has the following configuration:

  • Remote – Amazon OpenSearch Service domain with VPC access version 1.5 or higher
  • Local – Amazon OpenSearch Service domain with VPC access version 6.7 or higher

The following diagram illustrates our architecture.

Every Amazon OpenSearch Service domain is made up of its own internal VPC infrastructure. When you create a new Amazon OpenSearch Service domain in an existing VPC, an Elastic Network Interface (ENI) is created for each data node in the Amazon OpenSearch Service VPC. Because the remote reindex operation is performed from the local Amazon OpenSearch Service domain, and therefore within its own private VPC, you don’t access the remote Amazon OpenSearch Service domain’s VPC. Instead, you need a publicly accessible reverse proxy.

To copy the data, complete the following steps:

  1. Create an IAM user that has been granted access to both the local and remote Amazon OpenSearch Service domain. The following code is an example access policy:
    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "AWS": "<arn of user>"
          },
          "Action": "es:*",
          "Resource": "<arn of Amazon OpenSearch Service domain>/<index name>/*"
        }
      ]
    }
  2. On the IAM console, create an access key for the specified user.
  3. Record the access key ID and secret in a secure location.
  4. Set up an EC2 instance with a NGINX reverse proxy for the remote Amazon OpenSearch Service VPC endpoint as outlined in use case 1.

This EC2 instance must be within the same VPC as the Amazon OpenSearch Service domain. Because you’re signing your requests, make sure that the NGINX configuration contains the following:

proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
  1. Use a machine on the same VPC as the Amazon OpenSearch Service domain (either a running EC2 instance, or a local machine connected via VPN) to make the call to the _reindex request using IAM credentials to sign the request using the Signature Version 4 (SigV4) signing process.

To simplify the signing process, you can use the Postman application and the AWS Signature authorization type.

  1. Launch Postman.
  2. Enter the endpoint URL of the local Amazon OpenSearch Service domain in the address bar, followed by
    / _reindex/?pretty=true&scroll=10h&wait_for_completion=true.
  3. On the HTTP method drop-down menu, choose POST.
  4. On the Authorization tab, for the authorization type, choose AWS Signature.
  5. For AccessKey, enter your IAM user’s access key ID.
  6. For SecretKey, enter your IAM user’s secret key.
  7. Specify the appropriate AWS Region that matches the Region of your Amazon OpenSearch Service domain
  8. For Service Name, enter es.
  9. On the Body tab, select raw.
  10. Enter the local and target JSON as shown in use case 1, but make the setting "external": false.
  11. For the "source" argument, use the externally accessible URL for the NGINX reverse proxy.
  12. Enter Send.

You can check the progress using Kibana on the local Amazon OpenSearch Service domain through Dev Tools by issuing a search on the remote index similar to use case 1.

Performance improvements

The remote reindex operation allows you to modify local index settings before data copy, for example adjusting the number of primary shards. A best practice is to create an index with the required settings on your local domain before starting the reindexing operation.

To speed up the reindex performance, disable the refresh interval and replica shards using following settings:

PUT movies/_settings
{
  "refresh_interval" : "-1",
  "number_of_replicas" : 0
}

When the reindex operation is complete, adjust the replicas count and refresh interval to your desired settings.

The reindex operation additionally provides options such as copying only a subset of documents, copying unique documents, or even combining one or more indexes. In the following example code, the remote reindex operation copies data from the kibana_sample_data_commerce index, which matches the currency field with the value EUR:

POST _reindex/?pretty=true&scroll=10h&wait_for_completion=true
{
  "source": {
    "remote": {
      "host": "https://<remote endpoint>:443",
      "username": "<username>",
      "password": "<password>",
      "socket_timeout": "30m"
    },
    "query": {
      "bool": {
        "filter": {
          "term": {
            "currency": "EUR"
          }
        }
      }
    },
    "size": 10000,
    "index": "kibana_sample_data_ecommerce"
  },
  "dest": {
    "index": "kibana_sample_data_ecommerce"
  }
}

For more information about the available reindexing options, see Reindex data.

The local cluster pulls the data from the remote cluster using scroll queries. Depending on the dataset, you need to set up the time duration for which scroll context is valid on the remote cluster. To make sure the remote reindex operation doesn’t timeout while dealing with large datasets, set the scroll value higher (10–36 hours).

size determines the batch size for every single scroll call, and its value is dependent on the nature of the data and the cluster configuration. Initially set it to a lower value (such as 100), and increase it only if it improves performance.

socket_timeout is the maximum period of inactivity supported on the HTTP connection between the local and remote cluster. Basically, the local cluster gets the data in batches using the scroll query and triggers a bulk call. If there are too many pending bulk requests, it waits before fetching the next batch of documents. If the wait is higher than the configured socket timeout, the reindex fails. We recommend setting a higher timeout value (1–2 hours) to prevent failures.

Limitations

Keep in mind the following limitations when using remote reindex:

  • As of this writing, the remote reindex operation doesn’t support scroll slicing, which allows multiple scroll operations for same request in parallel. The operation is only as fast as an index operation with a single client connection.
  • You can’t restart the task if a failure occurs. If the node performing the operation dies, you have to re-trigger the reindex operation.
  • The remote reindex operation simply copies a snapshot of the index at that particular time. In situations where the indexes are continuously being updated on the remote cluster, repeat the reindex operation to sync data between the two clusters.

Conclusion

In this post, we covered how to use the remote reindex operation in Amazon OpenSearch Service to copy index data from your remote cluster into an Amazon OpenSearch Service domain. We also looked at several performance tuning options available with the reindex operation.

If you have questions or suggestions, please leave a comment.


About the authors

Ryan Peterson is a Senior Solutions Architect at Amazon Web Services based in Irvine, CA. Ryan works closely with the Amazon CloudSearch and Amazon Elasticsearch Service teams, providing help and guidance to a broad range of customers that have search workloads they want to move to the AWS Cloud.

Viral Shah is a Senior Solutions Architect with the AWS Data Lab team based out of New York, NY. He has over 20 years of experience working with enterprise customers and startups primarily in the data and database space. He loves to travel and spend quality time with his family.