How can I improve the indexing performance on my Amazon Elasticsearch Service cluster?

Last updated: 2020-08-10

I want to optimize indexing operations in Amazon Elasticsearch Service (Amazon ES) for maximum ingestion throughput. How can I do this?

Resolution

Be sure that the shards for the index that you're ingesting into are evenly distributed across the data nodes

Use the following formula to confirm that the shards are evenly distributed:

Number of shards for index = k * (number of data nodes), where k is the number of shards per node

For example, if there are 24 shards in the index, and there are eight data nodes, Amazon ES assigns three shards to each node. For more information, see Get started with Amazon Elasticsearch Service: How many shards do I need?

Increase the refresh_interval to 60 seconds or more

Refresh your Amazon ES index to make your documents available for search. Note that refreshing your index requires the same resources that are used by indexing threads.

The default refresh interval is one second. When you increase the refresh interval, the data node makes fewer API calls. The refresh interval can be shorter and faster depending on the length of the refresh interval. To prevent 429 errors, it's a best practice to increase the refresh interval.

Note: The default refresh interval is one second for indices that receive or more search requests in the last 30 seconds. For more information about the updated default interval, see _refresh API version 7.x on the Elasticsearch website.

Change the replica count to zero

If you're anticipating heavy indexing, consider setting the index.number_of_replicas value to "0". Each replica duplicates the indexing process. As a result, disabling the replicas will improve your cluster performance. After the heavy indexing is complete, reactivate the replicated indices.

Important: If a node fails while replicas are disabled, you might lose data. Disable the replicas only if you can tolerate data loss for a short duration.

Experiment to find the optimal bulk request size

Start with the bulk request size of 5 MiB to 15 MiB. Then, slowly increase the request size until the indexing performance stops improving. For more information, see Using and sizing bulk requests on the Elasticsearch website.

Note: Some instance types limit bulk requests to 10 MiB. For more information, see Network limits.

Use an instance type that has SSD instance store volumes (such as I3)

I3 instances provide fast and local memory express (NVMe) storage. I3 instances deliver better ingestion performance than instances that use General Purpose SSD (gp2) Amazon Elastic Block Store (Amazon EBS) volumes. For more information, see Run petabyte-scale clusters on Amazon Elasticsearch Service using I3 instances.

Reduce response size

To reduce the size of the Amazon ES response, use the filter_path parameter to exclude unnecessary fields. Be sure that you don't filter out any fields that are required to identify or retry failed requests. These fields can vary by client.

In the following example, the index-name, type-name, and took fields are excluded from the response:

curl -X POST "es-endpoint/index-name/type-name/_bulk?pretty&filter_path=-took,-items.index._index,-items.index._type" -H 'Content-Type: application/json' -d'
{ "index" : { "_index" : "test2", "_id" : "1" } }
{ "user" : "testuser" }
{ "update" : {"_id" : "1", "_index" : "test2"} }
{ "doc" : {"user" : "example"} }

For more information, see Reducing response size.

Increase the value of index.translog.flush_threshold_size

By default, index.translog.flush_threshold_size is set to 512 MB. This means that the translog is flushed when it reaches 512 MB. The weight of the indexing load determines the frequency of the translog. When you increase index.translog.flush_threshold_size, the node performs the translog operation less frequently. Because Amazon ES flushes are resource-intensive operations, reducing the frequency of translogs improves indexing performance. By increasing the flush threshold size, the Elasticsearch cluster also creates a few large segments (instead of multiple small segments). Large segments merge less often, and more threads are used for indexing instead of merging.

Note: An increase in index.translog.flush_threshold_size can also increase the time that it takes for a translog to complete. If a shard fails, recovery will take longer, because the translog is larger.

Before increasing index.translog.flush_threshold_size, call the following API operation to get current flush operation statistics:

$ curl 'es-endpoint/index-name/_stats/flush?pretty'

Replace the es-endpoint and index-name with your respective variables.

In the output, note the number of flushes and the total time. The following example output shows that there are 124 flushes, which took 17,690 milliseconds:

"flush" { "total" : 124, "total_time_in_millis" : 17690 }

To increase the flush threshold size, call the following API operation:

$ curl -XPUT 'es-endpoint/index-name/_settings?pretty' -d '{"index":{"translog.flush_threshold_size" : "1024MB"}}'

In this example, the flush threshold size is set to 1024 MB, which is ideal for instances with more than 32 GB of memory.

Note: Choose the appropriate threshold size for your Amazon ES domain.

Run the _stats API operation again to see whether the flush activity changed:

$ curl 'es-endpoint/index-name/_stats/flush?pretty' 

Note: It's a best practice to increase the index.translog.flush_threshold_size only for the current index. After you confirm the outcome, apply the changes to the index template.


Did this article help?


Do you need billing or technical support?