Managing Amazon EBS volume throughput limits in Amazon OpenSearch Service domains
In this blog post, we discuss the impact of Amazon Elastic Block Store (Amazon EBS) volume IOPS and throughput limits on Amazon OpenSearch Service domain and how to prevent/mitigate throughput throttling situation.
Amazon OpenSearch Service is a managed service that makes it easy for you to perform website searches, interactive log analytics, real-time application monitoring, and more. Based on the open source OpenSearch suite, Amazon OpenSearch Service allows you to search, visualize, and analyze up to petabytes of text and unstructured data.
An OpenSearch Service domain primarily contains nodes with the following set of roles.
- Cluster manager (dedicated master): Responsible for managing the cluster and checking the health of the data nodes in the cluster.
- Data: Responsible for serving search and indexing requests and storing the indexed data.
- Ultrawarm: Nodes which use Amazon S3 as a backing store to provide lower-cost storage.
If the OpenSearch Service data node storage is backed by Amazon EBS volumes, depending on your workload, EBS throughput can heavily influence performance of the OpenSearch Service domain. The EBS volume performance metric is defined by the following two key parameters.
- IOPS defines the number of IO operations performed per second.
- Throughput is a measure of how much data can be transferred in a given amount of time. It is usually measured in bytes per second.
Whenever IOPS or throughput of the data node breaches the maximum allowed limit of the EBS volume or the EC2 instance of the data node, then the OpenSearch Service domain experiences IOPS or throughput throttling. This can result in high search and indexing latency and in the worst scenario node crash as well.
Maximum allowed IOPS and throughput for the data node
The maximum allowed value for IOPS or the throughput for the data node in an OpenSearch Service domain is the minimum of the following two values.
- Maximum allowed IOPS or the throughput value of the Amazon EBS volume used by the data node.
- Maximum allowed IOPS or the throughput value of the EBS optimized instance type of the data node.
Throughput throttling and its impact on an Amazon OpenSearch Service domain
Throughput throttling happens when the total EBS throughput on a data node exceeds the maximum allowed throughput value of that data node in the OpenSearch Service domain.
The ThroughputThrottle metric for the domain or node can be seen in the Amazon CloudWatch console at the following location.
- Domain: “ES/OpenSearchService > Per-Domain, Per-Client Metrics”
- Node: “ES/OpenSearchService > ClientId, DomainName, NodeId”
The value of 1 in the ThroughputThrottle metric signifies a throttling event for the domain or node.
If a data node in the domain experiences throughput throttling for a consistent period, it can result in the following performance degradation for the data node.
- Slower EBS volume performance.
- High read/write latency.
This can affect the checks performed by the cluster manager or data node. It can result in:
- FS (file system) health check failure performed by the data node.
- Follower check failure performed by cluster manager due to high request latency.
This will result in the cluster manager marking such data nodes unhealthy, resulting in the data node being removed from the cluster. This can lead to a yellow or red cluster status.
Throughput value calculation
Total throughput for the data node is the total bytes read and written to the EBS volume per second. The following metrics provides the read and write throughput for the data node in the Amazon Opensearch Service domain.
- ReadThroughputMicroBursting The throughput, in bytes per second, for read operations on EBS volumes when micro-bursting is taken into consideration
- WriteThroughputMicroBursting The throughput, in bytes per second, for write operations on EBS volumes when micro-bursting is taken into consideration
Total throughput for the data node in the OpenSearch Service domain is calculated as the following.
Throughput = ReadThroughputMicroBursting + WriteThroughputMicroBursting
To get total throughput for the data node, follow these steps.
- Go to Amazon Cloudwatch metrics.
- Go to ES/OpenSearchService > ClientId, DomainName, NodeId.
- Select ReadThroughputMicroBursting and WriteThroughputMicroBursting metric.
- Go to Graphed metrics.
- Use Add math and create formulas to sum ReadThroughputMicroBursting and WriteThroughputMicroBursting values.
Handling throughput throttle
When the maximum allowed throughput limit is breached on the data node in an OpenSearch Service domain, a disk throughput throttle notification is sent to the AWS console. Throughput throttling on the data node can happen due to various reasons, such as the following.
- A sudden increase in the index rate or search rate to the data node of the OpenSearch Service domain.
- A blue/green event happening on the OpenSearch Service domain during peak hours.
- The OpenSearch Service domain is under-scaled.
We suggest the following measures to prevent throughput throttling for the OpenSearch Service domain.
- Monitor the traffic to the OpenSearch Service domain and create alarms on the search and index traffic sent to the OpenSearch Service domain.
- Set up off-peak hours for OpenSearch Service domain so that the updates that lead to blue/green deployments are executed when there is less demand.
- Monitor the ThroughputThrottle cluster metrics for the OpenSearch Service domain.
- Monitor shard skewness for the OpenSearch Service domain. Shard skewness can lead to uneven load distribution of traffic to data nodes and can lead to hot nodes in the cluster, which can experience high index and search traffic that results in throttling.
- If you are hitting EBS Volume or EC2 instance throughput limits for the data node, you will need to scale up the OpenSearch Service domain to avoid throughput throttling. Check the limits provided by EBS volumes and Amazon EBS optimized instances used by the data node and scale up the OpenSearch cluster accordingly.
Every scenario requires specific investigation and the appropriate measures to resolve it. Still, we suggest the following guidelines as part of a broader approach to handling throughput throttle.
- If high throughput is seen on a specific set of data nodes most of the time, shard skewness may be causing hot nodes. In such cases, resolving shard skewness will help the situation.
- If OpenSearch Service domain is experiencing uneven traffic patterns, check for sudden bursts resulting in throttling. In such scenarios, streamlining the traffic pattern can be helpful.
- If throughput throttling is seen on most of the nodes on the cluster with consistent traffic patterns, scaling up of the OpenSearch Service domain should be considered.
In this post, we covered the Amazon EBS throughput throttling in OpenSearch Service domain, its impact, and ways to monitor and handle it. We provided suggestions that can be used to handle such throttling situations.