Configure Amazon OpenSearch Service for high availability
Amazon OpenSearch Service is a fully open-source search and analytics engine that securely unlocks real-time search, monitoring, and analysis of business and operational data for use cases like recommendation engines, ecommerce sites, and catalog search. To be successful in your business, you need your systems to be highly available and performant, minimizing downtime and avoiding failure. When you use OpenSearch Service as your primary means of monitoring your infrastructure, you need to ensure its availability as well. Downtime for OpenSearch Service can have a significant effect on your business outcomes, such as loss of revenue, loss in productivity, loss in brand value, and more.
The industry standard for measuring availability is class of nines. OpenSearch Service provides 3 9’s of availability, when you follow best practices, which means it guarantees less than 43.83 minutes of downtime a month. In this post, you will learn how you can configure your OpenSearch Service domain for high availability and performance by following best practices and recommendations while setting up your domain.
There are two essential elements that influence your domain’s availability: the resource utilization of your domain, which is mostly driven by your workload, and external events such as infrastructure failures. Although the former can be controlled through continuous monitoring of the domain’s performance and health and scaling the domain accordingly, the latter cannot. To mitigate the impact of external events such as an Availability Zone outage, instance or disk failure, or networking issues on your domain, you must provision additional capacity, distributed over multiple Availability Zones, and keep multiple copies of data. Failure to do so may result in degraded performance, unavailability, and, in the worst-case situation, data loss.
Let’s look at the options available to you to ensure that domain is available and performant.
Under this section we will talk about various configuration options you have to setup your cluster properly which includes specifying the number of AZ for the deployment, setting up the master and data nodes, setting up indexes and shards.
Data nodes are responsible for processing indexing and search requests in your domain. Deploying your data nodes across multiple Availability Zones improves the availability of your domain by adding redundant, per-zone data storage and processing. With a Multi-AZ deployment, your domain can remain available even when a full Availability Zone becomes unavailable. For production workloads, AWS recommends using three Availability Zones for your domain. Use two Availability Zones for Regions that support only two for improved availability. This ensures that your domain is available in the event of a Single-AZ failure.
Dedicated cluster manager (master nodes)
AWS recommends using three dedicated cluster manager (CM) nodes for all production workloads. CM nodes track the cluster’s health, the state and location of its indexes and shards, the mapping for all the indexes, and the availability of its data nodes, and it maintains a list of cluster-level tasks in process. Without dedicated CM nodes, the cluster uses data nodes, which makes the cluster vulnerable to workload demands. You should size CM nodes based on the size of the task—primarily, the data node counts, the index counts, and the shard counts. OpenSearch Service always deploys CM nodes across three Availability Zones, when supported by the Region (two in one Availability Zones and one in other Availability Zones if regions have only two Availability Zones). For a running domain, only one of the three CM nodes works as an elected leader. The other two CM nodes participate in an election if the elected CM node fails.
The following table shows AWS’s recommendations for CM sizing. CM nodes do work based on the number of nodes, indexes, shards, and mapping. The more work, the more compute and memory you need to hold and work with the cluster state.
|Instance Count||Cluster Manager Node RAM Size||Maximum Supported Shard Count||Recommended Minimum Dedicated Cluster Manager Instance Type|
|1–10||8 GiB||10,000||m5.large.search or m6g.large.search|
|11–30||16 GiB||30,000||c5.2xlarge.search or c6g.2xlarge.search|
|31–75||32 GiB||40,000||c5.4xlarge.search or c6g.4xlarge.search|
|76 – 125||64 GiB||75,000||r5.2xlarge.search or r6g.2xlarge.search|
|126 – 200||128 GiB||75,000||r5.4xlarge.search or r6g.4xlarge.search|
Indexes and shards
Indexes are a logical construct that houses a collection of documents. You partition your index for parallel processing by specifying a primary shard count, where shards represent a physical unit for storing and processing data. In OpenSearch Service, a shard can be either a primary shard or a replica shard. You use replicas for durability—if the primary shard is lost, OpenSearch Service promotes one of the replicas to primary—and for improving search throughput. OpenSearch Service ensures that the primary and replica shards are placed in different nodes and across different Availability Zones, if deployed in more than one Availability Zone. For high availability, AWS recommends configuring at least two replicas for each index in a three-zone setup to avoid disruption in performance and availability. In a Multi-AZ setup, if a node fails or in the rare worst case an Availability Zone fails, you will still have a copy of the data.
Cluster monitoring and management
As discussed earlier, selecting your configuration based on best practices is only half the job. We also need to continuously monitor the resource utilization and performance to determine if the domain needs to be scaled. An under-provisioned or over-utilized domain can result in performance degradation and eventually unavailability.
You use the CPU in your domain to run your workload. As a general rule, you should target 60% average CPU utilization for any data node, with peaks at 80%, and tolerate small spikes to 100%. When you consider availability, and especially considering the unavailability of a full zone, there are two scenarios. If you have two Availability Zones, then each zone handles 50% of the traffic. If a zone becomes unavailable, the other zone will take all of that traffic, doubling CPU utilization. In that case, you need to be at around 30–40% average CPU utilization in each zone to maintain availability. If you are running three Availability Zones, each zone is taking 33% of the traffic. If a zone becomes unavailable, each other zone will gain approximately 17% traffic. In this case, you should target 50–60% average CPU utilization.
OpenSearch Service supports two types of garbage collection. The first is G1 garbage collection (G1GC), which is used by OpenSearch Service nodes, powered by AWS Graviton 2. The second is Concurrent Mark Sweep (CMS), which is used by all nodes powered by other processors. Out of all the memory allocated to a node, half of the memory (up to 32 GB) is assigned to the Java heap, and the rest of the memory is used by other operating system tasks, the file system cache, and so on. To maintain availability for a domain, we recommend keeping the max JVM utilization at around 80% in CMS and 95% in G1GC. Anything beyond that would impact the availability of your domain and make your cluster unhealthy. We also recommend enabling auto-tune, which actively monitors the memory utilization and triggers the garbage collector.
OpenSearch Service publishes several guidelines for sizing of domains. We provide an empirical formula so that you can determine the right amount of storage required for your requirements. However, it’s important to keep an eye out for the depletion of storage with time and changes in workload characteristics. To ensure the domain doesn’t run out of storage and can continue to index data, you should configure Amazon CloudWatch alarms and monitor your free storage space.
AWS also recommends choosing a primary shard count so that each shard is within an optimal size band. You can determine the optimal shard size through proof-of-concept testing with your data and traffic. We use 10–30 GB primary shard sizes for search use cases and 45–50 GB primary shard sizes for log analytics use cases as a guideline. Because shards are the workers in your domain, they are directly responsible for the distribution of the workload across the data nodes. If your shards are too large, you may see stress in your Java heap from large aggregations, worse query performance, and worse performance on cluster-level tasks like shard rebalancing, snapshots, and hot-to-warm migrations. If your shards are too small, they can overwhelm the domain’s Java heap space, worsen query performance through excessive internal networking, and make cluster-level tasks slow. We also recommend keeping the number of shards per node proportional to the heap available (half of the instance’s RAM up to 32 GB)—25 shards per GB of Java heap. This makes a practical limit of 1,000 shards on any data node in your domain.
In this post, you learned various tips and tricks to set up a highly available domain using OpenSearch Service, which helps you to keep OpenSearch Service performant and available by running it across three Availability Zones.
Stay tuned for a series of posts focusing on the various features and functionalities with OpenSearch Service. If you have feedback about this post, submit it in the comments section. If you have questions about this post, start a new thread on the OpenSearch Service forum or contact AWS Support.
About the authors
Rohin Bhargava is a Sr. Product Manager with the Amazon OpenSearch Service team. His passion at AWS is to help customers find the correct mix of AWS services to achieve success for their business goals.
Prashant Agrawal is a Sr. Search Specialist Solutions Architect with Amazon OpenSearch Service. He works closely with customers to help them migrate their workloads to the cloud and helps existing customers fine-tune their clusters to achieve better performance and save on cost. Before joining AWS, he helped various customers use OpenSearch and Elasticsearch for their search and log analytics use cases. When not working, you can find him traveling and exploring new places. In short, he likes doing Eat → Travel → Repeat.