Why did my Amazon OpenSearch Service node crash?
Last updated: 2021-07-30
One of the nodes in my Amazon OpenSearch Service (successor to Amazon Elasticsearch Service) cluster is down. Why did the node fail and how can I prevent this from happening again?
Each OpenSearch Service node runs on a separate Amazon Elastic Compute Cloud (Amazon EC2) instance. A failed node is an instance that isn't responding to heartbeat signals from the other nodes. Heartbeat signals are periodic signals that monitor the availability of the data nodes in the cluster.
Here are common causes of failed cluster nodes:
- High Java Virtual Machine (JVM) memory pressure
- Hardware failure
Check for failed nodes
1. Open the OpenSearch Service console.
2. Choose the name of your OpenSearch Service domain.
3. Choose the Cluster health tab, and then choose the Nodes metric. If the number of nodes is fewer than the number that you configured for your cluster, this indicates that a node is down.
Note: The Nodes metric can be inaccurate during changes to your cluster configuration or any routine maintenance for the service. This behavior is expected.
Identify and troubleshoot high JVM memory pressure
JVM memory pressure refers to the percentage of Java heap that is used for all data nodes in an OpenSearch Service cluster. High JVM memory pressure can cause high CPU usage and other cluster performance issues.
JVM memory pressure is determined by the following factors:
- The amount of data on the cluster in proportion to the amount of resources.
- The query load on the cluster.
Here's what happens as JVM memory pressure increases:
- At 75%: OpenSearch Service triggers the Concurrent Mark Sweep (CMS) garbage collector. The CMS collector runs alongside other processes to keep pauses and disruptions to a minimum.
Note: OpenSearch Service publishes several garbage collection metrics to Amazon CloudWatch. These metrics can help you monitor JVM memory usage. For more information, see Monitoring cluster metrics with Amazon CloudWatch.
- Above 75%: If the CMS collector fails to reclaim enough memory and usage remains above 75%, then OpenSearch Service JVM tries to free up memory. OpenSearch Service JVM also tries to prevent a JVM OutOfMemoryError (OOM) exception by slowing or stopping processes.
- If the JVM continues to grow and the space is not reclaimed, then OpenSearch Service JVM will stop the processes that try to allocate memory. If a critical process is stopped, one or more cluster nodes might fail. It's a best practice to keep CPU usage below 80%.
To prevent high JVM memory pressure, follow these best practices:
- Avoid queries on wide ranges, such as wildcard queries.
- Avoid sending a large number of requests at the same time.
- Be sure that you have the appropriate number of shards. For more information about indexing strategy, see Choosing the number of shards.
- Be sure that your shards are distributed evenly between nodes.
- Avoid aggregating on text fields. This helps prevent increases in field data. The more field data that you have, the more heap space is consumed. Use the GET _cluster/stats API operation to check field data. For more information, see field data on the Elasticsearch website.
- If you must aggregate on text fields, change the mapping type to keyword. If JVM memory pressure gets too high, use the following API operations to clear the field data cache: POST /index_name/_cache/clear (index-level cache) and POST */_cache/clear (cluster-level cache).
Note: Clearing the cache can disrupt queries that are in progress.
Identify and troubleshoot hardware failure issues
Sometimes hardware failures can impact cluster node availability. To limit the impact of potential hardware failures, check the following:
- Be sure that you have more than one node in your cluster. A single-node cluster is a single point of failure. You can't use replica shards to back up your data, because primary and replica shards can't be assigned to the same node. If the node goes down, you can restore data from a snapshot. For more information about snapshots, see Working with OpenSearch Service index snapshots. Note that you can't recover any data that wasn't already captured in the last snapshot. For more information, see Sizing OpenSeach Service domains and Creating and managing OpenSearch Service domains.
- Be sure that you have at least one replica. A multi-node cluster can still experience data loss if there aren't any replica shards.
- Enable zone awareness. When zone awareness is enabled, then OpenSearch Service launches data nodes in multiple Availability Zones. OpenSearch Services tries to distribute primary shards and their corresponding replica shards to different zones. If there is a failure in one node or Availability Zone, your data is still available. For more information, see Configuring a Multi-AZ domain.