Why did my OpenSearch Service node crash?

5 minute read
0

One of the nodes in my Amazon OpenSearch Service cluster is down, and I want to prevent this from happening.

Short description

Each OpenSearch Service node runs on a separate Amazon Elastic Compute Cloud (Amazon EC2) instance. A failed node is an instance that isn't responding to heartbeat signals from the other nodes. Heartbeat signals are periodic signals that monitor the availability of the data nodes in the cluster.

Common causes of failed cluster nodes include:

  • High Java Virtual Machine (JVM) memory pressure
  • Hardware failure

Resolution

Check for failed nodes

1.    Sign in to the OpenSearch Service console.

2.    In the navigation pane, under Managed clusters, choose Domains.

3.    Choose the name of your OpenSearch Service domain.

4.    Choose the Cluster health tab, and then choose the Nodes metric. If the number of nodes is fewer than the number that you configured for your cluster, then a node is down.

Note: The Nodes metric can be inaccurate during changes to your cluster configuration or any routine maintenance for the service. This behavior is expected.

Identify and troubleshoot high JVM memory pressure

JVM memory pressure refers to the percentage of Java heap that's used for all data nodes in an OpenSearch Service cluster. High JVM memory pressure can cause high CPU usage and other cluster performance issues.

JVM memory pressure is determined by the following conditions:

  • The amount of data on the cluster in proportion to the number of resources.
  • The query load on the cluster.

As JVM memory pressure increases, the following happens:

  • At 75%: OpenSearch Service initiates the Concurrent Mark Sweep (CMS) garbage collector. The CMS collector runs with other processes to keep pauses and disruptions to a minimum.
    Note: OpenSearch Service publishes several garbage collection metrics to Amazon CloudWatch. These metrics can help you monitor JVM memory usage. For more information, see Monitoring OpenSearch cluster metrics with Amazon CloudWatch.
  • Above 75%: If the CMS collector fails to reclaim enough memory and usage remains above 75%, then OpenSearch Service JVM tries to free up memory. OpenSearch Service JVM also tries to prevent a JVM OutOfMemoryError (OOM) exception by slowing or stopping processes.
  • If the JVM continues to grow and the space isn't reclaimed, then OpenSearch Service JVM stops the processes that try to allocate memory. If a critical process is stopped, one or more cluster nodes might fail. It's a best practice to keep CPU usage below 80%.

To prevent high JVM memory pressure, follow these best practices:

  • Avoid queries on wide ranges, such as wildcard queries.
  • Avoid sending a large number of requests at the same time.
  • Be sure that you have the appropriate number of shards. For more information about indexing strategy, see Choosing the number of shards.
  • Be sure that your shards are distributed evenly between nodes.
  • Avoid aggregating on text fields. This helps prevent increases in field data. The more field data that you have, the more heap space is consumed. Use the GET _cluster/stats API operation to check field data. For more information, see the Elasticsearch documentation for fielddata.
  • If you must aggregate on text fields, then change the mapping type to keyword. If JVM memory pressure gets too high, then use the following API operations to clear the field data cache: POST /index_name/_cache/clear (index-level cache) and POST */_cache/clear (cluster-level cache).
    Note: Clearing the cache can disrupt queries that are in progress.

Identify and troubleshoot hardware failure issues

Sometimes hardware failures can impact cluster node availability. To limit the impact of potential hardware failures, consider these factors:

  • Be sure that you have more than one node in your cluster. A single-node cluster is a single point of failure. You can't use replica shards to back up your data, because primary and replica shards can't be assigned to the same node. If the node goes down, you can restore data from a snapshot. For more information about snapshots, see Creating index snapshots in OpenSearch Service. Also, you can't recover any data that wasn't already captured in the last snapshot. For more information, see Sizing OpenSearch Service domains and Creating and managing OpenSearch Service domains.
  • Be sure that you have at least one replica. A multi-node cluster can still have data loss if there aren't any replica shards.
  • Turn on zone awareness. When zone awareness is turned on, then OpenSearch Service launches data nodes in multiple Availability Zones. OpenSearch Service tries to distribute primary shards and their corresponding replica shards to different Availability Zones. If there is a failure in one node or zone, your data is still available. For more information, see Configuring a Multi-AZ domain in OpenSearch Service.

Related information

Operational best practices for Amazon OpenSearch Service

How do I make my Amazon OpenSearch Service domain more fault tolerant?

How can I scale up or scale out an Amazon OpenSearch Service domain?

Why is my Amazon OpenSearch Service domain stuck in the "Processing" state?

AWS OFFICIAL
AWS OFFICIALUpdated a year ago