Why did my Amazon Elasticsearch Service node crash?

Last updated: 2020-04-16

One of the nodes in my Amazon Elasticsearch Service (Amazon ES) cluster is down. Why did the node fail and how can I prevent this from happening again?

Short Description

Each Amazon ES node runs on a separate Amazon Elastic Compute Cloud (Amazon EC2) instance. A failed node is an instance that isn't responding to heartbeat signals from the other nodes. Heartbeat signals are periodic signals that monitor the availability of the data nodes in the cluster.

Here are common causes of failed cluster nodes:

  • High Java Virtual Machine (JVM) memory pressure
  • Hardware failure

Resolution

Check for failed nodes

1.    Open the Amazon ES console.

2.    Choose the name of your Elasticsearch domain.

3.    Choose the Cluster health tab, and then choose the Nodes metric. If the number of nodes is fewer than the number that you configured for your cluster, this indicates that a node is down.

Note: The Nodes metric can be inaccurate during changes to your cluster configuration or any routine maintenance for the service. This behavior is expected. For more information, see About Configuration Changes in the Amazon Elasticsearch Service Developer Guide.

Identify and troubleshoot high JVM memory pressure

JVM memory pressure refers to the percentage of Java heap that is used for all data nodes in an Elasticsearch cluster. High JVM memory pressure can cause high CPU usage and other cluster performance issues.

JVM memory pressure is determined by the following factors:

  • The amount of data on the cluster in proportion to the amount of resources.
  • The query load on the cluster.

Here's what happens as JVM memory pressure increases:

  • At 75%: Amazon ES triggers the Concurrent Mark Sweep (CMS) garbage collector. The CMS collector runs alongside other processes to keep pauses and disruptions to a minimum. Note: Amazon ES publishes several garbage collection metrics to Amazon CloudWatch. These metrics can help you monitor JVM memory usage. For more information, see Instance Metrics.
  • Above 75%: If the CMS collector fails to reclaim enough memory and usage remains above 75%, then Amazon ES JVM tries to free up memory. Amazon ES JVM also tries to prevent a JVM OutOfMemoryError (OOM) exception by slowing or stopping processes.
  • If the JVM continues to grow and the space is not reclaimed, then Amazon ES JVM kills processes that try to allocate memory. If a critical process is killed, one or more cluster nodes might fail. It's a best practice to keep CPU usage below 80%.

To prevent high JVM memory pressure, follow these best practices:

  • Avoid queries on wide ranges, such as wildcard queries.
  • Avoid sending a large number of requests at the same time.
  • Be sure that you have the appropriate number of shards. For more information about indexing strategy, see Choosing the Number of Shards.
  • Be sure that your shards are distributed evenly between nodes.
  • Avoid aggregating on text fields. This helps prevent increases in field data. The more field data that you have, the more heap space is consumed. Use the GET _cluster/stats API operation to check field data. For more information about field data, see the Elastic website.
  • If you must aggregate on text fields, change the mapping type to keyword. If JVM memory pressure gets too high, use the following API operations to clear the field data cache: POST /index_name/_cache/clear (index-level cache) and POST */_cache/clear (cluster-level cache).
    Note: Clearing the cache can disrupt queries that are in progress.

Identify and troubleshoot hardware failure issues

Sometimes failures that can affect node availability in your Elasticsearch cluster occur. To limit the impact of potential hardware failures, check the following:

  • Be sure that you have more than one node in your cluster. A single-node cluster is a single point of failure. You can't use replica shards to back up your data, because primary and replica shards can't be assigned to the same node. If the node goes down, you can restore data from a snapshot. For more information about snapshots, see Working with Amazon Elasticsearch Service Index Snapshots. Note that you can't recover any data that was not already captured in the last snapshot. For more information, see Sizing Amazon ES Domains and Configuring Amazon ES Domains.
  • Be sure that you have at least one replica. A multi-node cluster can still experience data loss if there aren't any replica shards.
  • Enable zone awareness. When zone awareness is enabled, then Amazon ES launches data nodes in multiple Availability Zones. Amazon ES tries to distribute primary shards and their corresponding replica shards to different zones. If there is a failure in one node or Availability Zone, your data is still available. For more information, see Configuring a Multi-AZ Domain.