Why did my Elasticsearch node crash?

Last updated: 2019-07-09

One of the nodes in my Amazon Elasticsearch Service cluster is down. Why did the node fail and how can I prevent this from happening again?

Short Description

Each Amazon ES node runs on a separate Amazon Elastic Compute Cloud (Amazon EC2) instance. A failed node is an instance that is not responding to heartbeat signals from the other nodes. Heartbeat signals are periodic signals that monitor the availability of the data nodes in the cluster.

Here are common causes of failed cluster nodes:

  • High JVM memory pressure
  • Hardware failure

Resolution

Check for failed nodes

1.    Open the Amazon ES console.

2.    Choose the name of your Elasticsearch domain.

3.    Choose the Cluster health tab, and then choose the Nodes metric. If the number of nodes is fewer than the number that you configured for your cluster, a node is down.

Note: The Nodes metric is not accurate during changes to your cluster configuration and during routine maintenance for the service. This behavior is expected. For more information, see About Configuration Changes.

Identify and troubleshoot high JVM memory pressure

JVM memory pressure refers to the percentage of the Java heap that is used for all data nodes in an Elasticsearch cluster. High JVM memory pressure can cause high CPU usage and other performance issues on an Elasticsearch cluster.

JVM memory pressure is determined by the following factors:

  • The amount of data on the cluster in proportion to the amount of resources.
  • The query load on the cluster.

Here's what happens as JVM memory pressure increases:

  • At 75%: Amazon ES triggers the Concurrent Mark Sweep (CMS) garbage collector. The CMS collector runs alongside other processes to keep pauses and disruptions to a minimum.
    Note: Amazon ES publishes several garbage collection metrics to Amazon CloudWatch. These metrics can help you monitor JVM memory usage. For more information, see Instance Metrics.
  • Above 75%: If the CMS collector fails to reclaim enough memory and usage remains above 75%, Amazon ES triggers a different garbage collection algorithm. This algorithm tries to free up memory and prevent a JVM OutOfMemoryError (OOM) exception by slowing or stopping processes.
  • Around 95%: Amazon ES kills processes that try to allocate memory. If a critical process is killed, one or more cluster nodes might fail.

To prevent high JVM memory pressure:

  • Avoid queries on wide ranges, such as wildcard queries.
  • Avoid sending a large number of requests at the same time.
  • Be sure that you have the appropriate number of shards. For more information, see Choosing the Number of Shards.
  • Be sure that your shards are distributed evenly between nodes.
  • When possible, avoid aggregating on text fields. This helps prevent increases in field data. The more field data you have, the more heap space is consumed. Use the GET _cluster/stats API operation to check field data.
  • If you must aggregate on text fields, change the mapping type to keyword. If JVM memory pressure gets too high, use the following API operations to clear the field data cache: POST /index_name/_cache/clear (index-level cache) and POST /_cache/clear (cluster-level cache).
    Note: Clearing the cache can disrupt queries that are in progress.

Identify and troubleshoot hardware failure issues

Although rare, failures can occur that affect the availability of nodes in your Elasticsearch cluster. To limit the impact of potential hardware failures:

  • Be sure that you have more than one node in your cluster: A single-node cluster is a single point of failure.You can't use replica shards to back up your data because primary and replica shards can't be assigned to the same node. If the node goes down, you can restore your data from a snapshot. However, you can't recover any data that wasn't captured in the last snapshot. For more information, see Sizing Amazon ES Domains and Configuring Amazon ES Domains.
  • Be sure that you have at least one replica: A multi-node cluster can still experience data loss if there are no replica shards.
  • Enable zone awareness: When zone awareness is enabled, Amazon ES launches data nodes in multiple Availability Zones. Each primary shard and its replica is assigned to a different node in a different Availability Zone. If there is a failure in one node or Availability Zone, your data is still available. For more information, see Configuring a Multi-AZ Domain.