How do I monitor my Amazon Elasticsearch Service cluster using CloudWatch alarms?

Last updated: 2020-08-21

I want to monitor my Amazon Elasticsearch Service (Amazon ES) cluster for stability issues. How can I effectively monitor my Elasticsearch cluster?

Resolution

Important: Different versions of Elasticsearch use different thread pools to process calls to the _index API. Elasticsearch 1.5 and 2.3 use the index thread pool. Elasticsearch 5.x, 6.0, and 6.2 use the bulk thread pool. Elasticsearch versions 6.3 and later use the write thread pool. Currently, the Amazon ES console doesn't include a graph for the bulk thread pool.

To monitor the health of your Elasticsearch cluster, set the recommended Amazon CloudWatch alarms and the following alarms:

LeaderReachableFromNode:
Statistic = Maximum
Value = ‘=0’
Frequency = 1 period
Period = 1 minute
Issue: Leader node is down

KibanaHealthyNodes:
Statistic = Average
Value = ‘=0’
Frequency = 1 period
Period = 1 minute
Issue: kibana is unhealthy

DiskQueueDepth:
Statistic = Average
Value = ‘>=100'
Frequency = 1 period
Period = 5 minutes
Issue: Disk Queue Depth is the number of I/O requests that are queued at a time against the storage. This could indicate a surge in requests or Amazon EBS throttling, resulting in increased latency.

ThreadpoolIndexQueue and ThreadpoolSearchQueue:
Statistic = Maximum
Value = ‘>=20’
Frequency = 1 period
Period = 1 minute
Issue: Indicates that there are requests getting queued up, which can be rejected. To verify the request status, check the CPU Utilization and Threadpool Index or Search rejects.

To set up an Amazon CloudWatch alarm for your Elasticsearch cluster, perform the following steps:

1.    Open the CloudWatch console.

2.    Go to the Alarm tab.

3.    Choose Create Alarm.

4.    Choose Select Metric.

5.    Choose ES for your metric.

6.    Select Per-Domain and Per-Client Metrics.

7.    Select a metric and choose Next.

8.    Configure the following settings for your CloudWatch alarm:

Statistic = Maximum
Period to 1 minute
Threshold type = Static
Alarm condition = Greater than or equal to
Threshold value = 1

9.    Choose the Additional configuration tab.

10.    Update the following configuration settings:

Datapoints to alarm = Frequency stated above
Missing data treatment = Treat missing data as ignore (maintain the alarm state)

11.    Choose Next.

12.    Choose the action you want your alarm to take, and choose Next.

13.    Set a name for your alarm, and then choose Next.

14.    Choose Create Alarm.

Note: If the alarm is triggered for CPUUtilization or JVMMemoryPressure, check the following metrics to see if there is a spike coinciding with incoming requests:

IndexingRate

SearchRate

ElasticsearchRequests