如何使用 CloudWatch 警报来监控我的 Amazon Elasticsearch Service 集群?

上次更新时间:2020 年 8 月 21 日

我想监控我的 Amazon Elasticsearch Service (Amazon ES) 集群的稳定性问题。如何有效地监控我的 Elasticsearch 集群?

解决方法

重要提示:不同版本的 Elasticsearch 使用不同的线程池来处理对 _index API 的调用。Elasticsearch 1.5 和 2.3 使用索引线程池。Elasticsearch 5.x、6.0 和 6.2 使用批量线程池。Elasticsearch 6.3 版及更高版本使用写线程池。目前,Amazon ES 控制台不包含批量线程池的图形。

要监控 Elasticsearch 集群的运行状况,请设置建议的 Amazon CloudWatch 警报以及下列警报:

LeaderReachableFromNode:
Statistic = Maximum
Value = ‘=0’
Frequency = 1 period
Period = 1 minute
Issue: Leader node is down

KibanaHealthyNodes:
Statistic = Average
Value = ‘=0’
Frequency = 1 period
Period = 1 minute
Issue: kibana is unhealthy

DiskQueueDepth:
Statistic = Average
Value = ‘>=100'
Frequency = 1 period
Period = 5 minutes
Issue: Disk Queue Depth is the number of I/O requests that are queued at a time against the storage. This could indicate a surge in requests or Amazon EBS throttling, resulting in increased latency.

ThreadpoolIndexQueue and ThreadpoolSearchQueue:
Statistic = Maximum
Value = ‘>=20’
Frequency = 1 period
Period = 1 minute
Issue: Indicates that there are requests getting queued up, which can be rejected. To verify the request status, check the CPU Utilization and Threadpool Index or Search rejects.

要为您的 Elasticsearch 集群设置 Amazon CloudWatch 警报,请执行以下步骤:

1.    打开 CloudWatch 控制台

2.    转至警报选项卡。

3.    选择创建警报

4.    选择选择指标

5.    为您的指标选择 ES

6.    选择每个域每个客户端的指标

7.    选择一个指标,然后选择下一步

8.    为您的 CloudWatch 警报配置以下设置:

Statistic = Maximum
Period to 1 minute
Threshold type = Static
Alarm condition = Greater than or equal to
Threshold value = 1

9.    选择其他配置选项卡。

10.    更新以下配置设置:

Datapoints to alarm = Frequency stated above
Missing data treatment = Treat missing data as ignore (maintain the alarm state)

11.    选择下一步

12.    选择您希望您的警报执行的操作,然后选择下一步

13.    为您的警报设置一个名称,然后选择下一步

14.    选择创建警报

注意:如果触发了 CPUUtilizationJVMMemoryPressure 警报,请检查下列指标以确定传入的请求是否出现峰值:

IndexingRate

SearchRate

ElasticsearchRequests


这篇文章对您有帮助吗?


您是否需要账单或技术支持?