How do I make my Amazon Elasticsearch Service domain more fault tolerant?
Last updated: 2020-01-09
How can I protect Amazon Elasticsearch Service (Amazon ES) resources against accidental deletion, application or hardware failures, or outages?
To improve the fault tolerance of an Amazon ES domain:
- Take regular index snapshots.
- Use Amazon CloudWatch metrics to monitor Amazon ES resources.
- Understand Amazon ES service limits.
- Use dedicated master nodes.
- Use more than two nodes.
- Enable zone awareness.
- Don't use T2 instances in production environments.
Take regular index snapshots
All Amazon ES domains take automated snapshots. Take manual index snapshots to create point-in-time backups of the data in an Amazon ES domain. Store the snapshots in an Amazon Simple Storage Service (Amazon S3) bucket. You can also use manual index snapshots to migrate data between Amazon ES domains and to restore data to another Amazon ES domain.
Monitor CloudWatch metrics
- Use the Cluster health and Instance health tabs in the Amazon ES console to monitor CloudWatch metrics about your Elasticsearch clusters.
- Create CloudWatch alarms for important Amazon ES metrics. For example, monitor the AutomatedSnapshotFailure metric to confirm that automated snapshots are happening at regular intervals. For a tutorial, see Get Started with Amazon Elasticsearch Service: Set CloudWatch Alarms on Key Metrics.
Use dedicated master nodes
Dedicated master nodes help prevent problems that are caused by overloaded nodes. Use dedicated master nodes when:
- Your domain is used in production environments.
- Your domain has five or more nodes.
- Your index mapping is complex, with many fields defined across types and indices.
Use at least three nodes
To avoid an unintentionally partitioned network (split brain), use at least three nodes. To avoid potential data loss, be sure that you have at least one replica for each index. (By default, each index has one replica.)
Enable zone awareness
Zone awareness helps prevent downtime and data loss. When zone awareness is enabled, Amazon ES allocates the nodes and replica index shards that belong to a cluster across two Availability Zones in the same Region.
Don't use T2 instances in production environments
For production environments, use M-class or larger Amazon Elastic Compute Cloud (Amazon EC2) instances. If you decide to use T2 instance types, be sure to closely monitor the CPU credits, CPU usage, memory usage, and stability of your instances. Scale up or out when necessary.