How do I make my Amazon OpenSearch Service domain more fault tolerant?

4 minute read

I want to protect Amazon OpenSearch Service resources against accidental deletion, application or hardware failures, or outages. What are some best practices for improving fault tolerance or restoring snapshots?

Short description

To improve the fault tolerance of an OpenSearch Service domain, consider the following best practices:

Take regular index snapshots.
Use Amazon CloudWatch metrics to monitor OpenSearch Service resources.
Understand OpenSearch Service limits.
Use dedicated master nodes.
Use at least three nodes.
Enable zone awareness.
Don't use T2 instances in production environments.

Resolution

Take regular index snapshots

All OpenSearch Service domains take automated snapshots. Take manual index snapshots to create point-in-time backups of the data in an OpenSearch Service domain. Store the snapshots in an Amazon Simple Storage Service (Amazon S3) bucket. You can also use manual index snapshots to migrate data between OpenSearch Service domains or to restore data to another OpenSearch Service domain.

Monitor Amazon CloudWatch metrics

Use the Cluster health and Instance health tabs in the OpenSearch Service console to monitor the Amazon CloudWatch metrics for your clusters.
Create Amazon CloudWatch alarms for important OpenSearch Service metrics. For example, monitor the AutomatedSnapshotFailure metric to confirm that automated snapshots are happening at regular intervals. For a tutorial, see Get started with OpenSearch Service: Set CloudWatch alarms on key metrics.

Use dedicated master nodes

Dedicated master nodes help prevent problems that are caused by overloaded nodes. Use dedicated master nodes when:

Your domain is used in production environments.
Your domain has five or more nodes.
Your index mapping is complex, with many fields defined across types and indices.

Use at least three nodes

To avoid an unintentionally partitioned network (split brain), use at least three nodes. To avoid potential data loss, be sure that you have at least one replica for each index. (By default, each index has one replica.)

Enable zone awareness

Zone awareness helps prevent downtime and data loss. When zone awareness is enabled, OpenSearch Service allocates the nodes and replica index shards across two or three Availability Zones in the same AWS Region.

Note: For a setup of three Availability Zones, use two replicas of your index. If there is a single zone failure, the two replicas afford 100% data redundancy.

Don't use T2 instances in production environments

For production environments, use M-class or larger Amazon Elastic Compute Cloud (Amazon EC2) instances. If you use T2 instance types, be sure to monitor the CPU credits, CPU usage, memory usage, and stability of your instances. Scale up or out when necessary.

Additionally, note the following limitations for T2 instances:

T2 instances are assigned CPU credits. If there is a spike in network traffic, your OpenSearch Service cluster could exceed the amount of CPU credits available in your T2 instance. For more information, see CPU credits and baseline utilization for burstable performance instances.
T2 instances have an EBS volume limit of 35 GB.
T2 instances have a payload limit of 10 MB. Make sure that your request payload doesn't exceed the payload limit. For more information about OpenSearch Service network limits, see Network limits.
T2 instance types can be used only if your OpenSearch Service instance count is ten or fewer. For more information about the supported OpenSearch Service instance types, see Supported instance types.
T2 instance types must not be used as data nodes or dedicated master nodes. T2 instance types can become unstable under sustained heavy load. For more information, see OpenSearch Service best practices.

Related information

Get started with Amazon OpenSearch Service: How many data instances do I need?

Creating and managing Amazon OpenSearch Service domains

Topics

Serverless Analytics

Relevant content

FSx for OpenZFS: compute and network scalability (and fault-tolerance)
Nick
asked a month ago
how to check which Amazon OpenSearch Service Domains connected to my dynamoDB tables or vice versa
Amplify_lover
asked a year ago
How to restore AWS Opensearch snapshot into a new Opensearch domain when the source Opensearch domain from which the snapshot was created is no longer available?
Accepted Answer
George Li
asked 8 months ago
How to set up IAM role to access AWS OpenSearch Service domain through terraform
Mihir Gosai
asked 6 months ago
What is the best way to configure Availability Zones for a fault-tolerant ALB?
Quark Soup
asked 10 months ago
How do I implement disaster recovery or fault tolerance for my Amazon ElastiCache Redis cluster?
AWS OFFICIALUpdated 4 years ago
How do I troubleshoot low storage space in my OpenSearch Service domain?
AWS OFFICIALUpdated a year ago
How do I reduce the cost of using OpenSearch Service domains?
AWS OFFICIALUpdated a year ago
How do I use Auto Scaling to improve the fault tolerance of an application behind my load balancer?
AWS OFFICIALUpdated a year ago
Troubleshoot null_pointer_exception during remote reindexing in OpenSearch VPC domains
EXPERT
Cathy W
published 2 months ago