Reducing cost for small Amazon Elasticsearch Service domains

September 8, 2021: Amazon Elasticsearch Service has been renamed to Amazon OpenSearch Service. See details.

When you deploy your Amazon Elasticsearch Service (Amazon ES) domain to support a production workload, you must choose the type and number of data instances to use, the number of Availability Zones, and whether to use dedicated master instances or not. To follow all the best practice recommendations, you must configure the following:

Three dedicated master instances, M5.large
Three-zone replication, with three M5.large data nodes
Using two replicas for your primaries
Storage as needed, maximum 512 GB, GP2 Amazon Elastic Block Store (EBS) volumes for data nodes

This configuration supports up to ~400 GB of source data and hundreds to thousands of requests per second at an on-demand cost of ~$800 per month (US East, Northern Virginia pricing). If you want to reduce this cost, you can reduce to a minimum possibly viable deployment. The minimum possibly viable deployment for production workloads is:

No dedicated master instances
Two-zone replication, with M5.large nodes
Using one replica for your primaries
Storage as needed, maximum 512 GB, GP2 EBS volumes for data nodes

This deployment supports the same 400 GB of source data and the same hundreds to thousands of requests per second at a much lower, on-demand cost of $350/month. That’s an 81% reduction in cost. If you deploy smaller than the maximum, 512 GB EBS volumes, you proportionally reduce the Amazon EBS cost component of $207 per month.

AWS does not recommend T2 instances for any Amazon ES production workload.

This post discusses best practices of how you should think about ROI and mitigate the potential lower availability of your domain.

Reserved instances

The best choice for cost reduction, while following all best practices, is to purchase reserved instances for your Amazon ES domain. With a 1-year commitment and no up-front cost, you reduce the cost of the best practices domain to $630/month, a 21% reduction in cost. With a 3-year, up-front commitment, you pay $10,746 up front and $207/month for an effective monthly rate of $505/month, or a 37% cost reduction.

For more information about pricing scenarios, see the AWS Pricing Calculator.

Best practices and availability

The following are best practices on sizing, dedicated master instances, and Multi-AZ deployments:

Set your shard count so that primary shards are under 50 GB for log analytics workloads or under 30 GB for search workloads (always test to determine actual best shard sizes for maximum throughput and minimum errors)
Choose an instance type so that vCPUs = 1.5 * active shards
Deploy to three Availability Zones
Use two replicas for your index
Use a minimum of three data nodes and use three dedicated master nodes

Using these best practices gives you the most available deployment. For more information, see Get started with Amazon Elasticsearch Service: T-shirt-size your domain, Get Started with Amazon Elasticsearch Service: Use Dedicated Master Instances to Improve Cluster Stability, and Increase availability for Amazon Elasticsearch Service by deploying in three Availability Zones.

The recommendations in the following sections expose you to the additional potential risk of data loss and cluster unavailability. You should weigh the ROI against the cost of downtime and recovery.

Reducing replicas

Each replica of your index adds storage equal to the storage in the primary shards. If your primary shards for an index hold 1 TB of data, the first replica doubles the storage need to 2 TB. The second replica triples it to 3 TB. You can reduce the replica count to 1 to drop your minimum data node and storage needs, and see a 33% reduction in storage cost.

If you reduce to one replica and a single data node becomes unavailable, you retain a copy of the data in the cluster, and Elasticsearch can recover. Elasticsearch deploys the replica shards to data nodes different from their primaries. If a data node becomes unavailable, the replicas make sure that the data is not lost from the cluster. Reducing to one replica means you also reduce the minimum data nodes needed to two (one for the primary, one for the replica).

If you reduce to one replica and more than one data node in your cluster becomes unavailable, you risk losing both the primary and replica copies of shards that you deploy to those instances. If that happens, your cluster state is red, and you must reload that data to recover the indexes.

Reducing Availability Zones

With a three-zone deployment, three data nodes, and two replicas, you are protecting your data if one or two Availability Zones become unavailable. When you reduce your deployment to two Availability Zones, you can reduce the minimum data node count to 2 and replicas to 1. In this setup, you risk data loss if more than one Availability Zone becomes unavailable.

This recommendation is applicable for small domains, where you can reduce the node count below three. For larger workloads, it can make sense to run three zones with one replica. You benefit when your minimum instance count is already three and your total shard count is higher than three. Amazon ES distributes nodes and shards across the three zones. Each zone has 1/3 the nodes and 1/3 the shards.

At the extreme low end, you could use a single Availability Zone, single data node, and no replicas. However, you risk losing the entire domain if that Availability Zone becomes unavailable, which is not production worthy.

Removing dedicated master nodes

Dedicated master nodes provide additional stability and availability for your Amazon ES domain. They hold the cluster state and broadcast that state to all the nodes in the cluster. Dedicated master nodes do not process requests themselves.

Dedicated masters are single-threaded; one instance from the eligible instances is elected the single cluster master. For master election to take place, the cluster must have a quorum—half of the eligible instances, rounded down, plus one. If your cluster has three eligible instances, the quorum is two. If your cluster has two eligible instances, the quorum is also two.

If you forgo dedicated master nodes, your data nodes are the eligible master nodes. If you lose a single node from a two-data-node cluster, that cluster stops accepting writes because there isn’t quorum (two) to elect a master. You are more likely to survive node loss when there are more than two dedicated master nodes.

Because your data nodes are usually loaded to handle indexing and search requests, they are not the best choice as master nodes. Their availability is highly subject to the demands of your workload. If you do not use dedicated master nodes, you increase the risk that you cannot write to your cluster or that your cluster otherwise degrades.

Dedicated master loss in Elasticsearch 6x versus 7x

Elasticsearch master nodes behave differently for Elasticsearch versions 7 and above. In Elasticsearch versions 6 and lower, you need a quorum for the cluster to continue functioning. If you lose a single dedicated master node in Elasticsearch versions 6 and lower, the cluster is write-blocked 100% of the time. In version 7, the elected cluster master does not require a quorum to continue, though you still need a quorum to elect a new master. In versions 7 and above, with two dedicated master nodes, if you lose the cluster master, the cluster is write-blocked; if you lose the other master, it is not. This is a 50% chance of unavailability.

Mitigating risks

You can reduce costs at the increased risk of data loss or cluster unavailability. You should carefully consider these risks and take steps to minimize recovery time where you can.

Before adopting any of these measures, evaluate what happens if your cluster is not available. If you don’t want to be unable to query Amazon ES for an hour or more, don’t adopt these cost reduction suggestions.

On the ingest side, you should have a plan to recreate a new cluster from scratch as quickly as possible. Because this use case concerns smaller datasets and clusters, it is feasible to reload your source data fairly rapidly. However, to reload source data, you have to store your source data somewhere other than Amazon ES. You can maintain all source data in an Amazon S3 bucket as Elasticsearch _bulk requests. To reload, you can have a script prepared to walk that bucket and load in your data.

If you don’t have a primary store like Amazon S3, you can rely on self-managed snapshots you take with the _snapshot API. For more information, see Use Amazon S3 to Store a Single Amazon Elasticsearch Service Index. The drawback of snapshots is that you lose data that was loaded after your last viable snapshot. If you don’t want data loss across that period of time, you must plan a way to store or hold updates while you recreate your domain. If your ingest infrastructure is sending batches to Amazon S3, you can load them again. If your ingest infrastructure includes Amazon Kinesis Data Streams or Amazon Managed Streaming for Kafka, your updates are available to re-read and reload.

Automatic, service-managed snapshots are not suitable because you can only load them into the same domain. If your domain is unavailable, you can’t use automatic snapshots.

Conclusion

You can reduce your cost for a small deployment of Amazon ES by taking steps to reduce shard count, reduce instance count, and remove dedicated master instances. Each step expands your risk of domain unavailability. This post covered the tradeoffs and some of the risk mitigation strategies that can yield up to an 81% reduction in cost.

About the Author

Jon Handler (@_searchgeek) is a Principal Solutions Architect at Amazon Web Services based in Palo Alto, CA. Jon works closely with the CloudSearch and Elasticsearch teams, providing help and guidance to a broad range of customers who have search workloads that they want to move to the AWS Cloud. Prior to joining AWS, Jon’s career as a software developer included four years of coding a large-scale, eCommerce search engine.