Run a petabyte scale cluster in Amazon OpenSearch Service
When you use Amazon OpenSearch Service for log data, you’re drinking from what usually becomes a forceful firehose. As your OpenSearch and Kibana knowledge deepens, you find many compelling uses of your data. As your customer base scales up and you scale your infrastructure to handle it, you generate even more log data.
You can take a number of design paths to store and analyze your ever-growing log data. For example, you can deploy multiple Amazon OpenSearch domains, splitting your workloads across them. However, this design strategy does not support analysis with a single Kibana dashboard. You can also take a hybrid approach, with co-tenancy for some of your log streams, or you can take the all-one-domain approach.
Eventually, even usage of a single workload domain can scale to and beyond 1 PB of storage.
We recently doubled our support limit for large clusters in Amazon OpenSearch. Today, we support a default limit of 20 data instances, with up to 200 data instances in a single cluster with a limit raise for your domain. If you choose I3.16xlarge.search instances for your data instances, that gives you up to 3 PB of storage in your cluster. What are the best practices for successfully running these XL workloads?
Run the latest version of OpenSearch
Every new version of OpenSearch brings improvements in stability and performance. At the time this post was written, Amazon OpenSearch Service supported Elasticsearch version 6.4. For large-scale clusters, we recommend deploying or upgrading (by using our in-place upgrades) to the latest supported version of OpenSearch.
Use I3 instances
You need to use I3s to get to petabyte scale. I3 instances have NVMe SSD ephemeral store. At the large end, I3.16xlarge.search instances have 15.2 TB storage per node, for a total of 3.04 PB in a 200 node domain. If you choose one of our EBS-backed instance types, you can deploy up to 1.5 TB of EBS volume per data instance in Amazon OpenSearch, giving you a maximum cluster size of 300 TB. Note that we are always evaluating and adding new instance types, and evaluating and changing our service limits. Future service improvements might change the recommendations highlighted in this post.
Use the _rollover API to manage index lifecycle
When you use Amazon OpenSearch Service for log analytics, you create indexes to partition your data and to manage its lifecycle in your cluster. You accumulate logs for a specified time period, and then move to a new index for the next time period. At the end of your retention window, you DELETE the oldest index.
How often should you change indexes? There’s no hard rule, but there are two kinds of patterns for managing indexes:
- By rolling over at a time interval (usually daily, but for sufficiently large workloads you will need to roll over more than once per day). This change happens outside of the cluster, in your ingest pipeline, which determines the destination index.
- By rolling over based on the index size. This change happens inside the cluster. You use an alias to send data to the index and call the _rollover API, specifying the conditions. OpenSearch creates a new index behind the alias when the conditions are met. Note that rolling over based on the size of the index is available only in open source Elasticsearch version 6.4 and later.
For earlier versions of open source Elasticsearch that don’t support rolling over by index size, we recommend a similar pattern of calling the _rollover API, but using document count as the trigger, because the alternative—age—results in more variation in shard size.
Using the _rollover API, either rolling over by index size or by document count, has the advantage of ensuring that all of your shards are approximately the same size. This gives the best chance at an even division of storage and, as a consequence, an even division of compute across the instances in the cluster.
Set your shard count to be divisible by your instance count
A main cause of cluster instability in Amazon OpenSearch Service domains is skew in the usage of the shards, and the resulting hot nodes that the skew creates. OpenSearch maps shards to instances based on count; each instance receives, as nearly as possible, the same number of shards. When you set your shard count to be divisible by your instance count, you ensure that every node in your cluster has the same number of shards.
This is where the _rollover API helps manage storage skew. If your shards are all the same size, the storage will distribute evenly based on shard count. If your storage is not evenly distributed, some of the nodes can run out of space before the others. If that happens, OpenSearch stops deploying shards to those nodes, and the shard counts can become skewed as well, forcing uneven load on the instances.
Optimize your indexes after rolling them over
For log analytics workloads, your indexes are write-once and then read-only. You can take advantage of this pattern, calling the _forcemerge API when you have rolled over to a new index. _forcemerge reduces the Lucene segment count by merging segments in your index. This has a beneficial effect for processing queries. Using the _shrink API to reduce the shard count of your indexes can be a good idea as well, provided you make sure to evenly distribute work across your cluster through managing shard count.
Be aware that both of these APIs generate a significant load on your cluster. You should not regularly call _forcemerge or _shrink. Schedule these calls at a low traffic time of day, and limit the number of concurrent calls to 1.
For PB-scale clusters you must deploy dedicated master instances. Dedicated master instances provide necessary stability for clusters with high node counts. Primary drivers of dedicated master usage include shard count, node count, mapping size, and mapping changes. The following table uses shard count as a proxy for these various factors to recommend instance types for your dedicated master instances at different scale.
|Recommended dedicated master instance configuration|
|Cluster size||Maximum shard count||Instance type|
|< 10 data instances||2,500 shards||C4.large.search|
|10 to 30 data instances||5,000 shards||C4.xlarge.search|
|30 to 75 data instances||10,000 shards||C4.2xlarge.search|
|> 75 data instances||< 30,000 shards||R4.4xlarge.search|
We recommend keeping the total number of shards in any Amazon OpenSearch cluster to less than 30,000. However, not all shards use resources the same. In a prior post on sizing your Amazon OpenSearch Service Domain, we drew a distinction between active and inactive shards. Shards are active when they receive read and write traffic. The number of active shards that your domain can support is driven by the number of CPUs in the cluster (see the sizing your Amazon OpenSearch Service Domain post for details). For a 200-node, I3.16XLarge.search cluster, you should keep active shards to fewer than 5,000 (leaving some room for other cluster tasks).
The preceding table assumes a ratio of 1:50 for JVM size in bytes to data stored on the instance in bytes. We also use 50GB as the best practice, maximum shard size. Depending on your workload, you can be successful with larger shard sizes, up to 100 GB.
If you use larger than 50 GB shards, you need to keep a close eye on the cluster’s metrics to ensure that it functions correctly. Make sure to set alarms on your CloudWatch metrics. For high shard count clusters, it’s especially important to monitor the master instances’ CPU and JVM usage. If your dedicated master CPUs are running greater than 80% average, or your dedicated master JVM Memory Pressure is consistently above 80%, reduce shard count.
Limit _bulk request payloads
When initially deploying your domain, you should spend time tuning your _bulk request payload sizes. Your OpenSearch throughput is critically dependent on the size of the write request and the subsequent processing time and JVM memory use for each request. Empirically, we see approximately 5 MB as being a good starting point for this tuning.
For PB-scale clusters, we recommend keeping bulk payloads below about 20 MB, even when you are getting better throughput at larger bulk sizes. Larger bulks require more JVM memory to process. You can run out of JVM heap when processing too many large bulk requests.
Conclusion and caveat emptor
Whether you use a single domain or spread across multiple domains for processing extra-large log analytics workloads, you should employ the best practices outlined in this blog post. However, we want to end with a strong caution: every OpenSearch workload is its own beast. The recommendations in this post are guidelines, and your mileage will vary! You’ll need to actively deploy, monitor, and adjust your sizing, shard strategy, and index lifecycle management.
Jon Handler is an AWS solutions architect specializing in search technologies. You can reach him @_searchgeek.