Choose the right storage tier for your needs in Amazon OpenSearch Service
Amazon OpenSearch Service (successor to Amazon Elasticsearch Service) enables organizations to perform interactive log analytics, real-time application monitoring, website search, and more. OpenSearch is an open-source, distributed search and analytics suite derived from Elasticsearch. Amazon OpenSearch Service offers the latest versions of OpenSearch, support for 19 versions of Elasticsearch (1.5 to 7.10 versions), and visualization capabilities powered by OpenSearch Dashboards and Kibana (1.5 to 7.10 versions).
In this post, we present three storage tiers of Amazon OpenSearch Service—hot, UltraWarm, and cold storage—and discuss how to effectively choose the right storage tier for your needs. This post can help you understand how these storage tiers integrate together and what the trade-off is for each storage tier. To choose a storage tier of Amazon OpenSearch Service for your use case, you need to consider the performance, latency, and cost of these storage tiers in order to make the right decision.
Amazon OpenSearch Service storage tiers overview
There are three different storage tiers for Amazon OpenSearch Service: hot, UltraWarm, and cold. The following diagram illustrates these three storage tiers.
Hot storage for Amazon OpenSearch Service is used for indexing and updating, while providing fast access to data. Standard data nodes use hot storage, which takes the form of instance store or Amazon Elastic Block Store (Amazon EBS) volumes attached to each node. Hot storage provides the fastest possible performance for indexing and searching new data.
You get the lowest latency for reading data in the hot tier, so you should use the hot tier to store frequently accessed data driving real-time analysis and dashboards. As your data ages, you access it less frequently and can tolerate higher latency, so keeping data in the hot tier is no longer cost-efficient.
If you want to have low latency and fast access to the data, hot storage is a good choice for you.
UltraWarm nodes use Amazon Simple Storage Service (Amazon S3) with related caching solutions to improve performance. UltraWarm offers significantly lower costs per GiB for read-only data that you query less frequently and don’t need the same performance as hot storage. Although you can’t modify the data while in UltraWarm, you can move the data to the hot storage tier for edits before moving it back.
When calculating UltraWarm storage requirements, you consider only the size of the primary shards. When you query for the list of shards in UltraWarm, you still see the primary and replicas listed. Both shards are stubs for the same, single copy of the data, which is in Amazon S3. The durability of data in Amazon S3 removes the need for replicas, and Amazon S3 abstracts away any operating system or service considerations. In the hot tier, accounting for one replica, 20 GB of index uses 40 GB of storage. In the UltraWarm tier, it’s billed at 20 GB.
The UltraWarm tier acts like a caching layer on top of the data in Amazon S3. UltraWarm moves data from Amazon S3 onto the UltraWarm nodes on demand, which speeds up access for subsequent queries on that data. For that reason, UltraWarm works best for use cases that access the same, small slice of data multiple times. You can add or remove UltraWarm nodes to increase or decrease the amount of cache against your data in Amazon S3 to optimize your cost per GB. To dial in your cost, be sure to test using a representative dataset. To monitor performance, use the WarmCPUUtilization and WarmJVMMemoryPressure metrics. See UltraWarm metrics for a complete list of metrics.
The combined CPU cores and RAM allocated to UltraWarm nodes affects performance for simultaneous searches across shards. We recommend deploying enough UltraWarm instances so that you store no more than 400 shards per ultrawarm1.medium.search node and 1,000 shards per ultrawarm1.large.search node (including both primaries and replicas). We recommend a maximum shard size of 50 GB for both hot and warm tiers. When you query UltraWarm, each shard uses a CPU and moves data from Amazon S3 to local storage. Running single or concurrent queries that access many indexes can overwhelm the CPU and local disk resources. This can cause longer latencies through inefficient use of local storage, and even cause cluster failures.
UltraWarm storage requires OpenSearch 1.0 or later, or Elasticsearch version 6.8 or later.
If you have large amounts of read-only data and want to balance the cost and performance, use UltraWarm for your infrequently accessed, older data.
Cold storage is optimized to store infrequently accessed or historical data at $0.024 per GB per month. When you use cold storage, you detach your indexes from the UltraWarm tier, making them inaccessible. You can reattach these indexes in a few seconds when you need to query that data. Cold storage is a great fit for scenarios in which a low ROI necessitates an archive or delete action on historical data, or if you need to conduct research or perform forensic analysis on older data with Amazon OpenSearch Service.
Cold storage doesn’t have specific instance types because it doesn’t have any compute capacity attached to it. You can store any amount of data in cold storage.
Cold storage requires OpenSearch 1.0 or later, or Elasticsearch version 7.9 or later and UltraWarm.
Manage storage tiers in OpenSearch Dashboards
OpenSearch Dashboards installed on your Amazon OpenSearch Service domain provides a useful UI for managing indexes in different storage tiers on your domain. From the OpenSearch Dashboards main menu, you can view all indexes in hot, UltraWarm, and cold storage. You can also see the indexes managed by Index State Management (ISM) policies. OpenSearch Dashboards enables you to migrate indexes between UltraWarm and cold storage, and monitor index migration status, without using the AWS Command Line Interface (AWS CLI) or configuration API. For more information on OpenSearch Dashboards, see Using OpenSearch Dashboards with Amazon OpenSearch Service.
The hot tier requires you to pay for what is provisioned, which includes the hourly rate for the instance type. Storage is either Amazon EBS or a local SSD instance store. For Amazon EBS-only instance types, additional EBS volume pricing applies. You pay for the amount of storage you deploy.
UltraWarm nodes charge per hour just like other node types, but you only pay for the storage actually stored in Amazon S3. For example, although the instance type ultrawarm1.large.elasticsearch provides up to 20 TiB addressable storage on Amazon S3, if you only store 2 TiB of data, you’re only billed for 2 TiB. Like the standard data node types, you also pay an hourly rate for each UltraWarm node. For more information, see Pricing for Amazon OpenSearch Service.
Cold storage doesn’t incur compute costs, and like UltraWarm, you’re only billed for the amount of data stored in Amazon S3. There are no additional transfer charges when moving data between cold and UltraWarm storage.
Example use case
Let’s look at an example with 1 TB of source data per day, 7 days hot, 83 days warm, 365 days cold. For more information on sizing the cluster, see Sizing Amazon OpenSearch Service domains.
For hot storage, you can go through a baseline estimation with the calculation as: storage needed = (daily source data in bytes * 1.25) * (number_of_replicas + 1) * number of days retention. With the best practice for two replicas, we should use two replicas here. The minimum storage requirement to retain 7 TB of data on the hot tier is (7TB*1.25)*(2+1)= 26.25 TB. For this amount of storage, we need 6x R6g.4xlarge.search instances given the Amazon EBS size limit.
We also need to verify from the CPU side, we need 25 primary shards (1TB*1.25/50GB) =25. We have two replicas. With that, we have total 75 active shards. With that, the total vCPU needed is 75*1.5=112.5 vCPU. This means 8x R6g.4xlarge.search instances. This also requires three dedicated c6g.xlarge.search leader nodes.
When calculating UltraWarm storage requirements, you consider only the size of the primary shards, because that’s the amount of data stored in Amazon S3. For this example, the total primary shard size for warm storage is 83*1.25=103.75 TB. Each ultrawarm1.large.search instance has 16 CPU cores and can address up to 20 TiB of storage on Amazon S3. A minimum of six ultrawarm1.large.search nodes is recommended. You’re charged for the actual storage, which is 103.75 TB.
For cold storage, you only pay for the cost of storing 365*1.25=456.25 TB on Amazon S3. The following table contains a breakdown of the monthly costs (USD) you’re likely to incur. This assumes a 1-year reserved instance for the cluster instances with no upfront payment in the US East (N. Virgina) Region.
|Cost Type||Pricing||Usage||Cost per month|
|Instance Usage||R6g.4xlarge.search = $0.924 per hour||8 instances * 730 hours in a month = 5,840 hours||5,840 hours * $0.924 = $5,396.16|
|c6g.xlarge.search = $0.156 per hour||3 instances (leader nodes) * 730 hours in a month = 2,190 hours||2,190 hours * $0.156 = $341.64|
|ultrawarm1.large.search = $2.68 per hour||6 instances * 730 hours = 4,380 hours||4,380 hours * $2.68 = $11,738.40|
|Storage Cost||Hot storage cost (Amazon EBS) EBS general purpose SSD (gp3) = $0.08 per GB per month||7 days host = 26.25TB||26,880 GB * $0.08 = $2,150.40|
|UltraWarm managed storage cost = $0.024 per GB per month||83 days warm = 103.75 TB per month||106,240 GB * $0.024 = $2,549.76|
|Cold storage cost on Amazon S3 = $0.022 per GB per month||365 days cold = 456.25 TB per month||467,200 GB * $0.022 = $10,278.40|
The total monthly cost is $32,454.76. The hot tier costs $7,888.20, UltraWarm costs $14,288.16, and cold storage is $10,278.40. UltraWarm allows 83 days of additional retention for slightly more cost than the hot tier, which only provides 7 days. For nearly the same cost as the hot tier, the cold tier stores the primary shards for up to 1 year.
Amazon OpenSearch Service supports three integrated storage tiers: hot, UltraWarm, and cold storage. Based on your data retention, query latency, and budgeting requirements, you can choose the best strategy to balance cost and performance. You can also migrate data between different storage tiers. To start using these storage tiers, sign in to the AWS Management Console, use the AWS SDK, or AWS CLI, and enable the corresponding storage tier.
About the Author
Changbin Gong is a Senior Solutions Architect at Amazon Web Services (AWS). He engages with customers to create innovative solutions that address customer business problems and accelerate the adoption of AWS services. In his spare time, Changbin enjoys reading, running, and traveling.
Rich Giuli is a Principal Solutions Architect at Amazon Web Service (AWS). He works within a specialized group helping ISVs accelerate adoption of cloud services. Outside of work Rich enjoys running and playing guitar.