AWS Storage Blog

Best practices for archiving large datasets with AWS

As companies grow, they often find themselves managing an ever-increasing amount of data. Customers often need to retain backups for business continuity or disaster recovery, as well as records for compliance and audits. In addition, some customers may need to retain backups to create a centralized repository of information that is heterogeneous in nature, with varying data access patterns and lifetimes. As the volume of data grows, the need to optimize costs associated with the retention and manipulation of the data increases.

In this blog we discuss best practices for using Amazon S3 Glacier storage classes for your archive workloads. We also discuss the key considerations when planning your cold data storage patterns, all when working at petabyte scale. For more information you can also check out this “Best practices for archiving large datasets with AWS” session from re:Invent.

Archival storage classes

When you’re considering where to store your data in Amazon S3, you have a range of storage classes all designed for specific use cases and data access patterns. As the amount of data in your organization grows and ages, you should evaluate optimizing your costs by selecting the right storage classes based on your data usage.

Within Amazon S3, there are four storage classes designed to lower costs for your archival data:

  • S3 Glacier Instant Retrieval: With data retrieval in milliseconds, the S3 Glacier Instant Retrieval storage class is the lowest cost S3 storage class for rarely accessed long-lived data that requires milliseconds retrieval. With S3 Glacier Instant Retrieval, you can save up to 68% on storage costs compared to the S3 Standard-Infrequent Access (S3 Standard-IA) storage class.
  • S3 Glacier Flexible Retrieval: With access times ranging from 3 hours to 12 hours – and the option to retrieve a small number of critical objects in a few minutes – the S3 Glacier Flexible Retrieval storage class is attractive for use cases where applications can tolerate multi-hour access times while achieving up to 10% lower storage costs than S3 Glacier Instant Retrieval.
  • S3 Glacier Deep Archive: With access time ranging from 12 to 48 hours, the S3 Glacier Deep Archive storage class is the lowest cost Amazon S3 storage class (~$1 per terabyte per month). Ideal for archiving data that is rarely accessed, S3 Glacier Deep Archive offers a cost-effective option for long-term compliance and digital archive workloads.
  • S3 Intelligent-Tiering: The only cloud storage class that delivers automatic storage cost savings when data access patterns change, without performance impact or operational overhead. With the addition of the Archive Instant Access tier, data stored in the S3 Intelligent-Tiering storage class can realize automated cost savings up to 68% while retaining milliseconds data access. If you opt in to one or both asynchronous Archive Access Tiers, S3 Intelligent-Tiering will automatically move objects which haven’t been accessed for at least 90 days into an asynchronous Archive Access Tier. After a minimum of 90 days data can be transitioned to the Archive Access Tier, which has the same retrieval performance and price as S3 Glacier Flexible Retrieval. If the object has not been accessed for a minimum of 180 days, it is automatically moved into the Deep Archive Access Tier, which has the same retrieval performance and price as S3 Glacier Deep Archive. When data is transitioned between S3 Intelligent-tiering Access Tiers, you do not pay the S3 Lifecycle fees normally associated with moving data into S3 Glacier storage classes, nor do you pay for Standard or Bulk restores. Instead, you pay a predictable monthly monitoring and automation fee of 1/4 of a cent per 1,000 objects.

Key considerations

There are five key factors to consider when planning your archival storage for large datasets.

1. Map your data access patterns

Your access needs will determine the best storage class options for your data:

  • For unknown or changing access patterns, S3-Intelligent Tiering manages tiering so you don’t have to. One or both of the opt-in Access Tiers can be selected if applications and workflows can support the asynchronous data retrieval characteristics of these options.
  • If you know your data access patterns and would like to immediately gain the cost savings associated with archival storage by managing the lifecycle of your data yourself, use S3 Glacier Instant Retrieval, S3 Glacier Flexible Retrieval, or S3 Glacier Deep Archive.
  • If you need milliseconds access to your archive data, use S3 Glacier Instant Retrieval. S3 Glacier Instant Retrieval is great for rarely accessed data that still needs immediate access. S3 Glacier Instant Retrieval offers the high durability, high throughput, and low latency of S3 Standard-IA.
  • If you have a need to retrieve small amounts of critical data in minutes or hours, use the S3 Glacier Flexible Retrieval storage class with expedited retrieval. S3 Glacier Flexible Retrieval is great for certain workloads because it provides a wider set of options around restore times. Free bulk retrievals are also available when longer restore times can be supported by your use case.
  • The S3 Glacier Deep Archive storage class is ideal for long term compliance, digital preservation, and regulatory workloads. It’s a great option when you need to retain data for years and even decades. Ideal for workloads with flexible recovery time objectives allowing for S3 Glacier Deep Archive’s 12-48 hour restoration process.

2. Plan for asynchronous data access

If your data is stored in one of S3’s storage classes such as S3 Standard, S3 Standard-Infrequent Access, or S3 Glacier Instant Retrieval, you can access your data using the GetObject API, which allows synchronous access to your data. However, with the S3 Glacier Flexible Retrieval, S3 Glacier Deep Archive, and the opt-in Archival Tiers of S3 Intelligent-Tiering, objects must first be restored using the RestoreObject API before a temporary copy of the object can be accessed with the GetObject API. The RestoreObject API is asynchronous, and retrieval times range from minutes to 48 hours, depending on the storage class and selected retrieval option.

When moving data from the synchronous access data storage classes to asynchronous archival storage classes, applications and workflows need to accommodate asynchronous RestoreObject API operations.

You can learn more about the S3 Glacier restore API here.

3. Build a comprehensive cost plan

When evaluating archival storage options, the focus is often on significantly lowering storage costs, but the number of objects and your access patterns should be factored into your storage cost optimization plan. While lower storage costs are attractive, the number of objects as well as your access patterns have costs that you need to consider as you determine how to best optimize your storage.

S3 Glacier Flexible Retrieval and S3 Glacier Deep Archive, as well as the opt-in Archive Access tiers of S3 Intelligent-Tiering, require an additional 40 KB of data storage per object to support index and metadata activities (32 KB billed at the archive rate and 8 KB billed at the S3 Standard rate). For larger objects, these costs are a small fraction of the overall cost of storing the object. For smaller objects, this additional overhead will have a much larger impact on your overall storage costs. S3 Glacier Instant Retrieval does not require this additional 40 KB of overhead but does impose a minimum billable object size of 128 KB. As you define your data lifecycle management approach, take into account the lifecycle fees, early deletion charges, and restore charges associated with S3 Glacier Instant Retrieval, S3 Glacier Flexible Retrieval, and S3 Glacier Deep Archive.

If you prefer to not worry about archival lifecycle charges or restore charges, S3 Intelligent-Tiering may be an attractive option. S3 Intelligent-Tiering monitors the access patterns of your data and moves objects automatically from one access tier to another with no transition or retrieval fees. S3 Intelligent-Tiering instead applies a small monthly object monitoring fee of $.0025 per 1,000 objects (objects smaller than 128 KB stored in S3 Intelligent-Tiering will remain in the Frequent Access Tier but will not incur the monthly object monitoring charges). You can learn more about S3 pricing here.

To demonstrate the effects object size (object count) has on storage costs and retrieval rate, consider the following large dataset consuming 1 Pebibyte (PiB) of storage. The following table represents the monthly costs you would expect to pay to store objects in the S3 Glacier Flexible Retrieval and S3 Glacier Deep Archive storage classes. Notice that the amount of storage is the same in all three examples (1 Pebibyte), but the total storage cost in the last column varies. This is due to the 40 KB overhead that is appended to each stored object. The impact this overhead has on your storage costs can be significant when storing large numbers of very small objects, as the overhead becomes a larger percentage of the storage costs per object. With this in mind, it is often a best practice to avoid life cycling very small objects. To help with this you can consider implementing object size filters on your lifecycle polices.

Best practices for archiving large datasets with AWS
Next, consider how these object sizes can affect lifecycle transitions into the archive storage classes. Notice in the following table, the lifecycle transition fees associated with tiering the 1 PiB dataset into S3 Glacier Deep Archive. If the avg object size is 1 MiB, lifecycle transition fees cost over $53,000, but just over $10 to transition that same 1 PiB of data if the avg object size is 5 GiB. Also notice the data retrieval charges are the same in each scenario, however the retrieval request charges which are billed per 1,000 objects vary significantly. In our example these charges are over $26,000 for the 1 MiB object size, and less than $6 to retrieve the same amount of data if the object size was 5 GiB.

Best practices for archiving large datasets with AWS

It’s important to note that although objects stored in S3 Intelligent-Tiering will incur a monthly monitoring fee, the absence of lifecycle transition and data restore fees can offer significant cost saving opportunities for some workloads.

4. Optimize your objects for archival

As detailed previously, increasing the average size of your objects will allow you to save costs when storing, transitioning, and retrieving your objects. However, many customers have workloads that generate small files that they would like to archive in S3 Glacier storage classes. One technique for optimizing your objects for archival storage is to bundle them together, using tar. There are several benefits to using an Amazon EC2 instance or an AWS Lambda function to read objects from S3 and tar them into a single object:

  • It dramatically lowers the cost to transition these objects into S3 Glacier Flexible Retrieval or S3 Glacier Deep Archive, and lowers the cost to store them (overhead costs are per object).
  • Bundling the objects together creates a faster restore mechanism, which is organically cheaper and faster.
  • Simple mechanisms, like tar, preserve file-based metadata which is critical for many workloads.

Object aggregation also has some trade-offs that need to be considered. Additional costs will be incurred in workflow complexity as well as compute and database services leveraged to bundle and index objects. Restore activities will require an index of objects stored in each of the aggregated archives to support object level restores, should that be necessary.

5. Restore sequentially

When restoring a large dataset, there are some additional considerations on how restores should be issued to Amazon S3 to maximize the restore throughput. The Amazon S3 Glacier documentation on restoring objects states the following:

“When required, you can restore large segments of the data stored in S3 Glacier Flexible Retrieval or S3 Glacier Deep Archive. For example, you might want to restore data for a secondary copy. However, if you need to restore a large amount of data, keep in mind that S3 Glacier Flexible Retrieval and S3 Glacier Deep Archive are designed for 35 random restore requests per pebibyte (PiB) stored per day.”

‘Randomness’ specifically refers to the order in which objects within a bucket were transitioned to the S3 Glacier Flexible retrieval or S3 Glacier Deep Archive storage class. When initiating restores, you should create restore requests in order, starting with the oldest object in terms of its transition time. Often, this transition time is analogous to the object creation time, which you can obtain using an S3 Inventory report. This minimizes the randomness and will result in optimal restore performance.

S3 Glacier Flexible Retrieval and S3 Glacier Deep Archive are capable of PB restores in a single day. However, to harness that power, restores must be issued sequentially. As an example, consider a case where you need to restore 1 PB of data stored in the S3 Glacier Deep Archive storage class, using Bulk restores, where the average size of the object is 1 TB. With that object size, it would take 1,024 RestoreObject requests to restore all of the data. If these restores were issued randomly, you could be limited to restoring only 35 objects per day. This could result in 32 days needed to restore the entire 1 PB of data, which is clearly not optimal. By issuing the requests in the order in which the objects were moved into S3 Glacier Deep Archive, which is often the object creation time, all objects can be restored within the S3 Glacier Deep Archive Bulk restore time of 48 hours.

Conclusion

When you’re considering where to store long-term, rarely accessed data in Amazon S3, you have a variety of storage classes optimized for specific use cases and data access patterns: S3 Intelligent-Tiering with automatic archival, S3 Glacier Instant Retrieval, S3 Glacier Flexible Retrieval, and S3 Glacier Deep Archive. Each of these storage classes provides unique benefits that map to specific use cases, giving you plenty of flexibility to tailor your rarely accessed or archive storage usage and optimize for both cost and performance.

If you have any comments or questions about this blog post, please don’t hesitate to reply in the comments section. Thanks for reading!