Identify cold objects for archiving to Amazon S3 Glacier storage classes

Update (02/13/2024): Consider Amazon S3 Lifecycle transition fees that are charged based on the total number of objects being transitioned, the destination storage class (listed on the Amazon S3 pricing page), as well as the additional metadata charges applied. You can use the S3 pricing calculator to estimate the total upfront and monthly costs by inputting the total storage, average object size, and indicating how the data is moved to the destination storage class.

Many organizations move cold data to archive storage in the cloud to optimize storage costs for data they want to preserve over a number of years. Archiving data at a very low cost also gives organizations the ability to quickly restore that data and put it to work for their business, such as for historical analytics and machine learning (ML) model training. Some key considerations when moving data to archive storage include the expected lifetime of data, restore frequency, and the cost of transitioning data between storage classes. Since the transition pricing is based on the number of transition requests, the cost to archive data varies by the size of the objects in your dataset.

You can cost-effectively store Amazon Simple Storage Service (Amazon S3) objects throughout their lifecycle with S3 Lifecycle. Amazon S3 has three archive storage classes designed for different access patterns. For archive data that needs immediate access, you can use S3 Lifecycle to transition objects to the Amazon S3 Glacier Instant Retrieval storage class. Additionally, you can transition objects that don’t require real-time access to the Amazon S3 Glacier Flexible Retrieval or Amazon S3 Glacier Deep Archive storage classes. Understanding what objects to transition to archive storage classes is an important consideration for minimizing transition costs and maximizing savings.

In this post, we walk through the steps to identify objects that are good candidates for S3 Lifecycle transitions to S3 Glacier Flexible Retrieval or S3 Glacier Deep Archive. These transition recommendations apply for both current and non-current object versions. This post covers understanding S3 Glacier data access requirements and reviews four elements of cost optimization related to your Amazon S3 storage using S3 Lifecycle:

How to use the free dashboard in Amazon S3 Storage Lens to determine object count and average object size in the bucket.
Classifying object size distribution using S3 Inventory report, Amazon Athena, and Amazon QuickSight to determine the average object size filter.
How to determine savings from aggregating small objects and how to do it.
How to calculate costs to help determine an optimal S3 Lifecycle rule to transition the objects to S3 Glacier Flexible Retrieval or S3 Glacier Deep Archive.

Understanding minimum charges and data access requirements

Storage classes like Amazon S3 Standard-Infrequent Access, Amazon S3 Intelligent-Tiering, Amazon S3 One Zone-Infrequent Access, and Amazon S3 Glacier Instant Retrieval have a minimum charge of 128 KB per object, regardless of size, and those storage classes block objects <128 KB from transitioning when using S3 Lifecycle. In contrast, S3 Glacier Flexible Retrieval and S3 Glacier Deep Archive have metadata charges of 32 KB in the storage class of choice, as well as 8 KB in S3 Standard, with no minimum charge on the object size. For these two storage classes, you can move any size object using S3 Lifecycle. However, in most cases we recommend using S3 Lifecycle’s object size filter feature to limit the transition of smaller objects (we recommend a 128 KB [base 2 value=131,072 bytes] as minimum object size filter as a starting point). Then, depending on how long you expect the data to live and the storage class from which you’re transitioning the data, you can adjust the filter size to make sure of the lowest overall cost for your archival data over its lifetime.

S3 Glacier Flexible Retrieval and S3 Glacier Deep Archive require asynchronous access using the RestoreObject API before a GET can be performed, so it’s important to know your data access requirements before using S3 Lifecycle to transition data into these storage classes. S3 Glacier Flexible Retrieval is suitable for archiving data that is accessed one or two times per year asynchronously, or more often if using the free bulk retrieval option. S3 Glacier Flexible Retrieval delivers the most flexible retrieval options: expedited retrievals that typically complete in 1–5 minutes with the purchase of a provisioned capacity unit that provides at least three retrievals every five minutes and up to 150 MBps retrieval throughput, standard retrievals that typically complete in minutes–5 hours with Batch Operations and 3–5 hours without Batch Operations, and free bulk retrievals that return large amounts of data typically within 5–12 hours. S3 Glacier Deep Archive is suitable for archiving data that is accessed less than once per year and is retrieved asynchronously, with data retrieval in 9–48 hours. Regulatory, compliance, media asset workflow, scientific data storage, and long-term backup retention are a few use cases for these storage classes.

For use cases where access patterns are unknown, you can configure Storage Class Analysis to observe data access patterns to gather information on storage bytes stored in each storage class and data retrieval rate based on the object’s age. The access pattern with a reasonably flat retrieval rate over a year is a good use case for S3 Glacier Flexible Retrieval or S3 Glacier Deep Archive. To learn more about Storage Class Analysis and to identify data access patterns, refer to the post, “How Canva saves over $3 million annually in Amazon S3 costs.”

1. Identifying object count and average object size in the bucket

After you’ve identified archival data that requires access in minutes to hours through an asynchronous process, the next step is getting more information about object sizes within your dataset to maximize the savings of your S3 Lifecycle rule. Customers can use S3 Storage Lens to gain insights into S3 buckets to identify total storage, object count, average object size, and much more. S3 Storage Lens delivers more than 60 metrics (free metrics and advanced metrics) on object storage usage and activity to an interactive dashboard in the Amazon S3 console. With the default dashboard, you can determine the number of objects and average object size of your S3 buckets.

From the S3 Storage Lens dashboard, you can get the number of objects and average object size of an individual S3 bucket. You can use the Filters option in S3 Storage Lens as shown Figure 1.

Snapshot of S3 bucket

Figure 1: Snapshot of S3 bucket

For each object that is stored in S3 Glacier Flexible Retrieval or S3 Glacier Deep Archive, Amazon S3 adds 40 KB of chargeable overhead for metadata. Metadata charges affect the storage cost of smaller objects in these storage classes. This means that objects under 128 KB should be strongly considered for filtering out of Amazon S3 through the S3 Lifecycle transitions filter, with medium sized objects from 128 KB to 1 MB candidates for aggregation before transitioning. Additionally, the cost is directly proportional to the total number of objects to be transitioned because transitions are charged per object. For example, transitioning system log files is a use case where the object size is often smaller than 8 KB with billions of objects. The cost of aggregating those objects is very likely to overwhelm any savings from transitioning. Instead, objects under 128 KB should largely be considered for leaving in the original storage class.

In Figure 1, the average object size in the bucket is 2.4 MB with approximately 2.9 million objects. Based on the price of $0.03 per 1,000 transitions, the cost of moving all of these objects to S3 Glacier Flexible Retrieval is estimated at $87 at US East (Ohio) Region. Even though the average object size (2.4 MB) indicates overall savings by transitioning to S3 Glacier Flexible Retrieval or S3 Glacier Deep Archive, you can also apply an S3 Lifecycle object size filter to stop the transition of the objects smaller than 128 KB. Because of the large average object size (> 1 MB), you would realize cost benefits by transitioning to S3 Glacier Flexible Retrieval or S3 Glacier Deep Archive. Therefore, you can confidently create an S3 Lifecycle rule without needing to set up S3 Inventory and Amazon Athena.

Now we show another situation with an example bucket (Figure 2), where the average object size in the bucket is 6.11 KB (<1 MB). In this case, you should use Amazon S3 Inventory report and run Athena query to identify how many objects and what amount of storage you actually transition when setting an object size filter (for example, 128 KB). This is important because even a very small average (say, ~6 KB) could potentially make you think that you should not move the data to S3 Glacier Flexible Retrieval or S3 Glacier Deep Archive, when in fact more than 50% of the data is still made up of large objects, an important consideration for cost optimization.

To understand this setup better, let’s assume our example bucket has three billion objects, with a total storage of 17.08 TB and average object size of 6.11 KB. Here, 83.73% of storage is consumed by the top 0.5% of the large sized objects (average object size 1 MB), while the remaining 99.5% of the small sized objects (average object size 1 KB) consumes 16.27% of the storage. Therefore, identifying object size distribution is critical in determining the cost benefits by accurately identifying the average object size and the amount of storage that you would require to transition. In this example, this customer can move 83.7% of the storage to S3 Glacier with a 128 KB filter, saving money on the vast majority of storage.

Object size distribution in large bucket (3 billion objects)

Figure 2: Object size distribution in large bucket (3 billion objects)

2. Identifying object size distribution using S3 Inventory report, Athena, and QuickSight

Customers with billions of small objects and a very small average object size (in the 1s of KBs), such as those with long-term log storage, often find that there is still a substantial amount of data they can transition to S3 Glacier Flexible Retrieval or S3 Glacier Deep Archive after filtering out objects <128 KB. Identifying the object size distribution of your data helps determine how much you can save, even before you make a decision to aggregate or transition the objects. For these cases, you should determine the distribution of object size using S3 Inventory report in conjunction with Athena and QuickSight.

The S3 Inventory report contains a list of objects in the bucket, or specific prefix, along with the object size. The recommendation is to generate an S3 Inventory report in ORC-formatted or Parquet-formatted for better query performance. To determine the object size distribution in the band of 128 KB using Athena, you can create an Athena table and run the following query by modifying the content highlighted in purple.

SELECT CAST(FLOOR(CAST(CASE WHEN "size" - 0 < 0 THEN 0 ELSE "size" - 0 END AS DOUBLE) / 131072) AS BIGINT) * 131072/1024 AS "Distribution Band in KB",
       COUNT(*) AS "Number of Objects in the band",
       CAST(MIN("size") AS DECIMAL (38,2))/1024 AS "Smallest Object size in the band in KB",
       CAST(MAX("size") AS DECIMAL (38,2))/1024 AS "Largest Object size in the band in KB",
       CAST(AVG("size") AS DECIMAL (38,2))/1024 AS "Average Object size in the band in KB",
       CAST(SUM("size") AS DECIMAL (38,2))/1024/1024/1024 AS "Total Band size in GB"
FROM "database.table_name"
WHERE "size" IS NOT NULL and "dt" IN ('2023-05-30-01-00')
GROUP BY CAST(FLOOR(CAST(CASE WHEN "size" - 0 < 0 THEN 0 ELSE "size" - 0 END AS DOUBLE) / 131072) AS BIGINT) * 131072/1024
ORDER BY CAST(FLOOR(CAST(CASE WHEN "size" - 0 < 0 THEN 0 ELSE "size" - 0 END AS DOUBLE) / 131072) AS BIGINT) * 131072/1024 NULLS FIRST

Note that Amazon S3 storage usage is calculated in binary gigabytes (GB), where 1 GB is 2³⁰ bytes. Therefore, 128 KB is 131,072 bytes.

The sample Athena query output is shown in Figure 3:

Athena output showing object size distribution

Figure 3: Athena output showing object size distribution

The preceding sample Athena output shows object size distribution in the band of 128 KBs in one of the largest buckets. There are 179,849,716 objects between zero bytes and 128 KB in the bucket, with the smallest object being a zero-byte object and the largest being 127.96 KB within the band (0 to 128 KB). In the next band, between 128 KB to 256 KB, there are 62,538 objects with the smallest being 128.07 KB and the largest being 255.85 KB. You can increase or decrease the distribution band (128 KB=131,072 bytes) in the query to identify the optimal filter size, the corresponding number of objects, and the amount of storage that would be transitioned.

Optionally, if you’re interested in visualizing the object size distribution, you can create a QuickSight dataset using an Athena table and plot a default histogram chart of S3 Inventory. To identify the number of objects in a specific band, such as 128 KB, you can customize the visual by formatting the histogram, as shown in the following figure.

Distribution of object size (formatted view)

Figure 4: Distribution of object size (formatted view)

3. Aggregating smaller objects

For objects over 128 KB, you can decide to either keep the objects in the current storage class, such as S3 Standard, or you can aggregate multiple smaller objects to fewer large-sized objects to save the S3 Glacier metadata costs. The Amazon S3 TAR tool can be used create a tarball of existing objects in Amazon S3. The S3 TAR tool allows customers to group existing S3 objects into TAR files without having to download the files. The tool also has an option to create the TAR files directly in your preferred storage class. Note that querying S3 Inventory report can help identify the objects within the 128 KB to 1 MB band that could be aggregated to save additional costs. Depending on how large this band is, that can determine whether S3 TAR is the right solution. In our example (see Figure 3), there are 75,638 objects in the 128 KB to 1 MB range with a total storage of 16.38 GB. Here the cost benefit in terms of metadata and transition is negligible (approximately $3.14) if you decide to aggregate before transfer (see Figure 5). Therefore, you can transition these objects without aggregation using S3 TAR tool.

Cost calculation in smaller band (<1 TB)

Figure 5: Cost calculation in smaller band (< 1 TB)

However, if the band size of 128 KB to 1 MB is large (>1 TB) with millions of objects, you can consider aggregating the objects before transition to S3 Glacier Flexible Retrieval or S3 Glacier Deep Archive. An example cost savings is shown for your reference in Figure 6. Learn more about S3 TAR and its pricing in this GitHub post.

An example showing cost benefits in large band (>1 TB)

Figure 6: An example showing cost benefits in large band (>1 TB)

4. Object size and S3 Lifecycle transition cost considerations

The size of your objects determines whether using S3 Lifecycle is an effective tool to optimize your costs. This is because of the S3 Lifecycle transition fees that are charged based on the total number of objects being transitioned, the destination storage class (listed on the Amazon S3 pricing page), as well as the additional metadata charges applied.

To illustrate those costs, we show the approximate storage cost charges for storing 100 TB of data in the Amazon S3 Standard storage class as opposed to the S3 Glacier Deep Archive storage class. Additionally, the table shows the one-time S3 Lifecycle transition costs to transition the objects to S3 Glacier Deep Archive based on the number of objects in the bucket. This table uses demonstrative S3 Lifecycle transition costs based on the Amazon S3 pricing page as of August 1^st 2023 in the US East (Ohio) Region. You must always refer to the Amazon S3 pricing page for the most up-to-date information. Moreover, utilizing S3 Lifecycle to transition objects to lower cost storage is cost-effective for larger-sized objects.

In the following table, for 1.67 billion objects at 64 KB size each, the S3 Lifecycle rule to transition all objects to S3 Glacier Deep Archive would cost approximately $83,500 in a one-time S3 Lifecycle transition cost. After transitioning, you save $1,857.54 per month, which requires 45 months of storing the data in S3 Glacier Deep Archive to break even. Therefore, transitioning smaller objects to S3 Glacier Deep Archive is not ideal if you plan to store the data for less than 45 months in S3 Glacier Deep Archive.

However, for 104.85 million objects of size 1 MB each, the S3 Lifecycle rule to transition all objects to S3 Glacier Deep Archive would cost approximately $5,242 in a one-time S3 Lifecycle transition cost. After the transition, you would realize monthly savings of $2,181.06 and your breakeven period would be approximately three months.

Figure 7: S3 Lifecycle transition and storage cost

Figure 7: S3 Lifecycle transition and storage cost

You can use the Amazon S3 pricing calculator to estimate the total upfront and monthly costs by inputting the total storage, average object size, and indicating how the data is moved to the destination storage class.

Let’s look at another example Figure 8 to estimate how much it would cost if you transition only the objects > 128 KB by creating an S3 Lifecycle rule with filtering objects < 128 KB. In this case, only the top 0.5% of the objects (150 million) with a total storage of 14,648.4375 GB is eligible for the transition.

Object size distribution in a large bucket

Figure 8: Object size distribution in a large bucket (3 billion objects)

In the pricing calculator, as shown in the Figure 9, you should enter the total storage you would transition to the destination storage class (for example, S3 Glacier Deep Archive), input the average object size, and select the data to be moved to S3 Glacier Deep Archive using S3 Lifecycle requests. In this case, the total upfront cost is $750, which is also the one-time S3 Lifecycle transition cost, and the monthly storage cost is $17.59. Here, transitioning 0.5% of the objects from Amazon S3 Standard to S3 Glacier Deep Archive results in 94.77% storage cost reduction on monthly basis.

Amazon S3 pricing calculator

Figure 9: Amazon S3 pricing calculator

Cleaning up

If you have created a new Amazon S3 Inventory report to calculate object size distribution, and no longer want to be billed, you must remove the Amazon S3 Inventory report from your S3 bucket, delete the Amazon S3 Inventory configuration, and delete the Amazon S3 Inventory report Athena table. It is also recommended to cancel your QuickSight subscription if you have used the QuickSight dashboard to visualize the object size distribution.

Conclusion

In this post, we provided guidance on identifying suitable objects for transitioning to the Amazon S3 Glacier Flexible Retrieval or Amazon S3 Glacier Deep Archive storage classes in a bucket. By using Amazon Athena, Amazon QuickSight, and S3 Inventory report, we demonstrated how to calculate object size distribution with the number of objects and storage in each band. We also explained how to use the Amazon S3 pricing calculator to identify S3 Lifecycle transition and storage cost.

By using the approaches discussed, you can implement optimized S3 Lifecycle rules with a filter (< 128 KB) to transition objects suitable for S3 Glacier Flexible Retrieval or S3 Glacier Deep Archive in a cost-efficient way. This can help you decrease storage costs and optimize spending on S3 Lifecycle transitions.

Thank you for reading this post. We’re here to help, and if you need further assistance with S3 Lifecycle strategy, reach out to AWS Support and your AWS account team.