AWS Storage Blog

How Pinterest uses Amazon S3 Glacier Deep Archive to manage storage for its visual discovery engine

Pinterest is the visual discovery engine with a mission to bring everyone the inspiration to create a life they love. It’s one of the biggest datasets of ideas ever assembled online, with over 300 billion Pins with ideas around home, food, style, beauty, travel, and more. More than 440 million people around the world use Pinterest to dream about, plan, and prepare for things they want to do in life. This all translates to a lot of data being generated, ingested, and analyzed. Having this data is important to help people discover things that inspire them. Amazon S3 is one of our main data storage solutions and is an important pillar to Pinterest’s storage strategy. As large-scale users of S3, Pinterest ​​stores billions of objects and nearly an exabyte of data across multiple AWS Regions.

My name is Yi Yin and I am one of the software engineers on the Storage Governance team in Data Engineering at Pinterest. Our team was formed to help minimize storage cost while enabling a culture of data-driven decision making. We work closely with all engineering teams at Pinterest to help them make data retention decisions. We also work closely with AWS with respect to S3 performance and optimal data storage. To meet our large-scale S3 cost goals, we use Amazon S3 Lifecycle to optimize our data’s S3 storage classes assignment. In particular, this blog post discusses how Pinterest uses the Amazon S3 Glacier Deep Archive storage class, the lowest cost storage in the cloud, to keep our overall storage goals on track. The Amazon S3 Glacier storage classes provide a world class solution that is important for Pinterest’s long term archival storage needs. We hope this blog provides useful insights to other S3 users to help optimize your storage efficiency.

Evaluating Amazon S3 Glacier Deep Archive with Pinterest’s storage insights

When Amazon S3 Glacier Deep Archive was first announced in 2019, we knew it would satisfy most of our storage archival needs (cost, durability, restore time), while providing significant storage savings when compared to S3 Glacier Flexible Retrieval (formerly S3 Glacier) and the other storage classes. At the time of the S3 Glacier Deep Archive launch, Pinterest had data that was already in S3 Glacier Flexible Retrieval and we wanted to identify additional datasets that could bypass S3 Glacier Flexible Retrieval and be transitioned directly into S3 Glacier Deep Archive to take advantage of the lower storage cost.

Enabling S3 Glacier Deep Archive was easy through S3 Lifecycle Policy transition rules. But there were questions we needed to answer to ensure we were correctly adopting an archival storage class into our existing storage environment such as:

  • How to identify datasets best-suited for S3 Glacier Deep Archive, and estimate potential savings
  • How to roll out a new archival storage class to large amounts of datasets
  • How to handle data restore from S3 Glacier Deep Archive while maintaining storage efficiency

We utilized our in-house “storage insights” pipeline to help us answer the above questions. This pipeline is a part of our toolset that provides visibility into our current storage footprint, access patterns, and ownership information for all datasets. This information is partially powered by inputs from Amazon S3, namely S3 Server Access Logging and S3 Inventory reports. These inputs are aggregated in Pinterest’s Hadoop/Spark clusters, and exposed as tables to allow for analysis that powers storage savings identification and ownership attribution, and helps answer other questions regarding our S3 access patterns.

Figure 1: S3 storage insights pipeline flow chart

Ingesting S3 server access logs allows us to aggregate S3 access information. We also ingest tags that identify active owners (writers) and users (readers) information to build out ownership information.

Ingesting S3 Inventory reports helps to build a detailed, S3 Inventory snapshot of the size, storage class, and object count information of all our datasets. When joined with the access information above, we can build out access trends (frequency) of a given dataset over time. Further aggregation creates a helpful access ratio summary.

If you are interested in increasing your S3 usage visibility, we invite you to use our pipeline as a reference. In addition, utilizing features built by AWS such as S3 Storage Lens with proper S3 object tagging, can also help to increase visibility into your S3 usage. S3 Storage Lens provides access information at user-defined prefix depth level (delimited by ‘/’), broken down by storage tier, allowing you to identify portions of a bucket with low-to-zero access activity. This insight is important to identify candidate datasets for S3 Glacier Deep Archive.

Identify candidate datasets and rolling out Amazon S3 Glacier Deep Archive

Equipped with these helpful storage insights, we were able to identify which datasets should be migrated to S3 Glacier Deep Archive. We also identified which datasets should start using S3 Glacier Deep Archive. Here is a usage example of our insights data:

Figure 2 Storage insights datasets highlighting the storage class and size of different datasets

Figure 2: Storage insights datasets highlighting the storage class and size of different datasets

In Figure 2, we can see that the dataset s3://my-bucket/my_prefix/of/arbitrary/01/length/ has data split between STANDARD and GLACIER storage classes. It’s likely we’ll want to move the S3 Glacier Flexible Retrieval data to the S3 Glacier Deep Archive since Glacier data is rarely accessed, making it a natural fit for S3 Glacier Deep Archive.

We also want to identify datasets with low access (read) ratios. There were datasets not taking advantage of any storage class savings. Identifying such information allowed us to identify the correct type of data to archive, with low chances of that data needing to be restored later. Figure 3 shows an example of such a query:

Figure 3: Storage insights access data identifying datasets with low access ratio

Figure 3: Storage insights access data, identifying datasets with low access ratio

In Figure 3, we can see for datasetA and datasetB, a read requires accessing 100% of the data (access = 1.0). This data shows us that it’s not ideal to place some of the older data into S3 Glacier Deep Archive. Meanwhile, datasetD only reads 0.009% of the available data. The largest delta between an object’s creation date, and its access (read) date, is only 6 days. Since the oldest object’s timestamp is about two years old, this indicates the older objects in this dataset are likely not accessed, making it ideal for S3 Glacier Deep Archive. Data owners can also use this information to decide whether they want to directly delete the dormant portion of their dataset.

Figure 4: S3 Glacier Flexible Retrieval to S3 Glacier Deep Archive usage over time

Figure 4: S3 Glacier Flexible Retrieval (formerly S3 Glacier) to S3 Glacier Deep Archive usage over time

Our insights data allowed for a speedy migration of existing S3 Glacier Flexible Retrieval data into S3 Glacier Deep Archive because it allowed users to understand their data access patterns. Some data owners were not aware of their long-term data already in S3 Glacier Flexible Retrieval, others did not realize they only accessed a low percentage of data outside of S3 Glacier Flexible Retrieval. Having the above access pattern information helped owners understand their usage patterns and helped them decide to delete instead of migrate their data to S3 Glacier Deep Archive, skipping the migration all together. We still utilize this valuable analysis today when helping data owners identify their dormant data and decide whether to archive with S3 Glacier Deep Archive or shorten the retention.

S3 Intelligent-Tiering is also an S3 storage class we want to consider for identifying dormant data going forward. Placing data in S3 Intelligent-Tiering can help verify data that is not being accessed if it continuously stays in the Infrequent Access tier, indicating it is an ideal candidate for S3 Glacier Flexible Retrieval or S3 Glacier Deep Archive.

Standardize S3 Glacier Deep Archive usage and handling restore

The importance of identifying datasets fitting S3 Glacier Deep Archive’s infrequent access utility is highlighted by the restore time and cost. If we did not go through the above exercise, and aggressively placed older objects in datasets into S3 Glacier Deep Archive, we would increase the likelihood of restore requests. Frequent restores defeats the immense savings one would experience by using S3 Glacier Deep Archive. This is true for both engineering time (users waiting for data to be restored, ranging from 12 to 48 hours), as well as increased costs from restore requests and storage of restored objects.

Having all this analysis also allows for standardization on what should be restored. We already know the data is highly unlikely to be needed again, allowing us to strictly limit restores to absolute critical use cases. Having the storage insights pipeline also allows for estimating the cost of a restore with storage class, size, and object count information built from the S3 Inventory report.

To complete the integration of S3 Glacier Deep Archive and reach our efficiency goals, the last piece is issuing the restore. We do this by using the following command:

aws s3api restore-object -—bucket my-bucket --key prefix/to/my/object --restore-request '{"Days":7,"GlacierJobParameters":{"Tier":"Bulk"}}'

Often the restore needs to be issued on millions of objects, which is time consuming and rate limits must be considered. Once restored requests are deemed critical enough, the orchestration is:

  1. Generate list of objects to be restored
  2. Issue restore-object on all objects in list

Generally a restore request for us, is for objects in a list of prefixes. This needs to be translated to an absolute list of objects. Pinterest has an internal job (utilizing boto3) that issues list-object commands with some parallelism to generate an up-to-date view of the prefixes. Here is the code snippet:

def _list_in_one_prefix(bucket, client, prefix_list):
    paginator = client.get_paginator('list_objects')
    operation_parameters = {'Bucket': bucket, 'Prefix': partition_path, 'MaxKeys': 1000}
    page_iterator = paginator.paginate(**operation_parameters)
    objects = []
    try:
        for page in page_iterator:
            if 'Contents' not in page.keys():
                continue
            objects = objects + [bucket + ',' + c['Key'] for c in page['Contents']]
        return objects
    except Exception as e:
        raise e


def _list_prefixes(bucket, prefix_list):
    client = boto3.client('s3')
    pool = ThreadPool(10) # check up to 10 partitions at a time
    objects = pool.map(partial(_list_in_one_prefix, bucket, client), partition_paths)
    pool.close()
    pool.join()
    return itertools.chain.from_iterable(objects)

We recommend adding retries in your retrieval process for stability.

We choose the above scripting solution because we have a lot of insights to our data, and tooling built around our data workflow. If you don’t have these insights, you can achieve object list generation at scale by configuring a S3 Inventory report with the prefix(es) to be restored. S3 Inventory report can be generated either at daily or weekly intervals.

After the list of objects is generated, there are multiple ways to issue restore commands. In most scenarios, we use an in-house job with similar parallelism as the list-object job above, that issues restore-object commands. We find that this is generally sufficient for our restore needs (millions of objects).

For larger amounts of objects, we use S3 Batch Operations to issue retrievals instead. The object list above is fed in as input, and AWS issues and monitors restore jobs. S3 Batch Operations is more user-friendly than the scripting solution as it scales well for large amounts of objects and includes job progress monitoring. It will also generate a summary report to report success/failures, making it easy to retry the job with just the failed objects.

Users can also reuse the same object list to periodically issue head-object commands to check on the status of the restore. The status of a restore looks like this in the head-object response:

  • Object is being restored:"Restore": "ongoing-request=\"true\"",
  • Object is restored:"Restore": "ongoing-request=\"false\", expiry-date=\"Sun, 13 Aug 2017 00:00:00 GMT\"",
  • Otherwise, this object is not a part of a restore

Conclusion

The Amazon S3 Glacier Deep Archive storage class is a great option for long-lived, but seldom accessed datasets. Using S3 Glacier Deep Archive correctly can introduce significant savings to your business.

We walked through the initial evaluation, rollout, and usage of S3 Glacier Deep Archive as a part of Pinterest’s storage efficiency efforts. We discussed the importance of having visibility into S3 usage, which correctly and optimally identified data that is best-suited to be placed in S3 Glacier Deep Archive. Adopting S3 Glacier Deep Archive into our storage strategy introduced considerable savings in the millions of dollars per year. Our methodology helped to reduce the likelihood of S3 Glacier Deep Archive restore scenarios, allowing us to maximize our savings and implement clear guidelines on how archived data should be handled.

Pinterest and AWS are constant collaborators, and having insights to S3 usage information allows for quick and correct evaluation of new features. We hope that our experience has inspired you to consider using S3 Glacier Deep Archive for your long-term infrequently accessed data, as well as increased your organization’s storage usage visibility to help with technology decision-making and strategy.

We’d like to thank current and former teammates who contributed to the storage efficiency impact at Pinterest: Shawn Nguyen, William Tom, Zirui Li, Chunyan Wang, and Bryant Xiao.

Thank you for reading this blog post! To learn more about how Pinterest uses storage and compute solutions on AWS to provide the scale, speed, and security its platform requires check out this Pinterest on AWS page. We are looking forward to reading your feedback and questions in the comments section.

The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.
Yi Yin

Yi Yin

Yi is a Software Engineer in Data Engineering at Pinterest. Yi has 5 years of experience in data pipeline infrastructure, with the last two years as a part of the storage governance efforts at Pinterest, focusing on storage efficiency. Yi has led company-wide initiatives to help teams ensure their data is properly stored. Yi is based in San Francisco, CA and loves to cheer on her hometown Raptors.

Bin Yang

Bin Yang

Bin is a data engineering tech lead at Pinterest with a broad set experience engineering and architecting efficient and scalable data solutions. Bin is currently building various big data tools and platforms for Pinterest to enhance our data-based decision-making capabilities. Bin is based in San Francisco, CA and enjoys spending time with his family, traveling, and cooking.

Xiaoning Kou

Xiaoning Kou

Xiaoning is a Software Engineer in Data Engineering at Pinterest. Xiaoning brings expertise on big data analytics to data platforms at Pinterest. Recently she's been driving the storage efficiency effort. Xiaoning is based in San Francisco, CA.