Automating retrievals from the Amazon S3 Glacier storage classes

Faced with increasing amounts of data and a tightening economic climate, enterprises are looking to save money on their storage costs by moving rarely needed data to archival storage options. The least costly options require your internal systems to support receiving data back in hours or days, often called asynchronous retrievals. With this time delay, objects are not immediately accessible, generating a message showing the object as inaccessible, and potentially breaking downstream dependencies in your internal applications.

Amazon S3 customers have the flexibility to choose the archive storage that meets their needs, with the S3 Glacier storage classes providing options for the fastest access to your archive data and the lowest-cost archive storage in the cloud. Customers that need their data synchronously, or near-instantly, can use the S3 Glacier Instant Retrieval storage class. Customers who want the most savings can choose the asynchronous S3 Glacier Flexible Retrieval or S3 Glacier Deep Archive storage classes.

In this post, we show you a simple way to integrate your applications with the two asynchronous S3 Glacier storage classes, S3 Glacier Flexible Retrieval and S3 Glacier Deep Archive, by using a GetObject API wrapper to automate the retrieval process for single objects. We also provide a reference architecture to process and fetch retrieved objects by end user applications. With this, customers can optimize their storage costs by leveraging S3 Glacier storage classes for rarely accessed data but also overcome the manual efforts and overhead involved in objects retrieval process. You can incorporate this easy-to-use solution into your existing application workflow for objects calling and retrieving using a GetObject API wrapper, thus providing seamless experience for end users without potentially breaking any downstream dependencies.

Solution overview

The solution detailed in this post provides a wrapper that customers can integrate with their GetObject API calling application. This wrapper checks that you own the respective AWS account, then invokes a Python script in response to the GetObject call whenever your application attempts to access an unavailable object in Amazon S3 or the S3 Glacier storage classes. The solution also notifies you when the object restore has been initiated as well as when it’s complete. You can opt to feed these notifications into your Amazon Simple Queue Service (Amazon SQS) or AWS Lambda, using the following reference architecture, to automatically perform a second GetObject call and complete your retrieval:

Python script getting the S3 object by first checking its storage class and then restoring it.

The sections 1-4 in the architecture demonstrates our solution workflow. The sections 5a and 5b show two options for downstream processing.

The user applications execute the GetObject command to access an object in S3 storage classes.
In response, the S3 GetObject invokes the py script.
The script validates your ownership of the S3 bucket through Amazon S3.
After successful validation, Amazon Simple Notification Service (Amazon SNS) sends an email notification when the Amazon S3 Event Notifications state updates, including when the object restoration initiates and completes.
The architecture in the dotted lines is optional. You can reference this architecture to develop an automated end-to-end solution. The two options for downstream processing in the bottom reference architecture are:

5a. Amazon SQS maintains the object state in the SQS queue and invokes the GetObject call automatically by your application when the object is restored and ready to be accessed.
5b. Upon receiving the Amazon SNS notification that the object has been restored, Lambda invokes the GetObject call integrated within the user application.

This framework automates data restores in response to a GetObject call for individual objects in S3 Glacier Flexible Retrieval and S3 Glacier Deep Archive, creating a more consistent GetObject call experience across all Amazon S3 storage classes. As an example, companies store data, such as build images and logs, in Amazon S3 Standard for later analysis or for security compliance. Customers can choose to enable an Amazon S3 Lifecycle transition policy to archive objects to Glacier that have been created and not accessed for a certain number of days, which may result in significant savings as compared to Amazon S3 Standard. However, on the occasion that customers want to access this data, they often start by running a GetObject across their storage and receive an error message. Customers realize that the object is in attempt to access a specific object stored in S3 storage classes by auditing the errors after their application fails to run, run a HeadObject call to find the location of the object, restore it, and then manually re-run the GetObject call.

Automating retrievals from Amazon S3 Glacier storage classes

In this section, we’ll cover what you need to get started and the implementation details.

Prerequisites

Access to the AWS Management Console, AWS Command Line Interface (AWS CLI), or AWS SDK.
Follow AWS Identity and Access Management (IAM) role permissions for IAM user using the solution provided in this GitHub.
Follow Amazon SNS access policy permissions as provided in this GitHub. Please ensure you specify the correct SNS ARN and account id.

Implementation details

Whenever customers attempt to access a specific object stored in Amazon S3 storage classes using the GetObject call, if the object is in the archive state, then the script checks the object’s archive state using the HeadObject call and accordingly retrieves or restores the object. The script also notifies the users about status of the object using Amazon S3 Event Notifications.

Navigate to the GitHub repository and clone it to your local machine.
From your cloned repository, run the following command. The script requires several Python libraries to run, which are listed in the requirements.txt file.

pip install -r requirements.txt

To restore object, run the restoreS3AsyncObjectRestore.py script along with providing the following arguments:

restoreS3AsyncObjectRestore.py [-h] -b -k -s [-e]

Command arguments

The following are the details for command arguments:

-h —help show this help message and exit
-b —BucketName Bucket Name
-k —Key prefix/key or key
-s —SNSArn SNS Arn
-e —expirationDays Expiration in Days is optional

Example

restoreS3AsyncObjectRestore.py -b 's3.outbound.data' -k 'nothing_to_see_here_todo/nothing.csv0110_part_00' -s 'arn:aws:sns:us-east-2:82257177xxxx:snsTest:a616ee10-de3a-4d65-bc54-5c0d5b8b072c' -e 2

Note

If you don’t provide Expiration days (-e) and the object resides in Glacier Deep Archive or Glacier Flexible Retrieval, then the code restores the object for one day and is charged at the Amazon S3 standard storage class pricing for that one day.

To learn more about how the solution works, and details on implementation algorithm and/or pseudo code, see this GitHub README.md.

Conclusion

In this post, we laid out a solution to automate the retrieval process for single objects in Amazon Glacier Flexible Retrieval and Glacier Deep Archive storage classes. The solution provides you a wrapper that you can integrate in your application to call GetObject API. You can now call a single API from your applications to retrieve S3 object regardless of S3 Storage Class. Your software developers now have a consistent API while you optimize your storage costs.

For more information on storage classes, you can refer to the Amazon S3 documentation.

Thank you for reading. Feel free to provide comments or feedback in the comments section.