Restoring archived objects at scale from the Amazon S3 Glacier storage classes

Update (7/26/2024): You no longer need to optimize the S3 Inventory report using Amazon Athena. Amazon S3 will automatically optimize your S3 Batch Operations restore job to achieve the fastest retrieval throughput. For more guidance on using batch operations, learn more in the S3 User Guide.

Every organization around the world has archival data. There is a data archiving need not only for companies that have been around for a while, but also for digital native businesses. Workloads such as medical records, news media content, and manufacturing datasets, often store petabytes – or billions of objects indefinitely. The vast majority of data in the world is cold and rarely accessed, and millions of customers globally choose to archive this vital data in Amazon S3.

Within Amazon S3, you can choose from three archive storage classes optimized for different access patterns and storage duration. For archive data that needs immediate access, such as medical images, news media assets, or genomics data, the Amazon S3 Glacier Instant Retrieval storage class delivers the lowest cost storage with milliseconds retrieval. For archive data that does not require immediate access, such as backup or disaster recovery use cases, Amazon S3 Glacier Flexible Retrieval provides three retrieval options: expedited retrievals in 1-5 minutes, standard retrievals in 3-5 hours, and free bulk retrievals in 5-12 hours. To save even more on long-lived archive storage such as compliance archives and digital media preservation, Amazon S3 Glacier Deep Archive is the lowest cost storage in the cloud with data retrieval within 12 hours using the standard retrieval option, or 48 hours using the bulk retrieval option.

Customers use the S3 Glacier Flexible Retrieval and S3 Glacier Deep Archives storage classes to archive large amounts of data at a very low cost. Customers use these storage classes to store their backups, data lakes, media assets, and other archives. These customers often need to retrieve millions, or even billions, of objects quickly when restoring backups, responding to audit requests, retraining machine learning models, or performing analytics on historical data.

Now, for such workloads, you can restore archived data faster. The S3 Glacier Flexible Retrieval and S3 Glacier Deep Archive storage classes now deliver up to 10x improvement in restore request rate. When bringing data back from these storage classes, you can restore archived objects with up to 1,000 transactions per second (TPS) of object restore requests per account per AWS Region. The improved restore rate allows your applications to initiate restore requests at a much faster rate, significantly reducing the restore completion time for datasets composed of small objects. The benefit in improved restore requests rate increases as the number of restore requests increase.

In this blog, we discuss best practices to optimize, simplify, and streamline restoring large datasets from the S3 Glacier Flexible Retrieval and S3 Glacier Deep Archive storage classes, leveraging Amazon S3 Batch Operations, Amazon S3 Inventory, and Amazon Athena.

Restoring a large number of objects using S3 Batch Operations

S3 Batch Operations is a managed solution for performing batch actions across billions of objects and petabytes of data with a single request. S3 Batch Operations automatically uses up to 1,000 TPS when restoring objects from S3 Glacier Flexible Retrieval or S3 Glacier Deep Archive using the standard and bulk retrieval options. S3 Batch Operations also manages retries, tracks progress, generates reports, and delivers events to AWS CloudTrail providing a fully managed, auditable, and serverless experience.

In this blog, we use the following name convention for the AWS resources:

111122223333 as the AWS account number
archive-bucket as the S3 bucket where we have the archived dataset
inventory-bucket as the S3 bucket where we store the inventory reports
athena-bucket as the S3 bucket where we store Amazon Athena queries results
reports-bucket as the S3 bucket where we store S3 Batch Operations completion reports

Configuring the Amazon S3 Inventory

Using Amazon S3 Inventory, you can create a list of objects for the restore job. S3 Inventory reports provides you with the list of objects and their corresponding metadata on a daily or weekly basis for an S3 bucket or a shared prefix. Before setting up the inventory report, we create a bucket policy for the S3 bucket inventory-bucket to allow the delivery of the S3 Inventory reports, by executing the below AWS CLI command.

You should replace the bucket names and the AWS account number with your correct values.

aws s3api put-bucket-policy \
    --bucket inventory-bucket \
    --policy file://policy.json

policy.json:

{
    "Version": "2012-10-17",
    "Id": "S3-Console-Auto-Gen-Policy-1656541301560",
    "Statement": [{
        "Sid": "InventoryPolicy",
        "Effect": "Allow",
        "Principal": {
            "Service": "s3.amazonaws.com"
        },
        "Action": "s3:PutObject",
        "Resource": "arn:aws:s3:::inventory-bucket/*",
        "Condition": {
            "StringEquals": {
                "s3:x-amz-acl": "bucket-owner-full-control",
                "aws:SourceAccount": "111122223333"
            },
            "ArnLike": {
                "aws:SourceArn": [
                    "arn:aws:s3:::archive-bucket"
                ]
            }
        }
    }]
}

Then, we configure the inventory of the bucket archive-bucket containing the dataset we want to restore, by executing the following CLI command. Remember to replace the bucket names and the AWS account number with your correct values.

aws s3api put-bucket-inventory-configuration \
    --bucket archive-bucket \
    --id inventory_for_restore \
    --inventory-configuration file://inventory-configuration.json

inventory-configuration.json:

{
	"Destination": {
		"S3BucketDestination": {
			"AccountId": "111122223333",
			"Bucket": "arn:aws:s3:::inventory-bucket",
			"Format": "CSV"
		}
	},
	"IsEnabled": true,
	"Id": "inventory_for_restore",
	"IncludedObjectVersions": "Current",
	"Schedule": {
		"Frequency": "Daily"
	},
	"OptionalFields": ["Size", "LastModifiedDate", "StorageClass"]
}

Optimizing the S3 Inventory report using Amazon Athena

When we restore large datasets composed of millions, or even billions, of archived objects, for best performances it’s preferred to issue the restore requests in the same order as the objects were archived in the S3 Glacier storage classes. Usually, objects are archived to S3 Glacier storage classes using Amazon S3 Lifecycle transition rules based on the object creation date. Therefore, we can order the restore requests by the objects’ creation date using the LastModifiedDate metadata information included in the inventory report.

In this step I explain how to order the inventory report by LastModifiedDate and generate a manifest file we will use to create the batch restore job within S3 Batch Operations.

First, we import the inventory report into Amazon Athena and we execute two simple SQL queries to include only objects in the S3 Glacier Flexible Retrieval storage class. (Note, we are focusing this walkthrough on S3 Glacier Flexible Retrieval, but you can follow the same process for S3 Glacier Deep Archive). We then order the inventory by LastModifiedDate to optimize the efficiency of the batch restore operation.

The S3 Inventory report with the symlink.txt file is published to the following location:

s3://<destination-bucket>/<source-bucket>/<id>/hive/dt=yyyy-mm-dd-hh-mm/

In my case the location is:

s3://inventory-bucket/archive-bucket/inventory_for_restore/hive/dt=2022-11-05-00-00/

To create the table in Athena, we run the following query in the Amazon Athena console.

Amazon Athena Console Table

To verify that the table is populated correctly, we run the following query:

Amazon Athena Table Query

When we specify a manifest in CSV format for a S3 Batch Operations job, each row in the file must include only the bucket name, the key, and the VersionId (if versioning is enabled). We run the following query and only select the columns bucket and key as versioning is not enabled, filter the inventory by selecting only objects in the S3 Glacier Flexible Retrieval storage class, and order the list by last_modified_date.

Amazon Athena Optimize Inventory

We then select the Recent queries tab.

Amazon Athena Recent Queries

In the Recent queries tab, we take note of the query Execution ID.

Amazon Athena Query Editor

We then execute the following CLI command to find the output location of the previous query.

aws athena get-query-execution --query-execution-id <ExecutionID> --region us-east-1 --query 'QueryExecution'.'ResultConfiguration'.'OutputLocation'

Command output: “s3://athena-bucket/<ExecutionID>.csv”

We download the output of the query from the S3 bucket athena-bucket where we store the result of the queries we previously ran in Athena, to the local file inventory_for_restore.csv. Remember to replace athena-bucket with your correct bucket name.

aws s3 cp s3://athena-bucket/<ExecutionID>.csv ./inventory_for_restore.csv

We remove the first line containing the headers from the manifest file.

sed -i '1d' inventory_for_restore.csv

We then upload the manifest file to the inventory-bucket bucket.

aws s3 cp inventory_for_restore.csv s3://inventory-bucket/

Finally, we get the ETag of the manifest file and we take note of it as we will need it when we create the S3 Batch Operations job.

aws s3api head-object --bucket inventory-bucket \
--key inventory_for_restore.csv \
--query 'ETag'

Command output: “\”<ETag>\””

Creating the S3 Batch Operations restore Job

Now, we need to create the IAM policy and the IAM role necessary to run the S3 Batch Operations job.

To create the IAM policy, we execute the following CLI command where reports-bucket is the S3 bucket to store the S3 Batch Operations job’s completion reports. Remember to replace archive-bucket and reports-bucket with your correct bucket name.

aws iam create-policy --policy-name s3batch-policy --policy-document file://policy.json

policy.json:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:RestoreObject"
            ],
            "Resource": "arn:aws:s3:::archive-bucket/*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:GetObjectVersion"
            ],
            "Resource": [
                "arn:aws:s3:::inventory-bucket/inventory_for_restore.csv"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:PutObject"
            ],
            "Resource": [
                "arn:aws:s3:::reports-bucket/*"
            ]
        }
    ]
}

To create the IAM role, we execute the following CLI command:

aws iam create-role --role-name s3batch-role --assume-role-policy-document file://trust-policy.json

trust-policy.json

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "batchoperations.s3.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}

Finally, we attach the IAM policy to the IAM role executing the following CLI command:

aws iam attach-role-policy --policy-arn arn:aws:iam::111122223333:policy/s3batch-policy --role-name s3batch-role

Now, we create the S3 Batch Operations job to restore the objects listed in the manifest file we previously created. For this blog, we configure the job to keep the temporary restored copy for only one day, to avoid unnecessary storage costs. The number of days you specify should be based on the use case you are considering.

aws s3control create-job \
--account-id 111122223333 \
--operation '{"S3InitiateRestoreObject": { "ExpirationInDays": 1, "GlacierJobTier":"STANDARD"} }' \
--report '{"Bucket":"arn:aws:s3:::reports-bucket ","Prefix":"batch-op-restore-job", "Format":"Report_CSV_20180820","Enabled":true,"ReportScope":"FailedTasksOnly"}' \
--manifest '{"Spec":{"Format":"S3BatchOperations_CSV_20180820", "Fields":["Bucket","Key"]},"Location":{"ObjectArn":"arn:aws:s3::: inventory-bucket/inventory_for_restore.csv", "ETag":"<ETag>"}}' \
--priority 10 \
--role-arn arn:aws:iam::111122223333:role/s3batch-role \
--region us-east-1

The command should output the job id that we will use to start the job.

aws s3control update-job-status \
--account-id 111122223333 \
--job-id <JobID> \
--requested-job-status Ready

We can then monitor the status of the job by running the following CLI command:

aws s3control describe-job \
--account-id 111122223333 \
--job-id <JobID> \
--query 'Job'.'ProgressSummary'

Output of the command:

{
    "TotalNumberOfTasks": 1000000,
    "NumberOfTasksSucceeded": 15686,
    "NumberOfTasksFailed": 0,
    "Timers": {
        "ElapsedTimeInActiveSeconds": 263
    }
}

Understanding the S3 Batch Operations job’s completion report

When restoring millions or even billions of objects, we can expect a few operations to fail. For the job we just ran, five restore operations failed.

To see the summary of our S3 Batch Operations job, we run the following command:

aws s3control describe-job \
--account-id 111122223333 \
--job-id <JobID> \
--query 'Job'.'ProgressSummary'

Output of the command:

{
    "TotalNumberOfTasks": 1000000,
    "NumberOfTasksSucceeded": 999995,
    "NumberOfTasksFailed": 5,
    "Timers": {
        "ElapsedTimeInActiveSeconds": 1154
    }
}

We can find out additional information about the failed tasks by downloading and opening the S3 Batch Operations job’s report. First we execute the following command to find the name of the report file in our S3 bucket:

aws s3 ls s3://reports-bucket/batch-op-restore-job/job-<JobID>/results/

Output of the command:

2022-11-06 23:50:13 1670 79d4ee72d460fdf927b4071002ad9191f025c228.csv

Then we execute the following command to download the report file:

aws s3 cp s3://reports-bucket/batch-op-restore-job/job-<JobID>/results/79d4ee72d460fdf927b4071002ad9191f025c228.csv ./restore-job-report.csv

Completion reports for failed tasks are in the following format:

Bucket,Key,VersionId,TaskStatus,HTTPStatusCode,ErrorCode,ResultMessage

You can find additional information about S3 Batch Operations completion reports here.

We now output only Key, HTTPStatusCode, and ErrorCode to better understand the failures, executing the following command:

awk -F "\"*,\"*" '{print $2,$5,$6}' restore-job-report.csv

Output of the command:

archives/file_b17179 409 RestoreAlreadyInProgress
archives/file_l17223 409 RestoreAlreadyInProgress
archives/file_d29577 409 RestoreAlreadyInProgress
archives/file_g44242 409 RestoreAlreadyInProgress
archives/file_d30756 409 RestoreAlreadyInProgress

We had five restore operations fail as there was a restore operation already in progress. S3 Batch Operations is an at least once execution engine, which means it performs at least one invocation per key in the provided manifest. In rare cases, there might be more than one invocation per key, like in this case.

We now confirm that the objects have been restored correctly, and we don’t need to retry the restore operations.

aws s3api head-object --bucket archive-bucket --key archives/file_b17179 --query 'Restore'

Output of the command:

“ongoing-request=\”false\”, expiry-date=\”Wed, 09 Nov 2022 00:00:00 GMT\””

Conclusion

In this blog, we described how you can restore a large number of archived objects stored in the Amazon S3 Glacier Flexible Retrieval and Amazon S3 Glacier Deep Archive storage classes at a faster rate. The increased throughput automatically applies to all standard and bulk retrievals from the S3 Glacier Flexible Retrieval and S3 Glacier Deep Archive storage classes, and it’s available at no additional cost. Using S3 Batch Operations, you can automatically take advantage of the increased restore request rate. As a result, you can now reduce data retrieval times for millions, or billions, of small objects by up to 90%.

To summarize our walkthrough results, we restored a 10 TB dataset composed of 1,000,000 objects stored in the S3 Glacier Flexible Retrieval storage class using the standard retrieval option. It took 19 minutes to initiate the 1,000,000 object restore requests, while the entire 10 TB dataset was completely restored in 4 hours 53 minutes 11 seconds.

All S3 Glacier storage classes are designed to be the lowest-cost storage for specific access patterns, allowing you to archive large amounts of data at a very low cost. And when you need access to that archive data, you can retrieve objects even faster with the improved restore request rate we discussed throughout the blog. I look forward to hearing from you on how this improvement helps your organization.

If you have any comments or questions about this blog post, please don’t hesitate to reply in the comments section. Thanks for reading!

AWS Storage Blog

Restoring archived objects at scale from the Amazon S3 Glacier storage classes

Restoring a large number of objects using S3 Batch Operations

Configuring the Amazon S3 Inventory

Optimizing the S3 Inventory report using Amazon Athena

Creating the S3 Batch Operations restore Job

Understanding the S3 Batch Operations job’s completion report

Conclusion

Resources

Follow