Restore data from Amazon S3 Glacier storage classes starting with partial object keys

When managing data storage, it is important to optimize for cost by storing data in the most cost-effective manner based on how often data is used or accessed. For many enterprises, this means using some form of cold storage or archiving for data that is less frequently accessed or used while keeping more frequently used data in more standard storage. In the event that archived data is needed, it often needs to be restored to standard storage first before it can be used, which can require that data’s metadata or other information to complete the restoration process. Sometimes, however, those details that are necessary for restoration are incomplete, missing, or otherwise difficult to track or manage, and restoration at scale can become difficult without the right tools and a well-organized and complete storage infrastructure.

Amazon S3 customers use the Amazon S3 Glacier storage classes for data that is not frequently accessed and for other archiving needs. The S3 Glacier storage classes offer the lowest cost for storing archival data while still providing high performance and retrieval flexibility. If you need to restore data from the S3 Glacier storage classes, you can use Amazon S3 Batch Operations, which lets you manage billions of objects at scale with just a few clicks in the Amazon S3 console or a single API request. One of the requirements for such an at-scale restoration is to have all of the object keys associated with each object that you are restoring. However, as mentioned, sometimes you might need to restore data without all of the details you need to do so, in this case, the full object key.

In this post, we demonstrate a simple method of using available, partial object-key information to locate full object-key information, which you need for object restoration using S3 Batch Operations. We do so by using the partial object-key information and Amazon Athena, a serverless query service, to query and match the partial information available with the full object-key information present in an S3 Inventory report. The output from Athena can be used as a manifest to be consumed by S3 Batch Operations for your restoration process. In so doing, you can seamlessly restore archived data at scale while not having had all the necessary information to start, enabling you to use the data for what you need without excessive overhead or added costs.

Note: If you already have the entire object key(s) necessary for restoration, refer to the blog post “Manage and analyze your data at scale using Amazon S3 Inventory and Amazon Athena.”

Scenario

A “key” in Amazon S3 is a unique identifier to an object and is assigned to every object in an S3 bucket. The inventory report uses these keys to point to where the objects are physically located in Amazon S3. For example, we have objects in S3 Glacier Flexible Retrieval with keys that represent this structure:

AWSLogs/XXXXXXXXXXXX/CloudTrail/us-east-1/2019/05/03/038810374562_CloudTrail_us-east-1_20190503T0155Z_BWb0X9sFteG4dRKo.json.gz

This specific object contains data about API calls made within an AWS account. AWS CloudTrail writes these objects to Amazon S3 on a periodic basis and can be used for auditing, security monitoring, and operational troubleshooting. Your data may look different, but the object keys will generally be composed of one or more prefixes and then the object itself. The CloudTrail data is moved into S3 Glacier Flexible Retrieval after a month to save on cost by using S3 Lifecycle policies. The inventory is used for management and tracking of your objects across all S3 storage tiers.

In this scenario, we simulate an audit where we need to restore certain records from Amazon S3 Glacier Flexible Retrieval but are only given the timestamp portion of the object key, highlighted in bold above. We are given 100 unique timestamps and need to bring back all objects if they match one of the timestamps.

Prerequisites

Before creating your manifest report, ensure you have the following:

An AWS account that has access to Amazon S3 and Amazon Athena.
Amazon Athena, configured for querying.
An inventory report and the Athena table to query it. Refer to the blog “Manage and analyze your data at scale using Amazon S3 Inventory and Amazon Athena” if you need help creating your inventory report.

Solution overview

In this solution, we walk you through how to match a partial Amazon S3 object key to a full S3 object key by using Athena and an existing inventory report. After matching the keys, we show you how to create an Athena query output that can be used for restoring the objects out of S3 Glacier Flexible Retrieval using S3 Batch Operations.

This solution walks through the following steps:

Uploading partial key data to an Amazon S3 bucket and creating a table in Amazon Athena.
Analyzing objects in your S3 Inventory report with Athena.
Joining both tables and Athena query tuning.
Creating an output for a manifest file to be used with S3 Batch Operations.

Step 1: Load partial key information into S3

To match partial keys to the full keys in the inventory report, we upload a .csv file to Amazon S3 containing the necessary information, including a bucket and ID column.

To match partial keys to the full keys in the inventory report, we upload a .csv file to Amazon S3 containing the necessary information

Step 2: Create an Amazon Athena table for the partial keys

In the query editor of the Amazon Athena console, you can use SQL to create the table or use the Create Table GUI that Amazon Athena offers. Here is the Create Table statement that we use, which contains the bucket and ID fields. The “bucket” column holds the bucket name that the objects are located in, and the “id” column holds the partial object key values.

CREATE EXTERNAL TABLE `partial_key_lookup`(
  `bucket` string, 
  `id` string)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ',' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://your_bucket/glacier_blog_csv'
TBLPROPERTIES (
  'has_encrypted_data'='false', 
  'skip.header.line.count'='1')

Choose Run as shown in the following screenshot, to create your table.

In the query editor of the Amazon Athena console, you can use SQL to create the table or use the Create Table GUI that Amazon Athena offers.

Verify you are able to query the new table:

SELECT * FROM "default"."partial_key_lookup" limit 10;

Verify you are able to query the new table

You can also run a count of the rows in the table to make sure you restore the correct number of objects. In our case, the .csv file we uploaded contained 100 rows. Each row has a unique ID that will be matched to the objects in Amazon S3 Glacier Flexible Retrieval via the inventory report.

SELECT count(*) as row_count FROM partial_key_lookup;

You can also run a count of the rows in the table to make sure you restore the correct number of objects.

Step 3: Query tuning

It would be fairly easy to accomplish this task if your lookup table contained the matching key to the inventory report, as you could write SQL to join on those keys. The difficult part in our scenario is figuring out how to match the partial key to the full key of the inventory report without writing a SQL query in Athena that would time out while performing a filter over 20 million rows.

By using the split function, you can fine-tune Athena to efficiently query millions of records. At the time of this post, Athena engine version 2 supports Presto version 0.217, which includes the split function and many others. The split function will help us break up the key by using a delimiter to target the part of the key that matches the lookup table.

Analyze the inventory report key

We take the inventory report key mentioned previously and go through an example of using the split function:

AWSLogs/XXXXXXXXXXXX/CloudTrail/us-east-1/2019/05/03/038810374562_CloudTrail_us-east-1_20190503T0155Z_BWb0X9sFteG4dRKo.json.gz

Let’s walk through how to extract “20190503T0155Z” from the object above. The “key” column of the inventory report is what we can query with Amazon Athena to see object keys.

SELECT key,
try(split(key,'_',5)) AS keys_split,
try(split(key,'_',5) [ 4 ]) AS id
FROM s3_inventory --Inventory Table 
LIMIT 1;

Note: The “try” in the preceding SQL ensures that if there are other keys that are out of the split criteria, Amazon Athena will skip over those keys instead of throwing an error. For more information on using “try” when writing SQL in Athena, refer to the Presto documentation.

The “key” column of the inventory report is what we can query with Athena to see object keys.

The split function takes the column (string), delimiter, and limit. In the query above, we split the key column into an array and index for every underscore in the object key. You could also easily do this for other characters, like a dash or forward slash. The array is then limited to 5 indexes (starting with 1).

Looking at the newly created “id” column, we are left with “20190503T0155Z” and have successfully extracted the required id to match with the lookup table. We simply specified in the query to grab the index [4] for the “id“ column.

Build the query to join tables

One potential challenge when using Amazon Athena to query millions or even billions of rows in Amazon S3 is writing a query that is efficient. Using ‘LIKE’ in a WHERE clause or ‘OR’ over multiple JOINS can add significant processing time to an Athena query. In this scenario, you can maximize Athena’s efficiency by building a query to first extract the partial key from the S3 inventory report, create an “id” column, and then join both tables on the “id” columns.

Replace s3_inventory with the name of your S3 Inventory Report table.

Replace s3_inventory with the name of your S3 Inventory Report table.
WITH inv_table AS (
    SELECT key,
    	     try(split(key,'_',5)) AS keys_split,
           try(split(key,'_',5) [ 4 ]) AS id,
           bucket
    FROM s3_inventory --inventory report table
),
lookup_table AS (
    SELECT id, 
           bucket
    FROM partial_key_lookup --lookup table
)
SELECT inv.bucket,
       inv.key, 
       inv.id AS inventory_key,
       lut.id AS lookup_key
FROM inv_table inv
    inner JOIN lookup_table lut ON lut.id = inv.id

Step 4: Create output for manifest file

The query successfully matched both keys together, so it is time to alter the query to match the required format for using the bulk restore option for S3 Batch Operations. We used bulk restore here because it is free for retrieving objects out of Glacier Flexible Retrieval, and the restore job should finish within 5–12 hours. You can visit the Archive retrieval options documentation if you’d like to learn more about Amazon S3 Glacier restore options. For the .csv output, you need the key and bucket in order to run the restore:

WITH inv_table AS (
    SELECT key,
           try(split(key,'_',5) [ 4 ]) AS id,
           bucket
    FROM s3_inventory --Inventory Table
),
lookup_table AS (
    SELECT id, 
           bucket
    FROM partial_key_lookup --Lookup Table
)
SELECT inv.bucket,
       inv.key
FROM inv_table inv
    inner JOIN lookup_table lut ON lut.id = inv.id

Alter the query to match the required format for using the bulk restore option for S3 Batch Operations

We are able to see the bucket and full key from the inventory report, which are the required columns for the Amazon S3 Batch Operations report. In this scenario, we queried over 20 million rows and output 1,300 rows. If your output returns billions of rows, you should consider splitting the output into smaller manifests.

Cleaning up

To avoid any extra charges beyond following the steps in this post, consider deleting the following resources if they are no longer required.

Partial key .csv file uploaded to Amazon S3.
1. How to empty an S3 bucket.
Partial key lookup table in Amazon Athena.
1. How to drop a table in Athena.

Conclusion

In this blog post, we walked through matching a partial Amazon S3 key to an inventory report key by creating a lookup table and joining it to the inventory report using Amazon Athena. We then optimized the query to handle use cases that require querying millions or even billions of rows with Athena. Finally, we modeled the query to be exported as a .csv file to be used as an S3 Batch Operation for retrieving objects from S3 Glacier Flexible Retrieval.

We hope this post helped you understand how to use Athena to match a partial object key with a full key from an S3 Inventory report. The ability to restore archived data for use is just as important archiving it in the first place for cost savings, and the solution demonstrated here should help you do that even with incomplete starting information, giving you added flexibility with your archived data.

Thanks for reading this blog! Leave any comments you have in the comments section below. Additionally, you can reference our blog on S3 Batch Operations if you’d like to export your .csv and use it to restore objects from Glacier Flexible Retrieval.