AWS Storage Blog
Simplify querying your archive data in Amazon S3 with Amazon Athena
Today, customers increasingly choose to store data for longer because they recognize its future value potential. Storing data longer, coupled with exponential data growth, has led to customers placing a greater emphasis on storage cost optimization and using cost-effective storage classes. However, a modern data archiving strategy not only calls for optimizing storage costs, but also mechanisms that allow you to put that data to work for your business as quickly as your business requirements demand.
Amazon S3 offers Amazon S3 Glacier storage classes, where you can store any amount of long-term archive data in a secure, durable, and cost-optimized manner. Amazon Athena is an interactive query service that makes it simple to analyze data directly in Amazon S3 at petabyte scale using standard SQL.
In this post, we discuss how you can use Amazon Athena to automatically query data in the S3 Glacier storage classes. This solution makes querying archived data more accessible and quickens the query process. With a more streamlined query process, you can get to insights faster and ensure that archive data is being put to work for your organization rather than sitting idle.
Solution overview
Amazon S3 offers three archive storage classes for infrequently accessed objects: S3 Glacier Instant Retrieval, S3 Glacier Flexible Retrieval, and S3 Glacier Deep Archive. These storage classes offer the lowest cost archive storage in the cloud, the most retrieval flexibility and, 11 9s of durability. While historical data is less frequently accessed, this archived data drives business decisions and insights into long-term trends. This makes the the ability to query historical data critical to customers for use-cases such as long-term trend analysis for business intelligence (BI), pattern recognition using artificial intelligence/machine learning (AI/ML) models, compliance auditing, and security investigations. Amazon Athena is a serverless service and is great for making these types of as-needed interactive queries.
On June 29, 2023, Amazon Athena added the ability to query data restored from S3 Glacier Flexible Retrieval and S3 Glacier Deep Archive storage classes. You can now use Athena to query your data stored in Amazon S3 across all storage classes. This helps you to archive more data in the S3 Glacier Flexible Retrieval and S3 Glacier Deep Archive storage classes to save on storage costs, while also retaining the ability to directly query restored data quickly and easily.
For example, imagine a customer needing to query 400,000 files totaling 10 TB of data stored in S3 Glacier Flexible Retrieval to do a cybersecurity threat analysis. Now, with this new capability, once they have restored the 10 TB data, they can directly query these files using Amazon Athena, without first having to copy this restored data into Amazon S3 Standard or other instant retrieval storage classes. This means there is no infrastructure to manage and you only pay for the amount of data that is read from Amazon S3 for the query to complete, similar to your Athena queries for data stored in other S3 storage classes. Next, let’s talk about the Athena configuration required to take advantage of this capability.
Use Amazon Athena to read directly from restored archive objects
When you use Amazon Athena to query objects in S3, by default, Athena only queries the objects in S3 Standard and other non-archive storage classes. Now, with a simple table level opt-in, you can run queries on restored data in the S3 Glacier storage classes, similar to objects stored in other storage classes.
To read restored objects in S3 Glacier Flexible Retrieval and S3 Glacier Deep Archive storage classes from your Hive tables in Athena, set the new table property read_restored_glacier_objects
to true
by using the following Alter Table
statement using Athena’s Query Editor:
ALTER TABLE table_name SET TBLPROPERTIES ('read_restored_glacier_objects' = 'true')
Review the Amazon Athena User Guide for additional configuration details. Alternatively, you can also use the AWS Glue console to update this table property using these steps. Once the property is set to true for the table, all subsequent Athena queries on that table will read data from S3 objects stored in any of the storage classes including restored S3 Glacier Flexible Retrieval and S3 Glacier Deep Archive objects. Note that this feature is only supported for Apache Hive tables on Athena engine version 3.
Example architecture showing direct query access to archive objects
In Figure 1, we present a step-by-step example of an architecture to query objects stored in the S3 Glacier storage classes using Athena after you have updated your table property to read glacier objects as described in the previous section. To start, you can use S3 Batch Operations to initiate the restore process for your S3 Glacier Flexible Retrieval and S3 Glacier Deep Archive objects according to the objects listed in the S3 Inventory report. You can also use a CSV file containing the list of objects to be restored as your manifest to S3 Batch Operations. When you run the S3 Batch Operations job, it initiates the request to restore objects listed in your manifest. Once this job is complete, you must wait for the objects to be restored. Expected restore times depend on the storage class and the chosen restore tier (expedited, standard, bulk). You can find these expected restore times in the “Storage classes for archiving objects” section of the S3 documentation. You can check the restore status of objects either via the S3 HeadObject API or S3 ListObjectsV2 API. Once the object restores have completed, you can then use Athena to directly query these objects.
Figure 1: Example architecture to query archive objects using Amazon Athena
In Figure 2, we build upon the architecture in Figure 1 to automatically notify you when the data is ready to be queried by Athena. In this example as well, you use S3 Batch Operations to initiate the restore process just as the previous example. As each object restore completes, an Amazon S3 Event Notification is triggered that invokes an AWS Lambda function. This AWS Lambda function updates an Amazon DynamoDB table with the count of completed restores and also checks whether the counter is equivalent to the total number of objects you submitted in the S3 Batch Operations job. If so, a notification is sent through Amazon Simple Notification Service (Amazon SNS) to notify you that your archive objects are now available to be directly queried using Athena alongside your other objects in the S3 bucket.
Figure 2: Example architecture to query archive objects using Amazon Athena with notification
Reduce operational complexity and costs
With this new capability, there is no need to copy data at scale to another storage class to query your restored archive data. Since you can now directly query restored data in S3 Glacier Flexible Retrieval and S3 Glacier Deep Archive storage classes from Athena, this significantly reduces the operational overhead and costs – in most cases, up to 50%.
How can your native applications directly query objects stored in the Amazon S3 Glacier storage classes?
The Amazon S3 LIST API now provides customers the restore status information of S3 Glacier objects. Athena uses this new restore status information in the S3 LIST API to identify and access restored objects. Using this new capability, Athena runs queries against your restored data in the S3 Glacier storage classes. Similarly, you can integrate this new restore status information from the Amazon S3 LIST response into your native applications to directly query restored objects in the S3 Glacier storage classes.
To get the restore status of objects in S3 Glacier archive storage classes, use the S3 LIST API with the new optional header x-amz-optional-object-attributes: RestoreStatus
. This feature is available in ListObjectsV1, ListObjectsV2, and ListObjectVersions API. Refer to our documentation for more information on RestoreStatus. Figure 3 is an example S3 LIST API response when object restoration is complete, with the RestoreStatus response header highlighted. Your application can use this RestoreStatus response header to identify, access, and directly query restored objects from S3 Glacier Flexible Retrieval and S3 Glacier Deep Archive.
Figure 3: Example S3 LIST API response with RestoreStatus when object restore is complete
Conclusion
In this post, we covered a cost effective and efficient way to query restored data in S3 Glacier Flexible Retrieval and S3 Glacier Deep Archive storage classes directly using Amazon Athena. With this new functionality, Athena can now read data across all S3 storage classes in a single query.
Access to restored data in Athena greatly simplifies your query process by removing the operational overhead and cost of copying restored data and storing it in an instant access storage tier for querying. This means you have more direct, easier access to your archive data, allowing you to archive more data now without worrying about the complexity of accessing this data later. To learn more about querying data in the S3 Glacier storage classes with Amazon Athena, visit the documentation on querying restored objects.
Although this post focuses on Amazon S3 and Amazon Athena, you can integrate the restore status information of your S3 objects in the S3 LIST API responses into any query engine to replicate these efficiencies.