Manage and analyze your data at scale using Amazon S3 Inventory and Amazon Athena
Object storage gives you virtually unlimited scale, which helps you grow your business without being concerned with managing the infrastructure to support your data. Managing millions to billions of objects in Amazon S3 can be difficult, inefficient, and time consuming if you don’t take steps to automate the management of this data at scale. Data migrations are often especially difficult when working with billions of objects when you want to take intelligent action to move data or reduce costs. AWS has great tools to help you manage your data at scale. With Amazon S3 Inventory and Amazon Athena, you can generate and curate lists of your objects and then use automation tools to take action on that data such as targeted migrations or cost optimization.
In this blog post, I walk you through a few examples of applying data science to object management and automation with AWS. I also show you how to use Athena to manage your object data and derive actionable information using simple queries to make data placement, protection, and replication decisions to optimize your objects for your business needs. From a high level, I cover:
- Configuring Amazon S3 Inventory.
- Importing your inventory into Amazon Athena.
- Sample SQL queries for your inventory.
- Automation with your CSV outputs.
In a follow-on blog post, I cover using S3 Batch Operations to act on this data to perform restores and migrations to gain immediate benefit using AWS tools to scale.
Data science tools and inventory
AWS has 35+ machine learning and analytics services that you can leverage to gain insights from your data. A common data management request with object storage is to create and curate an inventory list of objects in S3 to perform automation tasks against. Tasks such as object copy, deletion, or replication can all be enhanced by using data science tools to curate your list of objects to act on. Amazon Athena brings the power of Data Science to object inventory management. Tools like AWS Glue and Amazon Athena are great ways to manipulate and derive insights from the Amazon S3 Inventory files. You can use structured query language (SQL) to generate lists of objects and export your results to CSV files for automation purposes. In the next sections, we will review a number of samples. With our examples and some SQL skills, you can come up with a wide range of ways to generate actionable lists of files for your file management and analytics tasks.
Amazon S3 Inventory
Amazon S3 Inventory allows you to generate a list of objects and metadata that you can use to query and manage your objects. You can use this inventory report to generate granular data such as object size, last modified date, encryption status and other fields. Those reports are available daily or weekly to automatically give the latest list. Pricing for Amazon S3 Inventory is based on number of objects at $.0025 (in AWS US East (N. Virginia)) per million objects listed, and you can review the S3 pricing page for details for your specific use case based on the AWS Region of your bucket. In large environments, these files can be extremely large in size and go beyond the ability of traditional spreadsheets or local python tools to manage. When working with billions of objects, having access to data science caliber tools can be a big help.
Configuring Amazon S3 Inventory
Here are the steps to generate the AWS Inventory CSV, ORC, or Parquet file for your S3 destination bucket. You can select the existing bucket or a dedicated log bucket to store your inventory in. Many companies use aggregated inventory buckets to better group for analysis. Make sure you have a policy with permissions on your destination bucket, or the inventory file will fail to create.
Steps to set up Amazon S3 Inventory on your bucket – documentation
1. Select the bucket you are interested in collecting an inventory on.
2. Select the Management Tab.
3. Select Inventory and then create inventory configuration.
4. Set the name and scope by choosing the name, prefix, and depth of inventory.
5. Set Destination bucket and apply policy for access: Sample policy to put on your inventory destination bucket.
6. Select Frequency, Format, and Enable the inventory report. Choose from Daily or Weekly for frequency. For output, you can choose from CSV, Apache ORC, and Apache Parquet. Finally, you can choose to enable or disable inventory reports.
7 . Select Fields to report on, including Size, Last modified, Storage class, Encryption, and Object Lock.
8. The inventory report will run automatically based on settings.
After 24-48 hours you should have an inventory report in your specified report bucket. In this example it is Amazon S3, the report Bucket, the bucket the report is of, and the report name format.
Under that report name, folders are organized in the report destination bucket by date and time of the last report run, with this format (date and time): 2021-05-05T00-00Z/.
The folder 2021-05-05T00-00Z/ will contain a manifest.json file that tracks the location of the files in the report for that date.
The folder data/2021-05-05T00-00Z/ contain a gz of the inventory file you can download and unzip to manipulate manually. Note these files can be quite large.
The folder hive/2021-05-05T00-00Z/ contains a symlink to the gz file that you use for Athena in the rest of this blog. Athena can import the file directly into Athena for analysis.
Note: While you can work with your downloaded CSV for reports they can be prohibitively large for local data management tools, using Amazon Athena you get the ability to scale your analysis of the data to billions of objects.
Importing your inventory into Amazon Athena
Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL such as our log files. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run. Athena is easy to use. Point to your data in Amazon S3, define the schema, and start querying using standard SQL. Most results are delivered within seconds. With Athena, there’s no need for complex ETL jobs to prepare your data for analysis. This makes it easy for anyone with SQL skills to quickly analyze large-scale datasets like our S3 Inventory files.
Athena pricing is $5.00 per TB scanned, but check your Region.
Here are the steps and screenshots to show you how to import, apply a schema, and query your data for analysis. You can run many queries and export the results as a CSV at any time to capture a list of objects that meet those criteria. You can then apply automation to those files.
1. Open Athena in your preferred Region and select Get started.
2. Select Settings and set destination directory Athena is the same Region.
3. Import your inventory into a table in Athena.
Find the inventory location for the date you are interested in and collect the S3://URL from your inventory directory containing symlink.txt file (see the following sample URL). Then adjust the following import SQL to match the schema of your inventory files. This will import your S3 Inventory with the right fields and headers for reporting. Edit the following sample SQL query for your inventory file format and replace the URL example with the one for your inventory file.
URL Example: s3://awg-ab3-code-20/awg-ab3-code-20/Athena_Test/hive/dt=2021-05-06-00-00/
List of all possible fields in your inventory file, yours may be a subset:
4. Here is the SQL query used to import your inventory file. Prepare the import query for your inventory, edit the fields to match and URL to your file:
CREATE EXTERNAL TABLE S3_Bucket_Inventory(
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
Helpful tips for the inventory file import:
- Make sure you have no “hidden characters” in your SQL.
- Make sure you use
https:for the URL input SQL.
- Make sure your S3 Inventory file and your import schema fields match.
5. Once your SQL import is ready, select the Create table drop-down menu, then select CREATE TABLE under SQL templates.
6. Input your sample query you prepared above and select run query.
7. Test query to see if your table populated:
SELECT * FROM "default"."s3_bucket_inventory" limit 10
8. You can export to CSV the output at any time from upper right hand corner of the console.
Sample SQL queries for your inventory
Here are a series of sample queries you can use to analyze your inventory files. These are a few examples, but there are many others you can create.
Sort your objects by size and report
SELECT DISTINCT size FROM S3_bucket_inventory ORDER BY 1 DESC limit 10;
SELECT size, count(*) FROM s3_bucket_inventory GROUP BY size;
Export 100 items to CSV to create “sample” data and export to CSV
SELECT * FROM "sampledb"."s3_bucket_inventory" limit 100;
Check for encryption status to identify security exposure to enable encryption
SELECT encryption_status, count(*) FROM s3_bucket_inventory GROUP BY encryption_status;
Count by last modified date to see how active the data is
SELECT last_modified_date, count(*) FROM s3_bucket_inventory GROUP BY last_modified_date;
Count or list objects smaller than 128 K from a specific inventory date
SELECT COUNT(*) FROM s3_bucket_inventory WHERE size < 128000;
SELECT bucket,key,size FROM s3_bucket_inventory WHERE size< 128000;
Count or list objects greater than or equal to 5 GB from a specific inventory date
SELECT COUNT(*) FROM S3_bucket_inventory WHERE size >= 5000000000;
SELECT bucket,key,size FROM s3_bucket_inventory WHERE size>5000000000;
Report infrequently accessed by tier if for data spread
SELECT intelligent_tiering_tier,count (*) FROM s3_bucket_inventory GROUP BY intelligent_tiering_tier;
Automation and using your CSV outputs
Once you have a filtered list of files organized in the way you want, you can export the results to a CSV file and then use some form of automation or Amazon S3 Batch Operations to take action on your data. This can include migrate, restore, copy, delete, tag, archive, encrypt or act on your data to return value. This can be done at the billions of objects scale with intelligent object selection. In the companion blog in this series, I will cover using Amazon S3 Batch Operations to act on your curated CSV files to perform migrations or perform other data management tasks that allow you to manage your object data.
In addition to Amazon S3 Batch Operations, there are other automation tools you can use to act on your objects such as AWS CLI, AWS Boto 3, and the Cloud SDK kits to write your own automation. Github has some great examples of automation you can use to act on your curated CSV lists.
When your work is complete in Athena, make sure you delete the table to ensure you do not continue to incur costs.
In this blog post, I walked through a few examples of applying data science to object management and automation with AWS. Amazon S3 gives you the ability to store virtually unlimited objects in a single bucket for use cases like archive, analytics and, data lakes. While this makes it a powerful tool for storing data, there are situations where you want to manage individual files intelligently for reporting and automation to achieve business goals. Typical business goals are to understand usage, verify security, increase efficiency, drive down costs, or migrate data to other environments selectively. By bringing in AWS data science services to help analyze your object inventory, you can manage your data and achieve your business objectives.
Thanks for taking the time to read this blog. If you have any feedback or questions, please don’t hesitate to leave them in the comments section.