Consolidate and query Amazon S3 Inventory reports for Region-wide object-level visibility

Organizations around the world store billions of objects and files representing terabytes to petabytes of data. Data is often owned by different teams, departments, or business units, spanning multiple locations. As the amount of datastores, locations, and owners grow, you need a way to cost-effectively maintain visibility on important characteristics of your data, including based on location, such as the amount of data, the size of data, how much of your data is encrypted, and more. This can help ensure that you are efficiently and properly storing and maintaining data based on your needs, while allowing you to compare storage characteristics across regions for greater insights.

Amazon Simple Storage Service (S3) is a cloud storage service for virtually any kind of data, in any format. Amazon S3 Inventory lists and tracks S3 objects and metadata for auditing. S3 Inventory generates a list file, which can be queried and analyzed using Amazon Athena. This configuration requires one Athena table and one associated query per bucket, as outlined in the blog here. Customers that store data in multiple buckets across different accounts and AWS Regions must complete the following tasks for each bucket to query S3 Inventory from a central location:

Enable Amazon S3 Inventory configuration at the bucket or prefix level across accounts and AWS Regions. This must be completed for all existing buckets, and all subsequently created buckets.
Move and structure the data from multiple accounts and AWS Regions into one bucket.

In this post, we present a solution that leverages AWS CloudFormation, AWS Lambda, and Python scripts to automate these two tasks. It consolidates Amazon S3 Inventory reports into one S3 bucket per Region across desired accounts. This solution extends the visibility of objects by organizing S3 Inventory reports in a partitioned Athena table for performance improvement and cost reduction on multiple accounts across AWS Regions. If you are looking to build a central asset register and want to consolidate regional inventory reports into a single Region, check out this solution.

Solution overview

This solution automates aggregation of S3 Inventory reports regionally across your AWS Organization in a destination account. For example, S3 Inventory data in the us-west-2 Region across all accounts is aggregated in the us-west-2 Region in the destination account. You then use Amazon Athena to define schema for raw S3 Inventory data and run queries on centralized S3 Inventory through it.

This solution relies on two main components:

A python script that does most of the heavy lifting of creating centralized, Regional destination buckets for reports and the bucket policy in the destination account. The script also configures S3 Inventory for all buckets in the AWS Organizations to publish reports to their respective centralized report collection buckets in the destination account.

Every time an object is added to the centralized, Regional destination bucket where all the reports are being sent, an AWS Lambda function is invoked to copy the hive structure of the S3 Inventory report to the same bucket under the “centralize” prefix with two partitions: “bucketname” and “dt.”

Solution flow showing regional S3 inventory configuration being mapped to a single regional S3 bucket which is then queries by Athena

Prerequisites

The following prerequisites are required before continuing:

The destination account must be a delegated administrator account for AWS Organizations to run the solution. The solution does not work if the destination account is the management account.
The AWS delegated administrator account requires administrator permissions to create and manage stack sets with service-managed permissions for the organization. For instructions to complete this, see this AWS CloudFormation user guide.
An environment running Python 3.7+.

Walkthrough

The solution is deployed in three steps:

Configure and centralize Amazon S3 Inventory across AWS Regions and across AWS accounts.
Automate query setup across AWS Regions in the destination account.
Query Regionally centralized Amazon S3 Inventory reports using Athena.

Step 1: Configure and centralize Amazon S3 Inventory across AWS Regions and AWS accounts

We have created a Github repo that you can clone to automate the enablement of Amazon S3 Inventory reports on multiple buckets, the creation of necessary AWS Identity and Access Management (IAM) roles for each account, the creation of centralized, Regional destination buckets for reports, and the bucket policy in the destination account to store Amazon S3 Inventory reports. For example, Amazon S3 Inventory data in the us-west-2 Region across all accounts is aggregated in the us-west-2 Region in the destination account. To get started, navigate to the Github repo and clone it to your local machine.

If you have assigned administrator access to the destination account’s IAM user, then you can ignore configuring the following IAM policy to the user. Otherwise, attach this policy to your current IAM role for your delegated admin IAM User.

{
	"Version": "2012-10-17",
	"Statement": [{
			"Effect": "Allow",
			"Action": [
				"iam:ListPolicies",
				"sts:GetCallerIdentity",
				"athena:ListWorkGroups", 
				"athena:CreateWorkGroup"
			],
			"Resource": "*"
		},
		{
			"Effect": "Allow",
			"Action": "iam:CreatePolicy",
			"Resource": "arn:aws:iam::${Account}:policy/S3InvDestAccountPolicy"
		},
		{
			"Effect": "Allow",
			"Action": "iam:AttachUserPolicy",
			"Resource": "arn:aws:iam::${Account}:user/${UserName}"
		}
	]
}

Note that you must replace ${Account} and ${UserName} with the destination accountId and IAM user, respectively, from the preceding policy.

2. Navigate to the folder where you have cloned the Github repo and run the following command. The script requires several Python libraries to run, which are listed in the requirements.txt file.

pip install -r requirements.txt

To make sure that the IAM user can perform Amazon S3 service operational activities, such as create a bucket, put bucket policy, etc., run the following command from the destination account:

python createAttachPolicyToDestAcct.py

The preceding command attaches the IAM policy “S3InvDestAccountPolicy” to the IAM user’s profile to perform Amazon S3 service operational activities.

Next, deploy the IAM role – “OrgS3role” across source accounts from the destination account with a trust to the destination account so that the destination account’s IAM user can assume can Amazon S3 operations activities on them. Create a StackSet using “OrgS3InvSourceAccountPolicy” file (either JSON or YAML) from the Github repo you cloned. To do this, navigate to the AWS CloudFormation console, and from the left-hand panel, choose StackSets. From the right-hand panel, choose Create StackSet. Under Permissions, choose Self-service permissions. For the IAM role name, select “AWSCloudFormationStackSetAdministrationRole”, and for the IAM execution role name, enter “AWSCloudFormationStackSetExecutionRole” as per the third prerequisite.

Enter IAM admin role and IAM Execution role name for your StackSet

5. In Specify Template, under Template Source, select Upload a template file, and choose an “OrgS3InvSourceAccountPolicy” file (either JSON or YAML) from the folder that you downloaded. Select Next. On the Specify StackSet details page, enter StackSet name and StackSet Description. Under Parameters, enter S3InventoryUser (your delegated admin user) and DestinationAccountID:

Example StackSet name, description, destination account id and S3InventoryUser

6. On Configure StackSet options keep defaults and select Next. On the Set deployment options page, under Accounts, provide Account numbers separated by a comma. Under Specify regions, select US EAST (N.Virginia):

Enter accounts you want to deploy the StackStet in and provide region as US EAST (N VIrginia)

7. On the Review page, review the configuration and check the box I acknowledge that AWS CloudFormation might create IAM resources with custom names.

Upon successful completion of deployment, you should see the status of the stackset operation set as Succeeded under the Operations tab. You should see “OrgS3readonlyRole” and “OrgS3readonlyRole_policy” in each AWS account you provided in Step 6. To publish new and existing S3 buckets and objects, navigate to your IDE and run the following script:

python orgS3Inventory.py

The following image shows the ongoing execution of the orgs3inventory.py script. You can see that the destination account is the same but the source accounts are different, as well as the completion percentage:

Ongoing execution of orgS3inventory.py script

You should see a centralized Amazon S3 Inventory bucket per AWS Region in the destination account to record inventories from all source accounts and buckets within that AWS Region. The script creates the following:

Destination S3 bucket per AWS Region and format of the bucket is s3inventory-<Region>-<accountid>.
Athena Workgroup bucket per AWS Region to store the query results, and the bucket format is s3inv-athena-wgp-<Region>-<accountid>.
Finally, Athena WorkGroupName and the format of it is s3inv-athena-wgp-<Region>-<accountid>-wg.

Navigate to the Amazon S3 console to find the buckets created:

Centralized Amazon S3 Inventory bucket per AWS Region in the destination account

Note the following:

It might take up to 48 hours for the Amazon S3 Inventory to appear in your destination account, and subsequent reports are delivered on Sundays.
orgS3Inventory.py configures Amazon S3 Inventory for all existing buckets owned by accounts listed in the StackSet (Step 7). orgS3Inventory.py must run if/when new buckets are created after its initial execution. You can automate this with a cron job that runs orgS3Inventory.py on a schedule per your requirements.

Step 2: Automate query setup across AWS Regions in the destination account

In this step, you automate the creation of a Regional query mechanism by deploying an Athena table using CloudFormation Stacks. Every time an Amazon S3 Inventory report is added to the regional S3 bucket, it invokes a Lambda function to copy the hive structure of the Amazon S3 Inventory report into the same bucket under the “centralize” prefix with two partitions: “bucketname” and “dt” to optimize query performance on the Amazon S3 Inventory.

Once Amazon S3 metadata is updated, you can choose to automate the query setup for your active AWS Regions. To do so, create a stack in the CloudFormation console using the “S3Inventory.yaml” script in each AWS Region where you want to set up the query mechanism.

1. Navigate to CloudFormation console, and select Create Stack. On Create Stack page, under Prerequisite – Prepare template choose Template is ready. Under Specify Template, choose Upload a template file then select Choose file and upload “S3Inventory.yaml” script.

2. On Specify stack details page, enter Stack name. Under Parameters, you must provide the following:

i. For InventoryBucketName, enter the name of the S3 bucket that is the destination for all of the S3 Inventory reports in the Region. The bucket should be in the same Region as the CloudFormation stack.

ii. An optional parameter WorkGroupName: an Athena WorkGroupName from the Region where you are running CloudFormation stack. If you don’t provide WorkGroupName, then it uses the default WorkGroupName, “primary,” and InventoryBucketName for storing Athena queries. Select Next.

Enter Stack name and InventoryBucketName while Workgroup name defaults to primary

3. Keep the default settings for Configure stack and options, and for Review configuration, check the box I acknowledge that AWS CloudFormation might create IAM resources with custom names.

4. Upon successful completion of deployment, you should see the status of the stack operation set as CREATE_COMPLETE.

5. This stack creates two AWS Lambda functions:

i. “lambda-function-inventory”: Triggers when a new object is added to the destination S3 bucket. This function creates an Athena table “Inventory,” if it does not exist, along with two partitions: “bucketname” and “dt.”

ii. “<stackname>-<region-name>-CustomResourceLambdaFunction-<uuid>”: Sets event types of “s3:ObjectCreated:*” for the destination bucket.

Step 3: Querying centralized Amazon S3 Inventory using Athena

Amazon Athena makes it easy to query your Amazon S3 Inventory data. Athena is already pointing at the destination bucket to run a query against the Amazon S3 Inventory files quickly. This allows you to gain insights from your data in a fraction of the time and cost-effectively. To get started, do the following:

Navigate to Amazon Athena console.
From the left-hand panel, under Database, select default. Here is what the table looks like on the Athena console, which contains all the fields mentions under the Amazon S3 Inventory list section of the S3 User Guide:

Amazon S3 table structure with field names

The table metadata lets the Athena query engine know how to find, read, and process the data you want to query.

Here are some sample queries to run. You can create your own as per your needs:

#Find objects that are >128 KB in size
SELECT * FROM "default"."inventory" where size > 128000 order by "size" desc;

#Groups objects as per S3 storage classes 
SELECT COUNT (*), "storage_class" FROM "default"."inventory" GROUP BY "storage_class"

#Finding objects that are not encrypted 
SELECT * FROM "default"."inventory" where "encryption_status" = 'NOT-SSE';

#Finding objects for a particular date and bucket
SELECT * FROM "default"."inventory" where dt = 'YYYY-MM-DD-00-00' and bucketname= ' <bucketname>';

#Finding duplicate objects with a region for a specific date
SELECT e-tag, count(*) FROM "default"."inventory" where dt = 'YYYY-MM-DD-00-00' group by e-tag having count(*) > 1

Cleaning up

To clean up resources created through this blog, complete the following steps:

Delete S3 buckets created in Step 1.
Delete the CloudFormation StackSet created in Step 1.
Delete the CloudFormation Stack you created in Step 2.
Delete the Athena table created in Step 3.

You will need to individually delete Amazon S3 inventory configurations for your S3 buckets across accounts and Regions.

Conclusion

In this post, we showed you how to aggregate Amazon S3 Inventory data at the Region level in across your AWS accounts, and provided sample queries using Amazon Athena. You can now track S3 object metadata such as such as encryption status, replication status, storage class, size, and more at a Regional level and use it to audit your S3 objects for business, compliance, and regulatory needs across your AWS accounts. For more information on Amazon S3 Inventory, refer to the Amazon S3 Inventory documentation.

If you have feedback or questions about this post, don’t hesitate to submit them in the comments section. If you’d like to learn more about this solution, check out this video of us breaking it down: