Building a central asset register with Amazon S3 Inventory

UPDATE 7/12/2022: Amazon SQS policy updated to support every AWS Region (step 3 in the architecture diagram) in the central.yml template.

Many AWS customers store millions of objects in their Amazon S3 buckets, due to the scalability, durability, and performance that S3 provides. Customers compelled to build an information asset register for compliance reasons or wanting to identify thousands of objects to prepare a batch operation can rely on Amazon S3 Inventory. S3 Inventory is one of the features S3 provides to help manage your storage. You can use it to audit and report on the replication and encryption status of all or some of your objects. S3 Inventory is useful for business, compliance, and regulatory needs, and you can even include optional fields in your reports, such as the storage class, size, and replication status of your objects.

However, customers face two challenges when using Amazon S3 Inventory at scale. First, the S3 Inventory report must be located in the same AWS Region as the Amazon S3 bucket. Second, the prefix structure makes it impossible to query all reports at once if you have multiple buckets. With these limitations, customers who store objects in multiple S3 buckets, AWS Regions, and AWS accounts are not directly able to build a central inventory.

In this blog post, I present a solution that leverages Amazon SNS, Amazon SQS, and AWS Lambda to consolidate Amazon S3 Inventory reports into a single bucket. The solution also organizes inventory reports in a partitioned AWS Glue table for performance improvement and cost reduction. Using these reports, you can, for example, prepare an Amazon S3 Batch Operation to change the storage class of old items, or create an Amazon QuickSight dashboard to check the compliance of your data assets.

Solution overview

The following illustration shows the architecture of the solution:

Building a central asset register with Amazon S3 Inventory architecture

The complete steps in this workflow are as follows:

1. Amazon S3 Inventory publishes the report into a regional Amazon S3 bucket.
2. The Amazon S3 bucket notifies Amazon SNS that it has received a new report.
3. Amazon SNS topic forwards the event generated in step 2 to a central Amazon SQS queue.
4. An AWS Lambda function reads the events from the queue and copies the Amazon S3 Inventory reports from the remote AWS Region to the central one. It also updates the AWS Glue table definition with new partitions.

Amazon S3 Inventory reports consolidation

The solution proposed relies on two main components:

A regional collector, which collects all S3 Inventory reports from your Amazon S3 buckets. It is an S3 Inventory requirement that you host your destination bucket in the same AWS Region, but not necessarily in the same AWS account.
A central application, which copies the S3 Inventory reports centrally as soon as they arrive in the regional collector.

This tutorial is a four-step process. It includes deploying and configuring AWS resources in a central AWS Region and in each AWS Region hosting your Amazon S3 buckets:

Configuration of the central application.
Configuration of the regional collector.
Configuration of Amazon S3 Inventory.
Automating S3 Inventory feature configuration on existing buckets

Step 1: Configure the central Region

The central Region is the AWS Region where you want to centrally store and query the reports. In this blog post, we assume it is Ireland (eu-west-1).

The following stack deploys the SQS queues, the Lambda function, an Amazon S3 bucket, and an AWS Glue table to centrally store and query Amazon S3 Inventory reports:

If you want to deploy it in any other AWS Region, use this link to download the AWS CloudFormation template.

Step 2: Configure the regional collector

The regional collector is a stack built with Amazon S3 and Amazon SNS. You configure an Amazon S3 bucket with the correct bucket policy to receive the inventory reports for every Amazon S3 bucket you host in this AWS Region. The Amazon SNS topic is handling Amazon S3 create object events and sends them to the central Amazon SQS queue, ready to be processed.

The following stack deploys the regional collector in Ireland (eu-west-1):

By default, the stack deploys an Amazon S3 bucket with a bucket policy that only allows S3 inventory reports from this AWS account. If you want to receive inventory reports from additional AWS Accounts, update the stack parameter AllowList with the list of additional AWS account IDs.

If you want to deploy it in any other AWS Region, use this link to download the AWS CloudFormation template.

You must deploy the regional stack in every AWS Region where you have Amazon S3 buckets. You can leverage AWS CloudFormation StackSets to quickly deploy the stack in multiple Regions or download the template and use the AWS Command Line Interface.

The following bash script loops over every standard AWS Region and deploys the stack using AWS CLI and a CloudFormation stack.

for i in `aws ssm get-parameters-by-path --path /aws/service/global-infrastructure/regions --output text --query "Parameters[?!((contains(Value, 'gov')) || (contains(Value,'cn-')))].Value"`;
do aws cloudformation create-stack --stack-name inventory-$i --template-body file://regional.yml --region $i;
done

Step 3: Configure S3 Inventory feature on buckets

Now that you have setup the infrastructure to collect the inventory reports, you have to enable S3 inventory reports on existing buckets.

Perform the configuration by completing the following steps:

Open the Amazon S3 console.
Click on a bucket for which you want to have inventory reports.
Choose the Management tab.
Under Inventory configurations, click Create inventory configuration.
For Inventory configuration name, enter inventory.
For Inventory scope, leave Prefix empty and choose Current version only.
Under Reports details, select the following options:
- For Destination bucket, if the bucket is in the same AWS account as the regional bucket, choose This account, else choose A different account and enter the AWS account ID of the regional bucket in Account ID.
- For Destination, enter s3://<inventory_bucket>/<accountid> where <inventory_bucket> is the regional bucket and <accountid> is this bucket’s AWS account ID.
- For Frequency, select Daily.
- For Output format, choose Apache Parquet.
- For Status, choose Enable.
Under Server-side encryption:
- For Server-side encryption, choose Enable.
- For Encryption key type, choose Amazon S3 key (SSE-S3)
For Additional fields, select all fields.
Choose Create.

Create your S3 Inventory configuration

The configuration is now complete. Amazon S3 Inventory produces daily reports and publishes them into the regional Amazon S3 bucket. Then, the AWS Lambda function copies the reports to the central Amazon S3 bucket, under distinct prefixes, using partitioning best practices. AWS Glue sees this as a table with four partitions to improve query performances and costs.

The partition prefixes are:

dt: the date of the inventory report
account: the AWS Account Id of the Amazon S3 bucket
region: the AWS Region of the Amazon S3 bucket
bucket: the Amazon S3 bucket name

Once copied, an example inventory report has the following name structure:

/dt=2020-04-20/account=123456789012/region=eu-west-1/bucket=some-bucket/data.parquet

Step 4: Automating S3 Inventory feature configuration on existing buckets

You have now built a pipeline that you can deploy in every AWS Region to collect all Amazon S3 Inventory reports from any bucket. Once Amazon S3 Inventory delivers an inventory report, the Amazon S3 bucket sends an event to Amazon SNS. Amazon SNS forwards it to Amazon SQS. Amazon SQS triggers an AWS Lambda function. The Lambda function copies the file centrally and updates the Glue table so that you can query the reports from Amazon Athena (see this blog or this documentation) as soon as they arrive.

Once deployed in every Region, the following python scripts can help to configure S3 Inventory reports for every bucket of an AWS account.

Script to create a map of each Region and its inventory bucket

# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: MIT-0
import boto3
import json

session = boto3.Session()
client = session.client('s3')
inventory_buckets = {}


def build_inventory_map():
    global inventory_buckets
    buckets = client.list_buckets()['Buckets']
    for b in buckets:
        if b['Name'].startswith('inventory'):
            region = client.get_bucket_location(Bucket=b['Name'])['LocationConstraint']
            region = 'us-east-1' if not region else region
            inventory_buckets[region] = b['Name']
    with open('aws_inventory_buckets.json', 'w') as f:
        json.dump(inventory_buckets, f)


if __name__ == '__main__':
    try:
        build_inventory_map()
        print('inventory map built and stored locally')
        exit(0)
    except Exception as e:
        print('An error occured')
        print(e)
        exit(1)

Script to configure S3 inventory feature

Disclaimer: This script overwrites existing inventory configuration.

# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: MIT-0
import boto3
import json

session = boto3.Session()
s3 = session.resource('s3')
client = session.client('s3')
local_account_id = session.client('sts').get_caller_identity()['Account']
inventory_account_id = '123456789012'  # replace with AWS account ID hosting the inventory buckets
inventory_buckets = {}


def _get_inventory_bucket(x):
    region = client.get_bucket_location(Bucket=x)['LocationConstraint']
    region = 'us-east-1' if not region else region    
    return inventory_buckets[region]


def load_inventory_buckets():
    global inventory_buckets
    with open('aws_inventory_buckets.json', 'r') as f:
        inventory_buckets = json.load(f)


def configure_inventory():
    for bucket in s3.buckets.all():
        if bucket.name.startswith('inventory'):
            print(f'passing {bucket.name}')
            continue
        try:
            client.put_bucket_inventory_configuration(
                Bucket=bucket.name,
                Id='inventory',
                InventoryConfiguration={'Destination': {
                    'S3BucketDestination': {
                        'AccountId': inventory_account_id,
                        'Bucket': f'arn:aws:s3:::{_get_inventory_bucket(bucket.name)}',
                        'Format': 'Parquet',
                        'Prefix': local_account_id,
                        'Encryption': {
                            'SSES3': {}
                        }
                    }
                }, 'IsEnabled': True, 'Id': 'inventory', 'IncludedObjectVersions': 'All', 'OptionalFields': [
                    'Size',
                    'LastModifiedDate',
                    'StorageClass',
                    'ETag',
                    'IsMultipartUploaded',
                    'ReplicationStatus',
                    'EncryptionStatus',
                    'ObjectLockRetainUntilDate',
                    'ObjectLockMode',
                    'ObjectLockLegalHoldStatus',
                    'IntelligentTieringAccessTier'
                ], 'Schedule': {
                    'Frequency': 'Daily'
                }}
            )
        except Exception as e:
            print(f'An error occurred processing {bucket.name}')
            print(e)


if __name__ == '__main__':
    try:
        load_inventory_buckets()
        configure_inventory()
    except Exception as e:
        print(f'An error occurred while configuring buckets for account {local_account_id}')
        print(e)

Cleaning up

In order to avoid extra charges after testing this solution, don’t forget to clean up the resources created:

Empty all Amazon S3 buckets created by the AWS CloudFormation stacks. They start with inventory-.
Delete the AWS CloudFormation stack deployed for the consolidation.
Delete the AWS CloudFormation stacks deployed in each AWS Region for inventory collection.

Conclusion

In this blog post, I covered how to collect and consolidate all Amazon S3 Inventory reports from multiple AWS Regions and multiple AWS accounts, allowing you to query them centrally. To do so I set up a distributed architecture based on Amazon S3 data events, Amazon SQS, Amazon SNS, and AWS Lambda. Now, instead of having to search for each report individually in every AWS Region, you can find them in one central location. Moreover, with the new partitioning scheme, you can now query them all at once.

Using this solution, you can have a global view of the S3 objects within your organization. It saves you time when you must audit your resources. It can also help ensure that all your objects comply with business requirements around localization or encryption. Finally, you can also simplify and speed up business workflows and big data jobs by using Amazon S3 Inventory, which provides a scheduled alternative to the Amazon S3 synchronous List API operation.

If you only need visibility on your Amazon S3 buckets usage and activity at the bucket or prefix level, I recommend that you use our newly launch feature, Amazon S3 Storage Lens.

Thanks for reading this blog post! If you have any comments or questions, please don’t hesitate to leave them in the comments section.