AWS Architecture Blog

How to Efficiently Extract and Query Tagged Resources Using the AWS Resource Tagging API and S3 Select (SQL)

AWS customers can use tags to assign metadata to their AWS resources. Each tag is a simple label consisting of a customer-defined key and an optional value that can make it easier to manage, search for, and filter resources. Although there are no inherent types of tags, they enable customers to categorize resources by multiple criteria such as purpose, owner and, environment.

Once a tagging strategy is defined and enforced, customers can use the AWS Tag Editor to view and manage tags on their AWS resources, regardless of service or region. They can use the tag editor to search for resources by resource type, region, or tag, and then manage the tags applied to those resources.

However, customers have asked for guidance on how to build custom automation mechanisms to extract and query tagged resources so that they can extend the built-in functionalities of the Tag Editor. For instance, customers can build automation to generate custom CSV files for tagged resources and perhaps use SQL to query those resources. In addition, automation allows customers to add validation checks to their CI/CD deployment pipelines, for instance, to check whether resources have been properly tagged.

In this blog post, we introduce a simple yet efficient AWS architecture for extracting and querying tagged resources based on AWS cloud-native features such as the Resource Tagging API and S3 Select. We provide sample code for the architecture discussed that can help customers to customize and/or extend the architecture for their own purpose. By relying on AWS cloud-native features, customers can save time and reduce costs while still being able to do customizations.

For customers unfamiliar with the Resource Tagging API and the S3 Select features, below is a very brief introduction.

Resource Tagging API
AWS customers can use the Resource Tagging API to programatically access the same resource group operations that had been accessible only from the AWS Management Console by now using the AWS SDKs or the AWS Command Line Interface (CLI). By doing so, customers can build automation that fits their need, e.g., code that extract, export, and queries tagged resources.

For further details, please read Resource Groups Tagging – Reference

S3 Select
S3 Select enables applications to retrieve only a subset of data from an object by using simple SQL expressions. By using S3 Select to retrieve only the data needed by the application, customers can achieve drastic performance increases – in many cases you can get as much as a 400% improvement.

For further details, please read:

The Overall Solution Architecture

The figure above depict the overall architecture discussed in this post. It is a simple yet efficient architecture for extracting and querying tagged resources based on AWS cloud-native features. The Resource Tagging API is used to extract tagged resources from one or more AWS accounts via the Python AWS SDK, then a custom CSV file is generated and pushed to S3. Once in S3, the tagged resources file can now be efficiently queried via S3 Select also using Python AWS SDK. By leveraging S3 Select, we can now use SQL to query tagged resources and save on S3 data transfer costs since only the filtered results will be returned directly from S3. Pretty neat, eh?

The Extract Process
The extract process was built using Python 3 and relies on the Resource Tagging API to fetch pages of tagged resources and export them to CSV using the csv Python library.

We start importing the required libraries (boto3 is the AWS SDK for Python, argparse helps managing input parameters, and csv supports building valid CSV files):

import boto3
import argparse
import csv

Then, we define the header columns to use when generating the CSV files containing all tagged resources and the writeToCsv function:

field_names = ['ResourceArn', 'TagKey', 'TagValue']

def writeToCsv(writer, args, tag_list):
    for resource in tag_list:
        print("Extracting tags for resource: " +
              resource['ResourceARN'] + "...")
        for tag in resource['Tags']:
            row = dict(
                ResourceArn=resource['ResourceARN'], TagKey=tag['Key'], TagValue=tag['Value'])
            writer.writerow(row)

We take the CSV output file path as a required parameter so that users can specificy the desired output file name using the argparse library:

def input_args():
    parser = argparse.ArgumentParser()
    parser.add_argument("--output", required=True,
                        help="Output CSV file (eg, /tmp/tagged-resources.csv)")
    return parser.parse_args()

And then, we implement the main extract logic that uses the Resource Tagging API (see boto3.client(‘resourcegroupstaggingapi’) in the code below). Note that we fetch 50 resources at a time and write them to the CSV output file until no more resources are found.

def main():
    args = input_args()
    restag = boto3.client('resourcegroupstaggingapi')
    with open(args.output, 'w') as csvfile:
        writer = csv.DictWriter(csvfile, quoting=csv.QUOTE_ALL,
                                delimiter=',', dialect='excel', fieldnames=field_names)
        writer.writeheader()
        response = restag.get_resources(ResourcesPerPage=50)
        writeToCsv(writer, args, response['ResourceTagMappingList'])
        while 'PaginationToken' in response and response['PaginationToken']:
            token = response['PaginationToken']
            response = restag.get_resources(
                ResourcesPerPage=50, PaginationToken=token)
            writeToCsv(writer, args, response['ResourceTagMappingList'])

if __name__ == '__main__':
    main()

The extract procedure is pretty simple and illustrates well how to use the Resource Tagging API to customize the output. It will also use the default credentials in your account.

Here is how the extract process can be triggered for the QA account (assuming the python source file is named aws-tagged-resources-extractor.py and that there is a QA_AWS_ACCOUNT AWS profile defined in your ~/.aws/credentials file).

export AWS_PROFILE=QA_AWS_ACCOUNT
python aws-tagged-resources-extractor.py --output /tmp/qa-tagged-resources.csv

The extract procedure can be applied to other AWS accounts by updating the AWS_PROFILE environment variable accordingly.

The extract procedure can be applied to other AWS accounts by updating the AWS_PROFILE environment variable accordingly.

The ‘Upload to S3’ Process
Once file /tmp/qa-tagged-resources.csv is generated, it can be upload to an S3 bucket using the AWS CLI (or one could extend the extract sample code above to do so):

aws s3 cp /tmp/qa-tagged-resources.csv s3://[REPLACE-WITH-YOUR-S3-BUCKET]

The Query Process
Once the CSV files containing tagged resources for different AWS accounts are uploaded to S3, we can now use S3 Select to perform familiar SQL queries against these files. Another advantage of using S3 Select is that it reduces the amount of data transferred from S3 which is especially relevant in our case when accounts have a very large number of tagged resources.

We again use the boto3 and argparse libraries (Python 3). Required input parameters include the S3 bucket (–bucket) and the S3 key (–key). The SQL query parameter (–query) is optional and will return all results if not provided.

import boto3
import argparse

def input_args():
    parser = argparse.ArgumentParser()
    parser.add_argument("--bucket", required=True, help="SQL query to filter tagged resources output")
    parser.add_argument("--key", required=True, help="SQL query to filter tagged resources output")
    parser.add_argument("--query", default="select * from s3object", help="SQL query to filter tagged resources output")
    return parser.parse_args()

The main query logic is shown below. It uses the boto3.client(‘s3’) to initialize an s3 client that is later used to query the tagged resources CSV file in S3 via the select_object_content() function. This function takes the S3 bucket name, S3 key, and query as parameters. Check the [Boto3] (http://boto3.readthedocs.io/en/latest/reference/services/s3.html) API reference for details on this function and its inputs and outputs.

def main():
    args = input_args()
    s3 = boto3.client('s3')
    response = s3.select_object_content(
        Bucket=args.bucket,
        Key=args.key,
        ExpressionType='SQL',
        Expression=args.query,
        InputSerialization = {'CSV': {"FileHeaderInfo": "Use"}},
        OutputSerialization = {'CSV': {}},
    )

    for event in response['Payload']:
        if 'Records' in event:
            records = event['Records']['Payload'].decode('utf-8')
            print(records)
            
if __name__ == '__main__':
    main()

Here’s a few examples of how to trigger the query procedure against the CSV files stored in S3 (assuming the Python source file for the query procedure is called aws-tagged-resources-querier). We assume that the S3 bucket is located in a single account referenced by profile CENTRAL_AWS_ACCOUNT.

Return the resource ARNs of all route tables containing a tag named ‘aws:cloudformation:stack-name’ in the QA AWS account

export AWS_PROFILE= CENTRAL_AWS_ACCOUNT
python aws-tagged-resources-querier \
     --bucket [REPLACE-WITH-YOUR-S3-BUCKET] \
     --key qa-tagged-resources.csv \
     --query "select ResourceArn from s3object s \
              where s.ResourceArn like 'arn:aws:ec2%route-table%' \
                and s.TagKey='aws:cloudformation:stack-name'"

We invite readers to build more sophisticated SQL queries.

Summary
In this blog post, we introduced a simple yet efficient AWS architecture for extracting and querying tagged resources based on AWS cloud-native features such as the Resource Tagging API and S3 Select. We provided sample code that can help customers to customize and/or extend the architecture for their own purpose. By relying on AWS cloud-native features, customers can save time and reduce costs while still being able to do customizations.

The “extract” process discussed above is available in the AWS Serverless Repository under an application called aws-tag-explorer. Check it out!

Happy Resource Tagging!

 

 

 

TAGS:
Marcilio Mendonca

Marcilio Mendonca

Marcilio Mendonca is a Sr. Solutions Developer in the Solution Prototyping Team at Amazon Web Services. In the past years, he has been helping AWS customers to design, build and deploy modern applications on AWS leveraging VMs, containers, and Serverless architectures. Prior to joining AWS, Marcilio was a Software Development Engineer with Amazon. He also holds a PhD in Computer Science. You can find him on twitter at @marciliomendonc.