AWS Big Data Blog
Analyze your Amazon S3 spend using AWS Glue and Amazon Redshift
The AWS Cost & Usage Report (CUR) tracks your AWS usage and provides estimated charges associated with that usage. You can configure this report to present the data at hourly or daily intervals, and it is updated at least one time per day until it is finalized at the end of the billing period. The Cost & Usage Report is delivered automatically to an Amazon S3 bucket that you specify, and you can download it from there directly. You can also integrate the report into Amazon Redshift, query it with Amazon Athena, or upload it to Amazon QuickSight. For more information, see Query and Visualize AWS Cost and Usage Data Using Amazon Athena and Amazon QuickSight.
This post presents a solution that uses AWS Glue Data Catalog and Amazon Redshift to analyze S3 usage and spend by combining the AWS CUR, S3 inventory reports, and S3 server access logs.
Prerequisites
Before you begin, complete the following prerequisites:
- You need an S3 bucket for your S3 inventory and server access log data files. For more information, see Create a Bucket and What is Amazon S3?
- You must have the appropriate IAM permissions for Amazon Redshift to be able to access the S3 buckets – for this post, choose two non-restrictive IAM roles (AmazonS3FullAccess and AWSGlueConsoleFullAccess), but restrict your access accordingly for your own scenarios.
Amazon S3 inventory
Amazon S3 inventory is one of the tools S3 provides to help manage your storage. You can use it to audit and report on the replication and encryption status of your objects for business, compliance, and regulatory needs. Amazon S3 inventory provides comma-separated values (CSV), Apache optimized row columnar (ORC), or Apache Parquet output files that list your objects and their corresponding metadata on a daily or weekly basis for a given S3 bucket.
Amazon S3 server access logs
Server access logging provides detailed records for the requests you make to a bucket. Server access logs are useful for many applications, for example in security and access audits. It can also help you learn about your customer base and understand your S3 bill.
AWS Glue
AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it simple and cost-effective to categorize your data, clean it, enrich it, and move it reliably between various data stores. AWS Glue consists of a central metadata repository known as the Data Catalog, a crawler to populate the Data Catalog with tables, an ETL engine that automatically generates Python or Scala code, and a flexible scheduler that handles dependency resolution, job monitoring, and retries. AWS Glue is serverless, so there’s no infrastructure to set up or manage. This post uses AWS Glue to catalog S3 inventory data and server access logs, which makes it available for you to query with Amazon Redshift Spectrum.
Amazon Redshift
Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. You can use Amazon Redshift to efficiently query and retrieve structured and semi-structured data from files in S3 without having to load the data into Amazon Redshift native tables. You can create Amazon Redshift external tables by defining the structure for files and registering them as tables in the AWS Glue Data Catalog.
Setting up S3 inventory reports for analysis
This post uses the Parquet file format for its inventory reports and delivers the files daily to S3 buckets. You can select both the frequency of delivery and output file formats under Advanced settings as shown in the screenshot below:
For more information about configuring your S3 inventory, see How Do I Configure Amazon S3 Inventory?
The following diagram shows the data flow for this solution:
Below steps summarize the data flow diagram represented above:
- S3 Inventory Reports are delivered to an S3 bucket that you configure.
- The AWS Glue crawler then crawls this S3 bucket and populates the metadata in the AWS Glue Data Catalog.
- The AWS Glue Data Catalog is then accessible through an external schema in Redshift.
- The S3 Inventory Reports (available in the AWS Glue Data Catalog) and the Cost and Usage Reports (available in another S3 bucket) are now ready to be joined and queried for analysis.
The inventory reports are delivered to an S3 bucket. The following screenshot shows the S3 bucket structure for the S3 inventory reports:
There is a data folder in this bucket. This folder contains the Parquet data you want to analyze. The following screenshot shows the content of the folder.
Because these are daily files, there is one file per day.
Configuring an AWS Glue crawler
You can use an AWS Glue crawler to discover this dataset in your S3 bucket and create the table schemas in the Data Catalog. After you create these tables, you can query them directly from Amazon Redshift.
To configure your crawler to read S3 inventory files from your S3 bucket, complete the following steps:
- Choose a crawler name.
- Choose S3 as the data store and specify the S3 path up to the data
- Choose an IAM role to read data from S3 –
AmazonS3FullAccess
andAWSGlueConsoleFullAccess
. - Set a frequency schedule for the crawler to run.
- Configure the crawler’s output by selecting a database and adding a prefix (if any).
This post uses the database s3spendanalysis.
The following screenshot shows the completed crawler configuration.
Run this crawler to add tables to your Glue Data Catalog. After the crawler has completed successfully, go to the Tables section on your AWS Glue console to verify the table details and table metadata. The following screenshot shows the table details and table metadata after your AWS Glue crawler has completed successfully:
Creating an external schema
Before you can query the S3 inventory reports, you need to create an external schema (and subsequently, external tables) in Amazon Redshift. An Amazon Redshift external schema references an external database in an external data catalog. Because you are using an AWS Glue Data Catalog as your external catalog, after you create an external schema in Amazon Redshift, you can see all the external tables in your Data Catalog in Amazon Redshift. To create the external schema, enter the following code:
Querying the table
On the Amazon Redshift dashboard, under Query editor, you can see the data table. You can also query the svv_external_schemas
system table to verify that your external schema has been created successfully. See the following screenshot.
You can now query the S3 inventory reports directly from Amazon Redshift without having to move the data into Amazon Redshift first. The following screenshot shows how to do this using the Query Editor in the Amazon Redshift console:
Setting up S3 server access logs for analysis
The following diagram shows the data flow for this solution.
Below steps summarize the data flow diagram represented above:
- S3 Server Access Logs are delivered to an S3 bucket that you configure.
- These server access logs are then directly accessible to be queried from Amazon Redshift (note that we’ll be using CREATE EXTERNAL TABLE in Redshift Spectrum for this purpose, explained below).
- The S3 Server Access Logs and the Cost and Usage Reports (available in another S3 bucket) are now ready to be joined and queried for analysis.
The S3 server access logs are delivered to an S3 bucket. For more information about setting up server access logging, see Amazon S3 Server Access Logging.
The following screenshot shows the S3 bucket structure for the server access logs.
The server access log files consist of a sequence of new-line delimited log records. Each log record represents one request and consists of space-delimited fields. The following code is an example log record:
Creating an external table
You can define the S3 server access logs as an external table. Because you already have an external schema, create an external table using the following code. This post uses RegEx SerDe
to create a table that allows you to correctly parse all the fields present in the S3 server access logs. See the following code:
Validating the data
You can validate the external table data in Amazon Redshift. The following screenshot shows how to do this using the Query Editor in the Amazon Redshift console:
You are now ready to analyze the data.
Analyzing the data using Amazon Redshift
In this post, you have a CUR file per day in your S3 bucket. The files themselves are organized in a monthly hierarchy. See the following screenshot.
Each day’s file consists of the following files for CUR data:
- myCURReport-1.csv.gz – A zipped file of the data itself
- myCURReport-Manifest.json – A JSON file that contains the metadata for the file
- myCURReport-RedshiftCommands.sql – Amazon Redshift table creation scripts and a COPY command to create the CUR table from a Redshift manifest file
- myCURReport-RedshiftManifest.json – The Amazon Redshift manifest file to create the CUR table
Using Amazon Redshift is one of the many ways to carry out this analysis. Amazon Redshift is a fast, fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing Business Intelligence (BI) tools. Amazon Redshift gives you fast querying capabilities over structured data using familiar SQL-based clients and BI tools using standard ODBC and JDBC connections. Queries are distributed and parallelized across multiple physical resources.
You are now ready to run SQL queries with the Amazon Redshift SQL Query Editor. This post also uses the psql client tool, a terminal-based front end from PostgreSQL, to query the data in the cluster.
To query the data, complete the following steps:
- Create a custom schema to contain your tables for analysis. See the following code:
You should create your table in a schema other than public to control user access to database objects.
- Create a CUR table for the latest month in Amazon Redshift using the CUR SQL file in S3. See the following code:
- Load the data into Amazon Redshift for the latest month, using the provided CUR Manifest file. See the following code:
- Validate the data loaded in the Amazon Redshift table. See the following code:
The following screenshot shows that data has been loaded correctly in the Amazon Redshift table:
Managing database security
You can manage database security in Amazon Redshift by controlling which users have access to which database objects. To make sure your objects are secure, create two groups: FINANCE
and ADMIN
, with two users in FINANCE
and one user in ADMIN
. Complete the following steps:
- Create the groups where the user accounts are assigned. The following code creates two different user groups:
To view all user groups, query the
PG_GROUP
system catalog table (you should see finance and admin here): - Create three database users with different privileges and add them to the groups. See the following code:
Validate the users have been successfully created. To view a list of users, query the
PG_USER
catalog table: - Grant SELECT privileges to the
FINANCE
group and ALL privileges to theADMIN
group for your tableAWSBilling201910
inredshift_schema
. See the following code:You can verify if you enforced database security correctly. The user
finance1
tried to rename the tableAWSBilling201910
inredshift_schema
, but got apermission denied
error message (due to restricted access). The following screenshot shows this scenario and the subsequent error message:
Example S3 inventory analysis
S3 charges split per bucket. The following query identifies the data storage and transfer costs for each separate S3 bucket:
The following screenshot shows the results of executing the above query:
Costs are split by type of storage (for example, Glacier versus standard storage).
The following query identifies S3 data transfer costs (intra-region and inter-region) by S3 storage class (usage amount, unblended cost, blended cost):
The following screenshot shows the result of executing the above query:
The following query identifies S3 fee, API request, and storage charges:
The following screenshot shows the result of executing the above query:
Server access logs sample analysis queries
S3 access log charges per operation type. The following query identifies the data storage and transfer costs for each separate HTTP operation:
The following screenshot shows the result of executing the above query:
The following query identifies S3 data transfer costs (intra-region and inter-region) by S3 operation and HTTP status (usage amount, unblended cost, blended cost):
The following screenshot shows the result of executing the above query:
The following query identifies S3 fee, API request, and storage charges:
The following screenshot shows the result of executing the above query:
Overall data flow diagram
The following diagram shows the complete data flow for this solution.
Conclusion
AWS Glue makes provides an easy and convenient way to discover data stored in your S3 buckets automatically in a cloud-native, secure, and efficient way. This post demonstrated how to use AWS Glue and Amazon Redshift to analyze your S3 spend using Cost and Usage Reports. You also learned best practices for managing database security in Amazon Redshift through users and groups. Using this framework, you can start analyzing your S3 bucket spend with a few clicks in a matter of minutes on the AWS Management Console!
If you have questions or suggestions, please leave your thoughts in the comments section below.
About the Author
Shayon Sanyal is a Data Architect, Data Lake for Global Financial Services at AWS.