How to attribute Amazon Redshift costs to your end-users

Amazon Redshift is a fast, fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing business intelligence (BI) tools. In this two-part series, we discuss how to attribute Amazon Redshift costs at the user and group level.

When using Amazon Redshift, you can choose either of the two pricing models available: On-Demand pricing and Reserved Instance pricing.

On-Demand pricing allows you to pay for capacity by the hour with no commitments and no upfront costs; you simply pay an hourly rate based on the type and number of nodes in your cluster. Partial hours are billed in 1-second increments following a billable status change such as creating, deleting, pausing, or resuming the cluster.

Reserved Instances are appropriate for steady-state production workloads, and offer significant discounts over On-Demand pricing.

At the end of billing cycle, you see an itemized billing of your usage of AWS Services, as in the following screenshot.

itemized billing sample

In addition, you can check AWS Cost Explorer for further details on the cost. For example, the following screenshot shows usage cost for Amazon Redshift per day.

AWS Cost Explorer

These views give you the overall cost of using Amazon Redshift. However, you may need to attribute the cost at the user or group level. For example, what’s the usage cost in Amazon Redshift by the finance business unit?

Use case

Amazon Redshift allows you administer controls to individual objects by users and groups. You can use schemas to group database objects under a common name, which provides a convenient way to manage access, rather than by individual objects. Organizations typically organize related objects in schemas. For example, a finance_schema contains all the related objects related to the finance dataset, and granting access to the finance schema to the finance_group allows only users who are members of the finance_group to access this dataset.

The following diagram illustrates this schema-based setup.

Typically, you can also grant a schema access to multiple groups (teams) or individual users. For example, a finance user might want access to sales data to perform the annual budgeting, or you may have common datasets like the customer information that can be shared by different groups. The following diagram illustrates this setup.

Now, the goal of the cost attribution involves proportional assignment of the overall cost to the individual groups or users.

Cost attribution

At its simplest form, cost attribution can be determined using the amount of the storage assigned to the individual objects using the ownership of the objects to the groups. But the downside of this approach is it doesn’t provide a true translation of the resource usage. For example, let’s say Team 1 has total object size of 1 TB, whereas Team 2 has 100 GB in total size. Team 1 member runs 10 queries daily, and Team 2 runs 1,000 queries per day. Of course, Team 2 uses more resources than Team 1.

The Amazon Redshift RA3 architecture allows you to pay for the compute and data warehouse storage capacity separately, therefore storage doesn’t reflect the resources used by the teams for the cost attribution.

Cost attribution model

The methodology for the cost attribution model has to be translated to the resource used by the user or team. The SQL queries used to create and manipulate database objects, run queries, load tables, and modify the data provide an ideal mechanism to associate the resource of the data warehouse. The following table shows a matrix of possible different query metrics that you can use to associate cost attribution.

Metric	Resource deterministic?		Remarks
Metric	Queries Using Amazon Redshift Local Table	Queries using Amazon Redshift Spectrum	Remarks
Data Scanned	Yes	Yes	Amount of data scanned by the query
CPU Time	Yes	No	CPU time consumed by the query
Storage Used	Yes	No	Storage footprint of the objects used in the query
Number of Runs	Yes	Yes	Number of innovations of a query
Runtime	No	No	Runtime may differ based on the available resources

You can now derive a costing model using these deterministic metrics as follows:

Overall query cost = (query data scan cost * data scan weighted score) + (Query CPU cost * CPU weighted score) + (query run cost * run weighted score) + Redshift Spectrum cost

With the preceding model, you can now associate the query cost per user, which can be rolled up to individual teams (or groups) for cost attribution.

Use Amazon Redshift system tables for cost attribution

Amazon Redshift system tables contain information about how the system is functioning and logs user activities. You can use the following system tables to capture deterministic metrics:

svl_s3query_summary – Shows a summary of all Redshift Spectrum queries (Amazon Simple Storage Service queries) that have been run on the system
stl_wlm_query – Shows the attempted run of a query in a service class handled by WLM
stl_query – Shows run information about a database query
svl_qlog – Shows a log of all queries run against the database
stl_alert_event_log – Shows an alert when the query optimizer identifies conditions that might indicate performance issues

We used these system tables to create the following views, which are available in the GitHub repo:

redshift_spectrum_scan_summary_vw
redshift_query_summary_vw
redshift_query_attribution_vw

We used the following representative query to obtain the metrics that can be used for the cost attribution:

SELECT trim(to_char(rqs.event_date_utc, 'Day')) AS metric_day_of_week
      ,rqs.event_date_utc
      ,rqs.database_name
      ,rqs.queue_name
      ,rqs.db_username
      ,COUNT(1)                                                                                      AS query_count
      ,SUM(rqs.query_cpu_time_secs)                                                                  AS cpu_time_secs
      ,SUM(case when rqs.used_concurrency_scaling = 0 then rqs.query_execution_time_secs else 0 end) AS execution_time_secs
      ,SUM(rqs.query_blocks_read_mb) + SUM(rqs.query_temp_blocks_to_disk_mb)                         AS disk_io_mb
      ,SUM(rqs.rated_spectrum_scan_size_mb)                                                          AS rated_spectrum_scan_size_mb
FROM redshift_query_summary_vw rqs
GROUP BY trim(to_char(rqs.event_date_utc, 'Day'))
        ,rqs.event_date_utc
        ,rqs.database_name
        ,rqs.queue_name
        ,rqs.db_username;

The following table is our sample output (not all columns are shown).

event_date_utc	db_username	query_count	cpu_secs	execution_time_secs	disk_io_mb	rated_spectrum_scan_size_mb
2021-06-23	mia	16	3883	919	26500	0
2021-06-23	ava	3	1757	768	55600	0
2021-06-23	emma	3	3	23	0	0
2021-06-23	steve	21	6449	2167	50000	0
2021-06-25	etl_app_user	16	3943	832	43300	0

Let’s assume that the total Amazon Redshift incurred cost is $100 per day. If we use a simple data scanned model (total_disk_io_mb), we can attribute cost to individual users, as shown in the following table.

event_date_utc	db_username	disk_io_mb	Cost attribution factor = disk_io_mb/total_disk_io_mb	Attribution cost (cost attribution factor * $100)
2021-06-23	mia	26500	0.20	$20
2021-06-23	ava	55600	0.42	$42
2021-06-23	emma	0	0	$0
2021-06-23	steve	50000	0.38	$38
	total_disk_io_mb	132100	Daily Redshift Compute Cost	$100
2021-06-25	etl_app_user	43300	1	$100
	total_disk_io_mb	43300	Daily Redshift Compute Cost	$100

The following query automatically calculates the attribution cost of a query based on the defined cost attribution model, which also includes the spectrum cost (if any):

SELECT TRIM(TO_CHAR(rqa.event_date_utc,'yyyy-mm')) AS metric_month,
       TRIM(TO_CHAR(rqa.event_date_utc,'Day')) AS metric_day_of_week,
       rqa.event_date_utc,
       rqa.database_name,
       rqa.queue_name,
       rqa.db_username,
       MIN(rqa.daily_redshift_compute_cost) AS daily_redshift_compute_cost,
       SUM(rqa.redshift_query_cost) AS total_redshift_query_cost
FROM redshift_query_attribution_vw rqa
GROUP BY TRIM(TO_CHAR(rqa.event_date_utc,'yyyy-mm')),
         TRIM(TO_CHAR(rqa.event_date_utc,'Day')),
         rqa.event_date_utc,
         rqa.database_name,
         rqa.queue_name,
         rqa.db_username;

The following table shows our output (not all columns are shown) for the cost attributed at user level.

metric_month	event_date_utc	database_name	db_username	daily_redshift_compute_cost	total_redshift_query_cost
2021-06	2021-06-23	demo_db	mia	100.00	$20
2021-06	2021-06-23	demo_db	ava	100.00	$42
2021-06	2021-06-23	demo_db	emma	100.00	$0
2021-06	2021-06-23	demo_db	steve	100.00	$38
2021-06	2021-06-25	demo_db	etl_app_user	100.00	$100

To show the compute cost of your own Amazon Redshift cluster, you need to download the redshift_query_attribution_vw view and adjust the numbers on the following columns in the redshift_cluster_node subquery of the view:

price_per_node_per_hour
daily_operation_hour
spectrum_price_per_tb
concurrency_price_per_second
cpu_rated_score
disk_io_rated_score
execution_rated_score
daily_redshift_compute_cost

The sum of the cpu_rated_score, disk_io_rated_score, and execution_rated_score should be equal to 1.

System tables retain approximately 2–5 days of log history, depending on log usage and available disk space. If you want to retain the log data, you need to periodically copy it to other tables or unload it to Amazon S3. You can use the Amazon Redshift System Object Persistence Utility for longer persistence.

Conclusion

Amazon Redshift logs deterministic metrics that you can use to associate resource usage of the cluster to a user or team. You can collect these metrics for a fine-grained cost attribution model to meet your business needs.

You can also automate the reports using the Amazon Redshift scheduling feature or through any of your BI tools. With the cost attribution model, you can easily manage the costs of your Amazon Redshift cluster in a fine-grained fashion and identify options for optimization and scaling.

About the Authors

Jason Pedreza is an Analytics Specialist Solutions Architect at AWS with over 13 years of data warehousing experience. Prior to AWS, he built data warehouse solutions at Amazon.com. He specializes in Amazon Redshift and helps customers build scalable analytic solutions.

Bhanu Pittampally is Analytics Specialist Solutions Architect based out of Dallas. He specializes in building analytical solutions. His background is in data warehouse – architecture, development and administration. He is in data and analytical field for over 13 years.

Thiyagarajan Arumugam is a Principal Solutions Architect at Amazon Web Services and designs customer architectures to process data at scale. Prior to AWS, he built data warehouse solutions at Amazon.com. In his free time, he enjoys all outdoor sports and practices the Indian classical drum mridangam.

AWS Big Data Blog