How do I calculate the query charges in Amazon Redshift Spectrum?

6 minuti di lettura
0

I want to calculate query charges in Amazon Redshift Spectrum when the data is scanned from Amazon Simple Storage Service (Amazon S3). How do I calculate the Redshift Spectrum query usage or cost, and what are some best practices to reduce the charges?

Short description

Prerequisites:

  • An Amazon Redshift cluster.
  • An SQL client that's connected to your cluster to run SQL commands.
  • Both the Amazon Redshift cluster and Amazon S3 bucket must be in the same Region.

With Redshift Spectrum, you can run SQL queries directly against the data in S3. You are charged for the number of bytes scanned from S3. Additional charges (of $5 to $6.25 per TB of data scanned) can be incurred depending on the Region. Byte numbers are always rounded up to the next megabyte, with a minimum of 10 MB per query. For more information, see Amazon Redshift pricing.

Note: There are no charges for Data Definition Language (DDL) statements like CREATE, ALTER, or DROP TABLE statements for managing partitions and failed queries.

Resolution

To calculate the estimated query cost (and to obtain a summary of all S3 queries that were run in Redshift Spectrum), use the SVL_S3QUERY_SUMMARY table. The s3_scanned_bytes column returns the number of bytes scanned from S3 sent to the Redshift Spectrum layer.

Usage

You can run the query against SVL_S3QUERY_SUMMARY to determine the number of bytes transferred by queryID:

SELECT s3_scanned_bytes
FROM SVL_S3QUERY_SUMMARY
WHERE query=<queryID>;

To determine the sum of all bytes scanned from S3, use the following query:

SELECT sum(s3_scanned_bytes)
FROM SVL_S3QUERY_SUMMARY;

You can also determine the sum of bytes for all queries from Redshift Spectrum in a specific time interval. The following example shows you how to calculate the sum of all bytes for queries that started running since the previous day:

SELECT sum(s3_scanned_bytes)
FROM SVL_S3QUERY_SUMMARY
WHERE starttime >= current_date-1;

If you run this query against an S3 bucket in the US East (N. Virginia) Region, Redshift Spectrum charges $5 per terabyte. If the sum for s3_scanned_bytes returns 621,900,000,000 bytes when querying SVL_S3QUERY_SUMMARY, you have 0.565614755032584 terabytes (when you convert from bytes to terabytes).

621900000000 bytes = 621900000000/1024 = 607324218.75 kilobytes
607324218.75 kilobytes = 607324218.75/1024 =  593090.057373046875 megabytes
593090.057373046875  megabytes =  593090.057373046875 /1024 = 579.189509153366089 gigabytes
579.189509153366089  gigabytes =  579.189509153366089/1024 = 0.565614755032584 terabytes

In this example, your usage is approximately 0.5657 terabytes. To calculate the usage cost of Redshift Spectrum, multiply it by the cost per terabyte:

$5 * 0.5657= $2.83

You can also use the following SQL query to calculate the Redshift Spectrum usage charges:

SELECT
   round(1.0*sum(s3_scanned_bytes/1024/1024/1024/1024),4) s3_scanned_tb,
   round(1.0*5*sum(s3_scanned_bytes/1024/1024/1024/1024),2) cost_in_usd 
FROM SVL_S3QUERY_SUMMARY;

In this example, the charges in Redshift Spectrum are queried against your S3 bucket for data scanned from the previous day.

Note: All queries that scan up to 9.9 MB are rounded up and charged for 10 MB. There are no charges for failed or aborted queries.

Additionally, system log tables (STL) retain only two to five days of log history, depending on log usage and available disk space. Therefore, it's a best practice to calculate the daily query charges and to store it in another table, retaining a record of transferred bytes. Here's an example:

CREATE VIEW spectrum_cost AS
SELECT starttime::date as date, xid, query, trim(usename) as user, 
  CASE WHEN s3_scanned_bytes < 10000000 then 10 ELSE s3_scanned_bytes/1024/1024 end as scanned_mb, 
  round(CASE WHEN s3_scanned_bytes < 10000000 then 10*(5.0/1024/1024) 
  ELSE (s3_scanned_bytes/1024/1024)*(5.0/1024/1024) end,5) as cost_$ 
FROM svl_s3query_summary s 
LEFT JOIN pg_user u ON userid=u.usesysid 
JOIN 
(select xid as x_xid,max(aborted) as x_aborted from svl_qlog group by xid) q 
ON s.xid=q.x_xid 
WHERE userid>1 AND x_aborted=0
AND s.starttime >= current_date-1;

Note: You can also use the CREATE TABLE query to calculate and store the data in another table. If you don't want to specify a time period, remove "current_date-1".

Calculate the total sum of data scanned from S3 to Redshift Spectrum since the day before. Then, calculate the total estimate of charges by running the following query:

SELECT current_date-1 as query_since, SUM(scanned_mb) as total_scanned_mb, SUM(cost_$) as total_cost_$
FROM spectrum_cost;
Result:
  query_since | total_scanned_mb | total_cost_$
--------------+------------------+---------------
 2020-05-15   |            5029  |      0.02515

Redshift Spectrum best practices

To reduce the query charges and to improve Redshift Spectrum's performance, consider the following best practices:

  • Use cost controls for Redshift Spectrum and concurrency scaling features to monitor and control your usage.
  • Use Optimized Data Formats to improve performance and lower costs. Use columnar data formats such as PARQUET and ORC to select only the columns you want to scan from S3.
  • Load the data in S3 and use Redshift Spectrum if the data is infrequently accessed.
  • When using multiple Amazon Redshift clusters to scale concurrency, terminate those clusters as soon as the jobs are complete.

Cost controls and concurrency scaling for Redshift Spectrum

By using the cost controls and concurrency scaling feature for Redshift Spectrum, you can create daily, weekly, and monthly usage limits. When the usage limits are reached, Amazon Redshift automatically takes action based on the usage limits.

To configure the cost control from the Amazon Redshift console, perform the following steps:

1.    Sign in to the AWS Management Console.

2.    Open the Amazon Redshift console.

Note: You can also define and manage usage limits from the AWS Command Line Interface (AWS CLI) or Amazon Redshift API operations.

3.    Choose Configure usage limit.

4.    Update the following configuration settings:

Time period (Daily/Weekly/Monthly)

Usage limit (TB)

Action (Alert/Log to system table/Disable feature)

Note: The Action features can help you manage your usage limits.

To configure the Concurrency Scaling usage limit, perform the following steps:

1.    Sign in to the AWS Management Console.

2.    Open the Amazon Redshift console.

3.    Choose Concurrency scaling usage limit as your usage limit.

4.    Update the following configuration settings:

Time period (Daily/Weekly/Monthly)

Usage limit (hh:mm)

Action (Alert/Log to system table/Disable feature)

Note: The Time period is in UTC time zone. For the Alert and Disable feature, you can also attach an Amazon Simple Notification Service (SNS) subscription to the alarm. Additionally, if you activate an alert using the Amazon Redshift console, an Amazon CloudWatch alarm is created automatically for those metrics.

Additional cost control requirements and limitations

When managing your Redshift Spectrum usage and cost, be aware of the following requirements and limitations:

  • Usage limits are available with supported versions 1.0.14677 or later.
  • You can add up to 4 limits and actions per category (8 limits total).
  • Redshift Spectrum is supported only in Regions where Redshift Spectrum and Concurrency Scaling are available.
  • Only one limit per feature can use the Disable action.
  • Usage limits persist until the usage limit definition itself or the cluster is deleted.
  • If you create a limit in the middle of a period, the limit is measured from that point to the end of the defined period.
  • If you are choosing log options, review the details in the STL_USAGE_CONTROL logs.

Related information

Managing usage limits in Amazon Redshift

SVL_QLOG

System tables and views

Best practices for Amazon Redshift Spectrum

AWS UFFICIALE
AWS UFFICIALEAggiornata 2 anni fa