AWS Big Data Blog

Twelve Best Practices for Amazon Redshift Spectrum

Amazon Redshift Spectrum enables you to run Amazon Redshift SQL queries on data that is stored in Amazon S3. With Redshift Spectrum, you can extend the analytic power of Amazon Redshift beyond the data that is stored natively in Amazon Redshift.

Redshift Spectrum offers several capabilities that widen your possible implementation strategies. For example, it expands the data size accessible to Amazon Redshift and enables you to separate compute from storage to enhance processing for mixed-workload use cases.

Redshift Spectrum also increases interoperability of your data, because you can access the same S3 object from multiple compute platforms beyond Amazon Redshift. Such platforms include Amazon Athena, Amazon EMR with Apache Spark, Amazon EMR with Apache Hive, Presto, and any other compute platform that can access S3. Thus, you can query vast amounts of data in your Amazon S3 data lake without having to go through a tedious and time-consuming extract, transfer, and load (ETL) process. You can also join external S3 tables with tables that reside on the cluster’s local disk. Redshift Spectrum applies sophisticated query optimization and scales processing across thousands of nodes to deliver fast performance.

In this blog post, we collect 12 important best practices for Redshift Spectrum and group them into several different functional groups. We base these guidelines on many interactions and considerable direct project work with Amazon Redshift customers.

Before you get started, there are a few setting up steps, listed following. For further information about prerequisites and steps to get started in Redshift Spectrum, see Getting Started with Amazon Redshift Spectrum in the Amazon Redshift documentation.

Set up the test environment

To perform tests to validate the best practices we have outlined in this blog post, you can use any dataset. Redshift Spectrum supports many common data formats: Text, Parquet, ORC, JSON, Avro, and so on. You can query data in its original format, or convert data to a more efficient one based on data access pattern, storage requirement, and so on. For example, if you often access a subset of columns, a columnar format such as Parquet and ORC can greatly reduce I/O by reading only the needed columns. How to convert from one file format to another is beyond the scope of this blog post. For more information on how this can be done, check out the following resources:

Create an external schema

You can create an external schema named s3_external_schema as follows:

create external schema s3_external_schema 
from data catalog 
database 'spectrumdb' 
iam_role 'arn:aws:iam::<AWS_ACCOUNT_ID>:role/aod-redshift-role'
create external database if not exists;

The Amazon Redshift cluster and the data files in Amazon S3 must be in the same AWS Region. You can create the external database in Amazon Redshift, in the AWS Glue or Athena Data Catalog, or in your own Apache Hive metastore. Your Amazon Redshift cluster needs authorization to access your external data catalog and your data files in Amazon S3. You provide that authorization by referencing an AWS Identity and Access Management (IAM) role (for example, aod-redshift-role) that is attached to your cluster. For more information, see Create an IAM Role for Amazon Redshift in the Amazon Redshift documentation.

Define external tables

You can define a partitioned external table using Parquet files and another nonpartitioned external table using comma-separated value (CSV) files as follows:

CREATE  external table s3_external_schema.LINEITEM_PART_PARQ ( 
 L_ORDERKEY BIGINT,
 L_PARTKEY BIGINT,
 L_SUPPKEY BIGINT,
 L_LINENUMBER INT,
 L_QUANTITY DECIMAL(12,2),
 L_EXTENDEDPRICE DECIMAL(12,2),
 L_DISCOUNT DECIMAL(12,2),
 L_TAX DECIMAL(12,2),
 L_RETURNFLAG VARCHAR(128),
 L_LINESTATUS VARCHAR(128),
 L_COMMITDATE DATE,
 L_RECEIPTDATE DATE,
 L_SHIPINSTRUCT VARCHAR(128),
 L_SHIPMODE VARCHAR(128),
 L_COMMENT VARCHAR(128))
partitioned by (L_SHIPDATE DATE)
stored as PARQUET
location 's3://<your-bucket>/<xyz>/lineitem_partition/';

CREATE  external table s3_external_schema.LINEITEM_CSV ( 
 L_ORDERKEY BIGINT,
 L_PARTKEY INT,
 L_SUPPKEY INT,
 L_LINENUMBER INT,
 L_QUANTITY DECIMAL(12,2),
 L_EXTENDEDPRICE DECIMAL(12,2),
 L_DISCOUNT DECIMAL(12,2),
 L_TAX DECIMAL(12,2),
 L_RETURNFLAG VARCHAR(128),
 L_LINESTATUS VARCHAR(128),
 L_SHIPDATE DATE ,
 L_COMMITDATE DATE,
 L_RECEIPTDATE DATE,
 L_SHIPINSTRUCT VARCHAR(128),
 L_SHIPMODE VARCHAR(128),
 L_COMMENT VARCHAR(128))
row format delimited
fields terminated by '|'
stored as textfile
location 's3://<your-bucket>/<xyz>/lineitem_csv/';

Query data

To recap, Amazon Redshift uses Redshift Spectrum to access external tables stored in Amazon S3. You can query an external table using the same SELECT syntax that you use with other Amazon Redshift tables. Currently, all external tables are read-only.

You must reference the external table in your SELECT statements by prefixing the table name with the schema name, without needing to create and load the table into Amazon Redshift.

If you want to perform your tests using Redshift Spectrum, the following two queries are a good start.

Query 1

SELECT  l_returnflag,
        l_linestatus,
        sum(l_quantity) as sum_qty,
        sum(l_extendedprice) as sum_base_price,
        sum(l_extendedprice*(1-l_discount)) as sum_disc_price,
        sum(l_extendedprice*(1-l_discount)*(1+l_tax)) as sum_charge,
        avg(l_quantity) as avg_qty,
        avg(l_extendedprice) as avg_price
FROM s3_external_schema.LINEITEM_PART_PARQ
WHERE l_shipdate BETWEEN '1998-12-01' AND '1998-12-31'
GROUP BY l_returnflag, l_linestatus
ORDER BY l_returnflag, l_linestatus;

This query accesses only one external table and it can be used to highlight the additional processing power provided by the Redshift Spectrum layer.

Query 2

SELECT   l_orderkey, 
         Sum(l_extendedprice * (1 - l_discount)) AS revenue, 
         o_orderdate, 
         o_shippriority 
FROM     customer, orders, s3_external_schema.lineitem_part_parq 
WHERE    c_mktsegment = 'BUILDING' 
         AND      c_custkey = o_custkey 
         AND      l_orderkey = o_orderkey 
         AND      o_orderdate < date '1995-03-15' 
         AND      l_shipdate >  date '1995-03-15' 
GROUP BY l_orderkey, o_orderdate, o_shippriority 
ORDER BY revenue DESC, o_orderdate 
LIMIT 20;

This query joins three tables. The customer and orders tables are local Amazon Redshift tables, and the LINEITEM_PART_PARQ is an external table.

Best practices for concurrency

These recommended practices can help you optimize your concurrent workload performance using Redshift Spectrum.

1. Use Redshift Spectrum to improve concurrent workloads

Redshift Spectrum is a sophisticated execution engine that is invoked as a serverless compute service. The native Amazon Redshift cluster makes the invocation to Redshift Spectrum when the SQL query requests data from an external table stored in S3. Large multiples in parallel are possible by using Redshift Spectrum on external tables to scan, filter, aggregate, and return rows from S3 back into the Amazon Redshift cluster. All of these operations are performed outside of Amazon Redshift, which reduces the computational load on the Amazon Redshift cluster and improves concurrency. In addition, Redshift Spectrum scales intelligently. Based on the demands of your queries, Redshift Spectrum can potentially use thousands of instances to take advantage of massively parallel processing (MPP).

For some use cases of concurrent scan or aggregate intensive workloads, or both, Redshift Spectrum might perform better than native Amazon Redshift.

The most resource-intensive aspect of any MPP system is the data-load process. This is because it competes with active analytic queries not only for compute resources, but also for locking on the tables through multi-version concurrency control (MVCC). By contrast, you can add new files to an existing external table by writing to Amazon S3, with no resource impact on Amazon Redshift. You can then update the metadata to include the files as new partitions, then access them by using Redshift Spectrum. Thus, you eliminate this data-load process from the Amazon Redshift cluster. This has an immediate and direct positive impact on concurrency.

2. Use multiple on-demand Amazon Redshift clusters to scale concurrency

Given that Redshift Spectrum operates on data stored in Amazon S3, you can have multiple Amazon Redshift clusters to access these external S3 tables and improve concurrent workload performance. A common Amazon Redshift question is what to do with seasonal spiky, highly concurrent query workloads. Before Redshift Spectrum, to handle the increased concurrency, you often had to spin up multiple read-only Amazon Redshift clusters by restoring from a snapshot. The problem with this approach is that for a large Amazon Redshift data warehouse with hundreds of terabytes of data, the restore process can take a long time. This can result in data latency issues.

With Redshift Spectrum, you can move the largest tables to Amazon S3, and each Amazon Redshift cluster needs to keep only a small amount of data on local disks. Because of the reduction of data volume, it’s much faster to spin up or restore multiple read-only Amazon Redshift clusters to handle these seasonal spiky query workloads. As an illustration, see the diagram following, where multiple read-only Amazon Redshift clusters access the shared Redshift Spectrum layer.

To reduce cost, you should terminate these “on-demand” Amazon Redshift clusters as soon as they have finished the job.

Amazon Redshift customers have been using pgbouncer-rr to simplify and control client query routing when deploying multiple Amazon Redshift clusters to scale concurrency. More details can be found in the AWS Big Data Blog post Query Routing and Rewrite: Introducing pgbouncer-rr for Amazon Redshift and PostgreSQL.

Best practices for storage

For storage optimization considerations, think about reducing the I/O workload at every step. That tends toward a columnar-based file format, using compression to fit more records into each storage block. The file formats supported in Redshift Spectrum include CSV, TSV, Parquet, ORC, JSON, Amazon ION, Avro, RegExSerDe, Grok, RCFile, and Sequence.

A further optimization is to use compression. Currently, Redshift Spectrum supports Gzip, Snappy, LZO, BZ2, and Brotli (only for Parquet).

For files that are in Parquet, ORC, and text format, or where a Bz2 compression codec is used, Redshift Spectrum might split the processing of large files into multiple requests. Doing this can speed up performance. There is no restriction in the file size, but we recommend that you avoid too many kilobyte-sized files.

In case of file formats and compression codecs that can’t be split, such as Avro or Gzip, we recommend that you don’t use very large files (that is, files greater than 512 MB in size). We recommend this because using very large files can reduce the degree of parallelism. Using a uniform file size across all partitions helps reduce execution skew.

3. Consider columnar format for performance and cost

Apache Parquet and ORC are columnar storage formats that are available to any project in the Apache Hadoop ecosystem. They’re available regardless of the choice of data processing framework, data model, or programming language. For more details, see the Apache Parquet and Apache ORC documentation.

Redshift Spectrum charges you by the amount of data that is scanned from S3 per query. Because Parquet and ORC store data in a columnar format, Redshift Spectrum reads only the needed columns for the query and avoid scanning the remaining columns, thereby reducing query cost.

Various tests have shown that columnar formats often perform faster and are often more cost-effective than row-based file formats.

You can compare the difference in query performance and cost between queries that process text files and columnar-format files. To do so, you can use SVL_S3QUERY_SUMMARY to gain some insight into some interesting S3 metrics.

Select * from SVL_S3QUERY_SUMMARY where query=<Query-ID>;

Pay special attention to some interesting metrics:  s3_scanned_rows and s3query_returned_rows, and s3_scanned_bytes and s3query_returned_bytes. Notice the tremendous reduction in the amount of data that returns from Redshift Spectrum to native Amazon Redshift for the final processing when compared to CSV files.

When you store data in Parquet and ORC format, you can also optimize by sorting data. If your data is sorted on frequently filtered columns, the Redshift Spectrum scanner considers the minimum and maximum indexes and skips reading entire row groups. This feature is available for Parquet, and support for ORC is coming.

4. Partition files on frequently filtered columns

If data is partitioned by one or more filtered columns, Redshift Spectrum can take advantage of partition pruning and skip scanning of unneeded partitions and files. A common practice is to partition the data based on time. When you’re deciding on the optimal partition columns, consider the following:

  • Columns that are used as common filters are good candidates.
  • Multilevel partitioning is encouraged if you frequently use more than one predicate. As an example, you can partition based on both SHIPDATE and STORE.
  • Excessively granular partitioning adds time for retrieving partition information. However, it can help in partition pruning and reduce the amount of data scanned from S3. Still, you might want to avoid using a partitioning schema that creates tens of millions of partitions. For example, using second-level granularity might be overkill.
  • Actual performance varies depending on query pattern, number of files in a partition, number of qualified partitions, and so on.
  • Measure and avoid data skew on partitioning columns.
  • Redshift Spectrum supports DATE type in Parquet now. Take advantage of this and use DATE type for fast filtering or partition pruning.

Scanning a partitioned external table can be significantly faster and cheaper than a nonpartitioned external table. To illustrate the powerful benefits of partition pruning, you should consider creating two external tables: One table is not partitioned, and the other is partitioned at the day level.

How do you know if partition pruning is effective? You can use the following SQL to analyze the effectiveness of partition pruning. If the query touches only a few partitions, you can verify if everything behaves as expected.

SELECT   query, 
         segment, 
         Max(total_partitions)     AS total_partitions, 
         Max(qualified_partitions) AS qualified_partitions 
FROM     svl_s3partition 
WHERE    query=<query-id> 
GROUP BY 1, 2;

You can see that the more restrictive the S3 predicate (on the partitioning column), the more pronounced the effect of partition pruning, and the better the Redshift Spectrum query performance. Amazon Redshift employs both static and dynamic partition pruning for external tables. Query 1 employs static partition pruning—that is, the predicate is placed on the partitioning column l_shipdate. We encourage you to explore another example of a query that uses a join with a small-dimension table (say, Nation or Region) and a filter on a column from the dimension table. Doing this can help you study the effect of dynamic partition pruning.

Best practices for cluster configuration

Here are some recommendations for configuring your Amazon Redshift clusters for optimal performance in Redshift Spectrum.

5. Optimize performance with the right Amazon Redshift cluster configuration

If your queries are bounded by scan and aggregation, request parallelism provided by Redshift Spectrum results in better overall query performance.

To see the request parallelism of a particular Redshift Spectrum query, do the following:

SELECT query, files, splits, avg_request_parallelism FROM SVL_S3QUERY_SUMMARY 
WHERE query= xxxx;

query | files | splits | avg_request_parallelism
------+-------+--------+-------------------------
64577 |   112 |    328 |    5

Here are some factors that affect Amazon S3 request parallelism:

  • The number of splits of all files being scanned (a nonsplittable file counts as one split)
  • The total number of slices across the cluster
  • How many concurrent queries are running

The simple math is as follows. When the total file splits are less than or equal to the avg_request_parallelism value (for example, 10) times total_slices, provisioning a cluster with more nodes might not increase performance.

The guidance is to check how many files a Redshift Spectrum table has. Then you can measure to show a particular trend: After a certain cluster size (in number of slices), the performance plateaus even as the cluster node count continues to increase. The optimal Amazon Redshift cluster size for a given node type is the point where you can achieve no further performance gain.

Best practices for when to use Redshift Spectrum

With Redshift Spectrum, you can run Amazon Redshift queries against data stored in an S3 data lake without having to load data into Amazon Redshift at all. Doing this not only reduces the time to insight, but also reduces the data staleness. Under some circumstances, Redshift Spectrum can be a higher performing option.

6. Achieve faster scan- and aggregation-intensive queries with Redshift Spectrum

Thanks to the separation of computation from storage, Redshift Spectrum can scale compute instantly to handle a huge amount of data. So it’s good for heavy scan and aggregate work that doesn’t require to shuffle data across nodes.

A few good use cases are these:

  • Huge volume but less frequently accessed data
  • Heavy scan- and aggregation-intensive queries
  • Selective queries that can use partition pruning and predicate pushdown, so the output is fairly small

Certain queries, like Query 1 earlier, don’t have joins. Their performance is usually dominated by physical I/O costs, that is scan speed. For these queries, Redshift Spectrum might actually be faster than native Amazon Redshift. On the other hand, for queries like Query 2 where multiple table joins are involved, highly optimized native Amazon Redshift tables that use local storage come out the winner.

7. Simplify ETL pipelines

Before Redshift Spectrum, data ingestion to Amazon Redshift could be a multistep process. A common data pipeline includes extract, load, and transform (ETL) processes. You need to clean dirty data, do some transformation, load the data into a staging area, then load the data to the final table. You might need to use different services for each step, and coordinate among them.

Using Redshift Spectrum, you can streamline the complex data engineering process by eliminating the need to load data physically into staging tables.

To do so, create an external schema or table pointing to the raw data stored in S3, or use an AWS Glue or Athena data catalog. Put your transformation logic in a SELECT query, and ingest the result into Amazon Redshift. You can do this all in one single query, with no additional service needed.

CREATE TABLE <Final Amazon Redshift table> AS SELECT … FROM s3_external_schema.xxx WHERE …;
SELECT … INTO <Final Amazon Redshift table> FROM s3_external_schema.xxx WHERE …;

Best practices for query performance

You can improve query performance with the suggestions following.

8. Improve Amazon S3 query performance with predicate pushdown

The processing that is done in the Redshift Spectrum layer (the S3 scan, projection, filtering, and aggregation) is independent from any individual Amazon Redshift cluster. In general, any operations that can be pushed down to Redshift Spectrum experience a performance boost because of the powerful infrastructure that supports Redshift Spectrum. Using predicate pushdown also avoids consuming resources in the Amazon Redshift cluster.

You can push many SQL operations down to the Redshift Spectrum layer. We recommend that you take advantage of this wherever possible. The following are some examples of operations you can push down:

  • Common aggregate functions such as COUNT, SUM, AVG, MIN, MAX, GROUP BY clauses, and many others
  • String functions such as regex_replace, to_upper, date_trunc, and many others
  • Equal predicates and pattern-matching conditions such as LIKE, IS NULL, and CASE WHEN

Examine your query explain plan to verify if predicate pushdown is effective

Take the following two queries as an example:

EXPLAIN SELECT * FROM s3_external_schema.sales WHERE memo LIKE 'return%';
XN S3 Query Scan sales  (cost=0.00..225000000.00 rows=10000000000 width=410)
  ->  S3 Seq Scan s3_external_schema.sales location:"s3://<your-bucket>/sales" format:TEXT  (cost=0.00..125000000.00 rows=10000000000 width=410)
        Filter: ((memo)::text ~~ 'return%'::text)
EXPLAIN SELECT * FROM s3_external_schema.sales WHERE memo ILIKE 'Return%';
XN S3 Query Scan sales  (cost=0.00..225000000.00 rows=512000001 width=410)
  Filter: ((memo)::text ~~* 'Return%'::text)
  ->  S3 Seq Scan s3_external_schema.sales location:"s3://<your-bucket>/sales" format:TEXT  (cost=0.00..100000000.00 rows=10000000000 width=410)

In the first query’s explain plan, the S3 scan filter is pushed down to the Redshift Spectrum layer. Thus, only the matching results are returned to Amazon Redshift for final processing.

On the other hand, look at the second query’s explain plan. Here, there is no predicate pushdown to Spectrum layer due to ILIKE. As a result, this query is forced to bring back a huge amount of data from S3 into Amazon Redshift to filter.

We keep improving predicate pushdown, and plan to push down more and more SQL operations over time. For example, ILIKE is now pushed down to Redshift Spectrum in the current Amazon Redshift release.

9. Replace complex DISTINCT operations with GROUP BY in your queries

Amazon Redshift can automatically rewrite simple DISTINCT (single-column) queries during the execution planning step and push them down to Redshift Spectrum.

Certain SQL operations like multiple-column DISTINCT and ORDER BY must be performed in Amazon Redshift because they can’t be pushed down to Redshift Spectrum. If possible, you should rewrite these queries to minimize their use, or avoid using them.

As an example, examine the following two functionally equivalent SQL statements.

  • A query with multiple columns that uses DISTINCT:
    SELECT DISTINCT l_returnflag,l_linestatus 
    FROM 	s3_external_schema.LINEITEM_PART_PARQ 
    WHERE 	EXTRACT(YEAR from l_shipdate::DATE) BETWEEN '<Start-Year>' AND  '<End-Year>' 
    ORDER BY l_returnflag, l_linestatus;
  • An equivalent query that uses GROUP BY:
    SELECT l_returnflag,l_linestatus 
    FROM 	s3_external_schema.LINEITEM_PART_PARQ 
    WHERE EXTRACT(YEAR from l_shipdate::DATE) BETWEEN '<Start-Year>' AND  '<End-Year>' 
    GROUP BY l_returnflag, l_linestatus 
    ORDER BY l_returnflag, l_linestatus;

In the first query, you can’t push the multiple-column DISTINCT operation down to Spectrum, so a large number of rows is returned to Amazon Redshift to be sorted and de-duped. In the second query, S3 HashAggregate is pushed to the Redshift Spectrum layer, where most of the heavy lifting and aggregation is done. You can query against the SVL_S3QUERY_SUMMARY system view for these two SQL statements (check the column s3query_returned_rows). What you should find is that there is a big difference in the number of rows returned from Redshift Spectrum to Amazon Redshift.

The lesson learned is that you should replace DISTINCT with GROUP BY in your SQL statements wherever possible.

Best practices for table placement and statistics

You can improve table placement and statistics with the suggestion following.

10. Determine the best place to store your tables

The following simple guidelines can help you determine the best place to store your tables for the optimal performance.

You can access data stored in Amazon Redshift and S3 in the same query. By placing data in the right storage based on access pattern, you can achieve better performance with lower cost:

  • Load data in S3 and use Redshift Spectrum if data is historical and less frequently accessed
  • Load data into Amazon Redshift if data is hot and frequently used. In addition to text, JSON, and AVRO, Amazon Redshift supports Parquet and ORC data now.
  • Roll up complex reports on S3 data nightly to small local Amazon Redshift tables. Combine the power of Redshift Spectrum and Amazon Redshift. Use the Redshift Spectrum compute power to do the heavy lifting and materialize the result. Use Amazon Redshift as a result cache to provide faster response to many users.
  • Use a late binding view to integrate an external table and an Amazon Redshift local table if a small part of your data is hot and the rest is cold.
    CREATE VIEW total_sales as
    SELECT … FROM sales_last_month
    UNION ALL
    SELECT … FROM s3_external_schema.sales_historical
    WITH NO SCHEMA BINDING

11. Set the table statistics (numRows) manually for S3 external tables

The Amazon Redshift optimizer can use external table statistics to generate more robust execution plans. Without statistics, an execution plan is generated based on heuristics with assumption that the S3 table is a relatively large table. Let’s take a look at the explain plan of the following query.

explain select * from s3_external_schema.event_desc ed join s3_external_schema.event e
on e.event_cd = ed.event_cd and e.event_type_cd = ed.event_type;

XN Hash Join DS_BCAST_INNER  (cost=250000000.00..356425000500000000.00 rows=173286210788786592 width=2620)
  Hash Cond: ((("outer".event_cd)::text = ("inner".event_cd)::text) AND (("outer".event_type)::text = ("inner".event_type_cd)::text))
  ->  XN S3 Query Scan e  (cost=0.00..200000000.00 rows=10000000000 width=1556)
        ->  S3 Seq Scan s3_external_schema.event_desc e location:"s3://<your-bucket>/<xyz>/event_desc/" format:PARQUET  (cost=0.00..100000000.00 rows=10000000000 width=1556)
  ->  XN Hash  (cost=200000000.00..200000000.00 rows=10000000000 width=1064)
        ->  XN S3 Query Scan ed  (cost=0.00..200000000.00 rows=10000000000 width=1064)
              ->  S3 Seq Scan s3_external_schema.event ed location:"s3://<your-bucket>/<xyz>/event/" format:PARQUET  (cost=0.00..100000000.00 rows=10000000000 width=1064)

As you can see, the join order is not optimal. How do we fix it?

Although you can’t perform ANALYZE on external tables, you can set the table statistics (numRows) manually with a TABLE PROPERTIES clause in the CREATE EXTERNAL TABLE and ALTER TABLE commands.

ALTER TABLE s3_external_schema.event SET TABLE PROPERTIES ('numRows'='799');
ALTER TABLE s3_external_schema.event_desc SET TABLE PROPERTIES ('numRows'=' 122857504');

With this piece of information, the Amazon Redshift optimizer can generate more optimal execution plans and complete queries faster.

XN Hash Join DS_BCAST_INNER  (cost=19.98..12696028004.76 rows=170103196 width=2620)
  Hash Cond: ((("outer".event_cd)::text = ("inner".event_cd)::text) AND (("outer".event_type_cd)::text = ("inner".event_type)::text))
  ->  XN S3 Query Scan e  (cost=0.00..2457150.08 rows=122857504 width=1064)
        ->  S3 Seq Scan s3_external_schema.event e location:"s3://<your-bucket>/<xyz>/event/" format:PARQUET  (cost=0.00..1228575.04 rows=122857504 width=1064)
  ->  XN Hash  (cost=15.98..15.98 rows=799 width=1556)
        ->  XN S3 Query Scan ed  (cost=0.00..15.98 rows=799 width=1556)
              ->  S3 Seq Scan s3_external_schema.event_desc ed location:"s3://<your-bucket>/<xyz>/event_desc/" format:PARQUET  (cost=0.00..7.99 rows=799 width=1556)

Best practices for query cost control

You can help control your query costs with the suggestion following.

12. Pay attention to query cost and use query monitoring rules

For a nonselective join, a large amount of data needs to be read to perform the join. Doing this can incur high data transfer costs and network traffic, and result in poor performance and higher than necessary costs.

Are your queries scan-heavy, selective, or join-heavy? To monitor metrics and understand your query pattern, you can use the following query.

SELECT query, 
       Round(elapsed :: FLOAT / 1000 / 1000, 3) AS elapsed_sec, 
       aborted, 
       is_partitioned, 
       s3_scanned_rows, 
       s3query_returned_rows, 
       s3_scanned_bytes, 
       s3query_returned_bytes, 
       files, 
       max_retries, 
       Round(max_request_duration :: FLOAT / 1000 / 1000, 3) AS 
       max_request_dur_sec, 
       Round(avg_request_duration :: FLOAT / 1000 / 1000, 3) AS 
       avg_request_dur_sec, 
       Round(avg_request_parallelism, 2)                     AS 
       avg_request_parallelism 
FROM   svl_s3query_summary 
WHERE  query = xxxx; 

As part of this, do the following:

  • Check the ratio of scanned to returned data and the degree of parallelism
  • Check if your query can take advantage of partition pruning (see best practice 4)

When you know what’s going on, you can set up workload management (WLM) query monitoring rules to stop rogue queries to avoid unexpected costs.

If a user forgets to add a filter or data is not partitioned properly, a query can accidentally scan a huge amount of data and cause high costs. To set query performance boundaries, use WLM query monitoring rules and take action when a query goes beyond those boundaries. For example, you might set a rule to abort a query when spectrum_scan_size_mb is greater than 20 terabytes or when  spectrum_scan_row_count is greater than 1 billion.

With these and other query monitoring rules, you can terminate the query, hop the query to the next matching queue, or just log it when one or more rules are triggered. For more information, see WLM Query Monitoring Rules in the Amazon Redshift documentation.

Conclusion

In this post, we provide some important best practices to improve the performance of Redshift Spectrum. Because each use case is unique, you should evaluate how you can apply these recommendations to your specific situations.

We want to acknowledge our fellow AWS colleagues Bob Strahan, Abhishek Sinha, Maor Kleider, Jenny Chen, Martin Grund, Tony Gibbs, and Derek Young for their comments, insights, and help.

If you have any questions or suggestions, please leave your feedback in the comment section. If you need further assistance in optimizing your Amazon Redshift cluster, contact your AWS account team.


About the authors

Po Hong, PhD, is a Big Data Consultant in the Global Big Data & Analytics Practice of AWS Professional Services.

 

 

 

Peter Dalton is a Principal Consultant in AWS Professional Services.

 

 

 

Updated 10/31/2018 by:

Matt Scaer is a Principal Data Warehousing Specialist Solution Architect, with over 20 years of data warehousing experience, with 11+ years at both AWS and Amazon.com.

 

 

 

 

Juan Yu is a Data Warehouse Specialist Solutions Architect at AWS.

 

 

 

 

Ippokratis Pandis is a Principal Software Eningeer in AWS working on Amazon Redshift and Amazon Redshift Spectrum.