AWS Big Data Blog

Top 9 Best Practices for High-Performance ETL Processing Using Amazon Redshift

August 2024: This post was reviewed and updated for accuracy.

An extract, transform, load (ETL) process enables you to load data from source systems into your data warehouse . This is typically executed as a batch or near real-time ingest process to keep the data warehouse current and provide up-to-date analytical data to end users.

Amazon Redshift is a fast, petabyte-scale data warehouse with which you can make data-driven decisions effortlessly. With Amazon Redshift, you can get insights into your big data cost-effectively using standard SQL. You can set up any type of data model, from star and snowflake schemas to simple denormalized tables for running any analytical queries.

To operate a robust ETL platform and deliver data to Amazon Redshift promptly, design your ETL processes to take account of Amazon Redshift architecture. When migrating from a legacy data warehouse to Amazon Redshift, it is tempting to adopt a lift-and-shift approach, but this can result in future performance and scale issues. This post guides you through the following best practices for optimal, consistent runtimes for your ETL processes:

  1. Use workload management to improve ETL runtimes
  2. Maximize ETL performance with Concurrency Scaling
  3. Perform table maintenance regularly
  4. Use automatic table optimization (ATO)
  5. Maximize the benefits of materialized views in Amazon Redshift
  6. Perform multiple steps in a single transaction
  7. Load data in bulk
  8. Use UNLOAD to extract large result sets
  9. Use Amazon Redshift Spectrum for one-time ETL processing

Use workload management to improve ETL runtimes

As organizations migrate more workloads to Amazon Redshift, an improper workload management (WLM) setup can lead to inconsistent runtime performance for ETL processes. Enabling automatic WLM can help maximize throughput and resource utilization in these scenarios. You can use it to define multiple queues dedicated to different workloads, such as ETL and reporting, so you can better manage the resource allocation and prioritization for these distinct workloads.

Automatic WLM dynamically adjusts query concurrency and memory allocation based on the resource needs of the current workload. It enables the definition of multiple dedicated queues to handle different workloads, such as ETL and reporting. Each queue can be assigned a priority level for example (HIGHEST, ) to reflect the importance of that workload. This priority is then inherited by all queries submitted to that queue, allowing fine-grained control over how system resources are allocated across the organization’s analytical tasks. For information about how to change the priority of queries in a queue, see Query priority.

When configuring automatic WLM in Amazon Redshift as a best practice, we recommend that you:

  1. Create a queue dedicated to your different workloads, for example, ETL process.
  2. Assign appropriate priority to determine the order in which the queries are executed to ensure efficient resource utilization. For example, if your ETL jobs need to meet certain Service Level Agreements (SLAs), assign the high or highest priority to the queue. If the reporting queries or any other workloads need to be prioritized, you can change the ETL queue priority to normal, low, or lowest based on your workload characteristics.
  3. Create query monitoring rules (QMRs) as part of your WLM configuration and specify actions to take for queries that exceed them.
  4. Monitor query priority by viewing the the query_priority column in the stv_wlm_query_state system
  5. To set the priority of an automatic WLM queue, use the Workload management page on the Amazon Redshift console, or set it programmatically using the wlm_json_configuration parameter with the AWS Command Line Interface (AWS CLI) or the API. Refer to the documentation on Modifying the WLM configuration.

Alternatively, for those who prefer a more hands-on approach and choose to use manual WLM, consider the following during setup:

  1. Take advantage of the dynamic memory parameters. They swap the memory from your ETL to your reporting queue after the ETL job has completed.
  2. Configure your ETL queue with a small number of slots (five or fewer).
  3. Claim extra memory available in a queue. When executing an ETL query, you can take advantage of the wlm_query_slot_count to claim the extra memory available in a particular queue. For example, a typical ETL process might involve using the COPY command to copy raw data into a staging table so that downstream ETL jobs can run transformations that calculate daily, weekly, and monthly aggregates. To speed up the COPY process (so that the downstream tasks can start in parallel sooner), the wlm_query_slot_count can be increased for this step.
  4. Create a QMR to define metrics-based performance boundaries for WLM queues and specify what action to take.

Maximize ETL performance with Concurrency Scaling

Boosting the efficiency of your data processing and analytics operations can be achieved by enabling the Concurrency Scaling feature in Amazon Redshift. This powerful capability automatically provisions additional computing resources to handle spikes in concurrent read and write query workloads, ensuring your ETL jobs are processed quickly and consistently.

When your workload experiences a surge in concurrent queries, Concurrency Scaling will automatically scale out the necessary computing power to handle the increased demand, then scale it back down when the demand subsides. This allows you to pay only for the extra capacity when it is actively in use, improving the performance and efficiency of your ETL processes.

To get the most out of Concurrency Scaling for your ETL needs, we recommend the following best practices:

  1. Enable Concurrency Scaling for ETL workloads – Turn on Concurrency Scaling for your high-performance ETL processing, including common write operations like COPY, INSERT, UPDATE, CTAS, and DELETE. This will allow Amazon Redshift to automatically add additional cluster capacity when you need it, streamlining your ETL pipelines and maintaining consistent performance.
  2. Integrate Concurrency Scaling with WLM – Configure your Amazon Redshift WLM queues to take full advantage of the Concurrency Scaling feature. When Concurrency Scaling is enabled for a queue, eligible write queries will be routed to the Concurrency Scaling clusters, which provides efficient resource allocation and optimal throughput for your ETL workloads.
  3. Use Concurrency Scaling credits – Amazon Redshift customers can scale their workloads with minimal cost impact because each cluster earns up to 1 hour of free Concurrency Scaling credits per day. The credit accrual applies to both read and write operations, providing flexibility to scale resources based on overall workload demands, so you can confidently handle dynamic ETL workloads.
  4. Understand Concurrency Scaling limitations – While Concurrency Scaling supports a wide range of write operations, it is important to note that it does not support data definition language (DDL) operations, such as CREATE TABLE or ALTER TABLE. Additionally, Concurrency Scaling for write operations is only available on Amazon Redshift RA3 node types (ra3.16xlarge, ra3.4xlarge, and ra3.xlplus).

When concurrency scaling is enabled for a queue, eligible write queries are routed to concurrency scaling clusters without waiting for resources to become available on the main Redshift cluster. This streamlines ETL pipelines, maintains consistent performance, and optimizes resource utilization, so organizations can handle dynamic ETL workloads with confidence.

Perform table maintenance regularly

To fully use the capabilities of your Amazon Redshift environment and make sure you have predictable, high-performing, and optimized ETL processes, it is essential to maintain database tables through routine operations that reclaim storage space, enhance performance, and update table statistics.

Amazon Redshift simplifies this task by automatically monitoring your workload and performing VACUUM and ANALYZE operations in the background during periods of light activity, minimizing the impact on system performance. The ANALYZE operation updates the statistical metadata used by the query planner so that user queries and daily ETL processes run efficiently while the VACUUM operation re-sorts rows and reclaims space in either a specified table or all tables in the current database.

However, during typical ETL refresh processes, tables may undergo changes (such as removing outdated data using DELETE statements) that can degrade query performance. To address this, Amazon Redshift can automatically sort and perform VACUUM operations on tables in the background. Additionally, you can manually run the VACUUM command at any time or schedule it to run against the entire database or individual tables so that tables are optimized after a load or a series of incremental updates. This makes sure that user queries execute consistently after an ETL process completes.

To determine the optimal timing and tables for manual VACUUM operations, use the following guidelines:

  • Use the vacuum_sort_benefit column in the SVV_TABLE_INFO system view to identify tables that would benefit the most from a manual VACUUM SORT This metric estimates the potential improvement in scanning and filtering data if the table was fully sorted.
  • Use the stats_off_columns in the SVV_TABLE_INFO The value of that field indicates the percentage by which the table’s statistics are stale and is a number between 0 and 100. A larger percentage for stats_off indicates that the tables need to be analyzed.
  • Combine this information with the unsorted column to pinpoint tables that have a high percentage of unsorted data or outdated statistics, indicating a need for manual VACUUM and ANALYZE.

When scheduling manual VACUUM operations, consider the following best practices:

  • Run them during periods of minimal cluster activity to avoid impacting your critical workloads
  • Avoid scheduling them during maintenance windows or other high-traffic periods
  • Monitor the impact on your workloads to ensure optimal efficiency and performance

Use automatic table optimization

As a columnar database, Amazon Redshift is optimized for rapid data transformations and aggregations, making it well-suited to support your mission-critical ETL processes. However, maintaining this high performance requires regular table maintenance, which can be greatly simplified by using the Amazon Redshift automatic table optimization (ATO) capabilities to ensure that your distribution and sort keys are optimal for your workload.

ATO is a self-tuning feature that automatically optimizes table designs by applying sort, multidimensional data layout sorting, and distribution keys without manual intervention. It continuously observes your queries and uses artificial intelligence (AI)–powered methods to choose the optimal keys that maximize performance for your cluster’s specific workload.

If Amazon Redshift determines that applying a key improves cluster performance, tables are automatically altered within hours from the time the cluster was created, with minimal impact to queries. Multidimensional data layout sorting is a table sorting mechanism that improves performance for repetitive queries. Unlike traditional sorting based on table columns, multidimensional data layout sorting sorts data based on the incoming query filters, significantly accelerating table scan performance. For more information on distribution styles, refer to Working with data distribution styles.

To take full advantage of ATO for your ETL workflows, follow these recommendations:

  1. Enable ATO for new and existing tables:
    1. When creating new tables, use the AUTO option for both sort and distribution keys to enable ATO
    2. For existing tables, use the ALTER TABLE statement with the ALTER SORTKEY AUTO and ALTER DISTSTYLE AUTO options to enable ATO
  2. Use multidimensional data layouts:
    1. When using ATO or SORTKEY AUTO, Amazon Redshift will automatically determine if the multidimensional data layout sorting mechanism should be used for your tables
    2. Multidimensional data layout sorts data based on the incoming query filters, significantly accelerating the performance of table scans crucial for ETL transformations
  3. Monitor and understand ATO decisions:
    1. Use the EXPLAIN command to understand the impact of the chosen sort keys on query performance
    2. Query the SVV_TABLE_INFO system view to view the sort keys applied to your tables
  4. Selectively disable ATO:
    1. In rare cases, you may need to explicitly set distribution styles or sort keys using ALTER TABLE to override ATO’s decisions
    2. However, only disable ATO after carefully evaluating the impact on your ETL workloads because ATO is designed to continuously optimize table design for your specific usage patterns

Maximize the benefits of materialized views in Amazon Redshift

Materialized views in Amazon Redshift store precomputed query results, eliminating the need for recalculation with each query. This significantly boosts performance for complex or frequently accessed analytical queries, including business intelligence (BI) dashboards and extract, load, and transform (ELT) workloads. This result in achieving low latency for analytical queries. Querying these views is like querying regular tables and views but without recomputing underlying data. Materialized views can be either user-defined or automated:

  1. User-defined – Created using CREATE MATERIALIZED VIEW, these custom precomputed views store results in a compressed columnar format, supporting automatic or manual refresh.
  2. Automated – Amazon Redshift automatically creates and maintains materialized views based on common subquery patterns, transparently rewriting recurring queries to use precomputed results. This machine-learned feature autonomously optimizes performance by caching repeated computations.

To maximize the benefits of materialized views the following are some best practices:

  1. Use the automated materialized views feature from Amazon Redshift, which automatically creates materialized views if it identifies a potential benefit to your query performance.
  2. Review reporting queries to see if they would benefit from materialized view. If Amazon Redshift has not created a materialized view, refer to Creating materialized views in Amazon Redshift. To optimize your ETL and ELT operations, use the EXPLAIN command in Amazon Redshift to analyze the execution plans of your queries and look for queries that involve complex joins, aggregations, or filtering operations on large datasets that will benefit from materialized view.
  3. Configure materialized views with the automatic refresh option to keep them up-to-date for automatic rewriting of queries. For a more deterministic refresh behavior, consider using manual refresh as described in REFRESH MATERIALIZED VIEW or scheduled refresh using the Amazon Redshift scheduler API operations or the console. To view your refresh status, use system view SVL_MV_REFRESH_STATUS. For more information, refer to Refreshing a materialized view.
  4. Continuously monitor your materialized view usage and performance by using the STV_MV_INFO system view in Amazon Redshift. This view provides insights into all your materialized views, both automated and user-generated, helping you identify optimization opportunities.
  5. Review the metrics in STV_MV_INFO, such as refresh status, size, and row count, to ensure your materialized views are delivering the expected performance benefits and make any necessary adjustments to your materialized view strategy.

Key aspects of materialized views in Amazon Redshift:

  1. Monitoring – You can track the status of a materialized view refresh by querying the SVV_MATERIALIZED_VIEW_REFRESH_STATUS to track refresh status and the last refresh time.
  2. Incremental refresh – Amazon Redshift supports incremental refresh for materialized views defined using external tables referencing Apache Iceberg and AWS Glue This capability eliminates the need for full refreshes, reducing the time and costs associated with refreshing materialized views for eligible queries.
  3. Data sharing – Amazon Redshift now supports automatic and incremental refresh of materialized views on data-sharing consumer data warehouses when the base tables used for the materialized views are shared data.
  4. Refresh options – Views can be refreshed automatically on a schedule or manually as needed. Automatic background refresh can be temporarily disabled or re-enabled for an automated materialized view. For more information, refer to Automated materialized views.

Perform multiple steps in a single transaction

ETL transformation logic often spans multiple steps. Because commits in Amazon Redshift are expensive, if each ETL step performs a commit, multiple concurrent ETL processes can take a long time to execute.

To minimize the number of commits in a process, the steps in an ETL script should be surrounded by BEGIN and END statements so that a single commit is performed only after all the transformation logic has been executed. For example, here is an example multistep ETL script that performs one commit at the end:

Begin

CREATE temporary staging_table;

INSERT INTO staging_table SELECT .. FROM source (transformation logic);

DELETE FROM daily_table WHERE dataset_date =?;

INSERT INTO daily_table SELECT .. FROM staging_table (daily aggregate);

DELETE FROM weekly_table WHERE weekending_date=?;

INSERT INTO weekly_table SELECT .. FROM staging_table(weekly aggregate);

Commit

Load data in bulk

Amazon Redshift is designed to store and query petabyte-scale datasets. Using Amazon Simple Storage Service (Amazon S3), you can stage and accumulate data from multiple source systems before executing a bulk COPY operation. The following methods allow efficient and fast transfer of these bulk datasets into Amazon Redshift:

  • Use a manifest to specify data files to ingest large datasets that span multiple files. The manifest file is a JSON file that lists all the files to be loaded into Amazon Redshift. When you use a manifest file, you can make sure that you Manage data consistency so that Amazon Redshift has a consistent view of the data to be loaded from Amazon S3 while also making sure that duplicate files do not result in the same data being loaded more than one time.
  • You can Create temporary staging tables to hold the data for transformation. These tables are automatically dropped after the ETL session is complete. Temporary tables can be created using the CREATE TEMPORARY TABLE syntax, or by issuing a SELECT … INTO #TEMP_TABLE query. Explicitly specifying the CREATE TEMPORARY TABLE statement allows you to control the DISTRIBUTION KEY, SORT KEY, and compression settings to further improve performance.
  • User ALTER TABLE APPEND to swap data from the staging tables to the target table. Data in the source table is moved to matching columns in the target table. Column order doesn’t matter. After data is successfully appended to the target table, the source table is empty. ALTER TABLE APPEND is much faster than a similar CREATE TABLE AS or INSERT INTO operation because it doesn’t involve copying or moving data.

Use UNLOAD to extract large result sets

Fetching a large number of rows using SELECT is expensive and takes a long time. When a large amount of data is fetched from the Amazon Redshift cluster, the leader node has to hold the data temporarily until the fetches are complete. Further, data is streamed out sequentially, which results in longer elapsed time. As a result, the leader node can become hot, which not only affects the SELECT that is being executed, but also throttles resources for creating execution plans and managing the overall cluster resources. The following image is an example of a large SELECT statement. You can observe that the leader node is doing most of the work to stream out the rows.

Use UNLOAD to extract large results sets directly to Amazon S3. After it’s in Amazon S3, the data can be shared with multiple downstream systems. By default, UNLOAD writes data in parallel to multiple files according to the number of slices in the cluster. All the compute nodes participate to quickly offload the data into Amazon S3.

If you are extracting data for use with Amazon Redshift Spectrum, you should make use of the MAXFILESIZE parameter so that you don’t have very large files (greater than 512 MB). Similar to step 1(example ETL process), having many evenly sized files ensures that Redshift Spectrum can do the maximum amount of work in parallel.

Use Redshift Spectrum for one-time ETL processing

With Redshift Spectrum, you can run queries directly against data stored in Amazon S3 without the need to load or copy the data into your Redshift cluster. This capability offers significant advantages for one-time ETL processing of large datasets, especially during events that trigger substantial data volume spikes.

Events such as backfills, promotions, or calendar-driven surges can lead to surges in data volumes, affecting the data refresh times and performance of your Redshift cluster. To effectively handle these scenarios, some recommended approaches are:

  1. Stage the additional data in Amazon S3. Once the data is organized and stored in S3, you can leverage Redshift Spectrum to run queries directly against the data using standard SQL, without the need to load the data into your Redshift cluster.
  2. Leverage prefix-based partitioning to take advantage of Redshift Spectrum’s partition pruning capabilities.
  3. Optimize performance by partitioning your data on the columns that are most frequently used in filters. Leverage Redshift Spectrum’s partition pruning capabilities by structuring your data partitions around these key filtering criteria. Structure your partitions around these key filtering criteria, such as date, or other time-related columns. Configure your partitioning to align with common filtering needs, enabling Redshift Spectrum to skip unneeded partition.

Redshift Spectrum employs massive parallelism to query large datasets stored in Amazon S3 with remarkable speed and efficiency. By querying the data directly in Amazon S3, you can gain the benefits of additional processing capacity without the need to resize or scale your Redshift cluster. This approach also allows multiple Redshift clusters to concurrently query the same dataset stored in Amazon S3, eliminating the need to create redundant copies of the data for each cluster.

For tips on getting started with and optimizing the use of Redshift Spectrum, refer to the Amazon Web Services (AWS) documentation Best practices for using Amazon Redshift Spectrum and the AWS Big Data Blog post Best Practices for Amazon Redshift Spectrum.

Example ETL process

This example ETL process reinforces some of the best practices discussed in this post. Consider the following four-step daily ETL workflow:

  1. Extract data from the RDBMS source to an S3 bucket
  2. Stage data to the Amazon Redshift table for cleansing
  3. Transform data to create daily, weekly, and monthly datasets and load into target tables
  4. Unload the daily dataset to populate the S3 data lake bucket

From there, they can be further processed and made available for end user reporting using tools such as Redshift Spectrum and Amazon Athena. The following image shows these four steps:

Step 1: Extract data from the RDBMS source to an S3 bucket

In this ETL process, the data extract job fetches change data every 1 hour, and it is staged into multiple hourly files. For example, the staged Amazon S3 folder looks like the following:

[ec2-user@ip-172-81-1-52 ~]$ aws s3 ls s3://<<S3 Bucket>>/batch/2017/07/02/

2017-07-02 01:59:58   81900220 20170702T01.export.gz

2017-07-02 02:59:56   84926844 20170702T02.export.gz

2017-07-02 03:59:54   78990356 20170702T03.export.gz

…

2017-07-02 22:00:03   75966745 20170702T21.export.gz

2017-07-02 23:00:02   89199874 20170702T22.export.gz

2017-07-02 00:59:59   71161715 20170702T23.export.gz

Organizing the data into multiple, evenly sized files enables the COPY command to ingest this data using all available resources in the Redshift cluster. Further, the files are compressed (using GZIP format) to further reduce COPY times.

Step 2: Stage data to the Amazon Redshift table for cleansing

Ingesting the data can be accomplished using a JSON-based manifest file. Using the manifest file ensures that S3 eventual consistency issues can be eliminated and also provides an opportunity to deduplicate any files if needed. A sample manifest 20170702.json file looks like the following:

{

  "entries": [

    {"url":" s3://<<S3 Bucket>>/batch/2017/07/02/20170702T01.export.gz", "mandatory":true},

    {"url":" s3://<<S3 Bucket>>/batch/2017/07/02/20170702T02.export.gz", "mandatory":true},

    …

    {"url":" s3://<<S3 Bucket>>/batch/2017/07/02/20170702T23.export.gz", "mandatory":true}

  ]

}

The data can be ingested using the following command:

SET wlm_query_slot_count TO <<max available concurrency in the ETL queue>>;

COPY stage_tbl FROM 's3:// <<S3 Bucket>>/batch/manifest20170702.json' iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole' manifest;

Because the downstream ETL processes depend on this COPY command to complete, the wlm_query_slot_count is used to claim all the memory available to the queue. This helps the COPY command complete as quickly as possible.

Step 3: Transform data to create daily, weekly, and monthly datasets and load into target tables

Data is staged in the stage_tbl from where it can be transformed into the daily, weekly, and monthly aggregates and loaded into target tables. The following job illustrates a typical weekly process. Multiple steps are combined into one transaction to perform a single commit, reducing contention on the commit queue.

Begin

INSERT into ETL_LOG (..) values (..);

DELETE from weekly_tbl where dataset_week = <<current week>>;

INSERT into weekly_tbl (..)

  SELECT date_trunc('week', dataset_day) AS week_begin_dataset_date, SUM(C1) AS C1, SUM(C2) AS C2

     FROM   stage_tbl

GROUP BY date_trunc('week', dataset_day);

INSERT into AUDIT_LOG values (..);

COMMIT;

End;

Step 4: Unload the daily dataset to populate the S3 data lake bucket

The transformed results are now unloaded into another S3 bucket using the UNLOAD command, where they can be further processed and made available for end user reporting using tools such as Redshift Spectrum and Amazon Athena.

unload ('SELECT * FROM weekly_tbl WHERE dataset_week = <<current week>>’) TO 's3:// <<S3 Bucket>>/datalake/weekly/20170526/' iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole';

Summary

Amazon Redshift makes it possible to operate petabyte-scale data warehouses in the cloud. This post outlined best practices for executing scalable ETL within Amazon Redshift, showcasing efficient data ingestion, transformation, and monitoring methods. It demonstrated these practices through a sample ETL workload, illustrating optimal data transformation and loading into Amazon Redshift. If you have questions or suggestions, please leave a comment.

Additional reading

If you found this post useful, be sure to check out the following posts:


About the Authors

Thiyagarajan Arumugam is a Big Data Solutions Architect at Amazon Web Services and designs customer architectures to process data at scale. Prior to working at AWS, he built data warehouse solutions at Amazon.com. In his free time, he enjoys all outdoor sports and practices the Indian classical drum mridangam.

Kiki Nwangwu is an AWS Associate Specialist Solutions Architect. She specializes in helping customers architect, build, and modernize scalable data analytics solutions.


Audit History

Last reviewed and updated in August 2024 by Kiki Nwangwu |Associate Specialist Solutions Architect