Amazon Redshift Engineering’s Advanced Table Design Playbook: Preamble, Prerequisites, and Prioritization

Part 1: Preamble, Prerequisites, and Prioritization (Translated into Japanese)
Part 2: Distribution Styles and Distribution Keys
Part 3: Compound and Interleaved Sort Keys
Part 4: Compression Encodings
Part 5: Table Data Durability

Amazon Redshift is a fully managed, petabyte scale, massively parallel data warehouse that offers simple operations and high performance. AWS customers use Amazon Redshift for everything from accelerating existing database environments that are struggling to scale, to ingesting web logs for big data analytics. Amazon Redshift provides an industry-standard JDBC/ODBC driver interface, which allows connections from existing business intelligence tools and reuse of existing analytics queries.

With Amazon Redshift, you can implement any type of data model that’s standard throughout the industry. Whether your data model is third normalized form (3NF), star, snowflake, denormalized flat tables, or a combination of these—by using Amazon Redshift’s unique table properties, your complex analytical workloads will operate performantly over multipetabyte data sets.

In practice, I find that the best way to improve query performance by orders of magnitude is by tuning Amazon Redshift tables to better meet your workload requirements. This five-part blog series will guide you through applying distribution styles, sort keys, and compression encodings and configuring tables for data durability and recovery purposes. I’ll offer concrete guidance on how to properly work with each property for your use case.

Prerequisites

If you’re working with an existing Amazon Redshift workload, then the Amazon Redshift system tables can help you determine the most ideal configurations. Querying these tables for the complete dataset requires cluster access as a privileged superuser. You can determine if your user is privileged with the result of the usesuper column from the following query result set:

root@redshift/dev=# SELECT usename, usesysid, usesuper FROM pg_user WHERE usename=current_user;
 usename | usesysid | usesuper
---------+----------+----------
 root    |      100 | t
(1 row)

In Amazon Redshift, a table rebuild is required when changing most table or column properties. To reduce the time spent rebuilding tables, identify all of the necessary changes up front, so that only a single rebuild is necessary. Once you’ve identified changes, you can query one of our amazon-redshift-utils view definitions (v_generate_tbl_ddl) to generate the existing DDL, for further modification to implement your identified changes.

I’ve also improved the system view SVV_TABLE_INFO with a new view, named v_extended_table_info, which offers an extended output that makes schema and workload reviews much more efficient. I’ll refer to the result set returned by querying this view throughout the series, so I’d recommend that you create the view in the Amazon Redshift cluster database you’re optimizing.

For the sake of brevity throughout these topics, I’ll refer to tables by their object ID (OID). You can get this OID in one of several ways:

root@redshift/dev=# SELECT 'bi.param_tbl_chriz_header'::regclass::oid;
  oid
--------
 108342
(1 row)

root@redshift/dev=# SELECT oid, relname FROM pg_class 
WHERE relname='param_tbl_chriz_header';
  oid   |        relname
--------+------------------------
 108342 | param_tbl_chriz_header
(1 row)

root@redshift/dev=# SELECT table_id, "table" FROM svv_table_info 
WHERE "table"='param_tbl_chriz_header';
 table_id |         table
----------+------------------------
   108342 | param_tbl_chriz_header
(1 row)

root@redshift/dev=# SELECT DISTINCT id FROM stv_tbl_perm 
WHERE name='param_tbl_chriz_header';
   id
--------
 108342
(1 row)

Prioritization

This series walks you through a number of processes that you can implement on a table-by-table basis. It’s not unusual for clusters that serve multiple disparate workloads to have thousands of tables. Because your time is finite, you’ll want to prioritize optimizations against the tables that are most significant to the workload, to deliver a meaningful improvement to the overall cluster performance.

If you’re a direct end user of the Amazon Redshift cluster, or if you have well-established communication with end users, then it might already be obvious where you should start optimizing. Perhaps end users are reporting cluster slowness for specific reports, which would highlight tables that need optimization.

If you lack intrinsic knowledge of the environment you’re planning to optimize, the scenario might not be as clear. For example, suppose one of the following is true:

You’re an external consultant, engaged to optimize an unknown workload for a new client.
You’re an Amazon Redshift subject matter expert within your circles, and you’re often approached for guidance regarding Amazon Redshift resources that you didn’t design or implement.
You’ve inherited operational ownership of an existing Amazon Redshift cluster and are unfamiliar with the workloads or issues.

Regardless of your particular scenario, it’s always invaluable to approach the optimization by first determining how best to spend your time.

I’ve found that scan frequency and table size are the two metrics most relevant to estimating table significance. The following SQL code helps identify a list of tables relevant to each given optimization scenario, based on characteristics of the recent historical workload. Each of these result sets are ordered by scan frequency, with most scanned tables first.

Scenario: “There are no specific reports of slowness, but I want to ensure I’m getting the most out of my cluster by performing a review on all tables.”

-- Returns table information for all scanned tables
SELECT * FROM admin.v_extended_table_info 
WHERE table_id IN (
  SELECT DISTINCT tbl FROM stl_scan WHERE type=2 
)
ORDER BY SPLIT_PART("scans:rr:filt:sel:del",':',1)::int DESC, 
  size DESC;

Scenario: “The query with ID 4941313 is slow.”

-- Returns table information for all tables scanned by query 4941313
SELECT * FROM admin.v_extended_table_info 
WHERE table_id IN (
  SELECT DISTINCT tbl FROM stl_scan WHERE type=2 AND query = 4941313
) 
ORDER BY SPLIT_PART("scans:rr:filt:sel:del",':',1)::int DESC, 
  size DESC;

Scenario: “The queries running in transaction with XID=23200 are slow.”

-- Returns table information for all tables scanned within xid 23200
SELECT * FROM admin.v_extended_table_info 
WHERE table_id IN (
  SELECT DISTINCT tbl FROM stl_scan 
  WHERE type=2 
  AND query IN (SELECT query FROM stl_query WHERE xid=23200)
) 
ORDER BY SPLIT_PART("scans:rr:filt:sel:del",':',1)::int DESC, 
  size DESC;

Scenario: “Our ETL workload running between 02:00 and 04:00 UTC is exceeding our SLAs.”

-- Returns table information for all tables scanned by “etl_user” 
-- during 02:00 and 04:00 on 2016-09-09
SELECT * FROM admin.v_extended_table_info 
WHERE table_id IN (
  SELECT DISTINCT tbl FROM stl_scan 
  WHERE type=2 
  AND query IN (
    SELECT q.query FROM stl_query q
    JOIN pg_user u ON u.usesysid=q.userid
    WHERE u.usename='etl_user' 
    AND starttime BETWEEN '2016-09-09 2:00' AND '2016-09-09 04:00')
) 
ORDER BY SPLIT_PART("scans:rr:filt:sel:del",':',1)::int DESC, 
  size DESC;

Scenario: “Our reporting workload on tables in the ‘sales’ schema is slow.”

-- Returns table information for all tables scanned by queries 
-- from "reporting_user" which scanned tables in the "sales" schema 
SELECT * FROM admin.v_extended_table_info 
WHERE table_id IN (
  SELECT DISTINCT tbl FROM stl_scan 
  WHERE type=2 AND query IN (
    SELECT DISTINCT s.query FROM stl_scan s
    JOIN pg_user u ON u.usesysid = s.userid 
    WHERE s.type=2 AND u.usename='reporting_user' AND s.tbl IN (
      SELECT c.oid FROM pg_class c 
      JOIN pg_namespace n ON n.oid = c.relnamespace 
      WHERE nspname='sales'
    )
  )
)
ORDER BY SPLIT_PART("scans:rr:filt:sel:del",':',1)::int DESC, 
  size DESC;

Scenario: “Our dashboard queries need to be optimized.”

-- Returns table information for all tables scanned by queries 
-- from “dashboard_user”
SELECT * FROM admin.v_extended_table_info 
WHERE table_id IN (
  SELECT DISTINCT s.tbl FROM stl_scan s
    JOIN pg_user u ON u.usesysid = s.userid 
    WHERE s.type=2 AND u.usename='dashboard_user' 
  )
ORDER BY SPLIT_PART("scans:rr:filt:sel:del",':',1)::int DESC, 
  size DESC;

Now that we’ve identified which tables should be prioritized for optimization, we can begin. The next blog post in the series will discuss distribution styles and keys.