Amazon Redshift Engineering’s Advanced Table Design Playbook: Table Data Durability

Part 1: Preamble, Prerequisites, and Prioritization
Part 2: Distribution Styles and Distribution Keys
Part 3: Compound and Interleaved Sort Keys
Part 4: Compression Encodings
Part 5: Table Data Durability (Translated into Japanese)

In the fifth and final installment of the Advanced Table Design Playbook, I’ll discuss how to use two simple table durability properties to squeeze even more performance out of your Amazon Redshift configuration.

Limiting Automated Backups

With automated backups, Amazon Redshift will frequently push incremental block changes to Amazon S3. These automated backups are triggered based on a threshold of time or blocks changed. This approach makes possible recovery features such as snapshot and table level restore operations.

The actual cost of the automated backup is generally insignificant. However, because it transfers data blocks to S3 it can potentially affect your clusters’ throughput to S3 for workloads that constantly perform large COPY or UNLOAD operations. To reduce the cost of these automated backups, we can use the table property BACKUP NO for tables that don’t contain data to store durably in S3. Some tables where you can use this property are these:

Tables that are populated with easily recreated data
Tables containing data that is transient or often changing in nature
Tables that are populated with unimportant or experimental data

We use this property to achieve two things:

Preventing inconsequential block changes from triggering frequent automated backups
Reducing quantity of blocks transferred to S3 during backups, to reduce backup duration

Note: Tables defined as temporary inherit this property by default, because these tables are cleaned up at the end of the session in which they were created.

Reducing Synchronous Data Replication

By default, multinode Amazon Redshift configurations automatically store redundant copies of table data on other nodes in the same cluster, so we can tolerate multiple disk and node failures. As with all synchronous data replication, some overhead is associated with the process. For temporary tables, this replication process doesn’t occur because the system makes the assumption that the data is truly transient. Additionally, blocks associated with temporary tables don’t count toward the threshold which triggers automated backups. Thus, by using temporary tables for truly transient data we can also avoid triggering backup processes and improve the rate at which the automated backup processes complete—just as we did with the BACKUP NO option.

Because temporary table data is not mirrored, the transactions that commit temporary table data are less expensive than those that commit permanent table data. This functionality means you can significantly reduce cluster overhead by using temporary tables where they are appropriate. Examples include scenarios where table data only needs to be accessible for a single session and is truly transient in nature.

You can identify the tables in a cluster that are currently using either of these table properties by using this query:

SELECT DISTINCT id, name, temp, backup 
FROM stv_tbl_perm 
WHERE backup=0 ORDER BY 1;

Next Steps

Depending on your unique data durability and retention requirements, these options might not be appropriate for you—and that’s okay. Out of the topics we’ve discussed throughout this blog series, these particular optimizations result in the least significant overall improvement to cluster performance. However, in an environment that is being optimized from top to bottom, this is the final table design concept you can use for a performance gain.