How can I troubleshoot high or full disk usage with Amazon Redshift?
Last updated: 2020-10-15
I'm experiencing high or full disk utilization on Amazon Redshift. How can I troubleshoot issues that cause high disk usage?
High disk usage errors can depend on several factors, including:
- Distribution and sort key
- Query processing
- Tables with VARCHAR(MAX) columns
- High column compression
- Maintenance operations
- Cartesian products with cross-joins
- Minimum table size
- Tombstone blocks
- Copying a large file
Distribution and sort key
Review the table's distribution style, distribution key, and sort key selection. A full disk node can be caused by tables that have distribution skew where more data is located in one node than the others. If you have tables with skewed distribution styles, change the distribution style to a more uniform distribution. Note that distribution and row skew can impact storage skew and intermediate rowset when a query is running. For more information about distribution keys and sort keys, see Amazon Redshift engineering’s advanced table design playbook: preamble, prerequisites, and prioritization.
To determine the cardinality of your distribution key, run the following query:
select distkey, count(*) from public.distribution_skew group by distkey having count(*) > 1 order by 2 desc;
Note: To avoid a sort step, use SORT KEY columns in your ORDER BY clause. A sort step can use excessive memory, causing a disk spill. For more information, see Choosing sort keys.
To see how database blocks in a distribution key are mapped to a cluster, use the Amazon Redshift table_inspector.sql utility.
Review any memory allocated to a query. While a query is processing, intermediate query results can be stored in temporary blocks. If there isn't enough free memory, then the tables cause a disk spill. Intermediate result sets aren't compressed, which affects the available disk space. For more information, see Insufficient memory allocated to the query.
Amazon Redshift defaults to a table structure with even distribution and no column encoding for temporary tables. But if you are using SELECT...INTO syntax, use a CREATE statement. For more information, see Top 10 performance tuning techniques for Amazon Redshift, and follow the instructions under Tip #6: Address the inefficient use of temporary tables.
If insufficient memory is allocated to your query, you might see a step in SVL_QUERY_SUMMARY where is_diskbased shows the value "true". To resolve this issue, increase the number of query slots to allocate more memory to the query. For more information about how to temporarily increase the slots for a query, see wlm_query_slot_count or tune your WLM to run mixed workloads. You can also use WLM query monitoring rules to counter heavy processing loads and to identify I/O intensive queries.
Tables with VARCHAR(MAX) columns
Check VARCHAR or CHARACTER VARYING columns for trailing blanks that might be omitted when data is stored on the disk. During query processing, trailing blanks can occupy the full length in memory (the maximum value for VARCHAR is 65535). It's a best practice to use the smallest possible column size.
To generate a list of tables with maximum column widths, run the following query:
SELECT database, schema || '.' || "table" AS "table", max_varchar FROM svv_table_info WHERE max_varchar > 150 ORDER BY 2;
To identify and display the true widths of the wide VARCHAR table columns, run the following query:
SELECT max(octet_length (rtrim(column_name))) FROM table_name;
For more information about table design, review the Amazon Redshift best practices for designing tables.
High column compression
Encode all columns (except sort key) using the ANALYZE COMPRESSION or Amazon Redshift column encoding utility for optimal column encoding. Amazon Redshift provides column encoding, which can increase read performance while reducing overall storage consumption. It's a best practice to use the system compression feature.
Be sure that the database tables in your Amazon Redshift Database are regularly analyzed and vacuumed. One way to maintain the health of your database is to identify any missing or outdated stats. This prevents Amazon Redshift from scanning any unnecessary table rows, and also helps to optimize your query processing.
Note: Maintenance operations such as VACUUM and DEEP COPY use temporary storage space for their sort operations, so a spike in disk usage is expected.
For example, the following query helps you identify outdated stats in Amazon Redshift:
SELECT * FROM svv_table_info WHERE stats_off > 10 ORDER BY size DESC;
For more information, see the Amazon Redshift Analyze & Vacuum schema utility.
Cartesian products with cross-joins
Use the EXPLAIN plan of the query to look for queries with Cartesian products. Cartesian products are cross-joins that are unrelated and can produce an increased number of blocks. These cross-joins can result in higher memory utilization and more tables spilled to disk. If cross-joins don't share a JOIN condition, then the joins produce a Cartesian product of two tables. Every row of one table is then joined to every row of the other table.
Cross-joins can also be run as nested loop joins, which take the longest time to process. Nested loop joins result in spikes in overall disk usage. For more information, see Identifying queries with nested loops.
Minimum table size
The same table can have different sizes in different clusters. The minimum table size is then determined by the number of columns and whether the table has a SORTKEY and number of slices populated. If you recently resized an Amazon Redshift cluster, you might see a change in your overall disk storage. This is caused by the change in number of slices. Amazon Redshift also counts the table segments that are used by each table. For more information, see Why does a table in an Amazon Redshift cluster consume more or less disk storage space than expected?
Tombstone blocks are generated when a WRITE transaction to an Amazon Redshift table occurs and there is a concurrent Read. Amazon Redshift keeps the blocks before the write operation to keep a concurrent Read operation consistent. Amazon Redshift blocks can't be changed. Every Insert, Update, or Delete action creates a new set of blocks, marking the old blocks as tombstoned.
Sometimes tombstones fail to clear at the commit stage because of long-running table transactions. Tombstones can also fail to clear when there are too many ETL loads running at the same time. Because Amazon Redshift monitors the database from the time that the transaction starts, any table written to the database also retains the tombstone blocks. If long-running table transactions occur regularly and across several loads, enough tombstones can accumulate to result in a Disk Full error.
You can also force Amazon Redshift to perform the analysis regarding tombstone blocks by performing a commit command.
If there are long-running queries that are active, then terminate the queries (and release all subsequent blocks) using the commit command:
begin; create table a (id int); insert into a values(1); commit; drop table a;
To confirm tombstone blocks, run the following query:
select trim(name) as tablename, count(case when tombstone > 0 then 1 else null end) as tombstones from svv_diskusage group by 1 having count(case when tombstone > 0 then 1 else null end) > 0 order by 2 desc;
Copying a large file
During a COPY operation, you might receive a Disk Full error even if there is enough storage available. This error occurs if the sorting operation spills to disk, creating temporary blocks.
If you encounter a Disk Full error message, then check the STL_DISK_FULL_DIAG table:
select '2000-01-01'::timestamp + (currenttime/1000000.0)* interval '1 second' as currenttime,node_num,query_id,temp_blocks from pg_catalog.stl_disk_full_diag;
For more best practices, see Amazon Redshift best practices for loading data.
Check the percentage of disk space under the Performance tab in the Amazon Redshift console. For each cluster node, Amazon Redshift provides extra disk space, which is larger than the nominal disk capacity.
If you notice a sudden spike in utilization, use the STL_QUERY to identify the activities and jobs that are running:
select * from stl_query where starttime between '2018-01-01 00:30:00' and '2018-01-01 00:40:00';
Note: Update the values with the time when the spike occurred.
To identify the top 20 disk spill queries, run the following query:
select A.userid, A.query, blocks_to_disk, trim(B.querytxt) text from stl_query_metrics A, stl_query B where A.query = B.query and segment=-1 and step = -1 and max_blocks_to_disk > 0 order by 3 desc limit 20;
To determine if your queries are properly writing to a disk, run the following query:
SELECT q.query, trim(q.cat_text) FROM ( SELECT query, replace( listagg(text,' ') WITHIN GROUP (ORDER BY sequence), '\\n', ' ') AS cat_text FROM stl_querytext WHERE userid>1 GROUP BY query) q JOIN ( SELECT distinct query FROM svl_query_summary WHERE is_diskbased='t' AND (LABEL ILIKE 'hash%' OR LABEL ILIKE 'sort%' OR LABEL ILIKE 'aggr%' OR LABEL ILIKE 'save%' OR LABEL ILIKE 'window%' OR LABEL ILIKE 'unique%') AND userid > 1) qs ON qs.query = q.query;