I'm experiencing high or full disk utilization on an Amazon Redshift instance. How can I troubleshoot issues that cause high disk usage with Amazon Redshift?

High disk usage can be caused by several factors, and to identify the cause, it's important to understand how Amazon Redshift monitors disk usage work. For each cluster node, Amazon Redshift provides extra space, which is the raw disk space that is reserved by Amazon Redshift for internal use. This extra space is larger than the nominal disk capacity, which is the amount of disk space available to the user. You might see different Amazon Redshift disk utilization in Amazon CloudWatch graphs and STV_PATRITION tables, which show the raw disk space. We recommend using the Percentage of Disk Space metric for monitoring. Disk space is calculated per node, and not for the cluster as a whole. If a node has no more free disk space, and a segment running on that node requires disk space during query execution, then Amazon Redshift cancels the query and rolls back the corresponding transaction to release resources. To see percentage of nominal disk capacity used by the cluster:

  1. Open the Amazon Redshift console, and then choose Clusters from the navigation pane.
  2. Choose the cluster, and then choose the Performance view.
  3. View the metrics listed in Percentage of Disk Space.

Review the Amazon Redshift Best Practices for Designing Tables to avoid poor table design, which can result in spikes in disk usage and poor query performance. Specifically, be sure to use a good distribution key to avoid skewed data, use sort keys based on query patterns, and use smaller column sizes to avoid consuming excessive memory and temporary disk space.

If there is a sudden spike in disk space utilization, identify the activities and jobs running during the spike by using the STL_QUERY table. Identify queries running during a specific time period by running a query similar to the following:

select * from stl_query where starttime between '2018-01-01 00:30:00' and '2018-01-01 00:40:00';

Note: Update the values for the time when the spike in utilization occurred.

Consistent high disk usage can be related to:

Distribution and sort key - Review the table's distribution style, distribution key, and sort key selection. A full disk node can be caused by tables that have distribution skew and more data is located in one node than the others. Row skew is a result of table storage that is not balanced across the nodes, which means that the number of rows varies highly among the nodes for specific tables. Distribution or row skew can impact node-level disk space (storage skew) and intermediary rowset during query execution. For example, choosing a column that has low cardinality as the DISTKEY for KEY (DISTSTYLE) distribution might populate data in only a few slices. Spikes in disk usage for single nodes can also be caused by the dominance of a single value in the distribution key column. For more information, see DISTSTYLE, DISTKEY, and SORTKEY Options.

If data distribution is suboptimal and you have tables with skewed distribution keys, consider changing the distribution style or distribution key to a column that exhibits high cardinality and uniform distribution. For more information, see Choosing a Data Distribution Style. To identify uneven data distribution, see Identifying Tables with Data Skew or Unsorted Rows. You can also use the Amazon Redshift table_inspector.sql utility to see how database blocks in a distribution key map to the slices in a cluster and to see how data is distributed across a cluster.

Determine the cardinality of your distribution key by running a query similar to the following:  

select distkey, count(*) from public.distribution_skew group by distkey having count(*) > 1 order by 2 desc;

Note: You can also use SORT KEY columns in your ORDER BY clause to avoid an expensive SORT STEP that can use excessive memory and spill to disk. For more information, see Choosing Sort Keys.

Query processing - Review the memory that is allocated to a query. While a query is processing, intermediate query results can be stored in temporary blocks. If there isn't enough free memory, tables are spilled to disk. Intermediary result sets are not compressed, which can impact the available disk space. For more information, see Insufficient Memory Allocated to the Query. Amazon Redshift defaults to a table structure with even distribution and no column encoding for temporary tables, so consider using a CREATE statement if you are using SELECT...INTO syntax. For more information, see Issue #10 – Inefficient use of Temporary Tables.

If insufficient memory is allocated to your query, you might see a step in SVL_QUERY_SUMMARY that has an is_diskbased that has the value "true." To resolve this issue, allocate more memory to the query by temporarily increasing the number of query slots, which helps prevent query spill to disk. For more information about how to temporarily increase the slots for a query, see wlm_query_slot_count or tune your WLM to run mixed workloads. You can also use WLM -QMR rules to counter loads that are using resources excessively and to identify I/O intensive queries.

Tables with VARCHAR(MAX) columns - Check VARCHAR or CHARACTER VARYING columns for trailing blanks that are omitted when data is stored on disk, but can occupy the full length in memory (for VARCHAR, the maximum is 65535) during query processing. If you have tables with large VARCHAR columns, processing data across these columns can cause less intermediary rowset in memory and more spills to disk, which result in higher disk usage. For more information, see Use the Smallest Possible Column Size.

Generate a list of tables having maximum column widths by running a query similar to the following:

SELECT database, schema || '.' || "table" AS "table", max_varchar FROM svv_table_info WHERE max_varchar > 150 ORDER BY 2;

After you have a list of tables, identify which table columns have wide VARCHAR columns, and then determine the true maximum width for each wide column by running a query similar to the following:

SELECT max(octet_length (rtrim(column_name))) FROM table_name;

High column compression - Encode all columns (except Sort key) using the Analyze Compression or Amazon Redshift Column Encoding Utility for optimal column encoding. Amazon Redshift provides column encoding, which can increase read performance and reduce overall storage consumption. It is a best practice to use system suggested compression.

Maintenance operations - Review your table for Missing or outdated stats, which can cause suboptimal execution plans. Maintenance operations like VACUUM and DEEP COPY use intermediate temporary storage space for their sort operations, so a spike in disk usage is expected. Identifying missing or outdated stats can help optimize the query execution plan, which helps avoid reading and scanning unnecessary rows.

Find outdated stats by running a query similar to the following:

SELECT * FROM svv_table_info WHERE stats_off > 10 ORDER BY size DESC;

For optimal performance from your Amazon Redshift Database, be sure that database tables are regularly analyzed and vacuumed. For more information, see the Amazon Redshift Analyze & Vacuum Schema Utility.

Cartesian joins - Look for queries that have join conditions that result in a cross-join (Cartesian product) and can produce an increased number of blocks, which can utilize more memory and spill tables to disk. If cross-joins don't have a join condition, they can result in the Cartesian product of two tables. Because cross-joins are typically executed as nested-loop joins, which are the slowest of the possible join types, large amounts of data can be spilled to disk, and you might see spikes in overall disk usage. For more information, see Identifying Queries with Nested Loops.

Minimum table size - Review the table sizes to see if you have different table sizes between two different Amazon Redshift clusters. If you recently resized an Amazon Redshift cluster, you might see a change in overall disk storage, because space allocated to two tables residing on different Amazon Redshift cluster configurations can vary significantly. Disk storage space allocation for each table is primarily a function of the number of populated slices on each Amazon Redshift cluster and the number of table segments used by each table. For more information, see Why does a table in an Amazon Redshift cluster consume more or less disk storage space than expected?

Tombstone blocks - Tombstone blocks are usually generated when a write transaction to an Amazon Redshift table occurs. Blocks are immutable—they cannot be changed—in Amazon Redshift. Insert, update, or delete actions create a new set of tombstone blocks, and this marks the old blocks as "tombstoned." Normally, tombstones are cleared at commit, but issues can occur when long-running transactions read a table and there are several ETL loads running at the same time. Because Amazon Redshift must provide long-running transactions with a consistent view of the database from the time the transaction started, as long as the transaction is running, any table that is written to the database retains tombstone blocks. If this happens regularly and across several loads, enough tombstones can accumulate to result in a "disk full" error. If there are long-running active queries, terminate the queries so that for subsequent commits, all the blocks are released. Alternatively, after you have terminated a long-running query, you can immediately run a dummy commit to release the blocks by running a command similar to the following:

create table a (id int);
insert into a values(1);
drop table a; 

Confirm tombstone blocks by running a query similar to the following:

select trim(name) as tablename, count(case when tombstone > 0 then 1 else null end) as tombstones from svv_diskusage group by 1 having count(case when tombstone > 0 then 1 else null end) > 0 order by 2 desc;

Disk full during COPY - During a COPY operation, you might receive a "disk full" error even if there is enough storage available. This error can occur if the sorting operation spills to disk, creating temporary blocks. For best practices, see Amazon Redshift Best Practices for Loading Data.

Troubleshooting queries - The following queries can help you troubleshoot the source of high disk usage:

Identify the top 20 disk spill queries by running a query similar to the following:

select A.userid, A.query, blocks_to_disk, trim(B.querytxt) text from stl_query_metrics A, stl_query B where A.query = B.query and segment=-1 and step = -1 and max_blocks_to_disk > 0 order by 3 desc limit 20;

Determine if queries are writing to disk by running a query similar to the following:

SELECT q.query, trim(q.cat_text)
SELECT query,
replace( listagg(text,' ') WITHIN GROUP (ORDER BY sequence), '\\n', ' ') AS cat_text
FROM stl_querytext
WHERE userid>1
GROUP BY query) q
SELECT distinct query
FROM svl_query_summary
WHERE is_diskbased='t'
AND userid > 1) qs
ON qs.query = q.query;

Did this page help you? Yes | No

Back to the AWS Support Knowledge Center

Need help? Visit the AWS Support Center

Published: 2018-08-31