Why does a table in an Amazon Redshift cluster consume more or less disk storage space than expected?

Issue

A table in an Amazon Redshift cluster is using disk storage space differently than I expected. How are disk storage space and table size calculated in Amazon Redshift?

Short Description

The amount of disk storage space allocated to two tables that are in different Amazon Redshift clusters can vary significantly, even if the tables are created using the same data definition language (DDL) statements and contain the same number of rows. In the following scenario, the difference in disk storage space consumed by each table is determined by:

• The number of populated slices on each Amazon Redshift cluster
• The number of table segments used by each table

The minimum disk space is the smallest data footprint that a table can have on an Amazon Redshift cluster. You can check the minimal table size when analyzing the cluster storage use or when resizing an Amazon Redshift cluster. You can calculate the minimum disk space using the following formula:

• For tables created using the KEY or EVEN distribution style:
Minimum table size = block_size (1 MB) * (number_of_user_columns + 3 system columns) * number_of_populated_slices * number_of_table_segments.
• For tables created using the ALL distribution style:
Minimum table size = block_size (1 MB) * (number_of_user_columns + 3 system columns) * number_of_cluster_nodes * number_of_table_segments.

If two Amazon Redshift tables share the following attributes:

• Created with identical DDL statements
• Contain the same number of rows
• Haven't been manually modified

Then the table disk storage space allocation can vary depending on:

• The number of cluster slices populated by the Table, for the EVEN and Key Distribution style
• The number of nodes in the cluster for ALL distributed slices
• The number of segments in a table

If an Amazon Redshift table has a sort key, the table has two segments—one sorted segment and one unsorted segment. If an Amazon Redshift table has no sort key, all data is unsorted, and therefore the table has one unsorted segment.

When data is added to an existing table with a sort key, the new data is maintained in a separate segment that contains unsorted data—the data is not inserted into the original sorted key segment until a VACUUM operation is performed. For more information, see Managing the Volume of Merged Rows.

Note: The VACUUM operation merges the data with sorted data. However the table will still have unsorted segment for future loads.

The variable number_of_table_segments is one of three values that represent the number of table segments to allocate for Amazon Redshift tables:

0: A table has never been loaded; allocate disk space for zero table segments.

1: A table without a sort key has been loaded one or more times.

2: A table with a sort key has been loaded one or more times

Example minimum table size calculations:

If a table has 125 user columns with sort keys on a cluster with 16 slices, then the smallest size the table can have populating all 16 slices is calculated as follows:

1 MB * (125 + 3) * 16 * 2 = 4096 MB

If a table is created with a DDL statement and the table resides on a cluster two-slice cluster that is populating both slices, then the minimal table size calculation dictates that the table uses significantly less disk storage:

1 MB* (125 + 3)* 2 * 2 = 512 MB

If a table is created with an identical DDL statement and the table resides on a cluster with 64 populated slices, then the following minimum table size calculation dictates that the table uses significantly more disk storage:

1 MB * (125 + 3) * 64 * 2 = 16384 MB

Based on the minimal table size example, the table size can grow or shrink based on the number of slices populated on the cluster.

Resolution

1. Use the following query to determine the amount of disk space allocated to a table in an Amazon Redshift cluster:

SELECT ti.database,
ti.schema||'.'||ti."table"AS tablename,
ti.size
FROM svv_table_info ti;

2. Run the following query to calculate a table’s minimum table size:

WITH
tbl_ids AS
(SELECT DISTINCT oid
FROM pg_class c
WHERE relowner>1
AND relkind='r'),
stp AS
(SELECT id,sum(ROWS)sum_r,sum(sorted_rows)sum_sr,min(ROWS)min_r,
max(ROWS)max_r,nvl(count(DISTINCT slice),0)pop_slices
FROM stv_tbl_perm
WHERE id IN (SELECT oid FROM tbl_ids)
AND slice<6400
GROUP BY id),
colenc AS
(SELECT attrelid,sum(CASE WHEN a.attencodingtype=0 THEN 0 ELSE 1 END)
AS encoded_cols,count(*)AS cols
FROM pg_attribute a
WHERE a.attrelid IN (SELECT oid FROM tbl_ids)
AND a.attnum>0
GROUP BY a.attrelid),
cluster_info AS
(SELECT COUNT(DISTINCT node) node_count
FROM stv_slices)
SELECT ti.database,
ti.schema||'.'||ti."table"AS tablename,
ti.diststyle,
ti.sortkey1,
ti.size current_size,
nvl(CASE
WHEN stp.sum_r=stp.sum_sr
OR stp.sum_sr=0 THEN CASE
WHEN"diststyle"='EVEN' THEN
CASE
WHEN ti.sortkey1 != '' THEN (stp.pop_slices*(colenc.cols+3)*2)
ELSE (stp.pop_slices*(colenc.cols+3))
END
WHEN substring("diststyle",1,3)='KEY' THEN
CASE
WHEN ti.sortkey1 != '' THEN (stp.pop_slices*(colenc.cols+3)*2)
ELSE (stp.pop_slices*(colenc.cols+3))
END
WHEN"diststyle"='ALL' THEN
CASE
WHEN ti.sortkey1 != '' THEN cluster_info.node_count*(colenc.cols+3)*2
ELSE cluster_info.node_count*(colenc.cols+3)
END
END
ELSE CASE
WHEN"diststyle"='EVEN'THEN(stp.pop_slices*2*(colenc.cols+3))
WHEN substring("diststyle",1,3)='KEY'
THEN(stp.pop_slices*(colenc.cols+3)*2)
WHEN"diststyle"='ALL'
THEN(cluster_info.node_count*(colenc.cols+3)*2)
END
END,0) AS minimum_size
FROM svv_table_info ti
LEFT JOIN stp ON stp.id=ti.table_id
LEFT JOIN colenc ON colenc.attrelid=ti.table_id
CROSS JOIN cluster_info
WHERE ti.schema NOT IN('pg_internal')
ORDER BY ti.size DESC;

To determine the optimal cluster configuration for your usage scenario, see Clusters and Nodes in Amazon Redshift and Data Warehouse System Architecture.