The default amount of disk storage space allocated to two tables residing on different Amazon Redshift clusters can vary significantly, even when the tables are created using identical data definition language (DDL) statements and contain the same number of rows. In this scenario, differences in disk storage space consumed by allocated for each table is primarily a function of the number of populated slices on each Redshift cluster and the number of table segments used by each table.

Amazon Redshift allocates disk space storage for tables by calculating the following formula for "minimum table size" as follows:

  • For tables created using the KEY or EVEN distribution style:

Minimum Table Size = block_size (1MB) * (number_of_user_columns + 3 system columns) * number_of_populated_slices * number_of_table_segments.

  • For tables created using the ALL distribution style:

Minimum Table Size = block_size (1MB) * (number_of_user_columns + 3 system columns) * number_of_cluster_nodes * number_of_table_segments

Given two tables created in Amazon Redshift which:

  • Are created with identical DDL statements
  • Contain the same number of rows
  • Are located on different cluster nodes
  • Have not been manually modified

Differences in table disk storage space allocation are a function of:

  • The number of populated slices allocated to the each cluster node (for tables created using the KEY or EVEN distribution style).
  • The number of Redshift cluster nodes used by the table (for tables created using the ALL distribution style).
  • The number of table segments used by each table, regardless of the table’s distribution style.

Note: For the purposes of this article, a "table segment" refers to a data structure created for a table, as opposed to a query execution segment as described under step 5 of Query Planning And Execution Workflow. Amazon Redshift is a column-oriented DBMS, with tables created as segments of data columns rather than as rows of data. When data is added to a table created in a column-oriented DBMS after the table is created, the new data is maintained in a separate table segment that contains unsorted data (that is, the data is not inserted into the original sorted key segment until a VACUUM operation is performed). The variable number_of_table_segments returns one of three integer values that represent the number of table segments to allocate for Redshift tables:

0: A table has never been loaded; allocate disk space for zero table segments.

1: A table with a sortkey has been loaded only once and never vacuumed OR a table without a sortkey has been loaded one or more times; allocate disk space for a single table segment.

2: A table with a sortkey has been loaded once and never vacuumed OR has been loaded multiple times; allocate disk space for two table segments.

Example Minimum Table Size calculations:

Given a table with with 125 user columns in a single sort key segment on a cluster with 16 populated slices, Minimum Table Size is calculated as follows:

1MB * (125 + 3) * 16 * 1 = 2048MB

Now, given a table created with an identical DDL statement but that:

  • Stores data in both a sorted key segment and an unsorted key segment
  • Resides on on a cluster with 64 populated slices

The Minimum Table Size calculation dictates that the table will utilize significantly more disk storage:

1 MB * (125 + 3) * 64 * 2 = 16384MB

Note: It is possible for the Minimum Table Size calculation to return a slightly different than expected result immediately after you complete a VACUUM operation using the SORT ONLY or DELETE ONLY parameters. This might occur because the VACUUM operation adds blocks for processing data as part of sort and delete operations. For more information, see Usage Notes.

Use the following query to determine the amount of disk space allocated to a table in an Amazon Redshift cluster:

SELECT ti.database,

  ti.schema||'.'||ti."table"AS tablename,

  ti.size

FROM svv_table_info ti;

 

Run the following query to calculate a table’s Minimum Table Size:

WITH

  tbl_ids AS

    (SELECT DISTINCT oid

     FROM pg_class c

     WHERE relowner>1

       AND relkind='r'),

  stp AS

    (SELECT id,sum(ROWS)sum_r,sum(sorted_rows)sum_sr,min(ROWS)min_r,

       max(ROWS)max_r,nvl(count(DISTINCT slice),0)pop_slices

     FROM stv_tbl_perm

     WHERE id IN (SELECT oid FROM tbl_ids)

       AND slice<6400

     GROUP BY id),

  colenc AS

    (SELECT attrelid,sum(CASE WHEN a.attencodingtype=0 THEN 0 ELSE 1 END)

        AS encoded_cols,count(*)AS cols

     FROM pg_attribute a

     WHERE a.attrelid IN (SELECT oid FROM tbl_ids)

       AND a.attnum>0

     GROUP BY a.attrelid),

  cluster_info AS

    (SELECT COUNT(DISTINCT node) node_count

     FROM stv_slices)

SELECT ti.database,

       ti.schema||'.'||ti."table"AS tablename,

       ti.diststyle,

       ti.sortkey1,

       ti.size current_size,

       nvl(CASE

           WHEN stp.sum_r=stp.sum_sr

                OR stp.sum_sr=0 THEN CASE

                  WHEN"diststyle"='EVEN'THEN(stp.pop_slices*(colenc.cols+3))

                   WHEN substring("diststyle",1,3)='KEY'

                      THEN(stp.pop_slices*(colenc.cols+3))

                   WHEN"diststyle"='ALL'THEN(cluster_info.node_count*(colenc.cols+3))

           END

           ELSE CASE

                    WHEN"diststyle"='EVEN'THEN(stp.pop_slices*(colenc.cols+3)*2)

                   WHEN substring("diststyle",1,3)='KEY'

                      THEN(stp.pop_slices*(colenc.cols+3)*2)

                   WHEN"diststyle"='ALL'

                      THEN(cluster_info.node_count*(colenc.cols+3)*2)

           END

       END,0) AS minimum_size

FROM svv_table_info ti

LEFT JOIN stp ON stp.id=ti.table_id

LEFT JOIN colenc ON colenc.attrelid=ti.table_id

CROSS JOIN cluster_info

WHERE ti.schema NOT IN('pg_internal')

ORDER BY ti.size DESC;

Review the information at About Clusters and Nodes and Data Warehouse System Architecture to determine the optimal cluster configuration for your usage scenario.

Amazon Redshift, disk space, allocate, minimum table size, storage slices, cluster, compute node, DC1.L, DC1.8XL, DS2.XL, DS2.8XL, data distribution


Did this page help you? Yes | No

Back to the AWS Support Knowledge Center

Need help? Visit the AWS Support Center

Published: 2016-04-15