AWS Database Blog

Understanding statistics in PostgreSQL

July 2023: This post was reviewed for accuracy.

PostgreSQL has become the preferred open-source relational database for many enterprise developers and startups, and powers leading business and mobile applications. AWS provides two managed PostgreSQL options: Amazon Relational Database Service (Amazon RDS) for PostgreSQL and Amazon Aurora PostgreSQL-Compatible Edition.

Database statistics play a key role in improving the performance of the database. The query planner uses the statistical data to generate efficient run plans for queries. The purpose of this post is to explain the types of statistics in PostgreSQL and how to read and understand them. This applies to both Amazon RDS for PostgreSQL and Aurora PostgreSQL.

The following types of statistics are collected and available in PostgreSQL:

  • Data distribution statistics
  • Extended statistics
  • Monitoring statistics

We explain each type in more detail in this post.

Data distribution statistics

These statistics are related to the data distribution for each relation. They provide information about the most common values in each column in a relation, average width of the column, number of distinct values in the column, and more. They’re collected when we run ANALYZE or when analyze is triggered by autovacuum, and are stored in the pg_statistic system catalog (whose public readable view is pg_stats).

The following is an example of how you can generate and see these stats:

  1. Create a table and insert some dummy data:
    postgres=# CREATE TABLE test_stats(id INT, name VARCHAR);
    CREATE TABLE
    postgres=> INSERT INTO test_stats VALUES (generate_series(1,10),'test'||generate_series(1,10));
    INSERT 0 10
    postgres=> INSERT INTO test_stats VALUES (generate_series(1,10),'test'||generate_series(1,10));
    INSERT 0 10
    postgres=> INSERT INTO test_stats VALUES (generate_series(1,10),'test'||generate_series(1,10));
    INSERT 0 10
    postgres=> INSERT INTO test_stats VALUES (generate_series(11,20),'test'||generate_series(11,20));
    INSERT 0 10
  2. Analyze the table to generate the stats for the table:
    postgres=> ANALYZE VERBOSE test_stats ;
    INFO:  analyzing "public.test_stats"
    INFO:  "test_stats": scanned 1 of 1 pages, containing 40 live rows and 0 dead rows; 40 rows in sample, 40 estimated total rows
    ANALYZE
  3. Check the stats of the table in the pg_stats view:
    postgres=> SELECT * FROM pg_stats WHERE tablename ='test_stats';
    -[ RECORD 1 ]----------+------------------------------------------------
    schemaname             | public
    tablename              | test_stats
    attname                | id
    inherited              | f
    null_frac              | 0
    avg_width              | 4
    n_distinct             | -0.5
    most_common_vals       | {1,2,3,4,5,6,7,8,9,10}
    most_common_freqs      | {0.075,0.075,0.075,0.075,0.075,0.075,0.075,0.075,0.075,0.075}
    histogram_bounds       | {11,12,13,14,15,16,17,18,19,20}
    correlation            | 0.7551595
    most_common_elems      |
    most_common_elem_freqs |
    elem_count_histogram   |
    
    -[ RECORD 2 ]----------+------------------------------------------------
    schemaname             | public
    tablename              | test_stats
    attname                | name
    inherited              | f
    null_frac              | 0
    avg_width              | 6
    n_distinct             | -0.5
    most_common_vals       | {test1,test10,test2,test3,test4,test5,test6,test7,test8,test9}
    most_common_freqs      | {0.075,0.075,0.075,0.075,0.075,0.075,0.075,0.075,0.075,0.075}
    histogram_bounds       | {test11,test12,test13,test14,test15,test16,test17,test18,test19,test20}
    correlation            | -0.19043152
    most_common_elems      |
    most_common_elem_freqs |
    elem_count_histogram   |
    
    postgres=>

As shown in the output, the pg_stats view has the data distribution stats for each column of the test_stats table. For example, there are 20 unique values for column id; however, you can see only 10 values for most_common_values because these are duplicates, and for the name column, test1, test2, …test9, as the most common values. The most common value list is used to help the planner predict the selectivity of equality expressions, such as where name='test5' or where state=’TX’. The histogram bounds are used to help the planner predict the selectivity of inequality or range expressions, such as where id is between 5000–10000.

If you look at the correlation column for the name column, its value is -0.19 (near 0). When the value is near -1 or +1, an index scan on the column will be estimated to be cheaper than when it’s near 0, due to the reduction of random access to the disk. Because there are only 30 rows, a value near 0 indicates that an index isn’t required for this column. There are no null values in the tables, so null_frac is 0.

The query planner needs to estimate the number of rows retrieved by a query in order to make good choices of query plans. For making these estimations, these data distribution statistics are used.

The following are a few key things to note about the statistics:

  • For large tables, ANALYZE takes a random sample of the table contents, rather than examining every row. This allows even very large tables to be analyzed in a small amount of time.
  • The amount of samples considered by ANALYZE depends on the default_statistics_target parameter. Larger values increase the time needed to do ANALYZE, but might improve the quality of the planner’s estimates. The default value for this is 100. To get an accurate plan, the default value is sufficient; however, default_statistics_target is the global default. For the case where there is 1 column that needs more stats, you can use ALTER TABLE [ TABLE ] ALTER [ COLUMN ] column_name SET STATISTICS integer. However, it will consume more CPU, memory, and time. If the value of this parameter is 100, then 300 * 100 = 30,000 rows are sampled from each table. This sample is used to determine up to 100 most common values to store in the most_common_vals array column, and up to 100 histogram bounds to store in that array, plus a few other scalar statistics, like the number of distinct values.
  • There is no command or function for resetting these statistics (as in resetting to 0).
  • After you complete an engine major version upgrade, you should run the ANALYZE operation to refresh the pg_statistic table (to have the statistics updated for the planner’s use).
  • For a read replica in Amazon RDS for PostgreSQL and for a reader node in Aurora PostgreSQL, these stats are the same as for the primary or writer. This is because they are stored in a relation (pg_statistics) on disk (physical blocks are the same on the replica in Amazon RDS for PostgreSQL and in the case of Aurora, the reader is reading from the same storage). This is also the reason why it isn’t allowed (and also not logical) to run an ANALYZE on a replica or a reader node (both can read from the pg_statistics relation, but can’t update it).

Extended statistics

By default, the statistics from ANALYZE are stored on a per-column per-table basis, and therefore can’t capture any knowledge about cross-column correlation. It’s common to see slow queries running bad run plans because multiple columns used in the query clauses are correlated. However, with the CREATE STATISTICS command, you can create extended statistics for correlated columns.

You can use CREATE STATISTICS for a single expression (which is called univariate statistics), which provides benefits similar to an expression index without the overhead of index maintenance. You can also use CREATE STATISTICS on multiple columns or expressions (known as multivariate statistics).

Consider a common table with a relation between a city and state in the US. For example, we have a query to find the rows from the city of Springfield in the state of Illinois. There is a city named Springfield in most states, and the state of Illinois has many cities. Determining the number of distinct rows with that combination without a relationship is simply combining the two probabilities. Extended statistics gather stats on the two columns together so the estimates of the number of rows from Springfield, Illinois, is a lot closer.

Let’s use a table with a two integer columns to demonstrate the usage of extended stats. Complete the following steps:

  1. Create a table with a and b columns and insert some data:
    postgres=> CREATE TABLE ext_stats(a int, b int);
    CREATE TABLE
    postgres=> INSERT INTO ext_stats SELECT x/1000, x/10000 FROM generate_series(1,1000000) s(x);
    INSERT 0 1000000
    postgres=>
    postgres=> ANALYZE VERBOSE ext_stats;
    INFO:  analyzing "public.ext_stats"
    INFO:  "ext_stats": scanned 4425 of 4425 pages, containing 1000000 live rows and 0 dead rows; 30000 rows in sample, 1000000 estimated total rows
    ANALYZE
    postgres=>

    From data inserted, these the columns are functionally dependent—the knowledge of a value in the first column is sufficient for determining the value in the other column.

  2. Look at the pg_stats values for these columns:
    postgres=> select * from pg_stats where tablename='ext_stats';
    -[ RECORD 1 ]----------+-----------------------------------------------------------
    schemaname             | public
    tablename              | ext_stats
    attname                | a
    inherited              | f
    null_frac              | 0
    avg_width              | 4
    n_distinct             | 1000
    most_common_vals       | {698,821,983}
    most_common_freqs      | {0.0015,0.0014666667,0.0014666667}
    histogram_bounds       | {0,10,20,….. ,999}
    correlation            | 1
    most_common_elems      |
    most_common_elem_freqs |
    elem_count_histogram   |
    -[ RECORD 2 ]----------+-----------------------------------------------------------
    schemaname             | public
    tablename              | ext_stats
    attname                | b
    inherited              | f
    null_frac              | 0
    avg_width              | 4
    n_distinct             | 100
    most_common_vals       | {84, 19,…..,55,5,74,62}
    most_common_freqs      | {0.011166667, 0. 011033333,….0.008733333,0.008433334}
    histogram_bounds       |
    correlation            | 1
    most_common_elems      |
    most_common_elem_freqs |
    elem_count_histogram   |

    You can see the distinct values for column a as 1000 and b as 100.

  3. Without extended statistics, selecting one column looks fine with the estimated rows as 999, which is almost the same as actual rows which is 1000:
    postgres=> set max_parallel_workers_per_gather =0;
    SET
    postgres=> explain analyze select * from ext_stats where a=1;
                                                     QUERY PLAN
    -------------------------------------------------------------------------------------------------------------
     Seq Scan on ext_stats  (cost=0.00..16925.00 rows=999 width=8) (actual time=0.073..58.429 rows=1000 loops=1)
       Filter: (a = 1)
       Rows Removed by Filter: 999000
     Planning Time: 0.037 ms
     Execution Time: 58.485 ms
    (5 rows)
  4. However, if you add the second column to the predicate, the planner estimates rows very differently, because no dependency statistics are gathered:
    postgres=> explain analyze select * from ext_stats where a=1 and b=0;
                                                     QUERY PLAN
    -------------------------------------------------------------------------------------------------------------
     Seq Scan on ext_stats  (cost=0.00..194247.65 rows=1 width=8) (actual time=0.077..612.787 rows=1000 loops=1)
       Filter: ((a = 1) AND (b = 0))
       Rows Removed by Filter: 9999000
     Planning Time: 0.044 ms
     Execution Time: 612.844 ms
    (5 rows)

    You can see estimated rows as 1, which is far from the actual rows (1,000).

  5. Let’s gather extended statistics:
    postgres=> create statistics s_ext_depend(dependencies) on a,b from ext_stats ;
    CREATE STATISTICS
    postgres=> analyze VERBOSE ext_stats ;
    INFO:  analyzing "public.ext_stats"
    INFO:  "ext_stats": scanned 30000 of 44248 pages, containing 6779952 live rows and 0 dead rows; 30000 rows in sample, 9999977 estimated total rows
    ANALYZE
    postgres=> explain analyze select * from ext_stats where a=1 and b=0;
                                                       QUERY PLAN
    ----------------------------------------------------------------------------------------------------------------
     Seq Scan on ext_stats  (cost=0.00..194247.65 rows=1017 width=8) (actual time=0.076..615.219 rows=1000 loops=1)
       Filter: ((a = 1) AND (b = 0))
       Rows Removed by Filter: 9999000
     Planning Time: 0.126 ms
     Execution Time: 615.287 ms
    (5 rows)
    
    postgres=>

As you see, estimated rows (1,017) is very close to the actual number.

All the queries with filters on both columns will have much better estimates using extended statistics.

Monitoring statistics

In addition to data distribution and extended statistics, there are monitoring statistics as well. These types of stats are collected by the statistics collector.

These statistics collect information about count accesses to tables and indexes in both disk block and individual row terms. It also tracks the total number of rows in each table, and information about vacuum and analyze actions for each table (when they were last run on the table).

There is a new mechanism built to store server-level statistics in shared memory in PostgreSQL 15. Previously, this was updated via UDP packets, stored in the file system, and read by sessions. There is no longer a statistics collector process.

The following output shows information like number of live rows, total insert, total dead rows, vacuum information, and more of the monitor_stats table:

postgres=> CREATE TABLE monitor_stats(id int, name varchar);
CREATE TABLE
postgres=> INSERT INTO monitor_stats VALUES(generate_series(1,10000), 'test'||generate_series(1,10000));
INSERT 0 10000
postgres=> ANALYZE monitor_stats;
ANALYZE
postgres=> DELETE FROM monitor_stats where id < 100;
DELETE 99
postgres=> UPDATE monitor_stats SET id=1 where id < 200;
UPDATE 100
postgres=> \x
Expanded display is on.
postgres=> SELECT * FROM pg_stat_user_tables WHERE relname='monitor_stats';
-[ RECORD 1 ]-------+------------------------------
relid               | 217223
schemaname          | public
relname             | monitor_stats
seq_scan            | 2
seq_tup_read        | 19901
idx_scan            |
idx_tup_fetch       |
n_tup_ins           | 10000
n_tup_upd           | 100
n_tup_del           | 99
n_tup_hot_upd       | 86
n_live_tup          | 9901
n_dead_tup          | 199
n_mod_since_analyze | 199
last_vacuum         |
last_autovacuum     |
last_analyze        | 2022-07-19 16:38:41.48811+00
last_autoanalyze    | 2022-07-19 16:38:32.805005+00
vacuum_count        | 0
autovacuum_count    | 0
analyze_count       | 1
autoanalyze_count   | 1

postgres=>

For PostgreSQL 14 or older, the statistics collector transmits the collected information to other PostgreSQL processes through temporary files. These files are stored in the directory named by the stats_temp_directory parameter, pg_stat_tmp by default.

When the server shuts down cleanly, a permanent copy of the statistics data is stored in the pg_stat subdirectory, so that statistics can be retained across server restarts. When recovery is performed at server start (such as after immediate shutdown, server crash, and point-in-time recovery), all statistics counters are reset.

The collector itself emits a new report at most once per PGSTAT_STAT_INTERVAL milliseconds (500 milliseconds unless altered while building the server).

pg_stat_activity, pg_stat_replication, pg_stat_all_tables, pg_stat_user_indexes, and pg_statio_all_tables are some examples of the views that are refreshed by the stats collector reports. For more information on statistics views, refer to Viewing Statistics.

You can use the pg_stat_reset() function to reset all statistics counters for the current database to zero.

In the following example, seq_scan is 0 at this stage on table reset_stats:

postgres=> CREATE TABLE reset_stats(id int, name varchar);
CREATE TABLE
postgres=> INSERT INTO reset_stats VALUES(generate_series(1,1000), 'test'||generate_series(1,1000));
INSERT 0 1000
postgres=> ANALYZE reset_stats ;
ANALYZE
postgres=> SELECT * FROM pg_stat_user_tables WHERE relname='reset_stats';
-[ RECORD 1 ]-------+------------------------------
relid               | 217229
schemaname          | public
relname             | reset_stats
seq_scan            | 0
seq_tup_read        | 0
idx_scan            |
idx_tup_fetch       |
n_tup_ins           | 1000
n_tup_upd           | 0
n_tup_del           | 0
n_tup_hot_upd       | 0
n_live_tup          | 1000
n_dead_tup          | 0
n_mod_since_analyze | 0
last_vacuum         |
last_autovacuum     |
last_analyze        | 2022-07-19 16:53:07.904961+00
last_autoanalyze    | 2022-07-19 16:53:03.792875+00
vacuum_count        | 0
autovacuum_count    | 0
analyze_count       | 1
autoanalyze_count   | 1

postgres=>

An explain analyze command runs the query, so this should increment the seq_scan counter:

postgres=> explain analyze select * from reset_stats ;
                                                 QUERY PLAN
-------------------------------------------------------------------------------------------------------------
 Seq Scan on reset_stats  (cost=0.00..16.00 rows=1000 width=11) (actual time=0.008..0.084 rows=1000 loops=1)
 Planning Time: 0.090 ms
 Execution Time: 0.131 ms
(3 rows)

postgres=>

You can check the table stats after running explain analyze once on the table tbl for a select * query. As we can see, seq_scan is 1 now:

postgres=> SELECT * FROM pg_stat_user_tables WHERE relname='reset_stats';
-[ RECORD 1 ]-------+------------------------------
relid               | 217229
schemaname          | public
relname             | reset_stats
seq_scan            | 1
seq_tup_read        | 1000
idx_scan            |
idx_tup_fetch       |
n_tup_ins           | 1000
n_tup_upd           | 0
n_tup_del           | 0
n_tup_hot_upd       | 0
n_live_tup          | 1000
n_dead_tup          | 0
n_mod_since_analyze | 0
last_vacuum         |
last_autovacuum     |
last_analyze        | 2022-07-19 16:53:07.904961+00
last_autoanalyze    | 2022-07-19 16:53:03.792875+00
vacuum_count        | 0
autovacuum_count    | 0
analyze_count       | 1
autoanalyze_count   | 1

seq_scan incremented again after running the same query:

postgres=> explain analyze select * from reset_stats ;
                                                 QUERY PLAN
-------------------------------------------------------------------------------------------------------------
 Seq Scan on reset_stats  (cost=0.00..16.00 rows=1000 width=11) (actual time=0.011..0.087 rows=1000 loops=1)
 Planning Time: 0.026 ms
 Execution Time: 0.136 ms
(3 rows)

postgres=> \x
Expanded display is on.
postgres=> SELECT * FROM pg_stat_user_tables WHERE relname='reset_stats';
-[ RECORD 1 ]-------+------------------------------
relid               | 217229
schemaname          | public
relname             | reset_stats
seq_scan            | 2
seq_tup_read        | 2000
idx_scan            |
idx_tup_fetch       |
n_tup_ins           | 1000
n_tup_upd           | 0
n_tup_del           | 0
n_tup_hot_upd       | 0
n_live_tup          | 1000
n_dead_tup          | 0
n_mod_since_analyze | 0
last_vacuum         |
last_autovacuum     |
last_analyze        | 2022-07-19 16:53:07.904961+00
last_autoanalyze    | 2022-07-19 16:53:03.792875+00
vacuum_count        | 0
autovacuum_count    | 0
analyze_count       | 1
autoanalyze_count   | 1

postgres=>

The pg_stat_reset() function is used to reset all statistics for the database to 0:

postgres=> select pg_stat_reset();
 pg_stat_reset
---------------

(1 row)

postgres=> \x
Expanded display is on.
postgres=> SELECT * FROM pg_stat_user_tables WHERE relname='reset_stats';
-[ RECORD 1 ]-------+------------
relid               | 217229
schemaname          | public
relname             | reset_stats
seq_scan            | 0
seq_tup_read        | 0
idx_scan            |
idx_tup_fetch       |
n_tup_ins           | 0
n_tup_upd           | 0
n_tup_del           | 0
n_tup_hot_upd       | 0
n_live_tup          | 0
n_dead_tup          | 0
n_mod_since_analyze | 0
last_vacuum         |
last_autovacuum     |
last_analyze        |
last_autoanalyze    |
vacuum_count        | 0
autovacuum_count    | 0
analyze_count       | 0
autoanalyze_count   | 0

postgres=>

Because collecting statistics adds some overhead to query runs, the system can be configured to collect or to not collect information. This is controlled by configuration parameters that are normally set in postgresql.conf using track_activities, track_counts, track_functions, and track_io_timing.

Because the statistics collector processes runs on each PostgreSQL server by default and the access to the database relations will be different on each server, these stats are different for each instance. This means that the RDS for PostgreSQL replicas and Aurora PostgreSQL reader will have different stats than the primary and the writer node.

Summary

In this post, we learned different types of statistics available in PostgreSQL. The data distribution statistics are collected whenever analyze is run (either manually or by autovacuum), and these stats are used by the query planner. The extended statistics are useful when you need a correlation between multiple columns so that the planner can find the statistics based on the dependencies between the columns. You can use the monitoring statistics to see the number of sequential scans, number of index scans, live and dead tuples in each relation, I/O stats for each relation, information about vacuum and analyze on each relation, and more. These stats are refreshed every 500 milliseconds by the stats collector.

If you have any questions or suggestions about post, leave a comment.


About the authors

Divya SharmaDivya Sharma is a Database Specialist Solutions architect at AWS, focusing on RDS/Aurora PostgreSQL. She has helped multiple enterprise customers move their databases to AWS, providing assistance on PostgreSQL performance and best practices.

Baji Shaik is a Sr Database Consultant with AWS ProServe, GCC AMER. His background spans a wide depth and breadth of expertise and experience in SQL/NoSQL database technologies. He is a Database Migration Expert and has developed many successful database solutions addressing challenging business requirements for moving databases from on-premises to Amazon RDS and Aurora PostgreSQL/MySQL. He is an eminent author, having written several books on PostgreSQL. A few of his recent works include “PostgreSQL Configuration“, “Beginning PostgreSQL on the Cloud”, and “PostgreSQL Development Essentials“. Furthermore, he has delivered several conference and workshop sessions.