AWS Database Blog
Understanding statistics in PostgreSQL
July 2023: This post was reviewed for accuracy.
PostgreSQL has become the preferred open-source relational database for many enterprise developers and startups, and powers leading business and mobile applications. AWS provides two managed PostgreSQL options: Amazon Relational Database Service (Amazon RDS) for PostgreSQL and Amazon Aurora PostgreSQL-Compatible Edition.
Database statistics play a key role in improving the performance of the database. The query planner uses the statistical data to generate efficient run plans for queries. The purpose of this post is to explain the types of statistics in PostgreSQL and how to read and understand them. This applies to both Amazon RDS for PostgreSQL and Aurora PostgreSQL.
The following types of statistics are collected and available in PostgreSQL:
- Data distribution statistics
- Extended statistics
- Monitoring statistics
We explain each type in more detail in this post.
Data distribution statistics
These statistics are related to the data distribution for each relation. They provide information about the most common values in each column in a relation, average width of the column, number of distinct values in the column, and more. They’re collected when we run ANALYZE or when analyze is triggered by autovacuum, and are stored in the pg_statistic system catalog (whose public readable view is pg_stats).
The following is an example of how you can generate and see these stats:
- Create a table and insert some dummy data:
- Analyze the table to generate the stats for the table:
- Check the stats of the table in the
pg_stats
view:
As shown in the output, the pg_stats
view has the data distribution stats for each column of the test_stats
table. For example, there are 20 unique values for column id; however, you can see only 10 values for most_common_values
because these are duplicates, and for the name column, test1
, test2
, …test9
, as the most common values. The most common value list is used to help the planner predict the selectivity of equality expressions, such as where name='test5'
or where state=’TX’
. The histogram bounds are used to help the planner predict the selectivity of inequality or range expressions, such as where id is between 5000–10000.
If you look at the correlation
column for the name
column, its value is -0.19 (near 0). When the value is near -1 or +1, an index scan on the column will be estimated to be cheaper than when it’s near 0, due to the reduction of random access to the disk. Because there are only 30 rows, a value near 0 indicates that an index isn’t required for this column. There are no null values in the tables, so null_frac
is 0.
The query planner needs to estimate the number of rows retrieved by a query in order to make good choices of query plans. For making these estimations, these data distribution statistics are used.
The following are a few key things to note about the statistics:
- For large tables,
ANALYZE
takes a random sample of the table contents, rather than examining every row. This allows even very large tables to be analyzed in a small amount of time. - The amount of samples considered by
ANALYZE
depends on the default_statistics_target parameter. Larger values increase the time needed to doANALYZE
, but might improve the quality of the planner’s estimates. The default value for this is 100. To get an accurate plan, the default value is sufficient; however,default_statistics_target
is the global default. For the case where there is 1 column that needs more stats, you can use ALTER TABLE [ TABLE ] ALTER [ COLUMN ] column_name SET STATISTICS integer. However, it will consume more CPU, memory, and time. If the value of this parameter is 100, then 300 * 100 = 30,000 rows are sampled from each table. This sample is used to determine up to 100 most common values to store in themost_common_vals
array column, and up to 100 histogram bounds to store in that array, plus a few other scalar statistics, like the number of distinct values. - There is no command or function for resetting these statistics (as in resetting to 0).
- After you complete an engine major version upgrade, you should run the ANALYZE operation to refresh the
pg_statistic
table (to have the statistics updated for the planner’s use). - For a read replica in Amazon RDS for PostgreSQL and for a reader node in Aurora PostgreSQL, these stats are the same as for the primary or writer. This is because they are stored in a relation (
pg_statistics
) on disk (physical blocks are the same on the replica in Amazon RDS for PostgreSQL and in the case of Aurora, the reader is reading from the same storage). This is also the reason why it isn’t allowed (and also not logical) to run anANALYZE
on a replica or a reader node (both can read from thepg_statistics
relation, but can’t update it).
Extended statistics
By default, the statistics from ANALYZE
are stored on a per-column per-table basis, and therefore can’t capture any knowledge about cross-column correlation. It’s common to see slow queries running bad run plans because multiple columns used in the query clauses are correlated. However, with the CREATE STATISTICS command, you can create extended statistics for correlated columns.
You can use CREATE STATISTICS for a single expression (which is called univariate statistics), which provides benefits similar to an expression index without the overhead of index maintenance. You can also use CREATE STATISTICS on multiple columns or expressions (known as multivariate statistics).
Consider a common table with a relation between a city and state in the US. For example, we have a query to find the rows from the city of Springfield in the state of Illinois. There is a city named Springfield in most states, and the state of Illinois has many cities. Determining the number of distinct rows with that combination without a relationship is simply combining the two probabilities. Extended statistics gather stats on the two columns together so the estimates of the number of rows from Springfield, Illinois, is a lot closer.
Let’s use a table with a two integer columns to demonstrate the usage of extended stats. Complete the following steps:
- Create a table with
a
andb
columns and insert some data:From data inserted, these the columns are functionally dependent—the knowledge of a value in the first column is sufficient for determining the value in the other column.
- Look at the
pg_stats
values for these columns:You can see the distinct values for column
a
as 1000 andb
as 100. - Without extended statistics, selecting one column looks fine with the estimated rows as 999, which is almost the same as actual rows which is 1000:
- However, if you add the second column to the predicate, the planner estimates rows very differently, because no dependency statistics are gathered:
You can see estimated rows as 1, which is far from the actual rows (1,000).
- Let’s gather extended statistics:
As you see, estimated rows (1,017) is very close to the actual number.
All the queries with filters on both columns will have much better estimates using extended statistics.
Monitoring statistics
In addition to data distribution and extended statistics, there are monitoring statistics as well. These types of stats are collected by the statistics collector.
These statistics collect information about count accesses to tables and indexes in both disk block and individual row terms. It also tracks the total number of rows in each table, and information about vacuum and analyze actions for each table (when they were last run on the table).
There is a new mechanism built to store server-level statistics in shared memory in PostgreSQL 15. Previously, this was updated via UDP packets, stored in the file system, and read by sessions. There is no longer a statistics collector process.
The following output shows information like number of live rows, total insert, total dead rows, vacuum information, and more of the monitor_stats
table:
For PostgreSQL 14 or older, the statistics collector transmits the collected information to other PostgreSQL processes through temporary files. These files are stored in the directory named by the stats_temp_directory parameter, pg_stat_tmp
by default.
When the server shuts down cleanly, a permanent copy of the statistics data is stored in the pg_stat
subdirectory, so that statistics can be retained across server restarts. When recovery is performed at server start (such as after immediate shutdown, server crash, and point-in-time recovery), all statistics counters are reset.
The collector itself emits a new report at most once per PGSTAT_STAT_INTERVAL
milliseconds (500 milliseconds unless altered while building the server).
pg_stat_activity
, pg_stat_replication
, pg_stat_all_tables
, pg_stat_user_indexes
, and pg_statio_all_tables
are some examples of the views that are refreshed by the stats collector reports. For more information on statistics views, refer to Viewing Statistics.
You can use the pg_stat_reset()
function to reset all statistics counters for the current database to zero.
In the following example, seq_scan
is 0 at this stage on table reset_stats
:
An explain analyze command runs the query, so this should increment the seq_scan
counter:
You can check the table stats after running explain analyze
once on the table tbl
for a select *
query. As we can see, seq_scan
is 1 now:
seq_scan
incremented again after running the same query:
The pg_stat_reset()
function is used to reset all statistics for the database to 0:
Because collecting statistics adds some overhead to query runs, the system can be configured to collect or to not collect information. This is controlled by configuration parameters that are normally set in postgresql.conf
using track_activities, track_counts, track_functions, and track_io_timing.
Because the statistics collector processes runs on each PostgreSQL server by default and the access to the database relations will be different on each server, these stats are different for each instance. This means that the RDS for PostgreSQL replicas and Aurora PostgreSQL reader will have different stats than the primary and the writer node.
Summary
In this post, we learned different types of statistics available in PostgreSQL. The data distribution statistics are collected whenever analyze
is run (either manually or by autovacuum), and these stats are used by the query planner. The extended statistics are useful when you need a correlation between multiple columns so that the planner can find the statistics based on the dependencies between the columns. You can use the monitoring statistics to see the number of sequential scans, number of index scans, live and dead tuples in each relation, I/O stats for each relation, information about vacuum and analyze on each relation, and more. These stats are refreshed every 500 milliseconds by the stats collector.
If you have any questions or suggestions about post, leave a comment.
About the authors
Divya Sharma is a Database Specialist Solutions architect at AWS, focusing on RDS/Aurora PostgreSQL. She has helped multiple enterprise customers move their databases to AWS, providing assistance on PostgreSQL performance and best practices.
Baji Shaik is a Sr Database Consultant with AWS ProServe, GCC AMER. His background spans a wide depth and breadth of expertise and experience in SQL/NoSQL database technologies. He is a Database Migration Expert and has developed many successful database solutions addressing challenging business requirements for moving databases from on-premises to Amazon RDS and Aurora PostgreSQL/MySQL. He is an eminent author, having written several books on PostgreSQL. A few of his recent works include “PostgreSQL Configuration“, “Beginning PostgreSQL on the Cloud”, and “PostgreSQL Development Essentials“. Furthermore, he has delivered several conference and workshop sessions.