AWS Database Blog

Auto Analyze in Aurora DSQL: Managed optimizer statistics in a multi-Region database

In Amazon Aurora DSQL, and other modern relational database systems, accurate statistics are among the most important factors for query planning. Accidentally choosing a bad query plan over a good one could cause a 100-fold slowdown. To minimize the risk of plan regressions occurring, up-to-date statistics are key. In this post, we give insights into Aurora DSQL Auto Analyze, a probabilistic and de-facto stateless method to automatically compute DSQL optimizer statistics. Users who are familiar with PostgreSQL will appreciate the similarity to autovacuum analyze.

Importance of statistics for query performance

To illustrate why statistics matter to query performance, let’s look at a basic example where the optimizer can choose between accessing data using a full table scan or an index scan. To illustrate the effect of statistics, we used an internal parameter to disable Auto Analyze. For you, as a customer, Auto Analyze is always enabled, without the option to turn it off.

First we generate a table with an int column A and a text column B. We also create an index on the A column. We then insert 600,000 rows into this table. In this example, we’re concerned with the A column. 300,000 rows contain A values from 0 to 299,999. The other 300,000 rows have an A value of 42.

create table mytable (A int, B text); 
create index async mytableidx on mytable(A); 

SELECT 'INSERT INTO mytable SELECT generate_series(3000 * ' || i-1 || ', 3000 * ' || i || ' - 1), ''AWS Aurora DSQL is great'';' FROM generate_series(1, 100) i; 
\gexec 

SELECT 'INSERT INTO mytable SELECT 42, ''AWS Aurora DSQL is great'' FROM generate_series(1, 3000);' FROM generate_series(1, 100); 
\gexec

We confirm that we have 300,001 A values of 42 using the following query. Therefore, rows with an A value of 42 make up more than half of the rows.

SELECT count(*) FROM mytable GROUP BY GROUPING SETS (A = 42);
count
--------
299999
300001
(2 rows)

Let’s observe which plan the optimizer chooses if we select all rows with an A value of 42, by running the following command.

EXPLAIN ANALYZE SELECT * FROM mytable WHERE A = 42; 

QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------ 
 Index Scan using mytableidx on mytable  (cost=23193.32..34868.97 rows=92373 width=32) (actual time=15.926..5217.368 rows=300001 loops=1)
   Index Cond: (a = 42)
   -> Storage Scan on mytableidx (cost=23193.32..34868.97 rows=92373 width=32) (actual rows=300001 loops=1)
       -> B-Tree Scan on mytableidx (cost=23193.32..34868.97 rows=92373 width=32) (actual rows=300001 loops=1)
           Index Cond: (a = 42)
   -> Storage Lookup on mytable (cost=23193.32..34868.97 rows=92373 width=32) (actual rows=300001 loops=1)
       Projections: a, b
       -> B-Tree Lookup on mytable (cost=23193.32..34868.97 rows=92373 width=32) (actual rows=300001 loops=1)
 Planning Time: 3.367 ms
 Execution Time: 5228.314 ms
(10 rows)

We observe that the chosen plan contains an index scan. Clearly, at this frequency of A = 42, we would have expected to avoid the cost of indirection from the index and just choose a full table scan.

To help the optimizer find the best plan, we run ANALYZE on the table

ANALYZE mytable;

Now the chosen plan contains a full table scan. The query now finishes in less than half the time.

EXPLAIN ANALYZE 
SELECT * 
FROM mytable 
WHERE A = 42;

QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
 Full Scan (btree-table) on mytable  (cost=74756.80..87915.45 rows=296975 width=32) (actual time=1.179..1977.851 rows=300001 loops=1)
   -> Storage Scan on mytable (cost=74756.80..87915.45 rows=296975 width=32) (actual rows=300001 loops=1)
       Projections: a, b
       Filters: (a = 42)
       Rows Filtered: 299999
       -> B-Tree Scan on mytable (cost=74756.80..87915.45 rows=597254 width=32) (actual rows=600000 loops=1)
 Planning Time: 5.055 ms
 Execution Time: 1989.230 ms
(8 rows)

If you reproduce this example in your Aurora DSQL cluster, you will find that even before the manual analyze, you get the fast query plan that uses a full table scan. Auto Analyze automatically computed the statistics in the background to give you this performance improvement.

Auto Analyze in Aurora DSQL

In this section, we first revisit autovacuuming in PostgreSQL. Then, we explain how Aurora DSQL mimics PostgreSQL behavior in a multi-AWS Region setting at virtually unlimited scale through two building blocks.

In PostgreSQL, ANALYZE is triggered automatically through the autovacuum daemon (AUTOVACUUM), which continuously monitors table modifications and updates statistics when predefined thresholds are met, typically after 10% of a table’s rows have been inserted, updated, or deleted. For details see the PostgreSQL documentation for the autovacuum daemon.

In Aurora DSQL, the Auto Analyze feature is the equivalent of PostgreSQL’s autovacuum analyze process, automatically maintaining table statistics essential for query planning. Unlike PostgreSQL’s deterministic threshold-based approach, DSQL implements a multi-Region-friendly solution built on two key building blocks:

  • The probabilistic trigger serves as the trigger mechanism. Instead of monitoring and tracking table modifications, each transaction has a calculated probability of triggering ANALYZE based on the number of rows it modifies relative to table size. This probabilistic approach eliminates the need for cross-session coordination while helping to ensure that statistics are updated as tables evolve.
  • The sampling-based analyze method handles the actual statistics computation. When triggered, ANALYZE uses a sampling technique to efficiently compute accurate statistics even for massive multi-terabyte tables, enabling Aurora DSQL to scale to virtually unlimited table sizes.

Probabilistic trigger

Aurora DSQL uses a probabilistic trigger for Auto Analyze to determine when to refresh table statistics. Each committing transaction has a probability of triggering ANALYZE that depends on the table size and number of modifications it makes through insert, update, or delete operations. Note that triggering ANALYZE does not significantly impact a transaction’s performance. In this section, we give insights into how the ANALYZE probability of a transaction is determined.

Aurora DSQL tracks modifications per table within each transaction. When a transaction commits, each modified table is evaluated against a 10% threshold ratio. If the transaction modifies more than 10% of a table’s rows, ANALYZE is always triggered. For smaller modifications, the probability of triggering ANALYZE is proportional to the percentage of rows changed.

Let threshold_ratio = 0.1
for each modified table R:
    change_count = num_inserts + num_updates + num_deletes
    threshold_count = threshold_ratio * pg_class.reltuples(R)
    probability = change_count / threshold_count
    if random_number(0,1) <= probability:
        submit_job("ANALYZE R")

This description is currently only accurate for tables with 1 million rows or more. For smaller tables, we have a dampening factor to consider the setup cost of an ANALYZE, which runs in a separate query processor in Aurora DSQL.

This probabilistic approach triggers ANALYZE on average after 10% of a table has been modified, without requiring coordination between database sessions. The system uses row count estimates from pg_class.reltuples (populated by previous ANALYZE runs) to calculate probabilities, defaulting to 1 row for tables that have not been analyzed.

The probabilistic mechanism naturally adapts to workload patterns. For frequently modified tables, statistics are updated more frequently. Conversely, static tables avoid unnecessary ANALYZE overhead.

Sampling-based ANALYZE

When Aurora DSQL triggers an ANALYZE operation, it uses sampling to efficiently compute accurate statistics without scanning entire tables. The system calculates a sampling rate designed to collect a sample of at least 30,000 rows—and more for large tables. This sample is then used to compute the table-wide statistics in pg_class. A strict 30,000 subset is subsequently used to generate column-specific statistics as in PostgreSQL.

Our method works by randomly selecting rows from storage based on the calculated probability. This approach mirrors PostgreSQL’s sampling methodology while adapting to the distributed architecture of Aurora DSQL. The sampling rate is determined by the target row count relative to the estimated table size from previous statistics.

As mentioned before, the collected sample generates two types of statistics: table-wide statistics stored in pg_class and column-specific statistics in pg_stats. The table-wide estimates are a row count and a page count estimate. The column-specific statistics in pg_stats include null fractions, distinct value ratios, histograms, and most common values. These statistics provide the query optimizer with the information needed to generate efficient execution plans.

The sampling-based Analyze method used by Aurora DSQL helps ensure efficient computation even for multi-terabyte tables by providing consistent sample sizes regardless of table growth. In experiments, we found that ANALYZE completes on tables of any size up to 240 TB in minutes.

Conclusion

In this post you learned about the Aurora DSQL’s Auto Analyze feature. Auto Analyze delivers the reliability of PostgreSQL’s autovacuum analyze while addressing the unique challenges of distributed, multi-Region database systems. By combining probabilistic triggering with efficient sampling-based computation, your queries consistently benefit from well-maintained statistics without manual intervention. The probabilistic approach eliminates much of the coordination overhead that traditional threshold-based systems require, making it naturally suited for distributed architectures. Meanwhile, the sampling-based analysis scales from small tables to massive 240TB datasets.Aurora DSQL Auto Analyze gives you the benefits of well-maintained optimizer statistics while operating transparently in the background, letting developers focus on building applications instead of managing their table statistics.

Aurora DSQL Auto Analyze is available in all Regions where Aurora DSQL is available. To learn more about Aurora DSQL, visit the webpage and documentation.


About the authors

Magnus Mueller

Magnus Mueller

Magnus is an Applied Scientist at AWS, specializing in cardinality estimation, query optimization, and machine learning for systems. He has a PhD in cardinality estimation and published research at major database conferences.

James Morle

James Morle

James is a Principal Engineer and distributed database architect with over 20 years of experience designing and implementing large-scale transactional and analytical systems at hyperscale.

Matthys Strydom

Matthys Strydom

Matthys is a Principal Engineer at AWS with over 20 years of experience on a wide range of software systems, including distributed database query processing, AWS cloud services control planes, high throughput telephonic network integrations, and desktop CAD programs.

Vishwas Karthiveerya

Vishwas Karthiveerya

Vishwas is a Senior Software Development and Database Systems Engineer at AWS, specializing in query planning, cost-based optimization, and execution performance for large-scale distributed databases.

Raluca Constantin

Raluca Constantin

Raluca is a Senior Database Engineer at AWS, specializing in Amazon Aurora DSQL. Her 18 years of database expertise span Oracle, MySQL, PostgreSQL and cloud-native solutions, focusing on database scalability, performance and real-time data processing.