Amazon Aurora PostgreSQL parameters, Part 1: Memory and query plan management

Organizations today have a strategy to migrate from traditional databases and as they plan their migration, they don’t want to compromise on performance, availability, and security features. Amazon Aurora is a cloud native relational database service that combines the speed and availability of high-end commercial databases with the simplicity and cost-effectiveness of open source databases. The PostgreSQL-compatible edition of Aurora delivers up to 3X the throughput of standard PostgreSQL running on the same hardware, enabling existing PostgreSQL applications and tools to run without requiring modification. The combination of PostgreSQL compatibility with Aurora enterprise database capabilities provides an ideal target for commercial database migrations.

Amazon Aurora PostgreSQL-Compatible Edition has enhancements at the engine level, which improves the performance for high concurrent OLTP workload, and also helps bridge the feature gap between commercial engines and open-source engines. While the default parameter settings for Aurora PostgreSQL are good for most of the workloads, customers who migrate their workloads from commercial engines may need to tune some of the parameters according to performance and other non-functional requirements. Even workloads which are migrated from PostgreSQL to Aurora PostgreSQL may need to relook at some of the parameter settings because of architectural differences and engine level optimizations.

In this four part series, we explain parameters specific to Aurora PostgreSQL. We also delve into certain PostgreSQL database parameters that apply to Aurora PostgreSQL, how they behave differently, and how to set these parameters to leverage or control additional features in Aurora PostgreSQL.

In this first post, we cover parameters that can be useful to tune the memory utilization of Aurora PostgreSQL. We also cover parameters that help control the behavior of the Query Plan Management (QPM) feature of Aurora PostgreSQL. In part two, we will cover parameters related to replication, security, and logging. We will cover Aurora PostgreSQL optimizer parameters in part three which can improve performance of queries. In part four, we will cover parameters which can align Aurora PostgreSQL closer to American National Standards Institute (ANSI) standards and reduce the migration effort when migrating from commercial engines.

Memory and buffer related parameters

Although Aurora PostgreSQL has a similar shared memory architecture as PostgreSQL, there are some variations on how they apply. In this section, we cover two important shared memory parameters of PostgreSQL, shared_buffers and wal_buffers, and see how their interpretation changes in Aurora. We also discuss apg_ccm_enabled, which allows you to control behavior of an important Aurora feature: cluster cache management.

shared_buffers

In PostgreSQL, reads and writes are cached in a shared memory area referred to as shared buffers, and the size of this area is controlled by the shared_buffers parameter. Just like any other relational database, if PostgreSQL needs to read a page, it first caches that page in shared buffers before returning it to the client. Subsequent queries that need to refer to the same page just get it from the shared buffer. Similarly, for modification, the page isn’t immediately flushed to the disk. The writes are cached in the shared buffers (as dirty buffers) and then flushed upon the checkpoint. shared_buffers is used to designate the size of the shared memory area reserved by the postmaster for shared buffers.

PostgreSQL heavily leverages file system caching for read and write I/O as referred to in the documentation for general recommendations for PostgreSQL:

“If you have a dedicated database server with 1 GB or more of RAM, a reasonable starting value for shared_buffers is 25% of the memory in your system. In some workloads, even larger settings for shared_buffers are effective, but because PostgreSQL also relies on the operating system cache, it’s unlikely that an allocation of more than 40% of RAM to shared_buffers works better than a smaller amount. Larger settings for shared_buffers usually require a corresponding increase in max_wal_size, in order to spread out the process of writing large quantities of new or changed data over a longer period of time.“

With Aurora PostgreSQL, I/O is handled by the Aurora storage driver. There is no file system or secondary level of caching for tables or indexes. This means that shared_buffers should be larger than what the PostgreSQL community recommends. Smaller settings may result in poor performance. Typically, the value for shared_buffers in the default parameter group is set using the formula SUM({DBInstanceClassMemory/12038},-50003), which is about 75% of the available memory. The default value is a good starting point for most workloads.

If you make any changes to shared buffers for Aurora PostgreSQL, you should thoroughly test it out. A select-only run of pgbench with shared_buffers set to 15% of memory showed a 14% reduction in performance and 14% increase in VolumeReadIOPS compared to the default setting for a db.r5.2xlarge instance. The test was performed using the default select-only script of pgbench with 500 clients against a database initialized using a scale of 1,000.

If the working set for your workload can’t fit in the shared buffers, the Aurora instance needs to fetch more pages from storage. This increase in I/O shows in VolumeReadIOPS ([Billed] Volume Read IOPS on the Amazon RDS management console) and results in a higher Aurora I/O bill. You can review BufferCacheHitRatio to judge the efficiency of the shared buffers utilization. If this metric is consistently lower, you should work on reducing your working set, for example by archiving old data, adding indexes, implementing table partitioning, and tuning queries. If you can’t tune your workload further, consider increasing the instance class to allocate more memory for shared buffers.

Another important difference is the Aurora PostgreSQL survivable buffer cache feature. In community PostgreSQL, the contents of the buffer cache aren’t kept during a restart of the database engine. Aurora PostgreSQL maintains shared buffers contents during restarts and potentially during failovers (see apg_ccm_enabled in the next section). This provides a significant performance improvement after a restart (for example, during a patching or failover process). However, changes in shared_buffers size clears the cache and causes performance issues related to cold restart.

apg_ccm_enabled

apg_ccm_enabled deals with cluster cache management, a feature in Aurora that helps improve application performance after a failover. You can enable apg_ccm_enabled in a DB cluster parameter group. This parameter needs a restart for cluster cache management to take effect, and the writer and at least one replica should be set to priority 0 as failover priority target and their instance classes should be exactly same.

Typically, an Aurora cluster has one writer instance and one or more replica instances. If the writer instance fails, one of the replica instances is promoted as the writer. The shared buffers cache on the replica may not have the same pages that the writer did or the cache may be empty. This is known as a cold cache. A cold cache degrades performance because the DB instance has to read from storage instead of taking advantage of data stored in the buffer cache. With cluster cache management, you set a specific replica DB instance as the failover target. Cluster cache management makes sure that the data in the designated replica instance’s cache is kept synchronized with the data in the writer DB instance’s cache.

apg_ccm_enabled parameter doesn’t have any affect if the instance class isn’t exactly the same for all instances in the tier-0 failover group or if the instances use different parameter groups. If all the required conditions for cluster cache management aren’t met, you see an error in the Logs & events section on the Amazon RDS management console:

This DB instance is not eligible for cluster cache management. The apg_ccm_enabled parameter has been reset to false for this DB instance.

A pgbench test shows the improvement in performance after a failover when cluster cache management is enabled. The test is performed with a read-heavy workload (20:1 read to write ratio) on a database with 160 GB cached data. A failover is performed after running pgbench with a writer instance for 10 minutes (600 seconds). The following graph shows that when apg_ccm_enabled is enabled, pgbench can get to the average 90th percentile of the transaction per second (TPS) almost immediately after the failover. In contrast, without cluster cache management, the newly promoted writer took approximately 357 seconds to scale up to the 90th percentile of the TPS.

A graph showing TPS over time for a workload with CCM Enabled and a workload with CCM disabled. After a simulated failure both TPS fall down to zero, however, CCM Enabled TPS goes back to the original levels almost instantaneously while CCM Disabled TPS slowly goes back to its original levels.

For more information about this benchmark, see Introduction to Aurora PostgreSQL cluster cache management.

wal_buffers

WAL buffers are used to hold write ahead log (WAL) records that aren’t yet written to storage. The size of the WAL buffer cache is controlled by the wal_buffers setting. Aurora uses a log-based storage engine and changes are sent to storage nodes for persistence. Given the difference in how writes are handled by the Aurora storage engine, this parameter should be left unchanged when using Aurora PostgreSQL.

Query Plan Management parameters

PostgreSQL uses a cost-based optimizer, which calculates the cost of different available plans and uses the least costly plan. The query plan is calculated based on optimizer statistics and query planner configuration parameters. Changes to optimizer statistics, query planner configuration, or bind variables can cause the optimizer to choose a different plan. This is referred as query plan instability and can lead to unpredictable database performance.

The Aurora PostgreSQL Query Plan Management (QPM) feature solves the problem of plan instability by allowing database users to maintain stable, yet optimal, performance for a set of managed SQL statements. QPM serves two main objectives:

Plan stability – QPM prevents plan regression and improves plan stability when any of the aforementioned changes occur in the system
Plan adaptability – QPM automatically detects new minimum-cost plans and controls when new plans may be used and adapts to the changes

QPM is a feature specific to Aurora PostgreSQL. In this section, we discuss the parameters that affect how it works. For more information about setting up QPM for Aurora PostgreSQL, see Introduction to Amazon Aurora PostgreSQL Query Plan Management.

rds.enable_plan_management

This parameter enables the apg_plan_mgmt extension which is needed in order to use QPM. You can enable it in the DB cluster parameter group. It requires a restart for the change to take effect.

apg_plan_mgmt.capture_plan_baselines

This parameter dictates how the run plan of SQL statements is captured:

Off – Disables the plan capture altogether
Manual – Enables plan capture for all SQL statements
Automatic – Enables automatic plan capture for SQL statements that satisfy the eligibility criteria

You can set this parameter in the DB cluster parameter group, DB parameter group, or at the session level without a restart.

manual mode can be useful when you have specific SQL statements that suffer from plan instability or when you have a list of all SQL statements in your application. You can set apg_plan_mgmt.capture_plan_baselines to manual in a session and then run the SQL statements for which you want QPM to manage the plans. Manual plan management can be useful when you want to enforce specific plans by using the pg_hint_plan extension or when you want the optimizer to choose a plan that is generated by disabling specific query optimizer configuration parameters. For example, you can force the optimizer to not use merge join for a SQL statement by disabling enable_merge_join in a session and using manual mode to capture the query plan.

You can use automatic when you have a lot of statements that suffer from plan instability in a scenario when you don’t know which specific SQL statements should be managed by QPM or when SQL statements are dynamically generated using the application framework (such as Hibernate). You can enable automatic plan capture by setting apg_plan_mgmt.capture_plan_baselines to automatic in the DB parameter group or DB cluster parameter group. After the parameter takes effect, all the subsequent query plans are captured.

You can review the query plans in the apg_plan_mgmt.dba_plans view. The view provides you access to estimated and actual statistics, which can be helpful to decide if you want to use pg_hint_plan or custom parameter values to influence the optimizer.

apg_plan_mgmt.use_plan_baselines

Setting apg_plan_mgmt.capture_plan_baselines to a value other than off enables you to capture query plans in apg_plan_mgmt.dba_plans. After the plan is captured, database administrators can choose to approve or reject a plan.

As we discussed, the PostgreSQL optimizer is a cost-based optimizer and by default, it uses the plan with minimum cost. If you want to force the optimizer to evaluate a generated plan against the managed plans , you need to enable apg_plan_mgmt.use_plan_baselines by setting it to true. You can set this parameter in the DB cluster parameter group, DB parameter group, or at session level without a restart. This parameter can be useful to flip between QPM-managed plans and default PostgreSQL cost-based plans.

apg_plan_mgmt.unapproved_plan_execution_threshold

You can set apg_plan_mgmt.unapproved_plan_execution_threshold to allow the Aurora PostgreSQL optimizer to use disabled or rejected plans. When you enable managed plans by setting apg_plan_mgmt.use_plan_baselines and accepting plans by using the apg_plan_mgmt.evolve_plan_baselines() or apg_plan_mgmt.set_plan_status() function, PostgreSQL optimizer doesn’t use any rejected or disabled plan. You can set apg_plan_mgmt.unapproved_plan_execution_threshold and if the generated plan’s cost is lower than this threshold, the optimizer runs it. You can set this parameter in the DB cluster parameter group, DB parameter group, or at session level without a restart.

Setting this threshold can be useful when you have a few complex queries where you want to leverage QPM, but most queries are simple and default cost-based plans work well.

apg_plan_mgmt.plan_retention_period

QPM uses shared memory to store query plans. If the unused plans aren’t cleaned up, Aurora PostgreSQL runs out of the shared memory that has been set aside for QPM. You can set apg_plan_mgmt.plan_retention_period to a non-zero integer to enable automated housekeeping. It defines the number of days after which unused plans are deleted. The default is 32 days. You can also manually delete a plan using the apg_plan_mgmt.delete_plan( ) function.

apg_plan_mgmt.max_plans

apg_plan_mgmt.max_plans controls the number of SQL statements that QPM can manage. This parameter sets a limit on number of SQL statements that are maintained in the apg_plan_mgmt.dba_plans view. The default value is 1,000; a larger value requires more shared memory to be allocated to QPM. You can change apg_plan_mgmt.max_plans in the DB cluster parameter group or in the DB parameter group, and the change requires a restart to take effect.

apg_plan_mgmt.max_databases

Each database for which the apg_plan_mgmt extension is set up has a separate apg_plan_mgmt.dba_plans view for the database administrator to review the managed plans. By default, QPM can manage query plans for 10 databases and it can be increased by setting apg_plan_mgmt.max_database to a larger value. Increasing the number of databases QPM manages causes more shared memory to be reserved for QPM plan management. You can change this in the DB cluster parameter group or DB parameter group, and the change requires a restart before it takes effect.

Conclusion

This series of blogs discusses Aurora PostgreSQL specific parameters and how they can be tuned to control database behavior. In this post, we looked at how Aurora memory utilization and buffer sizing can differ from community PostgreSQL and their possible effects on performance. We also looked at additional parameters provided by Aurora PostgreSQL to control performance regression caused by a change in query run plans.

In part two we discuss replication, security, and logging parameters; and in part three, and part four we will dive deep into Aurora PostgreSQL parameters that introduces additional query optimization features and ANSI compatibility options respectively.

About the authors

Sameer Kumar is a Database Specialist Technical Account Manager at Amazon Web Services. He focuses on Amazon RDS, Amazon Aurora and Amazon DocumentDB. He works with enterprise customers providing technical assistance on database operational performance and sharing database best practices.

Gopalakrishnan Subramanian is a Database Specialist solutions architect at Amazon Web Services. He works with our customers to provide guidance and technical assistance on database projects, helping them improving the value of their solutions when using AWS