How can I troubleshoot high CPU utilization for Amazon RDS or Amazon Aurora PostgreSQL?
Last updated: 2022-06-09
I want to identify and resolve the cause of high CPU use in Amazon Relational Database Service (Amazon RDS) or Amazon Aurora PostgreSQL-Compatible Edition.
If you see that your load has high CPU usage, you can use a combination of the following tools to identify the cause:
- Amazon CloudWatch metrics
- Enhanced Monitoring metrics
- Performance Insights metrics
- Native PostgreSQL view and catalogs (such as pg_stat_statements, pg_stat_activity, and pg_stat_user_tables)
- Idle connections in the database
- Analyze command
- PostgreSQL logging parameters to log long-running queries, autovacuum, lock waits, and connection and disconnection requests. For more information, see How do I enable query logging using Amazon RDS for PostgreSQL?
Amazon CloudWatch metrics
You can use CloudWatch metrics to identify CPU patterns over extended periods. Compare the graphs WriteIOPs, ReadIOPs, ReadThroughput, and WriteThroughput with the CPU utilization to find the times when the workload caused high CPU.
After identifying the timeframe, you can review the Enhanced Monitoring data associated with your DB instance. You can set Enhanced Monitoring to collect data at intervals of 1, 5, 10, 15, 30, or 60 seconds. Doing this allows you to collect data at a more granular level than CloudWatch.
Enhanced Monitoring provides a view at the operating system (OS) level. This view can help identify the cause of a high CPU load at a granular level. For example, you can review the load average, CPU distribution (system% or nice%), and OS process list.
Using Enhanced Monitoring, you can check the loadAverageMinute data in intervals of 1, 5, and 15 minutes. A load average that's greater than the number of vCPUs indicates that the instance is under a heavy load. Also, if the load average is less than the number of vCPUs for the DB instance class, CPU throttling might not be the cause for application latency. Check the load average to avoid false positives when diagnosing the cause of CPU usage.
For example, if you have a DB instance that's using a db.m5.2xlarge instance class with 3000 Provisioned IOPS that reaches the CPU limit, you can review the following example metrics to identify the root cause of the high CPU usage. In the following example, the instance class has eight vCPUs associated with it. For the same load average, exceeding 170 indicates that the machine is under heavy load during the timeframe measured:
Load Average Minute
Note: Amazon RDS gives your workload a higher priority over other tasks that are running on the DB instance. To prioritize these tasks, workload tasks have a higher Nice value. As a result, in Enhanced Monitoring, Nice% represents the amount of CPU being used by your workload against the database.
After enabling Enhanced Monitoring, you can also check the OS process list that's associated with the DB instance. Enhanced monitoring shows a maximum of 100 processes. This can help you identify which processes have the largest impact on performance based on CPU and memory use.
You can combine Enhanced Monitoring results with pg_stat_activity results to help identify the resource usage of queries.
You can use Amazon RDS Performance Insights to identify the query responsible for the database load after checking the SQL tab that corresponds to a particular timeframe.
Native PostgreSQL view and catalogs
At database-engine level, if the issue occurs in real time, you can use pg_stat_activity or pg_stat_statements. This can help you group the machines, clients, and IP addresses that send the most traffic. You can also use this data to see if there are increases over time, increases in application servers, or if an application server has stuck sessions or locking issues. For more information, see the PostgreSQL documentation for pg_stat_activity and pg_stat_statements.
To enable pg_stat_statements, modify the existing custom parameter group and set the following values:
- Add pg_stat_statements to shared_preload_libraries
- track_activity_query_size = 4096
- pg_stat_statements.track = ALL
- pg_stat_statements.max = 10000
Choose Apply Immediately, and reboot the DB instance. Then, run a command similar to the following on the database that you want to monitor:
Note: The following example installs the extension in the "demo" database.
demo=> select current_database(); current_database ------------------ demo (1 row) demo=> CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
After pg_stat_statements is set up, you can monitor the output using one of the following methods:
List queries by total_time and see which query spends most time in the database for PostgreSQL versions 12 and earlier:
SELECT total_time, query FROM pg_stat_statements ORDER BY total_time DESC LIMIT 10;
For PostgresSQL versions 13 and later:
SELECT total_plan_time+total_exec_time as total_time, query FROM pg_stat_statements ORDER BY 1 DESC LIMIT 10;
List queries with less buffer cache hit ratio, run the following query for PostgreSQL versions 12 and earlier:
SELECT query, calls, total_time, rows, 100.0 * shared_blks_hit / nullif(shared_blks_hit + shared_blks_read, 0) AS hit_percent FROM pg_stat_statements ORDER BY total_time DESC LIMIT 10;
For PostgreSQL versions 13 and later:
SELECT query, calls, total_plan_time+total_exec_time as total_time, rows, 100.0 * shared_blks_hit / nullif(shared_blks_hit + shared_blks_read, 0) AS hit_percent FROM pg_stat_statements ORDER BY 3 DESC LIMIT 10;
List queries on a per-execution basis to sample queries over time for PostgreSQL versions 12 and earlier:
SELECT query, calls, total_time/calls as avg_time_ms, rows/calls as avg_rows, temp_blks_read/calls as avg_tmp_read, temp_blks_written/calls as avg_temp_written FROM pg_stat_statements WHERE calls != 0 ORDER BY total_time DESC LIMIT 10;
For PostgreSQL versions 13 and later:
SELECT query, calls, (total_plan_time+total_exec_time as total_time)/calls as avg_time_ms, rows/calls as avg_rows, temp_blks_read/calls as avg_tmp_read, temp_blks_written/calls as avg_temp_written FROM pg_stat_statements WHERE calls != 0 ORDER BY 3 DESC LIMIT 10;
Idle connections in the database
Idle connections in the database might consume compute resources, such as memory and CPU. When your instance has high CPU utilization, be sure to check for idle connections on the database. For more information, see Performance impact of idle PostgreSQL connections. You can check for idle connections by reviewing the OS process list using Enhanced Monitoring. However, this list shows a maximum of 100 processes.
You can check for idle connection by running a few queries at the database level:
Run the following queries to view current sessions that are idle and active:
SELECT pid, datname, state, current_timestamp-least(query_start,xact_start) age, application_name, usename, query FROM pg_stat_activity WHERE query != '<IDLE>' AND query NOT ILIKE '%pg_stat_activity%' AND usename!='rdsadmin' ORDER BY query_start desc; SELECT application_name,pid,wait_event_type,wait_event,current_timestamp-least(query_start,xact_start) AS runtime, query AS current_query FROM pg_stat_activity WHERE not pid=pg_backend_pid() AND query NOT ILIKE '%pg_stat_activity%' AND usename!='rdsadmin';
Run the following queries to get the connection counts per user and application name:
postgres=> SELECT application_name,count(*) FROM pg_stat_activity GROUP BY application_name; application_name | count ------------------------+------- psql | 1 PostgreSQL JDBC Driver | 1 | 5 (3 rows)
postgres=> SELECT usename,count(*) FROM pg_stat_activity GROUP BY usename; usename | count ----------+------- master | 4 user1 | 1 rdsadmin | 2 (3 rows)
After identifying the idle connections, end these connections by running either of the following queries:
psql=> SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE usename = 'example-username' AND pid <> pg_backend_pid() AND state in ('idle');
SELECT pg_terminate_backend (example-pid);
If your application as designed results in a high number of connections, consider making changes so that your memory and CPU resources aren’t spent in managing these connections. You might either change the application to limit the number of connections or use a connection pooler such as PgBouncer. You can also use Amazon RDS Proxy, a managed service that allows you to set up connection pooling with a few clicks.
The Analyze command collects statistics about the contents of tables in the database and stores the results in the pg_statistic system catalog. Subsequently, the query planner uses these statistics to help determine the most efficient execution plans for queries. If you don't run Analyze frequently on tables in your database, then the queries might consume more compute resources. This is because of the stale statistics present in the system for the relations that you access. These issues occur under the following conditions:
- Autovacuum isn't running frequently.
- Analyze wasn't run after the major version upgrade.
Autovacuum isn't running: Autovacuum is a daemon that automates the execution of VACUUM and ANALYZE commands. Autovacuum checks for bloated tables in the database and reclaims the space for reuse. The autovacuum daemon makes sure that the table statistics are updated regularly by running the Analyze operation whenever the set threshold of tuples is dead. This enables the query planner to use the most efficient query plan based on recent statistics. If autovacuum isn't running, the query planner might create sub-optimal query plans, leading to higher resource consumption by the queries. For more information on tuning autovacuum, see Understanding autovacuum in Amazon RDS for PostgreSQL environments and A case study of tuning autovacuum in Amazon RDS for PostgreSQL.
Run the following query to get information on when autovacuum and autoanalyze were last run on the tables:
SELECT relname, last_autovacuum, last_autoanalyze FROM pg_stat_user_tables;
Analyze wasn't run after the major version upgrade: PostgreSQL databases typically encounter performance issues after any major engine version upgrade. A common reason for these issues is that Analyze operation isn't performed after the upgrade to refresh the pg_statistic table. Be sure to run the Analyze operation for every database in your RDS for PostgreSQL DB instance. Optimizer statistics aren't transferred during a major version upgrade. Therefore, you must regenerate all statistics to avoid performance issues due to higher resource utilization.
Run the following command without any parameters to generate statistics for all regular tables in the current database after a major version upgrade:
PostgreSQL logging parameters
Enable query logging using Amazon RDS for PostgreSQL. Then, check the PostgreSQL error logs to confirm that your log_min_duration_statement and log_statement parameters are set to appropriate values. For more information, see the PostgreSQL documentation for Error reporting and logging.
Lower the CPU usage
After you identify the queries causing the high CPU, you can use the following methods to further lower the CPU usage:
- To find opportunities for tuning, use EXPLAIN and EXPLAIN ANALYZE to identify the caveats. For more information, see the PostgreSQL documentation for EXPLAIN.
- If there is a query that's running repeatedly, consider using prepared statements to lower the pressure on your CPU. Repeated running of prepared statements caches the query plan. Therefore, the time for planning is much less for further runs because the plan is already in cache.