How can I troubleshoot high CPU utilization for Amazon RDS or Amazon Aurora PostgreSQL?

9 minute read

I want to identify and resolve the cause of high CPU use in Amazon Relational Database Service (Amazon RDS) or Amazon Aurora PostgreSQL-Compatible Edition.

Short description

If you see that your load has high CPU usage, then use a combination of the following tools to identify the cause:

Amazon CloudWatch metrics
Enhanced Monitoring metrics
Performance Insights metrics
Native PostgreSQL view and catalogs, such as pg_stat_statements, pg_stat_activity, and pg_stat_user_tables
Idle connections in the database
ANALYZE command
PostgreSQL logging parameters to log long-running queries, autovacuum, lock waits, and connection and disconnection requests

Resolution

Amazon CloudWatch metrics

Use CloudWatch metrics to identify CPU patterns over extended periods. Compare the graphs WriteIOPs, ReadIOPs, ReadThroughput, and WriteThroughput with the CPU utilization to find the times when the workload caused high CPU.

After you identify the timeframe, review the Enhanced Monitoring data that's associated with your DB instance. You can set Enhanced Monitoring to collect data at intervals of 1, 5, 10, 15, 30, or 60 seconds. Then, you can collect data at a more granular level than CloudWatch.

Enhanced Monitoring

Enhanced Monitoring provides a view at the operating system (OS) level. This view can help identify the cause of a high CPU load at a granular level. For example, you can review the load average, CPU distribution (System% or Nice%), and OS process list.

With Enhanced Monitoring, you can check the loadAverageMinute data in intervals of 1, 5, and 15 minutes. A load average that's greater than the number of vCPUs indicates that the instance is under a heavy load. If the load average is less than the number of vCPUs for the DB instance class, then CPU throttling might not cause the application latency. When diagnosing the cause of CPU usage, check the load average to avoid false positives.

For example, suppose that you have a DB instance that uses a db.m5.2xlarge instance class with 3000 Provisioned IOPS that reaches the CPU limit. In the following example, the instance class has eight vCPUs associated with it. For the same load average, to exceed 170 indicates that the machine is under heavy load during the measured timeframe.

Load Average Minute


Fifteen	170.25
Five	391.31
One	596.74

CPU Utilization


User (%)	0.71
System (%)	4.9
Nice (%)	93.92
Total (%)	99.97

Note: In Enhanced Monitoring, Nice% represents the amount of CPU your workload uses against the database.

After you turn on Enhanced Monitoring, you can also check the OS process list that's associated with the DB instance. Enhanced monitoring shows a maximum of 100 processes and can help you identify the processes with the largest impact on performance. You can combine Enhanced Monitoring results with pg_stat_activity results to help identify the resource usage of queries.

Performance Insights

Use Amazon RDS Performance Insights to identify the query that's responsible for the database load. Check the SQL tab that corresponds to a particular timeframe.

Native PostgreSQL view and catalogs

At the database-engine level, you can use pg_stat_activity and pg_stat_statements. If the issue occurs in real time, then use pg_stat_activity or pg_stat_statements to group the machines, clients, and IP addresses that send the most traffic. Use this data to check increases over time or increases in application servers. You can also verify if an application server has stuck sessions or locking issues. For more information, see pg_stat_activity and pg_stat_statements on the PostgreSQL website.

To turn on pg_stat_statements, modify the existing custom parameter group and set the following values:

Add pg_stat_statements to shared_preload_libraries
track_activity_query_size = 4096
pg_stat_statements.track = ALL
pg_stat_statements.max = 10000

Choose Apply Immediately, and then reboot the DB instance. Then, run a command similar to the following on the database that you want to monitor:

demo=> select current_database();current_database
------------------
demo
(1 row)     

demo=> CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

Note: The preceding command installs the extension in the demo database.

After pg_stat_statements is set up, use on of the following methods to monitor the output.

To list queries by total_time and see which query spends the most time in the database, run one of the following queries:

PostgreSQL versions 12 and earlier

SELECT total_time, query
FROM pg_stat_statements
ORDER BY total_time DESC LIMIT 10;

PostgreSQL versions 13 and later

SELECT total_plan_time+total_exec_time as total_time, query
FROM pg_stat_statements
ORDER BY 1 DESC LIMIT 10;

To list queries with less buffer cache hit ratio, run one of the following queries:

PostgreSQL versions 12 and earlier

SELECT query, calls, total_time, rows, 100.0 * shared_blks_hit / nullif(shared_blks_hit + shared_blks_read, 0) AS hit_percent
FROM pg_stat_statements 
ORDER BY total_time DESC LIMIT 10;

PostgreSQL versions 13 and later

SELECT query, calls, total_plan_time+total_exec_time as total_time, rows, 100.0 * shared_blks_hit / nullif(shared_blks_hit +shared_blks_read, 0) AS hit_percent
FROM pg_stat_statements 
ORDER BY 3 DESC LIMIT 10;

To list queries on a per-execution basis to sample queries over time, run the following query:

PostgreSQL versions 12 and earlier

SELECT query, calls, total_time/calls as avg_time_ms, rows/calls as avg_rows,temp_blks_read/calls as avg_tmp_read, temp_blks_written/calls as avg_temp_written
FROM pg_stat_statements
WHERE calls != 0
ORDER BY total_time DESC LIMIT 10;

PostgreSQL versions 13 and later

SELECT query,
calls,
(total_plan_time+total_exec_time as total_time)/calls as avg_time_ms, 
 rows/calls as avg_rows,
temp_blks_read/calls as avg_tmp_read,
 temp_blks_written/calls as avg_temp_written
FROM pg_stat_statements
WHERE calls != 0
ORDER BY 3 DESC LIMIT 10;

Idle connections in the database

Idle connections in the database might consume compute resources, such as memory and CPU. When your instance has high CPU utilization, check for idle connections on the database. For more information, see Performance impact of idle PostgreSQL connections. To check for idle connections, use Enhanced Monitoring to review the OS process list. However, this list shows a maximum of 100 processes.

To check for idle connections, run the following queries at the database level.

Run the following queries to view current sessions that are idle and active:

SELECT pid, datname, state, current_timestamp-least(query_start,xact_start) age, application_name, usename, queryFROM pg_stat_activity
WHERE query != '<IDLE>
'AND query NOT ILIKE '%pg_stat_activity%'
AND usename!='rdsadmin'
ORDER BY query_start desc;

SELECT application_name,pid,wait_event_type,wait_event,current_timestamp-least(query_start,xact_start) AS runtime, query AS current_query
FROM pg_stat_activity
WHERE not pid=pg_backend_pid()
AND query NOT ILIKE '%pg_stat_activity%'
AND usename!='rdsadmin';

Run the following queries to get the connection counts per user and application name:

postgres=> SELECT application_name,count(*) FROM pg_stat_activity GROUP BY application_name;
    application_name    | count 
------------------------+-------
 psql                   |     1
 PostgreSQL JDBC Driver |     1 
                        |     5
(3 rows)

postgres=> SELECT usename,count(*) FROM pg_stat_activity GROUP BY usename;
 usename  | count
----------+-------
 master   |     4 
 user1    |     1
 rdsadmin |     2
(3 rows)

After you identify the idle connections, run either of the following queries to end the connections:

psql=> SELECT pg_terminate_backend(pid)
   FROM pg_stat_activity
   WHERE usename = 'example-username'
   AND pid <> pg_backend_pid()
   AND state in ('idle');

-or-

SELECT pg\_terminate\_backend (example-pid);

If your application causes too many connections, then change it so that memory and CPU resources aren't spent managing these connections. You can either change the application to limit the number of connections or use a connection pooler such as PgBouncer. You can also use Amazon RDS Proxy, a managed service that allows you to set up connection pooling.

ANALYZE command

The ANALYZE command collects statistics about the contents of tables in the database and stores the results in the pg_statistic system catalog. Then, the query planner uses these statistics to help determine the most efficient execution plans for queries. If you don't run ANALYZE frequently on tables in your database, then the queries might consume more compute resources. The queries consume more resources because of the stale statistics present in the system for the relations that you access. These issues occur under the following conditions:

Autovacuum isn't frequently running.
ANALYZE wasn't run after the major version upgrade.

Autovacuum isn't frequently running: Autovacuum is a daemon that automates the running of VACUUM and ANALYZE commands. Autovacuum checks for bloated tables in the database and reclaims the space for reuse. The autovacuum daemon makes sure that table statistics are regularly updated by running the ANALYZE operation whenever the set threshold of tuples is dead. Then, the query planner can use the most efficient query plan based on recent statistics. If autovacuum isn't running, then the query planner might create sub-optimal query plans and lead to higher resource consumption by the queries. For more information, see the following resources:

To get information on when autovacuum and autoanalyze were last run on the tables, run the following query:

SELECT relname, last\_autovacuum, last\_autoanalyze FROM pg\_stat\_user\_tables;

ANALYZE wasn't run after the major version upgrade: PostgreSQL databases usually encounter performance issues after any major engine version upgrade. A common reason for these issues is that the ANALYZE operation isn't performed after the upgrade to refresh the pg_statistic table. Run the ANALYZE operation for every database in your RDS for PostgreSQL DB instance. Optimizer statistics aren't transferred during a major version upgrade. Therefore, to avoid performance issues because of higher resource utilization, regenerate all statistics.

To generate statistics for all regular tables in the current database after a major version upgrade, run the following command without any parameters:

ANALYZE VERBOSE

PostgreSQL logging parameters

Use Amazon RDS for PostgreSQL to turn on query logging. Then, check the PostgreSQL error logs to confirm that your log_min_duration_statement and log_statement parameters are set to appropriate values. For more information, see Error reporting and logging on the PostgreSQL website.

Lower the CPU usage

After you identify the queries that cause the high CPU, use the following methods to further lower the CPU usage:

To find opportunities for tuning, use EXPLAIN and EXPLAIN ANALYZE to identify the caveats. For more information, see EXPLAIN on the PostgreSQL website.
If there's a query that's repeatedly running, use prepared statements to lower the pressure on your CPU. Repeated running of prepared statements caches the query plan. Because the plan is already in cache for further runs, the time for planning is much less.