How can I troubleshoot high CPU utilization for Amazon RDS or Amazon Aurora PostgreSQL?

Last updated: 2019-04-16

How can I identify and resolve the cause of high CPU utilization in Amazon Relational Database Service (Amazon RDS) or Amazon Aurora PostgreSQL?

Short Description

If you see that your load has high CPU usage, you can use a combination of the following tools to identify the cause of a CPU load:

  • Amazon CloudWatch metrics can help identify patterns of CPU usage over a period of time.
  • Enhanced Monitoring provides a view at the operating system (OS) level, which can help identify the cause of a high CPU load at a granular level. For example, you can review the load average, CPU distribution (system% or nice%), and OS process list.
  • Enhanced Monitoring can also help identify micro bursts of CPU usage. Micro bursts can be missed by CloudWatch if the spike or burst of activity occurs for a period of less than 60 seconds. Enhanced Monitoring can identify these spikes, because you can configure the monitoring to view one-second granularity.
  • Performance Insights can help identify the exact queries that are responsible for load and the host that initiated the queries.
  • Native PostgreSQL view and catalogs (such as pg_stat_statements, pg_stat_activity, and pg_stat_user_tables) allow you to review database-level details. For more information, see the PostgreSQL documentation for Monitoring Database Activity and pg_stat_statements.
  • PostgreSQL provides a variety of logging parameters to log long-running queries, autovacuum, lock waits, and connection and disconnection requests. For more information, see How do I enable query logging using Amazon RDS for PostgreSQL?

After you identify the cause, you can use the following methods to further lower the CPU usage:

  • If there are opportunities for tuning, use EXPLAIN and EXPLAIN ANALYZE to identify caveats. For more information, see the PostgreSQL Documentation for EXPLAIN.
  • If there is a query that is running repeatedly, use prepared statements to lower the pressure on your CPU.

Resolution

Amazon CloudWatch metrics

You can use CloudWatch metrics to identify the CPU patterns over extended periods. You can compare the CloudWatch metric to identify the time frames during which CPU usage is highest. After the time frame is identified, you can review the Enhanced Monitoring data associated with your DB instance. You can set Enhanced Monitoring to collect data at intervals of 1, 5, 10, 15, 30, or 60 seconds, which allows you to collect data at a more granular level than CloudWatch. For more information, see the Differences Between CloudWatch and Enhanced Monitoring Metrics.

Enhanced Monitoring

By using Enhanced Monitoring, you can check the loadAverageMinute data in intervals of 1, 5, and 15 minutes. If the load average is greater than the number of vCPUs, this indicates that the instance is under a heavy load. Also, if the load average is less than the number of vCPUs for the DB instance class, the application latency might not be caused by CPU throttling. Check the load average to avoid false positives when diagnosing the cause of CPU usage.

For example, if you have a DB instance that is using a db.m4.2xlarge instance class with 3000 provisioned IOPS that reaches the CPU limit, you can review the following example metrics to identify the root cause of the high CPU usage. In the following example, the instance class has eight vCPUs associated with it. For the same a load average, exceeding 170 indicates that the machine is under heavy load during the time frame measured:

Load Average Minute
 
Fifteen 170.25
Five 391.31
One 596.74
CPU Utilization  
User (%) 0.71
System (%) 4.9
Nice (%) 93.92
Total (%) 99.97

Note: Your workload is given a higher priority over other tasks that are running on the DB instance. To prioritize these tasks, workload tasks are given a higher Nice value. As a result, in Enhanced Monitoring, Nice% represents the amount of CPU being used by your workload against the database.

After enabling Enhanced Monitoring, you can also check the OS process list that is associated with the DB instance. Enhanced monitoring shows only the top 50 processes, but this can help you identify which processes have the largest impact on performance based on CPU and memory use.

Performance Insights

You can use Amazon RDS Performance Insights to identify the query responsible for the database load after checking the SQL tab that corresponds to a particular time frame.

Native PostgreSQL view and catalogs

At database-engine level, if the issue occurs in real time, you can use pg_stat_activity or pg_stat_statements. These can help you group the machines, clients, and IP addresses that send the most traffic. You can also use this data to see if there are increases over time, increases in application servers, or if an application server has stuck sessions or locking issues. For more information, see the PostgreSQL Documentation for pg_stat_activity and pg_stat_statements. To enable pg_stat_statements, modify the existing custom parameter group and set the following values:

  • shared_preload_libraries = pg_stat_statements
  • track_activity_query_size = 2048
  • pg_stat_statements.track = ALL
  • pg_stat_statements.max = 10000

Choose Apply Immediately, and reboot the DB instance. Then, run a command similar to the following on the database that you want to monitor:

Note: The following example installs the extension in the "demo" database.

demo=> select current_database();
current_database
------------------
demo
(1 row)

demo=> CREATE EXTENSION pg_stat_statements;

After pg_stat_statements is set up, you can monitor the output by using one of the following methods:

  • List queries by total_time
  • List queries with total number of calls, total rows, and rows returned
  • List queries on a per execution basis

For information about the columns in the following example queries, see the PostgreSQL Documentation for pg_stat_statements.

List queries by total_time and see which query spends most time in the database:

SELECT round(total_time*1000)/1000 AS total_time,query
FROM pg_stat_statements
ORDER BY total_time DESC limit 5;

Example output:

benchmark=> SELECT round(total_time*1000)/1000 AS total_time,query 
FROM pg_stat_statements
ORDER BY total_time DESC limit 5;
-[ RECORD 1 ]-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
total_time | 50795183.089
query      | SELECT MD5((bbalance*random()*bid+?)::char) from pgbench_branches where (MD5((bid*random()*bbalance+?)::char))>substr(?,((random()*(?-?)+?)::integer),?) order by (bbalance*random()*bid+?) desc;
-[ RECORD 2 ]-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
total_time | 37131467.568
query      | UPDATE pgbench_accounts SET abalance = abalance + ? WHERE aid = ?;
-[ RECORD 3 ]-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
total_time | 27716213.981
query      | SELECT abalance FROM pgbench_accounts WHERE aid = ?;
-[ RECORD 4 ]-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
total_time | 25529446.83
query      | UPDATE pgbench_branches SET bbalance = bbalance + ? WHERE bid = ?;
-[ RECORD 5 ]-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
total_time | 5094847.077
query      | UPDATE pgbench_tellers SET tbalance = tbalance + ? WHERE tid = ?;

List queries with the total number of calls, total rows, and rows returned:

SELECT query, calls, total_time, rows, 100.0 * shared_blks_hit /
nullif(shared_blks_hit + shared_blks_read, 0) AS hit_percent
FROM pg_stat_statements ORDER BY total_time DESC LIMIT 5;

Example output:

benchmark=> SELECT query, calls, total_time, rows, 100.0 * shared_blks_hit /
nullif(shared_blks_hit + shared_blks_read, 0) AS hit_percent
FROM pg_stat_statements ORDER BY total_time DESC LIMIT 5;
-[ RECORD 1 ]--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
query       | SELECT MD5((bbalance*random()*bid+?)::char) from pgbench_branches where (MD5((bid*random()*bbalance+?)::char))>substr(?,((random()*(?-?)+?)::integer),?) order by (bbalance*random()*bid+?) desc;
calls       | 53855
total_time  | 50795183.089
rows        | 184638251
hit_percent | 99.9981448776361289
-[ RECORD 2 ]--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
query       | UPDATE pgbench_accounts SET abalance = abalance + ? WHERE aid = ?;
calls       | 100
total_time  | 37131467.568
rows        | 100
hit_percent | 99.0085070587482269
-[ RECORD 3 ]--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
query       | SELECT abalance FROM pgbench_accounts WHERE aid = ?;
calls       | 100
total_time  | 27716213.981
rows        | 100
hit_percent | 71.5744839710469907
-[ RECORD 4 ]--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
query       | UPDATE pgbench_branches SET bbalance = bbalance + ? WHERE bid = ?;
calls       | 4164050
total_time  | 25529446.8299978
rows        | 4164050
hit_percent | 99.9986231277663809
-[ RECORD 5 ]--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
query       | UPDATE pgbench_tellers SET tbalance = tbalance + ? WHERE tid = ?;
calls       | 4164070
total_time  | 5094847.07700154
rows        | 4164070
hit_percent | 99.9976317644971741

List queries on a per-execution basis to sample queries over time:

SELECT query, calls, total_time/calls as avg_time_ms, rows/calls as avg_rows,
temp_blks_read/calls as avg_tmp_read, temp_blks_written/calls as avg_temp_written
FROM pg_stat_statements
WHERE calls != 0
ORDER BY total_time DESC LIMIT 5;

Example output:

benchmark=> SELECT query, calls, total_time/calls as avg_time_ms, rows/calls as avg_rows,
temp_blks_read/calls as avg_tmp_read, temp_blks_written/calls as avg_temp_written
FROM pg_stat_statements
WHERE calls != 0
ORDER BY total_time DESC LIMIT 5;
-[ RECORD 1 ]----+--------------------------------------------------------------
query            | SELECT MD5((bbalance*random()*bid+?)::char) from pgbench_branches where (MD5((bid*random()*bbalance+?)::char))>substr(?,((random()*(?-?)+?)::integer),?) order by (bbalance*random()*bid+?) desc;
calls            | 53855
avg_time_ms      | 943.184162826107
avg_rows         | 3428
avg_tmp_read     | 0
avg_temp_written | 0
-[ RECORD 2 ]----+--------------------------------------------------------------
query            | UPDATE pgbench_accounts SET abalance = abalance + ? WHERE aid = ?;
calls            | 100
avg_time_ms      | 371314.67568
avg_rows         | 1
avg_tmp_read     | 0
avg_temp_written | 0
-[ RECORD 3 ]----+--------------------------------------------------------------
query            | SELECT abalance FROM pgbench_accounts WHERE aid = ?;
calls            | 100
avg_time_ms      | 277162.13981
avg_rows         | 1
avg_tmp_read     | 0
avg_temp_written | 0
-[ RECORD 4 ]----+--------------------------------------------------------------
query            | UPDATE pgbench_branches SET bbalance = bbalance + ? WHERE bid = ?;
calls            | 4164050
avg_time_ms      | 6.13091745536144
avg_rows         | 1
avg_tmp_read     | 0
avg_temp_written | 0
-[ RECORD 5 ]----+--------------------------------------------------------------
query            | UPDATE pgbench_tellers SET tbalance = tbalance + ? WHERE tid = ?;
calls            | 4164070
avg_time_ms      | 1.22352579975878
avg_rows         | 1
avg_tmp_read     | 0
avg_temp_written | 0

After reviewing the sample outputs from the previous execution, the following query is identified as the source of the high CPU utilization:

SELECT MD5((bbalance*random()*bid+?)::char) from pgbench_branches where (MD5((bid*random()*bbalance+?)::char))>substr(?,((random()*(?-?)+?)::integer),?) order by (bbalance*random()*bid+?) desc;

In the output, this query had the highest total execution time. This query was executed 53,855 times, and each execution took an average of 943 milliseconds.

PostgreSQL logging parameters

Enable query logging using Amazon RDS for PostgreSQL. Then, check the PostgreSQL error logs to confirm that your log_min_duration_statement and log_statement parameters are set to appropriate values. For more information, see the PostgreSQL Documentation for Error Reporting and Logging.


Did this article help you?

Anything we could improve?


Need more help?