How can I troubleshoot high CPU utilization for Amazon RDS or Amazon Aurora PostgreSQL?

Last updated: 2020-06-18

How can I identify and resolve the cause of high CPU utilization in Amazon Relational Database Service (Amazon RDS) or Amazon Aurora PostgreSQL?

Short description

If you see that your load has high CPU usage, you can use a combination of the following tools to identify the cause:

This article discusses how you can use each of these tools to identify the cause of high CPU usage.

Resolution

Amazon CloudWatch metrics

You can use CloudWatch metrics to identify CPU patterns over extended periods. Compare the graphs WriteIOPs, ReadIOPs, ReadThroughput, and WriteThroughput with the CPU utilization to find out the times at which the workload caused high CPU.

After you have identified the time frame, you can review the Enhanced Monitoring data associated with your DB instance. You can set Enhanced Monitoring to collect data at intervals of 1, 5, 10, 15, 30, or 60 seconds. This allows you to collect data at a more granular level than CloudWatch. For more information, see the Differences between CloudWatch and Enhanced Monitoring metrics.

Enhanced Monitoring

Enhanced Monitoring provides a view at the operating system (OS) level, which can help identify the cause of a high CPU load at a granular level. For example, you can review the load average, CPU distribution (system% or nice%), and OS process list.

By using Enhanced Monitoring, you can check the loadAverageMinute data in intervals of 1, 5, and 15 minutes. If the load average is greater than the number of vCPUs, this indicates that the instance is under a heavy load. Also, if the load average is less than the number of vCPUs for the DB instance class, CPU throttling might not be the cause of the application latency. Check the load average to avoid false positives when diagnosing the cause of CPU usage.

For example, if you have a DB instance that is using a db.m4.2xlarge instance class with 3000 provisioned IOPS that reaches the CPU limit, you can review the following example metrics to identify the root cause of the high CPU usage. In the following example, the instance class has eight vCPUs associated with it. For the same load average, exceeding 170 indicates that the machine is under heavy load during the time frame measured:

Load Average Minute
 
Fifteen 170.25
Five 391.31
One 596.74
CPU Utilization  
User (%) 0.71
System (%) 4.9
Nice (%) 93.92
Total (%) 99.97

Note: Amazon RDS gives your workload a higher priority over other tasks that are running on the DB instance. To prioritize these tasks, workload tasks have a higher Nice value. As a result, in Enhanced Monitoring, Nice% represents the amount of CPU being used by your workload against the database.

After enabling Enhanced Monitoring, you can also check the OS process list that is associated with the DB instance. Enhanced monitoring shows only the top 50 processes for both CPU and memory each. This can help you identify which processes have the largest impact on performance based on CPU and memory use.

You can club Enhanced Monitoring results with pg_stat_activity results to help identify the resource usage of queries.

Performance Insights

You can use Amazon RDS Performance Insights to identify the query responsible for the database load after checking the SQL tab that corresponds to a particular time frame.

Native PostgreSQL view and catalogs

At database-engine level, if the issue occurs in real time, you can use pg_stat_activity or pg_stat_statements. These can help you group the machines, clients, and IP addresses that send the most traffic. You can also use this data to see if there are increases over time, increases in application servers, or if an application server has stuck sessions or locking issues. For more information, see the PostgreSQL Documentation for pg_stat_activity and pg_stat_statements. To enable pg_stat_statements, modify the existing custom parameter group and set the following values:

  • shared_preload_libraries = pg_stat_statements
  • track_activity_query_size = 2048
  • pg_stat_statements.track = ALL
  • pg_stat_statements.max = 10000

Choose Apply Immediately, and reboot the DB instance. Then, run a command similar to the following on the database that you want to monitor:

Note: The following example installs the extension in the "demo" database.

demo=> select current_database();
current_database
------------------
demo
(1 row)
     
demo=> CREATE EXTENSION pg_stat_statements;

After pg_stat_statements is set up, you can monitor the output by using one of the following methods:

  • List queries by total_time and see which query spends most time in the database:
SELECT round(total_time*1000)/1000 AS total_time,query
FROM pg_stat_statements
ORDER BY total_time DESC limit 2;
  • List queries with the total number of calls, total rows, and rows returned:
SELECT query, calls, total_time, rows, 100.0 * shared_blks_hit /
nullif(shared_blks_hit + shared_blks_read, 0) AS hit_percent
FROM pg_stat_statements ORDER BY total_time DESC LIMIT 5;
  • List queries on a per-execution basis to sample queries over time:
SELECT query, calls, total_time/calls as avg_time_ms, rows/calls as avg_rows,
temp_blks_read/calls as avg_tmp_read, temp_blks_written/calls as avg_temp_written
FROM pg_stat_statements
WHERE calls != 0
ORDER BY total_time DESC LIMIT 5;

PostgreSQL logging parameters

Enable query logging using Amazon RDS for PostgreSQL. Then, check the PostgreSQL error logs to confirm that your log_min_duration_statement and log_statement parameters are set to appropriate values. For more information, see the PostgreSQL Documentation for Error Reporting and Logging.

Lower the CPU usage

After you identify the queries causing the high CPU, you can use the following methods to further lower the CPU usage:

  1. If there are opportunities for tuning, use EXPLAIN and EXPLAIN ANALYZE to identify caveats. For more information, see the PostgreSQL Documentation for EXPLAIN.
  2. If there is a query that is running repeatedly, use prepared statements to lower the pressure on your CPU.

Did this article help you?

Anything we could improve?


Need more help?