How do I troubleshoot high CPU utilization on my Amazon RDS for Oracle database?

Last updated: 2021-10-29

I'm experiencing high CPU utilization on my Amazon Relational Database Service (Amazon RDS) for Oracle DB instance.

Short description

If your RDS for Oracle database has high CPU usage, use a combination of the following tools to identify the cause:

  • Amazon CloudWatch Metrics
  • Enhanced Monitoring metrics
  • Performance Insights metrics
  • Oracle Statspack
  • Automatic Workload Repository (AWR)
  • Automatic Database Diagnostic Monitor (ADDM)
  • Active Session History (ASH)
  • Oracle SQLT

Resolution

When diagnosing issues related to high CPU utilization, identify the time period in which the issue occurred.

CloudWatch metrics

Amazon RDS sends metrics to CloudWatch every minute for each active database. Review the following CloudWatch metrics for Amazon RDS to identify CPU patterns over extended periods:

  • CPUUtilization
  • CPUCreditUsage, if you're using a T2 or T3 instance
  • CPUCreditBalance, if you're using a T2 or T3 instance

Also, review the following metrics to check if there was a change in the workload and any thresholds were breached. These factors might contribute to the spike in CPU utilization.

  • DatabaseConnections
  • DiskQueueDepth
  • FreeableMemory
  • ReadIOPS
  • ReadLatency
  • WriteIOPS
  • WriteLatency

For more information, see Amazon RDS metrics and Viewing the DB instance metrics.

Enhanced Monitoring metrics

Enhanced Monitoring provides metrics in real time for the operating system that your DB instance runs on. While CloudWatch gets the CPU utilization metrics from the hypervisor, Enhanced Monitoring gets these metrics from an agent on the DB instance. Enhanced Monitoring metrics are more granular than the CloudWatch metrics. Enhanced monitoring metrics are stored for 30 days in CloudWatch Logs.

You can define the collection interval for the metrics from 1 second to 1 minute. It's a best practice to set the granularity to 1 second or 5 seconds for business-critical applications. With this granularity, the metrics provide more accurate information about the load on the application for analyzing performance issues.

To view the time period of the spike in the CPU utilization, do the following:
  1. Open the Amazon RDS console.
  2. In the navigation pane, choose Databases.
  3. Choose the database that you want to monitor.
  4. Choose the Monitoring tab.
  5. Select Enhanced monitoring from the Monitoring dropdown list.
  6. Under the Enhanced Monitoring view, if the instance is a Multi-AZ deployment, select primary to view the OS metrics of the primary instance. Select secondary to view the metrics for the standby replica.
  7. Select the date and start time.
  8. On the right corner, select the duration. You can select 5 minutes, 15 minutes, 30 minutes, or 1 hour.

The CPU Total graph indicates the time period when the CPU utilization increased.

The Load Avg 1 min, Load Avg 5 min, and Load Avg 15 min graphs show the number of processes requesting the CPU time over the last minute, last five minutes, and the last 15 minutes, respectively. If the load average is greater than the number of vCPUs, then the instance might be experiencing a CPU bottleneck.

To view the OS processes, select OS process list from the Monitoring dropdown list. Then, sort the list by CPU% values to identify the process that has the most CPU usage.

Example:

NAME VIRT RES CPU% MEM% VMLIMIT
oracleORCL [27074]ᵗ 6.07 GiB 1,007.24 MB 44.72 12.78 unlimited
oracleORCL [27076]ᵗ 6.07 GiB 1,010.02 MB 44.64 12.82 unlimited

For more information on the columns in the preceding example, see Viewing OS metrics in the RDS console.

After identifying the process with the most CPU utilization, you can run the following query to map the process ID to a session on the database:

SET LINESIZE 120;
SET PAGES 200;
COL OSUSER FOR a20;
COL USERNAME FOR a20;
COL MACHINE FOR a20;
SELECT a.sid, a.serial#, a.osuser, a.username, a.machine, a.sql_id, c.sql_text FROM v$session a, v$process b, v$sql c 
WHERE a.paddr=b.addr AND b.spid=&spid AND a.sql_id=c.sql_id(+);

By default, all the Enhanced Monitoring graphs aren't displayed on the Enhanced Monitoring dashboard. To get a view of the workload at the time of the spike in CPU utilization, turn on additional graphs by doing the following:

  1. Open the Amazon RDS console.
  2. In the navigation pane, choose Databases.
  3. Choose the database that you want to monitor.
  4. Choose the Monitoring tab.
  5. Select Enhanced monitoring from the Monitoring dropdown list.
  6. Under the Enhanced Monitoring view, choose Manage graphs.
  7. Select the graphs that you want to view.
  8. Choose Save.

Examples of graphs you might choose to view:

Memory

  • Free
  • Cached
  • Buffered
  • Total
  • Dirty
  • Active
  • Slab

Note: Metrics related to metrics are retrieved from the /proc/meminfo file.

Swap

  • Swap
  • Free

Disk I/O and Physical Device I/O

  • Read IO/s
  • Write IO/s
  • Ave Queue Size
  • Await

CPU

  • User
  • Total
  • System
  • Wait
  • Idle
  • Nice

For the list of available metrics, see Metrics for MariaDB, MySQL, Oracle, and PostgreSQL DB instances.

For more information on Enhanced Monitoring, see Monitoring the OS by using Enhanced Monitoring.

For information on the cost of Enhanced Monitoring, see Cost of Enhanced Monitoring.

Performance Insights metrics

With the Amazon RDS Performance Insights dashboard, you can visualize the database load and filter the load by waits, SQL statements, hosts, or users.

  1. Open the Amazon RDS console.
  2. In the navigation pane, choose Performance Insights.
  3. Choose the DB instance that you want to monitor.
  4. For View past, Select the duration of your choice.
  5. In the Database Load graph, check the time when you experienced a spike in the CPU usage.
  6. Choose the Top waits tab.
    Note the top wait events during the time period of the spike.
  7. Choose the Top SQL tab.
    Review and optimize the SQL statements that contributed to the spike.

For information on the cost of Performance Insights, see Performance Insights pricing.

Oracle Statspack

Statspack is a performance reporting tool that provides the performance metrics of your database over a specific time period.

To review the CPU utilization of your instance using Statspack, do the following:

  1. Generate a statspack report for the time period when you experienced an issue.
  2. Review and optimize the queries that result in a high CPU load.
  3. Review the top wait events.

Example extract from a Statspack report:

-> Total DB CPU (s):           3,345
-> Captured SQL accounts for   91.3% of Total DB CPU
-> SQL reported below exceeded  1.0% of Total DB CPU
    CPU                  CPU per            Elapsed                     Old
  Time (s)   Executions  Exec (s)  %Total   Time (s)    Buffer Gets  Hash Value
---------- ------------ ---------- ------ ---------- --------------- ----------
   3043.36      598,100       0.01   91.0    3356.81     994,096,212  219593194

Module: JDBC Thin Client
SELECT tt.ORDER_TOTAL, tt.SALES_REP_ID, tt.ORDER_DATE, customers.CUST_FIRST_NAME, customers.CUST_LAST_NAME FROM   
(SELECT orders.ORDER_TOTAL, orders.SALES_REP_ID, orders.ORDER_DATE, orders.customer_id, rank() Over (ORDER BY orders.O

For more information, see Oracle documentation for Oracle Statspack.

AWR

AWR is an Oracle performance reporting tool that provides performance metrics over a specific time period.

Note: AWR requires a Diagnostic Pack License and is only available for the Enterprise Edition of Oracle.

To identify the cause for your CPU load using AWR, do the following:

1.    Run a query similar to the following to identify the start and end snapshot ID for the time period of high CPU load:

SELECT SNAP_ID, BEGIN_INTERVAL_TIME FROM DBA_HIST_SNAPSHOT ORDER BY 1;

2.    Generate the AWR report.

3.    Download the AWR report.

4.    Review and optimize the queries listed in the SQL ordered by CPU Time section of the AWR report.

5.    Review the top wait events.

In Oracle 12c and later versions, ADDM and ASH reports are included in the AWR report.

Note: When an AWR report is generated for more than four consecutive snapshot IDs, all the ADDM and ASH reports aren't included. To generate these additional reports, use the instructions in the following sections.

ADDM

ADDM is a diagnostic tool that analyzes the AWR data, identifies performance bottlenecks, and provides recommendations.

Note: ADDM requires a Diagnostic Pack license and is only available for the Enterprise Edition of Oracle.

1.    Run a query similar to the following to Identify the start and end snapshot ID for the time period of high CPU load:

SELECT SNAP_ID, BEGIN_INTERVAL_TIME FROM DBA_HIST_SNAPSHOT ORDER BY 1;

2.    Generate the ADDM report.

3.    Download the ADDM report.

4.    Review the recommendations in the ADDM report.

ASH

ASH is a diagnostic tool that collects active session information. To troubleshoot transient performance problems using ASH, do the following:

Note: ASH requires a Diagnostic Pack license and is only available for the Enterprise Edition of Oracle.

1.    Generate an ASH report for the time period when there was a high CPU load.

2.    Download the ASH report.

3.    Review the TOP SQL with TOP Events section.

For information on interpreting the AWR, ADDM, and ASH reports, see Oracle Support documentation for Oracle Support Doc ID FAQ: Automatic Workload Repository (AWR) Reports (Doc ID 1599440.1).

Oracle SQLT

Amazon RDS supports Oracle SQLTXPLAIN (SQLT) through the use of the SQLT option. SQLT is a tool that's used to diagnose SQL statements that don't perform well.

To produce a report for a specific SQL statement, see Oracle SQLT.

If you receive the following error when using SQLT:

Error: ORA-20106: SQLT parameter connect_identifier must be set when running SQLT from a remote client.

Run one of the following commands before running the extract:

EXEC sqltxadmin.sqlt$a.set_sess_param(‘connect_identifier’, ‘@SID’);
EXEC sqltxadmin.sqlt$a.set_param(‘connect_identifier’, ‘@example-hostname:example-port/example-sid’);