How do I identify and troubleshoot performance issues and slow-running queries in my RDS for PostgreSQL or Aurora PostgreSQL instance?

Last updated: 2022-07-14

My Amazon Relational Database Service (Amazon RDS) for PostgreSQL instance is slow. I want to identify and troubleshoot the slow-running queries.

Resolution

The performance of your Amazon RDS for PostgreSQL instance might be affected for multiple reasons, such as undersized hardware, change in workload, increased traffic, memory issues, or suboptimal query plans.

Identify the cause

Use a combination of these tools to identify the cause of slow-running queries:

  • Amazon CloudWatch metrics
  • Enhanced Monitoring metrics
  • Performance Insights metrics

CloudWatch metrics

To identify performance bottlenecks caused by insufficient resources, monitor these common CloudWatch metrics available for your Amazon RDS DB instance.

  • CPUUtilization - Percentage of computer processing capacity used
  • FreeableMemory - Available RAM on the DB instance (in megabytes)
  • SwapUsage - Swap space used by the DB instance (in megabytes)

A higher percentage of CPU utilization generally indicates an active workload on the instance and the need for more CPU resources. Higher memory utilization along with swap space consumption indicates frequent swapping due to low memory availability for the workload. This might mean that your instance isn't able to keep up with the workload. A high utilization of CPU and memory resources are typically caused by long-running queries, suddenly increased traffic, and a large number of idle connections.

Run this command to view the active queries with runtime:

SELECT pid, usename, age(now(),xact_start) query_time, query FROM pg_stat_activity WHERE state='active';

Run this command to find the number of idle connections present in the database:

SELECT count(*) FROM pg_stat_activity WHERE state='idle';

Then, run this command to terminate idle connections that consume resources.

Note: Be sure to replace example-pid with the PID of the idle connection that you got from pg_stat_activity:

SELECT pg_terminate_backend(example-pid);

Verify if the desired network throughput is achieved on the RDS Instance using these metrics:

  • NetworkReceiveThroughput
  • NetworkTransmitThroughput

These metrics indicate the incoming and outgoing network traffic in bytes/second. Undersized or non-Amazon Elastic Block Service (Amazon EBS)-optimized instance classes might affect network throughput, resulting in slow instances. A low network throughput can result in slow responses for all the application requests irrespective of the database performance.

Evaluate the I/O performance using these metrics:

  • ReadIOPS and WriteIOPS - Average number of disk read or write operations per second
  • ReadLatency and WriteLatency - Average time taken for a read or write operation in milliseconds
  • ReadThroughput and WriteThroughput - Average number of megabytes read from or written to disk per second
  • DiskQueueDepth - Number of I/O operations that are waiting to be written to or read from disk

For information on performing a detailed analysis, see How do I troubleshoot the latency of Amazon EBS volumes caused by an IOPS bottleneck in my Amazon RDS instance?

Enhanced Monitoring metrics

With Enhanced Monitoring, you can view metrics at the operating system level and the list of top 100 processes that consume high CPU and memory. You can activate Enhanced Monitoring at per-second level to identify intermittent performance issues on your RDS Instance.

You can evaluate the available operating system metrics to diagnose performance issues that might be related to CPU, workload, I/O, memory, and network.

From the process list, identify the process with high values for CPU% or Mem%. Then, find the related connection from the database.

For example:

NAME VIRT RES CPU% MEM% VMLIMIT
postgres: postgres postgres 178.2.0.44(42303) SELECT [10322] 250.66 MB 27.7 MB 85.93 2.21 unlimited

Connect to the database, and then run this query to find the connection and query-related information:

SELECT * FROM pg_stat_activity WHERE pid = 10322;

Performance Insights metrics

Performance Insights allows you to evaluate database workloads sliced by waits, SQL, host, and users. You can also capture the database and SQL level metrics.

The Top SQL tab on the Performance Insights dashboard shows the SQL statements that contribute the most to DB load. A DB load or load by waits (AAS) that's higher than the Max vCPU value indicates throttled workload on the instance class.

The average latency per call in SQL statistics provides the average running time of a query. It’s common to see a different SQL to be the top DB load contributor than the one having the highest average running time. This is because the top SQL list is based on total running time.

Tune memory settings

PostgreSQL DB server allocates a certain memory area for its entire lifetime for caching data to improve read and write accesses. This memory area is referred to as shared buffers. The amount of memory the database uses for shared memory buffers is controlled by the shared_buffers parameters. Apart from the shared memory area, each backend process consumes memory for performing operations within a DB server based on the values set for work_mem and maintenance_work_mem parameters. For more information, see PostgreSQL documentation for Server configuration.

If you continuously observe high memory pressure on the DB Instance, consider lowering the values of these parameters in the custom parameter group that's attached to your DB Instance.

Troubleshoot slow-running queries

You typically experience slow-running queries when there are infrastructure issues or the overall resource consumption is high. Slow-running queries might also be the result of suboptimal query planning by the query planner. PostgreSQL query planner uses statistics created for the table to create query plans. These plans might be affected due to schema changes and stale statistics. Bloat in the table and indexes might also result in slow-running queries.

The autovacuum daemon is responsible for creating autovacuum worker processes that remove the dead tuples from tables whenever the dead tuple threshold is met. The autovaccum daemon is also responsible for running the ANALYZE operation that refreshes the statistics stored for a particular table.

To find information on dead tuples, number of autovacuum or vacuum operations, number of autoanalyze or analyze runs, and when these operations were last run, run this query:

SELECT schemaname, relname, n_live_tup,n_dead_tup, last_autoanalyze, last_analyze, last_autovacuum, last_vacuum,
autovacuum_count+vacuum_count vacuum_count, analyze_count+autoanalyze_count analyze_count 
FROM pg_stat_user_tables
ORDER BY 5 DESC;

You can use the pg_stat_activity view to find data related to current activities. This view provides the backend pid, query, and other details. To find long-running queries, run this query:

SELECT pid, datname, query, extract(epoch from now()) - extract(epoch from xact_start) AS duration, case
WHEN wait_event IS NULL THEN 'CPU' 
ELSE wait_event_type||':'||wait_event end wait FROM pg_stat_activity
WHERE query!=current_query() AND xact_start IS NOT NULL ORDER BY 4 DESC;

Note that queries waiting for locks might be slow. Therefore, verify if the query is waiting for locks by running this query:

SELECT pid, virtualtransaction AS vxid, locktype AS lock_type, mode AS lock_mode, granted,fastpath,
CASE
WHEN virtualxid IS NOT NULL AND transactionid IS NOT NULL
THEN virtualxid || ' ' || transactionid
WHEN virtualxid::text IS NOT NULL
THEN virtualxid
ELSE transactionid::text
END AS xid_lock, relname, page, tuple, classid, objid, objsubid
FROM pg_locks LEFT OUTER JOIN pg_class ON (pg_locks.relation = pg_class.oid)
WHERE -- do not show our view’s locks
pid != pg_backend_pid() AND
virtualtransaction IS DISTINCT FROM virtualxid
ORDER BY 1, 2, 5 DESC, 6, 3, 4, 7;

Your RDS for PostgreSQL instance allows you to create the pg_stat_statements extension from within the database:

CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

Using pg_stat_statements, you can view the statistics of queries. Be sure to add the pg_stat_statements entry to shared_preload_libraries before creating the extension.

Note: You can modify parameters for this module only when a custom parameter group is attached to your DB Instance.

Use this queries to identify the top SQL queries that affect the performance of your instance.

To find queries that spend more time in the database, run this query for PostgreSQL versions 12 and earlier:

SELECT query, calls, total_time, mean_time FROM pg_stat_statements ORDER BY 3 DESC;

Run this query for PostgreSQL versions 13 and later:

SELECT query, calls, total_plan_time+total_exec_time AS total_time, mean_plan_time + mean_exec_time AS mean_time FROM pg_stat_statements ORDER BY 3 DESC;

To find queries with a lower buffer cache hit ratio, run this query for PostgreSQL versions 12 and earlier:

SELECT query, calls, total_time, rows, 100.0 * shared_blks_hit / nullif(shared_blks_hit + shared_blks_read, 0) AS hit_percent
FROM pg_stat_statements ORDER BY total_time
DESC LIMIT 10;

Run this query for PostgreSQL versions 13 and later:

SELECT query, calls, total_plan_time+total_exec_time as total_time, rows, 100.0 * shared_blks_hit / nullif(shared_blks_hit +
shared_blks_read, 0) AS hit_percent
FROM pg_stat_statements ORDER BY 5 ASC
LIMIT 10;

Also, you can capture long-running queries or query plans in database error logs by appropriately setting the log_min_duration_statement parameter for your instance and using the auto_explain module. The log_min_duration_statement parameter causes the duration of each completed statement to be logged if the statement ran for at least the specified amount of time. For example, if you set this parameter to 250 ms, then all SQL statements that run for 250 ms or longer are logged. The auto_explain module allows you to capture the explain plan of queries that are run in the database.

You can also capture the plan using explain and explain analyze commands. Identify query tuning opportunities based on the auto_explain module or explain commands for the query. For more information, see the PostgreSQL Documentation for Using EXPLAIN.

If your system is well tuned and you're still facing resource throttling, consider scaling the instance class upwards so that your DB Instance is allocated with more compute and memory resources. For more information, see Hardware specifications for DB instance classes.