How do I troubleshoot cluster or query performance issues in Amazon Redshift?
Last updated: 2021-03-11
The query performance in my Amazon Redshift cluster has degraded. How do I troubleshoot and improve cluster or query performance?
If you're experiencing performance issues in your Amazon Redshift cluster, consider the following approaches:
- Monitor your cluster performance metrics.
- Check the Amazon Redshift Advisor recommendations.
- Review the query execution alerts and excessive disk usage.
- Check for locking issues and long running sessions or transactions.
- Check your workload management (WLM) configuration.
- Check your cluster node hardware maintenance and performance.
Monitor your cluster performance metrics
If you observe performance issues with your Amazon Redshift cluster, review your cluster performance metrics and graphs. The cluster performance metrics and graphs can help you narrow down the possible root cause of your performance degradation. You can view performance data in the Amazon Redshift console to compare cluster performance over time.
An increase in these metrics can indicate a higher workload and resource contention on your Amazon Redshift cluster. For more information about monitoring performance metrics, see Monitoring Amazon Redshift using Amazon CloudWatch metrics.
Check the workload execution breakdown in the Amazon Redshift console to review particular queries and execution times. For example, if you see an increase in the query planning time, if might be an indication of a query waiting for a lock.
Check the Amazon Redshift Advisor recommendations
The Amazon Redshift Advisor offers recommendations on how you can improve and optimize your Amazon Redshift cluster performance. Amazon Redshift Advisor is available to you in the Amazon Redshift console for free. Use Amazon Redshift Advisor recommendations to learn about potential improvement areas for your cluster. The recommendations are based on common usage patterns and Amazon Redshift best practices.
Review the query execution alerts and excessive disk usage
During query execution, Amazon Redshift notes the query performance and indicates whether the query is executing efficiently. If the query is identified as inefficient, Amazon Redshift notes the query ID and provides recommendations for query performance improvement. These recommendations are logged in STL_ALERT_EVENT_LOG, an internal system table.
If you observe a slow or inefficient query, check the STL_ALERT_EVENT_LOG entries. To retrieve information from the STL_ALERT_EVENT_LOG table, use the following query:
SELECT TRIM(s.perm_table_name) AS TABLE , (SUM(ABS(DATEDIFF(SECONDS, Coalesce(b.starttime, d.starttime, s.starttime), CASE WHEN COALESCE(b.endtime, d.endtime, s.endtime) > COALESCE(b.starttime, d.starttime, s.starttime) THEN COALESCE(b.endtime, d.endtime, s.endtime) ELSE COALESCE(b.starttime, d.starttime, s.starttime) END))) / 60)::NUMERIC(24, 0) AS minutes , SUM(COALESCE(b.ROWS, d.ROWS, s.ROWS)) AS ROWS , TRIM(SPLIT_PART(l.event, ':', 1)) AS event , SUBSTRING(TRIM(l.solution), 1, 60) AS solution , MAX(l.QUERY) AS sample_query , COUNT(DISTINCT l.QUERY) FROM STL_ALERT_EVENT_LOG AS l LEFT JOIN stl_scan AS s ON s.QUERY = l.QUERY AND s.slice = l.slice AND s.segment = l.segment LEFT JOIN stl_dist AS d ON d.QUERY = l.QUERY AND d.slice = l.slice AND d.segment = l.segment LEFT JOIN stl_bcast AS b ON b.QUERY = l.QUERY AND b.slice = l.slice AND b.segment = l.segment WHERE l.userid > 1 AND l.event_time >= DATEADD(DAY, -7, CURRENT_DATE) GROUP BY 1, 4, 5 ORDER BY 2 DESC, 6 DESC;
This query lists query IDs and the most common issues and issue occurrences for the query running on the cluster.
Here's an example output of the query and the information describing why your alert was triggered:
table | minutes | rows | event | solution | sample_query | count -------+---------+------+------------------------------------+--------------------------------------------------------+--------------+------- NULL | NULL | NULL | Nested Loop Join in the query plan | Review the join predicates to avoid Cartesian products | 1080906 | 2
Review query performance by checking the diagnostic queries for query tuning. Make sure that your query operations are designed to run efficiently. For example, not all join operations are effective. A nested loop join is the least effective join type and must be avoided if possible because this type significantly increases query execution time.
Identify the queries performing the nested loops to help you diagnose the issue. For more information about how to diagnose common disk usage issues, see How can I troubleshoot high or full disk usage with Amazon Redshift?
Check for locking issues and long running sessions or transactions
Before a query is run on the cluster, table level locks might need to be obtained on the tables that are involved in query execution. There can be instances where queries can appear as "hung," or there is a spike in query execution time. If you observe a spike in your query execution time, a locking issue could be the cause. For more information about a delayed query execution time, see Why is my query planning time so high in Amazon Redshift?
If your table is currently locked by another process or query, your query can't proceed. As a result, you won't see your query appear in the STV_INFLIGHT table. Instead, your running query will appear in the STV_RECENTS table.
Sometimes, a hung query is caused by a long-running transaction. To prevent any long-running transactions from impacting your query performance, consider the following tips:
- Identify the long-running sessions and terminate them immediately. You can use the STL_SESSIONS and SVV_TRANSACTIONS tables to check for long-running transactions.
- Design your queries so that Amazon Redshift can process them quickly and efficiently.
Note: Long-running transactions also affect the VACUUM ability to reclaim disk space, leading to higher number of ghost rows or uncommitted rows. Ghost rows that are scanned by queries can effect query performance.
For more information about identifying long-running sessions that can cause table locks, see How do I detect and release locks in Amazon Redshift?
Check your workload management (WLM) configuration
Depending on your WLM configuration, a query might immediately begin to run or spend some time queuing. The goal should always be to minimize the amount of time a query is queued for execution. If you're looking to define your queues, then check your WLM memory allocation.
To check a cluster's WLM queues over a few days, use the following query:
SELECT *, pct_compile_time + pct_wlm_queue_time + pct_exec_only_time + pct_commit_queue_time + pct_commit_time AS total_pcnt FROM (SELECT IQ.*, ((IQ.total_compile_time*1.0 / IQ.wlm_start_commit_time)*100)::DECIMAL(5,2) AS pct_compile_time, ((IQ.wlm_queue_time*1.0 / IQ.wlm_start_commit_time)*100)::DECIMAL(5,2) AS pct_wlm_queue_time, ((IQ.exec_only_time*1.0 / IQ.wlm_start_commit_time)*100)::DECIMAL(5,2) AS pct_exec_only_time, ((IQ.commit_queue_time*1.0 / IQ.wlm_start_commit_time)*100)::DECIMAL(5,2) pct_commit_queue_time, ((IQ.commit_time*1.0 / IQ.wlm_start_commit_time)*100)::DECIMAL(5,2) pct_commit_time FROM (SELECT trunc(d.service_class_start_time) AS DAY, d.service_class, d.node, COUNT(DISTINCT d.xid) AS count_all_xid, COUNT(DISTINCT d.xid) -COUNT(DISTINCT c.xid) AS count_readonly_xid, COUNT(DISTINCT c.xid) AS count_commit_xid, SUM(compile_us) AS total_compile_time, SUM(datediff (us,CASE WHEN d.service_class_start_time > compile_start THEN compile_start ELSE d.service_class_start_time END,d.queue_end_time)) AS wlm_queue_time, SUM(datediff (us,d.queue_end_time,d.service_class_end_time) - compile_us) AS exec_only_time, nvl(SUM(datediff (us,CASE WHEN node > -1 THEN c.startwork ELSE c.startqueue END,c.startwork)),0) commit_queue_time, nvl(SUM(datediff (us,c.startwork,c.endtime)),0) commit_time, SUM(datediff (us,CASE WHEN d.service_class_start_time > compile_start THEN compile_start ELSE d.service_class_start_time END,d.service_class_end_time) + CASE WHEN c.endtime IS NULL THEN 0 ELSE (datediff (us,CASE WHEN node > -1 THEN c.startwork ELSE c.startqueue END,c.endtime)) END) AS wlm_start_commit_time FROM (SELECT node, b.* FROM (SELECT -1 AS node UNION SELECT node FROM stv_slices) a, stl_wlm_query b WHERE queue_end_time > '2005-01-01' AND exec_start_time > '2005-01-01') d LEFT JOIN stl_commit_stats c USING (xid,node) JOIN (SELECT query, MIN(starttime) AS compile_start, SUM(datediff (us,starttime,endtime)) AS compile_us FROM svl_compile GROUP BY 1) e USING (query) WHERE d.xid > 0 AND d.service_class > 4 AND d.final_state <> 'Evicted' GROUP BY trunc(d.service_class_start_time), d.service_class, d.node ORDER BY trunc(d.service_class_start_time), d.service_class, d.node) IQ) WHERE node < 0 ORDER BY 1,2,3;
This query provides the total number of transactions (xid), execution time, queued time, and commit queue details. You can check the commit queue details to see if frequent commits are impacting workload performance.
To check the details of queries running at a particular point in time, use the following query:
select b.userid,b.query,b.service_class,b.slot_count,b.xid,d.pid,d.aborted,a.compile_start,b.service_class_start_time,b.queue_end_time,b.service_class_end_time,c.startqueue as commit_startqueue,c.startwork as commit_startwork,c.endtime as commit_endtime,a.total_compile_time_s,datediff(s,b.service_class_start_time,b.queue_end_time) as wlm_queue_time_s,datediff(s,b.queue_end_time,b.service_class_end_time) as wlm_exec_time_s,datediff(s, c.startqueue, c.startwork) commit_queue_s,datediff(s, c.startwork, c.endtime) commit_time_s,undo_time_s,numtables_undone,datediff(s,a.compile_start,nvl(c.endtime,b.service_class_end_time)) total_query_s ,substring(d.querytxt,1,50) as querytext from (select query,min(starttime) as compile_start,max(endtime) as compile_end,sum(datediff(s,starttime,endtime)) as total_compile_time_s from svl_compile group by query) a left join stl_wlm_query b using (query) left join (select * from stl_commit_stats where node=-1) c using (xid) left join stl_query d using(query) left join (select xact_id_undone as xid,datediff(s,min(undo_start_ts),max(undo_end_ts)) as undo_time_s,count(distinct table_id) numtables_undone from stl_undone group by 1) e on b.xid=e.xid WHERE '2011-12-20 13:45:00' between compile_start and service_class_end_time;
Replace "2011-12-20 13:45:00" with the specific time and date that you want to check for queued and completed queries.
Review your cluster node hardware performance
During your cluster maintenance window, maintenance tasks such as patching, internal configuration changes, and node replacement can take place. If a node was replaced during the maintenance window, the cluster might be available shortly. However, it might take some time for the data to be restored on the replaced node. This process is known as hydration. During the hydration process, your cluster performance might decline.
To identify which events (such as hydration) have impacted your cluster performance, check your Amazon Redshift cluster events. Your cluster events inform you of any node replacement actions and or any other cluster actions that are performed.
To monitor the hydration process, use the STV_UNDERREPPED_BLOCKS table. The blocks that require hydration can be retrieved using the following query:
SELECT COUNT(1) FROM STV_UNDERREPPED_BLOCKS;
Note: The duration of the hydration process depends on the cluster workload. To measure the progress of your cluster's hydration process, check the blocks at certain intervals.
To check the health of a particular node, use the following query to compare its performance to other nodes:
SELECT day , node , elapsed_time_s , sum_rows , kb , kb_s , rank() over (partition by day order by kb_s) AS rank FROM ( SELECT DATE_TRUNC('day',start_time) AS day , node , sum(elapsed_time)/1000000 AS elapsed_time_s , sum(rows) AS sum_rows , sum(bytes)/1024 AS kb , (sum(bytes)/1024)/(sum(elapsed_time)/1000000) AS "kb_s" FROM svl_query_report r , stv_slices AS s WHERE r.slice = s.slice AND elapsed_time > 1000000 GROUP BY day , node ORDER BY day , node );
Here's an example query output:
day node elapsed_time_s sum_rows kb kb_s rank ... 4/8/20 0 3390446 686216350489 21570133592 6362 4 4/8/20 2 3842928 729467918084 23701127411 6167 3 4/8/20 3 3178239 706508591176 22022404234 6929 7 4/8/20 5 3805884 834457007483 27278553088 7167 9 4/8/20 7 6242661 433353786914 19429840046 3112 1 4/8/20 8 3376325 761021567190 23802582380 7049 8 ...This output indicates that the node 7 processed 19429840046 Kb of data for 6242661 amount of seconds, which is a lot slower than the other nodes.
The ratio between the number of rows (column "sum_rows") and number of processed bytes (column "kb") is about the same. The number of rows in the "kb_s" column is also about the same as the number of rows, depending on your hardware performance. If you observe that a particular node is processing less data over a period of time, that low performance might indicate an underlying hardware issue. To confirm that there is an underlying hardware issue, review the node's performance graph.