How do I troubleshoot high CPU usage in Amazon Redshift?
Last updated: 2020-08-24
I'm suddenly seeing high CPU utilization on my Amazon Redshift cluster. Why is this happening, and what are some best practices to lower my CPU utilization?
Amazon Redshift is designed to utilize all available resources while performing queries. Therefore, it's expected to see spikes in CPU usage in your Amazon Redshift cluster. An increase in CPU utilization can depend on factors such as cluster workload, skewed and unsorted data, or leader node tasks.
However, if your CPU usage impacts your query time, consider the following approaches:
- Review your Amazon Redshift cluster workload.
- Maintain your data hygiene.
- Update your table design.
- Check for maintenance updates.
- Check for spikes in your leader node CPU usage.
- Use Amazon CloudWatch to monitor spikes in CPU utilization.
Review your Amazon Redshift cluster workload
The following factors can impact the CPU utilization on your Amazon Redshift cluster:
- An increased workload (because there are more queries running). The increase in workload also increases the number of database connections, causing higher query concurrency.
- The higher number of concurrent queries also impacts resource contention, lock wait time, and workload management (WLM) queue wait time.
- More database connections, which can also be a result of idle sessions present in the cluster. Idle sessions can cause additional lock contention issues.
While the queries are running, retrieve locking information. To identify long-running sessions, use the following SQL query:
select *,datediff(s,txn_start,getdate())/86400||' days '||datediff(s,txn_start,getdate())%86400/3600||' hrs '||datediff(s,txn_start,getdate())%3600/60||' mins '||datediff(s,txn_start,getdate())%60||' secs' from svv_transactions where lockable_object_type='transactionid' and pid<>pg_backend_pid() order by 3;
Then, run PG_TERMINATE_BACKEND to stop any long-running transactions. To prevent these sessions from remaining open, be sure that all transactions are closed. For example, make sure that all transactions starting with a BEGIN statement are also accompanied by an END or COMMIT statement.
Then, run the following SQL query to identify queries consuming high CPU:
select stq.userid, stq.query, trim(stq.label) as label, stq.xid, stq.pid, svq.service_class, query_cpu_usage_percent as "cpu_%",starttime, endtime, datediff(s,starttime, endtime) as duration_s, substring(stq.querytxt,1,100) as querytext from stl_query stq join svl_query_metrics svq on stq.query=svq.query where query_cpu_usage_percent is not null and starttime > sysdate - 1 order by query_cpu_usage_percent desc;
To analyze segment and slice-level execution steps for each query, run the following query:
select query, segment, step, label ,is_rrscan as rrS, is_diskbased as disk, is_delayed_scan as DelayS, min(start_time) as starttime, max(end_time) as endtime, datediff(ms, min(start_time), max(end_time)) as "elapsed_msecs", sum(rows) as row_s , sum(rows_pre_filter) as rows_pf, CASE WHEN sum(rows_pre_filter) = 0 THEN 100 ELSE sum(rows)::float/sum(rows_pre_filter)::float*100 END as pct_filter, SUM(workmem)/1024/1024 as "Memory(MB)", SUM(bytes)/1024/1024 as "MB_produced" from svl_query_report where query in (<query_ids>) group by query, segment, step, label , is_rrscan, is_diskbased , is_delayed_scan order by query, segment, step, label;
For more information about tuning these queries, see Top 10 performance tuning techniques for Amazon Redshift.
You can also use the wlm_query_trend_hourly view to review Amazon Redshift cluster workload pattern. Then, determine which of the following approaches can help you reduce queue wait time:
- Reduce query concurrency per queue to provide more memory to each query slot. This reduction helps queries that require more memory to run more efficiently.
- Enable short query acceleration (SQA) to prioritize short-running queries over long-running queries.
- Scale the Amazon Redshift cluster to accommodate the increased workload. Scaling a cluster provides more memory and computing power, which can help queries to run more quickly. For more information, see How do I resize an Amazon Redshift cluster?
Maintain your data hygiene
Data hygiene is gauged by the percentage of stale statistics and unsorted rows present in a table. A high percentage of both can cause the query optimizer to generate an execution plan where queries run inefficiently when referencing tables. Unsorted data can also cause queries to scan unnecessary data blocks, which require additional I/O operations. A poorly performing query negatively affects your cluster's CPU usage.
Use the SVV_TABLE_INFO system view to retrieve stats_off and unsorted percentage data for a table. These percentages should remain close to 0. If the percentages are high, run the Analyze & Vacuum schema utility from the AWS Labs GitHub repository to update your tables.
Update your table design
Table design is governed by the designated sort keys, distribution style, and distribution key. The distribution key and distribution style determine how data is distributed across the nodes.
An inappropriate distribution key or distribution style can induce distribution skew across the nodes. To reduce data distribution skew, choose the appropriate distribution style and sort key based on query patterns and predicates. The distribution key should support the join conditions in your queries and columns with high cardinality. A proper distribution key selection can help queries perform merge joins instead of hash or nested loop joins, which ultimately affects the amount of time that queries run.
To identify tables with skewed distribution, use the table_inspector.sql script. Then, use the Amazon Redshift table design playbook to choose the most appropriate sort keys, distributions keys, and distribution styles for your table.
Check for maintenance updates
Amazon Redshift caches compiled code, allowing queries to reuse the code for previously run segments. The cache then is erased during any maintenance updates. As a result, queries that are run for the first time after a patch update will spend some time in compilation. This compilation overhead can increase a cluster's CPU usage.
Use the following SQL query to check how many segments are being compiled each hour:
select "hour", count(query) total_queries, count(case when is_compiled='Y' then 1 else null end ) as queries_compiled_count, sum(segements_count) total_segments_count, sum(segments_compiled_count) total_segments_compiled_count from ( select q.query, date_trunc('h',q.starttime) as "hour", q.starttime, q.endtime, q.aborted, (case when compiled_queries.segments_compiled_count = 0 then 'N' ELSE 'Y' end) is_compiled, compiled_queries.segements_count, compiled_queries.segments_compiled_count from stl_query q left join (select query, count(segment) segements_count, sum(compile) segments_compiled_count from svl_compile group by query) compiled_queries on q.query = compiled_queries.query where q.userid > 1 and q.starttime > trunc(sysdate) -7 ) group by "hour" order by "hour";
Check for spikes in your leader node CPU usage
Leader node tasks such as parsing and optimizing queries, generating compiled code, and aggregating results from compute nodes consume CPU resources. This consumption leads to increased leader node CPU usage. Leader node CPU usage can also rise if queries are heavily referencing system catalog tables or performing leader node-only functions.
If the spike in CPU usage is caused by a leader node, check under Events in the Amazon Redshift console. Verify whether any maintenance has occurred on your Amazon Redshift cluster. Use the SQL query provided in Check for maintenance updates to verify whether more segments are being compiled than usual.
Use CloudWatch to monitor spikes in CPU utilization
Use CloudWatch metrics to compare the spikes between CPUutilization and Database Connections. Analyze the workload performance by checking the Workload Execution Breakdown chart. The Workload Execution Breakdown chart shows you at which stages the queries are spending the most time.
To identify the top 100 queries that consume the most CPU during a specified time, use the following query:
select qms.*, substring(q.querytxt,1,100) qtxt from svl_query_metrics_summary qms join stl_query q on q.query=qms.query where q.starttime > sysdate - 1 and q.userid>1 order by qms.query_cpu_time desc nulls last limit 100;
To retrieve a list of queries that consume the most resources when CPU reaches 100%, use the following query:
select a.userid, service_class, a.query, b.pid, b.xid, a.starttime, slices, max_rows, max_blocks_read, max_blocks_to_disk, max_query_scan_size, segment, round(max_cpu_time/(max_run_time*1.0)*100,2) as max_cpu_pcnt, round(cpu_time/(run_time*1.0)*100,2) as cpu_pcnt, max_cpu_time, max_run_time, case when segment > -1 then 'segment' else 'query' end as metric_lvl, text from pg_catalog.stv_query_metrics a left join stv_inflight b using (query) where step_type=-1 order by query, segment;
To check the amount of data that are processed by each node, run the following query:
select iq.day_d, sl.node, sum(iq.elapsed_ms) as elapsed, sum(iq.bytes) as bytes from (select start_time::date as day_d, slice,query,segment,datediff('ms',min(start_time),max(end_time)) as elapsed_ms, sum(bytes) as bytes from svl_query_report where end_time > start_time group by 1,2,3,4) iq join stv_slices as sl on (sl.slice = iq.slice) group by 1,2 order by 1 desc, 3 desc;
You can use query monitoring rules (QMR) to identify and log any poorly designed queries. For example, QMR rules can be defined to log queries that consume high CPU usage or an extended execution time.