My Amazon Redshift queries exceed the WLM timeout that I set

Last updated: 2019-11-18

I set a WLM timeout for an Amazon Redshift query but the query keeps running after this period expires. Why is this happening?

Short Description

A WLM timeout applies to queries only during the execution phase. If WLM doesn’t terminate a query when expected, it’s usually because the query has spent time in stages other than the execution stage. For example, the query might wait to be parsed or rewritten, wait on a lock, wait for a spot in the WLM queue, hit the return stage, or hop to another queue.

Resolution

When querying STV_RECENTS, starttime is the time the query entered the cluster, not the time that the query began executing. When the query is in the Running state in STV_RECENTS, it's live in the system. However, the query doesn't use compute node resources until it enters STV_INFLIGHT status. For more information about query planning, see Query Planning and Execution Workflow. To view the status of an executing query, query STV_INFLIGHT instead of STV_RECENTS:

select * from STV_INFLIGHT where query = your_query_id;

Use this query for more information about query stages:

select * from SVL_QUERY_REPORT where query = your_query_id ORDER BY segment, step, slice;

Use this query for the current execution state of the query:

select * from STV_EXEC_STATE where query = your_query_id ORDER BY segment, step, slice;

Here are some common reasons why a query might appear to run longer than the WLM timeout period:

The query is in the "return" phase

There are two "return" steps. Check STV_EXEC_STATE to see if the query has entered one of these return phases:

  • The return to the leader node from the compute nodes
  • The return to the client from the leader node

A rollback is in progress

If a Data Manipulation Language (DML) operation encounters an error and rolls back, the operation doesn't appear to be killed because it is already in the process of rolling back. You can view rollbacks by querying STV_EXEC_STATE. You can find additional information in STL_UNDONE.

The query spends time queuing prior to execution

Query STV_WLM_QUERY_STATE to see queuing time:

select * from STV_WLM_QUERY_STATE where query = your_query_id;

The query is waiting on a lock

If the query is visible in STV_RECENTS, but not in STV_WLM_QUERY_STATE, it's possible that the query is waiting on a lock and hasn't entered the queue. Use a query similar to the following to check for locks that the stuck process is waiting on:

select * from SVV_TRANSACTIONS
where granted='f'
and pid in (select pid from STV_RECENTS where query = stuck_query_id);

Use a query similar to the following to identify the process that's blocking the query:

select b.* from SVV_TRANSACTIONS b, SVV_TRANSACTIONS w
where b.granted='t'
and w.granted = 'f'
and b.txn_db = w.txn_db
and b.relation = w.relation
and w.pid in (select pid from STV_RECENTS where query = stuck_query_id);

To find more information about the blocking process (such as the query that it's currently running), run queries similar to the following:

select * from STL_QUERY where pid =  pid_of_blocking_process;
        
select * from SVL_STATEMENTTEXT where pid = pid_of_blocking_process;

To kill the stuck query, or to kill the process that is blocking it, run a query similar to the following. For more information, see PG_TERMINATE_BACKEND.

select pg_terminate_backend(process_or_query_id);

A query hopped to another queue

If a read query reaches the timeout limit for its current WLM queue, or if there's a query monitoring rule that specifies a hop action, the query is pushed to the next WLM queue for execution. To confirm whether or not the query hopped to the next queue:

To prevent queries from hopping to another queue, configure WLM query monitoring rules. For more information about query hopping, see WLM Query Queue Hopping.

A networking or firewall issue

If an Amazon Redshift server has a problem communicating with your client, the server might get stuck in the "return to client" state. Check for conflicts with networking components, such as inbound on-premises firewall settings, outbound security group rules, or outbound network ACL rules. For more information, see Connecting from Outside of Amazon EC2 —Firewall Timeout Issue.

An issue with the cluster

Issues on the cluster itself, such as hardware issues, might cause the query to hang. When this happens, the cluster is in hardware-failure status. To recover a single-node cluster, restore a snapshot. In multi-node clusters, failed nodes are automatically replaced.


Did this article help you?

Anything we could improve?


Need more help?