How do I troubleshoot an AWS DMS task that is failing with error message "ERROR: canceling statement due to statement timeout"?

Last updated: 2019-09-30

I'm migrating data to or from my on-premises PostgreSQL database using AWS Database Migration Service (AWS DMS). The AWS DMS task runs normally for a while, and then the task fails with an error. How do I troubleshoot and resolve these errors?

Short Description

If the PostgreSQL database is the source of your migration task, then AWS DMS gets the data from the table during the full load phase. Then, AWS DMS reads from the write-ahead logs (WALs) that are kept by the replication slot during the change data capture (CDC) phase.

If the PostgreSQL database is the target of your migration task, then AWS DMS gets the data from the source and creates CSV files in the replication instance. Then, AWS DMS executes a COPY command to insert those records into the target during the full load phase. However, during the CDC phase, AWS DMS executes the exact DML statements from the source WAL logs in transactional apply mode. For batch apply mode, AWS DMS also creates CSV files during the CDC phase and executes a COPY command to insert the net changes to the target.

When AWS DMS executes the commands to either get data from source or put data in the target, AWS DMS uses the default timeout setting of 60 seconds. If the source or target is heavily loaded or there are locks in the tables, then AWS DMS can't finish executing those commands within 60 seconds. As a result, the task fails with an error that says "canceling statement due to statement timeout," and you see one of the following entries in the log:

Messages
]E:  RetCode: SQL_ERROR  SqlState: 57014 NativeError: 1 Message: ERROR: canceling statement due to statement timeout;

To troubleshoot and resolve these errors, follow these steps:

  • Identify the cause of long execution times for commands.
  • Increase the timeout value and check the the slot creation timeout value.
  • Troubleshoot slot creation issues.

Resolution

Identify the cause of long execution times for commands

To find the command that failed to execute during the timeout period, review the AWS DMS task log and the table statistics section of the task. You can also find this information in the PostgreSQL error log file if the parameter log_min_error_statement is set to ERROR or a lower severity. After identifying the command that failed, you can find the failed table names. See the following example error message from the PostgreSQL error log:

ERROR: canceling statement due to statement timeout 
STATEMENT: <The statement executed>"

To find locks on the associated tables, execute the following command in the source or target (depending where the error is appearing):

SELECT blocked_locks.pid     AS blocked_pid,
         blocked_activity.usename  AS blocked_user,
         blocking_locks.pid     AS blocking_pid,
         blocking_activity.usename AS blocking_user, 
         blocked_activity.query    AS blocked_statement,
         blocking_activity.query   AS current_statement_in_blocking_process
   FROM  pg_catalog.pg_locks         blocked_locks 
    JOIN pg_catalog.pg_stat_activity blocked_activity  ON blocked_activity.pid = blocked_locks.pid
    JOIN pg_catalog.pg_locks         blocking_locks 
        ON blocking_locks.locktype = blocked_locks.locktype 
        AND blocking_locks.DATABASE IS NOT DISTINCT FROM blocked_locks.DATABASE
        AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
        AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
        AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
        AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
        AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
        AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
        AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
        AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
        AND blocking_locks.pid != blocked_locks.pid 
    JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
   WHERE NOT blocked_locks.GRANTED;

If you find any PIDs that are blocked, stop or "kill" the blocked PID by running the following command:

SELECT pg_terminate_backend(blocking_pid); 

Because dead rows or "tuples" can increase SELECT time, check for large numbers of dead rows in the source tables by running the following command:

select * from pg_stat_user_tables where relname= 'table_name';

Check to see whether or not the failed target table has primary keys or unique indexes. If there are no primary keys or unique indexes, this causes a full table scan during the execution of any UPDATE statement, which takes a long time.

Increase the timeout value

AWS DMS uses the executeTimeout extra connection attribute in both the source and target endpoints. The default value for executeTimeout is 60 seconds, so AWS DMS times out if a query takes longer than 60 seconds to execute.

If the error appears in Source_Unload or Source_Capture, then set the timeout value for executeTimeout in the source. If the error appears in Target_Load or Target_Apply, set the timeout value for executeTimeout in the target. Increase the timeout value setting by following these steps:

1.    Open the AWS DMS console.

2.    Choose Endpoints from the navigation pane.

3.    Choose the PostgreSQL endpoint.

4.    Choose Actions, and select Modify.

5.    Expand the Endpoint-specific settings section.

6.    In the field for Extra connection attributes, enter the following value:

executeTimeout=3600;

7.    Choose Save.

8.    From the Endpoints pane, choose the name of your PostgreSQL endpoint.

9.    From the Connections section, the Status of the endpoint will change from Testing to Successful.

You can increase (in milliseconds) the statement_timeout parameter in the PostgreSQL DB instance. The default value is 0, which disables timeouts for any query. You can also increase the lock_timeout parameter. The default value is 0, which disables timeouts for locks.

Troubleshoot slot creation issues

If the timeout occurred when you created the replication slot in the PostgreSQL database, you see log entries similar to the following:

Messages
]E:  wal_slot_create(...) - Unable to create slot 'xxxxxxxxxxxxxxxx_00016391_c4a70947_84c9_4a55_8d54_ff63f2f69a52' (on execute(...) phase) [1020101]  (postgres_endpoint_wal_utils.c:3215)

If your replication instance runs version 3.1.3 or lower, the default timeout setting of 60 seconds is applied for this command, which overrides the value of executeTimeout. To resolve this issue, use version 3.1.4 for which the default timeout for this command is 600 seconds. You can increase this timeout by configuring the TransactionConsistencyTimeout parameter in the Task settings section.

PostgreSQL can't create the replication slot if there are any active locks in the database user tables. Check for locks by executing the following command:

select * from pg_locks;

Then, to test whether the error has been resolved, run the following command to manually create the replication slot in the source PostgreSQL database:

select  xlog_position FROM pg_create_logical_replication_slot('<Slot name as per
    the task log>', 'test_decoding');

If the command still fails to create the slot, then you might need to work with a PostgreSQL DBA to identify the bottleneck and configure your database. If the command is successful, delete the slot that you just created as a test:

select pg_drop_replication_slot(‘<slot name>');

Finally, restart your migration task.