How do I troubleshoot an AWS DMS task that is failing with error message "ERROR: canceling statement due to statement timeout"?
Last updated: 2022-10-12
I'm migrating data to or from my on-premises PostgreSQL database using AWS Database Migration Service (AWS DMS). The AWS DMS task runs normally for a while, and then the task fails with an error. How do I troubleshoot and resolve these errors?
If the PostgreSQL database is the source of your migration task, then AWS DMS gets data from the table during the full load phase. Then, AWS DMS reads from the write-ahead logs (WALs) that are kept by the replication slot during the change data capture (CDC) phase.
If the PostgreSQL database is the target, then AWS DMS gets the data from the source and creates CSV files in the replication instance. Then, AWS DMS runs a COPY command to insert those records into the target during the full load phase.
But, during the CDC phase, AWS DMS runs the exact DML statements from the source WAL logs in transactional apply mode. For batch apply mode, AWS DMS also creates CSV files during the CDC phase. Then, it runs a COPY command to insert the net changes to the target.
When AWS DMS tries to either get data from source or put data in the target, it uses the default timeout setting of 60 seconds. If the source or target is heavily loaded or there are locks in the tables, then AWS DMS can't finish running those commands within 60 seconds. So, the task fails with an error that says "canceling statement due to statement timeout," and you see one of these entries in the log:
"]E: RetCode: SQL_ERROR SqlState: 57014 NativeError: 1 Message: ERROR: canceling statement due to statement timeout; Error while executing the query  (ar_odbc_stmt.c:2738)"
"]E: test_decoding_create_replication_slot(...) - Unable to create slot 'lrcyli7hfxcwkair_00016402_8917165c_29f0_4070_97dd_26e78336e01b' (on execute(...) phase)  (postgres_test_decoding.c:392))"
To troubleshoot and resolve these errors, follow these steps:
- Identify the cause of long run times for commands.
- Increase the timeout value and check the slot creation timeout value.
- Troubleshoot slot creation issues.
Identify the cause of long run times for commands
To find the command that failed to run during the timeout period, review the AWS DMS task log and the table statistics section of the task. You can also find this information in the PostgreSQL error log file if the parameter log_min_error_statement is set to ERROR or a lower severity. After identifying the command that failed, you can find the failed table names. See this example error message from the PostgreSQL error log:
ERROR: canceling statement due to statement timeout STATEMENT: <The statement executed>"
To find locks on the associated tables, run this command in the source or target (depending where the error is appearing):
SELECT blocked_locks.pid AS blocked_pid, blocked_activity.usename AS blocked_user, blocking_locks.pid AS blocking_pid, blocking_activity.usename AS blocking_user, blocked_activity.query AS blocked_statement, blocking_activity.query AS current_statement_in_blocking_process FROM pg_catalog.pg_locks blocked_locks JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype AND blocking_locks.DATABASE IS NOT DISTINCT FROM blocked_locks.DATABASE AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid AND blocking_locks.pid != blocked_locks.pid JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid WHERE NOT blocked_locks.GRANTED;
If you find any PIDs that are blocked, stop or terminate the blocked PID by running this command:
Because dead rows, or "tuples," can increase SELECT time, check for large numbers of dead rows in the source tables by running this command:
select * from pg_stat_user_tables where relname= 'table_name';
Check to see if the failed target table has primary keys or unique indexes. If the table has no primary keys or unique indexes, then a full table scan is performed during the running of any UPDATE statement. This table scan can take a long time.
Increase the timeout value
AWS DMS uses the executeTimeout extra connection attribute in both the source and target endpoints. The default value for executeTimeout is 60 seconds, so AWS DMS times out if a query takes longer than 60 seconds to run.
If the error appears in Source_Unload or Source_Capture, then set the timeout value for executeTimeout in the source. If the error appears in Target_Load or Target_Apply, set the timeout value for executeTimeout in the target. Increase the timeout value setting by following these steps:
1. Open the AWS DMS console.
2. Choose Endpoints from the navigation pane.
3. Choose the PostgreSQL endpoint.
4. Choose Actions, and select Modify.
5. Expand the Endpoint-specific settings section.
6. In the field for Extra connection attributes, enter this value:
7. Choose Save.
8. From the Endpoints pane, choose the name of your PostgreSQL endpoint.
9. From the Connections section, the Status of the endpoint changes from Testing to Successful.
You can increase (in milliseconds) the statement_timeout parameter in the PostgreSQL DB instance. The default value is 0, which turns off timeouts for any query. You can also increase the lock_timeout parameter. The default value is 0, which turns off timeouts for locks.
Troubleshoot slot creation issues
If the timeout occurred when you created the replication slot in the PostgreSQL database, then you see log entries similar to the following:
"]E: test_decoding_create_replication_slot(...) - Unable to create slot 'lrcyli7hfxcwkair_00016402_8917165c_29f0_4070_97dd_26e78336e01b' (on execute(...) phase)  (postgres_test_decoding.c:392)"
You can increase this timeout by configuring the TransactionConsistencyTimeout parameter in the Task settings section. The default value is 600 seconds.
PostgreSQL can't create the replication slot if there are any active locks in the database user tables. Check for locks by running this command:
select * from pg_locks;
Then, to test whether the error has been resolved, run this command to manually create the replication slot in the source PostgreSQL database:
select xlog_position FROM pg_create_logical_replication_slot('<Slot name as per the task log>', 'test_decoding');
If the command still can't create the slot, then you might need to work with a PostgreSQL DBA to identify the bottleneck and configure your database. If the command is successful, delete the slot that you just created as a test:
select pg_drop_replication_slot(‘<slot name>');
Finally, restart your migration task.