Why is my AWS DMS task not retrying?

Last updated: 2022-08-18

I have an AWS Database Migration Service (AWS DMS) task that has stopped and is not retrying. How can I resume operation of my AWS DMS task?

Resolution

AWS DMS is a managed service that is designed to have self-healing behavior. This means that when issues occur, AWS DMS attempts to fix the issue and then resume operation without you needing to take any action. However, there are some situations when migration stops and doesn't retry.

First, it's important to understand the two types of errors that you can encounter when using AWS DMS:

  • Fatal errors
  • Recoverable errors

Fatal errors

If AWS DMS encounters an error that stops it from proceeding with migration, then the task is stopped and it enters a FAILED state. This is called a fatal error. Some examples include:

  • The source endpoint isn't configured, which is prerequisite for migration.
  • The AWS DMS replication instance doesn't fetch source objects from the source database.

In the task logs, you see messages similar to this:

"2022-05-28T16:07:35 [TASK_MANAGER    ]E:  Task 'K7YJOFK7GYXIK44C2KLGFNG7ZONLZGPWPD5RWHA' encountered a fatal error"

When AWS DMS encounters a fatal error, it tries to restart six times. If your task is no longer retrying, then it has likely already completed these attempts.

Recoverable errors

AWS DMS considers all environmental errors as recoverable errors. So, if a task or replication instance encounters an environmental error, then the task is interrupted but recovers itself, and then retries.

Examples of recoverable errors include:

  • AWS DMS replication instance connectivity to the source/target database is interrupted.
  • Because of maintenance, the replication instance restarted.

In the task logs, you see messages similar to this:

"Last Error Task error notification received from subtask 0, thread 0 [reptask/replicationtask.c:2673] [1022502] Stop Reason RECOVERABLE_ERROR Error Level RECOVERABLE"

By default, a task with a recoverable error attempts to retry, indefinitely. The RecoverableErrorCount setting controls this behavior. This parameter sets the maximum number of attempts that AWS DMS makes to restart a task when it encounters an environmental error. After the system tries to restart the task a designated number of times, then the task stops and manual intervention is needed. The default value is -1, which tells AWS DMS to restart the task indefinitely.

If a recoverable error causes a task to stop and it no longer retries, then check whether:

  • The RecoverableErrorCount parameter is set to a custom value.
  • The replication instance itself is down.

Check if other non-default value settings are preventing retries

If these settings are set to a non-default value, they might prevent the AWS DMS task from retrying:

"ErrorBehavior": {
        "FailOnNoTablesCaptured": false,
        "ApplyErrorUpdatePolicy": "LOG_ERROR",  --- can be set to STOP_TASK
        "FailOnTransactionConsistencyBreached": false,
        "RecoverableErrorThrottlingMax": 1800,
        "DataErrorEscalationPolicy": "SUSPEND_TABLE",  --- can be set to STOP_TASK
        "ApplyErrorEscalationCount": 0,
        "RecoverableErrorStopRetryAfterThrottlingMax": false,
        "RecoverableErrorThrottling": true,
        "ApplyErrorFailOnTruncationDdl": false,
        "DataTruncationErrorPolicy": "LOG_ERROR",  --- can be set to STOP_TASK
        "ApplyErrorInsertPolicy": "LOG_ERROR",  --- can be set to STOP_TASK
        "EventErrorPolicy": "IGNORE",
        "ApplyErrorEscalationPolicy": "LOG_ERROR",  --- can be set to STOP_TASK
        "RecoverableErrorCount": -1,
        "DataErrorEscalationCount": 0,
        "TableErrorEscalationPolicy": "STOP_TASK",
        "RecoverableErrorInterval": 5,
        "ApplyErrorDeletePolicy": "IGNORE_RECORD",  --- can be set to STOP_TASK
        "TableErrorEscalationCount": 0,
        "FullLoadIgnoreConflicts": true,
        "DataErrorPolicy": "LOG_ERROR",
        "TableErrorPolicy": "SUSPEND_TABLE"
    },