How can I troubleshoot T-Log full issues on my RDS for SQL Server source when CDC is turned on on my AWS DMS task?

Last updated: 2022-08-30

I have an AWS Database Migration Service (AWS DMS) task that has change data capture (CDC) turned on. My task uses Amazon Relational Database Service (Amazon RDS) for SQL Server as a source. I am experiencing SQL server T-Llog full issues on my AWS DMS task. How can I troubleshoot these issues when CDC is turned on on my AWS DMS task?

Short description

When you use SQL Server as a source for an AWS DMS task, you can replicate on going changes by turning on MS change data capture (CDC) on your instances. You can use MS CDC for Amazon RDS for MySQL, on-premises SQL, and Amazon Elastic Compute Cloud (Amazon EC2) instances.

After CDC is turned on, it tracks and reads ongoing DML changes from the active log file using the fn_dblog() function. For every source table, CDC creates a corresponding CDC change table.

To track changes, CDC creates two jobs:

  • Capture job - Sans T-logs, then captures and records changes in the corresponding change tracking table using the sp_replcmds store procedure.
  • Cleanup job - Cleans change tracking based on retention policy.

To view the value set for these settings, run this store procedure:

EXEC sys.sp_cdc_help_jobs
  • Maxtrans - The maximum number of transactions to process in each scan cycle.
  • Maxscans - The maximum number of scan cycles to run in order to extract all rows from the log.
  • Pollinginterval - The number of seconds between log scan cycles.

On Amazon RDS, SQL Server takes log backups every 5 minutes. It pushes those backups to Amazon Simple Storage Service (Amazon S3). SQL Server truncates the logs until last min LSN. If the active log is truncated while AWS DMS is reading in a near area of the active log, then changes might be missed on the next call to fn_dblog. The task fails because logs can't be accessed by using fn_dump_dblog from Amazon S3.

To avoid issues like this, AWS DMS has the Safeguard traction and pollingInterval options:

  • Safeguard traction: AWS DMS creates a table called Awsdms_trucatcation safeguard in the source database. Awsdms_trucatcation opens two safeguard transactions, and then runs a dummy update statement alternatively every five minutes. This means that SQL doesn't truncate the logs for 10 minutes.

But, if AWS DMS delays reading the logs for more than 10 minutes, then the task fails again. To avoid this, use the pollingInterval parameter. For more information, see Using a Microsoft SQL Server database as a source for AWS DMS.

  • pollingInterval: The default value of polling interval is set to 5 seconds for capture jobs. This means that pollingInterval runs every 5 seconds, scans the T-log to read changes, and marks these logs as replicated. Then, it truncates the logs.

It's a best practice to set the pollingInterval value >=3599. This means that capture jobs aren't allowed to run so frequently, and makes sure that T-logs aren't truncated for a specified amount of time.

But, a database that is highly transactional can lead to a T-Log full state, and the task fails with an error similar to this:

E: RetCode: SQL_ERROR SqlState: 42000 NativeError: 9002 Message: [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]The transaction
log for database ‘yourdatabase' is full due to 'REPLICATION'

Resolution

Troubleshoot and resolve SQL Server T-Log fill issues when CDC is turned on

Use these steps to troubleshoot and resolve SQL Server T-Log full issue when CDC is turned on.

1.    Check the size of your transaction log files:

DBCC SQLPERF(logspace)

2.    If the transaction log is full, run this command to see what is consuming the logs:

select name, log_reuse_wait_desc from sys.databases where name = ' db_name '
  • Replication: The replication has stopped working or is not reading active transactions.
  • Active_transaction: There is an open transaction.

If there's an active transaction, then check list of open transactions:

select * from sys.sysprocesses where open_tran=1

Note: It's not possible to shrink the T-Log of a database that has CDC turned on when the AWS DMS task is the running state. Instead, stop the task, and wait for the transaction logs to be truncated by log backup. Then, shrink the T-Log of the database.

Shrink the T-log files and RESTART the task

After logs are cleaned up, shrink the T- log files, and then RESTART the task:

1.    Stop the task.

2.    Check the status of CDC, and then turn off CDC.

Check status of CDC:

select b.name,a.run_date,a.run_status from msdb.dbo.sysjobhistory as a
join msdb.dbo.sysjobs as b on a.job_id=b.job_id
where b.name like '%cdc%capture%';

Turn off CDC at the database level:

exec msdb.dbo.rds_cdc_disable_db '<db_name>'

Find tables that are part of CDC:

USE db_name
GO
SELECT s.name AS Schema_Name, tb.name AS Table_Name
, tb.object_id, tb.type, tb.type_desc, tb.is_tracked_by_cdc
FROM sys.tables tb
INNER JOIN sys.schemas s on s.schema_id = tb.schema_id
WHERE tb.is_tracked_by_cdc = 1

Turn off CDC at table level:

use <db_name>
EXEC sys.sp_cdc_disable_table
@source_schema = N'<schema>',
@source_name   = N'<table>',
@capture_instance = N'<schema_table>'

3.    Confirm if CDC is turned off:

select is_cdc_enabled, * from sys.databases where name = 'dbname' Value 1- enabled   and 0 -Disabled

4.    After CDC is turned off, wait for at least 10 minutes, then shrink your T-logs:

use <dbname>
SELECT file_id, name
FROM sys.database_files;
GO

USE <dbname>
GO
DBCC SHRINKFILE (N'dbname_log' , 0, TRUNCATEONLY)
GO

5.    Confirm the size of the logs:

DBCC SQLPERF(logspace)

6.    Turn on CDC again, and then restart the task. Resuming the task might cause missing transactions. Accessing the archived T-log isn't supported for Amazon RDS for SQLServer.