Troubleshoot networking issues during database migration with the AWS DMS diagnostic support AMI

Hundreds of thousands of customers have migrated their databases to AWS using AWS Database Migration Service (AWS DMS). AWS DMS is a secure and performant managed service that you can use to migrate between 13 sources and 15 targets. A critical part of the migration process is ensuring that sufficient resources are available to perform the migration. This includes computing and memory capacity on your source and target databases. It also validates that your network has the bandwidth to transmit your data efficiently. For example, a poor connectivity can cause delays when propagating data changes and even cause your migration to fail.

Previously, to diagnose the networking issues, you would use OS tools such as network traces and traffic analysis. Now, we’re happy to share the AWS DMS diagnostic support AMI, an Amazon Elastic Compute Cloud (Amazon EC2) instance equipped with custom scripts that are designed to run out of the box with minimal configuration. You can use the scripts to check if network issues are present in your replication architecture. It also provides you with additional information before reaching out to AWS Support to dive deep into the issue. The tool currently supports networking testing for self-managed databases, either on-premises or Amazon EC2, and AWS managed database services, as well as endpoints such as Amazon Simple Storage Service (Amazon S3), Amazon Kinesis, and Apache Kafka.

In this post, we introduce the key functionalities, architecture, and configurations of the AWS DMS diagnostic support AMI. Then, we show you how to launch the AMI with proper networking configurations and AWS Identity and Access Management (IAM) permissions using AWS CloudFormation. Last, we demonstrate an example of how network latency results in significant replication lag and how to use the AMI to diagnose the issue.

Typical symptoms of network connectivity issues

When there is a network connectivity issue, you can expect to see the following symptoms. Please note this is not an exhaustive list of network symptoms.

Poor performance during AWS DMS full load and change data capture (CDC), especially when migrating large objects (LOBs). AWS DMS migrates LOB data in two phases, which involve more round trips during migration. CDC latency spikes as a result of high network latency and additional round trips for LOB lookup. If you enable detailed debugging, you’ll see log entries like the following when looking up the LOB value from the source database to fill in the target column:

[TARGET_APPLY]T: Going to fill lob data for columns in table x.x.

Frequent endpoint disconnections and retry loops. Typically, when a source or target endpoint connection drops, you will see a message like the following:

[SOURCE_CAPTURE]E: ORA-03113: end-of-file on communication channel Process ID: x Session ID: x Serial number: x 
[SOURCE_CAPTURE]I: Disconnected Oracle database
[SOURCE_CAPTURE]E: Oracle communication failure 
[TASK_MANAGER]E: Task 'x' encountered a recoverable error, retry attempt # n

The task is stuck at getting the table metadata, especially when the wildcard % is specified in the AWS DMS task table mapping. The task will fail with the following error message if it can’t capture the table metadata:

[METADATA_MANAGE]W: Failed to prepare get capture list statement

Unload or load interruption during the full load phase is especially common when the table is large and involves LOBs. The following is an example error message:

[SOURCE_UNLOAD]W: Error was encountered while FETCH-ing data from table 'xxx'. 'xxx'

Solution overview

The AWS DMS diagnostic support AMI provides tools to check the following:

Network configuration of AWS DMS replication tasks in a specific AWS Region
Network packet loss
Network latency
Maximum transmission unit (MTU) size

AWS DMS is a managed service; therefore, you can’t access the underlying host for testing purposes. The following diagram shows a high-level architecture for using the AWS DMS diagnostic support AMI. To simulate the same networking environment as the AWS DMS replication instance, launch the AMI in the same Region and VPC as your replication instance. This ensures that the EC2 instance shares the same networking backbone routes and security group, which allows network tests and sample calls to be representative of actual AWS DMS network traffic. The diagnostic scripts are issued from a SSH client in the same network as the source database. During networking checks, the diagnostic scripts test the AWS DMS metadata service via port 80, the source endpoint, and the target endpoint.

In the following sections, we walk you through the steps to launch the solution using AWS CloudFormation and access the diagnostic instance.

Prerequisites

You should have the following prerequisites:

An activated AWS account
An existing AWS DMS replication instance for networking tests
An EC2 key pair to connect to the EC2 instance
IAM permissions to launch a CloudFormation stack to perform the actions in the following section

The cost of the solution is associated with the EC2 instance launched for networking diagnosis. You’ll be charged to use the EC2 instance. For a rough estimate, the cost for a t2.micro EC2 instance per day in the us-east-1 Region is about $0.37 (1 instance x 0.0116 USD On-Demand hourly cost x 1 hour in a month + 30 GB x (24 total EC2 hours / 730 hours in a month) x 0.10 USD = 0.37 USD). See Amazon EC2 On-Demand Pricing for details about the cost.

Launch the solution with AWS CloudFormation

To launch the resources for this solution, use the CloudFormation template provided in this post. It performs the following actions:

Checks the VPC, subnet, and security group configuration of the AWS DMS replication instance you want to test.
Creates an IAM role with read access to the settings of the AWS DMS replication instance, task, and endpoint, and retrieves the host name and TCP port information from the specific AWS Secrets Manager key used by the AWS DMS endpoint. The IAM role is assumed by the diagnostic EC2 instance with the following permissions:
- dms:DescribeEndpoints – Permissions to return information about the endpoints for your account in the current Region.
- dms:DescribeTableStatistics – Permissions to return table statistics on the database migration task, including table name, rows inserted, rows updated, and rows deleted.
- dms:DescribeReplicationInstances – Permissions to return information about replication instances for your account in the current Region.
- dms:DescribeReplicationTasks – Permissions to return information about replication tasks for your account in the current Region.
- secretsmanager:GetSecretValue – Permissions to retrieve the host name and TCP port information from a specific Secrets Manager key.
Creates a diagnostic EC2 instance using the AMIs currently supported within the same subnet and with the same security group as the AWS DMS replication instance you specified. Additionally, it creates a security group that allows SSH port 22 for the IP range you specified to access the EC2 instance for network testing.

You create this CloudFormation stack in the same Region as your AWS DMS instance. Complete the following steps:

Choose Launch Stack and launch the CloudFormation stack in the same Region as the AWS DMS replication instance to test. Review the Regions where the AWS DMS diagnostic support AMI is available.

For Stack name, enter the name of the CloudFormation stack (for this post, DMS-Diag).

The resource created by the stack is named <stack name>-<resource name>-<random string>.

For DMSReplicationInstance, enter the ARN of the AWS DMS instance to test.
For InstanceType, choose the EC2 instance type for the networking testing environment.
For KeyName, choose an existing EC2 key pair to enable SSH access to the instance.
For SSHLocation, enter the IP address range to use to SSH to the EC2 instances.

Leave the remaining settings at their default and create the stack.

It can take about 3 minutes for the CloudFormation stack to complete. When it’s complete, you can see the resources created on the Resources tab.

You can view the diagnostic EC2 instance information on the Outputs tab.

Access the instance via SSH

To access the diagnostic EC2 instance, complete the following steps:

On the AWS CloudFormation console, on the stack’s Resources tab, select the physical ID i-xxxx associated with the resource EC2Instance.

You’ll be redirected to the instance launched on the Amazon EC2 console.

Choose Connect.
On the SSH client tab, follow the steps to access the EC2 instance.

If you get a connection timeout error when trying to SSH into the diagnostic EC2 instance, it means there is no network connectivity between the SSH client and the diagnostic EC2 instance. Typically, this is because the AWS DMS replication instance resides in a private subnet without an internet gateway attached to it, so you can’t access the private instance via the internet. In such a case, you can place the SSH client in the same VPC as your AWS replication instance. You must also double-check if your SSH client’s IP is within the CIDR specified for SSHLocation when launching the CloudFormation stack. Your SSH client IP address should be included in the allow list. For more information, refer to Setting up a network for a replication instance.

Test the solution

In this example, we have a replication task that is showing CDC latency increases during spikes of write operations in the source database. The AWS DMS task replicates from a MySQL source to an Amazon S3 target. The data files replicated to Amazon S3 are used by downstream applications for analytics workloads. There are more than 30 columns in MySQL tables with data types, including INTEGER, VARCHAR, TIMESTAMP, TEXT, and MEDIUM TEXT. The source database is highly transactional.

The following diagram shows that the Amazon CloudWatch metrics CDCLatencySource and CDCLatencyTarget both increase linearly.

We can also see the replication log as follows:

[SORTER ]I: Reading from source is paused. Total storage used by swap files exceeded the limit 1048576000 bytes (sorter_transaction.c:110)

To illustrate, as the workloads in the MySQL source continues, the data in Amazon S3 is significantly lagging behind. Because target apply is slow, it causes the changes to pile up on the replication instance, eventually exhausting the instance space for swap files (1 GB and not modifiable by customers). As a result, source capture is paused and CDC source latency increases as well.

So, why are there so many changes queued up waiting for the target to apply? When looking at the AWS DMS logs carefully, we can see a lot of LOB lookups:

[TARGET_APPLY    ]T:  Event received: source id ‘xxx’ operation ‘INSERT (1)’ event id ‘xxx’ event subid ‘0’ table id ‘1’ context ‘mysql-bin-changelog.xxx:xxx:xxx:xxx:xxx:mysql-bin-changelog.xxx:xxx timestamp ‘xxx’ commit timestamp ‘xxx’  (streamcomponent.c:2218)
…
[TARGET_APPLY    ]T:  Going to fill lob data for columns in table schema_xxx. table_xxx.  (file_apply.c:749)

The change can only be applied to the target after the LOB value is retrieved from the source database. Apparently, the target apply is waiting for the round trip of the LOB lookup. Therefore, the next step is to check the basic performance of the networking environment for the migration. We run the following command on the diagnostic EC2 instance we launched for this particular task:

$ dms-report -t arn:aws:dms:us-east-1:123455678912:task: 7RFX5TIRE323CTA2YKZJQDIE6W6M3T5Z667BUVQ -n y

In the first section of the output, we can see the network information of the EC2 instance.

In the section Network Packet Check, the diagnostic script issues network pings with 10 packets to test the AWS DMS metadata service, the source endpoint, and the target endpoint. As we can see from the output, there was no packet loss during the test.

In the section Network Latency Check, we can see the round trip time (RTT) for the time frame in which the test was conducted. Take the average RTT of 186.1 milliseconds shown in the output for estimation. If there are 1,000 rows changed per second, AWS DMS can take up to 3 minutes (186.1 milliseconds * 1000 ≈ 3 minutes) to look up the LOB for individual rows changed in 1 second, which explains why the CDC latency would linearly grow. The output also recommends that we put the AWS DMS replication instance closer to the endpoint when a high RTT is detected and review the MTU test result.

In the section Network MTU Check, we can see that the local MTU (9001) does not match the remote MTU (1500) in the output. Different MTUs may result in packet drop that interrupts migration, especially when migrating LOBs.

Based on this analysis, we can see that the network latency between the MySQL source and the AWS DMS diagnostic EC2 instance is high. As a result, the LOB lookup round trip during replication takes a much longer time. Because LOBs are large, the data can’t be migrated efficiently without sufficient network bandwidth and low latency. After checking, the networking team identified a hardware configuration issue in the customer’s data center that couldn’t be changed.

Given that, the customer determined that the LOB values could be excluded from the replication; the LOBs were not needed by the downstream analytic applications. After excluding the LOBs, the CDC latency was reduced from 7 hours to 10-20 seconds, even under heavy transaction volumes on the MySQL source.

This example shows that network slowness can cause significant replication performance issues, especially when migrating LOB. The following are additional considerations for this example:

When the RTT between the database and the replication instance is over 100 milliseconds, put the replication instance in a Region closer to that database if possible. Also, consider improving network bandwidth to eliminate this bottleneck.
AWS DMS determines if the column is LOB according to the data type of the source database. For example, MEDIUMTEXT of MySQL is converted to NCLOB of the AWS DMS data type, whereas TEXT of MySQL is converted to WSTRING. This leads to the question of whether the MEDIUMTEXT column in the source database really stores data that large or if TEXT is sufficient. Database design influences the behavior of AWS DMS during migration, which results in different performance.
Consider isolating the table with large LOBs to its own task, even in its own replication instance. Choose limited LOB, inline LOB mode, and full LOB mode according to the maximum size of the LOB data and the statistical distribution of the LOB size to effectively utilize the memory of the replication instance for better performance. Note that LOB lookup occurs during both the full load (except for limited LOB mode) and CDC phase. Refer to the best practices of migration large LOBs for more details.
When the task gets stuck when getting metadata, unloading from the source, or loading to the target, or there are frequent endpoint disconnections, consider issuing a support ticket to the AWS DMS team for additional checks and adjusting the MTU of the replication instance if it is identified as different between local and remote.

Clean up

After you’re finished with the networking diagnosis, delete the resources by deleting the CloudFormation stack. Note that you will see an error when deleting the CloudFormation stack if any change to the stack is made outside of AWS CloudFormation. You can choose to skip the resource that fails to be deleted and delete it manually.

Conclusion

Network stability and bandwidth are essential for a successful and performant database migration. In this post, we introduced an AWS DMS networking diagnostics tool that provides custom scripts to help engineers quickly diagnose AWS DMS replication anomalies that might be related to network configuration or other issues. We encourage you to test the AWS DMS diagnostic support AMI.

If you have any questions or suggestions about this post, leave a comment.

About the Authors

Wanchen Zhao is a Senior Database Specialist Solutions Architect at AWS. Wanchen specializes in Amazon RDS and Amazon Aurora, and is a subject matter expert for AWS DMS. Wanchen works with ISV partners to design and implement database migration and modernization strategies and provides assistance to customers for building scalable, secure, performant, and robust database architectures in the AWS Cloud.

Don Tam is a Database Engineer with the AWS DMS team. He works with engineers to migrate customers’ database platforms to the AWS Cloud. He also assists developers in continuously improving the functionality of AWS Cloud services.