Debugging Your AWS DMS Migrations: What to Do When Things Go Wrong (Part 2)
A quick recap from our previous post
In our previous post, we covered the initial steps to debug AWS Database Migration Service (DMS) migrations for environmental issues. We dove deep into the following topics:
- Establishing a flowchart to troubleshoot and debug issues in DMS.
- Dissecting the Amazon CloudWatch task logs for DMS and understanding different loggers in DMS.
- Best practices for log forensics.
- Troubleshooting resource constraint issues in your DMS replication instance using basic CloudWatch graphs like freeable memory, swap usage, disk queue depth, and so on.
In some cases, DMS migrations fail for reasons others than the preceding. For example, we recently worked with a customer whose DMS migration (one of many one-time migrations every day) slowed over time as the customer ran more and more migrations. We debugged further to find that the issue was caused by an unstable network between the replication instance and target.
In this post, we continue the debugging process, discussing problems with migration tasks that aren’t in the list preceding and that aren’t due to environmental issues. We review CloudWatch graphs and task and table states to shed light on DMS migrations. Specifically, we cover the red-outlined portion of the flowchart following. We introduced this chart in our previous blog post, where you can see the complete flowchart.
We can use anomalies in CloudWatch graphs to help us find and correct issues with DMS migrations before moving on. At this point in the debugging process, we assume that all prerequisites from part 1 of this blog series are taken care of. We also assume that there are no resource constraint issues on the source DB instance, the replication instance, or the target DB instance.
At this point, we want to answer an important question: “I have taken care of all issues reported in the logs and other possible resource constraint issues. How do I know if something is still wrong with my migration?”
To efficiently answer this question, let’s look at a few important pieces of data about DMS:
- Task status
- Table states
- Table metrics
Task status is a task’s current reported condition. A status bar shows an estimate of a task’s progress. You can use task state and information about how many table rows are loaded to confirm whether a task is making progress. The following states mean problems with a task:
- FAILED: The task has failed and has stopped migrating data. This status is a good indicator that we should debug the migration task using our troubleshooting workflow. We should use the log forensics best practices to find errors or warnings in the task logs and discover the root cause of the failure.
- ERROR: A task reports this status when one or more tables encounter an issue. This status is different from FAILED because the migration continues for tables that have been successfully loaded or for which load hasn’t started yet.
Let’s cover some background information before we dive in. A full load plus change data capture (CDC) task (that is, initial load plus replication of all changes) consists of three stages:
- Full load—the initial bulk load that starts at time T1 and ends at time T2.
- Applying cached changes—changes that occurred in the table between time T1 and T2.
- Ongoing changes—replicating all changes to the table after time T2.
Let’s consider one table for the purposes of this explanation. When our table completes full load, it enters the apply cached changes phase. It then continues on to the ongoing replication phase, regardless of what stages the other tables in the migration are in. When DMS encounters issues specific to a table or set of tables, the task is put into an ERROR state. When this happens, DMS continues to migrate and replicate tables that haven’t encountered errors.
In the following sections, we discuss how to identify tables with issues and what can be done to fix them. This discussion deals with a situation where only the problematic tables are failing the migration while other tables are fine. When a task enters this state, you can identify what table or tables are failing in the migration by following the steps in the next section. We can take the required steps to fix issues and reload just the affected tables from the full load phase and move on.
To find the status of all tables in the migration, you can look at the DMS table statistics tab, shown following. You can also use the describe-table-statistics AWS CLI command.
If the task state from the section preceding is ERROR, we can look at exactly which tables failed so we can take appropriate action. Here are some task states to be aware of that indicate problems with table migration:
- Table does not exist: This nonfatal error means that DMS couldn’t find this particular table on the source. However, other tables that are part of the migration can be migrated successfully. It might be that the table marked as not existing was present when DMS read the metadata but wasn’t present at task execution. It might also mean the migration user doesn’t have privileges to read from the table.
- Table error: This nonfatal error says that this particular table failed loading for some reason. You can find out the reason from the task logs.Note: If you’re migrating many tables, you can choose to use the filter box preceding the table list to filter only for tables with errors.
When we know more about the task and table states and have debugged the issues found, we are ready to look at some of the exposed CloudWatch metrics. Here are the three important metric classes to consider when debugging migrations:
- Table metrics
- Host metrics
- Replication task metrics
In the same table statistics tab we mentioned preceding, you can see the following important information for the table:
- Number of inserts, deletes, updates, and data definition language (DDL) events read for each table
- Total number of full load rows
- Total rows after applying all changes
- Last updated timestamp of these values
The following screenshot shows these values on the AWS Management Console.
You can also use the describe-table-statistics AWS CLI command to get these values in the CLI. If you know your source database well, you can use these metrics to see if DMS is migrating all required data. If these numbers look wrong, you might have issues with your migration. If you haven’t started debugging, you should start from the beginning as shown in our previous blog post.
In our first blog post, we identified some issues with use of replication server resources by analyzing host metrics like freeable memory, disk queue depth, and so on. Let’s talk more about other important host metrics and replication task metrics in the following section.
You can find important host metrics on the replication instances page of the console, shown following.
NetworkReceiveThroughput depicts the incoming network traffic on the replication instance, including both customer database traffic and AWS DMS traffic. This important metric can help make us aware of any network-related issues between the source database instance and the replication instance. The units reported are bytes per second.
Let’s look at a real world example where one of our customers used this metric to help identify a slow VPN issue.
A Fortune 500 company was trying to migrate their on-premises Oracle database from their data center in California to an Amazon Relational Database Service (Amazon RDS) for PostgreSQL instance in the AWS Oregon Region (us-west-2). During their initial tests, they had a good rate of transfer and completed their migrations. However, during their production migration, the transfer rate was slow, with about 10 MB of data taking two hours to migrate.
Task settings can help improve migration speed. However, one of the initial recommendations when troubleshooting an issue like this is to make sure that there are no environmental issues, such as resource constraints or network problems. The following graph, measured in bits per second, is what the network receive throughput looked like for that customer.
The graph clearly indicates that the replication instance was receiving data at a slow rate, which led to the 10 MB table taking close to two hours to migrate. (In the preceding graph, 1000 bits per second equates to 0.00125 MB per second.) When we reported this issue to the customer, they knew exactly what was going on. They corrected the network issue to get better transfer rates. After working with their networking team, they later found out that a slow VPN connection caused this issue. This example shows how important it is to make sure that the environmental issues are taken care of.
NetworkTransmitThroughput depicts the outgoing (transmit) network traffic on the replication instance. This traffic includes both customer database traffic and AWS DMS traffic used for monitoring and replication.
DMS connects to both source and target databases. It reads data from the source, preprocesses the data on the replication instance, connects to the target, and writes data into the target like any other application. A slow transmit rate from the replication instance can come from a number of reasons, like the following:
- The target is in a different AWS Region than the replication instance, and writes are going over a slow network or the internet.
- The target is an on-premises instance, and the writes are happening over a slow network.
- The receive throughput is slow, so the transmit throughput is also slow.
- The target is not fast enough to accept the replication instance transmit.
Let’s look at the network transmit throughput from the customer we discussed preceding, covering the timeframe that we previously examined. The following graph is measured in bits per second.
The transmit rate is just slightly lower than the receive rate (shown preceding). This effect might mean that this customer is hitting a combination of the two issues mentioned just preceding. If so, it implies that the issue lies in the replication instance’s ability to receive data rather than transmit data.
For more details about table metrics, see Monitoring AWS Database Migration Service Tasks in the DMS User Guide.
You can choose to monitor metrics for a particular task from the console. When you choose a given task, you can see all task-based metrics. Navigate to the task monitoring tab in the console, as shown following.
Following are some important task metrics to consider.
These are task metrics for full load:
These are task metrics for CDC:
- CDCChangesMemorySource and CDCChangesDiskSource
- CDCChangesMemoryTarget and CDCChangesDiskTarget
Task metrics for full load
FullLoadThroughputRowsSource: DMS selects data from a table on the source in the same way that any other application does and bulk loads the data into the target. The FullLoadThroughputRowsSource metric gives us a detailed picture of the rate at which DMS can unload source data into the replication instance during the full load phase. The following graph shows an example.
From this graph, we can tell that DMS unloaded more than 30,000 rows per second on average from the source during the full load phase. If you know the average row size of the table, you can roughly translate the rate at which DMS is unloading data. From this, you can determine if the performance is reasonable. If it isn’t, we might be hitting one or more of the following issues:
- A slow network between source and replication instance, causing DMS to unload data slowly.
- Performance issues on the source, causing the select to be slow. One of the best ways to determine if source performance is the issue is to select data from the table from an Amazon EC2 instance in the same Availability Zone (AZ) as the replication instance. You then see the time taken to return results. Be careful with this approach, because it has potential to cause other performance issues in the source.
- Task settings or table artifacts like LOBs might slow down the unload process. For a further discussion of this topic, see our best practices whitepaper. We also talk more about these task settings in the next post in this blog series.
FullLoadThroughputRowsTarget: FullLoadThroughputRowsTarget gives us the rate at which the rows are leaving the replication instance to commit to the target. Following is the FullLoadThroughputRowsTarget graph for the full load mentioned in the FullLoadThroughputRowsSource section preceding.
From the graph, we can see that DMS is doing a good job of maintaining the rate at which it reads from the source. This graph can help you identify whether a bottleneck occurs with DMS preprocessing data before sending it out, or if there are issues with task settings or performance-based issues on the replication instance. The key to finding any bottleneck issues lies in comparing the source and target throughputs.
FullLoadThroughputBandwidthSource and FullLoadThroughputBandwidthTarget
The full load metrics that we just discussed give us details on the actual rows per second rate. In contrast, FullLoadThroughputBandwidthSource and FullLoadThroughputBandwidthTarget give us details on the actual network bandwidth used for the full load from the source and to the target respectively. Both these metrics are depicted in kilobytes per second.
These metrics help us know how much network bandwidth is used between the source and replication instances, and between the replication instance and target per task. This information helps when there are multiple tasks running on a replication instance. You can use these metrics to check if a particular task is hogging network bandwidth, causing issues with other tasks.
Following are the corresponding graphs for the preceding full load task.
The preceding and following graphs are both measured in kilobits per second.
Comparing the graph with the row numbers, we see that more network bandwidth was used when more rows per second were read from the source. If multiple tasks run on the replication instance, we can look at the source and target bandwidths to see how much network bandwidth each task is using. We can tweak task settings to tune this up or down as required.
Task metrics for CDC
CDC Latency Source: This metric describes the latency reading changes from the source. Many things can cause CDC source latency, like the following:
- A slow network between source and replication instances, causing the replication instance to read changes slowly.
- Performance issues on the source.
- Task settings causing DMS to read changes slowly from the source.
This metric is useful to determine the exact cause if the target is falling behind the source during CDC. Let’s look at an example.
A Fortune 500 customer was trying to migrate from Microsoft SQL Server to Amazon RDS for MySQL. Having started the migration, they wanted to keep the target in sync while they rearchitected their main application to work with RDS for MySQL.
Their application team rearchitected certain parts of the application every day. They ran smoke tests to check on the performance and data consistency to make sure that they were on the right track. At certain times in the day, they found that the target lagged behind the source. However, the issue wasn’t present at other times. Upon further research, they found that the following graph showed their CDC source latency.
We also found that this customer ran index rebuild and statistics update jobs starting at 12:10 UTC every day on their source database. This process caused latency reading changes from the source because the required tables were locked during this operation.
CDC Latency Target: For the issue mentioned preceding, let’s look at the corresponding target latency, as shown following.
Let’s talk about the issues we find in the graph before we generally talk about the metric.
Apart from the issues mentioned in the preceding section, this customer had enabled backups on their single AZ target RDS for MySQL instance. They had a backup window between 12:00 and 12:30 UTC. RDS guarantees that a backup starts within the backup window, and the backup moved a lot of data.
At the same time as these issues, DMS had issues applying changes to the target because a snapshot of the target RDS instance was being taken. As a result, the target latency grew until the snapshot completed. It remained in sync with the source as soon as the backup was done.
Some other issues can cause target latencies:
- No primary keys or indexes in the target—DMS writes changes into the target like any other application. Without the required indexes in place, data manipulation language (DML) changes like updates and deletes can result in full table scans. Such scans cause performance issues on the target and result in target latency.
- Performance issues in the target where the target can’t accept changes at the rate DMS is sending them.
- Network issues between replication instance and target.
- Task settings that cause changes to be written slowly to the target.
CDCChangesMemorySource, CDCChangesDiskSource, CDCChangesMemoryTarget, and CDCChangesDiskTarget:
Before we talk about these metrics, let’s talk about the CDC process in DMS. These are the CDC steps:
- DMS reads changes from the source engine’s transaction log by using logical replication API operations and captures the changes.
- By default, these changes read from the source are preprocessed in memory. Based on different task parameters, DMS might be able to process changes in memory. In some situations, DMS might need to spill some changes to disk while they await processing.
- The events in step 2 also apply to target changes. DMS needs to convert the preprocessed changes so they can be easily sent to the target. This conversion can happen in memory, or a disk spill might be required. Which happens depends on parameters such as task settings or number of tasks in the instance.
- Once the changes are ready to be applied on to the target, they are sent to the target.
Ideally, all processing eventually happens. We might need to tweak some parameters to ensure that DMS can process changes in memory whenever possible.
Following are some additional CDC task metrics.
CDCChangesMemorySource represents the number of rows accumulating in memory and waiting to be committed from the source.
CDCChangesDiskSource represents the number of rows accumulating on disk and waiting to be committed from the source. All the rows shown as part of CDCChangesDiskSource once were in memory and were spilled because they hit the threshold of time allowed to reside in memory. Our goal should be to understand the internals of the engine and minimize CDCChangesDiskSource as much as possible using task settings. Some task settings that can help achieve that are MemoryLimitTotal and MemoryKeepTime. We’ll talk in detail about task settings in the next part of this series. For further information on task settings, see Specifying Task Settings for AWS Database Migration Service Tasks in the DMS User Guide.
CDCChangesMemoryTarget represents the number of rows accumulating in a memory and waiting to be committed to the target. These rows are created after the preprocessing stage in step 2.
CDCChangesDiskTarget represents the number of rows accumulating on disk and waiting to be committed to the target. As explained preceding, we should concentrate on making sure processing occurs in memory. If CDCChangesDiskTarget is increasing, it can mean a couple of things. The memory on the replication instance might be overutilized. Or, the target DB instance might not be able to accept changes at the rate DMS is sending them.
In summary, here are the series of steps we looked through in the first two parts of this blog series on debugging database migrations:
- Looking through task logs and understanding failures.
- Debugging environmental issues.
- Understanding task and table states.
- Understanding important CloudWatch metrics for DMS to know what is going on with a given task.
In the final part of the series, we dissect DMS task settings to see how each and every one applies to the debugging steps we discussed up to now. We also talk about the pros and cons of tweaking some of these task settings.
About the Authors
Akm Raziul Islam is a consultant with a focus on Database and Analytics at Amazon Web Services. He works with our customers to provide guidance and technical assistance on various database and analytical projects, helping them improving the value of their solutions when using AWS.
Arun Thiagarajan is a database engineer with the Database Migration Service (DMS) & Schema Conversion Tool (SCT) team at Amazon Web Services. He works on DB migrations related challenges and works closely with customers to help them realize the true potential of the DMS service. He has helped migrate 100s of databases into the AWS cloud using DMS and SCT.