Debugging Your AWS DMS Migrations: What to Do When Things Go Wrong? (Part 3)
Let’s start with a quick recap from our previous posts, Debugging Your AWS DMS Migrations: What to Do When Things Go Wrong, Part 1 and Part 2.
There, we cover the initial steps to take to debug your AWS Database Migration Service (DMS) migrations for environmental issues and to use Amazon CloudWatch metrics to identify what might be wrong with them. We dive deep into the following topics:
- Establishing a flowchart to troubleshoot and debug issues in DMS.
- Dissecting the DMS CloudWatch task logs and understanding different loggers in DMS.
- Troubleshooting issues with resource constraints and tasks using basic CloudWatch graphs.
In the current blog post, we cover the next logical step in the debugging process, which answers this question: “Can we tweak some of our DMS migration task settings to work around issues like failures, slowness, and so on?”
Let’s split the discussion into two sections:
- Top reasons for task slowness
- How to dissect task settings so you can get the best performance
Top four reasons for slowness in DMS tasks
Here are the top four reasons that we’ve discovered for why DMS tasks can run slowly:
- Resource constraint and administration issues on the source or target – Although we talked about this concern in the previous parts, we should take another quick look at this point in the debugging process. It’s important to make sure that we’re not missing resource constraint and administration issues here. You can easily catch these issues using the CDCSourceLatency and CDCTargetLatency metrics we discussed in part 2. The issues can range anywhere from resource-constraint issues on the source or target to DMS not being able to apply target transactions on time during ongoing replication due to lack of indexes.
- Network slowness – This issue was also extensively covered in the previous post, where we touch upon important CloudWatch metrics exposed by DMS to determine if we are facing a network slowness issue. Along with investigating network slowness, we also recommend making sure that the network bandwidth is appropriate for the scale of your migration. Make sure no network shaping is happening, because that can hinder the progress and speed of your migration. One of the best ways to make sure that the network is working fine is to run a network test. For example, you can copy a file across the network to an Amazon EC2 instance in the same Availability Zone and subnet. Doing this can help make sure everything is fine from a speed perspective.
- LOB settings – Due to their unknown and sometimes large size, large objects (LOBs) require more processing and resources than standard objects. To help with tuning migrations of systems that contain LOBs, AWS DMS offers the limited size LOB mode and full LOB mode.
When using limited size LOB mode, any LOB greater than the max LOB size is truncated by DMS and a warning is issued to the task logs. On the other hand, full LOB mode doesn’t assume the size of the LOB and chunks it up based on the LOB chunk size. In full LOB mode, DMS potentially needs to make multiple trips to the source database to read LOB content chunk by chunk based on the chunk and LOB size. For this reason, full LOB mode is usually slower than limited size LOB mode. Limited size LOB mode can bulk-read a set number of LOB entries because it knows what to expect through the maximum LOB size setting. You can find more details about these modes are given in our AWS Database Migration Service Best Practices whitepaper.
- CSV-based targets and maxFileSize – Given that DMS is designed to support heterogeneous migrations, it uses the best way to bulk-load data into a target database engine. One of the best ways to bulk-load data into engines like MySQL, PostgreSQL, and Amazon Redshift is to read data from a supported source. DMS then copies the data to a comma-delimited value (CSV) file. Next, it uses native methods to bulk data to these targets, like “load data local infile” for MySQL, and copy for PostgreSQL and Amazon Redshift. Because bulk loading from a CSV file is the best way to load data into these targets, we call them CSV-based targets.
In most cases, the size of the CSV file being bulk-loaded into the target is directly proportional to the speed of the migration. Based on multitude tests and table migration combinations, we determined that setting this size to 32 MB by default is an optimal way of migrating tables into these CSV-based targets. However, in certain scenarios like a smaller number of tables being migrated or a large replication instance being used (with ample memory and CPU), it makes sense to increase the default value of this CV file being loaded. That increase translates to faster loads with DMS. For CSV-based targets, you can control file size by using the maxFileSize setting. You can set this parameter to the target endpoint’s extra connect attribute for CSV-based targets.
There are other reasons why a task can be slow, for example commit rate, transaction consistency timeout, and so on. We take a look at these specific settings in the section following.
Dissecting task settings
In addition to understanding how to debug environmental issues and knowing more about what is going on from task logs and CloudWatch metrics, it is also important to understand the different tasks settings in the DMS task-setting JSON. We have identified default values for many task settings based on our experience and tests. However, in many cases, we have had customers change these to nondefault values to help with their migrations. Following, we walk through some customer examples as we talk about different task settings and what they mean when a migration is going on.
Target metadata settings
Here are some target metadata settings to take a close look at:
- TargetSchema – By default, this value is set to the schema that we are trying to migrate to. This setting thus identifies the schema where DMS artifacts are created in the target database. For example, one DMS artifact is the “awsdms_apply_exceptions” table. This table contains exception information for transactions that could not be applied during the change data capture (CDC) phase resides in this schema. It is important to know which schema this information relates to. If we know the schema, entries in this table can tell us exactly why certain CDC transactions can’t be applied on the target.
Example: One of our customers ran a migration with ongoing replication from an on-premises Microsoft SQL Server database to Amazon RDS for MySQL. The customer couldn’t see drop table DDL statements being migrated as part of the migration. When we went into the logs, we saw the following statement logged: “Some changes from the source database had no impact when applied to the target database. See awsdms_apply_exceptions table for details.”
On further research, the customer identified that the drop table DDL statement isn’t supported by DMS with MySQL targets. A list of all the available DMS artifacts to help get more information about a migration is documented here. It is important to know where these artifacts reside in the target instance for users to know what is going on with their migrations.
- parallelLoadThreads – This setting applies only to MySQL targets. It specifies the number of threads to use to load data into the MySQL target database. Setting a large number of threads can have an adverse effect on database performance, because a separate connection is required for each thread. However, if we can afford to use more threads from a resource utilization perspective, we can increase this setting to enable faster bulk loads into MySQL-based target instances.
Example: One of our customers was migrating a SQL Server database to Amazon Aurora with MySQL compatibility and wanted to improve full load speed of a huge 12-billion row, 700-GB table. The target Aurora MySQL instance was not as busy, and we were able to get higher full-load phase speeds through the following changes:
- Increased maxFileSize to 1,048,576 KB – Aurora MySQL is a CSV-based target as explained preceding. We saw that increasing maxFileSize would clearly help us improve the load speed as.
- Increased parallelLoadThreads to 5 – We weren’t loading multiple tables at once on the target. As a result, we could afford to load using more threads and connections. This approach gave us three times the full load speed in this case.
- We used table filters. Using DMS, we can load a single huge table in multiple tasks by chunking the table into multiple parts using filters. In this case, we used the primary key as the filter condition and used 12 different tasks for migrating approximately 1 billion rows apiece. You can find more details about different filter conditions possible in DMS in Using Table Mapping with a Task to Select and Filter Data in the DMS documentation.
- BatchApplyEnabled – This setting is important when it comes to running ongoing replication from a source where rate of change is high.
The term high here means different things in different scenarios. DMS uses a single thread to read changes from the source and apply them to the target. This single-threaded process can handle only a certain number of transactions at a given point in time. That number depends on the number of transactions read and the rate of change.
In some scenarios, the single-threaded process can’t keep the target in sync with the source after all DMS prerequisites, networking best practices, task settings, indexes added on the target, and so on, are applied. In these scenarios, we might have a high rate of change on the source. The BatchApplyEnabled setting can help with that. During the ongoing replication phase, DMS applies changes in a transactional manner in the following way:
- Changes are read from the transaction log from the source into the replication instance memory.
- Changes are translated and passed on to an internal component (the sorter).
- The sorter component sorts transactions in commit order and sends them one by one to the target to be applied.
If the rate of change is high, this process can take time, given that DMS uses logical replication to get changes from a source database. This scenario is exactly where this setting helps. Here is what it does:
- Collects all changes from a batch that is controlled by batch apply settings (discussed in following sections).
- Creates a table with all changes from the batch, called the net changes table. This table resides in the memory of the replication instance and is also passed on to the target instance.
- Applies a net changes algorithm that nets out all changes from the net changes table.
For example, suppose that there is a new row insert, 10 updates to that, row and a delete for that row in a single batch. In this case, DMS nets out all these transactions and doesn’t carry them over. It does so because the row is eventually deleted and no longer exists. Thus, this process reduces the number of actual transactions that are applied on the target.
Given that information, let’s also take a look at another related setting, BatchApplyPreserveTransaction, which is a change-processing tuning setting. When batch apply is enabled, this setting is also turned on by default. This setting makes sure that transactional integrity is preserved and that a batch is guaranteed to contain all the changes within a transaction from the source.
Let’s look at an example where using this setting in the default mode did not work well for a customer.
One of our enterprise customers was trying to do an all-in migration of an e-commerce backend database into AWS with minimal downtime. Their rate of change was high enough for us to recommend using batch apply. However, at certain times of day when the replication was ongoing, they started having memory-pressure issues on the replication instance.
After further research, we found that during those times, the customer source database ran heavy transactions that went for about 1 hour before being committed. We also found that this result was caused by the BatchApplyPreserveTransaction setting. The setting was trying to capture the entire long-running transaction in the replication instance memory before it sent changes to the net changes table. We decided to turn the setting off and keep an eye on the target. We needed to make sure that the target didn’t shut down unexpectedly. You’ll see in a moment why avoiding target shutdown is important in this case.
When the BatchApplyPreserveTransaction setting is false, changes from a single transaction can span across multiple batches. Although BatchApplyPreserveTransaction can help in reducing memory contention on the replication instance, we can get a situation where part of the transaction from one batch makes it to the target while another part doesn’t. This effect can arise because of issues like the target shutting down, networking issues, and so on, and the result can be data loss.
Full load settings
Here are some full load settings to take a close look at:
- Create PK after full load – DMS helps create the table structure, migrates data, and also helps transfer the primary key or unique index. This option helps create the primary key or unique index after the full load is complete.
Let’s say that the full load itself took x number of days trying to migrate a huge table with billions of rows. Turning on this option can mean that the application or cached changes and subsequent replication can be delayed. When this option is on, DMS waits for these until the primary key is successfully created.
- Stop task after full load completes – There are two options to use this setting: Stop the migration task before cached changes are applied, and stop the task after cached changes are applied.
For optimal performance of the cached changes phase and subsequent replication phase, we recommend that you stop the migration task twice: Stop it before applying cached changes, so we can add all required indexes and make the cached changes phase complete as soon as possible. Then stop it after applying cached changes, so we can add all other secondary objects to the target like foreign keys, procedures, and so on.
- MaxFullLoadSubTasks – This setting controls the number of tables loading in parallel. By default, this option is set to 8. However, setting this option to a higher number can greatly improve migration speeds if we have a bunch of small tables being migrated as part of one task. However, we recommend to increase this number with caution (after thorough testing). You do so to make sure that you don’t have resource utilization issues on the replication instance, source or target, because these put more pressure on all entities in the migration. You can assume that a separate thread is used for a table read from the source for each subtask during the full load phase. You can correlate how to set this number with the number of CPUs on the instance type of the replication instance.
For example, C4.4Xlarge instance comes with 16 vCPUs. Thus, it helps to set the MaxFullLoadSubTasks in multiples of 16, based on the amount of data being read and the number of tables. In other words, the number of tables being loaded in parallel is higher as you increase this parameter. However, too much parallelism can also cause issues. Choosing this setting based on the number of available CPUs can help alleviate those issues.
Stream buffer task settings
Every piece of data that flows from the source to the replication instance and from the replication instance to the target goes through some sort of translation at the replication instance. Before and after this translation, DMS stores the data in stream buffers before the data flows through to the next state. A number of parameters control these buffers. In many cases, the default values are not enough.
Some parameters to focus on are these:
- Stream Buffer Count: This parameter controls the number of stream buffers. The default size is 3. Increasing this count can increase migration speed. The default is sufficient for most situations.
- Stream Buffer Size in MB: The default size for each stream buffer mentioned in the previous setting is 8 MB. The default is sufficient in most cases. The exception is when you’re working with very large LOBs or long running transactions and receive the message “Outgoing stream is full. Forwarding events to target is postponed.”
Use the following formula to calculate optimal size of this setting.
[Max LOB size (or LOB chunk size)]*[number of LOB columns]*[number of stream buffers]*[number of tables loading in parallel per task(MaxFullLoadSubTasks)]*3
- Control Stream Buffer: This buffer is the control buffer for all stream buffers in the task. The default value is 5 MB and can go up to 8 MB. You might need to increase this setting when working with a very large number of tables, such as tens of thousands of tables.
Note: You might not need to change these settings normally. However, we suggest that you do so if you’re migrating LOB columns in multiple tasks on the same replication. In this case, you see a clear message in the task logs that the default stream buffer settings aren’t enough for the data being migrated. We recommend tweaking these settings only in this case.
Example: Let’s look at an example where we had to change stream buffer settings and also tweak other settings to help a customer complete their migration.
We recently worked with an enterprise customer who was trying to migrate their production Oracle database to Amazon RDS with PostgreSQL compatibility. They had three tables, one of which was 1.2 TB in size. The other two tables were pretty small, around 2 GB each. The rate of changes on the database was around 7 GB per hour (approximately 4000 updates and 3000 inserts per second).
Here are the changes they made to take care of this migration, in order:
- They increased maxFileSize to 500 MB to ensure faster migration of the single table initially.
- Target apply issues arose due to the rate of change. This issue came up despite making sure all required indexes were added to the target. The customer enabled batch apply mode.
- Memory pressure issues arose because batch apply mode was enabled. They enabled batch apply preserve transaction mode.
- They disabled batch apply preserve transaction mode. They found that the table still took too long to load, resulting in ORA-01555
- They resorted to loading the big tables in chunks using two replication instances in parallel with table filters.
- At this point, they started hitting full stream buffer issues due to their rate of change and LOB updates. They had to increase the stream buffer settings on both the instances to help cross the final hurdle and complete migration.
Now, let’s walk through each and every step to try and understand why this customer had to adapt their migration as described:
- As stated preceding, increasing maxFileSize improves migration speeds for CSV-based targets. It does so best when increased in a controlled manner, to avoid affecting the performance on the source or target and creating other resource-constraint issues on the replication instance. This customer gradually increased the maxFileSize to an acceptable level for faster migration based on testing.
- At this point, the customer started seeing the migration speed up. It had already started capturing and applying changes for the smaller tables. which completed full load. The biggest table was still under full load. However, due to the rate of change on the smaller tables, they decided to go from a completely transactional apply to batch apply. This approach started giving them better results.
- After some time, the two smaller tables kept in sync for a long period (which full load for the bigger table was going on). However, then DMS read a long-running transaction for one of the two tables. Because batch apply preserve transaction was turned on, the memory utilization in the replication instance went up and memory started swapping. We then disabled batch apply preserve transaction to ensure large transactions weren’t entirely held in the same batch in memory.
- After that, we started getting into issues with loading the big table because of ORA-01555 errors. ORA-1555 errors can happen when a query is unable to access enough undo space to build a copy of the data as it was when the query started. Committed “versions” of blocks are maintained along with newer uncommitted “versions” of those blocks. This approach lets queries access data as it existed in the database at the time of the query. These are referred to as “consistent read” blocks and are maintained using Oracle undo management.
- Because of problem in step 4, we decided to split the big table up into four chunks of approximately 400 GB each. We also decided to use two different replication instances with two tasks for each replication instance.
To illustrate, let’s assume that the table being migrated was employee in the test schema in Oracle. Following is how one of the table mappings looks for a task (approximately 5 billion rows in each task).
For the final task, we gave a higher end value to make sure that it took care of recent insert events coming into the source.
- At this point, the migration seemed to be going well. Then due to the rate of change, the customer started seeing these messages: “Outgoing stream is full. Forwarding events to target is postponed”
We increased their stream buffer settings based on the LOB size and number of buffers—we had the number of buffers at the highest. We also made sure that there were no resource constraint issues before we successfully migrated the database.
This concludes the final part of our blog series about debugging DMS migrations. In this post, we looked at specific task-related settings that can help us identify what is going on with DMS migrations. This post, along with series posts part 1 and part 2, can help you form a framework to debug DMS migrations and get full value from the service.
About the Authors
Akm Raziul Islam is a consultant with a focus on Database and Analytics at Amazon Web Services. He works with our customers to provide guidance and technical assistance on various database and analytical projects, helping them improving the value of their solutions when using AWS.
Arun Thiagarajan is a database engineer with the Database Migration Service (DMS) & Schema Conversion Tool (SCT) team at Amazon Web Services. He works on DB migrations related challenges and works closely with customers to help them realize the true potential of the DMS service. He has helped migrate 100s of databases into the AWS cloud using DMS and SCT.