Rolling back from a migration with AWS DMS

When migrating a database to a new system using AWS Database Migration Service (DMS), it is prudent to have a fallback strategy if the new system doesn’t work as expected. At a high level, there are four basic strategies for rolling back from a migration: basic fallback, fall forward, dual write, and bidirectional replication. Depending on your particular situation, one or more of these strategies may work for you.

This post defines each strategy and outlines situations in which that strategy may be appropriate. In general, you should deploy the strategy or combination of strategies that require the least amount of effort and cost, while still maintaining the integrity of your database system.

Basic fallback

The easiest fallback strategy when migrating from source A, to target B is to point the application back to the original source A. This strategy is reasonable in the following situations, though they may be rare:

You are migrating a read-only system and the new target hasn’t taken any new transactions.
Your system is a “batch” type system and you haven’t applied any transactions to the new target yet.
You do not require any transactions consumed on the new system once you’ve rolled back to the old system. (Some logging systems may fit this pattern.)
You have the ability to regenerate or copy transactions consumed on the new system to the original system before or after rolling back. (Logging systems or insert only systems may fit this pattern.)
You don’t need transactions applied to the new system after rolling back to the original system.

Before cutover, the DMS task replicates data from A to B. Applications that will be “flipped” to B continue to interact with A. The following diagram illustrates this architecture.

At cutover, the replication task is stopped and applications are flipped to interact with B. The following diagram illustrates this change.

If a rollback is required, applications are flipped back to A. Any changes that occurred on B post cutover are ignored/lost. See the following diagram.

Dual write

In some cases, a dual write strategy is used as part of a migration. For example, consider the situation when you would like to migrate database A to database B. A dual write strategy consists of modifying the application code such that it can simultaneously write transactions to both databases. Employing a dual write strategy is one of the more complicated and time consuming fallback approaches as it involves (usually non-trivial) modifications to the application. However, falling back from a dual write configuration is straightforward because you simply stop writing to the target database B. A dual write strategy may be appropriate in the following situations:

Your data is partitioned (perhaps by customer) and you are taking a phased approach to migrating your application. In this scenario, you would use DMS to hydrate your new target and keep it synchronized with your source. Cut over usually consists of enabling the dual write capability and disabling the DMS replication task. Fallback consists of disabling the dual writes (and may include resuming the DMS task).
None of the other fallback strategies will work for you.

In the following dual write configuration, DMS is used to keep A and B synchronized until the application is pointed to B. The following diagram illustrates this architecture.

Once the application is pointed to B, the DMS task is stopped and the application continues to write changes to A. The following diagram illustrates the change in configuration.

If a rollback is required, the application stops writing to B, or the application itself rolls back to the non dual-write version. The following diagram illustrates the change in configuration.

Depending on the reason for the rollback, you may wish to re-instantiate the replication task to keep A and B synchronized. For example, if the issue is related to performance and you could tune B or scale up the host of B, it would be desirable to keep B in sync with A for a later migration attempt. See the following diagram.

Fall forward

A fall forward approach consists of creating a third database that is a replica of your original source database. You then create replication tasks going from your original source A to your new target B, and then to your replica of A, called A’ (for example, A → B → A’). At cut over, your applications stop writing to database A and begin writing to database B. The replication stream from database B to database A’ keeps A’ synchronized with B. If you must abandon the migration, stop writing to database B and point your applications to database A’. Database A’ has consumed any transactions written to database B as part of your migration attempt.

This fall forward approach begs the question, “Why not create a second replication task from target B back to source A as part of the original cutover?” This method is not advisable because you can’t test the replication stream from database B back to database A. Additionally, you would need some mechanism to prevent what is called the “loop back” effect, where a transaction that originates on A is replicated to B and is then looped back and again applied to A as part of the change stream from B to A.

A fall forward approach allows you to test the system end-to-end. This method is especially important in heterogeneous migration scenarios, in which database A and database B are using different engines, such as Oracle and PostgreSQL.

A fall forward approach to rolling back a migration is appropriate in the following scenarios:

You must ensure the system that you rollback to remains in sync with your target system so a rollback is as swift and easy as possible.
You are performing a heterogeneous migration where the rollback mechanism doesn’t require special consideration due to architectural complexities. In other words, if the basic fallback (mentioned above) is not possible, the rollforward approach is the preferred method.
You are performing a homogeneous migration using a logical replication tool such as DMS and the rollback mechanism doesn’t require special consideration.

With a fall forward approach, in addition to the new target B, a copy of A is also created (A’ in the below picture). Changes are then replicated from A to B, and from B to A’. This allows for thorough testing of the replication stream from B to A’ as well as the replication stream from A to B. The following diagram illustrates this configuration.

At cutover, the application points to target B and the replication stream from A to B is removed (or left dormant.) The replication stream from B to A’ continues to run in case you require a rollback. The following diagram illustrates this configuration change.

If you must perform a rollback, the applications rollforward to database A’, a copy of A that includes the changes that occurred on database B. See the following diagram.

More complicated architectures

In many situations, a migration consists of moving an entire database system or a database system that is architecturally isolated. In some situations, it is desirable to migrate a portion of a database system or a database system that is part of a larger, interdependent system, such as a database that is part of a multi-master replication architecture.

Migrating a portion of a system or piecemeal migrations

It is often desirable to divide the migration of a large, multi-tenant database into smaller components that can be staged over time. While this approach can reduce risk in the event something goes wrong, it can also complicate the rollback architecture. Instead of rolling back an entire database system, we now must rollback a portion of the system. One way to accomplish this is with the dual write strategy, but it requires application changes and careful coordination during cutover.

When splitting up a multi-tenant system, individual migrations are usually associated with a specific application or set of applications. You can often isolate the data associated with these applications physically or logically. If so, you can use the fall forward architecture, which would consist of source A, target B, and a replica of A (A’). As before, you would replicate data from A to B and from B to A’. However, you may not want all the data from A replicated to B, but you do want the superset of data from A and B to end up on A’. You can accomplish this by creating three replication streams: A to B to A’ and A to A’, in which the replication stream going from A to B only replicates data associated with the application migrating to B, and the replication stream going from A to A’ replicates all data not associated with the application migrating to B.

In the below architecture, the intent is to carve out applications that write data to B. Database B includes only data associated with application B that is split out from all applications. Database A’ includes data from all applications. Replication tasks DMS 1 replicate data associated with application B from A to B and from B to A’ (the fall forward database). Replication task DMS 2 replicates data not associated with application B to database A’.

At cutover, application B is directed to database B and the DMS 1 task from A to B is removed. Applications responsible for writing to the B slice of data on the original database stop or are modified such that they no longer write data to database A. The following diagram illustrates this architecture.

In the event a rollback is required, application B is stopped and ALL applications (including those originally responsible for writing the B slice of data on A) are resumed and redirected to write to database A’. All DMS tasks are either removed or left dormant at this point. The following illustrates this configuration change.

The above strategy assumes that you want to break up a multi-tenant system into multiple independent systems. You can also apply these concepts if you want to migrate subcomponents of a large system into a new system in stages. You can do this as a special case of the previous multi-tenant example. After the first migration, you must create rollforward systems for database B and database A in stages.

You can also migrate an entire system in stages. For example, you may want to migrate your customer management system a few customers at a time to minimize potential impacts as you learn to scale your new system. In this scenario, you can transfer the responsibility for managing a customer or set of customers from all applications to application B. If you must perform a rollback, you can rollforward for all customers to the rollforward system. If you would rather rollback only the set of customers in which the issue occurred and keep system B for successful customer migrations, you must create a rollforward instance for system B, and a new rollforward A’ for each stage of the migration. The following diagram illustrates this architecture.

In this staged migration, application B has already been migrated and the system is set up to prepare for the migration of application C. You still have a rollforward database A’ and a DMS task (DMS 3), moving data not associated with C to A’. You have another set of DMS tasks (DMS 1) to move data associated with application C from A into your new database B,C, and from there into your rollforward instance A’. You have an additional rollforward database B’ and a DMS task (DMS 2) to move data not associated with application C from B to B’.

At cut over, the DMS task moving data from A to B,C is stopped. Application C is redirected to database B,C. The following diagram illustrates this architecture.

In the event a rollback is required, Application B is pointed to B’ and the other applications (including Application C) are redirected to A’. All DMS tasks are removed or left dormant. The following diagram illustrates this change in configuration.

Consolidating systems

In some situations, you may want to migrate multiple systems into a single system. For example, it is common to migrate a system of MySQL shards into a single or smaller set of Aurora MySQL databases. The rollforward architecture for the consolidation of systems is similar to that for a staged migration.

You can start the initial consolidated database B with one, two, or more shards. In this scenario, DMS tasks instantiate and synchronize the consolidated system B with two shards, 1A and 2A. Systems 1A’ and 2A’ are created as fall forward options. DMS tasks keep 1A’ and 2A’ synchronized with system B. The following diagram illustrates this architecture.

At cutover, the application uses database B for data previously stored in shards 1A and 2A. DMS tasks moving data from 1A to B and 2A to B are removed or left dormant. Databases 1A and 2A are no longer needed. The following diagram illustrates this change.

If you must perform a rollback, the application uses 1A’ and 2A’ for data associated with shards 1 and 2, respectively. All DMS tasks are removed or left dormant. The following diagram reflects this change in configuration.

Assuming the consolidation of shards 1 and 2 was successful we can continue to consolidate shards one or more at a time. In the above case, we’re adding two more shards. The following diagram shows the addition adding two more shards. A fall forward instance (1B, 2B)’ is added as a fall forward database for your consolidated system.

You can use this strategy for the staged migration of a single multi-tenant database or multiple independent or multi-tenant databases into a single database. This is a fairly common pattern when consolidating several systems into an Amazon Aurora database.

Migrating an interdependent system of databases

In some architectures, dependencies exist between separate databases or database systems. For example, some databases use multi-master replication to synchronize some (or all) data between different nodes of a system. In some cases, you can migrate the entire system. In other situations, migrating the entire system at once is not practical or too risky. Either way, it is important to have a rollback strategy in place prior to the migration.

The following diagram depicts a four-node, multi-master database configuration running on database type A. The arrows between the nodes depict two-way replication between node pairs. Multi-master replication is accomplished using third-party software or software native to the database system in question (not DMS).

The diagram below depicts an architecture that includes rollforward facilities to migrate the entire system from database type A to database type B. The new database system, consisting of nodes 1–4 of database type B, is instantiated, and replication between the corresponding nodes using DMS is established. A fall forward database of type A is established for each node (1–4 A’), and DMS replication is established from the B nodes to the A’ nodes. At this point, multi-master replication among the B and A’ nodes is instantiated and thoroughly tested.

At cutover, applications point to nodes 1–4 of database type B. DMS tasks replicating data from the A nodes to the B nodes are removed or left dormant. Databases 1A–4A are deprecated. See the following diagram of this change in configuration.

If you must perform a rollback, applications point to nodes 1A’–4A’. All DMS tasks are removed or left dormant. Databases 1B–4B are deprecated. See the following diagram.

This scenario may or may not be practical, depending on your situation. For example, you may not be comfortable with the risk of moving the entire system in one fell swoop. Or perhaps you can’t use Amazon RDS or Amazon EC2 for your rollforward instances, and you do not have the hardware available to create a complete rollforward system.

The following diagram outlines a rollforward architecture to support the migration of a multi-master system one node at a time. The first node to be migrated is node 2. Node 2 data is copied to database 2B and database 2A’. DMS replication tasks keep 2B synchronized with 2A and 2A’ synchronized with 2B.

At cutover, workloads previously sent to 2A are redirected to 2B. The fall forward node 2A’ is instantiated as a member of the multi-master group. The DMS task from 2A to 2B is removed or left dormant. In the following diagram, node 2A’ is participating in replication only with node 4A, which coordinates 2A’ changes with the rest of the group. Depending on your multi-master configuration, you may need to add 2A’ as a proper member of the group, connecting it with nodes 1A and 3A.

In the event you must abandon the migration, workloads meant for node 2B are redirected to use node 2A’. You may also need or want to connect node 2A’ to nodes 1A and node 3A. The following diagram illustrates this change in configuration.

The following diagram shows that the migration of node 2A to 2B was successful, and you are setting up to migrate node 1A. You still need node 2A’ because it is transferring changes from 2B back into the initial multi-master group. (If it’s possible to include node 2B in the original multi-master group, you can do so and remove the need for node 2A’.) Nodes 1B and 1A’ are loaded with data from 1A, and DMS tasks sync 1B with 1A and 1A’ with 1B.

At cutover, workloads meant for 1A are redirected to use node 1B. A multi-master connection is instantiated between nodes 1B and 2B. The DMS task between 1A and 1B is removed or left dormant and node 1A is deprecated. A multi-master replication task is created between node 1A’ and node 4A. If required, (or desired) multi-master connections are made between node 1A’ and nodes 3A and 2A’. The rollforward procedure for node 1B is similar to that of the rollforward procedure outlined for node 2B. See the following diagram.

When falling forward is not possible

With a bit of thought and ingenuity, you can devise a fall forward architecture for most migration projects requiring a rollback plan. However, in some situations, the creation of a fall forward architecture is too complex or not possible, for example, when migrating a subset or component of a single node participating in a multi-master configuration.

To address these situations, DMS includes bidirectional replication. Bidirectional replication can replicate data from system A to B and from B back to A with special provisions to prevent transactions originating on A from looping back to A as part of the replication stream from B to A (and vice versa).

An issue with using bidirectional replication as a rollback mechanism is you can’t test the replication stream from B back to A until after the cutover to B. This is dangerous because it can take a significant amount of work to get logical replication working smoothly (especially in heterogeneous migrations in which A and B use different database engines). The following architecture outlines a method for performing a best-effort testing of the B to A replication stream for a system requiring bidirectional replication as a rollback mechanism. See the following diagram.

In this scenario, you want to extract application B from multi-tenant database A1 and have it use a new database B, running a different engine than database A1. A1 is a participating node in a complex, multi-master replication configuration.

Creating a fall forward architecture for this scenario would be complicated. If application B is relatively small or less critical than other applications using database A1, bidirectional replication might introduce less risk to the overall system.

The following diagram depicts database B created and loaded with the subset of data required by application B from database A1. The intention is to migrate application B to use database B. Bidirectional replication is configured between A1 and B using DMS tasks 1 and 2. Database B resides in AWS; A1 does not. At cutover, applications begin using database B, DMS 1 pushes relevant changes from A1 into B, and DMS 2 pushes changes from B back to A1.

In the preceding diagram, you know DMS 1 works well because you have used it to keep database B synchronized with A1. However, you haven’t tested DMS 2 because database B doesn’t receive transactions from application B until after the cutover. This presents a serious risk to your fallback strategy because, as noted previously, it can take some significant effort to get logical replication working correctly, especially between databases running different engines.

To help mitigate the risk associated with using an untested DMS task, you can create a copy of database B on engine A and create a DMS task (T) to keep the two synchronized. You can use the configuration from DMS T for the task responsible for pushing data from B back to A1 after cutover. See the following diagram of this architecture.

The copy of B may require some or all of the data from A1 to make sure that DMS T works with any existing interdependencies of the data. In some scenarios, it may require a complete copy of A1.

Up to this point, you have omitted the location of systems involved in a migration because you could test the replication pipeline for all fallback configurations. With bidirectional replication, it is impossible to test the replication pipeline from the target (B) back to the source (A1) completely. More specifically, you must make sure that the network configuration supports replication in both directions. Therefore, when testing DMS T, it is important that the location of A, your copy of database B, is in the same location, with the same or similar network configuration as database A1. The following diagram illustrates this architecture.

Your final configuration consists of database B and a bidirectional pair of DMS tasks. Task DMS 2T is configured the same as a task that has been thoroughly tested to move data from B to a copy of A. See the following diagram.

At cutover, direct application B to use the new database B. DMS 2T pushes changes on database B back to database A1. The following diagram illustrates this change.

In the event we must rollback we simply revert to having application B use database A1. The DMS tasks are removed or left dormant and Database B is deprecated. The following diagram illustrates this change.

Wrap up

There are many situations in which the use of logical replication is an integral part of a database migration strategy. With any migration, it is important to have a rollback strategy if you must abort the migration. Using logical replication as part of a migration strategy minimizes any outage associated with the migration and rollback of the migration.

Creating a migration architecture that includes a fall forward approach to aborting a migration reduces the time required to rollback from a migration. If a rollforward approach is not possible, you must test any replication pipeline used as part of the rollback strategy.

As always, AWS welcomes any feedback, so please leave your comments below.

About the Author

Ed Murray is a Principle Database Engineer with Amazon Web Services.