Middleware-assisted Zero-downtime Live Database Migration to AWS

When trying to figure out how to refactor your applications to leverage AWS Managed Services, you have some decisions to make. You may have decided to move your storage layer to AWS before the computational layer. This may help with using advanced database features, in addition to reducing costs associated with writing and reading data. AWS Professional Services recently helped a large customer with this implementation.

With more than a quarter billion daily users, this customer uses highly transactional NoSQL databases that are few hundred GBs in size. Volume of data is growing rapidly. The downtime requirements for the migration were stringently low, as their applications are used globally, 24-7. The source data layer was Cloud Datastore, which runs outside of AWS. The destination was Amazon DynamoDB. Several hundred globally distributed applications (writing to and reading from the database) had little or no room for refactoring in the initial phase. While the go-to solution for this scenario is usually AWS Database Migration Service, Cloud Datastore is not yet supported by AWS DMS as a source.

Using a configurable middleware to migrate in-use data layer

The architectural approach chosen was to develop a custom middleware that the applications would communicate with rather than directly calling the database. Common database operations such as Get, Put, Delete, Conditional Updates, Deletes, and Transactions, would be issued against this middleware that is loaded as an in-memory library. Data would be read from and written to this middleware layer. It would then issue reads and writes to multiple databases with configurable load factors. The solution was developed and tested in stages (Dual-Write Single-Read, Dual-Write Dual-Read, and Single-Write Single-Read) as shown in the diagrams following.

Architecture: routing database traffic to multiple storage targets

Figure 1 shows the initial state when the application layer communicates directly with the source database.

Figure 1. Architecture of initial state: Generic 2 or 3 tier application with application and storage layers running inside or outside AWS Cloud

Figures 2 and 3 illustrate the intermediate and final states, respectively, with database traffic moved to DynamoDB progressively.

Figure 2. Architecture of intermediate state: The middleware layer introduced to switch database traffic between source and target databases

Figure 3. Architecture of post-migration final state

A closer look into the configurable stages of the migration

Initially, the middleware should be tested with the source database alone. It can then be configured to work with DynamoDB in Dual-Write mode. Reads will still continue from the source database. The target database is synchronized by copying old data in parallel.

In the next stage, reads are expanded to the target database. Reading from two sources allows in-memory comparison of the final result set. This ensures consistency of the data being returned. Upon successful validation, the system is finally configured as Single-Write Single-Read, operating solely on the target database. This is the “Point of no Return,” where the target database surges ahead with new data. In this mode, the migration is deemed complete, and the older database is ready to be taken offline.

This multi-stage approach results in a “live migration” of the data layer to DynamoDB with zero downtime. Higher-level applications are also left intact. This increases the speed and accuracy of the overall migration.

Configurable stages of migration load balanced traffic to underlying databases

The middleware layer acts as a valve or switch regulating traffic from the applications to one or more databases. This allows support for a canary-like load balancing, where a certain percentage of traffic can be diverted in either direction. We can visualize this behavior with the analogy of a 3-stage dial, as shown in Figures 4 through 6. These stages are developed and tested in a non-production environment with production-like data. All related sets of tables should be migrated together.

Dual-Write Single-Read stage

Figure 4. Migration Stage 1: Dual-Write Single-Read mode

In this stage, shown in Figure 4, data is written to both the source database and the target database (DynamoDB). At this point, data is read only from the source, because the target is not ready to handle reads yet. While new data is being written to the target database, older data is copied and backfilled by background processes.

Avoid data corruption while copying older data. Don’t change it while you’re bringing the target database on par with the source. The middleware can implement a locking mechanism for write operations based on primary keys. One way to monitor the movement of older data can be a temporary table, which the copying process can update. The middleware can read this table to allow or deny a write operation. In most use cases, writes taper off with time, making it easier for older data to be copied without running into contention.

Dual-Write Dual-Read stage

Figure 5. Migration Stage 2: Dual-Write Dual-Read mode

The prerequisite of this stage is the parity of content between the two databases. As shown in Figure 5, both reads and writes are routed to both databases. In this stage, the middleware layer activates data validation. The records that are read from both the data layers are compared and contrasted for accuracy and consistency. This allows any discrepancies in the data to be fixed and the solution redeployed.

Single-Write Single-Read stage

Figure 6. Migration Stage 3: Single-Write Single-Read mode

In this stage shown in Figure 6, all reads and writes are directed only to the target database on AWS. This is the “Point of no Return”, as new data is written to the target database alone. The source database falls behind, and can be taken offline for eventual retirement.

Dealing with differences in database features

Apart from acting as a switch, the job of the middleware layer in this design pattern is to accept, translate, and forward the generic database call. For example, when it receives a “Put” call, it must invoke the “Put” API on the specific underlying target. After due translation, it follows the rules governing the corresponding service. The middleware layer does this twice for two different underlying databases, when operated in Dual-Write or Dual-Read modes.

You must deal with differences in databases in terms of specific features, limits, and limitations. The following is a non-exhaustive list of such areas:

Specific quantitative limits: DynamoDB imposes a size limit of 16 MB on Transactions. This limit is likely different for the source database.
Behavioral differences for features like indices: Cloud Datastore supports writing empty values to indexed fields which DynamoDB doesn’t support.
Behavioral differences for primary and secondary keys: Other databases might not treat keys the same way DynamoDB treats its hash and sort keys.
Differences in capacity, throughput, and latency: The middleware may need to throttle or even decline requests. This can happen if it starts operating in an Availability Zone where one underlying database is able to scale, but the other can’t scale.

An object-oriented approach can be an efficient way to deal with such differences. Create a base class encapsulating features that are common to different databases. Then use inheritance and polymorphism to account for the differences. This can ensure reusability, readability, and maintainability.

As the AWS Professional Services team has experienced, the resulting tool can be reused several times in a large organization to migrate different application suites. It can potentially be applied to other use cases. These include, but are not limited to:

Support for more storage configurations and databases, abstraction of application code base by making them largely agnostic of underlying database technology
Extensive database compatibility testing using granular migration stages
Modularization and containerization with computational platforms such as Amazon Elastic Kubernetes Service (EKS) or Amazon Elastic Container Service (ECS)

Conclusion

This design pattern showcases the power of abstraction in enabling live database migrations. Several optimizations are possible based on the rate of writes and pre-existing size of the database. The key benefit of this approach is the elimination of the need to make extensive changes in the application layer. This can result in significant savings in terms of effort, time, and cost, especially if different applications are managed by different teams in an organization. In addition, migration to DynamoDB alone can save AWS customers significantly. This depends on the size and access pattern of data, and whether the solution is architected for cost-savings. Refer to the Cost Optimization Pillar of the Well-Architected Framework for further best practices.

Further reading

AWS Architecture Blog

Middleware-assisted Zero-downtime Live Database Migration to AWS

Using a configurable middleware to migrate in-use data layer

Architecture: routing database traffic to multiple storage targets

A closer look into the configurable stages of the migration

Configurable stages of migration load balanced traffic to underlying databases

Dual-Write Single-Read stage

Dual-Write Dual-Read stage

Single-Write Single-Read stage

Dealing with differences in database features

Conclusion

Resources

Follow

Learn

Resources

Developers

Help