Migrating to Amazon DocumentDB with the hybrid method

This blog post was last reviewed and updated February, 2022.

Amazon DocumentDB (with MongoDB compatibility) is a fast, scalable, highly available, and fully managed document database service that supports MongoDB workloads. You can use the same MongoDB 3.6, 4.0, or 5.0 application code, drivers, and tools to run, manage, and scale workloads on Amazon DocumentDB without worrying about managing the underlying infrastructure. As a document database, Amazon DocumentDB makes it easy to store, query, and index JSON data.

There are three primary approaches for migrating from MongoDB to Amazon DocumentDB: offline, online, and hybrid. For more information, see Migration Approaches.

This post discusses how to use the hybrid approach to migrate data from MongoDB to Amazon DocumentDB. The hybrid approach combines the speed of the offline approach and the ability to minimize downtime with the online approach. For more information, see Video: Live migration to Amazon DocumentDB.

The hybrid method is the best option if you want to minimize downtime and your source dataset is greater than 1 TB. The hybrid method takes advantage of parallelization and the speed that you can achieve with mongorestore to migrate the bulk of the data and then uses AWS Database Migration Service (DMS) to minimize downtime.

If your dataset is smaller than 1 TB, you should use the online or offline approach. For more information about migrating with the offline and online methods, see Migrate from MongoDB to Amazon DocumentDB using the offline method and Migrating to Amazon DocumentDB with the online method.

This post shows you how to use the hybrid approach to migrate data from a MongoDB replica set hosted on Amazon EC2 to an Amazon DocumentDB cluster.

Prerequisites

Before you start your migration, complete the following prerequisites:

Verify your source version and configuration
Set up and choose the size of your Amazon DocumentDB cluster
Set up an EC2 instance

Verifying your source version and configuration

If your MongoDB source uses a version of MongoDB earlier than 3.6, you should upgrade your source deployment and your application drivers. They should be compatible with MongoDB 3.6 to migrate to Amazon DocumentDB.

You can determine the version of your source deployment by entering the following code in the mongo shell:

mongoToDocumentDBOnlineSet1:PRIMARY> db.version()
3.4.4

Additionally, verify that the source MongoDB cluster (or instance) is configured as a replica set. You can determine if a MongoDB cluster is configured as a replica set with the following code:

db.adminCommand( { replSetGetStatus : 1 } )

If the output is an error message similar to "errmsg" : "not running with --replSet", the cluster isn’t configured as a replica set.

Setting up and sizing your source Amazon DocumentDB cluster

For this post, your target Amazon DocumentDB cluster is a replica set that you create with a single db.r5.large instance. When you size your cluster, choose the instance type that is suitable for your production cluster. For more information about Amazon DocumentDB instances and costs, see Amazon DocumentDB (with MongoDB compatibility) pricing.

Related Amazon DocumentDB posts

Setting up an EC2 instance

To connect to the Amazon DocumentDB cluster to migrate indexes and for other tasks during the migration, create an EC2 instance in the same VPC as your cluster and install the mongo shell. For instructions, see Getting Started with Amazon DocumentDB. When creating AWS resources, we recommend that you follow the AWS IAM best practices. To verify the connection to Amazon DocumentDB, enter the following CLI command:

[ec2]$ mongo --ssl --host docdb-cluster-endpoint \
--sslCAFile rds-ca-2019-root.pem --username myuser \
--password mypassword
…
rs0:PRIMARY> db.runCommand('ping')
{ "ok" : 1 }

If you have trouble connecting to either your source instance or Amazon DocumentDB cluster, check the security group configurations for both to make sure that the EC2 instance has permission to connect to each on the correct port (27017 by default). For more information about troubleshooting, see Troubleshooting Amazon DocumentDB.

Amazon DocumentDB uses Transport Layer Security (TLS) encryption by default. To connect over a TLS-encrypted collection, download the certificate authority (CA) file to use the mongo shell to connect. See the following code:

[ec2 ]$ curl -O https://s3.amazonaws.com/rds-downloads/rds-ca-2019-root.pem

You can also disable TLS. For more information, see Encrypting Data in Transit.

For index and data migration, a key consideration is ensuring the EC2 instance’s Amazon EBS volume is large enough to hold the exported data. You can obtain a rough estimate of a database’s size in bytes by running the db.stats() command in the mongo shell and looking at the value of storageSize. See the following code:

mongoToDocumentDBHybridSet1:PRIMARY> db.stats()
{
"db" : "zips-db",
"collections" : 1,
"views" : 0,
"objects" : 193579,
"avgObjSize" : 65.97073815367189,
"dataSize" : 9843917,
"storageSize" : 8125248,
"numExtents" : 0,
"indexes" : 1,
"indexSize" : 610304,
"scaleFactor" : 1,
"fsUsedSize" : 2396921856,
"fsTotalSize" : 8577331200,
"ok" : 1,
"$clusterTime" : {
"clusterTime" : Timestamp(1582412608, 1),
"signature" : {
"hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
"keyId" : NumberLong(0)
}
},
"operationTime" : Timestamp(1582412608, 1)
}

Hybrid migration steps

The following diagram illustrates the six steps of the hybrid migration process. The steps are as follows:

Application continues to write to source
Dump indexes using the Amazon DocumentDB Index Tool
Dump data using mongodump
Restore indexes using the Amazon DocumentDB Index Tool
Restore data using mongorestore
Replicate data with change data capture (CDC) using AWS DMS
Change application endpoint to Amazon DocumentDB cluster

Step 1: Application continues to writing to source

When you use the hybrid method to migrate to Amazon DocumentDB, your application continues to write to the source MongoDB database. Step 7 discusses ceasing writes to the source database and changing the application to point to the target Amazon DocumentDB cluster.

Step 2: Dumping indexes using the Amazon DocumentDB Index Tool

Before you begin your migration, create the same indexes on your target Amazon DocumentDB cluster that you have on your source MongoDB cluster. Although AWS DMS handles the migration of data, it doesn’t migrate indexes. To migrate the indexes, on the EC2 instance that you created as a prerequisite, use the Amazon DocumentDB Index Tool to export indexes from the MongoDB cluster. You can get the tool by creating a clone of the Amazon DocumentDB tools GitHub repo and following the instructions in README.md.

The following code dumps indexes from your source MongoDB cluster to a directory on your EC2 instance:

python migrationtools/documentdb_index_tool.py --dump-indexes 
--dir ~/index.js/ 
--host ec2-user.us-west-2.compute.amazonaws.com 
--auth-db admin 
--username user
--password password

2020-02-11 21:46:50,432: Successfully authenticated to database: admin
2020-02-11 21:46:50,432: Successfully connected to instance ec2-user.us-west-2.compute.amazonaws.com:27017
2020-02-11 21:46:50,432: Retrieving indexes from server...
2020-02-11 21:46:50,440: Completed writing index metadata to local folder: /home/ec2-user/index.js/

After the successful export of the indexes, the next step is to restore those indexes in your Amazon DocumentDB cluster.

Step 3: Dumping data using mongodump

Export the data from your MongoDB replica set to the EC2 migration instance using the mongodump tool. Set the –-readPreference option to secondary to force the dump to connect to a secondary replica set member. This step reduces the potential impact of the mongodump on the source deployment. To use the --readPreference option, connect to the replica set member using the form replicaSetName/replicasetMember. See the following code:

[ec2]$ mongodump \
--host mongoToDocumentDBHybridSet1/ec2-x-x-x-x.us-west-2.compute.amazonaws.com 
--username user \
--password password --db zips-db -o .\
--authenticationDatabase admin \
--readPreference secondary
2020-02-03T20:39:05.649+0000 writing zips-db.zips to
2020-02-03T20:39:05.683+0000 done dumping zips-db.zips (29353 documents)

The time it takes the data to export depends on the size of the source dataset, the speed of the network between the migration instance and the source, and the migration instance’s resources. Record the start time of the mongodump process; you need this information to know when to start the DMS CDC process later.

After the successful export of the indexes and data, the next step is to restore the data and indexes in your Amazon DocumentDB cluster.

Step 4: Restoring indexes using the Amazon DocumentDB Index Tool

To restore the indexes that you exported in your target cluster in the preceding step, use the Amazon DocumentDB Index Tool.

The following code restores the indexes in your Amazon DocumentDB cluster from your EC2 instance:

python migrationtools/documentdb_index_tool.py --restore-indexes
--dir ~/index.js/ 
--host docdb-2x2x-02-02-19-07-xx.cluster-xxxxxxxx.us-west-2.docdb.amazonaws.com:27017
--tls --tls-ca-file ~/rds-ca-2019-root.pem 
--username user 
--password password

2020-02-11 21:51:23,245: Successfully authenticated to database: admin
2020-02-11 21:51:23,245: Successfully connected to instance docdb-2x2x-02-02-19-07-xx.cluster-xxxxxxxx.us-west-2.docdb.amazonaws.com:27017
2020-02-11 21:51:23,264: zips-db.zips: added index: _id

To confirm that you restored the indexes correctly, connect to your Amazon DocumentDB cluster with the mongo shell and list the indexes for a given collection. See the following code:

mongo --ssl 
--host docdb-2020.cluster-xxxxxxxx.us-west-2.docdb.amazonaws.com:27017
--sslCAFile rds-ca-2019-root.pem --username documentdb --password documentdb
db.zips.getIndexes()

Step 5: Restoring data using mongodump

To restore the data that you dumped in your target cluster in the Step 3, use the mongodump utility.

The following code restores the data in your Amazon DocumentDB cluster from your EC2 instance. To increase the speed and parallelize the restore, use the --numInsertionWorkersPerCollection option. As a rule of thumb, set the numInsertionWorkersPerCollection value to the number of vCPUs on the cluster’s primary instance. Use option --noIndexRestore to avoid creating indexes twice, because you restored the indexes in Step 4. See the following code:

[ec2]$ mongorestore --host docdb-cluster-endpoint –-ssl –-sslCAFile rds-combined-ca-bundle.pem --username myuser --password mypassword – numInsertionWorkersPerCollection 64 --noIndexRestore <dump_dir>

If the mongodump operation includes all the databases from the source MongoDB cluster (for example, if --db option doesn’t specify an individual database to dump), remove the admin directory from the resulting dump directory. Otherwise, an error occurs when you attempt to restore to Amazon DocumentDB.

Pay attention to the total duration of the restore. The MongoDB oplog size should large enough to hold the data for this duration as well as the time it takes to complete the online migration that Step 6 covers. The AWS DMS CDC task relies on the oplog to replicate data to Amazon DocumentDB.

Step 6: Performing full load and replicating data with AWS DMS

AWS DMS is a managed service that helps you migrate databases to AWS services efficiently and securely. AWS DMS enables database migration using two methods: full data load and CDC. The hybrid migration approach uses CDC to replicate changes to Amazon DocumentDB. For more information about using AWS DMS, see AWS Database Migration Service Step-by-Step Walkthroughs.

To perform the hybrid migration, complete the following steps:

Create an AWS DMS replication instance. For instructions, see Working with an AWS DMS Replication Instance.
For data migration, this post uses the dms.t2.medium instance type. AWS DMS uses the replication instance to run the task that migrates data from your MongoDB source to the Amazon DocumentDB target cluster.

Create the MongoDB source and Amazon DocumentDB target endpoints. For more information, see Working with AWS DMS Endpoints.
The following screenshot shows the endpoints for this post for the MongoDB cluster and target Amazon DocumentDB cluster.

Create a replication task to migrate the data between the source and target endpoints.
a. Choose the task type Replicate data changes only.
b. Enable Start task on create.
Your replication begins immediately after task creation. The following screenshot shows the status of a database migration task that has completed the full load and is currently performing ongoing replication.
If you choose the task mongodbtodocumentbd-online-fullandongoing, you can review more specific details. In the Table statistics section, the task shows the statistics of full data load, followed by the ongoing replication between the source and destination databases. See the following screenshot.
To verify that the number of documents matches in each, run the command db.collection.count() in your source and target databases.
You can also monitor the migration’s status as an Amazon CloudWatch metric and create a dashboard to show progress. The following screen shows the rate of incoming CDC changes from the source database.

Step 7: Changing the application endpoint to an Amazon DocumentDB cluster

After the full load is complete and the CDC process is replicating continuously, you are ready to change your application’s database connection string to use your Amazon DocumentDB cluster. For more information, see Understanding Amazon DocumentDB Endpoints and Best Practices for Amazon DocumentDB.

Summary

This post described how to migrate data from MongoDB to Amazon DocumentDB by using the hybrid method. For more information about other migration methods, see Migrate from MongoDB to Amazon DocumentDB using the offline method, Migrating to Amazon DocumentDB with the online method, and Ramping up on Amazon DocumentDB (with MongoDB compatibility).

If you have any questions or comments, please leave your thoughts in the comments section.

About the Authors

Vijay Injam is a NoSQL Data Architect at Amazon Web Services.

Jeff Duffy is a Sr NoSQL Specialist Solutions Architect at Amazon Web Services.

Joseph Idziorek is a Principal Product Manager at Amazon Web Services.

AWS Database Blog