AWS Database Blog

Migrate your Azure Cosmos DB SQL API to Amazon DocumentDB (with MongoDB compatibility) using native tools and automation

While migrating workloads from the Azure Cloud to the AWS Cloud, organizations explore optimal, managed database services to replace their Cosmos DB databases. As NoSQL databases become more ubiquitous, especially those that support the Apache 2.0 open-source MongoDB APIs, our customers often choose Amazon DocumentDB because it’s a scalable, highly durable, and fully managed database service for operating mission-critical MongoDB workloads. Then they seek out a simple, automated, and secure solution to migrate their Azure Cosmos DB databases with SQL API to Amazon DocumentDB (with MongoDB compatibility).

Common migration requirements include:

  • A simple-to-use, single-command migration process
  • Automatically provisioned resources in your AWS Cloud account
  • Enforcement of production-grade security and encryption when transferring data
  • Performing data integrity checks upon migration

In this post, we provide automation code and detailed instructions to automatically migrate an Azure Cosmos DB database to an Amazon DocumentDB database, using native Cosmos DB and MongoDB client tools, and AWS CloudFormation provisioning.

The solution discussed in this post is suitable for use cases where your database and your application can tolerate an outage during the offline migration. If you need to minimize downtime and keep your source Cosmos DB and target Amazon DocumentDB databases synchronized until the migration cutover, you can migrate from Azure Cosmos DB to Amazon DocumentDB using the online method.

Overview of solution

The following is an outline of the migration process:

  1. Create an IAM role in your AWS account, which will perform the migration.
  2. Provision the Amazon Elastic Compute Cloud (Amazon EC2) worker instances, AWS Identity and Access Management (IAM) role, and IAM security groups using the provided CloudFormation template.
  3. Log in to the Windows EC2 worker instance using AWS Systems Manager Session Manager and invoke the migration process.
  4. Verify that the migration of data from Cosmos DB to Amazon DocumentDB was successful.
  5. Delete temporary resources upon completion of the migration.

After the migration process is started at the Windows EC2 worker instance, it performs the following tasks:

  1. Data is securely exported (HTTPS/TLS1.2) from Azure Cosmos DB to a JSON file using the dt.exe tool on the Windows EC2 instance. The data migration tool (dt.exe) is an open source software that can export data from Azure Cosmos DB SQL API databases to JSON files.
  2. The JSON file is securely copied from the Windows EC2 instance to the Linux EC2 instance.
  3. The JSON file is imported into Amazon DocumentDB from the Linux EC2 instance, using the mongoimport tool.
  4. As part of integrity verification, imported data is exported again from Amazon DocumentDB to check for potential differences between the source data from Cosmos DB, and the imported and exported data from Amazon DocumentDB.

The following diagram illustrates this workflow.

Architecture diagram that illustrates solution

The Multi-AZ setup is not required for the migration process, but it illustrates the high availability and disaster recovery best practices from the operations guidelines.

Costs

The resources provisioned by the CloudFormation template in this exercise are intended to fit within the AWS Free Tier. Actual billing in individual accounts may differ, due to several factors, such as Free Tier credits already consumed or increasing EC2 instance size. You can estimate your actual costs with the AWS Pricing Calculator. Also consider potential costs on the Azure side. Consult Azure’s pricing documentation to estimate the actual costs for your migration scenario.

Prerequisites

The migration process assumes that the Azure Cosmos DB account, database, and collection already exist and are available. Equally, the assumption is that the target Amazon DocumentDB cluster is provisioned and available. The example collection that we use throughout the migration is called us-zip-codes and was imported into the source Cosmos DB database from https://media.mongodb.org/zips.json.

Note that Amazon DocumentDB is a MongoDB-compatible database service, and its API, operations and data types differ from respective API, operations and data types in Cosmos DB with SQL API. Therefore, you are advised to review the Supported MongoDB APIs, Operations, and Data Types page from Amazon DocumentDB Developer Guide prior to the migration, to verify that the features used by your application will be supported after you switch to the new database.

As a best practice, we recommend that you first create indexes in Amazon DocumentDB before beginning your migration, because this can help improve the database performance and reduce the elapsed time of the migration. To learn more about the different index types and how to best work with them in Amazon DocumentDB, visit this post: How to index on Amazon DocumentDB (with MongoDB compatibility).

Other prerequisites are as follows:

  • The migration automation will be deployed in the existing AWS account, where the Amazon DocumentDB database cluster is running.
  • You need to obtain the following parameters related to your target cluster before you begin the migration:
    • Cluster endpoint (for example, docdb-demo-dev-par-cluster.cluster-1234567890abcdef0.eu-west-3.docdb.amazonaws.com).
    • Cluster port (for example, 27017).
    • Name of the database that will be imported (for example, PostalService).
    • Name of the collection that will be imported (for example, us-zip-codes).
    • Security group ID of the security group associated with Amazon DocumentDB (for example, sg-1234567890abcdef0).
  • We assume that the Amazon DocumentDB credentials are stored in AWS Secrets Manager. The Linux EC2 worker instance obtains the credentials automatically from Secrets Manager when performing the data import. It expects the target Amazon DocumentDB secret keys in Secrets Manager to be named username and password, which is the default and standard naming of keys for the user name and password of various database services in AWS. You need to provide the following parameters to the CloudFormation stack:
    • Secret ARN of the Amazon DocumentDB secret in Secrets Manager.
    • Security group ID of the security group associated with Secrets Manager’s VPC endpoint.
  • You have created the VPC and partitioned it into private and public subnets. The migration EC2 worker instances are intended to be deployed in a private subnet to help enhance security. Nevertheless, the EC2 instances in the private subnet must have internet access via NAT Gateway. Therefore, you must obtain the following parameters before you begin the migration:
    • VPC ID (for example, vpc-1234567890abcdef0).
    • Subnet ID of the subnet where Windows and Linux EC2 worker instances will be provisioned. It should be the same subnet (for example, subnet-1234567890abcdef0).
  • Have one created or create a new key pair and download the associated .pem file. We need this key pair when we provision the EC2 worker instances involved in the migration. Have the private key (.pem format) of the EC2 key pair to provide as a CloudFormation stack parameter. This private key is used to establish a secure connection between Windows and Linux EC2 worker instances when the data is copied.
  • You must have your Azure account, the source Cosmos DB database, and collection available. These parameters need to be obtained and provided to the CloudFormation stack during the migration process:
    • AccountEndpoint, URL, and port (for example, https://cosmos-db01.documents.azure.com:443).
    • AccountKey (for example, ajh5GkvckvqXPbwlgJzA9ZpByJVJEm5oTRbCCG0YYST6DaPLbE7REiTR70JHcLlOQrsX2EeVd76iPnM7mH7vTw==).
    • Database name (PostalService)
    • Collection name (us-zip-codes)
  • Cosmos DB in Azure cloud must allow access from the Windows EC2 worker instance in the AWS Cloud, which will perform the data export. To accomplish this step, first you need to determine the public IP address of the associated NAT Gateway. Having previously obtained the VPC ID (vpc-1234567890abcdef0) and the private subnet ID (subnet-1234567890abcdef0) where your EC2 instance will be running, you can use the AWS Command Line Interface (AWS CLI) and run the following commands.
NOTE: replace the illustrative values for VPC ID and subnet ID used in the example commands with your own values, including the NAT Gateway ID and public IP address that you will get as outputs of these commands.
$ nat_gateway=`aws ec2 describe-route-tables --filters "Name=vpc-id,Values=vpc-1234567890abcdef0" "Name=association.subnet-id,Values=subnet-1234567890abcdef0" --query 'RouteTables[].Routes[].NatGatewayId' --output text`

$ echo $nat_gateway 
nat-1234567890abcdef0

$ aws ec2 describe-nat-gateways --filter "Name=nat-gateway-id,Values=$nat_gateway" --query "NatGateways[].NatGatewayAddresses[].PublicIp" --output text

13.37.48.48

Then go to the Azure Portal and change the firewall settings for your Cosmos DB account.
In the Azure Portal under Azure Cosmos DB, choose Firewall and virtual networks in the navigation pane. For Allow access from, select Selected networks and add the NAT Gateway’s public IP address retrieved in the previous step.

Create IAM resources

Following IAM best practices and the least privilege principle, you create a role that is used to provision temporary resources necessary for the migration. This way, we confirm that the user conducting the migration doesn’t have excessive privileges. Complete the following steps:

  1. Download the IAM policy, which we provide in this post, and then save it to your computer locally in the working directory.
  2. In this directory, run the following AWS CLI command:
$ aws iam create-policy --policy-name DBDocMigrationUserPolicy --policy-document file://CF-User-Policy.json

(An example output: arn:aws:iam::12345678910:policy/DBDocMigrationUserPolicy)

  1. Note the policy ARN from the output of the previous step.
  2. Proceed with the role creation commands (replace 12345678910 with your actual account number):
$ aws iam create-role --role-name DBDocMigRole --assume-role-policy-document '{"Version": "2012-10-17", "Statement": [{"Effect": "Allow", "Principal": {"AWS": "12345678910"}, "Action": "sts:AssumeRole"}]}'

$ aws iam attach-role-policy --role-name DBDocMigRole --policy-arn arn:aws:iam::12345678910:policy/DBDocMigrationUserPolicy

Create the CloudFormation stack

After you have created this role, create additional resources with AWS CloudFormation.

  1. Sign in to the AWS Management Console and then switch to the role you just created (DBDocMigRole).
  2. On the AWS CloudFormation console, choose Create stack.
  3. Select Template is ready.
  4. Select Upload a template file, click on Choose file, and upload the CosmosDB-Migration.yaml template, which contains the automation of the migration process.

AWS CloudFormation console. Upload a template file.

  1. Give your stack a name, such as CosmosDB2DocDB-STK.
  2. Configure the parameters as instructed in the prerequisites.

Configure the parameters as instructed in the prerequisites

Pay special attention to the private key parameter. You must copy and paste the whole private key as text in the PrivKey field. That is the .pem key which was referred to in the prerequisites. To do so, you may use the command line interface, print the content of the .pem key on the terminal and then copy the text of the whole key, including the headers: -----BEGIN RSA PRIVATE KEY----- and -----END RSA PRIVATE KEY-----.
In the private key parameter, copy and paste the whole private key content as text

  1. In the Configure stack options section, leave all values at default.
  2. Optionally, you can provide a name tag, such as CosmosDB2DocDB-STK.
  3. Choose Next.
  4. Select I acknowledge that AWS CloudFormation might create IAM resources.
  5. Choose Create stack.

Acknowledge that AWS CloudFormation might create IAM resources and Create Stack

When the provisioning process starts, you can watch the progress by refreshing the status page.

watch the progress by refreshing the status page

When the whole stack reaches the CREATE_COMPLETE state, you’re ready to start the migration process. It usually takes between 4 and 5 minutes for the stack to be created.

Perform the migration

To perform the migration, you will need to access the Windows and Linux EC2 worker instances using Session Manager, so you don’t need the EC2 instances to be associated with public IP addresses. This approach enhances security because there are no external incoming connections from the internet or bastion hosts to the EC2 worker instances, including SSH or RDP protocols.

  1. On the Systems Manager console, choose Instances in the navigation pane.
  2. Select the instance you want to connect to (docdb-demo-dev-par-ec2-win), and choose Connect.

Select the instance you want to connect to

  1. On the Session Manager tab, choose Connect to confirm.
  2. Repeat these steps to connect to the second instance (docdb-demo-dev-par-ec2-lin).

Connect to the EC2 instance

  1. To start the export/import process from Azure Cosmos DB to Amazon DocumentDB, run the following command from the Windows EC2 worker instance:
C:\CosmosDB2JSON\CosmosDB2JSON.bat

start the export-import process from Azure Cosmos DB to Amazon DocumentDB on the Windows EC2 instance

In parallel, you can monitor the logs on the Linux EC2 worker instance:

tail -f /var/log/JSON2DocumentDB

monitor the logs on the Linux EC2 worker instance

When the migration process starts in the Windows worker instance, it exports the collection from the Cosmos DB database into a JSON file. It uses Microsoft’s native tool dt.exe, known as the Data Migration Tool. Note the number of exported documents – in this example it’s 29,353. The script then produces an MD5 hash of the exported JSON file, and copies both the JSON and MD5 files to the Linux EC2 worker instance.

monitor the logs on the Windows EC2 worker instance

On the Linux instance side, we can observe by watching the log (/var/log/JSON2DocumentDB) that the service is automatically picking up the JSON file and performing the import and verification process.

Watch the logs on the Linux EC2 worker instance

The Linux EC2 worker instance receives the files from Windows instance (JSON and MD5) and then performs the following actions:

  • Automatically validates JSON file integrity (using the provided hash file)
  • Drops the us-zip-codes collection, in case it has already been created in the Amazon DocumentDB PostalService database
  • Imports the JSON file into the Amazon DocumentDB PostalService database, in the us-zip-codes collection
  • Exports back the newly imported collection into a JSON file
  • Compares both JSON files (the one that came from Cosmos DB and the one exported from Amazon DocumentDB) to validate that they’re identical

The log shows that 29,353 documents were imported into Amazon DocumentDB, and there is no difference in data between Cosmos DB and Amazon DocumentDB.

The migration of data from your Azure Cosmos DB database to Amazon DocumentDB database has been successfully completed at this point.

Verification tests

This section discusses aspects of verification to be performed during the migration.

First, verify the log on the Windows EC2 worker instance. It should be error-free. The critical step is running the dt.exe command. Check the output values for Transferred and Failed. The values Failed = 0 and Transferred > 0 indicate a successful status. A different outcome indicates a problem, which must be investigated and the migration re-attempted. Verify if both files us-zip-codes.json and us-zip-codes.md5 have been successfully copied to the Linux EC2 worker instance.

NOTE: us-zip-codes.json and us-zip-codes.md5 are example files, and their actual names may differ in you case. The file names are defined as: ${DBCollectionName}.json and ${DBCollectionName}.md5, where ${DBCollectionName} is the destination collection name in your Amazon DocumentDB database. You provide this value as the DBCollectionName parameter, when you create the CloudFormation stack.

On the Linux EC2 worker instance, a Session Manager session should be open upfront and the log monitored. The log is located in /var/log/JSON2DocumentDB. The following are important to verify:

  • No errors in the log. All commands should succeed without errors or warnings.
  • The number of imported documents, upon successful running of the mongoimport command, must be identical to the number of documents exported from Cosmos DB, which we previously observed in the output of the dt.exe command (29,353 documents).
  • The number of exported documents from Amazon DocumentDB, upon successful running of the mongoexport command, must be identical to the number of documents previously imported (29,353 documents).
  • There should be no error on the cmp command. The command compares the original JSON file with collections exported from Cosmos DB and the JSON file with collections exported from Amazon DocumentDB. cmp is silent if the files are identical, which is what we expect. This step proves the integrity of the data in Amazon DocumentDB because it verifies that it’s identical to the data in Cosmos DB. If a message appears in the output of the cmp command, such as differ: char 1079, line 23, it means that the migration isn’t consistent! In that case, the issue should be investigated, and the migration process will need to be repeated after the issue has been resolved.

If you encounter problems with inserting data into the Amazon DocumentDB cluster during the migration, you should consult Amazon DocumentDB Developer Guide.

Implementation details

The provided CloudFormation template CosmosDB-Migration.yaml provisions and configures the necessary resources, including the EC2 worker instances. Both EC2 instances implement the migration logic and installation and configuration of necessary tools during the bootstrap phase, in the UserData section. These scripts are written in PowerShell and bash for Windows and Linux EC2 instances, respectively.

The concept that utilizes two independent EC2 worker nodes, one to export the data from the source database and another to import the data into the target Amazon DocumentDB database, is based on design patterns for parallel computing. The provided minimalistic migration example in this post is used for clarity and illustration purposes, and it does not utilize the full potential of the solution, having one source database with only one collection. In more complex migration scenarios, the provided template and the bootstraps scripts can be modified to run migrations of multiple collections in parallel, from different sources. The custom service /usr/local/bin/JSON2DocumentDB created on Linux EC2 worker instance continuously listens for new files in /tmp directory:

inotifywait -q -m -e close_write,moved_to,create /tmp |
while read -r directory events filename; do
. . .

and if the new file is copied to the /tmp directory, and its name is ${DBCollectionName}.json, then the service will proceed with the import of the input JSON file into the target Amazon DocumentDB database. This worker node is independent from the source and JSON files can be copied to its /tmp directory from different locations and in-parallel. For example, some collections can be exported from the Cosmos DB using the Windows EC2 worker instance, as in our example, while in-parallel we could copy static JSON collections that we have in other locations, such as S3 buckets. The code in the template can be expanded to support import of multiple different collections, to potentially different target database clusters by altering the “if” section of the code. For example:

190 inotifywait -q -m -e close_write,moved_to,create /tmp |
191 while read -r directory events filename; do
192 if [ \"\$filename\" = \"${CustomerCollection}.json\" ]; then
. . .
202 mongoimport --ssl --host ${DBClusterEndpoint}:${DBClusterPort} --sslCAFile /etc/ssl/certs/rds-combined-ca-bundle.pem -u \"\$USR\" -p \"\$PASS\" -d ${DBDatabaseName} -c ${CustomerCollection} --file /tmp/${CustomerCollection}.json --jsonArray
. . .
209 fi
210 if [ \"\$filename\" = \"${ProductsCollection}.json\" ]; then
. . .
221 mongoimport --ssl --host ${DBClusterEndpoint}:${DBClusterPort} --sslCAFile /etc/ssl/certs/rds-combined-ca-bundle.pem -u \"\$USR\" -p \"\$PASS\" -d ${DBDatabaseName} -c ${ProductsCollection} --file /tmp/${ProductsCollection}.json --jsonArray
. . .
229 fi
230 done

Furthermore, the CloudFormation template can be modified to provision multiple EC2 worker instances, and thus perform imports of different collections in-parallel.

The Windows EC2 worker instance is required in this specific migration scenario, because Microsoft’s native Cosmos DB Data Migration Tool (dt.exe) is a Windows application. In a general case, you could modify the provided CloudFormation template to suit your needs. Possible modifications could range from fitting the whole migration process onto a single EC2 worker instance, where both export and import of data are performed in serial fashion, to having multiple export and import EC2 worker instances to perform automated parallel migrations from and to different databases and collections.

Note that the maximum supported size of the Cosmos DB collection that can be migrated within one process is limited to the storage space available in C:\ and /tmp directories on the Windows and Linux EC2 worker instances respectively. Therefore, provision sufficiently large EBS volumes that can support your migration workload.

Both Windows and Linux instances encrypt the data at rest, because their root EBS volumes are encrypted. This is achieved by specifying BlockDeviceMappings sections of the AWS::EC2::Instance resource definitions:

# Windows
BlockDeviceMappings:
  - DeviceName: /dev/sda1
    Ebs:
      VolumeType: gp2
      VolumeSize: '30'
      DeleteOnTermination: 'true'
      Encrypted: 'true'

# Linux
BlockDeviceMappings:
  - DeviceName: /dev/xvda
    Ebs:
      VolumeType: gp2
      VolumeSize: '15'
      DeleteOnTermination: 'true'
      Encrypted: 'true'

The data is always encrypted in transit while it’s being transferred from one node to another. The export of data from Azure Cosmos DB to the Windows EC2 worker instance is done over HTTPS/TLS1.2 protocol. The files are copied from the Windows EC2 instance to Linux EC2 over the encrypted SSH connection, within the private subnet. Then the SSL encryption is enforced during the import/export of data to and from Amazon DocumentDB. By default, encryption in transit is enabled for newly created Amazon DocumentDB clusters. You can find more information about data protection in Amazon DocumentDB in the Amazon DocumentDB Developer Guide. In this example, the encryption in-transit is enabled on the Amazon DocumentDB cluster, and therefore we must use --ssl and --sslCAFile parameters when establishing connections from the native MongoDB client tools mongo, mongoimport and mongoexport to the Amazon DocumentDB database endpoint.

The Linux EC2 worker instance custom service (JSON2DocumentDB) additionally performs an integrity check of data to and from Amazon DocumentDB. It imports the collection into the target database, and then it exports it again to a JSON file. Then it compares the two JSON files for differences, using the cmp command in conjunction with the jq command. The jq inner commands sort JSON data of both input files by _id key, to avoid potential false differences due to the different order of data within the two files.

cmp <(jq 'sort_by(.\"_id\")' -S /tmp/${DBCollectionName}.json) <(jq 'sort_by(.\"_id\")' -S /tmp/${DBCollectionName}.out.json)

As mentioned earlier in the Prerequisites section, the service expects Amazon DocumentDB secret keys in Secrets Manager to be named username and password, which is the default and standard naming of keys for the user name and password of various database services in AWS. If these keys differ in your secret, then you can modify the bootstrap script and adjust it to your needs. You may change the UserData section of LinuxEC2Instance resource in the provided CloudFormation template and replace .username and .password with the actual keys that you use:

# Retrieve the DocumentDB credentials

USR=\$(aws secretsmanager get-secret-value --region ${AWS::Region} --secret-id ${DocumentDBSecret} --version-stage AWSCURRENT | jq --raw-output '.SecretString' | jq -r .username)

PASS=\$(aws secretsmanager get-secret-value --region ${AWS::Region} --secret-id ${DocumentDBSecret} --version-stage AWSCURRENT | jq --raw-output '.SecretString' | jq -r .password)

Clean up

To avoid incurring future charges, delete the resources upon completion of migration.

  1. Delete the CloudFormation stack on the AWS CloudFormation console.

Delete the CloudFormation stack on the AWS CloudFormation console.

  1. Delete the IAM role and policy. NOTE: replace the illustrative AWS account number 012345678910 with your own actual value.
$ aws iam detach-role-policy --role-name DBDocMigRole --policy-arn arn:aws:iam::012345678910:policy/DBDocMigrationUserPolicy
$ aws iam delete-role --role-name DBDocMigRole
$ aws iam delete-policy --policy-arn arn:aws:iam::012345678910:policy/DBDocMigrationUserPolicy
  1. After you complete the migration, you may visit the Azure Portal and regenerate both the primary and secondary account keys in your Cosmos DB account.

This step is optional. In case the migration was conducted by a multi-disciplined team and the account keys were exposed to team members that were outside of the need-to-know group, then regenerating the keys could help you keep your data secure until the Cosmos DB database is eventually decommissioned.

Conclusion

In this post, we presented how you can perform a migration of a document database from the Azure Cosmos DB SQL API to Amazon DocumentDB using native tools and automation. This migration process implements the offline migration strategy, which requires the application to be taken offline, or to operate in read-only mode, for the period of the migration. The process is secured and data is encrypted while being copied from Cosmos DB to Amazon DocumentDB. The whole process is automated in AWS CloudFormation and should require minimum effort from the staff operating the migration.

The source code referred to in this post is available in the GitHub repo.

If you have any questions or comments, post your thoughts in the comments section.


About the Authors

Igor Obradovich - blog author Igor Obradovic is a Senior Database Specialty Architect SME within the Professional Services team at Amazon Web Services. He helps organizations develop and conduct efficient and effective plans for their cloud adoption journey. Igor’s focus is on accelerating customers’ migrations of systems and workloads to the cloud, modernizing their applications, and maximizing the value of their investments.

Nicolas Ruiz - blog author Nicolas Ruiz is Cloud Infrastructure Architect within the Professional Services team at Amazon Web Services. He assists global enterprises to adopt modern architectures and methodologies, such as large-scale transformations, mass migrations, complex architectures, artificial intelligence, data science, and big data.