Migrate Apache Cassandra databases to Amazon DynamoDB more easily

Customers tell us that migrating data between different database engines—also known as a heterogeneous migration—can be challenging and time consuming. Some customers such as Samsung had to figure out on their own how to migrate their Apache Cassandra databases to Amazon DynamoDB (see Moving a Galaxy into the Cloud: Best Practices from Samsung on Migrating to Amazon DynamoDB). These migrations between NoSQL databases can be more difficult because of the scale of the data and the rate of change. For more details, watch this Applied Live Migration to DynamoDB from Cassandra tech talk.

AWS Database Migration Service (AWS DMS) and the AWS Schema Conversion Tool (AWS SCT) have been making it easier for AWS customers to migrate their databases to the AWS Cloud. You can use AWS DMS and the AWS SCT to migrate from any supported sources (such as MongoDB) to any supported targets (such as Amazon DynamoDB). Today, we are making it easier to migrate from Cassandra to DynamoDB by using AWS DMS and the AWS SCT.

This post walks you through the step-by-step process to bulk-load data from a Cassandra database into DynamoDB tables. It also shows how to keep your DynamoDB tables in full sync with their source until you are ready to cut over by using AWS DMS and the AWS SCT.

Why should I migrate from Cassandra to DynamoDB?

Customers tell us that the Cassandra architecture requires significant operational overhead, and the expertise can be difficult and expensive for them to find. On the other hand, DynamoDB is a fully managed service, which allows software engineers to focus on business innovation rather than on managing and maintaining database infrastructure.

The serverless provisioning model of DynamoDB also eliminates the need to overprovision database infrastructure and is provided without the need for specialized resourcing or licensing. As a result, customers report that DynamoDB-backed applications run with as much as a 70 percent total cost of ownership savings when compared to Cassandra.

Popular Cassandra features and third-party tools such as Transparent Data Encryption, multiple data center replication, and backup and restore are simplified with DynamoDB. Global tables, point-in-time recovery, and encryption at rest provide developers similar functionality to what Cassandra offers. However, these capabilities have push-button implementation without overhead or downtime.

How Cassandra-to-DynamoDB migration works

To offload the migration load from your primary Cassandra cluster, and to ensure necessary data consistency for additional migration processing, you can create a new on-premises or Amazon EC2 Cassandra data center.

Follow these steps to migrate data from a Cassandra cluster to a DynamoDB target:

Roll out a new Cassandra data center using the AWS SCT Clone Data Center Wizard, or prepare and use the data center on your own.
Extract the data from the existing or newly cloned Cassandra cluster by using data extraction agents, the AWS SCT, and AWS DMS tasks.

The current version of the Cassandra data extraction agent supports most popular versions of Apache Cassandra, which are 3.1.1 and and 3.0. The agent also works with previous Apache Cassandra 2.2 and 2.1. Currently, no other versions are supported.

The following diagram illustrates this migration approach.

Diagram demonstrating the migration approach just described

Data extraction is carried out directly from binary .db files with the Cassandra driver and data extraction agents. The following are the main benefits of this approach:

You can use multiple data extraction agents as nodes to expedite the data extraction process.
Access is required to file systems only (there is no need for the Cassandra cluster to be active).

During the data extraction process, data is extracted into .csv files, and metadata is stored in table-mapping and task-setting JSON files, which AWS DMS tasks use.

In the remainder of this post, we follow a series of steps to demonstrate how to migrate data from a Cassandra cluster to DynamoDB:

Identify the Cassandra cluster that has data you want to migrate to DynamoDB.
Optionally, switch the Cassandra cluster to a multi data center cluster configuration, and add new a data center with the replication factor set to 1 using the bulk extraction approach. Choosing this option gives you more resiliency in Cassandra, which is helpful when you add load during the migration.
Use the AWS SCT to convert Cassandra tables to DynamoDB structures.
Extract data from Cassandra tables with the help of the AWS SCT data extraction agents and write the data into .csv files.
Upload the .csv files to Amazon S3 by using the AWS SCT.
Load the .csv files into DynamoDB by using AWS DMS.
Capture data changes as part of the AWS DMS ongoing replication process.

The migration

The migration process includes two main steps:

Switch the Cassandra cluster to a multi data center cluster.
Extract data from the Cassandra cluster by using data extraction agents, the AWS SCT, and AWS DMS tasks.

Part 1: Switch the Cassandra cluster to a multi data center cluster

This part of the migration involves adding a data center to an existing cluster. This data center receives all data from the original cluster, and then the data is downloaded only from the newly added data center. Follow the steps in this section to clone the existing data center.

Step 1: Create a new project with the AWS SCT

Note: If you already have a Cassandra data center from which you want to replicate, you can skip this step and go directly to part 2.

Install the AWS SCT. Then follow these steps:

Start the AWS SCT.
Choose File, New project.
Choose NoSQL database, and then choose OK.
Connect to Cassandra and use the data center’s public or private IP address. If your instance running the AWS SCT is in the same virtual private cloud (VPC) as the cloned data center that is being used as a source, you can use the private IP address. Alternatively, if private communication is not possible, you can use the public IP address.
Open the context (right-click) menu for the data center, and choose Clone datacenter for extract.

Step 2: Configure the source Cassandra data center

Enter all required details of the source Cassandra cluster that you’re trying to switch to multi data center mode.

The first screen where you enter all required details of the source Cassandra cluster that you’re trying to switch to multi data center mode

Step 2.1: Supply the source cluster parameters

The data center name is selected by default. This setting can’t be edited because the AWS SCT automatically collects this information from the Cassandra configuration.
Choose “Snitch mode“: “Ec2Snitch”.

If the source and target data centers are in Amazon EC2 but in different AWS Regions, choose Ec2MultiRegionSnitch for the source data center. In all other cases, retain Snitch mode.

Step 2.2: Supply the Cassandra node parameters

The cluster information that you entered in step 2 is displayed by default.
Complete all required boxes for nodes, or Import connection data from a .csv file.
For future reference, you can save all entered data from List of nodes to a .csv file by using Export.
Choose Next to validate the parameters mismatch between the cassandra.yaml files of all source datacenter nodes.
Note: If some parameters (except IP addresses) are different between the .yaml files, the mismatch report appears and required node should be selected for using its cassandra.yaml file as a template for the target datacenter nodes. However, if there are no differences, the next step becomes available.

Step 3: Configure the target Cassandra data center

Note: If source data center nodes are configured on a private IP address, install Telnet on the target nodes.

Change the default directories for the target Cassandra data center, if required. If your source Cassandra cluster directory structure is different, you might want to change it here also.
Choose Snitch mode (possible snitch modes are
PropertyFileSnitch, Ec2Snitch, GossipingPropertyFileSnitch, and Ec2MultiRegionSnitch).
Enter the data center suffix, or use the default, which is _tgt.
Enter the data center name.

Step 3.1: Add a new target node

Choose Add new node.
Enter the Private IP:SSH port: X.X.X:22
Enter the Public IP:SSH port: X.X.X:22
Enter the OS user: ubuntu
Enter the OS password: In this case, keep the box empty (if you set up a password in your system, enter the password in this box).
Enter the Key path location, or choose a file by browsing to it. In this case, the file is your-pem-file.pem.
Enter the Passphrase: If you set up a password, enter the password in this box.
Choose Show Cassandra configuration link to open the Cassandra config YAML file. This link contains information about the automatically collected configuration parameters from the source Cassandra datacenter that are being used in cassandra.yaml file on the target datacenter nodes
Choose Next to validate all inputs.
Choose Show Log to go through the automated validation process.

Step 4: Start data replication

Choose keyspaces:
- Cassandra versions 3.x: Choose the keyspaces from which you want to copy data.
- For Cassandra versions 2.x: Choose one general row for all keyspaces.
Choose Start to start the data replication process to the target data center.
Start becomes Stop when replication starts, as shown in the following screenshot. You can monitor the progress of the replication process.
After the replication process has completed, choose Next.

Step 5: Review the data replication summary

The Current state of Cassandra cluster table is displayed with a list of source and target nodes.
Choose Finish. The Clone data center for extract page is closed, and the new data center is displayed in the source tree.

Part 2: Extract data from the Cassandra cluster using data extraction agents, the AWS SCT, and AWS DMS tasks

To extract data from the Cassandra cluster, you have to install and use the data extraction agents along with the AWS SCT.

To configure and prepare the data extraction agent, follow these steps in Migrating Data From Apache Cassandra to Amazon DynamoDB:

Install the prerequisites for the data extraction agent.
Install the AWS Cassandra data migration agent.
Configure the AWS Cassandra data migration agent.
Mount the Cassandra home and data directories.
Start the AWS Cassandra data migration agent.

Step 1: Switch to the AWS SCT after the data extraction agent is running

After the data extraction agent is set up and running successfully, return to the instance where the AWS SCT is installed and perform the following steps:

Add your AWS profile information to the AWS SCT (available in global settings in the AWS SCT). AWS profile information sets up the access key and secret access key to be used to communicate with AWS resources. For example, the AWS SCT uses this information to access DynamoDB tables in an account.
Choose File, and create a new Cassandra to DynamoDB project. Connect to Cassandra and DynamoDB. The Cassandra information is populated in the left panel, and the DynamoDB information is populated in the right panel, as shown in the following screenshot.
Choose the Cassandra data center for migration from the left panel, and switch to the Nodes
Specify the correct IP address of the Cassandra node you are trying to migrate from and jmxUser and jmxPassword for the current node. Then choose Apply.
Choose the tables that you want to migrate in a given keyspace. Open the context (right-click) menu for the tables that you chose, and convert the tables to DynamoDB. The AWS SCT automatically converts the table structures from Cassandra to DynamoDB. After you convert these tables, you can set the required read and write capacity units for the table on the Settings tab for each table.
In the right panel in the AWS SCT, open the context (right-click) menu for all the converted tables in DynamoDB, and apply changes for the tables to be created in DynamoDB.

Step 2: Register the data extraction agent

In this step, you register the data extraction agent that you configured in the previous steps. Follow these steps in the AWS SCT to register the data extraction agent:

Choose View, and then choose Data Migration View.
Choose Register.
Enter the agent name, the host, and the port of the machine on which the agent is set up. You also can decide if you want to use Secure Sockets Layer (SSL) for the agent to connect with your Cassandra data center.
Choose Register. You should see the agent in Active

Step 3: Create the extraction task and AWS DMS task

After the data extraction is registered in the AWS SCT, you should be able to create tasks to extract data and migrate it to DynamoDB from Cassandra.

Create a local extraction task that uses the data extraction agent to collect bulk-load data and ongoing changes from the Cassandra data center to Amazon S3.
A remote AWS DMS task gets the extracted data from Amazon S3 and migrates it to DynamoDB.
After you register the data extraction agent, in the left panel of the AWS SCT, open the context (right-click) menu for the Cassandra keyspace from which you want to migrate. Choose Create local & DMS task.
Enter a friendly Task name that you can remember.
Choose a DMS Replication instance in your AWS account. The AWS SCT uses the AWS profile information to look up AWS DMS resources in the account. If you don’t have one, create a replication instance in AWS DMS.
Choose the appropriate Migration type. You can do a one-time load or load all existing data with ongoing replication.
If you already used the AWS SCT to create tables in DynamoDB in the previous step, set the Target table prep mode to Do nothing. If you have not, then you can choose this process to automatically create tables in DynamoDB.
Choose the IAM role for the extraction and migration. The privileges required for this IAM role are documented in Using Data Extraction Agents.
Choose the appropriate Logging level to follow with the status of the task.
Add an optional Description.
Enable Data encryption, if required.
If you want to delete the .csv files after uploading to Amazon S3, choose Delete files from the local directory. This helps ensure that the disk does not fill up.
Name the S3 Bucket that you want to extract the data to.
Choose Create to create the extraction task and AWS DMS task.

Step 4: Start the AWS DMS task

Choose Start to start the task and monitor the data flow. The AWS DMS task always has the Running status because the agent waits for ongoing replication changes. You can pause the task, and after the migration is completed, you can delete the task and unregister the agent.

Step 5: Monitor the AWS DMS task

Monitor the progress of the AWS DMS task in the table on the Tasks tab. The status shows Full load progress %, Elapsed time, Tables loaded, Tables loading, and other variables. After the AWS DMS task starts, you can choose to monitor the task by using the Amazon CloudWatch metrics that AWS DMS exposes. For more information, see Monitoring AWS DMS Tasks.

Summary

You can use the solution in this post to migrate your Apache Cassandra databases to DynamoDB, and at the same time keep your Cassandra source databases completely functional during the migration process. When you are ready, you can choose to cut over your applications to DynamoDB with minimal downtime. You also can use multiple data extraction agents to migrate tables from multiple Cassandra keyspaces at once to expedite your migration to DynamoDB.

Important notes:

The original Cassandra data centers remain unchanged when adding a new data center with a replication factor of 1. Therefore, no data is lost. We use the DataStax user guide for data center replication, which is why we recommend using local_quorum for end-user connections. This will help you avoid any changes to the target data center that could lead to its inconsistency because we like to guard it from changes during the initial data extraction. We also use bootstrapping options on target data centers to protect these data centers from data changes. These precautions allow for consistent rebuilds and further data extraction. Data in the target data center has to be consistent just after you clone a data center and must stay unchanged during extraction. In addition, we use the auto_bootstrap: false option and the disableautocompaction command to protect the target data center from any data changes during the initial data extract.
The AWS Schema Conversion Tool starts extraction only after the new data center is created and is in a healthy state. The wizard checks for the Rebuild successful status for every target node. To ensure a successful rebuild, the source data center should have consistent data just before the start of data replication. We recommend performing flush and repair commands on the source data center to make it consistent.
The only issue that might arise is the corruption of one or more of the new data center nodes, which could force migration to restart (no data is lost—it just takes more time to migrate). In a node corruption in the target data center, you can start the wizard from scratch and replicate the new data center again. We use change data capture (CDC) to extract new data that could come to the source data center after we started the initial data extraction on the target data center. We automatically enable incremental backup on the target data center just before issuing the rebuild command.
Before you start CDC data extraction, you should (perform these steps after the initial data extraction process finishes):
- Enable bootstrapping on the target data center.
- Change CL to EACH_QUORUM in the application connectors.
- Perform a repair on the target data center nodes.

About the authors

Arun Thiagarajan is a database engineer with the AWS DMS and AWS SCT team at Amazon Web Services. He works on challenges related to database migrations and works closely with customers to help them realize the true potential of AWS DMS. He has helped migrate hundreds of databases to the AWS Cloud by using AWS DMS and the AWS SCT.

Mahesh Kansara is a database engineer at Amazon Web Services. He works with customers to provide guidance and technical assistance about database and analytical projects, helping them to improve the value of their solutions when using AWS.

AWS Database Blog