How Kount migrated a critical workload to Amazon DynamoDB from Cassandra
Database migrations can be challenging projects. With a mix of unique business requirements, specific data models and architectures, and technical risk, these projects often seem daunting. To help tackle these challenges, a team can apply planning, design, and testing methods to help navigate their unique project needs. In this post, Edin Zulich (AWS Principal NoSQL Specialist Solutions Architect), and Chris Galli (Kount Director of R&D) share the story of how Kount migrated to Amazon DynamoDB from Apache Cassandra.
Kount’s Identity Trust Global Network delivers real-time fraud prevention, account protection, and enables personalized customer experiences to more than 9,000 leading brands and payment providers. Linked by Kount’s award-winning AI, the Identity Trust Global Network analyzes signals from 32 billion annual interactions in order to personalize user experiences across the spectrum of trust, from frictionless experiences to blocking fraud. Quick and accurate identity trust decisions deliver safe payments, account creation, and login events, while reducing digital fraud, chargebacks, false positives, and manual reviews.
Device Data Collector service
Kount’s Device Data Collector service (or Device service for short) is a mission-critical service used by Kount’s digital fraud products and backing services. The service collects valuable customer device data such as geo location, user agent attributes, and other data, and provides it to both real-time workloads as well as case management, reporting, and analytics workloads. The collected data informs various risk evaluation services for payment transactions, and digital account events such as account creation and logins. The service handles an average of 2,500 requests per second, with bursts that can reach 5–10 times that, and makes an average of about 2,800 database writes per second. It’s expected to be highly available, and targets real-time device information response SLAs of 25 milliseconds average, and 75 milliseconds for 98th percentile.
The Device service leverages several AWS services. As shown in the following architectural diagram, Device data ingress is handled by a cluster (Collectors) running Kount’s Golang-based Data Collector microservices hosted on Amazon Elastic Compute Cloud (Amazon EC2), and fronted by a Network Load Balancer. Incoming device data is routed to Amazon Simple Storage Service (Amazon S3) for cost-effective storage of the complete device dataset, while the mission-critical data that is used downstream is placed into Amazon ElastiCache for Redis short-term storage. The critical device data is then consumed by microservices (Processors) responsible for providing device data to real-time consumers and storing the data in long-term storage. This long-term storage used to be in a Cassandra NoSQL database, and is now in DynamoDB.
The rest of this post describes why and how Kount migrated this mission-critical device data store from Cassandra to DynamoDB.
The challenge (and how DynamoDB came into the picture)
Given the importance of Device data and the volume of the workload, the Device team is constantly fielding enhancement requests, and managing and maintaining the infrastructure and services that constitute the Data Collector and Information APIs. In addition, the team also needed to prepare for an expected workload increase from existing product growth and new products, ranging from 10–100 times larger in size. To meet these demands, the team needed to ensure they were investing their efforts and energies on activities that provided the highest value. That focus can be challenging when supporting a mission-critical, high-volume service.
As mentioned earlier, the Device service used an Apache Cassandra NoSQL database for long-term data storage. Because Kount runs on AWS, Cassandra was hosted on Amazon EC2, with the total deployment spanning 40 EC2 instances and associated Amazon Elastic Block Service (Amazon EBS) volumes. Given the demands for new features and supporting the production workload, as well as the future plans for volume increases, the team was facing a major decision: continue using Cassandra and tackle a large upgrade of the Cassandra cluster with continued operational overhead, or evaluate other options that could help the team handle growth and focus on delivering value to the business. The task and cost of managing a mission-critical and complex database that is expected to significantly grow in the future was not small, and always at odds with delivering new features. The team was already intimately familiar with the complexity of maintenance that required specialized expertise and involved risks of upgrades and outages, and decided to consider alternatives.
In this context, managed services presented a compelling value proposition. Among them, DynamoDB stood out as a fully managed service, designed for zero downtime and elastic scaling, with a well-known track record of supporting mission-critical workloads, and documented successful migrations from Cassandra. Companies such as Samsung, Nike, and GE have migrated workloads from Cassandra to DynamoDB. As a serverless database, DynamoDB offers a maintenance-free experience: not only do you not have to worry about patching, upgrades, or any other server-related operational tasks, you don’t even see servers, because they’re completely abstracted away.
In addition, the service provides elasticity, which was also an important capability given the nature of the Device service workload and planned increases in volume. The service can experience transient and spiky usage periods; the elasticity of DynamoDB and ease of managing read and write capacity without so much over-provisioning (which means cost savings) was appealing. Another appealing feature of DynamoDB was global tables, which provide built-in support for active-active data replication between AWS Regions. The team initially thought that this would be needed for durability and high availability.
However, as the team found out, DynamoDB provides levels of availability and durability in a single Region that exceed the requirements of the Device service: DynamoDB not only replicates data across multiple Availability Zones in a Region, but it also runs in those Availability Zones.
Global tables’ multi-Region replication will still be useful when Kount decides to expand to other global regions. When that time comes, replicating the device data stored in DynamoDB to another Region using global tables will be as simple as flipping a switch.
Preparing for migration: Doing the homework
The key migration requirements were to migrate without any data loss, downtime, or disruptions to the Device service. This also meant that they needed a way to fall back to the old database at any point during migration, in case something went wrong.
The zero downtime requirement sets the overall approach to migration: during migration, the application has to be able to write to and read from both databases, and switch from one to the other without downtime. This requires application code changes to add support for the new database and the ability to use both, which highlights an important aspect of database migrations: they’re not just data migrations, they’re also application migrations.
The zero downtime requirement was also the reason why AWS Database Migration Service (AWS DMS) could not be used for this migration, because it requires a small interruption to writes at cutover time.
The demanding requirements meant the migration would have to be planned out and prepared meticulously. This involved a number of tasks, grouped in several stages, as shown in the following diagram.
The first task, planning, was about defining goals and requirements. It was followed by data analysis, which was about gathering detailed insights about application data access patterns and data, including the structure, size, cardinality of datasets, and rates of reads and writes. These insights were then used to inform the data modeling in DynamoDB. Because both Cassandra and DynamoDB are NoSQL databases, the data models can often be similar. In this case, it turned out that data access patterns could be optimized for DynamoDB and the number of tables in DynamoDB reduced. As part of this stage, preliminary cost estimates were done based on the data sizes, and expected read and write rates in DynamoDB. After the data access layer was implemented for DynamoDB, the team started to test performance using representative data. As the diagram shows, all this was done in an iterative process, during which adjustments were made based on testing and observations.
A few key elements that emerged during this process are worth calling out:
- Utilizing a dual write strategy to enable the application to write to both databases simultaneously and toggle read and write states through various phases of the migration. This mitigated risks by enabling fallback, and facilitated a multi-step approach to database migration.
- Leveraging a load-leveling queue within the application layer to handle extreme write spikes to DynamoDB, which could cause it to throttle writes.
- Thorough performance and load testing prior to production rollout. This provided valuable insight into provisioning levels, error handling, utilizing batch writes to DynamoDB, and established strong levels of confidence in the migration process.
After all these steps were complete, the team was ready for actual migration. The preparation work took 5 months to complete, although during this time the team was also working on other projects and supporting the production workload. Certain prep activities required 100% focus for one to three team members, while others might require 50% of one person. The key here was having clear direction, agreeing on priorities, and being willing to adapt as the prep work uncovered new learnings, or other demands like production support required the team’s attention.
The actual migration was composed of a series of phases. Each phase focused on achieving an important milestone of the overall migration plan. The following table summarizes those phases and milestones, showing the key operations taken in each phase, and the read/write state of the application.
Phase 1: Deploying application changes
The first step was to introduce application changes that would enable the remaining phases of the migration. This included the ability to modify the read/write configuration setting of the application at run-time using configuration management, which was Puppet in this case. The application could be configured to read from either Cassandra or DynamoDB, and write to one or both databases. An application node could be removed from the cluster, reconfigured to the desired read/write state, and placed back in the cluster to service workload.
New application metrics were added that provided important feedback on the performance and behavior of DynamoDB. For example, custom metrics were added to measure the latency of reads and writes to DynamoDB from the application, and read/write counters were added to help tune DynamoDB read and write capacity. Error tracking was added to identify potential configuration or application issues. These metrics were sent to an instance of InfluxDB and Grafana, which served as the primary tools for monitoring Device service performance during migration.
Finally, a load-leveling queue was introduced at the application layer to deal with potential throttling from DynamoDB caused by transient volume spikes, and to enable asynchronous writes via
BatchWriteItem requests to DynamoDB. The queue leveraged Golang channels and a worker pool pattern. There was a channel per table, and each channel had Go routine workers pulling messages off the channel. Each worker would add messages to its own batches, and when the configurable batch size was reached, a
BatchWriteItem request was issued to DynamoDB. If an error was returned by DynamoDB, any unprocessed individual messages were put back into the queue to be reprocessed.
Phase 2: Dual writing the live workload
With the foundations in place, the team commenced with the second phase of the migration. The application nodes were configured to begin dual writing the live workload to both Cassandra and DynamoDB. Each node in the cluster was updated by the configuration management process and put back into service. Any writes that only occurred on Cassandra during this transition period would be found in the subsequent backfill phase. This allowed for a seamless and incremental change to the application cluster state. The nature of Kount’s data and key requirements dictated that the migration of the live workload be done before any historical backfill occurred.
This phase enabled the team to evaluate the performance of DynamoDB and gain confidence with various configurations, such as auto scaling settings for provisioned write capacity. In addition, they tuned client retry behavior and timeout settings. During this phase, while writes were done to both Cassandra and DynamoDB, reads were directed only to Cassandra, which was still the authoritative database.
Phase 3: Backfilling and verifying
Confident that DynamoDB was handling the live workload, the team shifted focus to backfilling the historical data. The dataset contained around 3.1 billion records, and was about 4 TB in size. The team ultimately pursued a custom backfill process to satisfy important requirements. For one, backfill records couldn’t overwrite any records that were captured via the live workload. Also, the data model was optimized for DynamoDB and, as a result, some data transformation was required in order to switch to a different partition key in DynamoDB.
The following diagram shows the backfill and verification implementation. The team exported Cassandra records that predated the dual write of the live workload as CSV files. This was performed by selecting records from each table into large CSV files. Those files were copied using SCP to an EC2 instance to be split up into smaller files for processing. The resulting files were then moved to Amazon S3.
As shown in the diagram, those files were then processed by a custom set of migration logic, deployed as workers on a single c4.xlarge EC2 instance (EC2 Custom Migration Logic in the diagram) that could handle the data transformation and overwrite protection requirements. These workers could be scaled up or down to balance backfill efficiency with performance on the production database.
A key point of configuration was changed during the backfill phase: DynamoDB capacity mode was changed from provisioned to on-demand. This enabled the migration to proceed with a very high level of throughput without any capacity planning. The majority of data was backfilled in 60 hours, during which time the migration workers sustained close to 14,000 writes per second to DynamoDB. In other words, the single c4.xlarge EC2 instance was able to process and write 14,000 records per second to DynamoDB. If the team had needed to finish the migration in half the time, for example, they could have simply added another EC2 instance of the same type to double the throughput.
Although on-demand capacity increased the costs, it greatly simplified the process by removing the need for capacity planning and tuning of the database prior to performing the backfill. The time savings and simplicity afforded by on-demand capacity outweighed the incremental cost increase.
Each CSV file was placed into an S3 bucket based on the results of that file’s data migration. If any of the records failed to load, that dataset was placed in a bucket to be triaged; successful records were placed in their own bucket. Specific application metrics were also emitted to Amazon CloudWatch to help the team monitor and manage the backfill, such as the number of records written to DynamoDB and errors. A set of data integrity checks were then performed to verify the data was complete and correct.
Phase 4: Cutting over to DynamoDB with fallback
The next step was to make DynamoDB the authoritative database by moving reads from Cassandra to DynamoDB. Dual writes were still in place to enable a seamless fallback to Cassandra if it was deemed necessary. This phase allowed the team to validate and tune DynamoDB read capacity and measure application performance of DynamoDB reads. The application ran in this state for 2 weeks. During this time, there were no fallback events to Cassandra.
Phase 5: Removing fallback
At this point in the migration, the team had gained confidence that DynamoDB was properly configured to handle read and write workloads, configured to auto scale during traffic spikes, and the application metrics and monitoring provided necessary insight and information to manage the system. Cassandra writes were turned off, removing the last runtime dependency on that database. While this was a simple application configuration change, it was a significant operation; DynamoDB was now the only long-term Device data store in operation.
Phase 6: Decommissioning Cassandra
After another period of time, the Cassandra infrastructure was decommissioned. The EC2 clusters supporting the Cassandra production and test instances were shut down. This removed 40 EC2 instances and backing EBS volumes across various deployment lifecycles. The final result of the stage was the removal of all Cassandra-specific services and components.
Looking back at the reasons that drove Kount to migrate and what the team hoped to accomplish, the team faced the operational pain and stress of managing a mission-critical database, which could only get worse as the workload increased, and wanted to eliminate the overhead of database operations as much as possible, allowing the team to focus on developing new functionality.
The result? Database maintenance activities such as security patching, upgrades, and backup management have been eliminated. The infrastructure has been simplified: 40 long-lived EC2 instances that were supporting the Cassandra database cluster were removed. That’s 40 EC2 instances that no longer have to be reserved, monitored, patched, or upgraded. This has allowed the Device team to focus time and effort on other projects and features. DynamoDB auto scaling is handling volume spikes effectively without requiring the team to manually react to scale up or down. As a fully managed service, designed for zero downtime, DynamoDB relieves the team of the stressful task of ensuring the continuous uptime of the mission-critical database.
In addition to these key benefits that Kount was hoping to get, Kount is also seeing cost savings for day-to-day database operations. About 4 months post-migration, at the time of this writing, Kount is trending at about 35% cost-savings for the Device long-term data store. And that’s a nice added bonus: not only is the team free of the tedious and stressful database maintenance work, the company is also saving on the overall database cost.
Migration projects are complex and can seem daunting. When initially tackling a migration, it can be incredibly helpful to do your homework in the following ways:
- Understand your data model and access patterns. It may not be a 1:1 lift-and-shift when changing database technologies, even when they’re both NoSQL databases.
- Understand workload and critical business and technical requirements. This may influence the design of the new system, or how you plan your migration.
- Learn about the technology you’re adopting so you can design for it and configure it appropriately. Managed services are wonderful, but you still need to understand the tool and how to utilize it to meet your needs.
- Clearly understand the “why” of your migration project; for example, if you believe there will be cost savings, then verify with cost estimation or other analysis. Be prepared to make your case with data, and to stand behind your reasons for the project.
- Know and use the resources available. In addition to online resources, the team had support from AWS experts as needed.
As you begin to plan and implement your migration project, here are some key takeaways from Kount’s migration that you may find valuable:
- Start with the end in mind: what are you trying to accomplish? What does that ideal end state look like? Be prepared to come back to that vision when challenges arise, because these migration projects can be complex. You may face uncertainty and have to make changes as you learn more along the way. The Kount team, when faced with a challenge or decision, would often come back to that desired end state to help inform decisions about a new problem.
- Create a flexible migration plan with fallback and roll-forward contingencies. There’s a good chance you’ll encounter unexpected issues during the migration. Having the ability to adjust course and adapt can help reduce the anxiety and stress of these projects.
- Test and validate new technology, application changes, and your migration plan in a lower environment to gain confidence. AWS makes it possible to validate and test in lower environments. The Kount team leveraged this to their advantage to flush out problems early and often.
Finally, migrations don’t succeed purely on the merits of their technical design, project plan, or implementation. Technology projects involve and depend on groups of people coming together for a common goal. It’s important to acknowledge the effort and accomplishment of Kount’s Device team on this project. The project involved a wide range of skills, with SAs, SREs, SDETs, developers, and product managers all contributing valuable knowledge, experience, and skills. The trust, respect, and collaboration exhibited by the team was the key element of our project’s success. It was truly my honor to serve as a team member on this migration project with such an amazing group of people.
About the Authors
Edin Zulich leads a team of NoSQL solutions architects at AWS. He has helped many customers in all industries design scalable and cost-effective solutions to challenging data management problems. Edin has been with AWS since 2016, and has worked on and with distributed data technologies since 2005.
Chris Galli is the Director of R&D at Kount, where he has the honor of collaborating with a talented group of people to explore new ideas and learn every day. Chris has been with Kount since 2009, and he has worked in software development since 1999.