AWS Database Blog

How Zalando migrated their shopping carts to Amazon DynamoDB from Apache Cassandra

This post is co-written with Holger Macht and Mike Matsumoto from Zalando.

Zalando is a leading European online shop for fashion and lifestyle. Founded in Berlin in 2008, Zalando brings head-to-toe fashion to over 50 million active customers, offering clothing, footwear, accessories, and beauty supplies. The assortment of international brands ranges from world-famous names to local labels. As Europe’s fashionable tech company, we at Zalando work hard to find digital solutions for every aspect of the fashion journey: for our customers, partners, and every valuable player in the Zalando story.

In this post, we walk you through how Zalando migrated to Amazon DynamoDB from Apache Cassandra for the Shopping Cart service in 2021. We guide you through the challenges we faced with self-managing Cassandra, the business and technical requirements we had, and the details of the migration process. We end with how we optimized and now operate DynamoDB and discuss the results and impact of this migration.

The Shopping Cart

The backbone of Zalando’s digital shopping cart is a microservice responsible for managing customer carts for the Zalando Fashion Store. The service enables customers to add and remove items to and from the cart, view the cart’s content, and inform them about their delivery estimates. It exists in a unique place in the purchase journey, merging the end of the browse-and-discover functionality with the beginning of the transactional flow. Ultimately, the business logic applied by the service defines which articles are purchasable by customers and which are not. It helps customers make informed purchase decisions.

In 2021, the service received more than 5,000 requests per second on average, peaking to more than 25,000 requests per second during large sales events such as Black Friday. This service used Apache Cassandra, a NoSQL distributed database, for data storage. We stored the shopping carts for more than 50 million active customers, which amounted to more than 1 TB of shopping cart objects in Cassandra. On the database alone, we typically received, at peak, more than 10,000 read and more than 3,000 write operations per second.

Why migrate from Apache Cassandra

We self-managed our Cassandra cluster on Amazon Elastic Compute Cloud (Amazon EC2) instances, and the Shopping Cart team was responsible for provisioning, patching, and managing the servers, in addition to installing, maintaining, and operating software. Capacity planning and scaling the database both vertically and horizontally, together with managing backups and having a feasible disaster recovery plan, were other operational burdens on the team. Scaling considerations were especially challenging during high-throughput events such as Cyber Week, where we had to allocate two engineers just for the manual scaling of our Cassandra cluster.

The Shopping Cart team decided to migrate to DynamoDB for the following reasons:

  • Reduced operational and maintenance overhead – A fully managed, serverless, and cloud-native database would allow the team to spend more time on improving existing business-related features and developing new ones. Several of the previous responsibilities of the team, like provisioning, patching, managing servers and software, and manually scaling our cluster would no longer have to be done by us.
  • Introducing elasticity – Scaling our Cassandra cluster required manual intervention and two engineers allocated to this task. To match Zalando’s growth and respond with low latency during the large events, the team looked into an elastic database that would support the high traffic with low latency requirements that Shopping Cart has. DynamoDB capacity modes satisfied our elasticity requirements.
  • Similar data partitioning and fewer changes in the existing source code – Given Zalando’s requirements of supporting high traffic and large amounts of data in different countries and regions, the underlying code base had already been written to take advantage of Cassandra’s data partitioning. After evaluating DynamoDB’s partition capabilities, we realized that we could keep most of our code base by keeping the same partition key compositions. This would drastically reduce needed code changes in the main application as well as minimize the code complexity in the applications responsible for migrating and validating the data from Cassandra to DynamoDB.
  • Faster disaster recovery – Considering the importance of the Shopping Cart service for the Zalando business, having continuous backups and speedy disaster recovery was required. DynamoDB supports both full database backups with on-demand backups, as well as continuous backups with point in time recovery.

Migration requirements and strategy

The migration project included migrating all of the existing shopping carts for all Zalando customers. This represents a considerable time frame, because shopping carts can be kept for a long time.

The Shopping Cart service should experience zero downtime, and customers should be able to continue modifying their existing carts, as well as add new carts during the migration, without experiencing data loss or disruption in their experience.

Maintaining data integrity and consistency, as well as keeping the latency low during the entire migration, was considered critical given the importance of this service for the business.

To fulfill the migration requirements, we decided to follow the dual-write migration strategy. This meant that the Shopping Cart service should be able to write to and read from both Cassandra and DynamoDB databases, and switch from one to the other without downtime during the migration.

Planning phase

We looked at the existing data model and potential changes, evaluated expected costs together with possible scaling policies and backup and disaster recovery mechanisms, considered time and resourcing constraints, and got the buy-in from our stakeholders.

We started by looking at the two capacity modes DynamoDB provides: on-demand and provisioned. Whereas on-demand capacity mode is considered more suitable for spiky and unpredictable workloads and follows the pay-per-request pricing model, the provisioned capacity mode appeared to be more cost-effective for our use case in the long run, especially when combined with auto scaling. As DynamoDB allows us to change between capacity modes without downtime, we chose to start with the on-demand capacity mode for the DynamoDB table during the migration and rollout, with the intention of gathering insights about the required capacity in production and consider adjustments, especially with a focus on cost, at a later point.

Regarding backups, DynamoDB offers two backup mechanisms: on-demand backup for creating full backups of DynamoDB tables, as well as point-in-time recovery for continuous incremental backups, which would enable us to restore the table to any point in time during the last 35 days. After reviewing our business and disaster recovery requirements, we chose to enable the point-in-time recovery feature of DynamoDB so that we could have continuous backups of our table data. Whenever needed, our disaster recovery mechanism would restore the table data to the point in time prior to the occurrence of the disaster to a new table. The application would then be configured to access data from the newly created table and continue its normal operation.

Data analysis phase

To choose the right approach for the Shopping Cart service, we considered the following:

  • Throughput and latency – Do we have enough metrics that allow us to assess performance of the existing service using Cassandra and compare it with the performance of this service using DynamoDB in order to comply with our SLAs?
  • Access patterns – Do we know exactly what read and write workloads the Shopping Cart service has so we can make a conscientious choice about the design of the data model used for DynamoDB?
  • Costs – Can we estimate cost for both Cassandra and DynamoDB databases in order to help the decision-makers?

After verifying that we had all of the metrics and were aware of all our access patterns, our first consideration was that we could mimic Cassandra’s data model to DynamoDB, because they’re both based on a denormalized schema. By doing so, we could drastically reduce the required code changes and therefore we anticipated being able to do the migration faster.

Although keeping the same data schema was a very attractive idea, while analyzing the cost estimation, we realized that the estimated cost for DynamoDB would be around 2.5 times higher than the cost of Cassandra. We started investigating the possibility of reducing the estimated cost for DynamoDB, while at the same time keeping code changes to a minimum to avoid introducing bugs.

Data modeling phase

DynamoDB charges for reading, writing, and storing data in the DynamoDB tables, along with optional features that we choose to enable. Although the existing amount of data (> 1 TB) can be considered high, it was not a huge contributor to the overall costs, so we focused on the read and write access patterns.

We identified that generating cart-snapshots was the main contributor to the DynamoDB cost. A cart-snapshot is an immutable and signed representation of a customer shopping cart, which is used by many clients of our system.

Because the related data was stored in different tables and DynamoDB doesn’t support the concept of a table join, to generate a cart-snapshot, we performed multiple reads from and writes to different tables. Then we joined the data in the source code, which resulted in multiple network calls for the read and write operations. This had a direct impact on the latency and also increased the cost due to the fact that each individual request consumed capacity independently.

To reduce the number of individual read operations from different tables, we followed the recommendation from AWS Solutions Architects to redesign the data model based on a single-table design. We chose a composite primary key for our table and overloaded the table with different entities that were previously stored in different tables. This allowed us to perform a Query operation instead of multiple individual GetItem operations from different tables. We also followed best practices to design a composite sort key to combine different values in a sort key. Using begins_with() in a key condition expression for our queries allowed us to design targeted queries and optimize even further for cost. This helped us reduce the number of reads from multiple queries to a single network call, which was beneficial not only for the operation of creating cart-snapshots, but also for other operations.

Before, multiple queries resembled the following code:

var a = read Table_A where a_id = A#11
var b = read Table_B where b_id = … and a_id = A#11
var n = read Table_N where n_id = … and a_id = A#11

After, a single query looked like the following code, for a certain cart with ID C#1:

var (a,b,...,n) = read Cart where PK = A#11 and begins_with(SK, C#1)

To summarize, the transition to a single-table model not only reduced costs, particularly for read operations, but also allowed the team to review and enhance the existing access patterns with a more adaptable schema. In other words, the team could obtain a wide range of new insights just by rearranging and querying the data in a different manner, thanks to the new schema in combination with global secondary indexes.

A phased migration approach

To fulfill the requirements we discussed, a phased migration approach was laid out. The following table outlines the read and write workloads after each of the phases.

Phase Customer facing read/write behavior
Cassandra Read Cassandra Write DynamoDB Read DynamoDB Write
Implementing application changes
Dual-write to Cassandra and DynamoDB
Offline data migration
Read from DynamoDB with fallback to Cassandra
Decommission Cassandra

Let’s look at each phase in more detail.

Implementing application changes

After defining the data model, one of the key objectives was to introduce the necessary code to integrate with DynamoDB without changing too much of the existing code, as well as making it simple to clean up the existing code related to Cassandra later.

The existing source code was structured in a way that the Shopping Cart service components directly accessed the database layer backed by Cassandra. To decouple the service layer from the database layer, we introduced one level of abstraction by creating a new repository interface. The components in the service layer referred to this new interface instead of the database layer.

This approach enabled us to create a new implementation of this interface especially for DynamoDB without the need to change anything on the service layer. This also meant that further in the course of the migration, the Shopping Cart team didn’t have to change the implemented business logic, which reduced the risk of introducing bugs.

We updated the existing source code to use the AWS SDK for Java 2.x, which has true support for non-blocking applications. Upgrading to the latest version improved the existing source code. For interacting with DynamoDB, we used the DynamoDB Enhanced Client API and continued using data object-mapping and repositories abstraction (like JPA) in order to reduce the number of code changes and the boilerplate required to run CRUD operations on the database.

For tuning the DynamoDB client, we mostly followed the guidance in Tuning AWS Java SDK HTTP request settings for latency-aware Amazon DynamoDB applications because configuration may vary by use case. We could see that enabling the tcp-keep-alive attribute reduced the CPU and memory usage of our servers. If you also have a high-demand application, we suggest enabling this configuration.

We enabled SDK metrics to get insights for future optimization, and also pre-scaled our on-demand table by switching to the provisioned capacity mode temporarily to provision the estimated read and write capacity. When the table status became active, we switched the capacity mode of our table to on-demand.

Because of the importance of the Shopping Cart service to our business, we decided to gradually migrate the data during the different migration phases. At the start, 100% of customers’ shopping carts were powered by Cassandra, for both read and write operations. We then gradually increased the share of shopping carts being stored to and read from DynamoDB. This batching method enabled us to perform testing and verifications on a small share of customer data, gradually increasing it until we reached 100% being powered by DynamoDB.

Dual-write to Cassandra and DynamoDB

In this phase, whenever the Shopping Cart service received a request, the write operations went to both Cassandra and DynamoDB, and for read operations, Cassandra was the source of truth.

We added the required tests and monitoring to the existing source code to validate the data model and the correctness of data, which allowed us to compare the data stored in Cassandra and DynamoDB.

By the end of this phase, new customers coming to the website had their shopping carts successfully stored in both Cassandra and DynamoDB. Although Cassandra was the source of truth during this phase, the dual-write mechanism allowed us to quickly and confidently roll back in case of issues during later migration phases.

Offline data migration

The focus of this phase was to migrate the shopping carts and the corresponding data that belonged to the existing customers that weren’t migrated during the previous phase.

To perform this migration phase, we implemented an offline batch processing script and auxiliary endpoints:

  1. A script that dumped customer numbers and corresponding cart IDs from Cassandra to a local storage system. Dumping the data as a local copy allowed us to experiment with the data without impacting the production system.
  2. A new migration endpoint was added to the microservice, which manages shopping carts. This endpoint received a customer number and a cart ID as input, read the corresponding data from Cassandra, and wrote it to DynamoDB.
  3. We also added a new comparison endpoint to the microservice, which manages shopping carts. This endpoint accepted a customer number and a cart ID as input, read all corresponding data from Cassandra and DynamoDB, compared them for equality, and recorded inconsistencies between data stored in these two databases for this specific customer.
  4. Finally, we created a new microservice for the purpose of this migration. It parsed the dump files created in (a) and called the migration endpoint explained in (b) for batches of customers and cart IDs to migrate the data from Cassandra to DynamoDB. The size of these batches was configurable. The service also offered a verification step for a migrated batch to compare data from Cassandra and DynamoDB using the endpoint explained in (c). Whenever data inconsistencies were detected during the verification, data was fixed and these steps were repeated for the inconsistent data.

We started off with a small batch size, about 0.01% of our customers, to verify the functionality of the used tools and mechanisms. This also allowed us to verify the impact on the performance of the production system during the migration, as we were reading from the production data store. Because the data stored in Cassandra was never modified by (a), (b), (c), or (d), the migration for each batch could be repeated as often as necessary until the data was verified to be consistent. We then steadily increased the batch size to speed up the migration process until reaching 100% of migrated data.

The monitoring and dual-write mechanism put in place during the previous phase allowed us to verify that the data stored in Cassandra and DynamoDB was equal and kept up to date in both databases.

Read from DynamoDB with fallback to Cassandra

In this phase, we switched to DynamoDB as being the source of truth for the data presented to the customer. One difference from the previous phase was that reads went first to DynamoDB, and only if the record was not found there, we read from Cassandra. We recorded those data misses because this indicated that not all of the data was migrated. During this phase, all of the write operations were performed on both Cassandra and DynamoDB.

We applied the same gradual batching mechanism from the previous phase, starting with a small share of customers being served by DynamoDB and slowly increasing it along with growing confidence that everything was working well. This reduced the risk of affecting all customers simultaneously and enabled us to switch back to Cassandra if necessary.

We performed final consistency checks and verified that all of the data had been successfully migrated. We trained our on-call engineers to monitor and operate DynamoDB for production workloads. We enabled point-in-time recovery for our DynamoDB table and implemented a disaster recovery strategy by creating a corresponding playbook and testing its correct functioning.

Decommission Cassandra

When we gained confidence that DynamoDB served the same data as Cassandra and we were operationally ready to run our production workloads based on DynamoDB, we stopped writing to and reading from Cassandra. We also removed all Cassandra-related code and infrastructure.

Learnings during and after migrating to DynamoDB

During the offline data migration phase, the production system experienced increased latency in customer interactions due to the heavy-read workload while reading from Cassandra and writing to DynamoDB. To mitigate this, we scaled up the Cassandra instances and also rate-limited the data migration so that the impact on the production system stayed within accepted boundaries.

After running DynamoDB for about a month with on-demand capacity mode, the established baseline and access patterns were evaluated. This led to switching our table’s capacity mode to provisioned with auto scaling enabled, which helped us optimize our costs. To handle spikes, the target utilization was set to 70% to have enough buffer between consumed and provisioned capacity before throttling would kick in.

One operational learning after the migration was how to deal with traffic drops lasting more than 15 minutes, for instance, network outages that caused DynamoDB auto scaling to scale down. DynamoDB auto scaling uses Application Auto Scaling to dynamically adjust provisioned throughput capacity in response to actual traffic patterns. Application Auto Scaling automatically scales up the provisioned capacity only when the consumed capacity is higher than target utilization for 2 consecutive minutes. During this period, requests that exceed the provisioned capacity of the table might throttle. Although sudden short-duration spikes of traffic might be accommodated by DynamoDB’s built-in burst capacity, we created alarms to notify us when the provisioned capacity dropped below a certain level so we could manually increase the provisioned capacity.

To optimize cost even further, we purchased reserved capacity after establishing a well-known baseline over a longer time frame and evaluating historical data.

Conclusion

Overall, the migration was a success and went smoothly without impact on Zalando customers browsing and shopping on the website.

Besides the normal daily load, DynamoDB also served our workloads well for 2 consecutive Cyber Week sales events. Here we observed a significant reduction in operational load, which was observable by the reduced amount of tickets created. We also didn’t need to manually scale Cassandra before and after each load test during Cyber Week preparations, which typically required allocating two engineers for a minimum of 2 full days for each load test, which extended over several weeks.

Regarding utilization, although a 70% target was the right choice for our steady traffic, we decided to choose a lower value for target utilization during large-scale events to widen the gap between consumed and provisioned capacity, and better deal with unpredictable traffic surges. We also pre-scaled the table with the provisioned capacity based on the anticipated traffic during Cyber Week.

Backup and disaster recovery have also improved significantly with the adoption of DynamoDB. Although we had backups before, our previous disaster recovery was quite complex, requiring detailed knowledge in Cassandra and substantial downtime.

Finally, and perhaps most importantly, this successful migration inspired the team owning the Checkout domain to also migrate from Cassandra to DynamoDB, shortly after our migration. Because the time-to-live for customer data related to this domain is only 1 week, there was no need to migrate existing data. They took advantage of the lessons we had learned during our dual-write phase, and it was sufficient to run that process for a week to have the same data in both Cassandra and DynamoDB.

So, besides the technical, cost, and operational improvements, our migration is now a success story inside of Zalando, which can serve as a model to emulate in future migrations. If you have questions about migrating to DynamoDB, please let us know in the comments.


About the authors

Holger MachtHolger Macht is currently working as an Engineering Manager at Zalando. Together with his team, he is evolving the Fashion Store’s shopping cart domain and ensures the reliable operation of the backend production software systems, empowering more than 50 million active customers to make informed purchase decisions about the products they love.

Mike MatsumotoMike Matsumoto is a Senior Software Engineer at Zalando who successfully led the migration of Zalando’s shopping carts to DynamoDB, ensuring a smooth transition without any downtime. In addition to his contributions to the Cart Team’s deliveries, he founded the DynamoDB Guild at Zalando, where he advocates for operational guidance and best practices throughout the organization.

Luís Rodrigues Soares is a Game Tech Solutions Architect, based in A Coruña, Spain.

Samaneh UtterSamaneh Utter is an Amazon DynamoDB Specialist Solutions Architect based in Göteborg, Sweden.