Unlocking performance, scalability, and cost-efficiency of Zomato’s Billing Platform by switching from TiDB to DynamoDB

This post is co-authored with Neha Gupta & Kanica Mandhania from Zomato.

Zomato, an India-based restaurant aggregator, food delivery, and dining-out company, operates in over 1,000 cities and lists more than 350,000 restaurants. Since its inception in 2008, Zomato has grown tremendously, both in scope and scale—and has emerged as the leading market player in India’s food tech industry. As the demand for online food ordering continues to grow, Zomato recognizes the importance of innovation in meeting scalability requirements.

Considering the nature of our business, customer traffic is primarily concentrated during meal times, leading to notable differences in workload between peak and non-peak hours. Additionally, on special occasions like Diwali and New Year’s Eve, Zomato experiences massive traffic surges with significantly higher spikes compared to regular days. Therefore, it’s crucial for Zomato to select a database that ensures consistent low latency, regardless of scale, and possesses the capability to handle traffic spikes without the need for expensive overprovisioning during periods of lower activity.

In this post, we explain the primary factors that prompted Zomato to transition away from relational databases and TiDB to Amazon DynamoDB, a fully managed, serverless, key-value NoSQL database. This enabled us to effectively manage traffic spikes without the need for costly overprovisioning during periods of reduced activity and even lower total cost of ownership (TCO).

Zomato’s Billing Platform

Zomato’s Billing Platform is accountable for managing post-order processes, primarily focusing on maintaining ledgers and handling payouts for various business verticals such as Online Ordering, Dining Out, Blinkit, Hyperpure, and Feeding India. This platform effectively handles the distribution of payments to restaurant partners and riders at a large scale.

At the time of writing this post, Zomato’s Billing Platform processes around 10 million events on a typical day. This results in approximately 1 million payments on a weekly basis. Because generating invoices and processing payments is a mission-critical function for Zomato, the availability and resiliency of the billing system is important for the success of our business.

The Zomato Billing Platform architecture follows an asynchronous messaging approach using Kafka for integrating independent microservices in a loosely coupled manner to operate at scale and evolve independently and flexibly. The platform act as both producers and consumers of kafka events consuming billing order, processing it and producing processed ledger for various business use cases. The following diagram illustrates the legacy architecture. Zomato billing platform architecture

Zomato Billing Platform used TiDB, an open source distributed SQL database that supports online transaction processing (OLTP) and online analytical processing (OLAP) workloads. It combines the scalability of NoSQL databases with the ACID compliance of traditional relational databases. Its distributed architecture enables horizontal scalability, fault tolerance, and high availability. The TiDB system comprises multiple components that communicate with each other to form a complete system.

The following key components of the TiDB system need to be maintained:

TiDB server – The TiDB server serves as a stateless SQL layer, providing external access through the MySQL protocol. It can be scaled horizontally and offers a unified interface through load balancing components.
Placement Driver server – The Placement Driver (PD) server manages metadata and data scheduling commands within the cluster. It acts as the brain of the cluster by storing metadata and dynamically assigning data distribution tasks to TiKV nodes based on real-time reports.
TiKV server – The TiKV server is responsible for distributed data storage and functions as a transactional key-value storage engine.
TiFlash server – The TiFlash server is a specialized columnar storage server optimized for fast analytical processing, storing data in columns for improved performance.

Challenges with the original design

Zomato initially incorporated TiDB into original design to manage our OLTP workloads. The decision to use TiDB was based on the expertise of Zomato’s engineering team in the relational model and MySQL, as well as its ability to horizontally scale and meet scaling needs.

Over the years, the scale at which Zomato operates had grown tremendously, and with the addition of Blinkit and HyperPure, we also needed our billing platform to be multi-tenant. We faced several challenges while scaling and maintaining our TiDB database:

As our operations expanded, we faced challenges in handling schema changes and observed a decline in query performance, particularly in larger tables with billions of rows and terabytes of data. Furthermore, performance issues arose with certain workloads involving subqueries or joining numerous tables.
The distributed nature of TiDB introduced additional complexity compared to single-node databases. One such example is adding nodes with larger storage capacity to a balanced cluster for scalability. This resulted in the designated node becoming the primary for write operations, leading to higher CPU usage and negatively impacting the overall performance of the service. This, in turn, led to poor service level agreements (SLAs), a compromised customer experience, and delayed payments.
We expanded our cluster size from 5 nodes to 25 nodes in both the primary and replica clusters to accommodate the growing scale and data. Additionally, during peak days with anticipated high traffic, the infrastructure was manually scaled up and kept overprovisioned to prevent performance issues. All this led to an increase in TCO and made it a less scalable solution.
We opted for TiDB when TiDB cloud was not launched. Because the TiDB database was self-managed, our team took on the majority of the heavy lifting to ensure its reliable operation. This included tasks like synchronizing replicas, monitoring storage, adding extra nodes, and backup management. Migrating to newer versions also required tremendous effort and time. Over time, all these factors combined became an overhead for our teams.
As the size of the database grew, the backup tasks experienced longer completion times due to the larger volume of data being processed. This added delay in the overall backup process and led to a situation where we were unable to meet our agreed SLAs.

Why DynamoDB

DynamoDB is a serverless, NoSQL, fully managed database service with single-digit millisecond response times at any scale, enabling you to develop and run modern applications while only paying for what you use.

Apart from the fact that we are already using DynamoDB for multiple business-critical services at Zomato, the following functionalities highlight the key features of DynamoDB that increase scalability, streamline developer productivity, reduce time spent on repetitive tasks, and lower our overall TCO:

DynamoDB maintains consistent performance as your application scales, ensuring that regardless of the database size or number of concurrent queries, all operations will have a reliable response time in the order of milliseconds.
As a general rule for data modeling, related data is kept together. Therefore, DynamoDB doesn’t need a query planner to parse a query into a multi-step process to read, join, and aggregate data from different places on disk. This enables DynamoDB to support low-latency lookup and makes queries efficient with growing scale. It also frees developers and DBAs from the responsibility of performance tuning because you no longer need to spend time debugging query plans or figuring out why query performance decreases at scale.
DynamoDB offers an auto scaling capacity feature that can modify our database resources (provisioned throughput capacity) according to user traffic. This helps us scale for sudden spikes in traffic and not over-provision for peak workloads compared to TiDB—all while maintaining consistent latencies in an efficient and cost-effective manner.
Unlike TiDB, DynamoDB eliminates the need for manual setup and management tasks. It’s serverless, meaning you don’t have to handle hardware provisioning, configuration, replication, software patching, database backup, or cluster scaling yourself.

Solution overview

To address the aforementioned problems, the Zomato team identified the need to redesign the data storage layer. After conducting a thorough evaluation of various databases, the engineering team decided to use DynamoDB. This section highlights the key points that guided our approach.

Single-table design

In the original TiDB design, we followed the traditional relational database maganement system (RDBMS) approach. Each table was dedicated to storing data for a particular entity and had a unique key that connected to a foreign key in another table. To retrieve data from related tables and present results in a unified view, a JOIN operation involving multiple tables was performed. For example, we had separate sets of tables for each business vertical (such as Food Delivery, Dining Out, and HyperPure), which contained entities like payout_details, billing_ledger, payout_order_mapping, and billing_breakup, as illustrated in the following figure.

Image dipicting legacy table design

Given our learnings of using DynamoDB for multiple microservices, we decided to streamline this relational schema into a unified DynamoDB table using the adjacency list design pattern that encompasses all businesses and stores data for various entities. This strategy effectively consolidated related data into a single table (see the following figure), enhancing performance by eliminating the need to fetch data from different locations on disk. It also reduced costs, especially for read operations, because a single read operation now retrieves all the required data instead of multiple queries for different entities.

DynamoDB table with Single-table design

Designing partition keys

In DynamoDB, data is distributed across partitions, which are physical storage units. Each table can have multiple partitions, and in most cases the partition key determines where the data is stored. For tables with composite primary keys, the sort key may be used as a partition boundary. DynamoDB splits partitions by sort key if the collection size grows bigger than 10 GB. Designing a schema and identifying the partition key is crucial for efficient data access. Imbalances in data access can lead to hot partitions, where one partition experiences a higher workload, causing throttling and inefficient use of I/O capacity.

We tackled this problem by incorporating the partition key with composite attributes, allowing for data distribution across multiple partitions. The partition key for the single table was constructed by combining the merchant ID, payout cycle, and business vertical, with each element separated by a separator. For our workload, this approach improved the cardinality of the partition key and ensured that the read and write operations are distributed as evenly as possible across tables, avoiding poor performance.

In addition, we utilized the inverted index method to query data. This secondary index design pattern is commonly used with DynamoDB. This allows for querying the other side of a many-to-many relationship, which is typically not feasible in a standard key-value store approach.

Similar to the primary table’s partition key, it is essential to guarantee an even distribution of read and write operations across partitions for the partition key of a global secondary index (GSI) to prevent throttling. This was achieved by introducing a number, referred to as the division number (based on a business logic), along with the index key. As a result, the updated GSI format is now represented as <index-key>_<division-number>.

Migration approach

We wanted a seamless transition from TiDB to DynamoDB with zero downtime and no disruption to the ongoing operations. We employed a phased approach with strategies such as dual write, replication, data synchronization, and failover mechanisms to ensure continuous availability of our data during the migration process.

The following diagram illustrates the details of each phase.

Image depicting Migration approach

This approach allowed Zomato to maintain uninterrupted access to our critical information, ensuring smooth operations and minimizing any negative impact on productivity or customer experience.

Migration results

In this section, we discuss the increased performance and scalability Zomato realized with this solution.

Performance

The decision to switch from TiDB to DynamoDB was primarily driven by the goal of enhancing the application’s performance and reducing operational complexity. The performance improvement depicted in the following figure demonstrates an average decrease of 90% in microservice response time. Another noteworthy observation is that regardless of the level of traffic, the response time of the microservice consistently remains around 75 milliseconds.

graph dipicting performance

Scalability

The performance of the database layer in the earlier architecture became a significant bottleneck, resulting in the microservice achieving a throughput of only 2,000 requests per minute (see the following figure). It was essential to address this limitation in the database layer to enhance the microservice’s throughput and overall performance and to meet Zomato’s scalability requirements.

graph dipicting old throughput

Following the migration to DynamoDB, the performance of the database significantly improved, successfully addressing the bottleneck at the database level. Consequently, with the current scale, the throughput of the microservice increased to 8,000 RPM, allowing Zomato to handle four times more traffic compared to the previous design. The enhanced throughput led to reduced lag and near-real-time billing, resulting in improved SLAs.

graph dipicting new throughput

Cost

In the new solution with DynamoDB, we pay only for what we use, unlike the previous approach with TiDB, which involved over-provisioning the database for sudden traffic peaks. As a result, our billing system’s monthly expenses saw a notable 50% reduction due to these changes.

Conclusion

In this post, we discussed how Zomato successfully migrated from TiDB to DynamoDB without any downtime or negative impact on customer experience. The transition was implemented seamlessly, enabling Zomato to efficiently cater to our expanding user base by handling four times more transactions, reducing latency by nearly 90%, and optimizing database cost by 50%. Additionally, this solution improved our overall reporting times, thereby enabling us in better day-to-day decision-making. It also improved the response time for statement of accounts for merchants, enhancing the user experience and SLAs across all stakeholders.

To learn more about how to maximize performance and minimize throughput costs while using DynamoDB, see Best practices guide for designing and architecting with DynamoDB.

About the authors

Neha Gupta is working as an engineering manager with Zomato with 7 years of experience. She is responsible for designing and developing robust and scalable solutions that can handle massive amounts of data efficiently.

Kanica Mandhania is working as SDE-III at Zomato, bringing 7 years of experience to her role. She is passionate about leveraging technology to solve customer challenges and loves building products that make life simpler.

Vishal Gupta is a Senior Solutions Architect at AWS India, based in Delhi. Vishal works with large conglomerates and digital native business (DNB) customers and enables them to design, architect, and innovate highly scalable and resilient solutions on the AWS platform. Outside work, he enjoys traveling to new destinations and spending time with his family.