Kanica Mandhania | January 11, 2024 | 9 min read

Unlocking performance, scalability, and cost-efficiency of Zomato’s Billing Platform by switching from TiDB to DynamoDB

Zomato, India’s food ordering and delivery platform, that lists more than 2,00,000 restaurants, is emerging as the leading market player in India’s food tech industry. As the demand for online food ordering continues to grow, Zomato recognizes the importance of innovation in meeting scalability requirements.

Considering the nature of the business, customer traffic is primarily concentrated during meal times, leading to notable differences in workload between peak and non-peak hours. Additionally, on special occasions like Diwali and New Year’s Eve, Zomato experiences massive traffic surges with significantly higher spikes compared to regular days. Therefore, it’s crucial for Zomato to select a database that ensures consistent low latency, regardless of scale, and possesses the capability to handle traffic spikes without the need for expensive overprovisioning during periods of lower activity.

In this post, we explain the primary factors that prompted Zomato to transition away from relational databases and TiDB to Amazon DynamoDB, a fully managed, serverless, key-value NoSQL database. This enabled us to effectively manage traffic spikes without the need for costly overprovisioning during periods of reduced activity and even lower total cost of ownership (TCO).

Zomato’s Billing Platform

Zomato’s Billing Platform is accountable for managing post-order processes, primarily focusing on maintaining ledgers and handling payouts for various business verticals such as Food delivery, Dining Out, Blinkit and Hyperpure. It also maintains ledgers and handles payouts for Feeding India, a Zomato give-back.

This platform effectively handles the distribution of payments to restaurant partners and riders at a large scale and processes around 10 million events on a typical day. This results in approximately 1 million payments on a weekly basis. Because generating invoices and processing payments is a mission-critical function for Zomato, the availability and resiliency of the billing system is important for the success of its business.

The Zomato Billing Platform architecture follows an asynchronous messaging approach using Kafka for integrating independent microservices in a loosely coupled manner to operate at scale and evolve independently and flexibly. The platform acts as both producers and consumers of kafka events consuming billing order, processing it and producing processed ledger for various business use cases. The following diagram illustrates the legacy architecture.

Zomato Billing Platform used TiDB, an open source distributed SQL database that supports online transaction processing (OLTP) and online analytical processing (OLAP) workloads. It combines the scalability of NoSQL databases with the ACID compliance of traditional relational databases. Its distributed architecture enables horizontal scalability, fault tolerance, and high availability. The TiDB system comprises multiple components that communicate with each other to form a complete system.

The following key components of the TiDB system need to be maintained:

TiDB server – The TiDB server serves as a stateless SQL layer, providing external access through the MySQL protocol. It can be scaled horizontally and offers a unified interface through load balancing components.
Placement Driver server – The Placement Driver (PD) server manages metadata and data scheduling commands within the cluster. It acts as the brain of the cluster by storing metadata and dynamically assigning data distribution tasks to TiKV nodes based on real-time reports.
TiKV server – The TiKV server is responsible for distributed data storage and functions as a transactional key-value storage engine.
TiFlash server – The TiFlash server is a specialized columnar storage server optimized for fast analytical processing, storing data in columns for improved performance.

Challenges with the original design

Zomato initially incorporated TiDB into the original design to manage its OLTP workloads. The decision to use TiDB was based on the expertise of Zomato’s engineering team in the relational model and MySQL, as well as its ability to horizontally scale and meet scaling needs.

Over the years, the scale at which Zomato operates has grown tremendously, and with the addition of Blinkit and HyperPure, we also needed our billing platform to be multi-tenant. We faced several challenges while scaling and maintaining our TiDB database:

As our operations expanded, we faced challenges in handling schema changes and observed a decline in query performance, particularly in larger tables with billions of rows and terabytes of data. Furthermore, performance issues arose with certain workloads involved subqueries or joining numerous tables.
The distributed nature of TiDB introduced additional complexity compared to single-node databases. One such example is adding nodes with larger storage capacity to a balanced cluster for scalability. This resulted in the designated node becoming the primary for write operations, leading to higher CPU usage and negatively impacting the overall performance of the service. This, in turn, led to poor service level agreements (SLAs), a compromised customer experience, and delayed payments.
We expanded our cluster size from 5 nodes to 25 nodes in both the primary and replica clusters to accommodate the growing scale and data. Additionally, during peak days with anticipated high traffic, the infrastructure was manually scaled up and kept overprovisioned to prevent performance issues. All this led to an increase in TCO and made it a less scalable solution.
We opted for TiDB when their cloud solutions were not available. Because the TiDB database was self-managed, our team took on the majority of heavy lifting to ensure its reliable operation. This included tasks like synchronizing replicas, monitoring storage, adding extra nodes, and backup management. Migrating to newer versions also required tremendous effort and time, and combined all these over time this became an overhead for our teams.
As the size of the database grew, the backup tasks experienced longer completion times due to the larger volume of data being processed. This added delay in the overall backup process has led to a situation where we were unable to meet our agreed SLAs.

Why DynamoDB

DynamoDB is a serverless, NoSQL, fully managed database service with single-digit millisecond response times at any scale, enabling you to develop and run modern applications while only paying for what you use.

Apart from the fact that we are already using DynamoDB for multiple business-critical services at Zomato, the following functionalities highlight the key features of DynamoDB that increase scalability, streamline developer productivity, reduce time spent on repetitive tasks, and lower our overall TCO:
DynamoDB maintains consistent performance as your application scales, ensuring that regardless of the database size or number of concurrent queries, all operations will have a reliable response time in the order of milliseconds.
As a general rule for data modeling, related data is kept together. Therefore, DynamoDB doesn’t need a query planner to parse a query into a multi-step process to read, join, and aggregate data from different places on disk. This enables DynamoDB to support low-latency lookup and makes queries efficient with growing scale. It also frees developers and DBAs from the responsibility of performance tuning because you no longer need to spend time debugging query plans or figuring out why query performance decreases at scale.
DynamoDB offers an auto scaling capacity feature that can modify our database resources (provisioned throughput capacity) according to user traffic. This helps us scale for sudden spikes in traffic and not over-provision for peak workloads compared to TiDB—all while maintaining consistent latencies in an efficient and cost-effective manner.
Unlike TiDB, DynamoDB eliminates the need for manual setup and management tasks. It is serverless, meaning you don’t have to handle hardware provisioning, configuration, replication, software patching, database backup, or cluster scaling yourself.

Solution Overview

To address the aforementioned problems, the Zomato team identified the need to redesign the data storage layer. After conducting a thorough evaluation of various databases, the engineering team decided to use DynamoDB. This section highlights the key points that guided our approach.

Single table design

In the original TiDB design, we followed the traditional relational database management system (RDBMS) approach. Each table was dedicated to storing data for a particular entity and had a unique key that connected to a foreign key in another table. To retrieve data from related tables and present results in a unified view, a JOIN operation involving multiple tables was performed. For example, we had separate sets of tables for each business vertical, which contained entities like payout_details, billing_ledger, payout_order_mapping, and billing_breakup, as illustrated in the following figure.

Given our learnings of using DynamoDB for multiple microservices, we decided to streamline this relational schema into a unified DynamoDB table using the adjacency list design pattern that encompasses all businesses and stores data for various entities. This strategy effectively consolidated related data into a single table (see the following figure), enhancing performance by eliminating the need to fetch data from different locations on disk. It also reduced costs, especially for read operations, because a single read operation now retrieves all the required data instead of multiple queries for different entities.

Designing partition keys

In DynamoDB, data is distributed across partitions, which are physical storage units. Each table can have multiple partitions, and in most cases the partition key determines where the data is stored. For tables with composite primary keys, the sort key may be used as a partition boundary. DynamoDB splits partitions by sort key if the collection size grows bigger than 10 GB. Designing a schema and identifying the partition key is crucial for efficient data access. Imbalances in data access can lead to hot partitions, where one partition experiences a higher workload, causing throttling and inefficient use of I/O capacity.

We tackled this problem by incorporating the partition key with composite attributes, allowing for data distribution across multiple partitions. The partition key for the single table was constructed by combining the merchant ID, payout cycle, and business vertical, with each element separated by a separator. For our workload, this approach improved the cardinality of the partition key and ensured that the read and write operations are distributed as evenly as possible across tables, avoiding poor performance.

In addition, we utilized the inverted index method to query data. This secondary index design pattern is commonly used with DynamoDB. This allows for querying the other side of a many-to-many relationship, which is typically not feasible in a standard key-value store approach.

Similar to the primary table’s partition key, it is essential to guarantee an even distribution of read and write operations across partitions for the partition key of a global secondary index (GSI) to prevent throttling. This was achieved by introducing a number, referred to as the division number (based on a business logic), along with the index key. As a result, the updated GSI format is now represented as <index-key>_<division-number>.

Migration approach

We wanted a seamless transition from TiDB to DynamoDB with zero downtime and no disruption to ongoing operations. To achieve this, we undertook a phased approach with strategies such as dual write, replication, data synchronization, and failover mechanisms to ensure continuous availability of our data during the migration process.

The following diagram illustrates the details of each phase.

This approach allowed Zomato to maintain uninterrupted access to their critical information, ensuring smooth operations and minimizing any negative impact on productivity or customer experience.

Migration Results | Performance

The decision to switch from TiDB to DynamoDB was primarily driven with the objective of enhancing the application’s performance and reducing operational complexity. The performance improvement depicted in the following figure demonstrates an average decrease of 90% in microservice response time. Another noteworthy observation is that regardless of the level of traffic, the response time of the microservice consistently remains around 75 milliseconds.

Scalability

The performance of the database layer in the earlier architecture became a significant bottleneck, resulting in the microservice achieving a throughput of only 2,000 requests per minute (see the following figure). It was essential to address this limitation in the database layer to enhance the microservice’s throughput and overall performance and to meet Zomato’s scalability requirements.

Following the migration to DynamoDB, the performance of the database significantly improved, successfully addressing the bottleneck at the database level. Consequently, with the current scale, the throughput of the microservice increased to 8,000 RPM, allowing Zomato to handle four times more traffic compared to the previous design. The enhanced throughput led to reduced lag and near-real-time billing, resulting in improved SLAs.

Cost

In the new solution with DynamoDB, we pay only for what we use, unlike the previous approach with TiDB, which involved over-provisioning the database for sudden traffic peaks. As a result, our billing system’s monthly expenses saw a notable 50% reduction due to these changes.

A Better User Experience

The migration from TiDB to DynamoDB has enhanced our overall reporting times, empowering us to make better day-to-day decisions. Additionally, it has optimized the response time for merchants’ statements of accounts, improving the user experience and meeting SLAs across all stakeholders.

This blog was authored by Kanica Mandhania and Neha Gupta under the guidance of Himanshu Rathore. The migration was carried out by the Central billing team including Shailendra Kumar, Sushrut Bhatele, Shreyas Raj, Vikas Kumar under the mentorship of Gyanendra Kumar.