Zomato, India’s food ordering and delivery platform, that lists more than 2,00,000 restaurants, is emerging as the leading market player in India’s food tech industry. As the demand for online food ordering continues to grow, Zomato recognizes the importance of innovation in meeting scalability requirements.
Considering the nature of the business, customer traffic is primarily concentrated during meal times, leading to notable differences in workload between peak and non-peak hours. Additionally, on special occasions like Diwali and New Year’s Eve, Zomato experiences massive traffic surges with significantly higher spikes compared to regular days. Therefore, it’s crucial for Zomato to select a database that ensures consistent low latency, regardless of scale, and possesses the capability to handle traffic spikes without the need for expensive overprovisioning during periods of lower activity.
In this post, we explain the primary factors that prompted Zomato to transition away from relational databases and TiDB to Amazon DynamoDB, a fully managed, serverless, key-value NoSQL database. This enabled us to effectively manage traffic spikes without the need for costly overprovisioning during periods of reduced activity and even lower total cost of ownership (TCO).
Zomato’s Billing Platform is accountable for managing post-order processes, primarily focusing on maintaining ledgers and handling payouts for various business verticals such as Food delivery, Dining Out, Blinkit and Hyperpure. It also maintains ledgers and handles payouts for Feeding India, a Zomato give-back.
This platform effectively handles the distribution of payments to restaurant partners and riders at a large scale and processes around 10 million events on a typical day. This results in approximately 1 million payments on a weekly basis. Because generating invoices and processing payments is a mission-critical function for Zomato, the availability and resiliency of the billing system is important for the success of its business.
The Zomato Billing Platform architecture follows an asynchronous messaging approach using Kafka for integrating independent microservices in a loosely coupled manner to operate at scale and evolve independently and flexibly. The platform acts as both producers and consumers of kafka events consuming billing order, processing it and producing processed ledger for various business use cases. The following diagram illustrates the legacy architecture.
Zomato Billing Platform used TiDB, an open source distributed SQL database that supports online transaction processing (OLTP) and online analytical processing (OLAP) workloads. It combines the scalability of NoSQL databases with the ACID compliance of traditional relational databases. Its distributed architecture enables horizontal scalability, fault tolerance, and high availability. The TiDB system comprises multiple components that communicate with each other to form a complete system.
The following key components of the TiDB system need to be maintained:
Zomato initially incorporated TiDB into the original design to manage its OLTP workloads. The decision to use TiDB was based on the expertise of Zomato’s engineering team in the relational model and MySQL, as well as its ability to horizontally scale and meet scaling needs.
Over the years, the scale at which Zomato operates has grown tremendously, and with the addition of Blinkit and HyperPure, we also needed our billing platform to be multi-tenant. We faced several challenges while scaling and maintaining our TiDB database:
DynamoDB is a serverless, NoSQL, fully managed database service with single-digit millisecond response times at any scale, enabling you to develop and run modern applications while only paying for what you use.
To address the aforementioned problems, the Zomato team identified the need to redesign the data storage layer. After conducting a thorough evaluation of various databases, the engineering team decided to use DynamoDB. This section highlights the key points that guided our approach.
In the original TiDB design, we followed the traditional relational database management system (RDBMS) approach. Each table was dedicated to storing data for a particular entity and had a unique key that connected to a foreign key in another table. To retrieve data from related tables and present results in a unified view, a JOIN operation involving multiple tables was performed. For example, we had separate sets of tables for each business vertical, which contained entities like payout_details, billing_ledger, payout_order_mapping, and billing_breakup, as illustrated in the following figure.
Given our learnings of using DynamoDB for multiple microservices, we decided to streamline this relational schema into a unified DynamoDB table using the adjacency list design pattern that encompasses all businesses and stores data for various entities. This strategy effectively consolidated related data into a single table (see the following figure), enhancing performance by eliminating the need to fetch data from different locations on disk. It also reduced costs, especially for read operations, because a single read operation now retrieves all the required data instead of multiple queries for different entities.
In DynamoDB, data is distributed across partitions, which are physical storage units. Each table can have multiple partitions, and in most cases the partition key determines where the data is stored. For tables with composite primary keys, the sort key may be used as a partition boundary. DynamoDB splits partitions by sort key if the collection size grows bigger than 10 GB. Designing a schema and identifying the partition key is crucial for efficient data access. Imbalances in data access can lead to hot partitions, where one partition experiences a higher workload, causing throttling and inefficient use of I/O capacity.
We tackled this problem by incorporating the partition key with composite attributes, allowing for data distribution across multiple partitions. The partition key for the single table was constructed by combining the merchant ID, payout cycle, and business vertical, with each element separated by a separator. For our workload, this approach improved the cardinality of the partition key and ensured that the read and write operations are distributed as evenly as possible across tables, avoiding poor performance.
In addition, we utilized the inverted index method to query data. This secondary index design pattern is commonly used with DynamoDB. This allows for querying the other side of a many-to-many relationship, which is typically not feasible in a standard key-value store approach.
Similar to the primary table’s partition key, it is essential to guarantee an even distribution of read and write operations across partitions for the partition key of a global secondary index (GSI) to prevent throttling. This was achieved by introducing a number, referred to as the division number (based on a business logic), along with the index key. As a result, the updated GSI format is now represented as <index-key>_<division-number>.
We wanted a seamless transition from TiDB to DynamoDB with zero downtime and no disruption to ongoing operations. To achieve this, we undertook a phased approach with strategies such as dual write, replication, data synchronization, and failover mechanisms to ensure continuous availability of our data during the migration process.
The following diagram illustrates the details of each phase.
This approach allowed Zomato to maintain uninterrupted access to their critical information, ensuring smooth operations and minimizing any negative impact on productivity or customer experience.
The decision to switch from TiDB to DynamoDB was primarily driven with the objective of enhancing the application’s performance and reducing operational complexity. The performance improvement depicted in the following figure demonstrates an average decrease of 90% in microservice response time. Another noteworthy observation is that regardless of the level of traffic, the response time of the microservice consistently remains around 75 milliseconds.
Following the migration to DynamoDB, the performance of the database significantly improved, successfully addressing the bottleneck at the database level. Consequently, with the current scale, the throughput of the microservice increased to 8,000 RPM, allowing Zomato to handle four times more traffic compared to the previous design. The enhanced throughput led to reduced lag and near-real-time billing, resulting in improved SLAs.
This blog was authored by Kanica Mandhania and Neha Gupta under the guidance of Himanshu Rathore. The migration was carried out by the Central billing team including Shailendra Kumar, Sushrut Bhatele, Shreyas Raj, Vikas Kumar under the mentorship of Gyanendra Kumar.