Zomato Engineering | February 29, 2024 | 6 min read
A Tale of Scale: Behind the Scenes at Zomato Tech for NYE 2023

The Magnitude of New Year’s Eve

On this day, every second counts, and every order is a celebration. This year, we didn’t just surpass expectations; we redefined them.

On the technical front, our observability platform adeptly handled 2.2 billion active time series metrics, reaching a peak throughput of 17.8 million per second, showcasing robust performance during peak demand periods. Customers placed about 2 million orders from new restaurants they explored, highlighting adventurous palates and a vast variety of options on Zomato.

On the data front, LogStore our in-house logging platform processed 60 TB application logs on that day, with an 80 million RPM peak ingestion rate. Kafka, the backbone of our event-driven architecture, efficiently handled a remarkable rate of 450+ million messages per minute, with the clusters achieving a total peak throughput of 5.6 GBps.

Even with the unprecedented demand of New Year’s Day, 60% of the orders were successfully delivered within 30 minutes. Our chat support handled a record 19,500 peak concurrent sessions and about 51,400 messages were exchanged per minute, ensuring swift and effective customer service. 

Prelude to the Big Night: The Preparation

At Zomato, preparation for NYE is a blend of tech and strategy, setting a benchmark for best practices for ourselves and the industry. We delve deep into each service, ensuring that every aspect is finely tuned for peak performance. Special focus is placed on kill switch reviews, ensuring fail-safes for every scenario. Our preparation includes:

  • Service Design Review: We revisit service design from the ground up, removing excess redundancies to reduce latency and enhance performance.
  • Database Optimization: Monitoring of database metrics helps identify bottlenecks. Slow queries are optimized, and the database load is reduced by caching frequently used queries.
  • Incident Analysis: Past incidents are meticulously reviewed for actionable insights, ensuring all loops are closed.
  • Adherence to Best Practices: Each service is aligned with organizational best practices.
  • Enhanced Monitoring: Any gaps in monitoring are filled to aid quicker debugging during incidents.
  • Alerts and Anomalies: Frequent alerts are reviewed, and root cause analyses are conducted. New alerts are added to identify early signs of potential issues.
  • Kill Switches and Degraded Modes: To prevent lower-tier dependencies from impacting critical services, kill switches are implemented, including modes for degraded operations to maintain core business functions.

However, the cornerstone of our preparation is chaos testing and benchmarking.

Chaos-testing: An approach to test a system’s integrity by simulating and injecting fault into the system. We understand that despite exhaustive reviews, surprises can lurk in the system. Imagine a seemingly minor service causing major disruptions – this is where chaos testing becomes invaluable. It helps us to uncover hidden dependencies and potential failure points in our complex microservices architecture. We simulate extreme conditions such as:

  • Upstream downtime by injecting unrealistically high response times via our internal service mesh layer built using Kuma
  • Activation of kill switches to trigger degraded mode, testing system sanity
  • Cache miss injections to evaluate database reliability under stress

Benchmarking: The adage At scale, everything breaks” is not just a cautionary statement, but a fundamental principle that guides our approach to system reliability. As we scale up to meet the massive surge in orders and customer interactions during NYE, hidden vulnerabilities and bottlenecks in our system that are not apparent during regular operations may emerge. These can range from database performance issues, network bottlenecks, to unforeseen failures in microservices that behave differently under heavy load.

This is particularly crucial when preparing for events like NYE, where the sheer volume of transactions pushes our systems to their limits. Benchmarking provides insights into how they would behave under the extreme pressures of a high-traffic event like NYE.

Benchmarking, especially across our 300+ microservices, presented a unique challenge. Our goal was to push our systems to their limits while ensuring zero negative impact on our operations. To navigate this delicate balance, our SRE team (Site Reliability Engineering) developed an in-house Benchmarking Platform (more on this in a subsequent blog) tailored to these unique challenges. Our approach included:

  • Maintaining data integrity by focusing on read-only traffic or ensuring safe write operations
  • Isolated benchmarking to prevent simultaneous stress on interrelated systems
  • Isolation from production workloads. The benchmarking setup is completely independent to prevent any adverse effects on production loads
  • Gradual increase in load to allow sufficient time for effective autoscaling
  • Automated and systematic benchmarking to reduce manual effort and ensure timely benchmarking of a large number of microservices
  • Safe practices such as aborting benchmarking automatically if any critical alerts are triggered or reducing benchmarking load if the error rate on upstream increases

The results of this intensive preparation were remarkable. We clocked a total of 14,000 minutes of benchmarking across over 100 services, achieving a peak throughput of 50 million requests per minute, with over 30 billion requests being fired during the benchmarking phase. This extensive exercise, storing around 400 TB of benchmarking requests, provided us with invaluable insights and actionable data, all achieved without a single service outage.

D-Day: Breaking Records, Setting Benchmarks

On NYE, Zomato’s war rooms, spanning multiple businesses, were buzzing with energy as teams from every department came together to monitor services closely. The overall atmosphere was charged, reflecting a day where every aspect comes together in a coordinated, tech-driven effort. The SRE team proactively prescaled the infrastructure to handle the expected surge. We deployed an impressive array of EC2 fleets spanning across 11,000 EC2 instances. This was enough to host our massive armada of ECS services consisting of ~40,000 ECS tasks driving our operations forward. Large screens on every floor in the office displayed performance indicators, keeping the team alert and responsive. 

Here is what Raunak Kondiboyina from the Serviceability team had to say about the energy of 31st December at the Zomato office – “This was my 4th NYE at Zomato. Compared to last year, this year I was feeling pretty relaxed. Most of this confidence was because we started preparing early, and all review processes and benchmarking helped the team gain more confidence.” 

Manuj Grover from the Time-service team, shared his emotions for the D-day here – “It was D-Day, the excitement and anticipation was palpable. It was a mix of nervousness and thrill as we prepared for the highest order volume of the year. The sleepless nights and long hours leading up to the day were fueled by a sense of purpose and determination to ensure everything ran smoothly.”

As NYE descended, energy in Zomato’s war rooms accelerated to match the festive fervor in India. Activity on our app surged as the order rate skyrocketed to 8400 orders per minute. Our systems were operating behind a tsunami of digital traffic at a peak of 80.92 million requests per minute.

We use ElastiCache for our caching needs, on NYE we scaled up our ElastiCache clusters to reach a hefty total of 35 TB in memory across all our microservices. At peak, this fleet saw a significant 20 Gbps of network throughput, managing 1 billion GET operations and around 100 million SET operations per minute, demonstrating the sheer scale of our data handling capabilities.  

We rely on a mix of DynamoDB, Aurora RDS and MongoDB as our primary databases. DynamoDB usage peaked with 12.3 million RCUs and 5.2 million WCUs consumed per minute, while our RDS clusters were pushed to their limits handling a scale of 610k select queries and 16,000 write transactions per second. MongoDB clusters aggregated read throughput peaked at 470,000 ops/s scanning 3.3 million docs per second while the write throughput peaked at 180,000 ops/s writing over 1.75 million docs per second.

Storage provisioning was no less formidable, on New Year’s Eve, we provisioned a total of 5.5 petabytes across our infrastructure.

Compared to previous years, we saw a monumental shift in scale and efficiency. While India celebrated New Year’s Eve, our order metrics climbed to unprecedented heights, mirroring the exhilarating atmosphere in our war rooms, and peaked at the moment when we hit the 3 million orders mark, a milestone that not only signified our operational success but also underscored the trust placed in us by millions of customers. 

Here’s what Angad Sharma from the Search team had to say about D-Day energy – “Zomato office during NYE was bubbling with energy, a sweet kind of chaos but a confident & prudent calm like we had everything under control, and no external power can waiver our resolve & pledge for resiliency.”

Post NYE Reflections: Zomato’s Ongoing Mission of Excellence, Innovation, and Continuous Learning

While our New Year’s Eve experience has definitely become a springboard for future growth, we also know that we have a lot more to do.

It takes curiosity to learn and unlearn, and as a young Engineering team – with an average age of just 25 years – we already bring the spirit of immense interest and question to work every day. We believe we will hit new milestones this year, and are focused on working towards enhancing Zomato’s ordering and delivery journey at every step of the way.

Our goal for 2024? We want to build for the happiest customer experience, and not just a solve to a problem. And in this pursuit, we are bound to make some mistakes, and would have to face many new challenges – but we will surely get more things right than wrong 🙂

As always, we’re just 1% done.

facebooklinkedintwitter

More for you to read

Technology

apache-flink-journey-zomato-from-inception-to-innovation
Data Platform Team | November 18, 2024 | 10 min read
Apache Flink Journey @Zomato: From Inception to Innovation

How we built a self-serve stream processing platform to empower real-time analytics

Technology

introducing-pos-developer-platform-simplifying-integration-with-easy-to-use-tools
Sumit Taneja | September 10, 2024 | 2 min read
Introducing POS Developer Platform: Simplifying integration with easy-to-use tools

Read more about how Zomato is enabling restaurants to deliver best-in-class customer experience by working with POS partners

Technology

migrating-to-victoriametrics-a-complete-overhaul-for-enhanced-observability
SRE Team | August 12, 2024 | 11 min read
Migrating to VictoriaMetrics: A Complete Overhaul for Enhanced Observability

Discover how we migrated our observability metrics platform from Thanos and Prometheus to VictoriaMetrics for cost reduction, enhanced reliability and scalability.

Technology

go-beyond-building-performant-and-reliable-golang-applications
Sakib Malik | July 25, 2024 | 6 min read
Go Beyond: Building Performant and Reliable Golang Applications

Read more about how we used GOMEMLIMIT in 250+ microservices to tackle OOM issues and high CPU usage in Go applications, significantly enhancing performance and reliability.