The Magnitude of New Year’s Eve
On this day, every second counts, and every order is a celebration. This year, we didn’t just surpass expectations; we redefined them.
On the technical front, our observability platform adeptly handled 2.2 billion active time series metrics, reaching a peak throughput of 17.8 million per second, showcasing robust performance during peak demand periods. Customers placed about 2 million orders from new restaurants they explored, highlighting adventurous palates and a vast variety of options on Zomato.
On the data front, LogStore – our in-house logging platform processed 60 TB application logs on that day, with an 80 million RPM peak ingestion rate. Kafka, the backbone of our event-driven architecture, efficiently handled a remarkable rate of 450+ million messages per minute, with the clusters achieving a total peak throughput of 5.6 GBps.
Even with the unprecedented demand of New Year’s Day, 60% of the orders were successfully delivered within 30 minutes. Our chat support handled a record 19,500 peak concurrent sessions and about 51,400 messages were exchanged per minute, ensuring swift and effective customer service.
Prelude to the Big Night: The Preparation
At Zomato, preparation for NYE is a blend of tech and strategy, setting a benchmark for best practices for ourselves and the industry. We delve deep into each service, ensuring that every aspect is finely tuned for peak performance. Special focus is placed on kill switch reviews, ensuring fail-safes for every scenario. Our preparation includes:
However, the cornerstone of our preparation is chaos testing and benchmarking.
Chaos-testing: An approach to test a system’s integrity by simulating and injecting fault into the system. We understand that despite exhaustive reviews, surprises can lurk in the system. Imagine a seemingly minor service causing major disruptions – this is where chaos testing becomes invaluable. It helps us to uncover hidden dependencies and potential failure points in our complex microservices architecture. We simulate extreme conditions such as:
Benchmarking: The adage “At scale, everything breaks” is not just a cautionary statement, but a fundamental principle that guides our approach to system reliability. As we scale up to meet the massive surge in orders and customer interactions during NYE, hidden vulnerabilities and bottlenecks in our system that are not apparent during regular operations may emerge. These can range from database performance issues, network bottlenecks, to unforeseen failures in microservices that behave differently under heavy load.
This is particularly crucial when preparing for events like NYE, where the sheer volume of transactions pushes our systems to their limits. Benchmarking provides insights into how they would behave under the extreme pressures of a high-traffic event like NYE.
Benchmarking, especially across our 300+ microservices, presented a unique challenge. Our goal was to push our systems to their limits while ensuring zero negative impact on our operations. To navigate this delicate balance, our SRE team (Site Reliability Engineering) developed an in-house Benchmarking Platform (more on this in a subsequent blog) tailored to these unique challenges. Our approach included:
The results of this intensive preparation were remarkable. We clocked a total of 14,000 minutes of benchmarking across over 100 services, achieving a peak throughput of 50 million requests per minute, with over 30 billion requests being fired during the benchmarking phase. This extensive exercise, storing around 400 TB of benchmarking requests, provided us with invaluable insights and actionable data, all achieved without a single service outage.
D-Day: Breaking Records, Setting Benchmarks
On NYE, Zomato’s war rooms, spanning multiple businesses, were buzzing with energy as teams from every department came together to monitor services closely. The overall atmosphere was charged, reflecting a day where every aspect comes together in a coordinated, tech-driven effort. The SRE team proactively prescaled the infrastructure to handle the expected surge. We deployed an impressive array of EC2 fleets spanning across 11,000 EC2 instances. This was enough to host our massive armada of ECS services consisting of ~40,000 ECS tasks driving our operations forward. Large screens on every floor in the office displayed performance indicators, keeping the team alert and responsive.
Here is what Raunak Kondiboyina from the Serviceability team had to say about the energy of 31st December at the Zomato office – “This was my 4th NYE at Zomato. Compared to last year, this year I was feeling pretty relaxed. Most of this confidence was because we started preparing early, and all review processes and benchmarking helped the team gain more confidence.”
Manuj Grover from the Time-service team, shared his emotions for the D-day here – “It was D-Day, the excitement and anticipation was palpable. It was a mix of nervousness and thrill as we prepared for the highest order volume of the year. The sleepless nights and long hours leading up to the day were fueled by a sense of purpose and determination to ensure everything ran smoothly.”
As NYE descended, energy in Zomato’s war rooms accelerated to match the festive fervor in India. Activity on our app surged as the order rate skyrocketed to 8400 orders per minute. Our systems were operating behind a tsunami of digital traffic at a peak of 80.92 million requests per minute.
We use ElastiCache for our caching needs, on NYE we scaled up our ElastiCache clusters to reach a hefty total of 35 TB in memory across all our microservices. At peak, this fleet saw a significant 20 Gbps of network throughput, managing 1 billion GET operations and around 100 million SET operations per minute, demonstrating the sheer scale of our data handling capabilities.
We rely on a mix of DynamoDB, Aurora RDS and MongoDB as our primary databases. DynamoDB usage peaked with 12.3 million RCUs and 5.2 million WCUs consumed per minute, while our RDS clusters were pushed to their limits handling a scale of 610k select queries and 16,000 write transactions per second. MongoDB clusters aggregated read throughput peaked at 470,000 ops/s scanning 3.3 million docs per second while the write throughput peaked at 180,000 ops/s writing over 1.75 million docs per second.
Storage provisioning was no less formidable, on New Year’s Eve, we provisioned a total of 5.5 petabytes across our infrastructure.
Compared to previous years, we saw a monumental shift in scale and efficiency. While India celebrated New Year’s Eve, our order metrics climbed to unprecedented heights, mirroring the exhilarating atmosphere in our war rooms, and peaked at the moment when we hit the 3 million orders mark, a milestone that not only signified our operational success but also underscored the trust placed in us by millions of customers.
Here’s what Angad Sharma from the Search team had to say about D-Day energy – “Zomato office during NYE was bubbling with energy, a sweet kind of chaos but a confident & prudent calm like we had everything under control, and no external power can waiver our resolve & pledge for resiliency.”
Post NYE Reflections: Zomato’s Ongoing Mission of Excellence, Innovation, and Continuous Learning
While our New Year’s Eve experience has definitely become a springboard for future growth, we also know that we have a lot more to do.
It takes curiosity to learn and unlearn, and as a young Engineering team – with an average age of just 25 years – we already bring the spirit of immense interest and question to work every day. We believe we will hit new milestones this year, and are focused on working towards enhancing Zomato’s ordering and delivery journey at every step of the way.
Our goal for 2024? We want to build for the happiest customer experience, and not just a solve to a problem. And in this pursuit, we are bound to make some mistakes, and would have to face many new challenges – but we will surely get more things right than wrong 🙂
As always, we’re just 1% done.