Explained: How Zomato Handles 100 Million Daily Search Queries! (Part Three)
Welcome to the third and final instalment of our three-part blog series! In parts one and two, we unveiled the secrets behind Zomato’s powerful search engine, driven by Apache Solr, and shared our tactics for optimizing Solr’s performance. In this concluding chapter, we’ll take you behind the scenes of Zomato’s self-managed Solr clusters to show how we tackled the challenges arising from this rapid growth and transformed the scalability, reliability, and performance of Solr.
Hitting the infrastructural bottlenecks
Initially, we were running our Solr Clusters in Master-Slave Configuration. As Zomato’s growth continued at a breakneck speed, the index size started growing, and we started facing a new set of challenges:
To improve performance, we had to run the slave nodes on extremely large machines (120GB RAM) with a large JVM Heap (~100GB), but this resulted in poor tail latencies due to the longer Stop The World step in Garbage Collection.
Additionally, on every sync with the master, slaves had to open Searcher on a large index, and warming up the caches took more time. This led to even worse tail latencies.
Furthermore, our business experiences daily traffic peaks during the typical lunch and dinner hours, necessitating rapid autoscaling. But it took ~10 minutes for a new replica to be ready for querying.
Lastly, manual handling was required for master failovers, which resulted in low data indexing reliability and fault tolerance.
Solving for scale: Moving to SolrCloud architecture
The logical solution to this problem was to divide the index into smaller segments and perform queries across multiple segments to address application requests. This approach led us to SolrCloud architecture, which allows managing large indexes in a more efficient and scalable manner, while still being self-managed. We observed several improvements:
We were able to run the cluster on smaller machines resulting in ~75% reduction in individual instances’ CPU and Memory requirements.
We observed significant improvement in average and tail latencies.
Smaller index per shard also improved scale-up time by ~20%.
Improving the SolrCloud Performance
Custom sharding logic
To make sure that the migration to Solr Cloud is more performant, we developed a custom data-sharding strategy that aligns with Zomato’s hyperlocal business model, ensuring that data from a specific locality is stored within the same shard. As a result of this strategy, ~95% of queries were processed using a single shard, significantly reducing processing time and improving efficiency. The remaining queries were handled by multiple shards, providing an effective solution for handling a small subset of queries.
Turning off distributed search
At the time of splitting data into the shards, we ensured that we always knew which locality’s data was present in which shard. This made it possible to turn off the distributed search by directing requests to the relevant shard and ensuring that the replica receiving the request would return the response by querying its local index only.
By combining our understanding of customer behaviour with Solr’s internal request routing mechanisms, we successfully decreased the throughput on individual nodes by ~75%. Consequently, this reduction led to lower latency, decreased network traffic, and minimize costs, as the same traffic was served by a smaller number of nodes.
Solving for resiliency and availability
Following our successful migration from Solr Master-Slave to SolrCloud architecture, operations appeared to be running seamlessly. However, one day we started receiving alerts highlighting the absence of healthy hosts in one of the shards. This single-shard malfunction led to considerable downtime during peak business hours. At our operational scale, such downtimes result in substantial business losses and affect our customers, restaurant partners, and delivery partners alike.
Sharing a few details about our self-managed SolrCloud setup and the pre-conditions which triggered the downtime:
The Solr clusters were hosted on Spot instances which means that running nodes can go down anytime.
We had configured to use onlyNRT (Near Real Time) replicas out of the three available options i.e. NRT, TLOG and PULL replicas.
Whenever a new replica joins the cluster, a unique, monotonically increasing number (znode) is assigned by ZooKeeper. During the Leader Election process, the replica with the smallest value is given a chance to become the leader.
After a replica is allowed to become a leader, it verifies its index version proximity with the most recent version compared to the other replicas in the cluster. If its version is significantly outdated, it forfeits the chance of becoming the leader and goes into recovery, giving a chance to the next replica in the sequence.
After the leader is elected, any replica nodes significantly behind the new leader enter recovery mode to sync their index with the newly elected leader.
If any replica goes into recovery, it does not serve even the read traffic.
Now, what actually happened at the time of the incident:
The leader instance got reclaimed, and hence the cluster had no active leader.
A few minutes before the actual leader went down, a new replica joined the shard. Therefore, it had the most recent copy of the index.
Under these special circumstances, all the nodes in that shard went into recovery as their indexes were relatively older compared to the new replica and finally, the new replica was invited to become the leader.
The entire read traffic got redirected to the new leader. At the same time, all other replicas were trying to download GBs of data from the leader node. This led to a vicious cycle where the shard was not able to become healthy due to excess load on one node.
Finally, we had to cut all the traffic (single shard and cross-shard requests both) to allow it some time to become healthy.
Switch to TLOG and Pull Replicas
At Zomato, we believe that “Only the Paranoid Survive”. After finishing the RCA of the incident, we decided that we need to solve a few things:
The Solr cluster must be highly available for processing the read queries by serving the read traffic during the Leader election or without any leader at all.
We should be able to quickly scale up the cluster without putting too much stress on the leader.
We decided to deep dive into Solr’s replication and leader election process. Each replica type has a different way of processing the updates from the leader, which decides the series of events during the Leader Election:
Each replica receives the update from the leader and writes data to the transaction log (tlog).
Updates are periodically flushed from the tlog to the index and stored in the disc. This also means that the index files can be completely different (highlighted in Figure 2 as Idx1and Idx2) in each replica.
NRT replica is eligible to become a leader during the leader election process.
In order to avoid any data loss, these replicas can go into recovery during the leader election.
Each replica receives the update from the leader and writes data to the transaction log but does not write changes to the local index.
The index is periodically updated by fetching only the incremental changes from the leader. This also means that the index files will be the same as that of the leader.
TLOG replicas are also eligible to become a leader and just like NRT replicas, they can go into recovery during leader election to ensure data availability.
PULL replicas do not receive any updates directly from the leader and hence do not maintain a transaction log.
Similar to TLOG replicas, the index is periodically updated by fetching only the incremental changes from the leader.
PULL replicas are not eligible to become leader during the leader election.
Keeping all of this in mind, we began trying out multiple combinations of replicas and recreated multiple scenarios and we observed that:
NRT and PULL replicas [Not Recommended]
During leader election, only the NRT replicas go into recovery. The cluster is able to serve read requests by PULL replicas at all times (even without a leader).
No benefit of Near Real Time searching even after having NRT replicas as PULL replicas reflect changes only after the NRT leader commits changes to the index.
Since each replica has its own copy of index files, in case of Leader Election, the follower replicas need to download the entire index from the new leader and discard their old index files. This leads to a lot of disk IO and network bandwidth wastage and puts the leader under stress.
As mentioned earlier, the cluster goes into recovery for a short period of time during the Leader election process. Hence, the cluster will not be available for reads for that period.
NRT and TLOG replicas [Not Recommended]
Behave the same way as NRT and PULL replicas combination hence the same limitations and advantages.
TLOG and PULL replicas
During the leader election, only the TLOG replicas go into recovery and need to download only the incremental changes from the new leader.
Similar to the NRT and PULL combination, the cluster is able to serve read requests by PULL replicas even without a leader or during the leader election process.
The downside is that the cluster loses the capability of Near Real-Time Search.
Finally, we went ahead with the combination of TLOG and PULL replicas as we can bear some delay in reflecting the updates but we need the Solr cluster to be highly available for querying.
Extra Miles – seeding index data for faster scaling
Whenever a new node comes up, it has to download the entire index from the leader. This process has a serious limitation as it leads to bandwidth issues in case a lot of replicas need to be created at the same time.
We tried to look for solutions and sought help from the Solr community, but we eventually decided to build a custom solution:
Periodically take a snapshot of the index and save it in the AWS S3.
Whenever a new node comes up, it first downloads the latest snapshot from S3.
Configure the replica’s dataDirto the path where the index is downloaded. This makes it possible to seed the index data for any new replica.
After this, the replica downloads only the incremental changes from the leader, allowing us to scale 100s of replicas at the same time.
This solved one more problem for us. Currently, Solr offers backup and restore functionality for each collection, and it creates a single backup for the index data across all shards. Our solution enabled us to restore a specific shard by simply creating a TLOG replica using the seed data stored in S3. This effectively functions as a point-in-time recovery mechanism for individual shards, adding an extra layer of flexibility and resilience to our SolrCloud infrastructure.
The spirit of being always paranoid, clubbed with a deeper understanding of our business and the underlying technology, continuously pushes us to build scalable, resilient and highly available solutions for customers. Our journey with Apache Solr has been an interesting one and full of learnings and we look forward to regularly sharing this with the community.