Zomato Engineering | March 31, 2023 | 7 min read
Explained: How Zomato Handles 100 Million Daily Search Queries! (Part One)

Whether you are craving a piping hot pizza or a scrumptious biryani, customers can find their desired dishes effortlessly with just a few clicks. But have you ever wondered what serves as a backbone to all Search features available on the Zomato app and website?

Providing the correct and specific search results to our customers becomes the most basic criterion to ensure a good user experience. With the ability to handle an unlimited number of inputs, our search system is responsible for powering 100+ millions of search queries every day. On Zomato, Solr-Lucene is the base of all search functions and solves some of the use cases most important to us – restaurant and reviews search, cuisine search, sorting and grouping of restaurants and cuisines based on rating, search score, and much more.

So, why do we choose Solr for such use cases?

Solr is an open-source, enterprise-level search platform built on Apache Lucene, which is a highly popular and widely used library for search-related applications. It has gained trust and popularity among various well-known technology companies, which is a testament to its reliability, scalability, and robustness. This popularity has led to a thriving community of developers and users, making it easier for developers to find support and resources when needed. Additionally, the well-known Elasticsearch engine is also based on the Lucene library.

Scale – Performance & Cost!

Initially, everything functioned seamlessly, but as we experienced substantial growth in recent years, some of our assumptions failed at scale. One Solr node could not manage the high traffic volume, and the servers resulted in frequent Out of Memory (OOM) errors as we increased the traffic. The OOM issue became a bottleneck and required a larger cluster to handle peak traffic, leading to a substantial increase in server cost. This emphasizes the need for thoughtful consideration of cost implications when making technology decisions, particularly for high-traffic applications like Zomato.

OOM, Why?

The Out of Memory (OOM) issue on our Solr setup was caused by several factors. Solr runs as a Java process, which has a Garbage Collector (GC) that runs at a certain frequency to clean up non-referenced objects and free up memory for further use. However, in this case, GC was struggling to reclaim memory from the Old Gen Heap space, causing the space to grow until it reached an OOM state, resulting in the process being killed. We also observed that the Heap shrinks back to the normal size after it syncs data from the Master Machine. 

We run some of the Solr clusters in a Master-Slave setup, where the Master machine handles indexing (write) and the Slave machines sync data from the Master at a certain frequency. The Slave machines manage the Query (read) traffic behind a load balancer.

Despite tweaking the JVM parameters and memory settings, the instances continued to go OOM even with constant traffic. JVM monitoring and Heap analysis led to the discovery that the size of the caches maintained by Solr was the root cause of the issue, which was confirmed through the Solr Admin Dashboard as well. One of these caches, the Field Cache, was growing without limits, and it was uncovered that the Field Cache cannot be configured to limit its maximum size or count in general.

What is Field Cache?

Field cache is a crucial component of the Solr search engine, which enables fast and efficient sorting, grouping, and faceting on fields. Solr uses the Field Cache to store the field values of all the filtered documents so that it can quickly access them when needed. This is necessary because when fields are scanned, the process of iterating through all the documents and loading each document’s fields can be slow and resource-intensive, resulting in multiple disk seeks.

Solr gradually un-inverts and puts the data into the Field Cache at search time, which creates an in-memory column-oriented view of documents that makes these types of queries much faster. However, the Field Cache is not configurable and does not support auto-warming, which means if the cache is purged (after syncing with the master), the aggregation query performance will be impacted.

From Doc level stored info
{
    'DocX': {'A':1, 'B':2, 'C':3},
    'DocY': {'A':2, 'B':3, 'C':4},
    'DocZ': {'A':4, 'B':3, 'C':2}
}
To Field level cached info - Field Cache
{
    'A': {'DocX':1, 'DocY':2, 'DocZ':4},
    'B': {'DocX':2, 'DocY':3, 'DocZ':3}
}

Another issue with Field Cache is that if there is no limit on the number of such fields, the cache can grow indefinitely over time, making it even more resource-intensive. The Dynamic Field in Solr is one such construct that theoretically makes the overall number of fields unlimited across all the documents, which can be dangerous for applications using Solr if they have aggregation queries on such fields.

Elasticsearch, on the other hand, limits the total number of fields to 1000 in its default setting, which it calls “mapping explosion“. Unfortunately, Dynamic Fields resulted in a higher number of fields, exceeding far beyond 1000, in some of our use cases.

Possible fixes & nuances

In our setup, Slave instances pull data from the Master instance at a specific frequency. If there is no change in the index version on the Master instance, the Slave will not pull anything and the caches won’t be purged as there is no data update. To ensure that there is always an update on the Master machine, we started indexing a Mock document at regular intervals that is shorter than the sync interval. This ensures that the Slave gets an updated index version and eventually ends up purging its caches at regular intervals.

While this approach was a lifesaver, it ended up impacting the response time and causing higher CPU usage due to cache auto-warming for other caches and the complete rebuilding of the Field Cache at query time. The constant purging of the cache saved the Slaves from OOM errors, but it drastically limited the throughput of a single Slave node. As too many requests increase the memory usage, it results in OOM well before the Slave can even sync data from the Master. Moreover, syncing too frequently would not only harm the response time but would also consume a lot more CPU and memory resources.

The Right Fix – DocValues

The main objective was to increase throughput while keeping memory usage to a minimum and eliminate the Out of Memory (OOM) error. The biggest challenge was the growing Field Cache caused by sorting/grouping queries and multiple dynamic fields. The solution was to adopt DocValues, a way of storing field values that is more efficient for sorting and faceting than traditional indexing. DocValues use a column-oriented approach with a document-to-value mapping built at index time, which reduces memory requirements and makes lookups faster. Like other index files, DocValues are loaded into memory using MMapDirectory, which relies on the operating system to load the relevant data into RAM instead of the JVM, thereby avoiding OOM issues. Enabling DocValues is straightforward, simply use docValues=“true”

<field name="field_name" type="string" indexed="false" stored="false" docValues="true" />

DocValues – Solved Problems

DocValues create un-inverted data at index time. With un-inverted data, queries are more efficient and faster, reducing the overhead of un-inversion at runtime. This has also eliminated the problem of the ever-growing FieldValue cache, making the system more scalable and resilient to increasing traffic. The results of this optimization have been remarkable, with a whopping 10x increase in throughput per Slave node. This was rigorously load tested in a real-world production scenario and worked flawlessly. The reduced cluster size also resulted in significant cost savings across multiple Solr clusters used for various use cases, cutting the costs by approximately 80%, close to ₹30L per month.

DocValues – Issues

There were some issues with DocValues when dealing with Dynamic Fields, particularly with data sparseness and slow indexing processes. Dynamic fields, when defined as DocValues, can result in sluggish segment merge during indexing and optimization processes if they are present in a limited number of documents. The indexing process became time-consuming because of the inefficient handling of Dynamic Field Explosions. This was due to the DocValues producers and consumers that used iterators to travel through every single document for each DocValue field, regardless of its presence in a single document or all documents. This issue was reported in improvement ticket LUCENE-7253 which later got fixed in the Lucene-7.0.0 major version upgrade, where the iterator only traveled through documents with a non-empty value for a given field. There were further issues specific to our use cases for which we forked the Solr-Lucene repo and added fixes, and updated unit test cases for them.

Based on these improvements, we decided to upgrade our cluster from v6.x to v7.6.0, and the results were fantastic. The indexing speed was at-par, latency was reduced, cache hits improved, resiliency was drastically improved, and the overall cost was lower compared to the previous state. Further upgrades to some of our clusters to v8.7.0 have also shown the same positive results, and the improvements continue to work flawlessly.

Afterword

When it comes to optimizing Solr performance, every use case is unique and requires experimenting with various configurations, including JVM and cache settings. Choosing the right schema to index data can also have a significant impact on performance, as it helps to improve query speed, reduce index size, and minimize memory usage during runtime. Another crucial aspect is adjusting Solr cache parameters, which can greatly improve search latencies and minimize memory requirements. Writing efficient queries is also crucial for maximizing cache hits, but that falls outside the scope of this discussion. When dealing with faceting (aggregations) and sorting queries, using DocValues is the best option as it reduces memory footprint and minimizes the risk of memory leaks. Finally, it’s important to note that Solr performs best with a relatively static index, so it’s best to avoid frequent syncing with the master node, depending on the use case.

More to come

DocValues and other fixes addressed the issue, but the underlying problem of Mapping Explosion remained. In future articles, we will explore how to overcome Dynamic Field explosion in inappropriate scenarios and the reasons behind our migration to Solr Cloud architecture for some clusters.

This blog was written by Saurav Singh.


Sources:

facebooklinkedintwitter

More for you to read

Technology

apache-flink-journey-zomato-from-inception-to-innovation
Data Platform Team | November 18, 2024 | 10 min read
Apache Flink Journey @Zomato: From Inception to Innovation

How we built a self-serve stream processing platform to empower real-time analytics

Technology

introducing-pos-developer-platform-simplifying-integration-with-easy-to-use-tools
Sumit Taneja | September 10, 2024 | 2 min read
Introducing POS Developer Platform: Simplifying integration with easy-to-use tools

Read more about how Zomato is enabling restaurants to deliver best-in-class customer experience by working with POS partners

Technology

migrating-to-victoriametrics-a-complete-overhaul-for-enhanced-observability
SRE Team | August 12, 2024 | 11 min read
Migrating to VictoriaMetrics: A Complete Overhaul for Enhanced Observability

Discover how we migrated our observability metrics platform from Thanos and Prometheus to VictoriaMetrics for cost reduction, enhanced reliability and scalability.

Technology

go-beyond-building-performant-and-reliable-golang-applications
Sakib Malik | July 25, 2024 | 6 min read
Go Beyond: Building Performant and Reliable Golang Applications

Read more about how we used GOMEMLIMIT in 250+ microservices to tackle OOM issues and high CPU usage in Go applications, significantly enhancing performance and reliability.