Zomato blogs | April 12, 2023 | 8 min read
Explained: How Zomato Handles 100 Million Daily Search Queries! (Part Two)

In our previous blog post, we talked about the challenges we faced due to the suboptimal design of our schema, which heavily relied on dynamic fields for fulfilling our requirements. This inturn, caused poor performance and high costs during peak times as we scaled up. While we implemented some quick fixes to address these issues, such as upgrading Solr version and utilizing doc_values for aggregate and sort fields, our dependency on dynamic fields remained a bottleneck when expanding the schema for new use cases.

Continuing our journey towards a more robust and scalable data indexing system, in this blog post, we will dive deeper into the changes we made to our schema and query paradigm. These changes not only helped us achieve greater resiliency, scalability, and cost-effectiveness but also enhanced our overall performance, ensuring a seamless user experience. So, let’s explore how we tackled these challenges and improved our search system.

Dynamic Fields – The Proposition

Dynamic fields offer a practical function by enabling the addition of various fields in a document, tailored to meet specific use cases. With this feature, we can create fields customized to the unique requirements of our data without having to modify the schema. For instance, we may need to store popularity values for different delivery areas of a Restaurant. To achieve this, we could use a dynamic field such as popularity_* to accomplish the task. During indexing, we can enrich the restaurant document with popularity values for its various delivery locations.

An example of a Restaurant with its popularity values at different delivery locations:

{
    'id': 'res1',
    'entity': 'restaurant',
    'name': 'Dynamic Pizza Store',
    'location': '89.998900, 28.8389300',
    // popularity score for different delivery areas
    'popularity_area1': 0.93,
    'popularity_area2': 0.25,
    'popularity_areaN': 0.71
}

In the case of Zomato, where a single restaurant can deliver at multiple locations, the number of unique popularity_{area_id} fields can quickly accumulate, especially when considering all the restaurants in India. This scenario is just one of the many examples that can cause a substantial increase in the total number of unique fields. Furthermore, the system comprises of various document types, each with its own set of dynamic fields, which can exacerbate the issue, resulting in a field mapping explosion. This situation can lead to out-of-memory errors and make recovery difficult. Moreover, the high field count can also adversely affect full indexing speed. Using dynamic fields, even if it’s one of the most conventional design patterns, for unbounded fields, is not a scalable or sustainable solution in the long run.

An experiment to revisit the Restaurant popularity score which needs to be A/B tested in production will introduce another dynamic field like popularity_v2_* or extend the existing field, in either approach such experiments will increase the unique field counts by hundreds of thousands.

Usecase & Alternative Model

Filtering, sorting, or aggregating documents, such as restaurants or dishes, based on various fields like popularity, location, rating, and cost, is a practical use case. While these fields are mostly global, they may not always be sufficient for every use case. Zomato, being a hyperlocal heavy business, requires more granular values for entities to be relevant to specific use cases. For example, a dish like Chole Bhature may be more popular in Delhi than in Kolkata. Moreover, the popularity or demand for a dish may vary significantly within a city. For instance Sushi may be more popular in the Golf Course area of Gurugram as against the lanes of Old Gurugram. In such cases, the global properties of an entity might not be enough, and more granular values are needed to find the most relevant results. Hence having granular and localized information available during indexing is crucial for enhancing search relevance and speed. For instance, if a restaurant has popularity scores for various delivery locations, incorporating that information at index time would be beneficial. Although Dynamic Fields appear to be the logical choice to index such information alongside other properties, the unbounded or high cardinality of such properties can lead to a mapping explosion. 

An alternative model for the above example could be as follows:

{
    'id': 'res1',
    'entity': 'restaurant',
    'name': 'Dynamic Pizza Store',
    'location': '89.998900, 28.8389300',
    // popularity score as map
    'popularity': {
        'area1': 0.93,
        'area2': 0.25,
        'areaN': 0.71
    }
}

The above structure assumes that popularity can be indexed as a Map data type and that the map contains the popularity value of different areas. This approach seems more concise and intuitive and avoids dynamic field popularity_*. However, Solr does not support such a field type out of the box. Although we could store a custom field type or serialized map data, it would not efficiently filter, sort, or aggregate data on such granular properties. We cannot use the data indexed in the above format to filter restaurants serving in areaX and order them by their popularity in that area. This would have been possible, if Solr magically supported the query pattern to filter on popularity.areaX != 0 and sort by popularity.areaX, unfortunately that’s not the case.

Nested Documents Rationale

Even though Solr cannot index and query a field as a Map, it does support nested documents and facilitates efficient indexing and retrieval of them, according to the Solr documentation:

Nested documents in Solr can be used to bind a blog post (parent document) with comments (child documents) — or as a way to model major product lines as parent documents, with multiple types of child documents representing individual SKUs (with unique sizes/colours) and supporting documentation (either directly nested under the products or under individual SKUs.

In the nested document model, the above example can be represented as follows:

{
    'id': 'res1',
    'entity': 'restaurant',
    'name': 'Dynamic Pizza Store',
    'location': '89.998900, 28.8389300',
    // popularity score as the list of nested objects
    'popularity': [
        {
            'id': 'res1_area1',
            'entity': 'popularity',
            'value': 0.93
        },
        {
            'id': 'res1_area2',
            'entity': 'popularity',
            'value': 0.25
        },
        {
            'id': 'res1_areaN',
            'entity': 'popularity',
            'value': 0.71
        }
    ]
}

Further research revealed that the Nested document model is NOT a construct at the Lucene level. Solr has gone beyond the simple document structure that Lucene recognizes by abstracting this concept. Solr uses the following fields to maintain relationships between the document entities:

  • _root_ : automatically populated in all the docs (root, parent, child) of a hierarchy, contains the value of root document ID
  • _nest_path_ : automatically populated in non-root docs, contains the path of the document in the hierarchy
  • _nest_parent_ : automatically populated for the docs with its immediate parent ID if exists

In order to ensure that the fields mentioned above are populated, the schema must have them defined in a specific manner, and it is not optional:

<field name="_root_" stored="false" type="string" indexed="true"/>
<fieldType name="_nest_path_" class="solr.NestPathField" />
<field name="_nest_path_" type="_nest_path_"/>
<field name="_nest_parent_" stored="true" indexed="true" type="string"/>

By including the fields mentioned above in the documents, Solr can determine the relationships between the documents, which allows for the hierarchical indexing, updating, or deleting of a group of documents at the same time. For instance, it can index an entire Restaurant document with all of its Dishes and area Popularity documents together. Indexing the relationships between documents typically leads to faster queries than performing an equivalent join at query time because the relationships are already stored in the index, which eliminates the need to calculate them. As a result, Solr can respond to queries more quickly. Since the concept of Nested documents is an abstraction in Solr, all child documents are ultimately standalone documents. This has some nuances for the schema definition.

  • A field can be configured in one way only, no matter what sort of documents are using it
  • It may be infeasible to use required restrictions on the fields, since  not all documents will have all the fields

What all changed?

The implementation of a nested structure has resulted in a single document (like Restaurant) being divided into multiple documents (like Restaurants and its Popularity and Dishes), forming a hierarchical relationship that allows for the entire hierarchy to be accessed at once. Multiple small documents are used to replace dynamic fields, resolving the issue of mapping explosion. While this approach leads to a significant increase in the number of individual Lucene-level documents, we have not encountered any problems dealing with this tradeoff. Although there is a slight increase in index size, the indexing speed has improved significantly due to the absence of the mapping explosion. To efficiently query this new structure, we migrated our queries to the Block Join Query Parser (BJQ), which can effectively search the nested structure using relationships.

Block Join Query Paradigm

To ensure efficient searching within the nested document construct, queries must incorporate the relationship between the parent and child documents. By utilizing these relationships, various access patterns can be established, of which there are two primary ones.

  • Child Query – A query that matches some parent documents and returns the children of those documents
q={!child of=<blockMask>}<someParents>
  • Parent Query – A query that matches child documents and returns their parents
q={!parent which=<blockMask>}<someChildren>
  • blockMask here is to identify the set of document which should be treated as parents

To summarize, Block Join Queries (BJQ) can be used to query nested documents by filtering either the parent documents based on the attributes of their children or the child documents based on the attributes of their parents. In the specific use case of a Restaurant with Dishes and Popularity as children, BJQ can be used to find Restaurants (parents) that serve a certain Dish (child) using the Parent Query Parser. It can also be used to find all the Dishes being served by Restaurants in a particular area by filtering Restaurants with a popularity score in that area. The possibilities for using BJQ are immense and can be leveraged for a variety of use cases. More information can be found in the official Solr BJQ documentation.

End Note

We faced limitations in adding more dynamic fields, which led to delays in implementing some business use cases. To overcome these limitations, we changed the schema paradigm and migrated most of our use cases to the Nested Document model. We also changed our queries to efficiently query the new structure using BJQ. While the Nested Document model does increase the index size slightly, we experienced faster indexing speeds due to the reduced unique field counts. Overall, these changes were successful in addressing our challenges with dynamic fields.

Way Ahead

The migration to a nested structure effectively resolved the immediate issue and ensured future compatibility with various use cases. However, as the number of restaurants on the platform continues to grow and more delivery locations are added, there is an increasing number of child documents associated with each parent. While this is not a problem from a query perspective, as the relationship between parents and children is established during index time, the size of the index has grown over time. As a result, maintaining the entire index on a single machine in a Master-Slave setup is no longer a viable option. 

In the next blog post, the last in this series, we will delve into our decision to transition to a Solr Cloud setup and the performance and production issues associated with it.

Keep coming back for more!

This blog was written by Saurav Singh.


Sources:

facebooklinkedintwitter

More for you to read

Technology

apache-flink-journey-zomato-from-inception-to-innovation
Data Platform Team | November 18, 2024 | 10 min read
Apache Flink Journey @Zomato: From Inception to Innovation

How we built a self-serve stream processing platform to empower real-time analytics

Technology

introducing-pos-developer-platform-simplifying-integration-with-easy-to-use-tools
Sumit Taneja | September 10, 2024 | 2 min read
Introducing POS Developer Platform: Simplifying integration with easy-to-use tools

Read more about how Zomato is enabling restaurants to deliver best-in-class customer experience by working with POS partners

Technology

migrating-to-victoriametrics-a-complete-overhaul-for-enhanced-observability
SRE Team | August 12, 2024 | 11 min read
Migrating to VictoriaMetrics: A Complete Overhaul for Enhanced Observability

Discover how we migrated our observability metrics platform from Thanos and Prometheus to VictoriaMetrics for cost reduction, enhanced reliability and scalability.

Technology

go-beyond-building-performant-and-reliable-golang-applications
Sakib Malik | July 25, 2024 | 6 min read
Go Beyond: Building Performant and Reliable Golang Applications

Read more about how we used GOMEMLIMIT in 250+ microservices to tackle OOM issues and high CPU usage in Go applications, significantly enhancing performance and reliability.