Data Science Team | August 25, 2022 | 6 min read

How we make our Search more conversational and inclusive?

Generally, domain-specific search systems are optimised for single intent type queries. For instance, at an e-commerce site selling electronics, it will pertain to items like mobiles, chargers, power banks, etc. For an apparel site, it would be around dresses, shirts, pants, etc.

For us at Zomato, it usually pertains to a dish, a restaurant, or a cuisine. Such as searching for ‘a burger’, ‘Restaurant XYZ’, or ‘Chinese’ and so on. However, at times, customers search for more than one item together or queries with complex searches such as ‘burger from XYZ restaurant’ or ‘pizza under INR 200.’ In cases like these, search engines become more than just indexing entities.

There is a lot of coding and decoding involved to help our customers find the right thing in the blink of an eye. And to make that possible, our techies have to take a step back and get in the shoes of the customers to understand how they behave – why they type what they type, what they look for, and what they expect the results to be.

Any typical search engine works in a two-step process –

First is the retrieval process (candidate set generation), i.e., finding relevant entities for a given query with maximum recall, and the
The second step is ranking the entities based on factors like entity popularity, and keyword affinity towards the fetched entity with high precision

Our current retrieval system is heavily-dependent on lexically-based matches and other factors. If a customer searches for queries with multiple intents, our system ranks entities based on the lexical match. Below is an example of such a long tail query –

With the introduction of voice search, many customers prefer speaking longer queries rather than typing them. Recently, we also introduced voice search on our platform and observed that a significant number of searches in many emerging cities (Tier-2 and Tier-3 cities in India) are being made using voice assistants.

Here are a few examples of natural language queries made using voice search –

Garlic Bread with cheese dip
XYZ Bakery chocolate cake
Anda curry and roti
Koi Achcha Sa Sabji Batao
Non-veg restaurant in adarsh nagar
Pizza outlets near me
Chai and samosa
Burger 150rs wala

Unlike single-intent queries, long natural language searches involve a deeper understanding of items and optimising them to show relevant results. Now, this is where our Data Science team comes into the picture.

Before we dive further to understand the science behind the data, let’s understand what natural language search queries actually mean.

Understanding Natural Language Search

Natural language search allows customers to speak or type into a device using their everyday language rather than keywords and even use complete sentences or phrases in their native language. The computer then transforms these queries into something it can understand before showing results on the screen.

Optimising Natural Language Search

We start with understanding the customer’s intent and the entities involved in the query.

For example, consider the query ‘XYZ outlet near me’

Here, the intent is ‘nearest outlet’
Entity – ‘XYZ’

To better gauge the intent, we segment search queries in one of the three functionalities as below –

Dish + Dish search – ‘Chai and Samosa’
Restaurant + Dish Search – ‘XYZ ka Burger’
Restaurant/Dish + near me/ best/ irrelevant text – ‘Pizza outlets near me’

Making a new product flow

As our old search system only supported only single-intent queries, we added a new flow to incorporate multiple intents or single intent with some specifications.

Here’s how the overall architecture looks like –

Our new process involves two steps –

For every query made, our model identifies the ‘search query’ and ‘city/ location’
Based on this information, it returns the output, which can be one of the three categories –
1. Multiple DishSearch – Queries where customers search multiple dishes with minor spelling variations. For example,
  1. ‘Chai and samosa’
  2. ‘Thali lassi k sath’
  3. ‘Burger with cold drink’

Dish and Restaurant combine search – Queries where customers search for a dish or multiple dishes from a specific restaurant. For example,
1. ‘XYZ ka burger’
2. ‘Samosa chai palace’
3. ‘Paneer pizza pizzeria place’

Restaurant or Dish Search with specifications or irrelevant keywords – Queries where customers search for multiple information at once. In this case, we show all the related entities (dish and/ or restaurant) on the top.

For instance, if a customer searches for ‘best joe’s pizza outlet near me’, we show all results concerning pizza and joe’s (restaurants with the same or similar name).

Understanding challenges

Unavailability of labelled data
Queries involving more than one language
- Sabse achha pizza
- Makni dal k sath naan
Usage of words with multiple meanings
- chirkutlal chole bhature – (“chirkutlal chole bhature” – Restaurant)
- chirkutlal k chole bhature – (“chirkutlal” – Restaurant, “chole bhature” – Dish)
Spelling variations/ abbreviations/ aliases
- Rajma rice or Rajma chawal, roomali roti or rumali roti (here, both are valid dishes)

Our solution – the quest to find an architecture that works the best

Embedding-based approaches are commonly used in natural language processing as they easily capture the semantic relationship among words. Word2vec by Google² is commonly used in various lower stream Natural Language Processing tasks. For the text representation, we used word2vec embeddings. We trained the word2vec from scratch on our data for domain adoption and to reduce the noise.

But the major drawback of using word2vec is that it generates embeddings only for the words which are available in the vocabulary. Keeping the vocabulary of all possible words/ phrases is not a feasible solution. So along with word2vec, we used Byte pair encoding (BPE)³ embeddings. BPE is a subword tokenisation-based method that tokenises the words based on their occurrence.

Once we have our text (customer query) representation, sequentiality plays a vital role in name entity recognition. Both deep learning-based models like Recurrent Neural Network (RNN), Long Short-Term Memory (LSTM), and Gated Recurrent Unit (GRU) and statistical models such as Hidden Markov Models (HMM) and Conditional Random Field (CRF) are widely used. Both these methods are specialised for processing sequential data. We have taken a combination of both, more specifically bi-directional LSTM with CRF, inspired by this research paper on Bidirectional LSTM-CRF Models for Sequence Tagging⁴.

Our architecture is computational, fast (even on CPU) and easy to scale for a large customer base.

Evaluation of the new model

We took the first step in improving the customer experience and expanding our search capability beyond the single intent query. Currently, the model is deployed for A/B testing and has an upside in the click-through rate.

What’s next?

Keeping in mind the limited resources and latency to scale city-wise models for pan India, we plan to explore self-attention-based transformer models in our successive iterations.

Our priority is to optimise search queries based on customers’ needs. Currently, we deploy our NLS model for limited purposes and are planning to expand the model for the following use cases:

Multiple sub-categories and dish attributes (ingredients/ style of cooking)
- ‘Mawa wali gujiya’
- ‘Spicy dal’
Search with a specific restaurant’s address/ location
- ‘Ma petite bakery, Galleria’
Handling price-sensitive queries
- ‘hundred ke andar pizza’
- ‘cake 200 ke range mein’
- ‘Burger 150rs wala’
- ‘Sabse sasta pizza’

That’s all, folks!

This is a Data Science article to share how we built the Natural Language Search at Zomato. If you are interested in solving similar problems which push you to solve seemingly impossible tasks, connect with Manav Gupta on LinkedIn. We’re always looking for Data Scientists (aka Chief Statistics Officers) at Zomato.

This blog was written by Sonal Garg.

-x-

Sources:

What is natural language search?, algolia.com
Word2vec, wikipedia.com
Byte pair encoding, wikipedia.com
Bidirectional LSTM-CRF Models for Sequence Tagging, arxiv.org

–x–

^{All content provided in this blog is for informational and educational purposes only. It is not professional advice, and should be treated as such. The writer of this blog makes no representations as to the accuracy or completeness of any content or information contained here or found by following any link on this blog.}

^{All images/ videos are designed in-house.}