Generally, domain-specific search systems are optimised for single intent type queries. For instance, at an e-commerce site selling electronics, it will pertain to items like mobiles, chargers, power banks, etc. For an apparel site, it would be around dresses, shirts, pants, etc.
For us at Zomato, it usually pertains to a dish, a restaurant, or a cuisine. Such as searching for ‘a burger’, ‘Restaurant XYZ’, or ‘Chinese’ and so on. However, at times, customers search for more than one item together or queries with complex searches such as ‘burger from XYZ restaurant’ or ‘pizza under INR 200.’ In cases like these, search engines become more than just indexing entities.
There is a lot of coding and decoding involved to help our customers find the right thing in the blink of an eye. And to make that possible, our techies have to take a step back and get in the shoes of the customers to understand how they behave – why they type what they type, what they look for, and what they expect the results to be.
Any typical search engine works in a two-step process –
Our current retrieval system is heavily-dependent on lexically-based matches and other factors. If a customer searches for queries with multiple intents, our system ranks entities based on the lexical match. Below is an example of such a long tail query –
With the introduction of voice search, many customers prefer speaking longer queries rather than typing them. Recently, we also introduced voice search on our platform and observed that a significant number of searches in many emerging cities (Tier-2 and Tier-3 cities in India) are being made using voice assistants.
Here are a few examples of natural language queries made using voice search –
Unlike single-intent queries, long natural language searches involve a deeper understanding of items and optimising them to show relevant results. Now, this is where our Data Science team comes into the picture.
Before we dive further to understand the science behind the data, let’s understand what natural language search queries actually mean.
Natural language search allows customers to speak or type into a device using their everyday language rather than keywords and even use complete sentences or phrases in their native language. The computer then transforms these queries into something it can understand before showing results on the screen.
We start with understanding the customer’s intent and the entities involved in the query.
For example, consider the query ‘XYZ outlet near me’
To better gauge the intent, we segment search queries in one of the three functionalities as below –
As our old search system only supported only single-intent queries, we added a new flow to incorporate multiple intents or single intent with some specifications.
Here’s how the overall architecture looks like –
Our new process involves two steps –
For instance, if a customer searches for ‘best joe’s pizza outlet near me’, we show all results concerning pizza and joe’s (restaurants with the same or similar name).
Embedding-based approaches are commonly used in natural language processing as they easily capture the semantic relationship among words. Word2vec by Google2 is commonly used in various lower stream Natural Language Processing tasks. For the text representation, we used word2vec embeddings. We trained the word2vec from scratch on our data for domain adoption and to reduce the noise.
But the major drawback of using word2vec is that it generates embeddings only for the words which are available in the vocabulary. Keeping the vocabulary of all possible words/ phrases is not a feasible solution. So along with word2vec, we used Byte pair encoding (BPE)3 embeddings. BPE is a subword tokenisation-based method that tokenises the words based on their occurrence.
Once we have our text (customer query) representation, sequentiality plays a vital role in name entity recognition. Both deep learning-based models like Recurrent Neural Network (RNN), Long Short-Term Memory (LSTM), and Gated Recurrent Unit (GRU) and statistical models such as Hidden Markov Models (HMM) and Conditional Random Field (CRF) are widely used. Both these methods are specialised for processing sequential data. We have taken a combination of both, more specifically bi-directional LSTM with CRF, inspired by this research paper on Bidirectional LSTM-CRF Models for Sequence Tagging4.
Our architecture is computational, fast (even on CPU) and easy to scale for a large customer base.
We took the first step in improving the customer experience and expanding our search capability beyond the single intent query. Currently, the model is deployed for A/B testing and has an upside in the click-through rate.
Keeping in mind the limited resources and latency to scale city-wise models for pan India, we plan to explore self-attention-based transformer models in our successive iterations.
Our priority is to optimise search queries based on customers’ needs. Currently, we deploy our NLS model for limited purposes and are planning to expand the model for the following use cases:
That’s all, folks!
This is a Data Science article to share how we built the Natural Language Search at Zomato. If you are interested in solving similar problems which push you to solve seemingly impossible tasks, connect with Manav Gupta on LinkedIn. We’re always looking for Data Scientists (aka Chief Statistics Officers) at Zomato.
This blog was written by Sonal Garg.
All content provided in this blog is for informational and educational purposes only. It is not professional advice, and should be treated as such. The writer of this blog makes no representations as to the accuracy or completeness of any content or information contained here or found by following any link on this blog.
All images/ videos are designed in-house.