Machine Learning Team | August 16, 2022 | 9 min read

Powering restaurant ads on Zomato via Machine Learning

There’s a popular saying that ‘great ads are those which don’t feel like ads’. The statement holds true in every case, be it a television advertisement, a newspaper commercial, or a digital ad. In our case, it is the latter-most, wherein we show promoted restaurants to customers on our app.

At Zomato, promoted restaurants are a part of the recommendation rails. These are either new restaurants looking to create brand awareness or existing ones wanting new people to try their food. Their idea is to showcase what they make the best – could be a great fusion dish they invented or an authentic recipe they’ve perfected over the years. For restaurants, the aim is to make their customers happy and generate business through it. And for us, it is to make the search journey easier and faster for customers, so they can quickly discover food and restaurants they would highly like.

For example, when you visit a restaurant and ask the waiter what they would suggest the best. They ask you certain questions like your preference (veg/ non-veg), your spice tolerance, your preferred taste (sour/ sweet) and then come up with suggestions based on it. We aim for the same thing – to suggest dishes and restaurants that are relevant and unique.

To ensure that we feature what customers would highly like, we use machine learning to make better and more personalised suggestions. Through this blog, we discuss how we improved the relevance of promoted restaurants through an AI-powered model for a better customer experience while ensuring that the promoted restaurants get the right visibility.

Problem Formulation

Before using machine learning, we used to use different heuristics and utility functions to rank promoted restaurants. This system, however, has certain limitations. It did not consider customer nuances like dish preferences, promo code sensitivity, dietary preferences (veg/ non-veg), and affordability. Also, important contexts like meal time and weekdays are overlooked.

We want to serve customers with recommendations that have these factors baked into them, i.e., to present restaurants that are most relevant for them.

Unfortunately, experimenting with combinations of all the above factors that influence a customer’s ordering behaviour is too time-consuming. And, developing a utility function for it is not possible. To solve this, we formulated the problem as a ranking problem in machine learning.

In a typical ranking problem, we have a dataset of past sessions, where the customers are presented with a set of documents and their interactions are recorded. In our context, restaurants become the documents, and the customer and contextual features become the query. Based on this interaction, the model is trained to discover patterns related to what types of customers show positive interaction (click/ order) with what kind of restaurants. Each document (restaurant) is assigned a score directly proportional to its relevance for a given query.

Evaluation

After training and before taking the model live, comes the evaluation part. To evaluate a model, we decide on a few metrics to track our progress at every step in an offline setting.

We track Mean Reciprocal Rank (MRR) and Mean Ordered Rank to benchmark each iteration. These help us know how effectively and successfully we can lead our customers to what they are looking for. Before we move ahead, here is a brief on some of the terms we’ll be using.

Mean Reciprocal Rank (MRR)

Mean of reciprocal rank at which a customer clicks and places the order across different sessions. Higher MRR shows that clicks are coming from the top ranks of recommended restaurants.

Mean Rank

Mean of rank at which customer clicks and places the order, across different customer sessions. The lower the mean rank, the better it is.

For example, if we show you 10 restaurants and the customer orders from the third one rather than the fifth, then it shows the recommendations are better.

Accuracy@N

How many times has the clicked restaurant fallen in the top-N of recommendations? The higher the accuracy@N is, the better it is. We looked at Accuracy@1, Accuracy@5 and Accuracy@8 in our evaluations. For example, if you frequently order pizza, how often a restaurant like Pizza by Loui (fictitious name) will come in your top recommendations.

Out-of-time data set

In our case, it refers to a test dataset, i.e., if we have trained a model on data between 1-10th of a month, then we test our predictions on data from 11-15th of the same month. We move ahead when the model exhibits adequate performance over these metrics on the out-of-time dataset.

Apart from the metrics discussed above, we also closely monitor the qualitative evaluation of these results.

Modelling

We have used Learn to Rank (LTR) – a technique that applies supervised machine learning to rank and solve problems related to personalisation and relevance.

Typically in ranking algorithms, we have an input –

where, q corresponds to a query and d refers to a document.

As discussed above, here, the query refers to the customer and context, and the document refers to the restaurant. The model predicts a relevancy score s, where

Once we have the relevance of each document, we can sort i.e., rank the restaurants according to those scores.

Among several ranking strategies in ML literature, LTR seems to be the most promising tool. Over the years, LTR has evolved a lot and can be classified into three major categories: Pointwise, Pairwise, and Listwise. These approaches mainly differ in the number of documents you consider at a time in the loss function while training a model.

Pointwise: The Pointwise approach reduces LTR problems to regression problems. It solves a prediction problem on a single document at a time to create a single numerical score for that document. However, the drawback is, while doing so, it does not consider the relative ordering between documents (in our case, the restaurants). This makes it less effective for our use.

Pairwise: The Pairwise approach looks at a pair of documents at a time. Its main aim is to minimise the number of inversions between the pairs in ranking i.e., it tries to minimise the cases where we rank a lower document above a higher-rated document.

Listwise: While Pairwise looks at a pair of documents at a time, the Listwise approach looks at a list of documents at a time for calculating gradients. Listwise approach also considers ranking metrics like NDCG (Normalized Discounted Cumulative Gain) or MAP (Mean Average Precision) and tries to optimise them and come up with optimal ranking scores.

Essentially, the Pairwise and Listwise approaches consider relative ordering of restaurants i.e., if restaurant A is preferable to Restaurant B, the order (Restaurant A, Restaurant B) is correct, and the order (Restaurant B, Restaurant A) is not preferred. So, instead of predicting scores for each document, it cares much about relevant ordering among all the documents.

Generating clean data from historical sessions is very important. Since every model follows ‘Garbage In, Garbage Out’ principle, we have to be very careful of missing logs, sessions with inconsistent data and, most importantly, anomalies in feature values.

We follow the same steps to generate training and validation data; the only difference is the time duration which is different for different cases. We pull all the sessions from the past few weeks. Each session corresponds to an instance in our data. Let’s say when a customer searches for a dish, we show them a set of restaurants providing that particular dish, and when she orders from one of them, an instance is completed. For an instance to be counted, there has to be some material interaction between the customer and the restaurant tab such as clicking at the restaurant, browsing through the menu, or ordering from it.

Here, we mark the label corresponding to the restaurant from which the customer orders as 1 (positive sample). Similarly, the restaurant from which the customer does not order in that session is given label 0 (negative sample). Since there can be a lot of negative samples for each positive sample, we take M negative samples above the positive sample and N negative samples below it. We also tried other sampling approaches like random sampling for negative labels, but the top-M and bottom-N approach worked better for us.

We experimented with all the variants – Pointwise, Pairwise, and Listwise – to determine the most suitable algorithm for our purpose. Here’s what we observed –

Pairwise approach was providing better results than the Pointwise and Listwise approaches
Accuracy@5 for Pairwise approach was 14% better than Pointwise and 8% better than Listwise approach

All restaurant names in the above figure are fictitious and are solely used for representation purposes.

Here, r_aov refers to restaurant average order value and customer aov refers to customer average order value.

In the above figure, different customers have searched for the same dish – pizza. We can see the top five restaurants given by the model, wherein the model has recommended restaurants based on the spending capacity of customers. Similarly, the model also caters to the customer’s veg and non-veg restaurant preference and gives higher weightage to restaurants previously ordered by the customer.

We tested our model and the current logic (baseline) on MRR and Order Accuracy@5, and observed a significant increase in both the metrics on the home page as well as on dish search pages.

Model Deployment

All our machine learning models run behind an in-house ML API Gateway Service (Butterfly). We have deployed the ranking model on Kubernetes behind Butterfly. We also use MLflow¹ (an open source project) as a central model registry to manage our models. To deploy any of our models, we log our models on Zomato MLflow Server. The model has been developed in Python using LambdaMART² implementation from the XGBoost³ library and converted into MLflow format for deployment.

As depicted in the figure below, the candidate-promoted restaurants are available with ads-service which calls the ML API gateway for reranking. The API (Application Programming Interface) gateway service fetches features from our Feature Store/ Profile Store and passes them onto the model server. The model server then does the re-ranking and returns the relevance scores to ads-service via the gateway. Further ads-service does some post-processing on the scores and renders the result page as per the new ranking. The features are populated into feature-store using daily and weekly python-jobs running on airflow.

Experimentation and Results

To understand the impact of the new model on customers and our business, we divide the population into Control Group/ Treatment Group (CG/ TG) for A/B testing. We did an A/B testing across Pan India with a 50/50 split of the population between the proposed model and the current algorithm. Here’s what we observed –

An increase in customers ordering from the listing pages instead of the search bar. This implies that more customers found our recommendations relevant to their needs and hence decided to order from these restaurants quickly rather than spending time searching and scrolling through.

Conclusion and Next steps

Our recommendation model has shown promising results and made ordering choices for our customers easier; while making sure adequate visibility for promoted restaurants. Improvements in Click Through Rate (CTR) and Mean Reciprocal Rank (MRR) both signify that recommendation relevance for customers has increased. In simple terms, say, customers that found a relevant restaurant in the first eight recommendations earlier, now find it in the first six or even less.

Can our recommendations get better? The answer to this question is always a yes.

We have been monitoring the predictions and have observed the following key areas of improvement:

Personalisation sensitivity across different pages (delivery vs homepage) varies. Therefore, it seems wise to approach relevance at a more granular level, i.e., look at relevance specifically for the dish ‘Dal Makhani’ instead of looking at all dish searches through one lens.
Different customer cohorts respond differently to personalised suggestions. Relevance for some cohorts can be made better by indexing more on discovery and exploration rather than personal preferences.

In this article, our Machine Learning team shares how they run restaurant ads on Zomato. If you are interested in solving similar problems which push you to solve seemingly impossible tasks, connect with Ram Singla on LinkedIn. We’re always looking for ML geeks at Zomato.

This blog was written by Deepanshu Arora, Sanjeev Dubey, Sanyukta Singhal, and Siddharth Kumar under the guidance of Ram Singla and Sudheer Tumu.

-x-

Sources –

MLflow, mlflow.org
XGBoost Documentation, xgboost.readthedocs.io/en/stable
From RankNet to LambdaRank to LambdaMART: An Overview, microsoft.com/en-us/research/wp-content/uploads/2016/02/MSR-TR-2010-82.pdf

-x-

^{Please note that Python, MLflow, XGBoost are the respective trademarks of Python Software Foundation, LF Projects LLC, and The XGBoost Contributors.}

^{All content provided in this blog is for informational and educational purposes only. It is not professional advice, and should be treated as such. The writer of this blog makes no representations as to the accuracy or completeness of any content or information contained here or found by following any link on this blog.}

^{All images/ videos are designed in-house.}

Problem Formulation

Evaluation

Model Deployment

Experimentation and Results

Conclusion and Next steps

More for you to read