Data Science Team | June 22, 2022 | 8 min read

Predicting your order’s Food Preparation Time

Ever wondered what all goes into calculating the ETA (Estimated Time of Arrival) of your food order? From the time it takes a restaurant partner to prepare your food order to the time it takes a delivery partner to reach your location from the restaurant – the whole picture is much bigger.

Here is an example –

Suppose you order a sundae from an ice cream parlour 5 kilometres away from home. You note that it reaches you within 25 minutes.

On the other hand, when you order biryani from a restaurant merely 0.5 kilometres away, you note it takes more than 45 mins. Confusing right? Now imagine predicting the optimal FPT for the huge number of dishes that exist on Zomato and the lakhs of restaurants delivering them, each using their own preparation time.

Feels like quite a challenge, right?

Well, this is one of the significant problems our Data Science team solves – to predict accurate Food Preparation Time for both customers and restaurant partners while communicating to delivery partners the right time to reach a restaurant.

Story till now (a quick recap of our previous FPT blog)

In our earlier blog – The Deep Tech Behind Estimating Food Preparation Time – we described what food preparation time means at Zomato and the factors it depends upon. We spoke at length about how the data science model makes predictions.

For people who haven’t read the last blog, we highly recommend you give it a read to understand this continued conversation better. Here, we will be sharing the improvements and learnings made on top of the previous model.

The winning solution: loss functions for food problems

Restaurants are operationally-driven systems with multiple dynamics at play. It means that orders with the exact items and quantity can have different food preparation times, depending on the on-ground situation in the kitchens.

The above curve closely resembles a right-skewed distribution. One can see that the variation of chole bhature is considerable – different outlets take different times, and even the same restaurant can take more or less time depending on the rush hours, availability of staff, etc. We aim to build a model which can accurately predict the FPT for such a distribution.

Training our model with loss functions like Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) has its limitations:

MAE tends to penalise all errors in the same way. Hence, it gives the same weightage to extreme errors and minor errors, which sometimes results in predictions that can be significantly off from actuals (however infrequent),
RMSE tends to penalise extreme errors more but ignores minor errors, providing us with a less accurate model

For our use case, we need an accurate model with fewer underestimations ensuring precise and transparent communication with customers.

Hence, we work with alternative loss functions. Here is a brief about the Quantile Loss Function,

For q < 0.5, our loss function would start penalising the positive error of actual minus predicted more than the negative error making the model predict lower values for Y and vice-versa for q > 0.5. In a special case where q = 0.5, our loss function becomes the same as the MAE loss function, penalising positive and negative errors equally.

Further, we can modify this quantile loss function in the following manner, which yields us the benefits of MAE, and at the same time, adds appropriate penalties to the positive and negative errors.

Here, the variables m and n act as a penalty factor for positive and negative errors, respectively. These can be tweaked in such a way that it yields a model optimised to work on corresponding business metrics.

Multi-tasking like a boss – designing a single model to predict multiple outputs

The previous model could only predict the FPT of one order. Over time, we started integrating FPT with multiple downstream processes in our system, which in turn, increased the importance of FPT predictions multifold.

While predicting Food Preparation Time (FPT), we also need to predict other values, such as

Wait time of the delivery partner – time a delivery partner is expected to spend waiting at the restaurant after arrival.
Hand-shake time – time a delivery partner typically takes to pick up an order from a restaurant’s counter after reaching, while accounting for queues and other processes such as signing at the entrance, temperature check, etc.

Predicting targets like these requires overlapping information which is essentially what the core FPT model is. Hence, such targets can become a part of the FPT model itself. All of this together helps us improve our downstream algorithms so that customers get their food within the expected time range.

Let’s elaborate on how we implemented the above-use-cases into our neural network(s).

Initially, we had a separate model for each use case, each requiring independent training. At a high level, our deep neural network model looked as below –

A single TensorFlow model of the given architecture consists of ~6 million parameters which take up to 15 hours of training, not to mention the size of resources required to make a real-time prediction with this architecture.

Training multiple models with this architecture – each having 6 million parameters – makes it very impractical for us to re-train and maintain these models. We needed to make some optimisations to make it more practical to use.

Since all these models have the same type of information being fed to them, we introduced a multiple-output layer to the model. With its help, we could make multiple predictions with one model itself.

The multiple output layer consists of a compilation of multiple fully connected layers, each layer dedicated to making predictions for a dedicated loss function.

Performance Metrics

One network to depict them all – creating order-level embeddings

In the earlier version, to create an embedding of items present in an order, we used to take the weight of the item vector (created using word2vec on item names) based on their quantities and cost.

Now, we have updated the order item encoding by using a bi-directional LSTM-based sub-network which takes the word2vec features along with cost, quantity, and historic FPT statistics to create embedding for an order.

Shown below are the embeddings of all orders of restaurant X serving multiple cuisines, i.e., North Indian, Chaat, Chinese, desserts etc. (Red) and restaurant Y serving only ice creams and desserts (Blue) plotted on a 2D plane using Principal Component Analysis (PCA).

We can see that in the previous methodology, the major density of orders from either of the restaurants is concentrated at one point. In the new model using the Bi-LSTM architecture, there is a clear separation between the orders from the two restaurants (embedding spread over a larger space, giving clearer information).

However, an observant reader might notice some overlap between the order of X and Y, which is because of the similar nature of items being ordered, such as brownies, sundaes etc.

Smaller, better, faster and stronger!

While working on the FPT model, we added many complexities to the model, be it adding new features, training deeper models, or creating more embeddings. Upon continuously increasing the complexity, we ended up increasing the model size resulting in relatively slower predictions, longer training time, more resource utilisation, etc.

To solve these challenges, we made the below changes which controlled these issues to a large extent –

Train embeddings separately: We decoupled various embedding layers that were used to encode categorical variables like restaurant IDs, city names, the hour of the day, day of the week, etc. These embeddings constituted ~95% of the parameters in the FPT model. To fix this, we trained bigger embeddings for such sub-network separately and replace the same in our original FPT model. This reduced the FPT model size to a 10th of its original size, making the model lighter. Bonus: predictions became quicker in the production environment.

Breaking the problem into smaller chunks: It’s always better to break a big-looking problem into smaller chunks and solve those chunks one by one. The same holds true when you are training a deep neural network for real-life scenarios. For our use case, we created separate models for major cities to capture the city-level distribution better, with the given set of specific parameters. The idea was to make more accurate and better predictions.

Focusing on the bigger picture – aligning our tech with the business metrics

You can train your model on any loss function, be it MAE, RMSE, or MSE… the list goes on and on. All these functions may be necessary to train your model because of their convex nature, each having its own pros and cons. However, all this maths and computation would be of no use if the function cannot train a model with some business use.

Hence, it is essential to devise business metrics which can measure the relevance of different iterations of the FPT model. We shall discuss two of these metrics in particular here; these are –

X-min accuracy – percentage orders where the prediction to lie with X-min of actual
X-min breach – percentage orders where actual time overshoots the prediction by X-min

To ensure that the customer gets the food within the promised time, we want to maximise accuracy and minimise the breaches. At the same time, we don’t want to overestimate time as it might push the customer not to order food from our platform.

Enough with the theory; show me numbers

Now that we are familiar with the same terms, or as we say, now that we are on the same page, let’s talk about numbers. Using the above-mentioned strategies, we saw an improvement of 4% in 3-min accuracy and a reduction of 9% in 3-min breaches over our baseline model.

Now that we’ve addressed the elephant in the room, what’s next?

FPT estimation is so crucial to Zomato that we always strive to improve the predictions. While we continue to experiment with architectural changes which make the model larger and more complex, we also seek to develop a better understanding of restaurant processes on the ground. Some of the opportunity areas where we are actively working are

Modifying architecture such that the model can better understand the specifications. For example, Chole Bhature being ordered from a process-driven kitchen vs from a home-cooked style kitchen will have different FPTs.
Creating a real-time self-correcting system using techniques like reinforcement learning. Since we cannot capture all signals impacting FPTs, hence a reactive system works better.
Developing a confidence metric for predictions, so that the model can state when it is less confident about certain predictions.

This is a follow-up article on how we use machine learning to calculate Food Preparation Time. If you are interested in working with us on such innovative problem solving, then connect with Manav Gupta / Akshay Jain on LinkedIn. We’re always looking for Data Scientists (aka Chief Statistics Officers) at Zomato.

This blog was written by Parth Javiya, Abhilash Awasthi, and Akshay Jain.

____

All content provided in this blog is for informational and educational purposes only. It is not professional advice, and should be treated as such. The writer of this blog makes no representations as to the accuracy or completeness of any content or information contained here or found by following any link on this blog.

All images are designed in-house.

-x-