At Zomato, identifying the correct drop location is an important metric, as it governs and impacts a variety of things like an order’s estimated time of arrival, location serviceability and more.
While it’s comparatively easy to locate the drop location, it’s hard to identify if two different-looking locations are the same or not. One customer might write A-4, XYZ Apartments while their flatmate might write their address as A/4, First Floor, XYZ Apartments. A few syntax changes can make it look like a different address altogether such as a permutation of words and/or alphabets in an address string making data difficult to read.
Now imagine if members of the same building write their addresses distinctly. While navigating a pin location is the easy part, confirming if proximal people belong to the same building with identical addresses, is a steep climb.
Such situations can come up on a daily basis when we have to:
To solve such challenges, we deployed machine learning algorithms that helped us to automate clustering addresses, identify gibberish/fraudulent addresses, and improve existing address strings.
We soon realised that for us, word embeddings alone would not solve the problem, developing an understanding of the sequence of words was equally critical. With increasing use cases of string/sentence application in the real world, a lot of pre-trained models are already in the machine learning ecosystem.
One of the effective models we, use is the SBERT or Sentence BERT embedding2 on address strings that clustered them to identify similar address texts.
BERT [Bi-directional Encoder Representations from Transformers] by Google AI is an encoder transformer. Language models which are bidirectional have a deeper understanding of the context.
Prior to 2017, Recurrent Neural Networks (RNN) were used for most of the NLP cases. RNN works on encoder and decoder mechanisms. The encoder encodes input text and converts it into a context vector, then this context vector is passed to a decoder network, which predicts the output. Every token produced by our decoder is sent to the Attention mechanism, which helps us to identify where Attention is needed. The Attention mechanism further allows the decoder to focus on most of the relevant words in the encoder.
Post-2017, a research paper titled ‘Attention Is All You Need’1 gave birth to transformers. Transformers use Attention alone. But here, Attention is modified a bit, with three fundamental changes:
In the below example, we see the city in two different positions
We can use this with current NNs but it is far less effective. So with transformed models, we take the core of the model, which is being trained using a significant amount of time and computing power by the likes of Google/openAI. We add a few layers to its end to make it specific for our use case and then train to fine-tune it. One of the most popular models is BERT – a state-of-art language model. While generating embeddings, BERT masks the word in focus, which helps to avoid having a fixed meaning independent of its context. This method also forces BERT to identify masked words based on their context.
Since we wanted to capture semantic similarities between addresses, transformers did not work as they focused on word/token embeddings, which was not helpful in our use case. Rather, we were interested in Sentence Transformers as they work better on semantic embeddings.
Before this, there were cross encoders, so we would have a BERT cross encoder model [core BERT model followed by a feed-forward neural network]. This was helpful but not scalable.
Today, one widely used method for getting sentence embeddings is calculating the mean of all the word/token embeddings from the sentence. But these values from the mean pooling approach are not accurate.
In 2019, a solution came in the form of a Sentence transformer/Sentence BERT.
The transformer fine-tunes sentence pairs using a Siamese architecture/ Siamese network. Siamese architecture is two BERT models that are identical, the weight between these two models is tied. Because of this, SBERT is also known as a twin network as it allows two sentences to process simultaneously. While implementing, we use a single BERT model and process one sentence as sentenceA to the model and process another sentence as sentenceB to the model. During training, weights with inverts are optimised to reduce the difference between two vector embeddings or two-sentence embeddings. Those sentence embeddings are called u and v. We get these embeddings after mean pooling. The pooling layers help get a fixed size representation for input sentences of varying lengths.
Further, these embeddings are used to cluster and find similar address strings. We use DBSCAN (Density-based spatial clustering of applications) for clustering.
One of the exciting challenges we faced during this exercise was identifying private/ public/ fraudulent addresses from the vast pool of all addresses. To solve for this, we deployed the Simpson diversity index to calculate diversity on the basis of which we grouped addresses:
That’s all folks. Stay tuned for more explorative blogs on how we improve our tech offerings for all of you, one technical tweak at a time.
This is a Machine Learning article to share how our Data Science team tackles the problem of unique address identification. If you are interested in solving similar problems which push you to solve seemingly impossible tasks, connect with Manav Gupta on LinkedIn. We’re always looking for Data Scientists (aka Chief Statistics Officers) at Zomato.
This blog was written by Prathyusha Ratan.
-x-
Sources –
-x-
All content provided in this blog is for informational and educational purposes only. It is not professional advice and should be treated as such. The writer of this blog makes no representations as to the accuracy or completeness of any content or information contained here or found by following any link on this blog.
All images are designed in-house.