Data Science Team | July 8, 2022 | 5 min read

How we use embeddings to identify and cluster unique addresses

At Zomato, identifying the correct drop location is an important metric, as it governs and impacts a variety of things like an order’s estimated time of arrival, location serviceability and more.

While it’s comparatively easy to locate the drop location, it’s hard to identify if two different-looking locations are the same or not. One customer might write A-4, XYZ Apartments while their flatmate might write their address as A/4, First Floor, XYZ Apartments. A few syntax changes can make it look like a different address altogether such as a permutation of words and/or alphabets in an address string making data difficult to read.

Now imagine if members of the same building write their addresses distinctly. While navigating a pin location is the easy part, confirming if proximal people belong to the same building with identical addresses, is a steep climb.

Such situations can come up on a daily basis when we have to:

Differentiate public addresses (railway stations, hotels, hospitals) from private ones (homes)
Evaluate the reliability of addresses based on success rates

Our experiment and output with word embeddings

To solve such challenges, we deployed machine learning algorithms that helped us to automate clustering addresses, identify gibberish/fraudulent addresses, and improve existing address strings.

We soon realised that for us, word embeddings alone would not solve the problem, developing an understanding of the sequence of words was equally critical. With increasing use cases of string/sentence application in the real world, a lot of pre-trained models are already in the machine learning ecosystem.

One of the effective models we, use is the SBERT or Sentence BERT embedding² on address strings that clustered them to identify similar address texts.

BERT [Bi-directional Encoder Representations from Transformers] by Google AI is an encoder transformer. Language models which are bidirectional have a deeper understanding of the context.

Prior to 2017, Recurrent Neural Networks (RNN) were used for most of the NLP cases. RNN works on encoder and decoder mechanisms. The encoder encodes input text and converts it into a context vector, then this context vector is passed to a decoder network, which predicts the output. Every token produced by our decoder is sent to the Attention mechanism, which helps us to identify where Attention is needed. The Attention mechanism further allows the decoder to focus on most of the relevant words in the encoder.

Post-2017, a research paper titled ‘Attention Is All You Need’¹ gave birth to transformers. Transformers use Attention alone. But here, Attention is modified a bit, with three fundamental changes:

Positional Encoding: With lots of data, the model learns the position of words. Encoding works by adding a set of varying sine wave activations to each input embedding. These activations vary based on the position of the word/token

In the below example, we see the city in two different positions

Self-Attention: Attention is applied to the word and the other words in their own context. Previously, we have seen Attention applied between decoder to encoder. In contrast, Self-Attention is applied between encoder and encoder, which not only gets the word’s meaning but also embeds the context of the word into its vector or word representations. This enriched embedding helps identify disambiguated words, parts of speech, and even word tenses.

Multi-head Attention: We applied several parallel Attention mechanisms to work together, which allowed the representations of several sets instead of single sets.

We can use this with current NNs but it is far less effective. So with transformed models, we take the core of the model, which is being trained using a significant amount of time and computing power by the likes of Google/openAI. We add a few layers to its end to make it specific for our use case and then train to fine-tune it. One of the most popular models is BERT – a state-of-art language model. While generating embeddings, BERT masks the word in focus, which helps to avoid having a fixed meaning independent of its context. This method also forces BERT to identify masked words based on their context.

Since we wanted to capture semantic similarities between addresses, transformers did not work as they focused on word/token embeddings, which was not helpful in our use case. Rather, we were interested in Sentence Transformers as they work better on semantic embeddings.

Before this, there were cross encoders, so we would have a BERT cross encoder model [core BERT model followed by a feed-forward neural network]. This was helpful but not scalable.

Today, one widely used method for getting sentence embeddings is calculating the mean of all the word/token embeddings from the sentence. But these values from the mean pooling approach are not accurate.

In 2019, a solution came in the form of a Sentence transformer/Sentence BERT.

Discovering Sentence Transformers – our Eureka moment

The transformer fine-tunes sentence pairs using a Siamese architecture/ Siamese network. Siamese architecture is two BERT models that are identical, the weight between these two models is tied. Because of this, SBERT is also known as a twin network as it allows two sentences to process simultaneously. While implementing, we use a single BERT model and process one sentence as sentenceA to the model and process another sentence as sentenceB to the model. During training, weights with inverts are optimised to reduce the difference between two vector embeddings or two-sentence embeddings. Those sentence embeddings are called u and v. We get these embeddings after mean pooling. The pooling layers help get a fixed size representation for input sentences of varying lengths.

Further, these embeddings are used to cluster and find similar address strings. We use DBSCAN (Density-based spatial clustering of applications) for clustering.

One of the exciting challenges we faced during this exercise was identifying private/ public/ fraudulent addresses from the vast pool of all addresses. To solve for this, we deployed the Simpson diversity index to calculate diversity on the basis of which we grouped addresses:

What’s next? Solving for the next set of problems

Build a custom dictionary to cater to all Indian names and manage proper nouns such as Indianised place names
Explore grouped address risk scores to better identify customer recklessness or intent to do fraud

That’s all folks. Stay tuned for more explorative blogs on how we improve our tech offerings for all of you, one technical tweak at a time.

This is a Machine Learning article to share how our Data Science team tackles the problem of unique address identification. If you are interested in solving similar problems which push you to solve seemingly impossible tasks, connect with Manav Gupta on LinkedIn. We’re always looking for Data Scientists (aka Chief Statistics Officers) at Zomato.

This blog was written by Prathyusha Ratan.

-x-

Sources –

Attention Is All You Need, arxiv.org
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks, arxiv.org

-x-

_{All content provided in this blog is for informational and educational purposes only. It is not professional advice and should be treated as such. The writer of this blog makes no representations as to the accuracy or completeness of any content or information contained here or found by following any link on this blog.}

_{All images are designed in-house.}

Our experiment and output with word embeddings

Discovering Sentence Transformers – our Eureka moment

What’s next? Solving for the next set of problems

More for you to read