Zomato Engineering | April 22, 2025 | 9 min read

Eliminating Bottlenecks in Real-Time Data Streaming: A Zomato Ads Flink Journey

Project Overview

At Zomato, our Data Platform team closely collaborated with the Ads team to overhaul our real-time data pipelines significantly. The challenge was clear yet complex: managing large state pipelines—some exceeding 150 GB—while ensuring reliability and accuracy in our critical Ads feedback system. To address this, we introduced reconciliation mechanisms and transitioned to Flink SQL, significantly enhancing our data flows’ resilience. The results speak for themselves: we achieved a 99% reduction in state size, reduced monthly infrastructure costs by over $3,000, and eliminated downtime. This case study outlines our transformation journey and highlights how integrating a robust feedback loop became key to maintaining data accuracy, performance, and business trust.

The Challenge: Managing Real-Time Pipelines with Large States

Real-time feedback is crucial for driving business performance. For many real-time systems, processing vast amounts of data quickly, accurately, and without failure is essential. However, large state pipelines come with their own set of challenges.

Challenges with Large State Pipelines:

State Size & Performance Degradation: Large states, often exceeding several gigabytes due to deduplication logic for user interactions, negatively impacted performance. They caused slower recovery times and increased the risk of state loss.
Checkpointing Challenges: Larger states resulted in bigger checkpoint files, complicating restoration efforts and occasionally causing restart failures and significant downtime.
Increased Memory Usage: Large state sizes demanded more memory, leading to spikes during peak periods. These spikes frequently overwhelmed the system, causing instability, especially during jobs processing high volumes of user impressions.

At Zomato, our legacy Flink system maintained extensive state data for deduplication, tracking every user interaction, including clicks and impressions. The system stored these interactions for extended periods, resulting in massive states, sometimes exceeding 150 GB. This extensive state storage frequently caused stability issues, data loss, and inaccuracies. To address these challenges, we implemented reconciliation mechanisms.

Why Reconciliation Matters

In real-time systems, reconciliation acts as a safety net, making managing large states safer. It allows reprocessing or verification of data that might have been missed due to pipeline failures or job restarts. The reconciliation process typically runs in the background, periodically ensuring final counts are accurate and correcting discrepancies.

When applied to our real-time system, reconciliation ensured that even if our Flink job encountered issues, it filled gaps, preventing over-delivery or missed events. This mechanism became critical for data accuracy, directly impacting financial outcomes for our restaurant partners and the business. This dual verification layer provided robustness, ensuring no single point of failure could lead to data loss.

Case Study: Migration of Ads System to a Robust Real-time Feedback Loop

How Ads Work at Zomato

Zomato’s Ads ecosystem is essential to our business, enabling restaurant partners to attract more orders and acquire new customers. Our platform offers various types of Ads designed to drive customer engagement and conversions

Purpose of Ads:

Restaurant partners raise Ads to increase visibility on Zomato, attracting customers by showing Ads to relevant audiences likely to convert, ultimately boosting order volumes.

Variations in Ads:

We provide multiple Ad products tailored to restaurant partners’ specific business objectives, including performance-driven and awareness-driven Ads, and targeted Ads based on customer transaction frequency, dish keyword or cuisine preferences, and customer spending potential.

Key metrics for measuring ad performance include:

Impressions: Number of times an ad is displayed to users.
Clicks: Number of times an ad is clicked by users.
Conversions: User actions after clicking an Ad, such as placing an order.
ROI: Return on Ad spend provided to advertising restaurants.

Tech-driven advertising

At Zomato, our advertising efforts are deeply technology-driven. Our Ads engine leverages machine learning and artificial intelligence models using millions of data points to determine “Where, When, and Whom” to show specific Ads. Thus, data engineering is integral to the Ads ecosystem.

Challenges with the old system and the need for migration

Overview of the old system

The previous real-time feedback loop at Zomato relied on Flink version 1.8 and was implemented in Java. The system performed deduplication of clicks or impressions for a user with respect to a campaign based on specific strategies using Flink’s Managed State. After deduplication, the data was aggregated at the campaign level and sent to the Ads Billing Engine every minute.

The Flink job outputs the total daily updated count for each campaign. When a new click or impression was recorded for a campaign, the system would fetch the existing count for that campaign for the day from the state, increment it by the new event count, and then return the updated total count. Additionally, another state was maintained to store user IDs required for deduplication. This user ID state was the major contributor to the total state size in Flink.

Below is a diagram representing the old feedback loop architecture:

The system also managed late events. Late events occur when a user action, like a click, is delayed in being reported. For instance, if a user clicks an Ad but then kills the app before the click event is sent, the event is recorded with the original timestamp but may not immediately be sent to the server. Any pending events are batched and sent during the next sync, which can happen via scheduled background triggers if the app is moved to the background. The event time reflects when the user actually clicked the Ad, while the ingestion time is when the event is processed by the Flink job. The system allowed these late events to be processed if they were within specific hours of the original event time, covering most late events.

Issues and limitations

The old Flink Java job faced several critical issues that hindered its performance and reliability:

State Loss:

State loss was a significant issue that led to several downstream problems, including:

Resetting of Flink Counters: The system relied on returning the total daily updated count for each campaign. However, when state loss occurred, the Flink counters would reset to zero, effectively erasing the accumulated counts for all campaigns. This not only affected the current day’s data but also disrupted the integrity of historical data.
Incorrect Deduplication: State loss resulted in the incorrect deduplication of clicks and impressions. This caused inaccuracies in attributing these events, leading to unreliable data which needed to be fixed manually
Over-Delivery of Ads: Due to these inaccuracies, the system over-delivered clicks and impressions sometimes. This also leads to incorrect billing and reporting.

State Size:

The excessive state size was a primary contributor to the frequent state losses:

Massive State Sizes: In the impressions tracking job, for example, the size of the state could reach up to 150 GB. This excessive size made the system prone to state loss, particularly during job restarts or upgrades.
Large State Due to Late Events: The main reason for the bloated state was the need to store user IDs for deduplication over extended periods. This was necessary to handle late events, but it resulted in maintaining a large state for much longer than ideal, putting additional strain on the system.

New feedback loop

Overview of the new system

The new feedback loop was designed with reliability and scalability in mind. By leveraging the capabilities of a new Flink job using Flink SQL and Automated Reconciliation System, we were able to address the shortcomings of the old system and implement significant improvements.

Minimizing the impact of Flink job downtime and state loss

The new system mitigates these issues by shifting from total daily counts to publishing incremental counts. This change ensures that if the state is lost, new clicks or impressions won’t affect previous data.
Additionally, we implemented a recon job that runs at a fixed time interval, sending total counts for each campaign per hour (excluding the current hour). This minimizes the impact of any Flink job downtime by acting as a fail-safe feedback loop.

Addressing State Size Reduction

By incorporating late events into the reconciliation job, we significantly reduced the state’s Time-to-Live (TTL) for deduplication logic from 24 hours to just 2 hours. This change greatly reduced state size, enhancing overall system performance and stability.

Additional Benefits (Migration to Flink SQL and Version Upgrade)

Upgrading from Flink version 1.8 to 1.17 resolved several critical issues. The outdated version missed key performance improvements and bug fixes, leading to stability and maintenance challenges.
Switching to Flink SQL further improved state management by handling it more efficiently internally. This change not only enhanced performance but also simplified the code, making it more maintainable.

New System Architecture

Below is a diagram representing the new system architecture:

Starting Point 1: Real-time Attribution Flow
This flow starts from the app, where user interactions like clicks and impressions are sent through the Events Gateway to Kafka. The new Flink SQL job processes these events and sends the incremental counts to Redis in the Ads billing engine.
Starting Point 2: Recon ETL Job Flow
This flow starts with fetching raw events from the Data Lake on S3. The recon ETL job runs at a fixed time interval, processes the data, and updates Redis in the Ads billing engine with the processed counts.

Detailed Implementation

Flink SQL Job Implementation

In our Flink SQL job, the core functionality revolves around processing ad events (clicks or impressions) in real-time. Below is the pseudo code that illustrates this implementation:

Let’s break down the key components of this query:

RawAdsEvents: This table contains the raw events generated by user interactions with Ads.
aggregationId: This is the identifier used for aggregating the count per minute. In this context, you can consider it equivalent to campaignID.
eventType: This represents the type of action, such as a click or an impression.
adsDedupeCount: This is our custom User-Defined Function (UDF) for deduplication, which is essential for ensuring accurate counting. The UDF takes in parameters like dedupeKey and other necessary deduplication logic parameters.
emitTime: This parameter gives the time at which the events were processed and aggregated counts were published by the Flink job.

The Flink job’s Stream Time Characteristics are set to event time, Flink also allows for other time characteristics, such as processing time.

Key Concepts Used:

TUMBLE Window
TUMBLE is a type of Windowing Table-Valued Function (TVF) in Flink. It processes events in fixed intervals— in the example we have taken to explain, the Tumble window is 60 seconds. This means that all events occurring within a 60-second window are grouped and processed together.
User-Defined Functions (UDFs):
UDFs allow us to execute custom logic within SQL queries. In our case, we’ve implemented an aggregate UDF named adsDedupeCount to handle deduplication. This UDF is written in Java and uses a local cache for deduplication. Guava, developed by Google, is a powerful, flexible caching library that helps manage and store temporary data efficiently.
Watermarking: In a real-time processing system, events can arrive late, which requires careful handling to ensure accuracy. We use a watermark with a fixed interval to handle late events. For the purpose of this illustration, let’s consider the interval to be 60 seconds. The watermark ensures that the Flink job waits an additional 60 seconds after the window ends before finalizing the processing of events within that window.

Watermark Illustration:

Components of the plot:

Event time: It is when the user interaction (e.g., a click) actually occurs.
Ingestion time: It is when that event is processed and published by the Flink job.
Window size: In the above example we have taken the window size to be 60 secs.

Explanation of the illustration:

Event 1: Occurs at 00:12 and is ingested at 00:48. This event is processed normally because it falls within the same window (00:00 to 01:00).
Event 2: Occurs at 00:36 but is ingested at 01:30, after the window has technically ended. Thanks to the 60-second watermark, this event is still processed within the 00:00 to 01:00 window.
Event 3: Occurs at 00:24 but is ingested at 02:15, well after the watermark period. This event is discarded to avoid excessive delays in processing, ensuring the system remains responsive.

The decision to use the watermark interval was based on careful analysis. It strikes a balance between allowing for slight delays in event ingestion and maintaining timely processing. Events arriving after this buffer are rare and are handled by our recon job.

Recon Job Implementation

To reduce state size, the recon job processes late events separately at regular fixed intervals, eliminating the need for prolonged storage of user IDs. This approach greatly reduces state size.

The recon job updates our billing engine with accurate counts by only overwriting if the new data is greater, and it includes a rate limiter to control resource use. Additionally, it helps mitigate the impact of Flink job downtime, limiting any over-delivery to the decided fixed interval by sending total counts instead of just late events.

Migration Process: The migration to the new system was executed in phases, starting with a shadow mode deployment that ran alongside the existing system. This allowed us to conduct extensive data validation and address any issues, such as a timezone mismatch. Once validated, the system was gradually scaled and fully transitioned into production, ensuring a smooth and reliable migration with minimal disruption.

Post-Migration Benefits

The migration to the new feedback loop has yielded significant improvements:

More than 99% Reduction in State Size: By optimizing state management and handling late events separately, we achieved a reduction in state size from 150 GB to 500 MB, leading to more efficient processing and system stability.
$3,000 Monthly Cost Savings: These optimizations have also resulted in substantial cost savings, reducing our infrastructure expenses by approximately $3,000 per month.
Zero Downtimes to Date: Since the migration, the system has operated with zero downtimes, ensuring uninterrupted ad performance tracking and reliability.
Upgrade from Flink Java (v1.8) to Flink SQL (v1.17): The transition from Java-based processing to Flink SQL with version 1.17 brought critical performance enhancements and improved maintainability.
Seamless Deployments and Flexibility: This has unlocked capability to experiment with new business logic efficiently—something previously difficult and time-consuming. This flexibility fosters faster iteration and innovation.

Through this migration, we’ve successfully upgraded to a robust and scalable feedback system using Flink SQL, reaffirming our commitment to reliable and efficient technological solutions. Thank you for exploring our journey.

This blog was written by Mayank Aggarwal and Anshul Garg.