At Zomato, our Data Platform team closely collaborated with the Ads team to overhaul our real-time data pipelines significantly. The challenge was clear yet complex: managing large state pipelines—some exceeding 150 GB—while ensuring reliability and accuracy in our critical Ads feedback system. To address this, we introduced reconciliation mechanisms and transitioned to Flink SQL, significantly enhancing our data flows’ resilience. The results speak for themselves: we achieved a 99% reduction in state size, reduced monthly infrastructure costs by over $3,000, and eliminated downtime. This case study outlines our transformation journey and highlights how integrating a robust feedback loop became key to maintaining data accuracy, performance, and business trust.
Real-time feedback is crucial for driving business performance. For many real-time systems, processing vast amounts of data quickly, accurately, and without failure is essential. However, large state pipelines come with their own set of challenges.
Challenges with Large State Pipelines:
At Zomato, our legacy Flink system maintained extensive state data for deduplication, tracking every user interaction, including clicks and impressions. The system stored these interactions for extended periods, resulting in massive states, sometimes exceeding 150 GB. This extensive state storage frequently caused stability issues, data loss, and inaccuracies. To address these challenges, we implemented reconciliation mechanisms.
In real-time systems, reconciliation acts as a safety net, making managing large states safer. It allows reprocessing or verification of data that might have been missed due to pipeline failures or job restarts. The reconciliation process typically runs in the background, periodically ensuring final counts are accurate and correcting discrepancies.
When applied to our real-time system, reconciliation ensured that even if our Flink job encountered issues, it filled gaps, preventing over-delivery or missed events. This mechanism became critical for data accuracy, directly impacting financial outcomes for our restaurant partners and the business. This dual verification layer provided robustness, ensuring no single point of failure could lead to data loss.
How Ads Work at Zomato
Zomato’s Ads ecosystem is essential to our business, enabling restaurant partners to attract more orders and acquire new customers. Our platform offers various types of Ads designed to drive customer engagement and conversions
Purpose of Ads:
Restaurant partners raise Ads to increase visibility on Zomato, attracting customers by showing Ads to relevant audiences likely to convert, ultimately boosting order volumes.
Variations in Ads:
We provide multiple Ad products tailored to restaurant partners’ specific business objectives, including performance-driven and awareness-driven Ads, and targeted Ads based on customer transaction frequency, dish keyword or cuisine preferences, and customer spending potential.
Key metrics for measuring ad performance include:
At Zomato, our advertising efforts are deeply technology-driven. Our Ads engine leverages machine learning and artificial intelligence models using millions of data points to determine “Where, When, and Whom” to show specific Ads. Thus, data engineering is integral to the Ads ecosystem.
Overview of the old system
The previous real-time feedback loop at Zomato relied on Flink version 1.8 and was implemented in Java. The system performed deduplication of clicks or impressions for a user with respect to a campaign based on specific strategies using Flink’s Managed State. After deduplication, the data was aggregated at the campaign level and sent to the Ads Billing Engine every minute.
The Flink job outputs the total daily updated count for each campaign. When a new click or impression was recorded for a campaign, the system would fetch the existing count for that campaign for the day from the state, increment it by the new event count, and then return the updated total count. Additionally, another state was maintained to store user IDs required for deduplication. This user ID state was the major contributor to the total state size in Flink.
Below is a diagram representing the old feedback loop architecture:
The system also managed late events. Late events occur when a user action, like a click, is delayed in being reported. For instance, if a user clicks an Ad but then kills the app before the click event is sent, the event is recorded with the original timestamp but may not immediately be sent to the server. Any pending events are batched and sent during the next sync, which can happen via scheduled background triggers if the app is moved to the background. The event time reflects when the user actually clicked the Ad, while the ingestion time is when the event is processed by the Flink job. The system allowed these late events to be processed if they were within specific hours of the original event time, covering most late events.
The old Flink Java job faced several critical issues that hindered its performance and reliability:
State Loss:
State loss was a significant issue that led to several downstream problems, including:
State Size:
The excessive state size was a primary contributor to the frequent state losses:
Overview of the new system
The new feedback loop was designed with reliability and scalability in mind. By leveraging the capabilities of a new Flink job using Flink SQL and Automated Reconciliation System, we were able to address the shortcomings of the old system and implement significant improvements.
Minimizing the impact of Flink job downtime and state loss
Addressing State Size Reduction
Additional Benefits (Migration to Flink SQL and Version Upgrade)
New System Architecture
Below is a diagram representing the new system architecture:
Flink SQL Job Implementation
In our Flink SQL job, the core functionality revolves around processing ad events (clicks or impressions) in real-time. Below is the pseudo code that illustrates this implementation:
Let’s break down the key components of this query:
The Flink job’s Stream Time Characteristics are set to event time, Flink also allows for other time characteristics, such as processing time.
Key Concepts Used:
Watermark Illustration:
Components of the plot:
Explanation of the illustration:
The decision to use the watermark interval was based on careful analysis. It strikes a balance between allowing for slight delays in event ingestion and maintaining timely processing. Events arriving after this buffer are rare and are handled by our recon job.
Recon Job Implementation
To reduce state size, the recon job processes late events separately at regular fixed intervals, eliminating the need for prolonged storage of user IDs. This approach greatly reduces state size.
The recon job updates our billing engine with accurate counts by only overwriting if the new data is greater, and it includes a rate limiter to control resource use. Additionally, it helps mitigate the impact of Flink job downtime, limiting any over-delivery to the decided fixed interval by sending total counts instead of just late events.
Migration Process: The migration to the new system was executed in phases, starting with a shadow mode deployment that ran alongside the existing system. This allowed us to conduct extensive data validation and address any issues, such as a timezone mismatch. Once validated, the system was gradually scaled and fully transitioned into production, ensuring a smooth and reliable migration with minimal disruption.
The migration to the new feedback loop has yielded significant improvements:
Through this migration, we’ve successfully upgraded to a robust and scalable feedback system using Flink SQL, reaffirming our commitment to reliable and efficient technological solutions. Thank you for exploring our journey.
This blog was written by Mayank Aggarwal and Anshul Garg.