Data Platform Team | November 18, 2024 | 10 min read

Apache Flink Journey @Zomato: From Inception to Innovation

Introduction: Embracing Real-Time Data Streaming for Operational Excellence

Our journey into real-time data processing began with a clear objective: to align our data capabilities with the operational demands of a fast-paced, dynamic business. Traditional batch processing systems, like Spark Streaming, which rely on mini-batches, often fall short when immediate, real-time data insights are needed. Delays in data processing can hinder critical functions like monitoring restaurant performance, tracking live orders, and identifying fraudulent activities, where timely action is essential.

We saw significant potential in real-time streaming technology, which led us to adopt Apache Flink. Flink’s powerful stream-processing framework, combined with Apache Kafka for event streaming, enabled us to build systems that were not only more responsive but also more resilient, ultimately improving both operational efficiency and customer satisfaction.

The Initial Approach: Running Flink on AWS EMR

In 2019, we began our Flink journey by deploying Flink jobs written in Java on AWS EMR, using Flink version 1.8. The decision to start on EMR was influenced by its ease of setup, managed Hadoop infrastructure, and seamless integration with AWS services like S3. EMR allowed us to quickly launch big data applications without the need to manage complex underlying infrastructure.

However, as our usage expanded, we encountered limitations with this approach, particularly in terms of scalability, resource management, and cost efficiency. These growing pains made it clear that we needed to explore alternatives that could better support our evolving needs.

Democratizing Stream Processing: The Introduction of Flink SQL

As the demand for real-time data processing grew across the organization, we introduced Flink SQL to make stream processing more accessible, particularly to non-developers. Many data analysts and business users were more comfortable with SQL than Java, so this was a critical step in democratizing the technology.

Flink SQL allowed users to write SQL queries for stream processing, without needing to delve into the complexities of Flink’s underlying architecture. We wrapped the Flink SQL deployments in CI/CD pipelines, enabling users to submit SQL queries and automatically set up real-time data pipelines with minimal friction. This approach significantly reduced the reliance on our platform team, empowering teams across the organization to build and manage their own data processing pipelines.

Expanding Adoption: Scaling Flink Across Zomato

Between 2019 and 2023, Flink’s adoption at Zomato grew significantly, with various teams migrating critical use cases to Flink, including:

Metrics and monitoring systems: Real-time collection and monitoring of system health metrics, enhanced by alerting mechanisms for timely interventions.
Event ingestion pipelines: High-throughput ingestion of diverse events for downstream processing and analytics.
Business-critical applications: Use cases like restaurant stress monitoring, ads delivery, and abandoned cart recovery, where real-time data is essential for driving logical decisions in production systems.
Ad-hoc use cases: Numerous departments adopted Flink for various real-time data needs, highlighting its versatility and robustness.

Here’s what our setup looked like:

Overcoming Challenges: The Shift to a Self-Serve Platform

Despite the initial success of Flink on EMR, we encountered several challenges that revealed the limitations of the EMR-based approach:

Limited expertise: Only a few team members had the technical expertise to write and manage Flink jobs, creating bottlenecks as more teams sought to adopt real-time streaming.
Resource allocation: Balancing CPU, memory, and parallelism for each job proved difficult, often requiring manual job re-deployment for optimal performance.
Job debugging: Investigating and resolving job failures required deep knowledge of Flink’s internal mechanisms, leading to inefficiencies.
High costs: EMR’s pricing model, combined with inefficient resource allocation, led to growing infrastructure costs as cluster sizes increased.
Job Deployment and Recovery: Manually handling job restarts and recoveries, which was time-consuming and prone to human error.
Community Support: The community support and ecosystem around Flink on Kubernetes was growing much stronger compared to Flink on EMR, making Kubernetes a more attractive option for future growth.

Moreover, AWS updates on EMR were lacking and couldn’t keep up with the relatively more active community of Apache Flink.

These challenges motivated us to explore more scalable, efficient, and cost-effective alternatives, ultimately leading to our migration to Kubernetes.

Transitioning to Flink Kubernetes Operator: A Scalable and Cost-Efficient Solution

The above challenges highlighted the need for a more scalable, efficient, and user-friendly platform. This led us to explore Flink on Kubernetes with the Flink operator, which benefited us on:

Scalability and Flexibility: It provided us with unparalleled scalability and flexibility. Unlike EMR, Kubernetes allowed us to decouple compute and memory, offering fine-grained control over resource allocation. This decoupling helped us address the issue of over/under-provisioning of resources, which was a significant pain point on EMR.
Cost Efficiency: There have been extensive developments in Flink Kubernetes Operator related to auto-scaling which has helped us utilize spot instances. The Flink Kubernetes Operator’s capability to handle task reassignments seamlessly when spot instances were terminated further contributed to cost savings.
Enhanced Community Support: The community support for Flink on Kubernetes was far superior to that for Flink on EMR. This active community provided us with better tools, documentation, and troubleshooting resources, making it easier to adopt new features and get support when needed.
Simplified Resource Management: Kubernetes’ native support for managing resources (CPU, memory) at a granular level allowed us to optimize our workloads better. This feature was crucial for running diverse workloads efficiently, ensuring that each job got precisely the resources it needed without waste.
Dev, PreProd and Prod Setup: We have set up distinct environments for development, pre-production, and production modes of deployment under the new platform. This separation ensures that changes can be thoroughly tested in a controlled setting before being deployed to production, reducing the risk of introducing bugs or performance issues. Users can tweak config at each space just how they want it. This gives them room to play with resources and speed things up at different points in the rollout process.

Here’s what our current setup looks like:

Our current setup involves three major repositories that oversee the deployment of Flink jobs:

Central Flink Repository: The Central Flink Repository contains the core driver code for both Flink SQL and Java jobs. This repository serves as the backbone of our Flink Platform. When changes are made, a CI (Continuous Integration) pipeline is triggered, which builds docker image and pushes to Amazon ECR.
Flink Deployments Repository: The Flink Deployments Repository is where users submit their job configurations for both SQL and Java jobs. These configurations include resource allocations, job parameters, and other necessary settings.

Once a user submits their configuration, a CI pipeline runs to validate and commit these changes to the Kubernetes Deployments repository. This process ensures that each job’s configuration is properly reviewed and validated for deployment.

Sample Flink Deployment Configuration:

Kubernetes Deployment: The Kubernetes deployments repository serves as the central repository for all Kubernetes deployment configurations. It receives validated configuration changes, ensuring that all changes are reviewed and approved before deployment. Once changes are committed, the ArgoCD GitOps pipeline automatically picks up these configuration changes by continuously running sync on the deployed applications and immediately picks up any difference between the resource maps and deploys them to the Kubernetes cluster. This centralized approach to deployment management streamlines the process, making it easier to monitor and maintain the deployments. The automated deployment process ensures that configurations are consistently and reliably deployed, reducing the risk of human errors.

Why This Setup?

This setup was chosen to streamline and automate the deployment process, reducing manual effort as much as possible. By separating concerns across three repositories — Central Flink, Flink Deployments, and Kubernetes Deployments — we ensure clear responsibility and modularity.

It allowed us to create separate node provisioners for the staging and production jobs by just committing to the Git Repository while also ensuring role-based access control over the deployments. We are using Karpenter as our node provisionment software, enabling team-level provisioners to ensure that jobs from one team do not starve jobs from other teams of resources.

This structured approach not only simplifies the deployment pipeline but also enhances scalability, security, and maintainability, making it easier to manage and evolve our Flink platform.

Why ArgoCD?

ArgoCD follows a declarative GitOps approach, wherein a Git repository acts as a single source of truth. This approach aligns with our infrastructure-as-code practices, ensuring that our deployments are version-controlled, auditable, and reproducible. Its native integration with Kubernetes made it a perfect fit for our environment. It allowed us to automate the deployment of our Flink jobs on EKS clusters directly from our Git repositories, reducing manual intervention and potential errors.

It also provided automated synchronization capabilities, ensuring that the actual state of our clusters always matched the desired state defined in our Git repository while allowing us to implement role-based access control (RBAC), ensuring that only authorized users could make changes to specific parts of the system. This granular access control was crucial for maintaining security and accountability in a diverse environment like ours.

It offered comprehensive monitoring and logging features, providing real-time insights into the state of our deployments. This visibility helped us quickly identify and resolve issues, ensuring smooth running of our Flink jobs.

Scale of Operations

The scale of operations at Zomato is growing by the minute and the jobs deployed on our Flink Platform have been comfortably able to handle the vast loads. To put it in perspective, we have processed a peak ingestion rate of more than 250M events per minute within our system.

Enhancing Visibility: Exposing the Flink UI with Ease

In the past, access to the Flink Job Manager UI was restricted exclusively to the data platform team due to the need for direct access to the analytics AWS account, creating a bottleneck for stakeholders who needed critical metrics to monitor their Flink jobs.

To address this, we introduced an ingress configuration in our Kubernetes setup, enabling us to expose the Flink Job Manager UI securely via a dedicated domain. The ingress controller handles routing traffic to the correct namespace and pods within the cluster, ensuring seamless access for all our teams.

Sample configuration to add Ingress routing rules:

With this change, stakeholders now have direct access to the Flink UI, empowering them to monitor essential metrics like backpressure and flamegraphs. This improvement has resulted in several key benefits:

This change has brought several benefits:

Faster Issue Resolution: Teams no longer need to fiddle with port-forwarding or cluster access just to monitor their jobs or debug issues.
Improved Collaboration: With the UI easily accessible, multiple team members can view job details simultaneously, enabling better collaboration and quicker decision-making.
Centralized Access: Hosting the UI on a dedicated domain aligns with our vision of building a cohesive, user-centric platform where all necessary tools and dashboards are a click away.

By prioritizing visibility and ease of access, we’ve made it even simpler for our teams to leverage the power of Flink for real-time data processing. It’s a small change, but one that has had a meaningful impact on productivity and user satisfaction.

Metrics and Alerting

At Zomato, we take our data and its integrity very seriously as any failures may lead to disturbances in data being consumed from multiple sources. This is why we have built a robust metrics and alerting system into our self-serve Flink platform, powered by the Flink Kubernetes Operator. The Flink Kubernetes Operator exposes a wealth of useful metrics that provide insights into the health and performance of our Flink jobs. These metrics are periodically scraped by VictoriaMetrics, our go-to solution for monitoring and alerting.

We also run cAdvisor and Node Exporter as daemonSets on the cluster nodes to scrape detailed Kubernetes metrics, giving us comprehensive visibility into our cluster’s performance. cAdvisor collects, aggregates, processes, and exports information about running containers, while Node Exporter provides metrics on node level. This combination ensures we have a holistic view of both our applications and the underlying infrastructure.

Below are the cAdvisor and Node Exporter DaemonSet configurations for enabling scraping in the cluster:

We have set up a central dashboard where all the essential metrics and checks are available right out of the box. This makes it super easy for teams to get an overview of their job’s performance without having to dive deep into the details. For those who like to get their hands dirty, the dashboard is fully customizable. Users can tweak it to display the metrics that matter most to them and set up custom alerts to stay on top of any issues.

Speaking of alerts, we’ve made sure that no failure goes unnoticed. From basic metrics like consumer group lags (exposed from our Kafka cluster) to checkpoint failures, pods killed — we’ve covered them all and built alerts to inform the respective job owners so that prompt action can be taken. It gives our teams the confidence that their Flink jobs are running smoothly and that any hiccups will be promptly addressed. It’s all part of our commitment to providing a reliable, self-serve Flink platform that empowers our teams to focus on what they do best—delivering great experiences to our users.

We’ve also integrated job-level PagerDuty alerts powered by our in-house alert manager, which maps consumer groups of specific jobs to their respective PagerDuty services. This ensures prompt and precise incident response.

Logging

Logging is a critical component of our self-serve Flink platform at Zomato. To ensure we capture and store all necessary logs effectively, we have implemented a sidecar container alongside each Flink job manager/task manager container in each pod. This sidecar container is responsible for transporting logs to our in-house logging platform, Logstore.

LogStore provides our teams with an easy-to-use interface to view and analyze logs in real-time. This setup not only helps in monitoring the health and performance of Flink jobs but also significantly aids in debugging and troubleshooting issues. By having immediate access to detailed logs, our teams can quickly pinpoint problems and take corrective actions, ensuring the reliability and smooth operation of our data processing pipelines.

Platform Savings

Amazon charges considerable infrastructure cost over the basic instance cost for AWS EMR. The same cannot be said for EKS (Elastic Kubernetes Service) where the extra cost is minimal. This directly led to ~25% in savings. Moreover, by provisioning resources like compute and memory exactly as required, we have reduced the overhead wastage of resources as a result saving an additional 10% cost.

That’s okay, what next?

Version Upgrades: There have been rapid developments in the Flink Kubernetes Operator with major improvements in the auto-scaling feature, now allowing users to set resource ranges instead of static limits. This would further enable us to have flexible deployments efficiently.
In-house Flink assistant: We are looking to build an in-house assistant, hosted for users to provide them with a user friendly environment for testing and development purposes. The users would interactively consume and produce messages while implementing stateful logical operations on streaming real time data. This would further reduce multiple corrective iterations for any new job to the platform and would allow better visibility into the world of Flink for many people.

In conclusion, our self-serve Flink platform at Zomato has come a long way. By migrating from AWS EMR to EKS, we have achieved significant cost savings and gained greater flexibility in resource provisioning. The use of ArgoCD, coupled with our robust metrics and alerting system, ensures that our Flink jobs run smoothly and any issues are promptly addressed. Our in-house logging platform, LogStore, provides easy access to logs for monitoring and debugging purposes. With continuous improvements and optimizations, we are committed to delivering a reliable and user-friendly platform that empowers our teams to focus on innovation and delivering value to our customers.

This blog was written by Anmol Virmani and Anshul Garg, under the guidance of Himanshu Rathore.

More for you to read