Introduction: Embracing Real-Time Data Streaming for Operational Excellence
Our journey into real-time data processing began with a clear objective: to align our data capabilities with the operational demands of a fast-paced, dynamic business. Traditional batch processing systems, like Spark Streaming, which rely on mini-batches, often fall short when immediate, real-time data insights are needed. Delays in data processing can hinder critical functions like monitoring restaurant performance, tracking live orders, and identifying fraudulent activities, where timely action is essential.
We saw significant potential in real-time streaming technology, which led us to adopt Apache Flink. Flink’s powerful stream-processing framework, combined with Apache Kafka for event streaming, enabled us to build systems that were not only more responsive but also more resilient, ultimately improving both operational efficiency and customer satisfaction.
The Initial Approach: Running Flink on AWS EMR
In 2019, we began our Flink journey by deploying Flink jobs written in Java on AWS EMR, using Flink version 1.8. The decision to start on EMR was influenced by its ease of setup, managed Hadoop infrastructure, and seamless integration with AWS services like S3. EMR allowed us to quickly launch big data applications without the need to manage complex underlying infrastructure.
However, as our usage expanded, we encountered limitations with this approach, particularly in terms of scalability, resource management, and cost efficiency. These growing pains made it clear that we needed to explore alternatives that could better support our evolving needs.
Democratizing Stream Processing: The Introduction of Flink SQL
As the demand for real-time data processing grew across the organization, we introduced Flink SQL to make stream processing more accessible, particularly to non-developers. Many data analysts and business users were more comfortable with SQL than Java, so this was a critical step in democratizing the technology.
Flink SQL allowed users to write SQL queries for stream processing, without needing to delve into the complexities of Flink’s underlying architecture. We wrapped the Flink SQL deployments in CI/CD pipelines, enabling users to submit SQL queries and automatically set up real-time data pipelines with minimal friction. This approach significantly reduced the reliance on our platform team, empowering teams across the organization to build and manage their own data processing pipelines.
Expanding Adoption: Scaling Flink Across Zomato
Between 2019 and 2023, Flink’s adoption at Zomato grew significantly, with various teams migrating critical use cases to Flink, including:
Here’s what our setup looked like:
Overcoming Challenges: The Shift to a Self-Serve Platform
Despite the initial success of Flink on EMR, we encountered several challenges that revealed the limitations of the EMR-based approach:
These challenges motivated us to explore more scalable, efficient, and cost-effective alternatives, ultimately leading to our migration to Kubernetes.
Transitioning to Flink Kubernetes Operator: A Scalable and Cost-Efficient Solution
The above challenges highlighted the need for a more scalable, efficient, and user-friendly platform. This led us to explore Flink on Kubernetes with the Flink operator, which benefited us on:
Here’s what our current setup looks like:
Our current setup involves three major repositories that oversee the deployment of Flink jobs:
Sample Flink Deployment Configuration:
Why This Setup?
This setup was chosen to streamline and automate the deployment process, reducing manual effort as much as possible. By separating concerns across three repositories — Central Flink, Flink Deployments, and Kubernetes Deployments — we ensure clear responsibility and modularity.
It allowed us to create separate node provisioners for the staging and production jobs by just committing to the Git Repository while also ensuring role-based access control over the deployments. We are using Karpenter as our node provisionment software, enabling team-level provisioners to ensure that jobs from one team do not starve jobs from other teams of resources.
This structured approach not only simplifies the deployment pipeline but also enhances scalability, security, and maintainability, making it easier to manage and evolve our Flink platform.
Why ArgoCD?
ArgoCD follows a declarative GitOps approach, wherein a Git repository acts as a single source of truth. This approach aligns with our infrastructure-as-code practices, ensuring that our deployments are version-controlled, auditable, and reproducible. Its native integration with Kubernetes made it a perfect fit for our environment. It allowed us to automate the deployment of our Flink jobs on EKS clusters directly from our Git repositories, reducing manual intervention and potential errors.
It also provided automated synchronization capabilities, ensuring that the actual state of our clusters always matched the desired state defined in our Git repository while allowing us to implement role-based access control (RBAC), ensuring that only authorized users could make changes to specific parts of the system. This granular access control was crucial for maintaining security and accountability in a diverse environment like ours.
It offered comprehensive monitoring and logging features, providing real-time insights into the state of our deployments. This visibility helped us quickly identify and resolve issues, ensuring smooth running of our Flink jobs.
Scale of Operations
The scale of operations at Zomato is growing by the minute and the jobs deployed on our Flink Platform have been comfortably able to handle the vast loads. To put it in perspective, we have processed a peak ingestion rate of more than 250M events per minute within our system.
Enhancing Visibility: Exposing the Flink UI with Ease
In the past, access to the Flink Job Manager UI was restricted exclusively to the data platform team due to the need for direct access to the analytics AWS account, creating a bottleneck for stakeholders who needed critical metrics to monitor their Flink jobs.
To address this, we introduced an ingress configuration in our Kubernetes setup, enabling us to expose the Flink Job Manager UI securely via a dedicated domain. The ingress controller handles routing traffic to the correct namespace and pods within the cluster, ensuring seamless access for all our teams.
Sample configuration to add Ingress routing rules:
With this change, stakeholders now have direct access to the Flink UI, empowering them to monitor essential metrics like backpressure and flamegraphs. This improvement has resulted in several key benefits:
This change has brought several benefits:
By prioritizing visibility and ease of access, we’ve made it even simpler for our teams to leverage the power of Flink for real-time data processing. It’s a small change, but one that has had a meaningful impact on productivity and user satisfaction.
Metrics and Alerting
At Zomato, we take our data and its integrity very seriously as any failures may lead to disturbances in data being consumed from multiple sources. This is why we have built a robust metrics and alerting system into our self-serve Flink platform, powered by the Flink Kubernetes Operator. The Flink Kubernetes Operator exposes a wealth of useful metrics that provide insights into the health and performance of our Flink jobs. These metrics are periodically scraped by VictoriaMetrics, our go-to solution for monitoring and alerting.
We also run cAdvisor and Node Exporter as daemonSets on the cluster nodes to scrape detailed Kubernetes metrics, giving us comprehensive visibility into our cluster’s performance. cAdvisor collects, aggregates, processes, and exports information about running containers, while Node Exporter provides metrics on node level. This combination ensures we have a holistic view of both our applications and the underlying infrastructure.
Below are the cAdvisor and Node Exporter DaemonSet configurations for enabling scraping in the cluster:
We have set up a central dashboard where all the essential metrics and checks are available right out of the box. This makes it super easy for teams to get an overview of their job’s performance without having to dive deep into the details. For those who like to get their hands dirty, the dashboard is fully customizable. Users can tweak it to display the metrics that matter most to them and set up custom alerts to stay on top of any issues.
Speaking of alerts, we’ve made sure that no failure goes unnoticed. From basic metrics like consumer group lags (exposed from our Kafka cluster) to checkpoint failures, pods killed — we’ve covered them all and built alerts to inform the respective job owners so that prompt action can be taken. It gives our teams the confidence that their Flink jobs are running smoothly and that any hiccups will be promptly addressed. It’s all part of our commitment to providing a reliable, self-serve Flink platform that empowers our teams to focus on what they do best—delivering great experiences to our users.
We’ve also integrated job-level PagerDuty alerts powered by our in-house alert manager, which maps consumer groups of specific jobs to their respective PagerDuty services. This ensures prompt and precise incident response.
Logging
Logging is a critical component of our self-serve Flink platform at Zomato. To ensure we capture and store all necessary logs effectively, we have implemented a sidecar container alongside each Flink job manager/task manager container in each pod. This sidecar container is responsible for transporting logs to our in-house logging platform, Logstore.
LogStore provides our teams with an easy-to-use interface to view and analyze logs in real-time. This setup not only helps in monitoring the health and performance of Flink jobs but also significantly aids in debugging and troubleshooting issues. By having immediate access to detailed logs, our teams can quickly pinpoint problems and take corrective actions, ensuring the reliability and smooth operation of our data processing pipelines.
Platform Savings
Amazon charges considerable infrastructure cost over the basic instance cost for AWS EMR. The same cannot be said for EKS (Elastic Kubernetes Service) where the extra cost is minimal. This directly led to ~25% in savings. Moreover, by provisioning resources like compute and memory exactly as required, we have reduced the overhead wastage of resources as a result saving an additional 10% cost.
That’s okay, what next?
In conclusion, our self-serve Flink platform at Zomato has come a long way. By migrating from AWS EMR to EKS, we have achieved significant cost savings and gained greater flexibility in resource provisioning. The use of ArgoCD, coupled with our robust metrics and alerting system, ensures that our Flink jobs run smoothly and any issues are promptly addressed. Our in-house logging platform, LogStore, provides easy access to logs for monitoring and debugging purposes. With continuous improvements and optimizations, we are committed to delivering a reliable and user-friendly platform that empowers our teams to focus on innovation and delivering value to our customers.
This blog was written by Anmol Virmani and Anshul Garg, under the guidance of Himanshu Rathore.