SRE Team | August 12, 2024 | 11 min read
Migrating to VictoriaMetrics: A complete overhaul for enhanced observability

At Zomato, ensuring the reliability of our systems is paramount, as any downtime can affect millions of customers using our services.

In 2018, we relied on a single Grafana server on an EC2 instance, backed by a single Prometheus server, and New Relic for monitoring system health. However, with Zomato’s growth, we recognized the need for a more robust, scalable in-house monitoring solution. Given the rapidly increasing number of micro services and their custom metrics, along with critical infrastructure metrics such as Envoy and overall system health, we couldn’t afford any gaps in our monitoring capabilities. We needed a solution that could track the health of our metrics reliably and scale with our growth.

Prometheus, the de-facto choice for monitoring tools at the time, along with Thanos for enhanced scalability, reliability, and long-term storage, emerged as the perfect fit. We adopted this solution and scaled it as Zomato continued to grow, ensuring that we could maintain high observability and system reliability.

The initial Prometheus and Thanos setup

Here’s what our setup looked like:

  • Prometheus for metric collection and short term storage, Thanos for high availability and long-term storage
  • Multi-AZ setup for high availability
  • AWS ECS Services running on 100% spot EC2 instances
  • Total Prometheus servers: 144 (1a+1b), Thanos querier, Thanos stores, Thanos frontend and Thanos compactor

Challenges at scale

We are truly grateful to the Prometheus and Thanos communities for the incredible system they’ve built and supported over the years. However, we soon realized that the current setup might not sustain our anticipated growth.

As Zomato’s business expanded and our monoliths evolved into micro services, our metrics volume exploded, revealing several issues in the Thanos and Prometheus stack:

  • Memory Issues: Our Prometheus servers started crashing with Out Of Memory (OOM) errors frequently due to resource-intensive high cardinality queries.
  • WAL corruptions: Write-Ahead Log (WAL) corruption led to OOM and slow startup times.
  • Slower Queries: High query response time when querying Prometheus and AWS S3 via Thanos store.
  • False Alerts: False no_data alerts due to metric loss in case of spot interruptions and longer startup time
  • Mounting Costs: High cardinality led to high resource consumption, requiring very large instances (r5.16xlarge) and memory allocations ranging from 40 GB to 250 GB for each Prometheus server. Thanos stores also required 460 GB memory.

The wake-up call: NYE degradation

On New Year’s Eve, we witness a massive surge in orders and customer interactions, pushing our systems to the limits. However, we encountered a brief outage in our observability platform at the peak of NYE. The cause? An influx of concurrent users accessing resource-intensive queries from dashboards, which strained our observability stack.

Exploring alternatives and discovering VictoriaMetrics

Realizing that our current setup might not meet future growth, we began exploring alternatives. Leveraging our experience with Prometheus, we defined key criteria for our next platform. We explored some paid alternatives as well, but all of them came with a hefty price tag, significantly higher than our in-house setup cost.

After evaluating various options, we discovered VictoriaMetrics, which fulfilled almost all our expectations. Our criteria included:

  • Distributed System: Does it avoid single point of failure, as is the case with Prometheus?
    Yes, VictoriaMetrics is fully distributed with each component serving a specific purpose — vmselect, vmstorage, vminsert, vmagent, vmauth, vmalert. Please go through this doc for more details on each vm component.
  • Quick Startup: Can it start quickly in case of spot interruptions?
    Yes, being a distributed system with lightweight services, each component starts up impressively fast, well within the two-minute spot interruption window.
  • Improved Query Response Time: Can it improve query response time?
    VictoriaMetrics demonstrated significant improvements, especially since all data resides on EBS with much lower latencies and higher throughput as compared with S3. During POC, we found that some Grafana panels, like those showing orders placed every 30 seconds, saw response times drop by up to 1/3rd.
  • Handling High Cardinality and Churn Rate: Can it handle high cardinality and churn rate?
    VictoriaMetrics is highly optimized and can handle high cardinality efficiently, which was crucial for our fully containerized production environment running hundreds of micro services on AWS ECS backed by 100% AWS EC2 spot instances.
  • Metric Aggregation: Is there an alternative that can aggregate metrics and reduce time series?
    To our surprise, VictoriaMetrics supports streaming aggregation to aggregate data before ingestion into the system, which helps reduce unnecessary cardinality and the number of metrics. This feature was particularly appealing to us. Let’s try to understand this feature better with an example:


    You can see that id has cardinality 2, city_id 1 and instance 4, so overall cardinality of metric order_placed is 8. With streaming aggregation, we can drop instance label and aggregate the remaining series. Removing instance label reduces cardinality to just 2, down from 8, thereby significantly cutting down storage requirements, lowering costs, and enhancing performance.

    For more details, please refer to this documentation.

  • Scalability: Can it handle our ever-growing scale?
    Scalability is a crucial criterion for us. During NYE 2024, we handled 2.2 billion active time series with an ingestion rate of 17.5 million samples per second. We sought a system that’s already battle-tested with traffic much higher than ours. VictoriaMetrics has been tested with an ingestion rate of 100 million samples per second and already has a case study with active time series much higher than ours. See this and this for more details. This gave us confidence and a go-ahead signal.
  • Visibility and Monitoring: Can it offer better visibility into what goes into the system and areas for optimization?
    VictoriaMetrics has VMUI, an incredible source for monitoring what’s there in the system. Cardinality explorer provides insights into the top time series, label=value contributors. See this for more details. Below image is from our setup:
  • Migration Effort: Can we complete migration with minimal effort and time?
    VictoriaMetrics addressed this concern as well, offering a complete drop-in replacement for the current Prometheus stack. We realized that we don’t have to put much effort if we have the correct migration strategy in place.

POC and mini load test before the migration

Before embarking on the big migration, we decided to conduct a small POC. On December 23rd 2023, we deployed VictoriaMetrics in cluster mode to handle all our application metrics, which also served as a standby for NYE 2024. We ingested eight days worth of application metrics and performed load testing on the system. The results were impressive, demonstrating significant reductions in query execution time as well as disk and memory utilization.

The big migration

Following New Year’s Eve 2024, in mid-January 2024, we began crafting the migration plan. Overhauling the entire observability stack initially seemed daunting, but with the right approach and tools, we managed to execute a seemingly smooth migration.

When deciding on the migration strategy, we deliberated on the following key questions:

  • How many dashboards and panels need to be migrated?
  • What would be the optimal setup? How many different VM clusters do we need?
  • How should we manage configurations, recording rules, and alerting rules?
  • What strategy should we employ for migrating the data?
  • How could we validate the correctness of metrics and alerts?
  • Could we optimize and reduce current usage through cleanup efforts?
  • How should benchmarking be conducted?

Here are the details which answer all the above questions:

Size of the migration

The setup

We categorized metrics into two environments: preprod, and prod, and two types:

  • Infrastructure metrics: including Envoy mesh, cAdvisor, EKS, Kafka, MQTT, and others, which contribute 80%.
  • Service metrics: consisting of application-specific business metrics from our microservices, contributing 20%.

Based on this categorization, we decided to create three VM clusters: preprod (dedicated to all preprod metrics), infra prod, and svc prod. These clusters are replicated across two Availability Zones (AZs) for high availability and resiliency. Each cluster operates with its own vmselect instance that reads from both AZs. We utilize vmauth as a proxy to route metrics to the appropriate vmselect.

Here’s an overview of how the setup is structured:

The entire VM Stack runs on AWS ECS, utilizing AWS EBS and 100% EC2 spot instances. We use the rexray plugin to mount EBS volumes onto ECS tasks. Requests from Grafana are directed to AWS ALB, which then routes them to vmauth as shown in the above architecture.

Configuration management

We adopted a GitOps-based approach to synchronize vmagent, vmalert, and vmauth configurations with AWS EFS. All alerting and recording rules are organized in folders within our common GitHub repository. Upon merging a pull request (PR), configuration files are synced to AWS EFS using AWS CodeBuild as shown above. VM components retrieve configurations from this EFS every 30 seconds. Here’s a sample outline of our GitOps structure

Migration steps

Here are the steps we took during the migration process:

  1. Data Scraping and Ingestion: We configured VMagents for each cluster to scrape and ingest data similarly to Prometheus. Since VM is fully compatible with Prometheus, this transition was smooth. As we ingested more metrics, we scaled by adding additional VMAgent shards and VMStorage nodes. This scaling also allowed us to verify that there was no disruption or impact on the plotted graphs.
  2. Dashboard Updates: Recognizing the impracticality of manually migrating thousands of dashboards, we developed a script to update Grafana dashboard panels with labels and datasource configurations.
  3. Initial Dashboard Migration: After accumulating a month’s worth of data, we prioritized migrating our internal SRE dashboards first to address early issues. Subsequently, we engaged with service owners tier by tier, migrating dashboards based on their criticality. This approach helped us identify and resolve issues proactively before affecting critical services.
  4. Alert Verification: To ensure continuity, we configured a duplicate alert manager and routed all alerts triggered by vmalert to it. Throughout the migration, we regularly compared alerts generated by Prometheus and vmalert to validate their consistency.
  5. Final Stack Transition: Once all dashboards were successfully migrated, we phased out read load from the Thanos querier and eventually decommissioned the entire old stack.

These structured steps ensured a smooth transition while maintaining operational reliability and minimizing disruptions.

Reducing metrics: Optimizing the usage

When we hit 2.2 billion active time series on NYE, we asked ourselves a question: Do we actually need this many time series to monitor our systems, or can we remove and drop certain metrics?

During the migration, we picked the cleanup track and removed unused dashboards, alerts, and metrics.

To identify unused high cardinality metrics, we used VictoriaMetrics’ cardinality explorer and the Grafana-wtf tool. These tools enabled us to pinpoint metrics and histogram buckets that contributed the most by percentage. We then verified their usage in Grafana panels using the Grafana-wtf tool. If a metric wasn’t used in any panel and we confirmed it wasn’t needed, we dropped it. It turned out that the top contributors were Envoy mesh metrics. Some of the top unused metrics that we dropped include:

  • envoy_cluster_lb_subsets_active
  • envoy_cluster_max_host_weight
  • envoy_cluster_lb_zone_no_capacity_left
  • envoy_cluster_upstream_cx_length_ms_sum
  • envoy_cluster_circuit_breakers_high_rq_retry_open
  • envoy_cluster_lb_subsets_fallback_panic
  • envoy_cluster_lb_subsets_created

We also removed a lot of unused histogram buckets from certain high cardinality metrics. Here’s a sample drop configuration

Results from this exercise

  • 40% reduction in active time series
  • 200 unused dashboards deleted

Load Test: Pushing the limits

Read load test:

We used Grafana’s k6 tool to load test read traffic, simulating 800 concurrent users and 30,000 queries per minute — 4 times our average load. The query timeout was set to 10s. Below are the load test results for VictoriaMetrics, compared to a similar load test conducted on the Thanos stack –

Write load test:

The write load was tested during our infrastructure scale-up test, which we conducted recently for Mother’s Day, another high-volume business day besides New Year’s Eve. We scaled up EC2 instances and ECS services by 2 to 2.5x. This increase in infrastructure significantly multiplied the metrics count, thanks to the cardinality, resulting in a 2x increase in the ingestion rate and facilitating the load test.

Hurdles and Hiccups: Migration challenges

  • Increase vs Rate func: Incorrect graphs appeared in many Grafana panels when using the rate function. We had to manually ask teams to fix this. Please see this GitHub issue for details.
  • MinStaleness: We encountered unexpected data staleness and gaps while plotting some metrics on Grafana. By default, Prometheus has a lookback of 5 minutes, but in VictoriaMetrics, we had to explicitly set the minStalenessInterval. See this Github issue for more details.
  • Streaming Aggregation: Unfortunately, we could not use the streaming aggregation feature due to the nature of our multi-AZ setup. We were seeing random data and spikes during spot interruptions and vmagent rotations. After a detailed discussion with the VictoriaMetrics team, they suggested a solution that we are yet to try.
  • Gaps in cAdvisor Metrics: To address this, we had to set honor_timestamp to false in the configuration. Eventually, VictoriaMetrics team fixed this by setting it to false by default. See this GitHub issue for more details.
  • Version Upgrades: During the migration, three new versions were released, with the latest version including critical bug fixes and some features we needed. This was not an issue as such. In fact, this allowed us to test the resiliency of the system during upgrades. We followed the no-downtime strategy and observed no issues during the process.

Impact of this mighty exercise

Along with the above significant optimizations due to both the VictoriaMetrics adoption and the cleanup task we picked, this migration greatly enhanced the system’s reliability and resiliency –

  • Reliability: No metric loss events, significantly better reliability, and no false alerts due to no_data.
  • Monitoring: Improved monitoring and visibility with the official VictoriaMetrics grafana dashboards and VMUI component.

Limitations: Missing features

VictoriaMetrics lacks some features (available only in enterprise version) that were available in Thanos:

  1. Downsampling: Thanos provides downsampling, but in VictoriaMetrics, this feature is available only in the enterprise version. Before starting the migration, we thought we would be able to retain metrics for a longer retention period. Unfortunately, we have to keep the same retention period to keep costs under control. There are ways to achieve downsampling, such as setting up a parallel scraper and scraping at 5-minute or 1-hour intervals, then ingesting metrics into a different cluster. However, we decided not to go with this approach.
  2. Indexdb Rotation Period: Due to high cardinality and churn rate, our indexdb size is 10 times the data size. While data cleanup happens regularly during background merges, indexdb cleanup occurs only at the end of the retention cycle in vmstorage. Thus, we have to provision double disk capacity to accommodate this large index size, incurring some extra costs.
  3. Query Caching: In-memory caching is not fully utilized due to frequent spot instance rotations. Thanos frontend comes with external caching support, which is favorable for our setup.
  4. Query Splitting: We are awaiting this feature in vmselect. In the meantime, we might use Thanos query frontend. For reference – github issue
  5. Automatic VMStorage Node Discovery: This feature is available in Thanos but, unfortunately, only in the enterprise version of VictoriaMetrics. Every time we need to add vmstorage nodes, we have to manually add vmstorage endpoints in vmselect and vminsert command lines and rotate the stack. This process is cumbersome given the large setup we have.
  6. Random Spikes in Histogram Metrics During VMStorage Rotation: Recently, we have observed random spikes and incorrect graphs in some panels using histogram metrics. Upon investigation, we found that this occurs during EC2 spot interruptions of vmstorage nodes. We are looking into this issue, and there is a related open github issue on it.

Despite all the limitations, we are happy with the migration results and grateful to the VictoriaMetrics team.

A comparative look between the setups

Note: We have a multi-az setup in which every component is duplicated, one in each az.

Okay! So what next?

Looking ahead, we plan to add the following enhancements to our system:

  • Enhance the capabilities of our VM controller to include provisioning, management, and on-demand scaling of VM resources. Right now, it syncs configuration files to AWS EFS.
  • Potentially introduce Thanos frontend for query splitting and caching, while disabling caching on vmselect, until this feature is added in vmselect itself.
  • Introduce a separate cluster for streaming aggregation, with one vmagent, instead of two, scraping and duplicating metrics to two separate VM clusters in different availability zones (AZs), as suggested by the VictoriaMetrics team.
  • Add multi retention support.

In conclusion, migrating to VictoriaMetrics has been a game-changer for our observability platform at Zomato. This overhaul addressed the challenges we faced with Prometheus and Thanos, allowing us to reduce costs, improve query response times, and enhance overall system performance. The migration process, though complex, was meticulously planned and executed, ensuring a smooth transition without disrupting our operations.

This blog was authored by Abhishek Jain in collaboration with Ayush Chauhan, Nishant Saraff and Riya Kumari under the guidance of Himanshu Rathore .

facebooklinkedintwitter

More for you to read

Technology

apache-flink-journey-zomato-from-inception-to-innovation
Data Platform Team | November 18, 2024 | 10 min read
Apache Flink Journey @Zomato: From Inception to Innovation

How we built a self-serve stream processing platform to empower real-time analytics

Technology

introducing-pos-developer-platform-simplifying-integration-with-easy-to-use-tools
Sumit Taneja | September 10, 2024 | 2 min read
Introducing POS developer platform: Simplifying integration with easy-to-use tools

Read more about how Zomato is enabling restaurants to deliver best-in-class customer experience by working with POS partners

Technology

go-beyond-building-performant-and-reliable-golang-applications
Sakib Malik | July 25, 2024 | 6 min read
Go Beyond: Building Performant and Reliable Golang Applications

Read more about how we used GOMEMLIMIT in 250+ microservices to tackle OOM issues and high CPU usage in Go applications, significantly enhancing performance and reliability.

Technology

zomatos-journey-to-seamless-ios-code-sharing-and-distribution
Inder Deep Singh | June 6, 2024 | 15 min read
Unlocking Innovation: Zomato’s journey to seamless iOS code sharing & distribution with Swift Package Manager

Read more to know about how we migrated Zomato’s 10-year-old iOS codebase to Apple’s Swift Package Manager. The blog highlights the process behind building a central syncing mechanism needed across multiple apps, the challenges encountered, and how we resolved them.