Introducing Vinifera: A guide on how it prevents and mitigates accidental data leaks at scale
Vinifera is an open-source monitoring tool that detects and responds to accidental or critical sensitive leaks. It monitors public contributions made by developers across their repositories, forks, and gists, and notifies if any sensitive information is leaked. Started as a tool to monitor Github activity, Vinifera has now evolved to support multiple platforms like Gitlab, Slack and Asana.
Since we started using Vinifera in 2019, we have been able to prevent many accidental leaks on public platforms. We also open-sourced major components of Vinifera with the hope that it can help other companies in strengthening their security posture.
Addressing Security Vulnerabilities: Vinifera’s Solution for Data Exposures and Proactive Monitoring Sensitive data exposures are common yet frequently overlooked security vulnerabilities that are faced by many organizations. This vulnerability class is hard to avoid due to multiple points of exposure and is easy to exploit, making it a sweet target for researchers.
From an outsider’s perspective, getting data from internal sources is hard, but it is easy to monitor and wait for someone else to make that information public. The severity of such leaks can range from minor to critical. For example, the leakage of an internal database schema, while not posing an immediate security risk, could potentially assist someone in connecting relevant information and uncovering potential vulnerabilities. The leak of third-party API keys and credentials is not only concerning, but it also exposes the associated data to potential attacks – providing attackers with an opportunity to infiltrate the network further. In order to address these security vulnerabilities, the following is done —
We host an internal private-bin and Cyberchef instance to facilitate easy sharing of internal data and provide essential capabilities for developers in terms of data transformation. While this approach offers some level of support, it does not entirely resolve the problem.
Vinifera provides continuous monitoring capabilities that enable swift identification and response to leaks. By tracking and scanning registered sources such as users, repositories and gists, Vinifera promptly detects changes or anomalies. This proactive approach enhances data security by allowing for immediate action and risk mitigation towards data leak.
Vinifera’s Development Journey Vinifera was built using a standard Rails app. The development process involved the following key components —
Sidekiq: Sidekiq is used for efficient job lifecycle management, enabling tasks such as scheduling and retries
Gitleaks Integration: Gitleaks’ Docker image is used for scanning with communication facilitated through the Docker API
Data Storage and Cache: Postgres was chosen as the persistent data store, while Redis served as the application cache, ensuring reliable performance
Slack Integration: Vinifera provides real-time monitoring and event checking through Slack
On-Call Team Collaboration: Pager events are triggered to notify the on-call team and respond to any security incidents.
Vinifera’s Operations Vinifera’s operations can be broadly categorized into three key phases —
Discovery: During the discovery phase, Vinifera scans and identifies all assets, actively searching for new targets to monitor. Once identified, these targets are registered for ongoing monitoring and polling.
Lifecycle: After onboarding a new employee into our GitHub organization, we initiate a registration process where they are identified as targets. Subsequently, we conduct comprehensive scans on their public repositories and Gist – treating them as “assets”. To efficiently track changes, revisions to each asset are stored in a manner that allows quick identification of modifications.
Scanning: Scanning is triggered via Sidekiq, and all scans are performed within isolated Docker containers, ensuring secure and reliable execution. The scanning process varies based on the type of scan required – whether it is a repository scan or an analysis of raw files. Vinifera further leverages Datadog to monitor relevant statistics and ensure the overall health of the service.
An Overview of the Vinifera System After a successful execution of the “target_monitor” Sidekiq job, it is re-enqueued for the subsequent scan, with the enqueue information stored in the database. This enables fine-tuning and control over the timing of the next scan.
In the event of failures, Vinifera carefully analyzes the type of errors encountered. If the failures are attributed to issues with GitHub or the Docker Box, appropriate measures are taken to handle and throttle jobs, providing the system with a window to avoid further errors.
Scalability Enhancements: Boosting Stability and Performance To enhance the stability and scalability of Vinifera, we implemented several measures —
Auto throttling of jobs based on error types: When relying on third-party services or internal services with limited capacity, it is necessary to have checks to avoid overwhelming the service. We introduced auto throttling of jobs based on error types, dynamically adjusting limits for specific periods. This approach helps avoid disruption caused by exponential retries, ensuring smooth operation of the service.
Caching GitHub API responses to avoid rate limiting: To mitigate rate-limiting issues, we leveraged caching of GitHub API responses. We employed the faraday-http-cache library, utilizing response headers like Etag, Last-Modified and Age to cache the responses in Redis. This however, comes with a small impact on storage since responses are now stored in Redis memory. To ensure efficient storage management, Vinifera employs a selective caching approach. Heavy responses and resources that are not needed for future access are intentionally excluded from the caching process.
Optimized scanning of big forks and external projects: Scanning large forks, such as those of substantial open-source projects, posed resource utilization challenges. To address this, we directly determined and scanned the commit patch instead of cloning the entire fork. One of the main causes that spiked resource usage on our Docker daemon was scanning big forks. This arose when a developer forked a large open-source project for contribution, and we scanned the entire repo to look for violations. Some forks were enormous (for example, Linux kernel repo, odoo, android AOSP source code) and sometimes caused the cluster to choke. In some cases, the maximum memory used for scanning reached almost 1GB and ~99% of the total CPU. By leveraging the user activity GitHub API, we could quickly identify commits pushed by developers to collaborative projects or forks, extracting the relevant patch for direct scanning.
Running scans on a dedicated docker daemon: Separating scanning operations to a dedicated machine greatly improved performance and prevented resource congestion. By isolating the scanning process on a separate Docker daemon, we were able to limit the blast radius in case of any issues. This approach ensured the continuity of discovery operations in an uninterrupted manner, in spite of the dedicated Docker machine experiencing errors. Additionally, we optimized resource allocation by restricting scans to a single CPU core and a fraction of CPU weight, preventing resource exhaustion.
Expanding Platform Support: Vinifera’s Multi-Service Compatibility Vinifera’s journey began as a monitoring tool for GitHub sources, but it naturally evolved to support GitLab and other services as well. Depending on the platform, Vinifera employs different monitoring approaches to ensure comprehensive coverage —
Asset Monitoring Approach: Services like GitHub and GitLab are monitored using an asset monitoring approach. This involves registering assets and continuously looking for any changes or updates.
Query-Based Monitoring: Services like Slack, Asana, and Intelx are monitored using a query-based approach. Vinifera periodically polls search results against fixed queries, detecting any changes or relevant information.
Currently, Vinifera supports scanning for the following platforms —
Intelx, specifically, serves as a valuable source for monitoring information exposures across various assets. It functions as a search engine that collects and indexes information from multiple public sources. Vinifera leverages Intelx to identify and neutralize any potential exposure risks proactively.
Distributed Security Alerting We built a robust alerting and acknowledgment system – DiSeAl (Distributed Security Alerting) to distribute the security alerting across the entire organization and integrated it with Vinifera. Whenever a new resource is registered in Vinifera, our internal lookup logic identifies the corresponding owner within our organization. Subsequently, an alert is sent to the owner, ensuring they are promptly notified of any new public assets associated with their account discovered on Vinifera. This proactive approach empowers users to stay vigilant and report any accidental exposures, aiding in identifying and mitigating potential risks
Conclusion Vinifera serves as a powerful monitoring tool, proactively detecting and mitigating accidental leaks across multiple platforms. Vinifera strengthens the security posture with continuous monitoring, robust alerting, and multi-tenant support. With Vinifera, we can take proactive measures to identify and mitigate potential risks, bolster data protection, and prevent leaks.By open-sourcing Vinifera, Zomato aims to contribute to the broader security community, inviting others to strengthen their security postures and swiftly identify and mitigate potential leaks. If you are passionate about solving security problems at scale, we would love to hear from you at firstname.lastname@example.org.