Imagine this: you’re in a war room atmosphere. Tension hangs thick in the air as your application team and you scramble to debug a sudden spike in OOM issues plaguing your application’s containers. No memory leaks seem to be the culprit, but further investigation reveals a shocking truth: the garbage collector (GC) wasn’t running when it should have been! This crucial cleanup process failed to free up memory before the memory breached the container’s hard limit, triggering those dreaded OOM issues.
Now, let’s look at a different scenario. You’re analyzing your application’s performance profile, and a particular pattern jumps out. A significant chunk of your application’s CPU usage is being devoured by the GC. While your service doesn’t necessarily require a ton of live heap memory, the constant GC activity is eating away at valuable resources. You have a hunch – what if you could streamline the GC process, making it more efficient?These two seemingly unrelated situations hold the key to unlocking a powerful new tool in GoLang: GOMEMLIMIT
The memory maze: Why did we need it?
While GoLang, like many languages, utilizes a garbage collector (GC) for automated memory management, pre-1.19 versions offered limited control, fostering the potential for OOM issues and inefficient GC management in high memory applications. The crux of the issue lay in:
GOGC and the Twice the Trouble: Previously, Go relied on the GOGC environment variable (defaulting to 100) to trigger GC cycles. This initiated garbage collection roughly at twice the live heap size (live heap is that part of the memory that is being used actively by your application and cannot be reclaimed by the garbage collector) as identified in the previous GC cycle, and it didn’t take into account your application’s memory hard limit
The OOM Trap: The problem arose when this doubled live heap size GC target surpassed the application’s memory hard limit, inevitably resulting in OOM issues.
GC CPU wastage: For high memory applications with low live heap size, this doubled live heap size was lower compared to the memory hard limit, causing frequent GC cycle consumption, leading to high GC CPU usage.
The memory guardian: GOMEMLIMIT
GOMEMLIMIT is a godsend for memory-hungry Go applications like ours at Zomato. It acts as a soft memory limit for the heap, gently nudging the garbage collector to work its magic more frequently when memory usage gets close to a defined threshold (GOMEMLIMIT) which is set to be less than the application’s memory hard limit (to prevent OOM issue) and also saving CPU cycles by not running GC unnecessarily. This proactive approach prevents OOM crashes caused by inefficient garbage collection and keeps our applications running smoothly and consistently.
It’s important to remember that GOMEMLIMIT cannot solve memory leaks. A memory leak occurs when your application holds onto some memory (live heap) that the garbage collector cannot reclaim. If the live heap itself exceeds the application’s hard memory limit, even GOMEMLIMIT won’t be able to prevent an OOM issue.
Introducing: Zomato/Go/Runtime Library
In order to implement GOMEMLIMIT for our applications, we took the initiative to develop our in -house Zomato/go/runtime library, simplifying the process for application developers. This library boasts the following functionalities:
Dynamic GOMEMLIMIT Calculation:
This feature eliminates the guesswork by dynamically calculating the optimal GOMEMLIMIT value during runtime. It considers the following factors:
GOMAXPROCS
Setting the GOMAXPROCS environment variable to an optimal value is required to avoid thrashing and to reduce unnecessary CPU throttling, our library achieves this by setting it based on ECS task level CPU hard limit.
Enhanced Monitoring with Runtime Metrics Integration
This library goes beyond dynamic GOMEMLIMIT calculation. It leverages the Go runtime library to export crucial metrics related to:
Unveiling the Impact: A deep dive into our memory management success 🚀
We integrated Zomato/go/runtime library in over 250 of our golang microservices , which yielded impressive outcomes:
Reduced GC CPU Usage by more than 95%: GOMEMLIMIT intelligently triggers garbage collection only when necessary, saving valuable CPU cycles for other critical tasks. This resulted in a remarkable reduction of overall CPU usage of our applications by up to 25%-50% (which in turn leads to equivalent reduction in our EC2 compute costs 📈). This substantial improvement empowers our applications to operate with greater efficiency handling higher loads with
increased ease, GC CPU usage dropped significantly, from a considerable 25%-60% (~25% on average) down to under 2% for most of our applications (a decrease of around 90% on average)
Enhanced Stability: By establishing a soft memory limit on when to run the GC, GOMEMLIMIT demonstrably reduced the risk of unexpected OOM ❌ issues and application crashes. This translates to a more reliable and stable user experience for our valued Zomato customers.
Reduced CPU throttling: By adjusting GOMAXPROCS to the optimal value using our library, we have reduced cpu throttling for our applications by up to 50%.
Enhanced connection management: By reducing CPU%, each application requires less number of containers to serve the same number of requests, so each downstream application has to maintain less number of connections to our application, reducing load on our service mesh.
Easy Debugging: Runtime library exports important metrics related to garbage collector, heap, memory, cpu and ecs task limits which help debug issues during incidents and also in setting up alerting on critical metrics
Cost Optimizations: By optimizing CPU usage and reducing the number of required AWS ECS tasks / EC2 instances, GOMEMLIMIT contributed to huge cost savings of around 30,000 USD per month.
The takeaway: Empowering developers
At Zomato, we’re constantly pushing the boundaries to enhance our platform’s stability and efficiency. Our latest innovation, GOMEMLIMIT, is a testament to our commitment to delivering seamless experiences through optimized performance.
GOMEMLIMIT empowers our developers with unparalleled control over memory and CPU usage. By significantly reducing out-of-memory (OOM) errors and optimizing garbage collection, this powerful feature not only improves application performance but also delivers substantial cost savings. It has proven invaluable in enhancing the scalability and reliability of our applications, ensuring that Zomato remains at the forefront of technological innovation.
As part of our ongoing commitment to innovation, we are excited to announce our plans to implement Profile-Guided Optimization (PGO) in GoLang, starting with the upcoming Go 1.20 release. PGO utilizes real-world profiling data to guide the compiler in making more informed optimization decisions, promising to further enhance runtime performance by 2-14%. This initiative underscores our dedication to leveraging cutting-edge technologies to drive continuous improvement across our platform.
Join us in shaping the future of technology
At Zomato, we believe in creating an environment where innovation thrives and where every team member contributes to our success. Our commitment to excellence is reflected in initiatives like GOMEMLIMIT and our upcoming implementation of PGO. If you are passionate about technology and seek to work with a team that values innovation and impact, consider joining us on our journey to redefine the future of technology in the food industry and beyond. Reach out to us at techrecruitment@zomato.com to explore exciting career opportunities.
All content provided in this blog is for informational and educational purposes only. It is not professional advice, and should not be treated as such.
This blog was authored by Sakib Malik in collaboration with Saurabh Sabharwal and Aniket Suri under the guidance of Himanshu Rathore.