In an effort to help the company become profitable, Uber’s Maps Production Engineering department has focused their efforts on making the usage of infrastructure more efficient. As an outcome of this effort, they managed to develop a semi-automated Go Garbage Collection tuning mechanism that saved 70K CPU cores across 30 mission critical services. The tuning library was mostly built in Go and ran on top of their cloud native scheduler-based infrastructure.
Based on their prior experience in increasing the efficiency of Java services by tuning garbage collection, the team’s profiling exploratory sessions led them to understand that almost twenty five percent of the CPU time of their Go services was being spent in garbage collection activities (identified by the runtime.scanobject
method).
Microservices within Uber’s application portfolio have a significantly diverse memory utilization portfolio. For instance, a sharded system can have quite different live sets. In one case, the p99 utilization was 1GB, but the p1 was 100MB, therefore the p1 occurrences were having a huge GC impact. As a service is not aware of the maximum amount of memory the container has allocated, it became obvious to the team that a fixed value tuning approach would not be appropriate in their case.
This led to the conception of the GOGCTunner: a library which simplifies the process of tuning garbage collection for service owners and adds a layer of reliability on top. The tuner dynamically computes the correct GOGC value in accordance with the container’s memory limit (or the upper limit from the service owner) and sets it using Go’s runtime API.
The library was built with the following features:
- Simplified configuration for easier reasoning and deterministic calculations.
- Protection against Out Of Memory (OOM) kills: the library reads the memory limit from the cgroup and uses a default hard limit of 70%, a safe value from the perspective of the team's experience. Nevertheless, this protection has a limit; it can only adjust the buffer allocation, so if a service’s live objects are higher than the limit, the tuner would set a default lower limit of 1.25X the live objects utilization.
- Allowing higher GOGC values for corner cases like: if the live dataset doubles at peak value, the tuner would enforce the same memory limit at the cost of more CPU. A manual approach would cause an OOM.
For better observability during their effort, the team observed several metrics:
- Intervals between garbage collections: as Go triggers a GC at most every two minutes, if this graph indicates that this is regularly occurring, the team needs to work on allocations optimisations.
- GC CPU impact: this enabled the team to observe CPU utilization and understand how much the services are being affected.
- Live dataset size: as the amount of used memory increased, this metric allowed the team to observe a steady utilization (live usage).
- GOGC value: to understand how the tuner is reacting to different values.
As an outcome of this effort, they managed to develop a semi-automated Go Garbage Collection tuning mechanism which in turn saved 70K CPU cores across 30 mission critical services. Taking into account that many of the tools used by Uber’s infrastructure are built in Go as well (among others: Kubernetes, Prometheus, Jaegger), any large-scale deployment from the outside could benefit from memory tweaking.
Even if the tool is still for Uber’s internal purposes only, other Go developers have been inspired by these efforts and have developed related open source tools.