Intuit recently discussed how it managed the complexities of monitoring and debugging Kubernetes clusters using Generative AI (GenAI) experiments. The GenAI experiments were conducted to streamline the detection, debugging, and remediation processes.
Lili Wan, senior staff software engineer, and Anusha Ragunathan, principal software engineer at Intuit, detailed the experiment and provided a background of Intuit's Kubernetes Service platform.
With over 325 Kubernetes clusters supporting more than 7,000 applications and services, Intuit faced challenges maintaining cluster health and minimizing alert fatigue among on-call engineers.
Intuit's Kubernetes Service platform is vast and complex, making it difficult to observe and debug effectively. The rapid growth of applications and frequent changes in clusters added further layers of complexity. Engineers often experience alert fatigue due to the overwhelming volume of data sources and alerts, complicating the detection and remediation of issues.
The team at Intuit identified three key areas for improvement: detection, debugging, and remediation.
To enhance detection capabilities, Intuit implemented a system called "Cluster Golden Signals," which mirrors the concept of service golden signals. This system provides a consolidated view of a cluster's health by filtering out noise and focusing on critical signals for alerting.
Core components of Kubernetes clusters are monitored through dashboards that aggregate metrics into a single health indicator—Healthy, Degraded, or Critical—using Prometheus expressions. This approach allows engineers to quickly isolate problematic clusters and determine whether issues are service-related or platform-related, thus reducing the mean time to detect issues (MTTD).
For deeper debugging, Intuit integrated an open-source tool called K8sGPT. This tool scans Kubernetes clusters to diagnose and triage issues by leveraging knowledge codified by site reliability engineers. K8sGPT uses resource-specific analyzers to extract relevant error messages from clusters, enriching them with AI insights. By combining Prometheus metrics with Golden Signals, K8sGPT can prompt public models to search for additional details on errors.
This integration provides more context to identify potential root causes of alerts.
Source: GenAI Experiments: Monitoring and Debugging Kubernetes Cluster Health
As a side, K8sGPT was among the top 10 most contributed projects from CNCF. The first commit was in March 2023. Currently, the project has 5.6K stars and 88 contributors. Installed in a Kubernetes Cluster, K8sGPT supports models like OpenAI, Azure, Cohere, Amazon Bedrock, Google Gemini and local models. K8sGPT was featured alongside other projects like kube-burner, Kuasar, KRKN, and easgress during the KubeCon EU 2024 conference.
It runs on Windows, Mac and Linux machines and can be installed via brew, RPM, DEB or APK.
Once issues are debugged, remediation is the next step. K8sGPT integrates with public Large Language Models (LLMs) from companies like OpenAI, Google, and Microsoft to suggest remediation steps for Kubernetes-specific errors. However, public LLMs lack context about Intuit's specific platform configurations.
To address this gap, Intuit has developed a proprietary GenAI operating system (GenOS), which hosts local models augmented with Intuit-specific data through retrieval-augmented generation (RAG).
Intuit plans to continue monitoring progress in reducing MTTD and mean time to resolution (MTTR). They also aim to explore GenAI's potential applications in other areas, such as traffic management and Java virtual machine debugging.