InfoQ Software Architects' Newsletter

A monthly overview of things you need to know as an architect or aspiring architect.

Enter your e-mail address

Select your country

We protect your privacy.

InfoQ Homepage News Intuit Engineering's Approach to Simplifying Kubernetes Management with AI

DevOps

Intuit Engineering's Approach to Simplifying Kubernetes Management with AI

This item in japanese

Sep 29, 2024 2 min read

Write for InfoQ

Feed your curiosity. Help 550k+ global
senior developers
each month stay ahead.Get in touch

Intuit recently discussed how it managed the complexities of monitoring and debugging Kubernetes clusters using Generative AI (GenAI) experiments. The GenAI experiments were conducted to streamline the detection, debugging, and remediation processes.

Lili Wan, senior staff software engineer, and Anusha Ragunathan, principal software engineer at Intuit, detailed the experiment and provided a background of Intuit's Kubernetes Service platform.

With over 325 Kubernetes clusters supporting more than 7,000 applications and services, Intuit faced challenges maintaining cluster health and minimizing alert fatigue among on-call engineers.

Intuit's Kubernetes Service platform is vast and complex, making it difficult to observe and debug effectively. The rapid growth of applications and frequent changes in clusters added further layers of complexity. Engineers often experience alert fatigue due to the overwhelming volume of data sources and alerts, complicating the detection and remediation of issues.

The team at Intuit identified three key areas for improvement: detection, debugging, and remediation.

To enhance detection capabilities, Intuit implemented a system called "Cluster Golden Signals," which mirrors the concept of service golden signals. This system provides a consolidated view of a cluster's health by filtering out noise and focusing on critical signals for alerting.

Core components of Kubernetes clusters are monitored through dashboards that aggregate metrics into a single health indicator—Healthy, Degraded, or Critical—using Prometheus expressions. This approach allows engineers to quickly isolate problematic clusters and determine whether issues are service-related or platform-related, thus reducing the mean time to detect issues (MTTD).

For deeper debugging, Intuit integrated an open-source tool called K8sGPT. This tool scans Kubernetes clusters to diagnose and triage issues by leveraging knowledge codified by site reliability engineers. K8sGPT uses resource-specific analyzers to extract relevant error messages from clusters, enriching them with AI insights. By combining Prometheus metrics with Golden Signals, K8sGPT can prompt public models to search for additional details on errors.

This integration provides more context to identify potential root causes of alerts.

Source: GenAI Experiments: Monitoring and Debugging Kubernetes Cluster Health

As a side, K8sGPT was among the top 10 most contributed projects from CNCF. The first commit was in March 2023. Currently, the project has 5.6K stars and 88 contributors. Installed in a Kubernetes Cluster, K8sGPT supports models like OpenAI, Azure, Cohere, Amazon Bedrock, Google Gemini and local models. K8sGPT was featured alongside other projects like kube-burner, Kuasar, KRKN, and easgress during the KubeCon EU 2024 conference.

It runs on Windows, Mac and Linux machines and can be installed via brew, RPM, DEB or APK.

Once issues are debugged, remediation is the next step. K8sGPT integrates with public Large Language Models (LLMs) from companies like OpenAI, Google, and Microsoft to suggest remediation steps for Kubernetes-specific errors. However, public LLMs lack context about Intuit's specific platform configurations.

To address this gap, Intuit has developed a proprietary GenAI operating system (GenOS), which hosts local models augmented with Intuit-specific data through retrieval-augmented generation (RAG).

Intuit plans to continue monitoring progress in reducing MTTD and mean time to resolution (MTTR). They also aim to explore GenAI's potential applications in other areas, such as traffic management and Java virtual machine debugging.

About the Author

Aditya Kulkarni

Aditya has been functioning as tech-aware product delivery leader over this tenure. He has worked with different organizations on their journey to agility and DevOps transformation. An avid reader, he is always interested in keeping an eye on the latest in the world of software development!

Show moreShow less

This content is in the DevOps topic

The InfoQ Newsletter

A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

We protect your privacy.

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

Intuit Engineering's Approach to Simplifying Kubernetes Management with AI

Write for InfoQ

About the Author

Aditya Kulkarni

This content is in the DevOps topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

The InfoQ Newsletter