Kuberhealthy, an open source solution developed by Comcast, detects Kubernetes issues by performing synthetic tests within Kubernetes clusters. Kuberhealthy reports test results via a JSON status page as well as a Prometheus metric endpoint, providing flexible options for alerting on Kuberhealthy metrics.
By replicating real Kubernetes workflow, Kuberhealthy attempts to identify production issues that may otherwise go unnoticed. Potential issues Kuberhealthy detects include pods that get stuck in the "Terminating" state due to CNI communication failures, pods that get stuck in "ContainerCreating" state due to disk provisoning errors, or pods that are restarting too quickly. To identify these types of issues, Kuberhealthy runs several checks in parallel:
- Daemonset deployment and termination: This test deploys a daemonset to the Kuberhealthy namespace, waits for all pods to reach the "Ready" state, terminates the pods, and guarantees the terminations were successful.
- Component health: Checks the state of cluster component status and alerts if a status is down for more than 5 minutes.
- Excessive pod restarts: Monitors if a pod has restarted more than five times in an hour in the provided namespace, defaulting to kube-system.
- Pod status: Checks for pods that are older than ten minutes and not in a "Ready" state.
- DNS: Checks for DNS failures within and outside the cluster.
Additional tests are planned for future versions, including service provisioning, DNS resolution, and disk provisioning.
If errors or failures occur for any of the Kuberhealthy tests, the error details are reported to a JSON status page available at http://kuberhealthy.kuberhealthy. The status page contains a boolean OK field that indicates the Kuberhealthy status and a check details JSON object for each Kuberhealthy check, which includes an errors array listing all potential error descriptions. Additional information on the checks, such as when the last check was run, can also be found in the check details object.
{
"OK": true,
"Errors": [],
"CheckDetails": {
"ComponentStatusChecker": {
"OK": true,
"Errors": [],
"LastRun": "2018-06-21T17:32:16.921733843Z",
"AuthoritativePod": "kuberhealthy-7cf79bdc86-m78qr"
},
"DaemonSetChecker": {
"OK": true,
"Errors": [],
"LastRun": "2018-06-21T17:31:33.845218901Z",
"AuthoritativePod": "kuberhealthy-7cf79bdc86-m78qr"
},
"PodRestartChecker namespace kube-system": {
"OK": true,
"Errors": [],
"LastRun": "2018-06-21T17:31:16.45395092Z",
"AuthoritativePod": "kuberhealthy-7cf79bdc86-m78qr"
},
"PodStatusChecker namespace kube-system": {
"OK": true,
"Errors": [],
"LastRun": "2018-06-21T17:32:16.453911089Z",
"AuthoritativePod": "kuberhealthy-7cf79bdc86-m78qr"
}
},
"CurrentMaster": "kuberhealthy-7cf79bdc86-m78qr"
}
Status page example from the Kuberhealthy README.md
Kuberhealthy can be installed with Helm or yaml spec files and is only available inside the cluster. Once installed, Kuberhealthy runs two instances with a pod disruption budget and rolling update strategy to ensure high availability. Kuberhealthy provides a Prometheus Service Monitor configuration to integrate with Prometheus alerts as well as a template to install a Grafana dashboard.
Comcast developed Kuberhealthy out of a need to monitor the health and stability of their Kubernetes clusters and integrate with existing monitoring tools, such as Prometheus. By mimicking real workloads, Kuberhealthy provided Comcast with a more robust monitoring solution for Kubernetes.
Additional approaches for monitoring Kubernetes cluster health include the Kubernetes tool kubelet, which aggregates pod resource usage statistics, and cAdvisor, which collects CPU, memory, filesystem, and network usage statistics. Grafana provides a plugin to collect and visualize these metrics via Prometheus. Outside of Kubernetes tooling, the kube state metrics add on listens to the Kubernetes API server and collects metrics on the health of various objects, such as deployments, nodes and pods. Similar to Kuberhealthy, these metrics are reported as plaintext on a metrics endpoint that can integrate with Prometheus.
To get started with Kuberhealthy, follow the installation guide or learn more in the kuberhealthy channel in the Kubernetes Slack.