Netflix's engineering team disclosed the investigation, identification, and resolution of the issue about "orphaned" pods causing inconvenience to engineers using Titus, shedding light on the journey from kernel panics to Kubernetes (k8s) and ultimately providing operators with the tools to understand why nodes are going away.
Netflix Titus is a container management platform developed by Netflix and open-sourced in 2018. It is designed to run containers at scale in the cloud. It is specifically tailored to meet the unique requirements and challenges of Netflix's massive, dynamic, and high-traffic streaming service.
The orphaned pods, though a minority in the system, pose a significant concern for batch users, who face uncertainties without clear return codes to guide their retry decisions. Orphaned pods result from the disappearance of the underlying Kubernetes node object. When a node goes away, a garbage collection (GC) process is triggered, leading to the deletion of associated pods. To enhance user experience, Titus employs a custom controller to maintain a history of Pod and Node objects, ensuring transparency. However, the absence of a satisfying explanation for why the agent was lost prompted further investigation into the root causes.
Nodes can vanish for various reasons, particularly in cloud environments. Cloud vendors often employ Kubernetes cloud controllers to detect the loss of underlying servers and subsequently delete the Kubernetes node object. However, this doesn't answer the crucial question of why nodes go away. To address this, Neflix’s engineering team introduced an annotation to capture termination reasons, providing information for understanding node disappearances.
{
"apiVersion": "v1",
"kind": "Pod",
"metadata": {
"annotations": {
"pod.titus.netflix.com/pod-termination-reason": "Something really bad happened!",
...
The addition of the `pod-termination-reason` annotation became a pivotal step. By incorporating this annotation into the garbage collector controllers, coupled with its inclusion in processes that could unexpectedly terminate pods or nodes, Titus ensured a holistic approach. Unlike patching the status, using annotations preserves the integrity of the pod for historical purposes. Various termination reasons, such as preempted jobs, hardware failures, user interventions, or kernel panics, are now captured, enabling human-readable messages.
Dealing with kernel panics presented a unique challenge, given the limited options available when the Linux kernel panics. Inspired by Google Spanner's "last gasp" concept, where nodes send UDP packets upon critical failures, Titus implemented a solution using the netconsole module. Configuring netconsole involves setting up the Linux kernel to send UDP packets upon kernel panic, allowing the platform to capture vital information even during catastrophic failures.
The final step is connecting to Kubernetes and implementing a controller to:
- Listen for netconsole UDP packets.
- Identify kernel panics and associate them with k8s node objects.
- Annotate and delete pods associated with the panicked node.
- Annotate and delete the panicked node.
This process ensures immediate action upon detecting a kernel panic, eliminating the need to wait for garbage collector processes. The annotations serve as documentation, providing operators with a clear understanding of what transpired with the node and associated pods.
The introduced measures not only address the immediate concern of orphaned pods, but also provide operators with crucial observability tools. Titus users can now receive detailed information on why a job failed, even in the case of a kernel panic. While marking a job failure due to such a critical event may not be ideal, the satisfaction lies in the enhanced observability and ability to address and rectify kernel panics proactively. Thanks to these improvements, Titus has significantly enhanced its capabilities, ensuring a smoother experience for engineers and Batch users.