Grab updated its Kafka on Kubernetes setup to improve fault tolerance and completely eliminate human intervention in case of unexpected Kafka broker terminations. To address the shortcomings of the initial design, the team integrated with AWS Node Termination Handler (NTH), used the Load Balancer Controller for target group mapping, and switched to ELB volumes for storage.
Grab has been operating Apache Kafka on Kubernetes (EKS) using Strimzi in production for two years as part of its Coban real-time data platform. The team previously leveraged Strimzi, now a CNCF incubating project, to enhance Kafka cluster security by applying proven authentication, authorization, and confidentiality mechanisms to all server-server and client-server integration.
The initial setup was working well, except when EKS nodes were unexpectedly terminated by AWS due to maintenance or infrastructure issues. In this case, Kafka clients would suddenly face errors because the broker was not gracefully demoted. Worse yet, the affected broker instance could not restart on the newly-provisioned EKS worker node because Kubernetes was still pointing to the storage volume that didn’t exist anymore. As a result, without an intervention from the Coban engineer, the Kafka cluster would be running in a degraded state with only two out of three broker nodes available.
Developers leveraged AWS Node Termination Handler (NTH) to minimize the disruption to Kafka clients by draining the worker node, which triggers a graceful shutdown of the Kafka process with a SIGTERM signal. The team opted to use the Queue Processor mode rather than the Instance Metadata Service (IMDS) mode, as it captures a broader set of events, including those related to the availability zones (AZ) and autoscaling groups (ASG).
AWS Node Termination Handler (Queue Processor) Used To Support Graceful Kafka Shutdown (Source: Grab Engineering Blog)
The team resolved the problem with broken network connectivity in the case of worker node termination by using the AWS Load Balancer Controller (LBC) to dynamically map Network Load Balancer (NLB) target groups. Engineers had to address the issue with NLB taking too long to mark each target group as healthy by increasing the health check frequency and configuring the NLB with a Pod Readiness Gate.
The last major hurdle the team had to overcome was ensuring that the newly provisioned Kafka worker node could start correctly and have access to the data storage volume. Engineers decided to use Elastic Block Storage (EBS) volumes instead of NVMe instance storage volumes. Using ESB comes with many benefits, such as lower cost, decoupling volume size from the instance spec, faster sync times, snapshot backups, and capacity increases performed without downtime. Furthermore, they changed EC2 instance types from storage-optimized to general-purpose or memory-optimized.
With additional configuration to Kubernetes and Strimzi, the setup was able to automatically create EBS volumes for the new cluster and attach/detach volumes between EC2 instances whenever a Kafka pod was relocated to a different worker node.
After all the enhancements, EC2 instance retirements and any operations requiring rotating all worker nodes can be performed without human assistance, which makes these operations faster and less error-prone. The team is planning further improvements, including using NTH webhooks to proactively spin up new instances and send Slack notifications informing about actions initiated by the NTH, as well as rolling out Karpenter to replace Kubernetes Cluster Autoscaler.