Gitpod, a cloud development environment platform, recently decided to move away from Kubernetes after six years of use and experimentation. This decision emerged from their experience of managing development environments for 1.5 million users, while handling numerous environments daily.
Christian Weichel, CTO and co-founder, and Alejandro de Brito Fontes, staff engineer at Gitpod, elaborated the journey to this decision in a blog post. Gitpod found that while Kubernetes is well-suited for production workloads, it presents significant challenges when used for development environments.
The nature of development environments also contributed to these challenges. Unlike production workloads, development environments are highly stateful and interactive, with developers deeply involved in their source code and changes. They show unpredictable resource usage patterns and require elaborate permissions and capabilities, often needing root access and the ability to install packages. These factors make development environments different from the typical application workloads and inform Gitpod's infrastructure decisions.
Initially, Kubernetes was ideal for Gitpod's infrastructure due to its scalability, container orchestration, and rich ecosystem. However, when they scaled, they encountered numerous challenges, particularly in security and state management. Resource management started posing challenges, with CPU and memory allocation per environment being particularly problematic. The spiky nature of CPU requirements in development environments made it difficult to predict when CPU time would be needed, leading to various experiments with CPU scheduling and prioritization.
Storage performance optimization was another critical area of focus. Gitpod experimented with various setups, including SSD RAID 0, block storage, and Persistent Volume Claims (PVCs). Each approach had its trade-offs in terms of performance, reliability, and flexibility. Backing up and restoring local disks proved to be an expensive operation, requiring careful balancing of I/O, network bandwidth, and CPU usage.
Autoscaling and startup time optimization were important goals for Gitpod. They explored various approaches to scale up and ahead, including "ghost workspaces," ballast pods, and eventually, cluster-autoscaler plugins. Image pull optimization was another crucial aspect, with Gitpod trying numerous strategies to speed up image pulls, including daemonset pre-pull, layer reuse maximization, and pre-baked images.
Networking in Kubernetes introduced its own set of complexities, particularly in terms of development environment access control and network bandwidth sharing. Security and isolation posed significant challenges, as Gitpod needed to provide a secure environment while giving users the flexibility they need for development. They implemented a specific user namespace solution to address these challenges, which involved complex components such as filesystem UID shift, mounting masked proc, and custom network capabilities.
There was an interesting conversation on Hacker News related to Gitpod’s journey. One of the HN users, datadeft, referenced the original k8s paper in a response and said,
the only use case was a low latency and a high latency workflow combination and the resource allocation is based on that. The generic idea is that you can easily move low latency work between nodes and there are no serios repercussions when a high latency job fails.
Based on this information, it is hard to justify to even consider k8s for the problem that gitpod has.
In search of better solutions, Gitpod experimented with micro-VM technologies like Firecracker, Cloud Hypervisor, and QEMU. While these offered some promising features like enhanced resource isolation and improved security boundaries, they also introduced new challenges in terms of overhead, image conversion, and technology-specific limitations.
Ultimately, Gitpod concluded that achieving their goals with Kubernetes was possible but came at a tradeoff in terms of security and operational overhead. This learning led them to develop a new architecture, Gitpod Flex, which carries over important aspects of Kubernetes such as control theory and declarative APIs while simplifying the architecture and improving the security foundation.
Gitpod Flex introduces abstraction layers relating to development environments and eliminates much of the unnecessary infrastructure. This new architecture allows for smooth integration of devcontainers and the ability to run development environments on desktop machines. It can be deployed self-hosted quickly and in any number of regions, providing more control over compliance and flexibility in modeling organizational boundaries.
In conclusion, Gitpod's journey highlighted the importance of choosing a system based on its ability to improve developer experience, lower down operational burden, and improve the bottom line, rather than simply choosing between Kubernetes and alternatives. To know more about Gitpod Flex architecture, interested readers can watch this deep-dive session.