Google announced GKE Autopilot, a further abstraction of managed Kubernetes that allows users to focus on their application software while GKE Autopilot manages the infrastructure and adapts to the changing workloads.
The newly introduced GKE Autopilot mode is intended to be a hands-off fully managed Kubernetes experience that allows users to focus more on the workloads and less on managing cluster infrastructure. The GKE Autopilot mode for creating GKE clusters complements the standard mode and optimizes the clusters based on battle-tested and hardened best practices learned from Google SRE and engineering experience. Google SRE already handles the control plane and the cluster management for GKE; with GKE Autopilot, Google SREs manage the nodes as well, including provisioning, maintenance, and lifecycle management.
InfoQ caught up with William Denniss, who is a product manager at Google and originally conceived the idea of GKE Autopilot. He compares GKE Autopilot and GKE standard mode, which are the primary two modes of creating GKE clusters, and he also discusses the pros and cons of each approach.
The GKE Autopilot mode, in his words, leverages Google's vast SRE experience in adapting clusters and cluster sizes to changing workloads.
InfoQ: The Kubernetes ecosystem has exploded but can you comment on the overall skills required to be able to deploy an application on a Kubernetes cluster? What gaps does GKE Autopilot specifically address from this perspective?
William Denniss: If you think of GKE/Kubernetes as two APIs, you have:
- The Kubernetes API, with objects like Deployment, StatefulSet, Job, etc. Users interact with this API through kubectl (and various other ways).
- The Google Cloud / GKE API with objects like cluster, and nodepool. Users interact with this API through gcloud, and the UI console (and various other ways).
Our goal with GKE Autopilot is to reduce the GKE API down into a single call: “gcloud container clusters create-auto”. From that moment on, the user is primarily interacting with the cluster through the Kubernetes API, for example, creating Deployments with Pods.
So the main skills that are needed become Kubernetes skills. While Kubernetes itself is not the simplest platform to use, it is an incredibly powerful one that can represent a wide range of deployments efficiently, for example a clustered, stateful database that would be even harder to specify without Kubernetes.
With GKE Autopilot, by focusing on the Kubernetes API and removing the need to learn and interact with a separate API (being how to provision and manage the cluster, like the nodes and nodepools), practitioners can focus on their Kubernetes skills, without needing to learn many additional skills specific to GKE.
InfoQ: Are there any best practices for application development/deployment for GKE Autopilot compared to GKE standard?
Denniss: GKE already made setting up and operating a Kubernetes cluster easier and more cost effective than DIY or other managed offerings, but GKE Autopilot goes beyond the “fully-managed” control plane that GKE has always provided, to apply industry best practices and eliminate all node management operations, maximizing cluster efficiency and providing a stronger security posture.
GKE has always been about simplifying Kubernetes. Users who are still interested in customizing their Kubernetes cluster configurations can continue to use GKE with the current mode of operation in GKE, referred to as “Standard,” which provides the same configuration flexibility that GKE offers today.
InfoQ: What are some operational issues associated with GKE standard that are obviated with GKE Autopilot?
Denniss: GKE Standard is a fully configurable product that offers low-level access to nodes. Cluster admins have full-root access to nodes, with freedom to install their own customizations, including ones that are not officially supported. While this is a useful feature that some customers use, and we expect will continue to want, the drawback is that it limits how much visibility we have into the operational aspects of the node, and we don’t offer fully managed nodes leaving the customer partly responsible for monitoring and operating their nodes.
InfoQ: What are some use cases that are unsuitable for GKE Autopilot?
Denniss: Autopilot removes several low-level administrative APIs, like root access to nodes. We did this in order to provide a fully-managed product. Workloads that require such access cannot run. The exception is that partner workloads may still be granted such access (as we can validate that their solutions won’t cause us excess supportability risk).
InfoQ: Can you delve into some implementation details of GKE Autopilot?
Denniss: GKE Autopilot is built to take advantage of many of the advanced capabilities of GKE including Node Auto Provisioning, the Cluster Autoscaler, and Gatekeeper (to restrict admin workloads). These features run in a special “autopilot mode” with a few modifications to support this new mode of operation, but users can actually choose to build their own autoscaling GKE cluster using the same building blocks provided in GKE Standard.
With GKE Autopilot we aim to provide an out-of-the-box solution that fully automates the autoscaling and applies all our best practices, while adding features unique to GKE Autopilot including node management (covered by a 99.9% Pod availability SLA) and a new pod-based billing model.
Our goal from the outset was that Autopilot is GKE, and not a forked or separate product. This means that many of the improvements we make to autoscaling in GKE Autopilot will be shared back to GKE Standard and vice-versa.
InfoQ: From a cost perspective, will GKE Autopilot always be more cost effective than GKE standard? If not, what are the caveats?
Denniss: Node resources fall into four main components: the operation system overhead, system pods (logging, monitoring, dns), the user workload, and unallocated capacity. On GKE Standard, users pay for the full node, including all these components.
In GKE Autopilot, the only resource that is charged are the user’s pods, and not the other three categories. While the per-CPU charge for GKE Autopilot is higher, this is balanced by the fact that the user is now precisely paying for only the resources of their pods. Depending on how efficiently nodes are being used today, moving to GKE Autopilot could result in a cost decrease (which we expect to be true for many users), or a slight increase (for those users who are expert at bin-packing Kubernetes nodes today). This is before considering the total cost of the solution of course, including how much effort is spent doing cost optimization.
GKE Autopilot empowers customer SRE teams to achieve the same sub-linear growth that Google’s own SRE teams strive for, by shifting more of the operational burden of running nodes to Google. By taking care of cluster and infrastructure operational duties, users can spend more energy doing what they do best: building value for their businesses through the software that they run on GKE.
InfoQ: Can you comment on the roadmap for GKE Autopilot including the partner ecosystem and any plans to open source it?
Denniss: GKE Autopilot was designed to be broadly compatible with workloads that run on GKE, including the integrations with partner’s solutions. Today, GKE Autopilot supports logging and monitoring from Datadog and CI/CD from GitLab. We have a program to certify partners' workloads for deployment on GKE Autopilot and we are inviting partners to participate in this program.
The GKE docs outlines how to create a GKE cluster in Autopilot mode and how the cluster adapts to the deployed workloads.