Key Takeaways
-
The initial guiding principles for Kubernetes networking was heavily influenced by users who were used to deploying applications on Virtual Machines (VMs) and needed a similar process on Kubernetes.
-
Load balancing in Kubernetes can be split into two parts: traffic from within the cluster and traffic from "elsewhere".
-
At L4 (TCP, UDP, SCTP), Kubernetes has the Service API, which can be used within the cluster and also exposed externally via NodePorts or LoadBalancer types. At L7, Kubernetes has the Ingress API for HTTP that enables richer expression of HTTP-specific semantics, such as host and path matching.
-
The new Kubernetes (ingress) Gateway interface is a unifying API that aims to fix some of the limitations of the Service API.
-
There are many sessions on Kubernetes networking at the upcoming KubeCon+CloudNativeCon NA 2020, and the K8s SIG Networking
will provide a deeper-dive to the topics discussed here and more.
InfoQ caught up with chair of Network SIG, Principal Software Engineer at Google, speaker of the upcoming Kubecon + CloudNativeCon 2020 session, and a Kubernetes maintainer even before it was announced, Tim Hockin about the history of Kubernetes Networking and the roadmap.
InfoQ: The cloud-native and Kubernetes networking landscape seems particularly complicated, maybe even frightening. Can you go a little bit into the history and offer some insights on why it’s so complicated?
Tim Hockin: You can break this down into two parts - the stuff within Kubernetes proper and the stuff in the rest of the ecosystem. Both can be daunting. Within Kubernetes, we made some early design choices that were primarily in pursuit of simplicity. I know that sounds silly - the simplifying decisions are often the most complex - but it matters who you are talking about. If you look at the ratio of cluster users - people who deploy and run their apps on Kubernetes - vs cluster admins, it is highly skewed towards users. As it should be! The simplifying decisions seek to make life easier on those cluster users, at the sometimes unfortunate cost of the cluster admins.
One example is IP per pod. This makes Kubernetes networking easy to think about for app developers -- it just works like a VM, mostly. Underneath that, though, cluster admins have to implement this model in a way that fits their environment.
We always knew Kubernetes could not be everything to everyone, so there’s a lot it doesn’t do (or didn’t at first). We relied on the ecosystem to fill those gaps. You can see myriad projects stepping in at various places - from the lowest levels of networking plugins to policy implementations to load-balancers to service meshes. Wherever there isn’t a "one true" implementation, there’s an opportunity for complexity. We are slowly tackling at least some of this complexity, for example by expanding the APIs that Kubernetes defines to capture the best ideas from the various efforts and bringing some consistency to users.
InfoQ: Without trying to fully comprehend the landscape, can you offer a cheat sheet for application developers and architects vis-a-vis Kubernetes networking?
Tim Hockin: There are really only a few main things to think about. First, individual workloads. Pods have their own IPs, very similar to a tiny VM. Second, groups of workloads. Services act as a grouping mechanism and the definition of some access mechanisms. For example, this is where you decide whether your app wants a virtual IP for load-balancing or not. Third, bringing traffic into your services. This is the most complex topic, because the APIs here have evolved a bit weirdly. The service API has affordances for this, but there’s also the Ingress API. At a high level the difference is "L4 vs L7".
InfoQ: The obvious question about L4/L7 layers and how load balancing is done -- can you provide a quick primer and how the advent of service mesh and proxies play a role in application development?
Tim Hockin: We can split this into two parts - traffic from within the cluster and traffic from "elsewhere". At L4 (TCP, UDP,SCTP) Kubernetes has the Service API. It has one mechanism, called ClusterIP, for exposing a Service to clients. Usually that VIP is only visible inside a cluster (though we’re working on that!). Load-balancing (really load-spreading) is done by the service proxy implementation - this is usually the kube-proxy (which is part of Kubernetes) but there is a growing number of alternative implementations available. In addition to ClusterIP, a Service can request an abstract LoadBalancer, which is generally assumed to be on-demand from cloud-providers but can be implemented in many ways.
At L7 we have the Ingress API for HTTP. This API allows richer expression of HTTP-specific semantics, such as host and path matching. Unlike ClusterIP, there is no "in the box" implementation of Ingress, but many options have sprung up in the ecosystem. An Ingress controller can be run in the cluster or in the cloud-provider environment, and that controller’s job is to ensure that clients destined for the configured hosts and paths are routed to the Kubernetes Services that implement them.
One of the projects we are working on this year is a new API, tentatively called "Gateway", that tries to bring more consistency and extensibility to these divergent APIs. It’s sort of a "grand unified theory" which seeks to break the API along the lines of common roles (infrastructure operator vs. cluster operator vs. app operator) and provide extension-points in the right places for implementations to get creative without forking the whole API. It should be usable for both L4 and L7 paths and offers access to a lot more "advanced" traffic routing capabilities.
We often think of the Service API for East-West traffic and Ingress for North-South traffic (not exactly right, but close enough). Service mesh adds much more robust capabilities to East-West traffic -- things like traffic splitting, fault injection, and mutual TLS -- without apps really being aware of it. It’s incredibly powerful, but can seem overwhelming. One of the things I am concerned with is how we can lower the barriers to entry. I believe that almost everyone will eventually want some facet of service mesh capabilities, even if they don’t use those words. It should be easy and incremental to adopt.
InfoQ: Diving a little deeper what is the role of the Container Network Interface and can you go a little bit into the inner workings? How does Kubernetes networking differ amongst the major cloud providers?
Tim Hockin: CNI is the (relatively thin) abstraction between container runtime modules like dockershim (which adapts docker to be used by Kubernetes) or containerd and the on-node manager, Kubelet. First, it’s important to note that not every container runtime (as defined by the Container Runtime Interface, CRI) uses CNI. CNI is a way for cluster admins to use plugins - whether from the CNI project or from a vendor - to configure the lowest level of networking. For example - how IPs are allocated, how virtual interfaces are configured, whether to use an overlay network or routes are advertised.
Each cloud is free to implement networking at that layer however they see fit. Some providers implement Kubernetes networking natively, and others use overlays or other mechanisms. Customers should pay attention here - not to the details of the implementation but to the depth of the integration. How the network is integrated can have an impact on which other infrastructure decisions are viable (or not) and on how you operate and debug your clusters.
InfoQ: What are your words of wisdom for network operators, etc. making a leap from on-prem to the world of Kubernetes? Are the tools that they are familiar with still relevant?
Tim Hockin: Most of the same tools can be used. At the end of the day, you can still 'tcpdump' it. Depending on the implementation it may be simpler or more complicated to trace traffic across layers, but I think we’re making progress there overall. Operators new to Kubernetes should know that, depending on how you integrate it into your larger network, Kubernetes may want a lot of IP addresses. This has been a stumbling block for some users and is another area we are working to make more flexible.
InfoQ: You are scheduled to speak at Kubecon Virtual and it seems to cover both introductory and advanced material. Can you provide a sneak peek and why should developers or architects attend?
Tim Hockin: We do a session almost every Kubecon where we go over some of the basics of networking in Kubernetes and then dive deep on some of the most recent work that people may not be aware of. In many ways the talk is similar each time, especially the intro section, but people keep showing up. As long as it is interesting, we’ll keep doing it. For those who can’t attend, you can watch recordings of previous sessions, or this one will be on-line soon enough.
InfoQ: Can you talk about the community work and the roadmap? Anything else you want to add?
Tim Hockin: I’ll touch on a few recent efforts driven by the community, including some of my Google colleagues, that I think are noteworthy.
I mentioned Gateway before. I am super excited about it as a unifying API and a way to get past some less-than-ideal API evolution in the Service API. I’m resting a lot of my hopes on it, and so far the API seems great to me. It’s obviously a bit more involved than either Service or Ingress, but I believe most users will find the new capabilities worth adopting.
A pain point many users have had is that Kubernetes happily load-balances traffic across cloud zones, which can cost money. Several releases ago we introduced a semi-manual API to avoid that, but we’re looking at a newer, more automatic way of doing topology-aware load-balancing for an upcoming release.
Scalability in a system like Kubernetes is always complicated. We recently announced Google Kubernetes Engine can now support clusters of up to 15,000 nodes. To get to that, there were a lot of things to fix, but one big thing was a new endpoint-management API called EndpointSlice. Without that, 15K would not have been achievable.
A change that merged very recently is revamped dual-stack support. The first version has been in alpha for a while, but we found some real problems with it (which is what alpha is for, after all). We put some serious effort into it and this next version (still alpha) should be much more capable and stable and upgrade-safe. And it includes real dual-stack support in the Service API, too.
About the Interviewee
Tim Hockin is a principal software engineer at Google, where he works on the Kubernetes and Google Container Engine (GKE). He is a co-founder of the Kubernetes project, and he is responsible for topics like networking, storage, node, multi-cluster, resource isolation, and cluster sharing.