Transcript
Probst: Let me start by stating something that I've seen, by talking to many of our customers. What do companies care about? They care about delivering their products to their customers, ideally, as quickly as possible, so velocity, and ideally, with as little costs as possible. These are themes that I see over and over, and people choose tools and infrastructure that help them achieve these goals. What I'm going to be talking about here today is how Kubernetes and, in particular, multi-tenancy in Kubernetes, can be one of the tools in your toolbox that you can look at in order to help you achieve these goals.
Let me introduce myself briefly. My name is Katharina Probst, I'm a Senior Engineering Manager at Google. You can find me on LinkedIn if you'd like. I will also share the slides, so you're welcome to take pictures, but you can also download them later.
Why Multitenancy
Let's start with why you might want to take a closer look at multi-tenancy. Do any of you run multi-tenant Kubernetes clusters? A couple, great, I'd love to hear your experiences too, maybe you can share with the room later. Why would you care about multi-tenancy? When you start out with Kubernetes, usually what happens at a very high level is, you have a user, and the user interacts via a command-line tool or the API, or UI with a master. The master, as we just heard, runs the API server and the scheduler, and the controller. This master is responsible for orchestrating and controlling the actual cluster. The cluster consists of multiple nodes that you schedule your pods on, Let's say these nodes are machines or virtual machines, or whatever the case may be. Usually, you have one logical master that controls one single cluster. Looks relatively straightforward. When you have one user and one cluster, that's what it is.
Now, what happens when you start having multiple users? Let's say your company decides to use Kubernetes for a variety of maybe internal applications, and so you have one developer over here, creating their Kubernetes cluster, and you have another one over here creating their Kubernetes cluster, and your poor administrators now have to manage two of them. This is starting to get a little bit more interesting. Now you have two completely separate deployments of Kubernetes with two completely separate masters and sets of nodes. Then, before you know it, you have something that looks more like this. You have a sprawl of clusters. You get more and more clusters that you now have to work with.
What happens now, some people call this cube sprawl, this is actually a pretty well-understood phenomenon at this point. What happens now is, I will ask you two questions of how does this scale? Let's think a little bit about how this model scales financially. How much does it cost you to run these clusters? The first thing that might stand out is that you now have all of these masters hanging out. Now you have to run all these masters. In general, it is best practice, not to run just one master node, but three or six, so that you get better high availability. If one of them fails, the other ones can take over. When you look at all these masters here, they're not one single node normally per master, they're usually three. This is starting to look a little bit more expensive. That's number one.
Then number two, one of the things that we see a lot is, we see the customers that say, "I have all of these applications, and some of them run during the day, and they take user traffic." They need a lot of resources during the day, but they really lie idle at night. They don't really do anything at night, but you have all these nodes.
Then you have some applications that are batch applications, maybe back processing of logs or whatever the case may be, and you can run them at any time you want. You could run them at night, you could have this model where some applications run during the day and then the other applications run at night, and uses the same nodes. That seems reasonable. With this model, where you have completely separate clusters on completely separate nodes, now, you've just made that much harder for yourself. That's one consideration.
Another consideration that people bring up a lot is operational overhead, meaning how hard it is to operate all of these clusters. If you've been in a situation like this before, maybe not even with Kubernetes, what you will have noticed is that oftentimes what happens is that all of these clusters look very similar at the beginning, maybe they run very different applications, but the Kubernetes cluster, like the masters are all at the same version of Kubernetes, and so forth, but over time, they tend to drift. They tend to become all of these special snowflakes. The more you have these special snowflakes, the harder it is to operate them. You get alerts all the time, and you don't know, is it like a specific version, and you have to do a bunch of work. Now we have tens or hundreds of sets of dashboards to look at, to figure out what's going on. This now becomes operationally very difficult and actually ends up slowing you down.
Now, with all of that being said, there is a model that is actually a very appropriate model under some circumstances. Lots of people choose this model, maybe not for hundreds or thousands, but lots of people choose this model of having completely separate clusters because it has some advantages, such as being easier to reason about and having very tight security boundaries. Let's say you're in this situation, and you have hundreds of clusters, and it's becoming just this huge pain. One thing you can consider is what we call multi-tenancy in Kubernetes.
There are many definitions of multi-tenancy. When you read things on the internet about multi-tenancy in Kubernetes, you have to dig a little bit deeper to understand which model we're talking about. Usually though, what people talk about is this model that you see up on the slide here. What this model is, is you have many users that interact via the command line, and the API, and the UI, with one logical master. You have one master running, and that master now controls a large cluster - because for small clusters, it doesn't make that much sense, maybe - but a large cluster and that cluster is divided up into namespaces.
There's this concept that we just heard about in Kubernetes that's called namespaces. What namespaces are, it's very important to understand that they are virtual clusters. You have one physical cluster, but then you divide that cluster up into namespaces. That does not mean that these two nodes belong to this namespace and these three nodes belong to the next namespace. The nodes are actually shared among the namespaces, but the namespace provides a boundary that creates this universe for you. Then you can run different applications in these namespaces but still share the resources.
Let's dig into this a little bit. Usually, when you run Kubernetes, you have different roles and different kinds of users of this cluster. If you have a multi-tenant cluster, what you can have, more than likely is, you're going to have a cluster administrator. That cluster administrator, essentially, has a lot of access to all the cluster. They're the ones that set up the namespaces, they set up the resource limits, as we will see later in the talk, and they make sure that there's consistency across the namespaces in the cluster so you don't end up with this divergence and all of these different snowflakes. Of course, oftentimes, they're responsible for operating the cluster, responding to incidents and making sure everything runs smoothly.
Now, we have a new role that really only applies to this model of multi-tenancy, and that is the namespace administrator. The namespace administrator now does not necessarily have access to our control over the entire cluster, but only one namespace, maybe multiple, but not the entire cluster, so only admin rights to specific namespaces.
Then finally, you have the cluster user, and the cluster user, just like it was before, runs their applications on the cluster. Now, in this multi-tenant model, it's a little bit different because the cluster user now has access only to certain namespaces, maybe even only to one. It is their responsibility to understand their own namespaces, to run their apps in their namespaces, make sure they understand the resource limits, and make sure they don't trample on other tenants. We'll get more into more detail about this further on in the slides.
Essentially, what you're going to have is you're going to have different roles, cluster administrator, namespace administrator, and user that you will typically see in these kinds of deployments.
Hard Multitenancy
When people talk about multi-tenancy, they often talk - if you go to, for instance, the open-source Kubernetes community, the working group for multi-tenancy - they talk about this concept of hard multi-tenancy and soft multi-tenancy. I'm going to talk about hard multi-tenancy first, but let me just give you a brief overview of what this means.
On the one end, hard multi-tenancy means that you have tenants that you don't trust and they don't trust each other, so there is zero trust. That might be random people uploading code and running it, or it might be different companies that compete with each other. It could be anything, but it's very much on the end of the spectrum where there's zero trust.
On the other side is soft multi-tenancy, and I'll talk more about this later today. When we're talking about soft multi-tenancy, there's more trust established between the tenants. One thing that's important to understand is that people often talk about hard versus soft multi-tenancy. In reality, it's really a spectrum, because how much you trust your tenants is not a binary, it's usually a spectrum. Which kinds of use cases work for you, you have to think for yourself and for your specific use case.
Let's talk a little bit more about hard multi-tenancy. Again, that is the case where there is no trust. Hard multi-tenancy, for a variety of reasons, is not yet widely used in production. Essentially, what it boils down to is the security boundaries and making sure that tenants don't step on each other. It is not yet widely used in production. There is ongoing work in the Kubernetes community to strengthen and make changes to Kubernetes so that we get closer and closer to a point where that is a very viable thing to do.
Let's talk a little bit about what it would take to have that. Think about this a little bit. You have now one cluster with a bunch of nodes and these nodes are shared by potentially malicious tenants. What do you need to do to make sure that this actually works smoothly? You need to make sure that there is great security isolation, that's the second bullet here. Tenants cannot see or access each other's stuff, they cannot intercept networks requests. They cannot get to the host kernel and escalate their privileges. All of that needs to be made sure so that you can have tenants that you cannot trust. The other thing is that you need to make sure that tenants essentially don't DoS each other, meaning they don't impact each other's access to the other's resources.
We'll talk about this a little bit more later on in the talk, but think about this. You have a bunch of nodes that are now shared, and you have to make sure that everybody essentially gets their fair share. That's one thing it would take. Another thing you have to make sure is that when you have resources, so, for instance, there's this concept of custom controllers and custom resource definitions, that's a way to extend Kubernetes. If you now have all of these different tenants, and they extend, they add their own API's, their own CRD controllers, you have to make sure that they don't conflict, so that one person over here doesn't create an API that conflicts with something over here. You have to make sure that they're very nicely isolated.
Then finally, much of what we talk about is about what we call the data plane, which is the cluster where the nodes are. The same questions apply to the master, which we call the control plane. We have to make sure that the control plane resources are also shared fairly. As we're on this journey towards making hard multi-tenancy more and more valuable, and more and more practical, and used in production, those are the kinds of questions that we need to answer.
We're going along this journey towards more and more hard multi-tenancy. Right now, what people do a lot is they use multi-tenancy in a context where there is trust between the tenants. The use cases, for instance, that are very common or pretty common are different teams within the same company. Within one company, you say we share one big pool of resources and different teams share them. The different teams really have good incentive and good reason to behave nicely. They're not assumed to be malicious, they trust each other and accidents happen, and that's what you try to protect from, but you don't assume that they're completely not trusted.
In that model, as you may by now have guessed, different teams will typically get different namespaces to share in one cluster. As I already said, this is used in production. Oftentimes, what happens is that multi-tenancy is still something that requires a little bit of setup or maybe a lot of setup. There are a bunch of knobs that you need to turn, we'll talk about that in a little bit. Often, or several times, what I've seen is, that companies use multi-tenancy, but then they actually have a few people that are dedicated to making sure the policies are applied correctly and network is set up consistently, and so forth for these shared clusters.
Multitenancy Primitives
There are a number of primitives that exist in Kubernetes, that will help you get a multi-tenant cluster set up and administrated properly. I already mentioned namespaces. The good thing about namespaces is that they were built in very early on in Kubernetes and they're very fundamental concepts, so they're actually implemented everywhere and pretty much all the components understand namespaces. That's good, but namespaces alone are not good enough.
There’s a number of things you need to do in order to set up your multi-tenant cluster in a way that protects tenants from each other's accidents, for instance. We're going to talk about three things in a little bit more detail. One is access control, which means who can access what. One is isolation, which means how do I make sure not everybody can see each other's stuff. Then the last thing, going back to our hard multi-tenancy goals, is fair sharing. What already exists in Kubernetes that lets you ensure fair sharing among tenants.
Let's talk a little bit about access control. That's our first primitive that we're going to talk about here. We already heard a little bit about RBAC, it was mentioned in the previous talk. RBAC is role-based access control in Kubernetes. RBAC is essentially, a tool that's built into Kubernetes that lets you control who can access what. Basically, the way it works is that you set up these roles and in these roles, you describe, "I'm going to have my administrator role, and this administrator can do all of these different things," There are two kinds of roles. There's ClusterRoles, those are roles that apply to the entire cluster, it makes sense.
Then they are just Roles, and those roles are namespace-scoped, meaning they apply to specific namespaces, whichever ones you list. Kubernetes already comes with some default roles that you can use, but then you can create your own and you probably will want to. You create your own roles that say exactly who can access what pods, what namespaces, what secrets, and all of that stuff.
You've now created those roles, and so now you get a new employee. Now, you need to make sure that this employee is assigned these roles. The way you do that is with ClusterRoleBinding or RoleBinding. ClusterRoleBbinding lets you bind groups of people or service accounts, or individuals, to ClusterRoles - again, that are cluster-wide - and RoleBindings let you bind individuals or groups, or service accounts to namespace-scoped roles.
You will use this extensively. As very concrete example, you will use this extensively to achieve this isolation that you want. Very concrete example is Secrets. Secrets is a way that Kubernetes provides that lets you store things like passwords if you need them. The way it's done is, they're stored in SCD, so in the master. You have to make sure that they're encrypted, but also you have to make sure that only the correct people that have access to the correct namespaces can access those secrets. Even if you're in a place where you all trust each other, more or less, you should make sure that secrets are pretty well-protected.
RBAC is this mechanism that you're going to use extensively to make sure that you essentially create this universe, this virtual cluster out of this concept that's called a namespace. There are other things that Kubernetes provides that lets you become more granular in your security controls. One of them is Pod Security Policy, it's similar in that it lets you set security policies. It basically lets you say, "For a specific pod, I will only allow pods into a specific namespace if that pod is not running as privileged," for instance. So, you can set up a pod security policy, and then, again, apply that to the cluster and to the namespaces so that you have better security.
One of the things that happens, we heard a little bit about network policies. Let me just touch on that in the context of multi-tenancy. When you have a multi-tenant cluster, you have a cluster that is carved up into these namespaces. The namespaces are virtual clusters, so the nodes are shared between all of the namespaces. Now, the pods get scheduled on these nodes, so it's entirely possible and likely that you will have a node that will have pods from different namespaces on the node.
If you've ever worked on a large distributed systems with many different components, you will have experienced that it is a very good idea to be very thoughtful about specifying who can talk to who. I've worked on systems where we did not do that at the beginning, and then, three years in, we're like, "That was a very bad idea," because now everybody can talk to everybody, and we can't reason about anything, and we have no idea what's going on.
In a namespace in a multi-tenant Kubernetes cluster, it is very recommended that you're very thoughtful about setting network policies. What those network policies let you do is, they let you set ingress and egress, so for specific pods, they let you say, "This pod can talk to this pod, but not to these other pods," so you can reason better about the topology of your deployments.
Another best practice is to make custom resource definition namespace-scoped. They can be, and arguably they should be. There are some use cases where that might not be the right thing to do, but in general, when you have these custom resource definitions - which are extensions, so you might have different teams writing extensions to Kubernetes and their own custom controllers - in many cases, it makes sense to make them namespace-scoped so they don't conflict with whatever other people are doing because you might not be interested in other people's extensions, and you might not like the side effects that they might have.
The final thing I want to touch on in terms of isolation is Sandboxes. People now go around saying containers don't contain, so they're not great security boundaries. What that means in practice is if you run your containers on a node, then there are certain security considerations that you have to think about. For instance, you have to make sure that when that pod runs on the node, the pod cannot easily access the host kernel, and then hack into the host kernel and escalate privileges, and then get access to everything else that is running in the cluster.
Sandboxes put a tighter security boundary around each pod, and so you can just launch all your pods in Sandboxes. Then there are several different ones, gVisor is one of them that's been developed. It's actually open-source, but Google is investing very heavily in it, so I know a little bit more about it. The way it works is, it puts the security boundary and isolates the pods more. The goal is to make sure that information is not leaked between tenants, and tenants can't break out accidentally or maliciously and mess everybody up, and stop everybody else's containers. That's something to consider.
There are a lot of details here, but what I want you to take away from this part of the presentation is that when you have a large e multi-tenant cluster, you assign namespaces, and that is really the first step. What you then do is, you set up all of the security and isolation mechanisms so that in essence you create a more tightly controlled universe for each namespace.
Let's talk a little bit about fair sharing. What I'm going to be talking about on this slide is fair sharing in what we call the data plane, which is the cluster of nodes, so fair sharing of resources. The reason I'm talking here about the data plane is because there are different mechanisms on the data plane, and it's actually better developed than it is on the control plane on the master. I'll talk about the master in the next part of the presentation. Let's talk about the data plane a little bit.
When you have all of your different teams running your applications, I have experienced this, maybe some of you have too, even when you want to behave nicely and you're incentivized to behave nicely, what happens sometimes is that all of a sudden, you get a lot of traffic. Then your autoscaler kicks in and that's wonderful, and your application still runs, but now others cannot run. You have to make sure that you have the mechanisms in place so that tenants don't trample on each other and don't crowd each other out. The most important or maybe the most fundamental way to do this in a multi-tenant Kubernetes cluster is with something called Resource Quotas.
Resource Quotas are meant to allow you to set resource limits for each namespace, which makes sense because you have a number of nodes, and you need to make sure that you carve up the resources among the namespaces. There's something called LimitRanger, which lets you set defaults for all the namespaces. Essentially, what you're going to want to do is you're going to want to think about how many resources does everybody get for CPU, for memory, and also for things like object counts. How many persistent volume claims can I have per namespace? Because there are limits on how many volumes you can mount on each virtual machine, depending on where you run them. You have to also make sure that those are shared fairly as well. Resource Quotas let you do that.
Then there are things that let you put priorities and Quality of Service Classes on the parts. There are related concepts, and what I want you to take away is that there are ways that you can control the little bit, what pods run at higher priority than others. You're probably familiar with this concept of priority, even if it's not from Kubernetes, from other systems like Linux. It essentially lets you control them. Quality of Service Classes are another twist to this because they let you also say, "This is how many resources I need, but I can burst out of them potentially," or I can say, "I need absolute guarantee that these pods will run."
Then finally, the last two bullet items here, node and pod. Those are mechanisms that lets you influence the scheduler. The scheduler is a complicated complex piece of technology that is not always easy to reason about. Your pods are scheduled and you're not really all that sure why they ended up where they did. It's complicated to reason about oftentimes, but there are mechanisms that let you influence the scheduler. We heard a little bit about Affinity and Pod Anti-affinity before. What that means is you can say these two pods shouldn't be scheduled together. In the context of multi-tenancy, that might mean that applications in two different namespaces should not end up on the same pod. Maybe you have namespaces that run financial applications that you want to keep separate from other things.
One interesting concept that I'd like to call out here is this concept of taints and tolerations, and that's really interesting in the context of multi-tenancy. What it does, the way it works, is that you say, for a node, you give it a taint, and you just give it a label. You say, "green," and then only pods that have a toleration that matches that get scheduled on it. Only parts that have that same toleration, green, will get scheduled on that node. What that means for us in the context of multi-tenancy is, it's a way for you to control if there is a need to have nodes that only schedule parts from a specific namespace. That's how you get it done.
Control Plane Multitenancy
Let's talk a little bit more about control plane. Control plane, again, is the master. The API server and the scheduler, that's what's most important in our context here. Much of what we talked about were the nodes, so let's talk a little bit about the master. One of the things you will notice, as we're going through this, we're sharing the cluster that's on the right, so all of these applications are sharing the cluster, but the other thing we're sharing is the master. We're still, all together, sharing that one master.
There’s one thing I should point out. Remember how I said at the beginning when people say multi-tenancy, they sometimes mean different [inaudible] not able to do that. We're definitely in this mode of one master controlling one cluster. That's fine.
All tenants share the master, and that includes things like secrets. Remember what we talked about, you need to protect your secrets with RBAC. One of the things that the master, in particular, the API server, is not really great at right now is preventing tenants from DDoSing it or DoSing each other and crowding each other out. You have this master running, and the master takes in these requests from users, Imagine you have one user that all of a sudden just sends all of these requests. Then the API server is, "I don't know what to do with all these requests." Then the API server gets behind and other tenants don't get a word, their requests get rejected. It's actually worse than that because the API server could also drop things like garbage collection and other things that get dropped.
There is work underway that you can check out, it's going on right now in the open-source community that will enable better fair sharing on the API server. The way this will work is, it's sort of a redesign of this concept of max inflight requests. Max inflight requests is a concept in the API server that essentially says, "Here is how many requests I can handle at any one given time and the rest, I just reject."
In this proposal that is currently underway, you can read this in the slide, but what it will do is, it will generalize max inflight request handling in the API server to make more distinctions among requests, and provide prioritization and fairness among the categories of requests. That is long, you can afterwards read that again in the slides, but let me just explain a little bit what that means for us in the context of multi-tenancy.
In this new way of doing things, when different tenants have requests coming in, there will be different priority levels, so the system ones will take the highest priority level, as you might guess, and then different tenants can be perhaps at the same priority level, and different tenants will likely have different cues and then they will compete evenly for the API server.
What Companies Care About
Before I conclude, I've given you a number of different things to look at and things to think about how you set this up. Now, let me get back to the beginning of the presentation, and talk a little bit, bring home the point of why multi-tenancy can help you with velocity and cost. You may walk out of this presentation saying, "That's really complicated," but when you really get down to it, when you have a shared cluster, you do have the ability now to have policies across the cluster, to have the same network settings across the cluster, and control things more tightly.
In my experience, that will really help you long-term with your velocity, with your speed of getting things out faster, so you don't have to look in 100 places. Then just going back to the point about cost. Cost is something that you can save by sharing the master, but then also sharing the resources to underlying nodes between all the namespaces.
Key Take Aways
What I want you to take away from this presentation is, think about multi-tenancy as one of the tools in your toolbox if you want better resource efficiency, costs, and operations. We talked about velocity and costs. Think about it, and see if it applies to your use cases.
When you read a little bit more about multi-tenancy, you will hear people talk about hard and soft. Remember that it's a spectrum, Hard means you don't trust the tenants at all, soft means you trust them completely. The truth is usually somewhere in the middle because even if you trust the tenants completely, they might make stupid mistakes and you still want to protect yourself against them, to some extent.
We're on this road towards making hard multi-tenancy really viable. There is ongoing work, which is really encouraging to me and it's very exciting. Right now, what we see is that quite a number of companies use soft multi-tenancy in production, and as you saw in this presentation, there's still some setup required and some knobs you have to turn to make sure that it works well for you. It is definitely something that can work out very well.
When I share the slides, I will have a few links at the end that link out to some of the open-source work that I was talking about here, so you can read a little bit more.
See more presentations with transcripts