Transcript
Zhang: I'm going to talk about Kubernetes resource management. The target audience of this talk is for application developers who have used the Kubernetes for some time and want to see how to make better use of it or for people who are considering using Kubernetes sometime soon. I want to first introduce myself briefly. I'm a software engineer working at Google. I first learned about containers and their resource management when I was working in the Google production kernel team. At the time, I have seen how containers have helped to improve Google production machine-utilization to 80%, 90%, or even beyond 100%. With Kubernetes, I'm glad to see the technologies we built inside Google can now be a benefit for even a larger group of people. I decided to join the Google Kubernetes team two years ago to be part of this effort.
Let's get started on the talk. First of all, why as an application developer you want to listen to this talk? You may have heard that a big benefit of Kubernetes is to allow separation of concern, that application developers only need to think about their applications with the rest of things set up; there are clusters operators for them. I think that is definitely the target goal we want to get someday but I think there are also different stages in terms of user experience. The most ideal case is, when you get a device, a tool, or a machine, you immediately know how to use it right after you see it. I think good examples in this stage are smartphones, tablets, or some home system device we have seen recently. A less ideal stage is, you need to read user manuals in order to use something and sometimes if you don't read those manuals carefully, you may get a surprise later or even serious consequence.
In terms of user experience, even worse cases, you need to understand how underlying things are put together, how they work together, in order to operate something. I think a good example of this are automobiles in early days, people perhaps need to have some mechanic background in order to drive a car or sometimes fix them. Today, I think Kubernetes resource management model is somewhere between the very early stage and the second stage. For the very early stage, it apparently is not very pleasant user experience, but I think people may still want to use them because they do solve some serious pain points in their lives and in general make our lives easier.
Kubernetes, I think is similar, it does provide a big benefit to us. It can do automatic bin packing for us and if you set correct resource requests for your containers, Kubernetes can manage things for you automatically. It allows us to make more efficient use of our physical resources, in general, you can get more predictable application performance even when you run your application with other applications on the same node. However, there are also some underlying details and best practices that you may want to know so that you can make better use of your system and avoid some future pitfalls, and those are what I want to talk about in this talk.
Let’s Start with a Simple Web App
Let's talk with a simple web application. Suppose you write an application, you containerize it with Docker and then you write a .yaml file to specify how your Pod should run. You have measured your application on your local machine and you know it roughly takes about 0.3 CPU time and 1.5G memory to run your application. You also want to leave some resource buffer, so you slightly increase the CPU and the memory amount in your limits and then you go ahead to run your application with kubectl. About half-minute later, you check back and you saw your application, your Pod was still in pending status. Taking a closer look, it turned out that none of the nodes in your cluster had sufficient memory to run your Pod. At this stage, your Pod is stuck at the scheduling stage.
At a very high level, Kubernetes scheduler is responsible for assigning Pod to node based on pod’s resource requests and how much resource a node exports. It makes sure all of the Pods scheduled to a node, the total resource requests need to be within node allocatable. What a scheduler looks at today is node allocatable. It's different from the total capacity a node has because we also need to reserve certain amount of resource for system components. In the early days, we actually look at capacity only and that has caused a lot of reliability issues because we all know the system components also need resources to run and sometimes it takes a good amount of resources to run those system components. After we introduced the node allocatable, we now can reserve enough resources for system components to avoid problems that may happen when system utilization goes high.
You know the problem is your node doesn't have enough memory to run your workload, so you go talk to your cluster administrators to create a bigger node. After that you can run your Pod and see your Pod is sent to a node and started to run correctly.
We have talked about how the scheduler schedules Pods to nodes based on resource requests. What about limits? Limits are only used at node level. Roughly, it specifies an upper bound on the resource that a container or Pod can consume. Based on request and limit settings, we can classify Pods into different QoS classes. If you specify requests and limits for both CPU and memory and they are equal, your Pod is in guaranteed QoS. If you specify requests and limits but they are not equal, your Pod is in burstable QoS. If you don't specify any requests or limits for your CPU and memory, your Pod is considered as best effort. This is the lowest priority QoS class because your Pod is more likely to be evicted or killed when the system is under resource pressure.
So far things seem to be straight-forward but I think you need to know a little bit more in order to use them correctly because resource requests and limits can have different implications on different kinds of resources as the underlying enforcing mechanisms are different. We can also classify resources into two different types, compressible and incompressible. Compressible resources are those that can be throttled and when they are revoked by the system, they should only cause slowness on your application. Examples of compressible resource include CPU, network bandwidth, and disk I/O. In comparison, incompressible resource cannot be easily throttled. They usually hold some kind of states that if they are revoked, they may cause container dying or even a Pod may be evicted. Examples of incompressible resources include memory, disk space, number of process IDs, and inodes.
CPU Requests and Limits
Next, I'm going to go a little deeper and to look at how requests and limits are used for CPU, memory and local ephemeral storage.
CPU requests on node map to cgroup cpu.shares. cgroup is the underlying Linux kernel mechanism to enforce the resource-sharing. Roughly, cpu.shares assigned to a Pod roughly equal to the CPU requests you put in your spec. The values here are slightly different because inside Linux, they divide a CPU core into 1024 shares but for Kubernetes, we divide a CPU core into 1000 millis.
Roughly, CPU shares define the relative CPU time assigned to a cgroup. It's relative because it's calculated against the total available CPU shares you have on your node. For example, if you have 2 cgroups, one has 200 shares and the other has 400 shares, they will roughly get 1 to 2 CPU time on your node. If you add another cgroup, the CPU time assigned to the first cgroups will drop proportionally. What this means is the actual CPU time you get assigned to your Pod actually also depends on how many CPU requests other Pods on the same node request. In the worst case, it should roughly equal to the amount of CPU requests you put in your spec.
CPU limits, on the other hand, map to a different CPU scheduling mechanism. That is cgroup CFS quota and period. CFS here stands for completely fair scheduler. This is the default CPU scheduler Linux kernel uses today. A CFS quota is defined in terms of period. Here, there are two parameters that matter. One is the CFS quota time, that is the total available run time within a period, and the other is CFS period that specifies how long a period is. In today's Linux kernel, the default setting is about 100 milliseconds, and this is also the default setting Kubernetes uses. This seems a very straight-forward model that basically CFS quota specifies the bandwidth you can get within a period.
An implication you need to be aware of is, if you don't set CPU limits correctly, that can cause some latency on your application. For example, if a container takes 30 milliseconds to handle a request without throttling, with the default by 100-millisecond setting, if you put 50 milli CPU limit, your container will take 30 milliseconds to finish the task. If you set the CPU limit too low like 20 milliseconds, it will take more than 100 milliseconds to finish the task. This is actually by design, but I saw a lot of times people are getting surprised when they see their latency goes up.
What makes things even more confusing is Linux kernel may not always behave as expected. CFS was introduced into Linux kernel 10 years ago but people are still changing it today. It's good that our kernel developers keep improving this system but as software engineers, we also know when you change something, there's more chance to introduce bugs. Here are a couple of example issues I have heard about that also affect the Kubernetes users.
In the first issue, people have reported that their containers got throttled incorrectly even though they make very low CPU usage. In the second issue, people have seen inconsistent latencies on their applications and that latency also changes with different kinds of CFS period settings. This issue actually started a lot of discussions on Kubernetes community and people were wondering whether we should set CPU limits, whether we should change the default period setting, or whether we should disable CFS quota totally.
Fortunately, the latest upstream kernel, I think they have those both issues fixed but I think we do want to take a step back to see why you want to use CPU limits at the first place. There are four reasons I have heard so far and the first reason is they run in Kubernetes environment that uses usage based billing. People want to constrain their CPU usage to limit their cost and this, I think, is a very legitimate reason that they should set CPU limits to control their cost. The second reason is people want to know how their application will behave, what the latency will look like with the worst case of CPU access time. This, I think, is also a reasonable reason especially for testing purpose. Third reason is they measured their application performance and they saw a big difference when using exclusive core versus a shared core. They want to use the static CPU manager feature that was introduced to Kubernetes recently and that requires them to set CPU limits equal to the CPU requests. That is also a good reason to use CPU limits. The last reason I heard about is some people just want to keep their Pods in guaranteed QoS because they heard that it would reduce their chance for their Pod to be evicted or OOM-killed. For this reason, it perhaps requires some double thinking. First, our current eviction mechanism no longer depends on QoS. It used to be, but after we introduced Pod priority, we have replaced the QoS class with priority during the eviction score calculation. Keeping your Pod in guaranteed QoS will not reduce the chance for your Pod to be evicted. In OOM-killing case, we still look at QoS today. It does affect OOM-killing behavior a little bit but if you think about this, it seems a bit weird that you want to set your CPU limits to avoid your memory problem.
That, perhaps, doesn't sound right and you perhaps want to make sure you set your correct memory requests and limits to avoid OOM-killing at first place. Personally, I actually hope in the future we may decouple the QoS from OOM-killing setting so that you don't need to combine those things together when you think about CPU limits. The quick takeaway is, if you do want to use CPU limits, you may want to use it with care. Make sure your node is running with a Linux kernel with the latest fixes patched and also make sure you set correct CPU limit that matches your application behavior to avoid unexpected latency.
Memory and Eviction
Now, let's go to memory. Memory requests actually don't map to any cgroup setting. It's only used by kubelet internally for memory eviction at node level. Now I want to talk about a little bit more about the eviction. This is actually the general mechanism for kubelet to reclaim incompressible resource. We have eviction implemented for memory, but we also implement that for other types of incompressible resources like disk space, process IDs, and inodes.
Generally, kubelet determines when to reclaim a resource based on eviction signals and eviction thresholds. Eviction signal specifies the current available capacity for particular resource, and eviction threshold specifies the minimum value of resource kubelet should maintain. There are different thresholds, soft threshold and hard threshold. The difference between the two is when the soft threshold is hit, kubelet will try to terminate pod within graceable period but when the hard threshold is hit, kubelet will go directly to kill Pod without considering grace period.
You can set those settings for each individual resource, and there are a lot of configurations you may need to care about, but you perhaps don't want to care about, because hopefully, your providers and operators will set those configurations right for you so you don't need to worry about them.
As an application developer, what you may want to know is your Pod may get evicted if it uses too much of a resource beyond its requested amount and that resource is nearly being exhausted on the node. Today kubelet decides which Pod to evict based on eviction score calculated from Pod priority and also how much Pod’s actual usage is above its request. A caveat here is the second part is not implemented for process IDs today but hopefully we can fix this in the upcoming release so that your Pod will not get unfairly killed even though it hasn't created a lot of processes but it's just because it happens to run with some bad Pod on the same node.
You can reduce your Pod's risk of being evicted by setting right requests for memory and ephemeral storage. For those two resources, we have APIs to specify the requests and limits. For other types of incompressible resource, we don't have explicit API to specify their requests, but you want to avoid using too much of them or perhaps consider to increase the node limit with the corresponding sysctl setting. If you consider your Pod to be very critical, you can try to give it a higher priority to reduce the chance for it from being evicted but you need to know that it will only reduce the chance. It will not eliminate the risk because if only your Pod is using more than its requested resource, it will be the only victim that kubelet will consider evicting.
If you have done all those things but you still see some unexpected behavior, you may want to check with your cluster operator on some of the underlying settings. For example, if you see kubelet or Docker keeps running out of resources, you may want to check the eviction signal and threshold setting to make sure you reserve enough of the resource. If you see node keeps bouncing around the memory pressure or disk pressure condition, you may want to consider increasing the eviction pressure transition period so that it will not keep bouncing around.
Memory limits actually map to cgroup memory limit in bytes. This specifies the upper amount of the memory a container can consume. If a container exceeds its memory limit, it will get OOM-killed by the Linux kernel. Sometimes you may see your process get OOM-killed even though you haven't exceeded your limit. This can happen if kubelet is not able to reclaim enough memory quickly enough. In this case, OS can kick in to make sure it can have enough memory to run its critical kernel threads. In this case, Linux kernel determines which process to kill based on OOM score, and today we adjust the OOM score based on QoS class and memory requests.
Basically, we give the lowest OOM score for critical node components by Kubelet and Docker so that they are less likely to be killed. Right next to it is guaranteed Pod. Best effort Pod gets the highest OOM score so that they are most likely to be killed and Burstable Pod is somewhere between Guaranteed Pod and Best-Effort Pod. The exact OOM score is calculated based on memory requests.
What do you need to know about the OOM-killing? OOM-killing is even worse than memory eviction because your whole system may experience performance downgrade and application doesn't have a chance to terminate gracefully. If you don't like eviction, then you have more reason to not like OOM-killing. You can reduce the chance of your application to be OOM-killed by setting correct memory limits and reserve enough memory for your system components. You should also try to not accumulate too many dirty pages to reduce the chance for your system to enter OOM state.
Ephemeral Storage
Now let's go to local ephemeral storage. This is still a beta feature, so some of the underlying mechanism may still change in the future. I'm going to talk about it based on the current implementation. Local ephemeral storage refers to the local root partition shared by user Pods and system components. For container, this includes writable layers, image layers, and logs. At Pod level, it also includes the EmptyDir Volume storage. Such storage has the same lifecycle as Pods and containers that use them. This is in comparison with the persistent disks whose lifecycles usually outlive containers and Pods that use them. In general, we don't consider persistent disk as a node resource so I'm not going to talk about it today but only focus on the local ephemeral storage.
At container level, you can specify ephemeral storage requests and limits just as you do for CPU and memory. At Pod level, we currently don't have the APIs to specify Pod-level resource requirements, so we use the emptyDir sizeLimit field to specify this requirement. When scheduler makes its scheduling decision, it only looks at the ephemeral storage requests from containers. It doesn't look at the emptyDir sizeLimit field. An implication of this is, if your Pod does use a lot of emptyDir storage, it may cause your node to enter auto-disk condition more easily.
Under disk pressure, kubelet will try to find a Pod to evict and if the node has local storage capacity isolation feature enabled, kubelet will see if a container has exceeded its container ephemeral storage limit. Then it will check whether Pod has emptyDir whose usage is it beyond the size limit.
Then the third thing is, it checks whether the total container's ephemeral storage usage plus emptyDir ephemeral storage usage have exceeded the container's total limits on ephemeral storage. This behavior is actually a bit different, as you may have noticed, from the general eviction mechanism we use for other types of resources. If kubelet cannot find any Pod to evict during this round, it will fall back to the general eviction mechanism based on eviction score calculated from priority and how much Pod’s actual usage is above its requested storage.
I think the first part is still evolving today. Some people have been looking into whether we can use filesystem quota to enforce storage limit but that is still an alpha feature. This part may likely change in future kubernetes release.
Beyond Basic Use Cases
Let's go a little beyond the basic use case. What if your application makes heavy use of disk I/O? We mentioned that disk I/O is a compressible resource but currently, we don't have throttling mechanism inside Kubernetes for disk I/O. If you want to use a lot of disk I/O, you first want to make sure you provision enough of I/O bandwidth and IOPS on your node and you perhaps want to avoid running two I/O heavy Pods on the same node with Pod anti-affinity. This is the API that you can put in your Pod spec and the scheduler will take this into account when it's making the scheduling decision.
You also want to consider using dedicated disk volumes so that you don't need to share the IO bandwidth with other Pods on the same node. What if your application is network latency sensitive or it requires a lot of network bandwidth? Similarly, we don't have throttling mechanism implemented for network bandwidth inside the Kubernetes, so you perhaps also want to avoid running two network-heavy Pods on the same node. You can use Pod anti-affinity to separate your Pods into different nodes. You may consider using a high-performance NIC so that you can get enough bandwidth. But first you want to make sure the actual bottleneck is on the host network because I've seen a lot of cases where the actual bottleneck is on network switches. I think that will be even harder to control.
What if your application is sensitive to CPU cache interference? You perhaps want to consider the latest static CPU manager feature we have introduced. If you put the equal amount of CPU requests and limits and give it integer value with the feature enabled, kubelet will reserve exclusive cores for your application. If you want to run your workload on GPU today, you can request GPU as extended resource. Extended resource can be anything other than GPU and I have seen people use it for high-performance NIC, FPGE or even use it for licensing control. We don't support over-commitment or sharing for extended resource today. You have to put requests and limits equal and give it an integer value. For GPU, it's usually expensive so you perhaps want to consider using taints and tolerations to prevent normal Pods from occupying your expensive GPU node.
There are some other things that may also affect your pods scheduling and running. First is priority and preemption. If your cluster has preemption enabled and your pods have different priority classes assigned, scheduler may preempt lower priority pods so that it can schedule higher priority pending pods. This is a knob you can use to make sure that your high-priority workload can have a place to run. There are a couple of admission policies related to resource management. One is the resource quota and the other is limit range. When you have different teams sharing the same cluster, usually the administrators want to give different namespace to different teams so that they can give different resource quota to different namespace to control how much resource a particular team can pursue.
Resource quota is checked during Pod creation time. If your namespace has exceeded its quota, your Pod cannot be created. Limit range is the policy control operators can enforce. They can use this to specify the default request and limit setting for a particular namespace. They can also enforce the minimum or maximum resource requirements or even specify some ratio between the two. You perhaps don't need to know too many details about the admission policies but I think what you want to know is how they are setup in your cluster so that you won't get surprised when things are changed underneath or your Pod cannot be created because it has exceeded the quota limit.
Things That Can Make Your Life Easier
So far we have talked about things you may want to know for setting requests and limits for CPU memory, ephemeral storage, some of the underlying configurations that may affect your application behavior, and some of the underlying details that you may want to know. That's a lot of things to think about as an application developer that you may start to wonder, will your life become easier or harder when you start to use Kubernetes?
The good news is we also provide some automation mechanisms in Kubernetes that can make your life easier. First one is Horizontal Pod Autoscaler. It can automatically scale up or scale down Pods in your replica set or deployment based on CPU utilization or some of the custom metrics that you can define on your own like QPS or input request queue length.
You can consider using a Horizontal Pod Autoscaler if you can load-balance your work among replicas and your Pod’s resource usage is proportional to its work input. Even better is to use it together with cluster autoscaler that can automatically create more nodes to run pending Pods and scale them down after your job finishes.
Another autoscaler you perhaps want to consider is what we call vertical Pod autoscaler. If your application resource requirements may change over time, you perhaps want to try vertical Pod autoscaler that can automatically manage or set resource requests for you. Some of the features in vertical Pod autoscaler are still experimental, so this perhaps may not always work as expected but I think this is something you may want to watch for because that can make your future life easier.
Wrap Up
A quick wrap up. You want to set CPU requests to reserve enough CPU time for your Pod. If you want to use CPU limits, you need to be careful. Make sure your node is patched with the latest fixes and also you want to set correct CPU limit that matches your application behavior. You want to set correct memory requests and limits to avoid being evicted or OOM-killed when the node is experiencing memory pressure. You want to prevent your nodes from running out of disk by setting correct ephemeral storage requests and limits and also emptyDir sizeLimit. In general, you want to avoid exhausting incompressible resource. If your Pod uses a lot of I/O or network, you want to make sure you have reserved enough resource on your node and the perhaps not share that.
You need to understand your cluster administration settings to avoid some surprise on resource quota and limit range. If you want to use GPU, you can request it as an extended resource, and you should try to use autoscaler to make your life easier. In summary, Kubernetes resource management model is still evolving and we still have a long way to go but hopefully someday, what I talked about today will become the underlying details that you don't need to care about and things will just automatically run for you.
Questions and Answers
Participant 1: I have a question which is related. I have a situation with deployment with a number of replicas which use PVC. The problem is as soon as we try to do that, all the replicas were scheduled on the same worker node because they were sharing the same PVC binding, I'd say. Is there any way to do that differently so that each replica would have its own space and ideally it would be scheduled on different nodes?
Zhang: Your question is related to Horizontal Pod Autoscaler?
Participant 1: PVC, persistent volume claims.
Zhang: PVC can also affect scheduling and, in general, you want to make sure the PV that can satisfy the PVC is available in your cluster and also they have some kind of topology constraints. For example, in some Kubernetes settings, I think they need to make sure you have the available PVs created at the zonal level. I think I have seen people created PVs beforehand so that they always have enough PVs to satisfy their PVC requirements. I have also seen some people use a dynamic PV like volume creation so that if your provider can automatically create the PVs to satisfy your PVC requirement, hopefully that can work correctly with autoscaling.
Participant 1: That's actually a storage class created automatically by AWS for us, so we cannot control that much.
Zhang: I think on that part, it is a bit provider-specific because usually, it depends on how they can provision things in their environment. I have seen very different kind of behaviors in different kind of environment. For AWS, I actually don't know whether there's any automation that people have developed. I think that we consider it as orthogonal to Kubernetes autoscaling mechanism because it's very tied to how things get provisioned in the environment.
Participant 2: Do you suggest running a Kubernetes in the VM environment or should it be run on some big machine?
Zhang: I think Kubernetes supports both use cases and really it depends on your requirements and what you think is most important for you. If you run it in VM environment, you generally can manage node lifecycle more easily because I think we have all kinds of automation tools to automatically create VMs for you and sometimes the environment can even create the right size VM for you dynamically. I think in terms of the node lifecycle, VM is definitely easier but if you really care about performance or you have some kind of isolation requirement, people have also tried to run Kubernetes on bare metal. I think in terms of support, perhaps people have more experience than running in VM environment. I also have heard people are using Kubernetes on bare metal.
Moderator: One thing is I want to add, for the bare metal like some of the resource management doesn't matter when you run on the VM especially run on the cloud provider. For example, for a while scheduling actually is starting to matter less for a lot of cloud provider but if you run the bare metal, that's definitely matters.
Participant 3: I think my question is a bit of continuance of your question. My situation is, I have two distinguished types of jobs want to run in Kubernetes. One type of job is memory-intensive, the other type of job is CPU-intensive. How do I create a Kubernetes cluster so that I can run both types of jobs more efficiently? Let's say one of my jobs will require like 200 gigs of memory to run or 500 gig of memory run, but CPU probably decides either one or two is fine. I have another job, it's real-time. I really want to have short latency so the job should be finished as fast as possible but memory-wise, one, two gigs is fine. What's the best efficient way to create a Kubernetes cluster in this way so that scheduler will utilize all the nodes? I schedule all my real-time job to one node then CPU is all used up, but I have tons of memory sitting there wasting. Nothing. It can be used.
Zhang: If your workloads are like one is very memory-intensive and one is very CPU-intensive, you should be able to fit them on the same node because that's actually a perfect model that you can share CPU and memory resource together on the same node. If there's some tricky part like sometimes your memory-intensive workload may interfere your CPU workload, you perhaps want to see, for example, considering reserving exclusive cores for your CPU-intensive workload or even create a separate node groups for the two workloads. You can use node label to make sure they are scheduled to the right node groups reserved for them. I think, in terms of best practice, in general, you should be able to run them together on the same node because we actually have good isolation mechanism for CPU and memory since they have been supported in Linux kernel for many years. If you have a very strict requirement and you have seen some difference, then perhaps you want to try to not share them.
See more presentations with transcripts