BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Six Tips for Running Scalable Workloads on Kubernetes

Six Tips for Running Scalable Workloads on Kubernetes

Key Takeaways

  • Kubernetes offers lots of tools to help make applications scalable and fault tolerant
  • Setting resource requests for pods is important
  • Use affinities to spread your apps across nodes and availability zones
  • Add pod disruption budgets to allow cluster administrators to maintain clusters without breaking your apps
  • Autoscaling of pods in Kubernetes means apps should always be available as demand increases

GDPR for Software Engineers

This eMag examines what software engineers, data engineers, and operations teams need to know about GDPR, along with the implications it has on data collection, storage and use for any organization dealing with customer data in the EU. Download Now.

As an infrastructure engineer at Pusher, I work with Kubernetes on a daily basis. And while I’ve not been actively developing applications to run on Kubernetes, I have been configuring deployments for many applications. I’ve learned what Kubernetes expects for workloads and how to make them as tolerant as possible.

This article is all about ensuring Kubernetes knows what is happening with your deployment: where best to schedule it, knowing when it is ready to serve requests and ensuring work is spread across as many nodes as possible. I will also discuss pod disruption budgets and horizontal pod autoscalers which I’ve found get overlooked all too often.

I work on the Infrastructure team at Pusher. I like to describe what my team does as “providing a PaaS for the SaaS”. The team of three primarily works on building a platform using Kubernetes for our developers to build, test and host new products.

In a recent project, I cut the costs of our EC2 instances by moving our Kubernetes workers onto spot instances. If you aren’t familiar with spot instances, they are the same as other AWS EC2 instances, but you pay a lower price for them on the proviso that you could get a 2-minute warning that you are losing the instance.

One of the requirements for the project was to ensure that our Kubernetes clusters were tolerant to losing nodes at such short notice. This also then extended to the monitoring, alerting and cluster components (kube-dns, metrics) managed by my team. The result of the project was a cluster (with cluster services) that was tolerant to node losses, but unfortunately this didn’t extend to our product workloads.

Shortly after moving our clusters onto spot instances, an engineer came to me to try and minimise any potential downtime for his application:

Is there any way to stop [my pod] scheduling on the spot instances? (paraphrased)

To me, this is the wrong approach. Instead of trying to avoid node failures, the engineer should be taking advantage of Kubernetes and the tools it offers to help ensure applications are scalable and fault tolerant.

How can Kubernetes help me?

Kubernetes has many features to help application developers ensure that their applications are deployed in a highly available, scalable and fault tolerant way. When deploying a horizontally scalable application to Kubernetes, you’ll want to ensure you have configured the following:

  • Resource requests and limits
  • Node and pod, affinities and anti-affinities
  • Health checks
  • Deployment strategies
  • Pod disruption budgets
  • Horizontal pod autoscalers

In the following sections, I will describe the part that each of these concepts play in making Kubernetes workloads both scalable and fault tolerant.

Resource requests and limits

Resource requests and limits tell Kubernetes how much CPU and memory you expect your application will use. Setting requests and limits on a deployment is one of the most important things you can do. Without requests, the Kubernetes scheduler cannot ensure workloads are spread across your nodes evenly and this may result in an unbalanced cluster with some nodes overcommitted and some nodes underutilised.

Having appropriate requests and limits will allow autoscalers to estimate capacity and ensure that the cluster expands (and contracts) as the demand changes.

spec:
      containers:
        - name: example
          resources:
            requests:
              cpu: 100m
              memory: 64Mi
            limits:
              cpu: 200m
              memory: 128Mi

Requests and limits are set on a per container basis within your pod, but the scheduler will take into consideration the requests of all the containers. For example, a pod with 3 containers, each requesting 0.1 CPUs and 64 megabytes memory (as in the spec example above), will total a request of 0.3 CPUs and 192 megabytes memory. So if you are running pods with multiple containers, be wary of the total requests for the pod, the higher the total the more restricted scheduling (finding a node with available resources to match the pod’s total request) becomes.

In Kubernetes, CPUs are measured as decimals of CPU cores. If a pod requests 0.3 CPUs, it will be limited to using up to 30% of one core of the processor. Memory is measured in megabytes as you might expect. Requests should represent a reasonable guess (a lot of this is guesswork, sorry) at what you expect your container might use during normal operations.

Load testing can be a good way to get initial values for your requests, for example, when deploying Nginx as an ingress controller in front of a service. Imagine you are expecting 30,000 requests per second under normal load, and have 3 replicas of Nginx initially. Then, load testing a single container for 10,000 requests per second and recording its resource usage might give you a reasonable starting point for your requests per pod.

Limits, on the other hand, are hard limits. If a container reaches its CPU limit it will be throttled, if it reaches its memory limit, it will be killed.

Limits, therefore, should be set higher than the requests and should be set such that they are only reached in exceptional circumstances. For instance, you might want to kill something deliberately if there was a memory leak to stop it crashing your entire cluster.

You must be careful here though. In the case that a single node in the cluster starts running out of memory or CPU, a pod with containers over their requests might be killed/limited before they reach their limits. The first pod to be killed in a lack of memory scenario will be the one with containers over their requests quota by the biggest margin.

With appropriate requests on all pods within a Kubernetes cluster, the scheduler can almost always guarantee that the workloads will have the resources they need for their normal operation.

With appropriate limits set on all containers within a Kubernetes cluster, the system can ensure that no single pod starts hogging all the resources, nor can it affect the running of other workloads. Perhaps even more importantly, no single pod will be able to bring down an entire node, since memory consumption is limited..

Node and Pod, Affinities and Anti-Affinities

Affinities and anti-affinities are another kind of hint that the scheduler can use when trying to find the best node to assign your pod to. They allow you to spread or confine your workload based on a set of rules or conditions.

At present, there are two kinds of affinity/anti-affinity: required and preferred. Required affinities will stop a pod from scheduling if the affinity rules cannot be met by any node. With preferred affinities, on the other hand, a pod will still be scheduled even if no node was found that matches the affinity rules.

The combination of these two types of affinity allows you to place your pods according to your specific requirements, for example:
 

    If possible, run my pod in an availability zone without other pods labelled app=nginx-ingress-controller.

or

    Only run my pods on nodes with a GPU.

Below is an example of an anti-affinity. The pod anti-affinity below ensures that pods with the label app=nginx-ingress-controller are scheduled in different availability zones. In a scenario  where you have nodes in 3 different zones in a cluster and you want to run 4 of these pods, this rule would then stop the 4th pod from scheduling.

spec:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values:
            - nginx-ingress-controller
        topologyKey: failure-domain.beta.kubernetes.io/zone

If we changed the  requiredDuringSchedulingIgnoredDuringExecution line to preferredDuringSchedulingIgnoredDuringExecution, that would tell the scheduler to spread the pods across the availability zones, until that’s no longer possible. With the prefer rule, when a 4th or 5th pod is scheduled, they will start being balanced across the availability zones as well.

The topologyKey field in the affinity specification is based on the labels on your nodes. You can use the labels to ensure that pods are only scheduled on nodes with certain storage types, or nodes which have GPUs. You can also prefer that they are scheduled on one type of node over another or even a node with a particular hostname.

If you’re interested in a deeper dive into how the scheduler manages distributing pods across nodes, check out this post from my colleague Alexandru Topliceanu.

Health Checks

Health checks come in two flavours with Kubernetes: readiness and liveness probes. A readiness probe tells Kubernetes that the pod is ready to start receiving requests (usually HTTP). A liveness probe on a pod tells Kubernetes that the pod is still running as expected.

HTTP readiness/liveness probes are very similar to traditional load balancer health checks. Their configuration normally specifies a path and port, but can also define timeouts, success/failure thresholds and initial delays. The probe passes for any response with a status code between 200 and 399.

readinessProbe:
  httpGet:
    path: /healthz
    port: 10254
    scheme: HTTP
  initialDelaySeconds: 10
  timeoutSeconds: 5
livenessProbe:
  httpGet:
    path: /healthz
    port: 10254
    scheme: HTTP
  initialDelaySeconds: 10
  timeoutSeconds: 5

Kubernetes uses liveness probes to determine whether the container is healthy. Should a liveness probe start failing while the pod is running, Kubernetes will restart the pod in accordance with its restart policy. Every pod should have a liveness probe if possible so that Kubernetes can tell whether the application is working as expected or not.

Readiness probes are for containers that expect to be serving requests, these will typically have a service receiving requests in front of them. A liveness probe and a readiness probe may well be the same thing in certain cases. However, in the case where your container may start up and have to process some data or do some calculation before serving requests, the readiness probe tells Kubernetes when the container is ready to be registered with the service and receive requests from the outside world.

While both of these probes are often HTTP callbacks, Kubernetes also supports TCP and Exec callbacks. TCP probes check that a socket is open within the container and Exec probes execute a command within the container, expecting a 0 exit code:

livenessProbe:
      exec:
        command:
        - cat
        - /tmp/healthy

These checks, when set up correctly, help ensure that requests to a service always hit a container that is in a state capable of processing that request. The probes are also used in other core Kubernetes functions when autoscaling and performing rolling updates. In the remainder of this article, when I refer to a pod being ready, that means its readiness probe is passing.

Deployment Strategies

Deployment strategies determine how Kubernetes replaces running pods when you want to update their configuration (changing the image tag for instance). There are currently two kinds of strategies: recreate and rolling update.

The “recreate” strategy kills all pods managed by the deployment, before bringing up a new one. This might sound dangerous, and it is in most scenarios. The intended use of this strategy is for preventing two different versions running in parallel. Realistically, this means it is used for databases or applications where the running replica must shut down before the new instance is launched.

The “rolling update” strategy, on the other hand, runs both the old and the new configurations side by side. It has two configuration options; maxUnavailable and maxSurge. They define how many pods Kubernetes can remove and how many extra pods Kubernetes can add as it starts a rolling update.

They can both be set as an absolute number or a percentage. The default for both values is 25%. When maxUnavailable is set to a percentage and a rolling update is in progress, Kubernetes will calculate (rounding down) how many replicas it can terminate to allow new replicas to come up. When maxSurge is set to a percentage, Kubernetes will calculate (rounding up) how many extra replicas it can add during the update process.

For example, in a deployment with 6 replicas: when there’s an update to the image tag and the rolling update strategy has the default configuration, Kubernetes will terminate 1 instance (6 instances * 0.25 = 1.5, rounded down =1) and then introduce a new replicaset with 3 new instances (6 instances * 0.25 = 1.5, rounded up =2 , plus 1 instance to compensate for the one terminated =3) giving a total of 8 running replicas. Once the new pods become ready it will then terminate 2 more instances from the old replicaset to bring the deployment back to the desired replica count, before repeating this process again and again until deployment is finished.

Using this method, in the case that a deployment fails (that means a liveness or readiness probe failed on the new replicaset), the rolling update will stop and your running workload remains with the old replicaset, albeit potentially slightly smaller if the rolling update had already removed some old instances.

You can configure the maxSurge and maxUnavailable based on your needs, you can set either of them to zero if you so desire (although they can’t both be zero simultaneously) which would result in you never running more than your desired replica count or never running fewer than your desired replica count respectively.

Pod Disruption Budgets

Disruptions in Kubernetes clusters are almost inevitable. The days of pet VMs that have a 2-year uptime are behind us and today’s VMs tend to be more like cattle, disappearing at a moment’s notice

There is an almost endless list of reasons why a node in Kubernetes might disappear, here are just a few we have seen:

  • a spot instance on AWS gets taken out of service
  • your infrastructure team wants to apply a new config and they start replacing nodes within a cluster
  • an autoscaler discovers your node is underutilised and removes it

The job of the pod disruption budget is to ensure that you always have a minimum number of ready pods for your deployment. A pod disruption budget allows you to specify either the minAvailable or maxUnavailable number of replicas within a deployment. These values can be as a percentage of the desired replica count or as an absolute number.

By configuring a pod disruption budget for your deployment, Kubernetes can start rejecting voluntary disruptions. Voluntary disruptions being those caused by the Eviction API (for example, a cluster administrator draining nodes in the cluster).

With a pod disruption budget of maxUnavailable: 1, the first of these drain attempts will be able to evict the first of your pods. Then, while this pod is being rescheduled and waiting to become ready, all requests to evict other pods in your deployment will fail. Applications that use the Eviction API, such as kubectl drain, are expected to retry eviction attempts until they are successful or the application times out, so both your administrator and your developers can achieve their goals without affecting each other.

While the pod disruption budget cannot protect you from involuntary disruptions, it can ensure that you don’t reduce capacity further during these events, halting any attempt to drain a node or move pods until new replicas are scheduled and ready.

Horizontal Pod Autoscalers

Finally, I would like to touch on Horizontal Pod Autoscalers. One of the best things about the Kubernetes control plane is the ability to scale your applications based on their resource utilization. As your pods become busier, it can automatically bring up new replicas to share the load.

The controller periodically looks at metrics provided by metrics-server (or heapster for Kubernetes versions 1.8 and earlier) and, based on the load of the pods, scales up or scales down the deployment as necessary.

When configuring a horizontal pod autoscaler, you can scale based on CPU and memory usage. But with custom metrics servers you could extend the metrics available and perhaps even scale based on, for example, requests per second to a service.

Once you have chosen which metric to scale on (default or custom), you can define your targetAverageUtilization for resources or targetValue for custom metrics. This is your desired state.

The resource values are based on the requests set for the deployment. If you set the CPU targetAverageUtilization to 70%, the autoscaler will try and keep the average CPU utilization across the pods at 70% of their requested CPU value. Additionally, you must set the range of replicas you would like the deployment to have: a minimum and a maximum number of replicas that you think your deployment will need.

Conclusion

By applying the configuration options discussed above, you can leverage everything Kubernetes has to offer for making your applications as redundant and available as possible. While not everything I have discussed will apply to every application, I do strongly urge you to start setting appropriate requests and limits, as well as health checks where you can.

About the Author

Joel Speed is a DevOps engineer working with Kubernetes for the last year. He has been working in software development for over 3 years and is currently helping Pusher build their internal Kubernetes Platform. Recently he has been focusing on projects to improve autoscaling, resilience, authentication and authorisation within Kubernetes as well as building a ChatOps bot, Marvin, for Pusher’s engineering team. While studying, he was heavily involved in the Warwick Student Cinema, containerizing their infrastructure as well as regularly projecting films.

BT