InfoQ recently spoke with Mohamed Ahmed, the co-founder and CEO of Magalix, a Kubernetes optimization company, to discuss the critical discipline of capacity management across cloud-native infrastructure and applications.
Capacity management is a critical discipline for companies that want to run reliable and efficient cloud-native infrastructure. Excellent cloud-native application design should declare any specific resources that it needs to operate correctly, however in order to achieve maximum performance and availability, engineers must find the balance between user workloads, the application architecture, and the underlying cloud infrastructure. It can be hard to have a common picture of effective Kubernetes and effective application capacity management and requires full team engagement to balance performance against resources and cost.
InfoQ: What typical problems do you encounter in Kubernetes capacity management?
Mohamed Ahmed: Capacity management is like securing your infrastructure and applications. The first problem is waiting until it becomes an issue and then acting in a reactive way. Poor capacity management causes bad application performance, live site incidents (LSIs), and a soaring monthly cloud bill, all of which put teams into fire fighting mode. Leaders and engineers should set the right KPIs and priorities to tackle each dimension of capacity management proactively to get out of this vicious cycle.
Teams may also lack a common view about capacity management. For example, developers may look at microservices and ignore or not fully understand the limits of their infrastructure. It’s also easy to focus on one set of metrics without considering the impact on the rest of the system. On the other hand, we also see engineers causing application downtime by changing resource allocations without considering the impact of these changes on the application’s performance and usage patterns.
InfoQ: What are some indicators of poor capacity management?
Ahmed: So, how do you know if you have room to improve how your team manages the capacity of your Kubernetes clusters? I broke it down in the below table to the three areas that any organization should keep an eye on. To accurately assess your team’s effectiveness, answer these questions:
- How frequently does your team get these triggers?
- How much of your team’s time is spent reacting to these triggers?
- Do you have a few team members always acting on these triggers, or is it distributed across the whole team?
InfoQ: How can engineers optimize their Kubernetes clusters upfront?
Ahmed: If workloads are not highly variant or teams are fine having a relatively large CPU/memory buffer, they can create a simple resources allocation sheet as I explained in this article, but it’s generally a continuous effort to optimise Kubernetes resources.
Workloads change all the time, and frequent code updates change CPU/memory requirements. Also, updating cluster configurations and the scheduling of containers changes the overall utilization of infrastructure, so it is not a fire-and-forget effort. The moment SREs [site reliability engineers] or developers optimize cluster resources, their clusters will start slipping into either under or overutilization state.
In the case of under-utilization, teams end up spending too much on their cloud infrastructure for the value they get. In an overutilization situation, applications are not reliable or performant anymore. Think of the optimization process as a budgeting exercise. You have resources that you would like to allocate. But if they want to keep their cluster optimized more frequently, Magalix has a free offering to automate capacity management for your Kubernetes cluster.
InfoQ: What are the potential pitfalls of specifying Kubernetes resource requests and limits?
Ahmed: The major pitfall is dealing with requests and limits as any other configuration item in pods spec files. They should be updated frequently based on the factors I mentioned earlier. This can be as frequent as every few hours or every one to two weeks. Setting the request value guarantees minimum resources for your containers. But if the value is set too high, your pods are basically over-provisioning resources from your virtual machine (VM). If the value is set too low, you are under jeopardy of not having enough resources available for your container to run reliably.
Setting limits guarantees that pods won’t monopolize VM resources and starve others. Not setting the limit value puts your infrastructure and application under the risk of evicting or killing pods due to resource starvation. Setting the limits to a low value can cause stability or significant performance issues. If your container reaches the memory limits, the operating system (OS) will kill it automatically. This may cause your container to go into a crash loop. If the container CPU limits are low, the OS will throttle your container and your application will suffer a significant slow-down, even if the VM has additional CPU cycles that this container could take.