BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News How the Adidas Platform Team Reduced the Cost of Running Kubernetes Clusters

How the Adidas Platform Team Reduced the Cost of Running Kubernetes Clusters

In a recent Medium post, platform engineer Iya Lang disclosed how adidas reduced the costs of running Kubernetes clusters in AWS by up to 50%.

The multi-pronged approach the adidas team took can be useful for platform engineering teams in many other organizations, as a recent CNCF report stated that Kubernetes has driven cloud spending up for 49% of respondents.

The first measure introduced by the team focused on lowering EC2 instance costs. To achieve this, they implemented Karpenter, an AWS-developed cluster autoscaler that adjusts node counts based on application demand. Karpenter’s features include:

  • Dynamically provisions compute resources (EC2 instances) based on real-time pod scheduling needs. This ensures a cluster has the right nodes at the right time to handle application workloads.
  • Optimizes cluster resource utilization by:
    • Launching only the necessary instance types to meet pod requirements.
    • Identifying opportunities to remove under-utilized nodes.
    • Replacing expensive instances with more cost-effective options when possible. Leverages spot instances (unused AWS compute capacity available at a lower cost) by identifying the least expensive options with minimal interruption risk.
    • Consolidating workloads onto more efficient computing resources.
  • Integrates seamlessly with existing Kubernetes workflows. You can configure various aspects of its behavior, including:
    • The types of EC2 instances used for provisioning.
    • Launch template specifications for node configuration.
    • Scaling policies to tailor resource allocation to specific needs.

Karpenter currently supports only AWS, but there are plans to include other cloud providers.

The second measure introduced by the adidas team was the automatic creation of Vertical Pod Autoscalers (VPAs) to improve resource utilization. In particular, the platform team automated the creation of Vertical Pod Autoscalers (VPAs) for all workloads in development and staging clusters. Adidas chose Kyverno, a policy tool typically used for application security, to generate default VPAs.

Kyverno is a policy engine that operates as a dynamic admission controller within a Kubernetes cluster. It handles validating and mutating admission webhook HTTP callbacks from the Kubernetes API server, applying relevant policies to enforce or reject admission requests. Kyverno policies can target resources based on various criteria, including resource kind, name, label selectors, etc. Mutating policies can be specified using overlays (similar to Kustomize) or JSON Patch formats. Validating policies use an overlay syntax and support pattern matching and conditional (if-then-else) logic. Policy enforcement results are recorded as Kubernetes events. For requests that are allowed or predate implementing a Kyverno policy, Kyverno generates Policy Reports. These reports provide a running list of resources matched by the policy, their statuses, and additional details.

adidas Kubernetes cost reduction architecture
Kyverno architecture

The adidas team configured the Kyverno policies to:

  1. Check if the resource has a Horizontal Pod Autoscaler (HPA) or VPA.
  2. Verify if automatic VPA creation is permitted for the resource and its namespace.

The third measure introduced was setting default VPA values. Configuring VPAs without prior knowledge of the applications posed a challenge. The Adidas team decided to control only resource requests to prevent application disruptions during usage spikes. They set minimum allowed values to very low levels (e.g., 10 millicores for CPU and 32 megabytes for memory) and set maximum values based on original requests or limits to ensure stability. For applications with multiple containers, the team avoided maxAllowed values to prevent potential issues.

Implementing default VPAs resulted in a 30% reduction in CPU and memory usage across development and staging clusters. However, some limitations exist:

  • VPAs cannot work with HPAs using resource metrics.
  • Older Java applications might not benefit due to fixed heap sizes.
  • Certain applications require uninterrupted operation, necessitating an opt-out option.
adidas Kubernetes cost reduction: CPU and memory usageCPU and memory usage after the VPA creation on a big cluster

The adidas team also aimed to reduce their CO2 footprint and save money by scaling down the resources during non-office hours. They utilized kube-downscaler. This tool adjusts replicas based on a predefined schedule, allowing customization for specific applications.

After implementing all of these measures, the team faced the problem of underutilized nodes. To address the issue, they implemented some Kyverno policies to prevent problematic Pod Disruption Budget (PDB) configurations that hinder node removal. A cleanup policy was also established to remove invalid PDBs periodically.

The adidas team implemented the cost optimization measures described for the non-production clusters, with PDB policies applied across all environments. This implementation led to a 50% reduction in monthly costs for development and staging clusters. They adopted an opt-in model for production clusters, allowing application teams to choose their tools and configurations.

The Adidas team shared some key considerations for successful cost optimization:

  • Ensuring sufficient node capacity to handle increased pod density.
  • Setting appropriate VPA configuration values to balance cost savings and application performance.
  • Informing users about changes to prevent incident-related disruptions.
  • Maintaining comprehensive monitoring to measure impact.

The team also acknowledged that cost optimization is an ongoing process requiring continuous adjustments.

Additional examples of organizations attempting to reduce cloud costs can be found on Reddit, e.g. "Reducing Cloud Costs on Kubernetes Dev Envs by Over 95%" and "How to reduce the AWS costs?"

Application optimization can also reduce cloud costs and improve sustainability. Erik Peterson provided guidance about this at QCon SF and wrote a related article for InfoQ, "Million Dollar Lines of Code—an Engineering Perspective on Cloud Cost Optimization."

About the Author

Rate this Article

Adoption
Style

BT