InfoQ Homepage Kubernetes Content on InfoQ
-
OpenEverest: Open Source Platform for Database Automation
Percona recently announced OpenEverest, an open-source platform for automated database provisioning and management that supports multiple database technologies. Launched initially as Percona Everest, OpenEverest can be hosted on any Kubernetes infrastructure, in the cloud, or on-premises.
-
NVIDIA Dynamo Planner Brings SLO-Driven Automation to Multi-Node LLM Inference
Microsoft and NVIDIA have released Part 2 of their collaboration on running NVIDIA Dynamo for large language model inference on Azure Kubernetes Service (AKS). The first announcement aimed for a raw throughput of 1.2 million tokens per second on distributed GPU systems.
-
Salesforce Migrates 1,000+ EKS Clusters to Karpenter to Improve Scaling Speed and Efficiency
Salesforce has completed a phased migration of more than 1,000 Amazon Elastic Kubernetes Service (EKS) clusters from the Kubernetes Cluster Autoscaler to Karpenter, AWS’s open-source node-provisioning and autoscaling solution.
-
Pinterest's Moka: How Kubernetes Is Rewriting the Rules of Big Data Processing
Digital pinboard provider Pinterest has published an article explaining its blueprint for the future of large-scale data processing with its new platform Moka. The company is moving core workloads from ageing Hadoop infrastructure to a Kubernetes-based system on Amazon EKS, with Apache Spark as the main engine and support for other frameworks on the way.
-
Docker Kanvas Challenges Helm and Kustomize for Kubernetes Dominance
Docker has launched Kanvas, a new platform designed to bridge the gap between local development and cloud production. By automating the conversion of Docker Compose files into Kubernetes artefacts, the tool challenges established solutions like Helm and Kustomize. Developed with Layer5, it marks a shift toward Infrastructure as Code, offering visualisations to simplify cloud-native deployments.
-
Kubernetes 1.35 Released with In-Place Pod Resize and AI-Optimized Scheduling
The Cloud Native Computing Foundation (CNCF) announced the release of Kubernetes 1.35, named "Timbernetes", emphasizing its focus on mutability and the optimization of high-performance AI/ML workloads.
-
AWS Announces New Amazon EKS Capabilities to Simplify Workload Orchestration
Amazon Web Services has launched Amazon EKS Capabilities, a set of fully managed, Kubernetes-native features designed to streamline workload orchestration, AWS cloud resource management, and Kubernetes resource composition and automation.
-
Open-Source Agent Sandbox Enables Secure Deployment of AI Agents on Kubernetes
The Agent Sandbox is an open-source Kubernetes controller that provides a declarative API for managing a single, stateful pod with stable identity and persistent storage. It is particularly well suited for creating isolated environments to execute untrusted, LLM-generated code, as well as for running other stateful workloads.
-
CNCF Launches Certified Kubernetes AI Conformance Program to Standardise Workloads
The CNCF has launched the Certified Kubernetes AI Conformance program to standardise artificial intelligence workloads. By establishing a technical baseline for GPU management, networking, and gang scheduling, the initiative ensures portability across cloud providers. It aims to reduce technical debt and prevent vendor lock-in as enterprises move generative AI models into production.
-
Neptune Combines AI‑Assisted Infrastructure as Code and Cloud Deployments
Now available in beta, Neptune is a conversational AI agent designed to act like an AI platform engineer, handling the provisioning, wiring, and configuration of the cloud services needed to run a containerized app. Neptune is both language and cloud-agnostic, with support for AWS, GCP, and Azure.
-
Lyft Rearchitects ML Platform with Hybrid AWS SageMaker-Kubernetes Approach
Lyft has rearchitected its machine learning platform LyftLearn into a hybrid system, moving offline workloads to AWS SageMaker while retaining Kubernetes for online model serving. Its decision to choose managed services where operational complexity was highest, while maintaining custom infrastructure where control mattered most, offers a pragmatic alternative to unified platform strategies.
-
Google Cloud Demonstrates Massive Kubernetes Scale with 130,000-Node GKE Cluster
The team behind Google Kubernetes Engine (GKE) revealed that they successfully built and operated a Kubernetes cluster with 130,000 nodes, making it the largest publicly disclosed Kubernetes cluster to date.
-
NVIDIA Dynamo Addresses Multi-Node LLM Inference Challenges
Serving Large Language Models (LLMs) at scale is complex. Modern LLMs now exceed the memory and compute capacity of a single GPU or even a single multi-GPU node. As a result, inference workloads for 70B+, 120B+ parameter models, or pipelines with large context windows, require multi-node, distributed GPU deployments.
-
How Discord Scaled its ML Platform from Single-GPU Workflows to a Shared Ray Cluster
Discord has detailed how it rebuilt its machine learning platform after hitting the limits of single-GPU training. The changes enabled daily retrains for large models and contributed to a 200% uplift in a key ads ranking metric.
-
Helm Improves Kubernetes Package Management with Biggest Release in 6 Years
Helm, the Kubernetes application package manager, has officially reached version 4.0.0. Helm 4 is the first major upgrade in six years, and also marks Helm's 10th anniversary under the guidance of the Cloud Native Computing Foundation (CNCF). The update aims to address several challenges around scalability, security, and developer workflow.