BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News Slack Develops Bedrock Operator for Kubernetes StatefulSets

Slack Develops Bedrock Operator for Kubernetes StatefulSets

This item in japanese

Slack, the popular workplace communication platform, has developed a custom Kubernetes operator to address limitations in managing StatefulSet deployments. In an article on Slack's Engineering blog, Clément Labbe (Senior Software Engineer, Cloud) introduces the Bedrock Rollout Operator, written to offer improved control and features for deploying stateful applications in Kubernetes clusters.

Engineers commonly use StatefulSets to run applications which need persistent storage and unique pod identities. However, Slack's engineering teams found existing update strategies for StatefulSets to be lacking. The default RollingUpdate strategy, while automated, only updates one pod at a time, leading to slow deployments for applications with numerous pods. The OnDelete strategy allows manual control but lacks advanced features like percent-based rollouts.

Slack developed the Bedrock Rollout Operator using Kubebuilder to meet their internal teams' needs. This operator manages a custom resource called StatefulsetRollout, which encapsulates the StatefulSet specification along with additional parameters for enhanced functionality.

The Bedrock Rollout Operator solves several fundamental problems:

  • Slow deployments: It addresses the limitation of the default RollingUpdate strategy, which updates only one pod at a time, making it very slow for applications with many pods.
  • Lack of control: It provides more controlled rollouts than the native Kubernetes options, allowing for faster percent-based rollouts and the ability to pause rollouts.
  • Limited rollback capabilities: It enables quicker rollbacks when needed.
  • Integration gaps: It integrates with Slack's internal service discovery (Consul) and provides Slack notifications about rollout status, filling gaps in their existing workflow.
  • Customisation needs: It allows Slack to implement custom rollout logic that fits their specific requirements, which weren't met by standard Kubernetes features.
  • Visibility: It improves visibility into the rollout process through real-time Slack notifications and integration with their internal release management UI.
  • Large-scale management: Although it required some adjustments, the solution helps manage large StatefulSets with up to 1,000 pods.

The operator is deployed across Slack's extensive Kubernetes infrastructure, which comprises over 200 clusters and manages nearly 100 stateful services.

Slack rollout architecture

The rollout process begins with Slack engineers defining their application configuration in a bedrock.yaml file. When a developer initiates a deployment through Slack's internal release platform, the Bedrock API transforms this configuration into a StatefulsetRollout resource.

The Bedrock Rollout Operator continuously monitors the StatefulsetRollout resource and reconciles the desired state with the actual state of the cluster. To facilitate the rollout, it performs actions such as creating or updating StatefulSets and terminating pods. Rather than operating in an event-driven fashion, the operator uses a self-enqueuing reconciliation loop. This approach allows for sequential processing of custom resources, reducing the risk of race conditions and simplifying the overall reconciliation process.

The operator provides real-time updates to users through rich-text Slack notifications, which include details such as version numbers and the list of pods being rolled out. Additionally, it communicates with the Bedrock API to report the success or failure of rollouts, ensuring that Slack's release management UI reflects the status.

While the custom operator has proven effective for Slack's needs, it does have some limitations. One challenge arose when dealing with extremely large StatefulSets containing up to 1,000 pods. This required modifications to the notification system to avoid rate-limiting problems. Another limitation is the "version leak" problem inherent in using the OnDelete strategy for StatefulSets. In scenarios where rollouts are paused or only partially completed, pods running the previous version that are terminated for reasons other than the rollout may be replaced by pods running the new version. This can lead to gradual, unintended convergence towards a full rollout over time. Slack mitigates this issue by encouraging teams to complete their rollouts promptly.

Slack has achieved greater control over StatefulSet deployments by creating a custom solution that integrates seamlessly with its existing internal systems and communication channels. As Kubernetes evolves, its maintainers may incorporate some of this functionality into core Kubernetes features. However, in the meantime, the flexibility and integration capabilities offered by the operator model are likely to be valuable for organisations with complex deployment needs and custom infrastructure.

Slack plans to expand its use of the operator model for managing Kubernetes deployments. The company is exploring existing CNCF projects such as Argo Rollouts and OpenKruise for non-stateful Deployment resources. Other organisations have also developed rollout operators - for example Grafana Labs offer an operator providing finer-grained control over rollouts

Other products, such as Argo Rollouts, provide similar functionality, additionally offering blue-green, canary, canary analysis, experimentation, and progressive delivery features but focusing on Deployments. Meanwhile, Flagger offers up Canary Releases, with or without session affinity, blue-green and A/B testing for similar needs. Bikram Kundu at jstobigdata talks through the complexity and limitations of StatefulSets in Kubernetes, also offering a summary of best practice in this area.

About the Author

Rate this Article

Adoption
Style

BT