In a recent blog post, Slack Engineering has detailed significant improvements to its Chef infrastructure. This manages tens of thousands of EC2 instances running its services, databases, and applications, and it recently moved from a single Chef stack to a more resilient, sharded infrastructure.
In moving to a new architecture, Slack hoped to resolve a number of limitations, namely the following areas which proved difficult with the previous setup:
- Assigning a shard to a node
- Neighbourhood discovery
- Searching Chef
- Cookbook uploads
In their previous setup, Slack had a single Chef stack across three environments: Sandbox, Development, and Production. This architecture posed risks as changes were deployed simultaneously across environments, and any issues with the stack could impact their entire infrastructure. The system used a process named DishPig to handle cookbook updates, which was triggered hourly.
The engineering team made a number of key changes to address these limitations. First, they created multiple Chef stacks to distribute the load and ensure system resilience. New instances are assigned to specific shards using AWS Route53 Weighted CNAME records. The team also separated development and production Chef infrastructure into distinct stacks.
To solve the challenge of node discovery in the new sharded infrastructure, the team started using Consul for service discovery. This needed careful implementation to avoid circular dependencies with their Nebula overlay network. The team developed custom Chef library functions to ease node lookups based on various criteria, effectively replacing their previous Chef search functionality.
The company also created a new service called Shearch (Sharded Chef Search) to maintain the ability to search across multiple Chef stacks. This service consolidates results from queries run across different shards. They also developed a new tool named Gnife to replace the traditional Chef Knife command, enabling operations across multiple shards.
The team replaced the DishPig system with Chef Librarian. This service independently manages cookbook versioning and environment updates, allowing for more controlled deployments. When changes are merged, GitHub Actions builds a tarball with a complete copy of the repository, and cookbook versions are updated using a timestamp-based format (YYYYMMDD.TIMESTAMP.0).
Chef Librarian provides API endpoints for updating environments to specific versions and matching environments to each other. Using Chef Librarian allowed Slack to test changes in sandbox and development environments before promoting them to production, reducing the risk of problematic changes affecting all environments simultaneously. The service stores artefact versions and deployment information in DynamoDB for better tracking and visibility.
A Slack app notifies users when their changes are promoted to an environment, using Git commit information to identify and tag the appropriate team members. A Kubernetes CronJob handles the promotion of versions across environments, with safety checks to prevent promotions if issues are detected.
Slack minimised risk for Chef roles (which cannot be versioned) by stripping them down to basic information and runlists. Roles are only uploaded to relevant Chef stacks when their corresponding environments are updated.
Slack is considering further improvements to its Chef infrastructure. One possibility is segmenting production Chef environments by AWS availability zones, which would allow for more granular control over change deployment. They are also exploring the adoption of Chef PolicyFiles and PolicyGroups, though this would represent a significant change to their current setup.
Chef's popularity has declined compared to its peak in the mid-2010s, potentially due to the rise of alternative tools such as Ansible and other cloud-native solutions. An industry-wide shift towards containerisation changed how many organisations approached configuration management, with many adopting Docker and Kubernetes instead. The tool's acquisition by Progress Software in 2020 may also have influenced long-term adoption.
However, Chef maintains a solid user base, particularly among organisations with existing Chef implementations or specific use cases well-suited to Chef's approach.