Dropbox's engineering team wrote about their network and server provisioning and validation automation tool called Pirlo. Pirlo has a pluggable architecture based on a MySQL backed custom job queue implementation.
Dropbox runs its own datacenters. The Pirlo set of tools consists of a TOR (top of rack) switch initializer and a server configurator and validator. It runs as worker processes on a generic distributed job queue, built-in house on top of MySQL, with a UI to track the progress of running jobs and visualize past data. Pirlo is built as pluggable modules with plenty of logging at every stage to debug and analyze automation runs. Dropbox has an NRE (Network Reliability Engineering) team, that works on building, automating and monitoring its production network. Most of Dropbox’s code is in Python, although it is unclear if Pirlo is written in Python as well.
Both the switch and the server provisioner use the job queue, with the specific implementation in the workers. The workflow is similar in both cases, with a client request resulting in the Queue Manager choosing the correct job handler. The job handler implementation runs plugins registered with it, which carry out the actual checks and commands. Each plugin performs a specific job, emits status codes, and publishes status to a database log including events that have the command which was run. This is a how most job queues work, so it is natural to ask why the team did not opt for existing ones like Celery. The authors of the article explain that
We didn't need the whole feature set, nor the complexity of a third-party tool. Leveraging in-house primitives gave us more flexibility in the design and allows us to both develop and operate the Pirlo service with a very small group of SREs.
The switch provisioner, called TOR starter, kicks off when a request is received by a starter client. A TOR switch is part of a network design where the server equipment on a rack is connected to a network switch in the same rack, usually on top. It attempts to find a healthy server using service discovery via gRPC, and the queue manager chooses a job handler for the job. Switch validation and configuration is a multi-step process, and starts with establishing basic connectivity. Followed by executing each plugin, it culminates in downloading the switch configuration and rebooting it.
The server provisioning and validation process is similar. The validator is launched on the server machine in an OS image created with Debirf, which can create a RAM based filesystem to run Debian systems entirely from memory. Nicknamed Hotdog, it is an Ubuntu based image that can boot over the network and runs validation, benchmarks, and stress tests. The results are pushed to the database and analyzed later. The tests include validation of various hardware and firmware components against predefined lists of configurations approved by the Hardware Engineering team. Repaired machines also go through this test suite before they are put back in production.
Pirlo's UI shows the progress of both currently running and completed jobs. Dropbox previously used to use playbooks (or runbooks) to perform provisioning and configuration. Other engineering teams that run their own datacenters have also moved on from runbook-based provisioning to Zero Touch Provisioning (ZTP) albeit using different methods.