In a recent blog post, Dropbox revealed Atlas, a platform whose aim is to provide various benefits of a Service Oriented Architecture while minimizing the operational cost of owning a service.
Atlas' goal is to support small, self-contained functionality, saving product teams the overhead of managing a full-blown service, including capacity planning, alert setup, etc. Atlas provides its users with an experience of serverless systems such as AWS Fargate while being backed by automatically provisioned services behind the scenes. According to the authors, Naphat Sanguansin and Utsav Shah, they evaluated using off-the-shelf solutions to run the platform. However, to de-risk their migration and ensure low engineering costs, they decided to continue hosting services on the same deployment orchestration platform used by the rest of Dropbox.
The reason for building Atlas was to replace Dropbox's central Python monolith called Metaserver. Building Atlas is a multi-year journey, still taking place today. Currently, Atlas is serving more than 25% of the monolith traffic it aims to replace. The authors draw a key conclusion regarding the migration process:
The single most important takeaway from this multi-year effort is that well-thought-out code composition, early in a project's lifetime, is essential. Otherwise, technical debt and code complexity compound very quickly. The dismantling of import cycles and the decomposition of Metaserver (...) was probably the most strategically effective part of the project because it prevented new code from contributing to the problem and made our code simpler to understand.
The authors state that many previous efforts to improve Metaserver had not succeeded due to the codebase's size and complexity. This time, they designed the execution plan for Atlas with stepping stones, not milestones, in mind. The idea was that each incremental step would provide sufficient value if the next part of the project failed for any reason. Key examples of this strategy involve making improvements to the monolithic codebase that have value regardless of Atlas implementation. Also, the team backported many enhancements developed for Atlas back into Metaserver to increase the project value even further.
Before and after Atlas
Source: https://dropbox.tech/infrastructure/atlas--our-journey-from-a-python-monolith-to-a-managed-platform
The Atlas design involved a few critical efforts revolving around componentization, orchestration, and operationalization. Atlas introduces Atlasservlets (pronounced "atlas servlets") as a logical, atomic grouping of HTTP routes to improve componentization. The authors say that "In preparation for Atlas, we worked with product teams to assign Atlasservlets to every route in Metaserver, resulting in more than 200 Atlasservlets across more than 5000 routes." Each servlet is assigned an owner, and the owner is the only authority that manages it. Also, to break up the Metaserver codebase, they had to break most of our Python import cycles. The process took several years to achieve, and they prevented regressions and new import cycles through the use of the Bazel build system and its visibility rules.
To improve orchestration, each servlet in Atlas is its own cluster. This decision provides isolation by default, as a misbehaving route will only impact other routes in the same Atlasservlet. Also, this decision allows for independent pushes of code. Besides, Dropbox decided to standardize on gRPC. To continue to serve HTTP traffic, they used the gRPC-JSON transcoding feature provided out of the box in Envoy, which they use as proxy and load balancer in front of the servlets.
HTTP transcoding
Source: https://dropbox.tech/infrastructure/atlas--our-journey-from-a-python-monolith-to-a-managed-platform
Regarding operationalization, according to the authors, "Atlas' secret sauce is the managed experience." This effort's main pillars are automated canary analysis that automatically checks each code push before it reaches production and an autoscaling capability that removes much of the need for capacity planning.
Canary analysis
Source: https://dropbox.tech/infrastructure/atlas--our-journey-from-a-python-monolith-to-a-managed-platform