Dropbox, a file-sharing cloud platform, recently discussed its Messaging System Model (MSM), which supports diverse use cases and handles over 30 million tasks per minute.
Dmitry Kopytkov and Deepak Gupta, software engineers at Dropbox, summarised this journey in a blog post. By 2021, Dropbox's asynchronous systems had become fragmented, tailored to specific product needs, but lacked consistency. These systems supported various use cases such as file uploads, machine learning, and search indexing but faced several limitations.
Developers struggled with complex systems that required significant effort to learn and manage, reducing productivity. Reliability was inconsistent due to varying service-level objectives (SLOs) for availability and latency and the lack of multi-homing increased risks during data center failures. Operational complexity was high because of a mix of external queuing solutions like Kafka and Redis, which also added to costs.
Scalability issues arose as the system processed over 30 billion requests daily but struggled to meet throughput demands in important components like the delayed event scheduler. Additionally, the existing lambda infrastructure deviated from Dropbox’s service-oriented architecture (SOA), making it difficult to diagnose issues or integrate with other systems. It also lacked auto-scaling capabilities, requiring manual interventions for capacity adjustments. Furthermore, the infrastructure was inflexible, making it hard to adapt to new workflows or integrate with Dropbox’s newer file system architecture, Cypress.
To address these issues, Dropbox adopted a phased approach rather than building an entirely new system from scratch. This approach aimed to improve development speed by simplifying asynchronous interfaces and reducing operational burdens through automated release practices that could detect regressions and trigger rollbacks.
Automatic compute scaling was introduced to handle event backlogs more efficiently. The new system also aimed to create a robust foundation by unifying patterns across asynchronous systems while providing extensible components and APIs to support new use cases with minimal changes. Cost efficiency was another priority, achieved by phasing out redundant systems and transitioning the lambda infrastructure to Dropbox’s SOA stack for better compute efficiency, autoscaling, multihoming, and monitoring.
Source: Evolving our infrastructure through the messaging system model in Dropbox
The Messaging System Model (MSM) was introduced as a key part of this transformation. Inspired by the OSI model in networking, MSM organizes Dropbox’s asynchronous system into five logical layers. The frontend layer serves as the interface for engineers or other systems like databases, managing schema validation for event compliance and converting message formats into a standardized protocol buffer format while ensuring event durability.
Dropbox was recently in the news as they discontinued Dropbox Vault. The decision surprised the tech community, with one user on Hacker News expressing frustration, saying they relied on Vault and now needed to explore other cloud storage options with similar secure PIN access for files.
Dropbox users turned to the community forum to seek clarification on the decision to discontinue the Vault feature but were left without a clear explanation. While Dropbox cited "technical risks that could compromise security" and a desire to focus on enhancing existing security features, this reasoning did little to satisfy users.
Coming back to the messaging system model, the scheduler layer manages event dispatching based on use case requirements like change data capture or delayed execution, ensuring execution order. The flow control layer handles task distribution based on subscriber availability, priorities, or throttling while tracking statuses and retrying failed tasks. The delivery layer routes events to services in public or private clouds, managing retries, filtering, and concurrency. The execution layer processes tasks via serverless functions or remote processes, leveraging autoscaling and ensuring reliability across multi-cloud operations.
This layered architecture enabled Dropbox to rebuild its asynchronous platform incrementally without disrupting stability. As a result, Dropbox has simplified workflows, improved developer productivity with automated scaling, enhanced reliability through multihoming, and adapted more easily to new demands. Additionally, they have achieved cost efficiency by consolidating infrastructure components.