Being clever about system architecture in advance is hard. Scaling successfully is more about being clever with metrics and introspection, creating efficient build and provisioning processes and being comfortable with radical change. These are some of the keys to scaling at Dropbox according to Rajiv Eranki in his recent presentation at the 2013 RAMP Conference in Budapest.
Reprising his 2012 blog post Eranki described his experiences as head of server engineering scaling Dropbox from 2000 users to about 40 million. When Dropbox enticed Eranki out of graduate school their architecture was a single (physical) MySQL database and a single front-end. Starting small has its benefits, says Eranki. Their simple architecture made it easy to run queries across all users, easy to perform backups, easy to debug and provided "incredible flexibility and agility" in the early days of the product development.
Keeping things simple is a key lesson for Eranki. Clever ideas and techniques might seem attractive but the lesson from Dropbox is that it is easier to scale out by buying more hardware when appropriate or sharding databases when required. "Every time we tried to be clever about architecture we failed" muses Eranki. Hopes to use clever data structures like Bloom Filters to manage distributed hash tables never panned out against simple database sharding. Plans to use clever distributed sharding schemes for their MySQL databases proved more complex than a pragmatic master-slave architecture. Coordinating transactions was another area where simplicity won out. Two-phase commit introduces brittleness and performance issues. Instead they opted for a design where commit order is coordinated so that errors failed gracefully and could be recovered after the fact via compensating transactions.
"Keeping track of stuff" is another major theme from Eranki's talk. Dropbox wrote their own application-specific metrics and created a simple logging API that was easy for developers to use anywhere in their code. Gathering metrics became automatic and they built dashboards to help with monitoring system health. "Most graphs are useless" says Eranki so it is better to build dashboards around specific application-level requirements. Create alerts for metrics that aren't continuously "watched" but you want to know when they depart from predefined limits. "Watch the biggest users" says Eranki. Monitoring users with "the most shared folders, the highest bandwidth, the most requests" provides interesting insights into user behaviour and often uncovers cases of abuse or simply bugs in the system.
Logging is the flip-side of monitoring. New Dropbox features started out with "ultra-verbose text logging" says Eranki and they generally regretted it when these log points were later cleaned out. "Having a lot of prints is the right way to do it." Eranki recommends that logs should be kept for later reference, for example to help debug a difficult race condition that occurs infrequently.
Reinforcing some of the lessons from Netflix on coping with chaos, Eranki describes the steps Dropbox made to "mitigate Murphy's Law." They would purposely fail hosts in production because it is difficult to replicate real failures in a test system. "You know machines are going to fail, so make it happen at 2 pm when you're in the office rather than at 5 am" says Eranki. This leads to another of Eranki's key lessons, which is to become comfortable with major change. Making schema changes, or switching master databases to clones are major changes that, if you can master them and make them routine, provide high levels of flexibility in making system changes or recovering from failure.
Eranki concludes his presentation by considering the scaling of human resources. Wishing they'd started earlier than they did, he notes how valuable it is to have people who know the complete system. The problem is that new people take months to ramp up. But experienced people reduce the cost of technical fixes and help avoid the buildup of technical debt incurred by "bandaid" fixes.