In a recent article on the Ably Blog, Alex Diaconu reviewed the thirty-year-old "eight fallacies of distributed computing" and provided a number of hints at how to handle them. InfoQ has taken the chance to talk with Diaconu to learn more about how Ably engineers deal with the fallacies.
The eight fallacies are a set of conjectures about distributed computing which can lead to failures in software development. The assumptions are: the network is reliable; latency is zero; bandwidth is infinite; the network is secure; topology doesn't change; there is one administrator; transport cost is zero; and the network is homogeneous.
The fallacies can be seen as architectural requirements you have to account for when designing distributed systems. InfoQ has taken the chance to talk with Diaconu to learn more about how Ably engineers deal with the fallacies.
InfoQ: Almost thirty years since the fallacies of distributed computing were initially suggested, they are still highly relevant. What's their role at Ably?
Alex Diaconu: All of the fallacies are pointers to distributed system design pitfalls, and they are all still relevant today. They don't all have the same impact — some are more easily accommodated than others. The fallacies that have the most pervasive effect on how we structure our systems at Ably are:
- The network is reliable. Naturally, this is an aspect that has to be taken into account in the design and operation of all services. It's not just that the network itself is unreliable, but the systems that you're attempting to reach over the network are also subject to failure. Network reliability isn't binary either — networks can fail in unexpected ways. Anticipating failure of nodes or interconnections is intrinsic to our system design. We have written extensively on the Ably blog about the ways we deal with this — through fault tolerance, for example.
- Topology doesn't change. This fallacy is also particularly relevant to our architecture. The Ably platform is designed to be elastic in realtime, so topology changes are continuous. Our system must handle topology changes routinely, and dealing with this is a significant source of complexity. The core Ably system is built on a common discovery layer that updates in realtime. Other system services use the discovery layer to construct a common view of system topology. The routing of inter-service requests is all performed with regard to that topological view. Scalability and performance of the discovery layer is an engineering challenge that we have addressed in the course of scaling the Ably service.
- Bandwidth is infinite & transport cost is zero. In practice, intra-system networking costs — in the context of a global system spanning multiple regions — are a significant fraction of our operating costs. Therefore, we need to be aware of that when we design the system, ensure that our traffic does not scale more than linearly with user load, and monitor traffic to ensure that it remains within design parameters. We have occasionally hit issues where network usage regressions have arisen, so we also need to monitor to detect these regressions.
InfoQ: Do you think the evolution of distributed systems in the last thirty years has revealed any additional fallacies that should be taken into account?
Diaconu: I believe the most significant transformation over the last 30 years is the maturity of our understanding of how to deal with them. That’s not to say that the answers are any easier, but they are better understood. We know what approaches are good, what approaches are bad, and the limits of any given approach. There is now well-established scientific theory and engineering practice around these problem spaces. Computer science students are taught the problems and what the state of the art is.
Of course, it’s important to acknowledge that the fallacies are manifestations of enduring technical challenges; they shouldn’t be thought of as easily avoided pitfalls. I suppose you could say that there is, in fact, a new fallacy — "avoiding the fallacies of distributed computing is easy."
InfoQ: Some of the fallacies have become meanwhile commonplace, for example the idea that the Cloud is not secure is widely accepted. Still there may be some subtleties to them that make the process of dealing with them not so trivial.
Diaconu: As previously mentioned, the challenges of distributed systems, and the broad science around the techniques and mechanisms used to build them, are now well-researched. The thing you learn when addressing these challenges in the real world, however, is that academic understanding only gets you so far.
Building distributed systems involves engineering pragmatism and trade-offs, and the best solutions are the ones you discover by experience and experiment.
As an example, the "network is reliable" fallacy is the most basic thing you have to address. The known solutions involve protocols with retries; or consensus formation protocols; or redundancy for fault tolerance, depending on the particular failure mode of concern.
However, the engineering reality is that multiple kinds of failures can, and will, occur at the same time. The ideal solution now depends on the statistical distribution of failures; or on analysis of error budgets, and the specific service impact of certain errors.
The recovery mechanisms can themselves fail due to system unreliability, and the probability of those failures might impact the solution. And of course, you have the dangers of complexity: solutions that are theoretically sound, but complex, might be far more complicated to manage or understand whenever an incident takes place than simpler mechanisms that are theoretically not as complete.
InfoQ: If we look at microservices, which have become quite popular in the last few years, they seem to be at odds with the "transport cost is zero" fallacy. In fact, the smaller each microservice, the larger their overall count and the ensuing transport cost. How do you explain this?
Diaconu: Maybe another fallacy is "microservices make it easier to reason about your system". Sometimes breaking things down into components with a smaller surface area makes them easier to reason about. However, sometimes creating those boundaries adds complexity; it can certainly add failure modes, and it can create new things whose behavior also needs to be reasoned about.
Much like the previous answer, the actual design choices, and when and where you deploy the known theoretical solutions, come down to engineering judgment and experience. At Ably, we operate a system with multiple roles that scale, interoperate and discover one another independently. However, splitting functionality out into a distinct role is something we rarely do, and only when there is a particular driver for that to happen. For example, if we want some specific functionality to scale independently of other functionality, that justifies the creation of an independent role, even if it brings additional complexity.
Diaconu's article not only helps you understand where the fallacies originate from, but also attempts to provide useful hints at current techniques and approaches to address the fallacies, so do not miss it if you are interested in the subject.