Alvaro Videla, distributed systems engineer and co-author of RabbitMQ in Action, reviewed distributed systems theory at QCon London 2017. This involved breaking distributed systems into different classifications and then discussing the trade-offs between each of them. These included timing models, failure modes and more.
Videla introduced various timing models and whether or not a distributed system knows how long its steps will take. He lists three categories, not to be confused with their alternative meanings in concurrent programming:
- Synchronous: This is a distributed system in which the timing of each step is known. Although something which would help with problems such as failure detection, it is not reflective of real systems.
- Asynchronous: A distributed system which takes steps in any order, with no guarantee of the timing of each step. This is more in line with a real system, although a real system is likely to impose some sort of timeout.
- Semi-synchronous: A distributed system with at least some timing information, so it is able to estimate each step.
Videla also elaborates on the types of interprocess communication. This is simple, and a binary choice of communicating via message passing, or by shared memory.
The final classification given by Videla is the failure mode, which determines what kind of process failures are assumed. They are as follows:
- Crash-stop: When a process crashes, it never recovers. This is not reflective of the real world; when a machine fails it would not be disposed of, and would instead be re-used.
- Crash-recovery: When a process fails it can recover, usually by making use of sort of recovery algorithm. This may involve reading from a database or communicating with other processes.
- Omission Faults: Where processes fail to send or receive messages. The example given by Videla is a cache which can receive messages from the group but not reply to them. It would still be useful to a client, as it is still up to date with the latest data.
- Arbitrary Failure Modes: When processes start to send or receive the incorrect information, the outcome of which could be the ability to invalidate the state of the system.
Videla stresses that when deciding between these various models and categories, there is no silver bullet. It’s simply a tradeoff that must be made based on the requirements of the system under development.
It’s also worth noting that while some of these options are not reflective of real-world systems, they tend to be useful in distributed systems theory. This is because they are easier to work with when trying to prove new algorithms in combination with them. Also, if something does not work with the simple model, it tends to be implicit that it will not work with the more complex models too.
Videla also explains failure detectors, which are algorithms which are used to detect whether another process has died. The main problem with these is knowing the difference between a failed process, and a process which has taken a long time. The "Eventually Perfect Failure Detector" was given as a workaround, which only suspects a process to have failed after a given period of time. This means that if later on the process is found to be alive, it can be removed from the suspect list, and its timeout extended.
The full video is available to watch online, and Videla has also written up his talk on an article. Both touch on additional concepts such as quorum and consensus, and also give advice on further reading.