Failure will happen, so you must accept that fact and be resilient. That doesn't just mean improving software systems to be more fault-tolerant. It requires redesigning business processes to be more reactive, so the software can behave accordingly. Bernd Ruecker made these points during his presentation at QCon London, Moving Beyond Request-Reply: How Smart APIs are Different. The full video and transcript are now available on InfoQ.
Ruecker gave examples of some common ways to handle failures, such as showing an error message to a user, and explained why they create unsatisfactory customer experiences. When he was unable to print an airline boarding pass, the error message said to try again later. He then had to create a stateful retry machine: he added a reminder to his calendar. A much better solution would be to have the stateful retry within the system, with a corresponding notification to the user, and automatically sending the boarding pass when it was created.
This type of adjustment requires rethinking both the business processes for responding to a failure, and software to implement the new, more complicated processes. In most cases, this involves some form of durable state machine, such as a workflow engine, which can be long-running and able to implement retries. He also said to make processes idempotent whenever possible, because that allows you to safely retry without side effects.
All distributed systems introduce complexity that must be managed, which leads to what Ruecker called "smart APIs." Being able to implement long-running services is essential for smart APIs. Because typical request/reply APIs are not stateful, Ruecker (citing Sam Newman) proposed supplementing them with "a few smart, god services that tell anemic services what to do."
The smart endpoints are long-running, and therefore being able to implement long-running services is essential for business reasons, not just for technical reasons. This brings with it a requirement for asynchronous communication, which again introduces complexity. Ruecker pointed out some of the major weaknesses of async communication: Latency creep; availability erosion; it's hard to implement; and you still have to think about UX.
One typical pattern is to simulate synchronicity by waiting and using callbacks or polling, but this is not a silver bullet. Because there is no simple technical solution to handling failures, companies must redesign their business processes to be more reactive, then the software design can incorporate the reactive processes.
Ruecker wrapped up his presentation by describing how event-driven architectures (EDAs) are often used in reactive systems, but said they do come with risks which need to be understood. While an EDA is a good pattern for a notification service, it causes problems for peer-to-peer event chains. The actual workflow can't be seen, and changing the workflow means changing multiple services, thereby defeating the purpose of having independent microservices. The right solution requires a balance between orchestration and choreography. He recommended reading Martin Fowler's writing for more advice on event-driven systems.