At QCon New York, Bernd Rücker presented "Complex Event Flows in Distributed Systems", and cautioned that although event-driven architectures can be extremely powerful, it can also be easy to create complex and highly-coupled peer-to-peer event chains. He proposed that lightweight, open source workflow engine solutions provide many advantages for the business, developers and ops.
Rücker, Co-founder and developer advocate at Camunda, began the talk by stating that he wanted to discuss three common hypotheses that he encounters when working with customers (particularly focusing on microservice-based architectures): events decrease coupling; orchestration needs to be avoided; and workflow engines are painful.
A simple example e-commerce reference application was introduced that was based upon the concept of "one-click" ordering presented by the Amazon dash, where a customer simply presses the dash button and a corresponding item is ordered. The processing steps involved with such an e-commerce transaction involved: paying for the item; fetching the item; and shipping the item. Sketching out some simple bounded contexts for the example application, it was suggested that a checkout, payment, inventory and shipment context would most likely be involved in the transaction. In a microservices-based architecture the bounded contexts often map to individual autonomous services
For the first hypothesis -- events decrease coupling -- a discussion was presented on temporal decoupling by using events and read models. Events are facts about what has happened in the past, as opposed to commands which are empirical and have an intent about what needs to happen in the future. Rather than a service directly informing another of something happening, for example by request/response, having a service emit an event that any other service can listen for provides looser temporal coupling. This is typically accomplished by a move away from Remote Procedure Calls (RPC) technology, like REST or gRPC, towards using a centralised event fabric or distributed logging platform like Apache Kafka.
Using events can clearly decrease temporal coupling, but can also lead to the creation of complex peer-to-peer event chains. Quoting Martin Fowler, Rücker stated:
The danger is that it's very easy to make nicely decoupled systems with event notification, without realizing that you're losing sight of that larger-scale flow, and thus set yourself up for trouble in future years.
When using peer-to-peer event chains it can be challenging to modify the sequence of the underlying business processes without changing multiple services. This is because services typically have (implicit) knowledge of the before and after steps within the chain. Rücker likened this process to a three-legged race; it's possible to go fast, but coordination is required to prevent calamity. Accordingly, there are clear benefits to extracting -- and orchestrating -- the end-to-end responsibility for the corresponding business functionality (or application feature). Using commands, alongside events, can help to avoid complex peer-to-peer event chains.
For the second hypothesis -- orchestration needs to be avoided -- the discussion began around the benefits of smart endpoints and dumb pipes and shared some of the lessons the industry has learned from using heavy-weight (and proprietary) middleware like Enterprise Service Buses (ESBs). Quoting Sam Newman, Rücker cautioned the audience that it is easy for clients of dumb endpoints to become god services: "a few smart god services tell anemic CRUD services what to do". He also mused that a god service is only created by bad API design, and that a big challenge is presented with the handling of state associated with a (potentially) long-running business flow. The conclusion to this section of the talk was that there can be many benefits realised by not avoiding orchestration.
Discussion of the third hypothesis -- workflow engines are painful -- began by exploring the history of such tooling. Although many developers have bad experience of "no code" Business Process Modeling (BPM) frameworks, the first generation of such tooling tended to be overly-complicated, proprietary, centralised and developer adverse. The latest generation of workflow engines, such as AWS Step Functions, Uber Cadence and Netflix Conductor, has proven to Rücker that Silicon Valley has recognised the potential benefits of workflow engines when this technology is implemented correctly. There are also lightweight open source options such as Camunda, zeebe (also by Camunda), Activiti, and jBPM.
Expanding on his definition of "lightweight", Rücker presented several code examples using Java code written for the Camunda framework. A simple DSL was used to construct a workflow, which can also be visualised graphically using Business Process Model and Notation (BPMN), an ISO standard. The Camunda framework also allows the specification of a data store to manage state for long-running business flows. This means that a developer now has a state machine, in which it is easy to manage concepts such as time (for example, checking for a updated payment card in seven days time), and the construction of "composite", long-running executions that consist of many steps in the workflow. Referencing work that both Martin Schimak's and Rücker have contributed to (also additionally covered recently on InfoQ), he suggested that interested members of the audience read more about this within the original Domain-Driven Design (DDD) book by Eric Evans (see also the InfoQ "DDD Quickly" book), and explore "Event Storming", as described by Alberto Brandolini.
Many developers are now (sometimes implicitly) working on distributed systems, and this introduces complexity that has to be tackled; it is impossible to differentiate certain failure scenarios within a distributed system independent of communication style. It can be tempting to implement distributed transactions or patterns like two-phase commit, but Pat Helland's seminal paper "Life beyond Distributed Transations: an Apostate's Opinion" warns of the challenges associated with this. The Saga Pattern, introduced as "distributed transactions using compensation", was discussed, as was the associated tradeoffs with eventual consistency (and no isolation, as in ACID).
The penultimate section of the talk examined the use cases for workflow automation. Rücker stated that as much as he agrees with the DevOps philosophies, he prefers that term "BizDevOps", as this highlights better the importance of collaboration across all teams within an organisation. Lightweight workflow engines can help with all aspects of this concept: business processes can be explicitly discussed and understood; developers can leverage the state machine and workflow engine; and the processes can be operated with visibility and context.
Rücker stressed that workflows live inside service boundaries, and must not stretch across microservices (or bounded contexts). Having a single workflow solution orchestrating all of the services within a system can lead to high coupling, and was one of the fundamental issues with the first generation of BPM software. He also cautioned against "not invented here" syndrome, and stated that developers shouldn't attempt to create their own workflow engines when there are many good open source solutions.
The final section of the talks reviewed the opening hypotheses. In regards to "events decrease coupling", the answer presented was "sometimes"; there is no denying the potential for loose coupling when using events, but it can be all too easy to create complex peer-to-peer events chains. For the second hypothesis, "orchestration needs to be avoided", the answer was also "sometimes"; instead of using complicated ESBs the smart endpoints and dumb pipes model should be chosen, but it worth remembering that important (cross-cutting) capabilities need a home. For the final hypothesis, "workflow engines are painful", Rücker suggested that "some of them are", but modern, lightweight, open source engines are easy to use and can run decentralised; they solve hard developer problems, but should not be custom implemented.
The slides for Bernd Rücker's QCon New York talk, "Complex Event Flows in Distributed Systems" (PDF), can be found on the conference website, and the video for this talks (and all QCon NY talks) will be released via InfoQ over the coming months.