BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Are You Done Yet? Mastering Long-Running Processes in Modern Architectures

Are You Done Yet? Mastering Long-Running Processes in Modern Architectures

Key Takeaways

  • Long-running processes involving waiting for human actions, external responses, or intentional delays are crucial for handling various real-world scenarios within software applications.

  • Waiting introduces challenges like managing persistent states, understanding progress, handling escalations, and versioning long-running processes. Distributed systems add further complexity.

  • Workflow engines and process orchestration platforms provide effective solutions for managing long-running processes.

  • Successfully adopting process orchestration often involves a dedicated team providing technology, consulting, and support.

  • It’s crucial to embrace asynchronous communication and design patterns to build robust and scalable systems.

In the evolving software development landscape, a new generation of tools is changing how we approach long-running processes. Modern Microservices orchestrators or workflow engines can handle scale, high-throughput, and low-latency scenarios. Such capabilities empower software engineers to make more informed decisions about domain boundaries and overall architectural design, ultimately leading to more efficient and scalable applications.

Challenges emerge as systems become increasingly distributed, demanding solutions for remote communication and potential peer unavailability. My presentation at QCon London 2024 delved into how process orchestration can navigate these complexities. This article explores the concepts discussed in the talk.

Using Pizza to Understand Long-Running Processes

Let’s take the example of ordering pizza via phone call. This is a synchronous blocking communication: the caller is blocked until the recipient answers, allowing direct feedback but requiring both sides to be simultaneously available. An alternative is email, which is asynchronous and non-blocking: the sender is not blocked, and the "message" is queued, allowing flexibility but lacking immediate feedback. Of course, you could provide feedback through another email.

What’s more important is that the communication feedback loop ("we got your order") is not the desired result of the interaction. You are hungry, so the actual result is the Pizza on your doorstep. But this will take time, as the Pizza needs to be in the oven for a bit. Here, you can see that ordering pizza is always a long-running task with inherent delays, regardless of the communication method used.

It is important to note that long-running processes refer to the act of waiting, not algorithms running for extended periods. This waiting can be due to human actions (e.g., approvals, decisions), external responses (e.g., customer feedback), or intentionally allowing time to pass (our Pizza in the oven). These processes can take hours, days, weeks, or even longer (hopefully not for your Pizza).

In contrast, ordering coffee at a small bakery is a synchronous blocking interaction with regard to the result. The customer and the person behind the counter can’t do anything else while the coffee is being made. This might work OK for a single order, but requesting multiple coffees leads to a poor experience due to the extended wait time—and you block the bakery entirely for other, possibly small orders. While coffee-making is relatively quick compared to some tasks, the synchronous nature of this interaction limits its scalability and degrades the user experience.

Gregor Hohpe’s article Starbucks Doesn’t Use Two-Phase Commit describes how Starbucks scales coffee making by separating the ordering process (synchronous and blocking, ends with the feedback loop) from the coffee preparation (asynchronous, ending with the final result). This allows for the independent scaling of baristas and the potential for replacing the ordering process with app-based systems, thereby improving efficiency and customer experience.

This pattern, which has a synchronous first step but an asynchronous long-running process involving waiting, is typical in various scenarios.

The Challenges of Waiting

Waiting is challenging because we need to remember we’re waiting, which requires a persistent state. This leads to subsequent requirements like understanding what we’re waiting for, escalation for prolonged waits, versioning of long-running processes, and handling them at scale. The challenge is to address these technical issues without introducing accidental complexity.

Over the last few years, I have seen way too many homegrown workflow engines. These engines often begin simply, with a status flag in a database and a waiting mechanism. However, they quickly escalate into complex systems with schedulers, escalations, and other features, resulting in an inefficient and hard-to-maintain solution. So, use existing tools for this.

But I need to be transparent here. Throughout my career, I’ve actively contributed to developing several open-source workflow engine projects and co-founded Camunda, a company focused on process orchestration. So, of course, I have a bias about workflow engines.

I believe a workflow engine, or, as it’s increasingly referred to, an orchestration or process engine, can effectively address the challenges of long-running processes. By defining workflow models (the blueprint) and being able to execute instances that can be long-running, these engines inherently address the challenges of waiting.

Let’s look at an example—a simple onboarding process. Onboarding is a widespread experience across industries; for example, you might open a new bank account, get a new insurance policy, or get a new mobile phone contract. You can visualize the process using the Business Process Model and Notation (BPMN), an ISO standard for defining processes graphically. This process involves various steps, such as customer scoring and order approval, which can be automated or handled manually.

In this example on GitHub, a Java Spring Boot application connects to a workflow engine, deploys the process, and provides a web UI for users to interact with. The UI initiates a process instance within the workflow engine by triggering a REST call in the application. This instance can be monitored and escalated, and tasks can be assigned to humans through a UI or other integration methods.

The process continues, using custom code or pre-built connectors to perform actions like creating a customer in a CRM system or sending a welcome email.

Modern workflow engines can also handle high-volume processes at scale, with thousands of instances per second in geographically distributed data centers. So, this technology is not limited to low-volume scenarios but can be used for core business processes in large-scale operations.

When Do Services Need to Wait? (Technical Reasons)

But let’s come back to distributed systems. Besides business reasons for waiting, there are also technical ones, such as asynchronous communication delays, failures in message delivery, and the unavailability of peer services in distributed systems. If not addressed, these issues can lead to cascading failures.

As an example, when attempting to check in for a flight a while back, I encountered an error preventing me from receiving my boarding pass. Repeated attempts failed, leading me to create a calendar reminder to try again later. This illustrates the concept of a stateful retry, where retries are delayed due to a temporary issue (like the airline’s barcode generation system failing). The problem highlights the challenges of distributed systems, where component failures are inevitable. A resilient design would isolate such failures, preventing them from impacting the entire system and inconveniencing users.

An improved system design would acknowledge the check-in while addressing the boarding pass issue separately, either by promising timely delivery or suggesting alternative options like using the app. This approach provides a better user experience and clearly defines responsibilities within the system. However, long-running capabilities within the check-in service are required to manage the delayed delivery of boarding passes. Many teams avoid this due to the need for state management in an otherwise stateless system. While customers are accustomed to immediate responses, a delayed result with clear communication is often a more desirable experience compared to a synchronous error message requiring user intervention.

Extending the flight booking example to include payment collection introduces complexities due to the reliance on external services like credit card processors. These services might be unavailable, causing delays or failures. And exceptions in remote calls can be ambiguous, making it difficult to determine whether the payment was successful. In such cases, retrying may lead to duplicate charges. A workflow engine can help manage these scenarios by checking payment status and refunding if necessary. However, it is still crucial to carefully consider corner cases and choose the most appropriate solution for your specific use case. A powerful workflow language like BPMN and a graphical visualization to discuss scenarios with various stakeholders is essential to build good solutions.

As soon as you start handling the complexities of distributed systems and long-running processes in payment processing, it is essential to embrace asynchronous communication in your API design. Although payments are usually quick and straightforward, situations like declined credit cards or unavailable services demand a different approach. We can build more adaptable and robust payment systems by designing systems that can handle immediate and delayed responses and using signals like a 202 HTTP code to indicate that processing will continue in the background.

This enables improved error handling and recovery, as well as greater flexibility in implementing complex requirements like deducting customer credit and charging a credit card asynchronously.

Importance of Long-Running Capabilities for Service Boundaries

Designing robust services requires long-running capabilities within your architecture. For example, credit card rejections might occur when a booking service instructs a payment service to retrieve payment. Addressing this scenario, such as allowing customers to update their credit card details, often necessitates long-running processes to manage the interaction and retry the payment. While some services prefer to remain stateless and avoid such complexities, embracing long-running capabilities provides a better solution, enabling more comprehensive error handling and recovery mechanisms.

This example also highlights the issue of leaking domain concepts. The booking service should not need to know about credit cards or specific payment methods. By isolating these concerns within the payment service, the design becomes cleaner and more adaptable to future changes. However, this requires the payment service to handle long-running processes related to payment retries and customer interactions. Providing tools and frameworks that simplify this process for development teams is essential.

Sometimes, people consider workflow orchestration to lead to monolithic systems. But in fact, it is the other way around; by enabling long-running capabilities within individual microservices, you can better distribute responsibilities and avoid creating too bloated services that centralize complex logic. This approach promotes a more modular and maintainable architecture, avoiding monolithic tendencies.

Introducing Process Orchestration Successfully

Long-running capabilities are essential for good service design and asynchronous, non-blocking operations. To make this available to your engineers across the enterprise, you need process orchestration provided centrally, ideally as a (self-)service. Many successful organizations using process orchestration have a dedicated team, often called a Center of Excellence (CoE), focused on enabling others to build solutions rather than creating them directly. This approach avoids the pitfalls of previous centralized BPM models and empowers teams with the technology and support they need to implement effective process orchestration.

Many organizations are hesitant to embrace centralized teams like Centers of Excellence due to a desire for team autonomy and to avoid excessive constraints. However, the Team Topologies model provides a compelling counterargument. This model emphasizes different team types for optimal efficiency: stream-aligned teams focus on delivering business value while enabling teams to provide consulting, and platform teams provide the necessary technology. A CoE focused on process orchestration and automation aligns well with this model by providing the technology and enablement, freeing stream-aligned teams to focus on delivering value. This approach avoids the pitfalls of having each team independently figure out their tech stack, thus leading to faster value delivery.

A good industry example is Spotify’s Golden Path approach, which offers predefined solutions for different types of projects, such as customer-facing web apps or long-running workflows. These Golden Paths are designed for development teams to be easy to use and appealing but not mandatory. This encourages standardization and efficiency while maintaining autonomy. The goal is to move away from "rumor-driven development" and consolidate effective technologies across the organization. Spotify found that too much decentralization slowed them down, and "Golden paths" helped to strike a balance between autonomy and efficiency.

To quote another example, Twilio uses a concept called "paved path" to simplify and standardize the development process. This involves creating pre-made services that teams can easily adopt and implement, reducing the complexity of setting up infrastructure and allowing them to focus on business logic.

Graphical models like BPMN, an ISO standard for defining processes, offer a powerful way to visualize and implement complex workflows. They serve as living documentation, representing running code rather than just requirements. These models can be used for various purposes: test cases, operations monitoring, and stakeholder communication. Their visual nature is particularly valuable when discussing long-running behavior with business stakeholders, helping them understand the rationale behind asynchronous operations and potential changes to the customer journey. Embracing graphical models allows organizations to leverage their modern architecture effectively and improve collaboration between technical and non-technical teams.

Conclusion

In conclusion, long-running capabilities are essential for a multitude of reasons. Process orchestration platforms and workflow engines offer invaluable technology for designing better service boundaries, faster implementation with less accidental complexity, and easier adoption of asynchronous communication. This translates to a superior customer experience, increased operational efficiency, better automation, reduced risk, and improved compliance. A central enablement team is highly recommended to successfully adopt this technology across the organization.

About the Author

Rate this Article

Adoption
Style

BT