InfoQ Homepage Presentations Are You Done Yet? Mastering Long-Running Processes in Modern Architectures

Are You Done Yet? Mastering Long-Running Processes in Modern Architectures

View Presentation

Speed:

Download

51:26

Summary

Bernd Ruecker discusses process orchestration and how tools like microservice orchestrators or workflow engines are built to implement long-running capabilities.

Bio

Bernd Ruecker has been innovating process automation deployed in highly scalable and agile environments of T-Mobile, Lufthansa, ING, and Atlassian. He contributed to various open-source workflow engines for more than 15 years and is the Co-Founder and Chief Technologist of Camunda. He is the author of "Practical Process Automation" and co-author of "Real-Life BPMN".

About the conference

Software is changing the world. QCon London empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Ruecker: I talk about long running, not so much about exercise, actually. We want to start talking about food first, probably more enjoyable. If you want to order pizza, there are a couple of ways of ordering pizza. You probably have ordered a pizza in the past. If you live in a very small city like I do, if you order pizza, what you do is actually you call the pizza place. That's a phone call. If I do a phone call, that's synchronous blocking communication. Because I'm blocked, I pick up the phone, I have to wait for the other person to answer it. I'm blocked until I got my message and whatever, whatnot.

If then the person answers me, I get a direct feedback loop. Normally, that person either tells me they make my pizza or they don't. They can reject it. I get a direct feedback. I'm also temporarily coupled to the availability of the other side. If the person is currently not available to pick up the phone, if they're already talking on another line, they might not be able to take my call. Then it's on me to fix that. I have to probably call them again in 5 minutes, or really stay on the line to do that. Synchronous blocking communication. What would be an alternative? I know you could probably use the app. Again, I can't do that where I live. You could send an email. An email puts basically a queue in between. It's asynchronous non-blocking communication, and there's no temporal coupling.

I can send the email, even if the peer is not available, even if they take other orders. How does it make you feel if you send an email to your pizza place? Exactly that, because there is no feedback loop at all. Do they read my email? I pick up the phone to call them. It could be. It's not a technical restriction that there is no feedback loop. They could simply answer the email saying, we got your order, and you get your pizza within whatever, 30 minutes. You can do a feedback loop again, asynchronously. It's not really the focus of today. I have another talk also talking about that this is not the same. You can have those interaction patterns decoupled basically from the technology you're using for doing that. Synchronous blocking communication, asynchronous non-blocking.

The most important thing is on the next slide. Even if I do that independent of email, or phone, it's important to distinguish that the feedback loop is not the result. I'm still hungry. They told me they send a pizza, I'm probably even more hungry than before the result of the pizza. The task of pizza making is long running, so it probably goes into a queue for being baked. It goes into the oven. They hopefully take the right time to do that.

They can't do that in a rush. Then the pizza is ready, and it needs to be delivered to me. It's always long running, it takes time. It's inherently there. That's actually a pattern we see in a lot of interactions, not only for pizza, but for a lot of other things. We have a first step, that synchronous blocking, but we have an asynchronous result later on.

Could you do synchronous blocking behavior for the result, in that case? Probably not such a good idea. If you take the example not of pizza but of coffee. If you go to a small bakery and order coffee, what happens is that the person behind the counter takes your offer, takes your money, basically turns around, going for the coffee machine, presses a button, waits for the coffee to come out of that. Going back to you, give you the cup. It's synchronous blocking. They can't do anything else. I can't do anything else.

We're waiting for the coffee to get ready. If you have a queue behind you, and if you're in a good mood to make friends, you probably order 10 coffees. It takes a while. It's not a good idea. It's not a good user experience here and it doesn't scale very well. The coffee making is relatively quick compared to the pizza making and other things. It doesn't have to be that way. There's a great article from Gregor Hohpe. He called it, "Starbucks Doesn't Use Two-Phase Commit." He talked about scalable coffee making at Starbucks, where you also separate the two things. The first step is the synchronous blocking thing. I go to the counter, order, pay. Then they basically ask for my names or correlation identifier. Then they put me in a queue, saying, to the baristas, make that coffee for Bernd. Then these baristas are scaled independently.

There might be more than one, for example, doing the coffee, and then I get the coffee later on. That scales much better. That's another thing you can recognize here, it also makes it easier to change the experience of the whole process. A lot of the fast-food chains have started to replace, not fully replaced, but replace some of the counters or the humans behind the counter with simply ordering by the app. Because that's very easy for the first step, but not so easy for the coffee making. There's robotics also for that. There are videos on the internet, how you can do that, but it's not on a big scale. Normally, the baristas are still there, the coffee making itself. We want to distribute those two steps.

With that in mind, if I come back to long running, when I say long running, I don't refer to any AI algorithm that runs for ages until I get a result. No, I'm basically simply referring to waiting. Long running for me is waiting because I have to wait for certain things, that could be human work, the human in the loop, like we just heard, because somebody has to prove something. Somebody has to decide something that are typically things, or somebody has to do something. Waiting for a response, I sent whatever inquiry to the customer, and they have to give me certain data.

They have to make their decision. They have to sign the document, whatever it is, so I have to wait for that. Both of those things are not within seconds, they can be within hours, days, or even weeks, sometimes even longer. Or I simply want to let some time pass. The pizza baking is one example, but I had a lot of other examples in the past. One of my favorites was a startup. They did a manufactured service, which was completely automated, but they wanted to make the impression to the customer that it's like a human does it. They waited for a random time, between 10 and 50 minutes, for example, to process a response. There are also more serious examples.

Why Is Waiting a Pain?

Why is waiting a pain? It basically boils down to because we have to remember that we are waiting. It's important to not forget about waiting. That involves persistent state. Because if I have to wait not only for seconds, but minutes, hours, days, or weeks, or a month, I have to persist it somewhere to still remember it when somebody comes back. Persistent state. Is that a problem? We have databases? We do. There are a lot of subsequent requirements, if you look at that.

For example, you have to have an understanding what you're waiting for. You probably have to escalate if you're waiting for too long. You have versioning problems, like if I have a process that runs for a month, and I start at like every day a couple of times, I always have processes in flux. If I want to change the process, I have to think about already running ones, and probably do something different for them than for newer ones, for example. I have to run that at scale. I want to see where I'm at, and a lot of those things.

The big question is, how do I do that? How do I solve those technical challenges without adding accidental complexity? That's what I'm seeing, actually, quite often. I wrote a blog post, I think, more than 10 years ago, where I said, I don't want to see any more homegrown workflow engines. Because people stumble into that, like we simply have to write a status flag in the database. Then we wait, that's fine. Then they start, "We have to remember that we have to have a scheduler. We have to have an escalation. People want to see that." They start adding stuff. That's not a good idea to do.

Background

I'm working on workflow engines, process engines, orchestration engines, however you want to call them, for almost all my life, at least my professional life. I co-founded Camunda, a process orchestration company, and wrote a lot of things in the past about it. I've worked on a couple of different open source workflow engines as well in the past.

Workflow Engine (Customer Onboarding)

One of the components that can solve these long running issues is a workflow engine. We're currently more going towards naming it an orchestration engine, some call it a process engine. It's all fine. The idea is that you define workflows, which you can run instances off, and then you have all these requirements being settled. I wanted to give you a 2-minutes demo, not because I want to show the tool, that's a nice side effect. There are other tools doing the same thing. I want to get everybody to the same page of, what is that? What's a workflow engine? If you want to play around with that yourself, there's a link.

It's all on GitHub, so you can just run it yourself. What I use as an example is an onboarding process. We see that in basically every company to some extent. You want to open up a new bank account, you go through an onboarding process, as a bank. You want to have a new mobile phone contract, you go through onboarding. If you want to have new insurance contract, onboarding. It's always the same. This is how it could look like. What I'm using here, it's called BPMN, it's an ISO standard, how to define those processes.

You do that graphically. In the background, it's simply an XML file basically describing that. It's standardized, ISO standard. That's not a proprietary thing here. Then you can do things like, I score the customer, then I approve the order. That's a manual thing. I always like adding things live with the risk of breaking down. We could say, that takes too long. We want to basically escalate that. Let's just say, escalate. Yes, we keep it like that. We have to say what too long is. That's a duration with a period, time, 10 seconds should be enough for a person to review it. I just save that.

What I have in the background is a Java application, in this case. It doesn't have to be Java, but I'm a Java person. It's a Java Spring Boot application basically that connects to the workflow engine, in this case also deploys the process. Then also provides a small web UI. I can open a new bank account. I don't even have to type in data because it does know everything. I submit the application. It triggers a REST call basically. The REST call goes into the Spring Boot application. That kicks off a process instance within the workflow engine. I'm using our SaaS service, so you have tools like Operate, where it can look into what's going on.

There it can see that I have processes running. You see the versioning. I have a new version. I have that instance running. If I kick off another one, I get a second one in a second. I'm currently waiting for approval. I also already have escalated it, at the same time. Then you have tasks list, because I'm now waiting for a human, for example. I have UI stuff. I could also do that via chatbot or teams' integration, for example. Yes, to automatic processing, please. Complete the task. Then this moves on. I'm seeing that here as well. I'm seeing that this moves on, and also sends an email. I have that one.

Process instance finish, for example. It runs through a couple of steps. Those couple of steps then basically connect to either the last two things I want to show, for example, create customer in CRM system, is, in this case, tied to a bit of Java code where it can do whatever you want to. That's custom glue code, you simply can program it. Or if you want to send a welcome email, you already see that. That's a pre-built connector. For example, for SendGrid, I can simply configure. That means in the background, also, my email was sent, which I can also show you hopefully here. Proof done, "Hello, QCon," in email. We're good.

That's a workflow engine running in the background. We are having a workflow model. We have instances running through. We have code attached, or UIs attached to either connect to systems or to the human. Technically, I was using Camunda as a SaaS service here, and I had a Spring Boot application. Sometimes I'm being asked, ok, workflow, isn't that for these like, I do 10 approvals a day things? No. We're having customers running that at a huge scale. There's a link for a blog post where we go into the thousands of process instances per second.

We run that in geographically distributed data centers in the U.S. and UK, for example, and this adds latency, but it doesn't bring throughput down, for example. We are also working to reduce the latency of certain steps. What I'm trying to say is that that's not only for I run five workflows a day, you can run that at a huge scale for core things.

When Do Services Need to Wait? (Technical Reasons)

So far, I looked at some business reasons why we want to wait. There are also a lot of technical reasons why you want to wait for things, why things get long running. That could be, first of all, asynchronous communication. If you send a message, you might not know when you get a message back. It might be within seconds in the happy case or milliseconds. What if not, then you have to do something. If you have a failure scenario, you don't get a message back, you want to probably just stop where you are, and then wait for it to happen.

Then probably you can also notify an operator to resolve that. Or the peer service is not available, so especially if you go into microservices, or generally distributed systems, the peer might not be available, so you probably have to do something about it. You have to wait for that peer to become available. That's a problem you should solve. Because otherwise, yes, you get to that. You get chain reactions, basically.

The example I always like to use is this one. If you use an airplane, you get an email invitation to check in a day before, 24 hours before that normally. Then you click a link, and you should check in. I did that for a flight actually to London. I think that was pre-pandemic, 2019, or something like that. I flew to London with Eurowings. I wanted to check in, and what it said to me was, "There was an error while sending you your boarding pass." I couldn't check in. That's it. What would you do? Try it again. Yes, of course. I try it again. That's what I did. Didn't work. I tried it again 5 minutes later, didn't work.

What was the next thing I did? I made a calendar entry in my Outlook, to remind me of trying it again in a couple of hours. Because there was still time. It wasn't the next day. I just wanted to make sure not to forget to check in. That's what I call a stateful retry. I want to retry but in a long running form, like 4 hours from now because it actually doesn't work. It doesn't matter because I don't need it yet now.

The situation I envision is that, in the background, they had their web interface, they probably had a check-in microservice. They probably had some components downstream required for that to work, for example, the barcode generation, or document output management, or whatever. One of those components did fail. The barcode generation, for example, didn't work, so they couldn't check me in. The thing is that the more we distribute our systems into a lot of smaller services, the more we have to accept that certain parts are always broken, or that network to certain parts are always broken.

That's the whole resiliency thing we're discussing about. The only thing that we have to make sure, which is really important, that it doesn't bring down our whole system. In other words, just that the 3D barcode generation, which is probably needed for my PDF boarding pass, I need to print out later, is not working, shouldn't prevent my check-in. That's a bad design. That's not resilient. Because then you get a chain reaction here. The barcode generation is not working, probably not a big deal. It gets to a big deal because nobody can check in anymore. They make it my problem.

They transport the error all the way up to me, for me to resolve because I'm the last one in the chain. Everybody throws the hot potato ones further, I'm the last part in the chain as a user. That makes me responsible for the Outlook entry. The funny part about that story was really, the onwards flight, same trip from London, easyJet, "We are sorry." Same problem, I couldn't check in, but they give you the work instruction. They are better with that. "We're having some technical difficulties, log on again, retry. If that doesn't work, please try again in 5 minutes." I like that, increase the interval. That makes a lot of sense. You could automate that probably.

The next thing, and I love that, "We do actively monitor our site. We'll be working to resolve the issue. There's no need to call." It's your problem, leave us alone. In this case, it's very obvious because it's facing the user. It's an attitude I'm seeing in a lot of organizations, even internally to other services, their problem, which is, throw an error, we're good.

The much better situation would be the check-in should probably handle that. They should check me in. They could say, you're checked in, but we can't issue the boarding pass right now, we're sorry, but we send it on time. Or, you get it in the app anyway. I don't want to print it out, don't need a PDF. They could handle it in a much more local scope. That's a better design. It gives you a better designed system. The responsibilities are much cleaner defined, but the thing is now you need long running capabilities within the check-in service. If you don't have them, that's why a lot of teams are rethrowing the error.

Otherwise, we have to keep state, we want to be stateless. That's the other part, which I was discussing with a lot of customers over the last 5 years. The customer wants a synchronous response. They want to see a response in the website where it says you're checked in, here's your boarding pass, here's the PDF, and whatever. We need that. People are used to that experience. I wouldn't say so. If my decision as the customer is either I get a synchronous error message and have to retry myself or I get some result later on. I know what I'd pick. It's still a better customer experience. It needs a little bit of rethink, but I find it important.

Let's extend the example a little bit and add a bit more flavor on a couple of those things. Let's say you're still doing flight bookings, but maybe you also want to collect payments for it. That would make sense as a company. The payments might need credit card handling, so they want to take money from the credit card. Let's look at that. The same thing could happen. You want to charge the credit card. The credit card service at least internally but maybe also on that level will use some SaaS whatever service in the internet. You will probably not do credit card handling yourself unless you're very big, but normally, you use some Stripe-like mechanism to do that.

You will call an external API, REST, typically, to make the credit card charge. Then you have that availability thing. That service might not be available when you want to charge a credit card. You probably also then have the same thing, you want to charge it and want to probably wait for availability of the credit card service, because you don't want to tell your customers, we can't book your flight because our credit card service is currently not available. You probably want to find other ways. That's not where it stops. It normally then goes beyond that, which is very interesting if you look into all the corner cases.

Let's say you give up after some time, which makes sense. You don't want to try to book the flight for tomorrow, for the next 48 hours. It does make sense. You give up at some point in time. You probably say the payment failed, and we probably can't book your flight, or whatever it is that you do. There's one interesting thing about distributed systems, if you do a remote call, and you get an exception out of that, you can't differentiate those three situations. Probably the network was broken, you have not reached the service provider.

Maybe the network was great but the service provider, the thread exploded while you were doing it. It didn't process it. Did it commit its transaction or not? You have no idea. Or everything worked fine and the response got lost in the network. You can't know what just happened. That makes it hard in that scenario, because even if you get an exception, you might have charged the credit card, actually.

It might be a corner case, but it's possible. Depending on what you do, you might not want to ignore it. Maybe you can. If that's a conscious decision, that's fine. Maybe you can't, then you have to do something about that. You also can do that in a workflow way. You could also run monthly conciliation jobs, probably also a good solution. It always depends. If you want to do it in a workflow way, you might even have to check if it was charged and refunded, so it gets more complicated. That's what I'm trying to say.

In order to do these kinds of things, again, embrace asynchronous thinking. Make an API that's ready to probably not deliver a synchronous result. That's saying, we try our best, maybe you get something in a good case, but maybe you don't. Then, that's HTTP codes. I like to think in HTTP codes, like 202 means we got your request, that's the feedback loop, we got it, but the result will be later. Now you can make it long running, and that extends your options, what it can do. Speaking of that, one of the core thoughts there is also, if you make APIs like that, make it asynchronous, make it be able to handle long running.

Within your services, you're more free to implement requirements the way you want. Let's say you extend the whole payment thing, not only to credit cards, but probably to also have customer credits on their account. Some companies allow that. If you return goods, for example, you get credits on your account, which you can use for other things, or PayPal has that. If you get money sent via PayPal, it's on your PayPal account, you can use that first before they deduct it from your bank account, for example. Then you could add that where you say, I first deduct credit and then I charge the credit card, and you get more options of doing that also long running. That poses interesting new problems around really consistency. For example, now we have a situation where we talk to different services, probably for credit handling, or for credit card charging.

All of them have their transactions internally, probably, but you don't have a technical transaction spawning all of those steps. Where you say, if the credit card charging fails, I also didn't deduct the customer credit, I just say payment failed. I need to think about these scenarios where a deducted customer credit card charge doesn't work. I want to fail the payment. Then I have to basically rebook the customer credit. That's, for example, also what you can do with these kinds of workflows. That's called compensation. Where you say, I have compensating, like undo activities for activities if something failed. The only thing I'm trying to say here is, it gets more complex very quickly if you think about all the implications of distributed systems here.

Long Running Capabilities (Example)

Going back to the long running capabilities. My view on that is, you need long running capabilities to design good services, good service boundaries. That's a technical capability you should have in your architecture. I made another example to probably also make it easier to grasp. Let's say the booking service basically tells the payment service via method via REST call, saying, retrieve payment. I won't discuss orchestration versus choreography, because that could be something you're also interested in. Why doesn't it just emit an event? Booking says, payment, retrieve payment for that flight, for example. Payment chose the credit card. Now let's say the credit card is rejected. Service is available, but the credit card is rejected. That very often happens in scenarios where I store the credit card in my profile, it's expired, and then it gets rejected.

Now the next question is what to do with that. Typically, a requirement could be, if the credit card is rejected, the customer can provide new details. They hopefully still book their flight. We want them to do that. They need to provide new credit card details. You can also think about other scenarios. Somewhere I have the example of GitHub subscriptions, because there, it's a fully automated process that renews my subscription, uses my credit card. It doesn't work, they send you an email, "Update your credit card."

The question is where to implement that requirement. One of the typical reactions I'm seeing in a lot of scenarios is that, as a payment, we're stateless again. We want to be simple. We can't do that, because then we have to send the customer an email. We have to wait for the customer to update the credit card details. We have to control that whole process. It gets long running.

They understand it adds complexity, they don't want to do that. Just hot potato forward to the booking, because the booking is long running anyway, for a couple of reasons. They also have that. They can handle that requirement better, so let's just throw it over the fence over there. I'm seeing that very often, actually. If you make the same example with order fulfillment, or other things where it's very clear that that component, like booking, order fulfillment has a workflow mechanism, then this happens. The problem is now you're leaking a lot of domain concepts, out of payment into booking, because booking shouldn't know about credit card at all. They want to get money. They want to have the payment. They shouldn't care about the way of payment. Because that probably also changes over time, and you don't always want to adjust the booking, just because there's a new payment method.

It's a better design to separate that. That's questionable. If you go into DDD, for example, it also leaks domain language, like, credit card rejected. I don't care, I wanted to retrieve payment. Either you got my payment or you didn't. That's the two results I care about as booking. You want to really put it into the payment service. That makes more sense. Then, get a proper response, like the final thing. In order to do that, you have to deal with long running requirements within payment. That's the thing. You should make that easy for the teams to do that.

I added potentially on the slide. In such a situation, payment in 99% of the cases might be really super-fast, and could be synchronous. Then there are all these edge cases where it might not be and it's good to be able to handle that. Then you can still design, for example, an API versus say, in the happy case I get a synchronous result. It's not an exceptional case. It's just one case. The other case could be, I don't get that. I get an HTTP 202, and an asynchronous response. Make your architecture ready for that. Then you could use probably also workflows for implementing that.

Just because there's a workflow orchestration doesn't mean it's a monolithic thing. I would even say, the other way round, if you have long running capabilities available in the different services you might want to do, it gets easier to put the right parts of the process in the right microservices, for example, and it's not monolithic at all. It gets monolithic if, for example, payment doesn't have long running capabilities, and you move that logic into the booking service, just because that booking service has the possibility to do long running. I find that important. It's not that having orchestration, or long running capabilities adds the monolithic thing. It's the other way round, because not all the services have them at their disposal. Normally, what they do is they push all the long running stuff towards that one service that does, and then this gets monolithic. From my perspective, having long running at the disposal for everybody avoids these, what Sam Newman once called, god services.

Successful Process Orchestration (Centers of Excellence)

Long running capabilities are essential. It makes it easier to distribute all the responsibilities correctly. Also, it makes it easier to embrace asynchronous, non-blocking stuff. You need a process orchestration capability. That's what I'm convinced of. Otherwise, probably, I wouldn't do it for all my life. That's also easy to get as a team. Nowadays, that means as a service, either internally or probably also externally, to create a good architecture. I'm really convinced by that. Looking into that, how can I do that? How can I get that into the organization better? What we're seeing very successful, all organizations I talk with that use process orchestration to a bigger extent, very successfully, they have some Center of Excellence, organizationally. They not always call it Center of Excellence. Sometimes it's a digital enabler, or even process ninjas. It might be named very differently. That depends a little bit on company culture and things.

It's a dedicated team within the organization that cares about long running, if you phrase it more technically, or process orchestration, process automation, these kinds of things. This is the link, https://camunda.com/process-orchestration/automation-center-of-excellence/, for a 40-page article where we collected a lot of the information about Center of Excellence: how to build them, what are best practices to design them, and so on. One of the core ideas there is, a Center of Excellence should focus on enablement, and probably providing a platform.

They should not create solutions. Because sometimes people ask me, but we did that BPM, where we had these central teams doing an ESB and very complicated technology and didn't work. It didn't work, because at that time, a lot of those central teams had to be involved in the solution creation. They had to build workflows. It was not possible without them. That's a very different model nowadays. You normally have a central team that focuses on enabling others that then build the things. Enabling means probably consulting, helping them, building a community, but also providing technology where they can do that.

What I'm discussing very often within the last two or three years is, but we stopped doing central things. We want to be more autonomous. We have the teams, they should be free in their decisions. We don't want to put too much guardrails on them. Isn't a central CoE the path? Why do you do that? I discuss that with a lot of organizations actually. I was so happy about the Team Topologies book. That's definitely a recommendation to look into. The core ideas are very crisp, actually. In order to be very efficient in your development, you have different types of teams. That's the stream-aligned team that does business logic, that implements business logic, basically. They provide value. That's very often also value streams and whatever. You want to make them as productive as possible to remove as much friction as possible so they can really provide value, provide features. In order to do that you have other types of teams.

The two important ones are the enabling team, a consulting function, like hopping through the different projects, and the platform team, providing all the technology they need, so they don't have to figure out everything themselves. The complicated subsystem team is something we don't focus on too much. It can be some fraud check AI thing somebody does, and then provides an internal as a service thing. You can map that very well. Our customers do that actually very well to having a Center of Excellence around process orchestration, automation, for example.

Where you say they provide the technology. In our case, that's very often Camunda, but it could be something else. Very often, they also own adjacent tools like RPA tools, robotic process automation, and others. They provide the technology and also the enablement: project templates, and whatnot. That's very efficient, actually. It frees the teams of figuring out that themselves, because that's so hard. As a team, if you don't have an idea how you build your stack, you can go into evaluation mode for two or three months, and you don't deliver any business value there. That's actually not new. There are a couple of recommendable blog posts out there also talking about that. One is the thing from Spotify. Spotify published about Golden Path, 2020, where they basically said, we want to have certain defined ways of building a certain solution type. If we build a customer facing web app, this is normally how we do it.

If we build a long running workflow, this is how we do it. They have these kinds of solution templates. The name is good, actually, they name it Golden Path, because it's golden. They make it so easy to be used. They don't force teams to use it. That's the autonomy thing. They don't force it upon people. They make it desirable to be used. They make it easy. It's not your fault if it's not working. Then it's golden. I like the blog post, actually, I love that quote, because they found that rumor-driven development simply wasn't scalable. "I heard they do it like that, probably you should do that as well." Then you end up with quite a slew of technology that doesn't work. I find this really important that you want to consolidate on certain technologies. You want to make it easy to use them across the whole organization. That makes you efficient. Don't force it upon the people.

They also have a tool. That's a big company, they do open source on the side. They made backstage.io. I have no idea if the tool is good. I have not used it at all. I love the starting page of their website, The Speed Paradox, where they said, "At Spotify, we've always believed in the speed and ingenuity that comes from having autonomous development teams, but as we learn firsthand, the faster you grow, the more fragmented and complex your software ecosystems become, and then everything slows down again." The Standards Paradox, "By centralizing services and standardizing your tooling, Backstage streamlines your development environment.

Instead of restricting autonomy, standardization frees your engineers from infrastructure complexity." I think that's an important thought. They're not alone. If you search the internet, you find a couple of other places, for example, Twilio, but also others. Same thing. We're offering paved path, mature services, pull off the shelf, get up and running super quickly. What you do is create the incentive structure for teams to take the paved path, because it's a lot easier. If they really have to go a different route, you make it possible. It's not restricting autonomy, simply helping them. That's important. I think it's also important to discuss that internally.

Graphical Models

Last thing, graphical models. That's the other thing I discuss regularly. Center of Excellence, yes, probably makes sense. Process orchestration, yes, I understand why we have to do that. Graphical models? We're developers. We write code. Thing is, BPMN, that's what I showed. It's an ISO standard. It's worldwide adopted. It can do pretty complex things. I just scratched the surface. It can express a lot of complex things in relatively simple model, so it's powerful. It's living documentation. It's not a picture that's requirement, but it's running code. That's the model you put into production. It's running code. That's so powerful.

This is an example where it's used for test cases. That's what the test case tests, for example. You can leverage that as a visual. Or it can use it in operations like, where is it stuck, or what is the typical way it's going through, or where are typical bottlenecks, and so on? You can use that to discuss that also with different kinds of stakeholders, not only developers, but all of them.

If you discuss a complex algorithm, like a longer process or workflow, you normally go to the whiteboard and sketch it because we're visual as a human. Just because I'm a programmer doesn't make me less visual. I want to see it. Very powerful. It's even more important, because I think a lot of the decisions about long running behavior needs to be elevated to the business level.

They need to understand, why we want to get asynchronous. Why this might take longer. Why we need to change, also customer experience to leverage the architecture. The only way of doing that is to really make it transparent, to make it visual. I think it was a former marketing colleague that worked with me, phrased it like that. What you're trying to say is that in order to leverage your hipster architecture, you need to redesign the customer journey. That's exactly that. That's important to keep in mind.

Example (Customer Experience)

I want to quickly close that with another flight story. The first thing it's happening, so you get everything asynchronous. They did change the customer experience a lot. Now I'm working on train companies. That's the same thing. Mobile. You get automatically checked in for flights. You don't even have to do that. Why should I do that? My flight to London was delayed by an hour. Ok, that's delayed. That was canceled. That's not so nice. Then I got a relatively quickly and automated email, that's the only one in German, which I don't get why. Did I get that one in German? It wasn't German.

I got the link to book my hotel at Frankfurt airport. Why? I don't want to get a hotel in Frankfurt, I want to get to London. Everything automated, everything pushed. Nice. Then I got, via the app not via email, a link to a chatbot where I should chat about my flight. It says, we rebooked you for tomorrow morning. It didn't do that completely because it's not Lufthansa, so you have to see a human colleague. I don't want to get to London tomorrow, I want to get there today. I basically visit a counter.

The end of the story is they could rebook me to a very late flight to London, Heathrow, which was very late. I hated that. What I still like, everything was asynchronously. I got notification of everything in the app via email. I think there's some good things on the horizon there. The customer experience for airlines at least changed quite a bit over the last 5 years. Funny enough, last anecdote, I read an article about the bad NPS score of Lufthansa, and I probably understand why.

Recap

You need long running capabilities for a lot of reasons. Process orchestration platforms, workflow engines, great technology. You should definitely use that for those, because it allows you to design better service boundaries, implement quicker, less accidental complexity. You can embrace asynchronicity better. Provide a better customer experience. We haven't even talked about the other stuff like increased operational efficiency, automation, reduce risk, be more compliant, document the process, and so on. In order to do that successfully across the organization, you should organize some central enablement. I'm a big advocate for that, to really adopt that at scale.

See more presentations with transcripts

Recorded at:

Sep 17, 2024

Bernd Ruecker
Co-founder and Chief Technologist

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?