BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Moving beyond Request-Reply: How Smart APIs Are Different

Moving beyond Request-Reply: How Smart APIs Are Different

Bookmarks
49:53

Summary

Bernd Ruecker talks about real-life experiences around typical architecture patterns and why people have to carefully think about boundaries and responsibilities of their components. Further, he discusses why balancing orchestration and choreography is essential to avoid chaos and idempotency, and long-running and event-driven services.

Bio

Bernd Ruecker is co-founder and Chief Technologist of Camunda – an open source software company reinventing workflow automation. He has been in the software development field for more than 15 years, automating highly scalable workflows at global companies including T-Mobile, Lufthansa and Zalando and contributing to various open source workflow engines.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Ruecker: I want to talk about smart APIs. Moving beyond request reply, I think that's a common theme we currently see. How can we do probably remote communication a bit better than we did that in the past, whatever that means? What motivation or what I always love to pull up and especially here at QCon, you probably saw that slide in the morning when Charles Humble had his talk, basically introducing InfoQ and they have that architecture trends report graph. There's a little detail, not a lot of people are normally noticing that on the first spot, but I love it. I think Daniel [Bryant] pointed it out to me last year. There's microservices on the graph, like late majority. Most people are doing microservices now, but correctly build distributed systems is kind of [inaudible 00:00:59] That's the gap we're talking about. I think it's still very true to some extent. That's basically the why of this talk. Let's directly jump into that.

One thing, and that's a mind shift, or change in thinking nowadays happening that if you have a reasonable amount of components, microservices, or whatever that is, one thing you have to accept that something in your system is always broken. It might be that the network towards the system is broken, or that the component itself is broken. That's something we have to accept. That was different 10 years ago, when we tried to make everything highly available, and just assume it's working. That's something we have to accept. The thing is, we have to plan for that this doesn't take down the whole system. There has to be a plan B. If that's not working, we either do it later or we do something differently. That's very important, basic assumption you should have in your head. We want to avoid that tiny problems bringing down the whole system.

I'll make an example. I'm an example person. That happened to me roughly three years back. When I first thought about these kind of situations, I did a talk. What happened is I took an airplane. Who took an airplane before? You're familiar with the process. It works like you book a ticket, and then normally 24 hours before you have your flight, you get an email with an invitation to check it. I checked in, I did all of that. Normally, you click another button where you say, "Hey, give me the boarding pass because I need the boarding pass at the airport." I did that for a flight to London, actually, three years back with Eurowings. It's German website, sorry for that, but it's very easy. When I clicked on the button, it said to me, "Hey, there was an error sending you the boarding pass. We could check you in. There was no problem, but we can't send you the boarding pass now".

Current Situation

Just as a disclaimer, Eurowings is not our customer, but we have customers doing, let's say, the similar business model. How it works normally internally is like, I'm in front of web-UI, typing and booking and checking in, that uses a microservice, for example, doing the check-in. That normally doesn't do like everything on its own but probably leverages other services in order to do that. For example, something to generate the 3D barcode you have on the on the boarding pass, or generating a PDF or whatever it is. Here's London. That's good. My favorite is actually if you check in with British Airways, you can send a fax with a boarding pass. Still, love it. Never did it. I have no idea where to send it to, but you can do it.

You have these several services. What I think that happened is that some of these components just didn't work. Because the overall website, the check-in did work. That's already one important thing here. That's the good part of that design was, they had some things in place in order to avoid that bringing down the whole system. One important pattern, just an example, there's a circuit breaker. It's like the circuit breaker in your house. If something is weird with the electricity, you don't wait for the house to burn down, but it switches off electricity. Circuit breakers do the same thing for software. If something is weird with the connection to that service, and very often it's not that the service is not available, but the service is slow. Slow is really bad. If it's slow, you're probably doing a REST call, and you don't get a response.

That means you block a threat on the service that invokes the other one. That means after a while, that thread pool gets exhausted and you can't deliver any results to your clients. That means the next services don't. They had some kind of circuit breaking in place components. That's very important, but that seems to be common knowledge by now but it's not totally common knowledge. Their airline's not doing that properly. That's a tiny little detail probably, but since I gave that talk the first time, it's a different talk than today, that story. I'm collecting examples from error pages from airlines, which is quite a good hobby if you travel too much, because then you're getting always a happy feeling if you've get an exception. I can totally recommend that. If you find good one, send it over. I'm happy to put them in the talk.

Some don't. It's still not a good design. If you think about it, what they did is they have a problem somewhere in their system. What they do is they basically push it back all the way to me as a user. I see the problem which is down there in their system. That's not a good design, because in this case, what could you do about that? If you ask computer scientists, engineers, it's normally very easy. What would you do? You press F5, of course. That's what you do. You retry it, you just retry it. For us, it's totally normal that it might work then, for normal people, it's very often confusing, but you can retry it. Thing didn't work for me. I retried it. It didn't work. I retried it a couple of minutes later, it didn't work. I still got the same problem. So what was the next thing I did? I created a calendar entry in my personal calendar to remind me in four or five hours to send the boarding pass, because for me, it's natural, "Okay, we'll resolve later." That did work. It did work, but it's what I call a stateful retry. I had to retry things, but stateful, I have to remember it because otherwise it's lost. If you're like me, you forget what you did four hours ago. That's a long time. I need stateful retry. The thing is, this is now up to me.

Interesting part in that story, really, it's the same trip. When I was in London, I had my onboard flight from London, in this case, EasyJet. I got the same problem. That was kind of a coincidence. They're kind of nicer. They give me the work instructions, what I should do, "We're having some technical difficulties." Yeah, that's what I recognize. "Please log on again, retry. If that doesn't work, please try again in five minutes." Increase the interval. Actually, that's the best. "We do actively monitor our site and will be working to resolve the issue, so there's no need to call." It's your problem. Solve it yourself. In this case, it's very obvious that that's probably a bad design. This is how we very often treat other services. If you provide an API like this, then it might be another service consuming that and they will have exactly the same feeling.

It should be more like, "If you're having problem, we are having problems. That might happen and we're sorry about that. We take care of it, we sort it out. It's not your problem." In this case, it's a boarding pass. I don't need it synchronously. I want to have it when I get to the airport, but I don't need it early on.

What would be a much better design is something like, "We have that problem, but then we do the retry ourselves within the next component that is really responsible for that." I come back to responsibility of services a couple of times throughout the talk, I find that very important to think about that. In this case, problems downstream should not lead to problems upstream. I should handle that. In this case, it has to be stateful and retry.

One quick disclaimer. Of course, this talk contains opinion, my own opinionating. It's important for me to make it transparent. My name is Bernd Ruecker. I'm a co-founder at an open source workflow engine vendor basically. We're doing open source virtual platform. I've worked on different open source workflow engines over the last basically 10 to 15 years. That means, in a way, every problem I see, I think about that problem through the lens of state machines and workflow engines. That probably explains a couple of things I talk about later on. That's probably important for you to have in mind to judge whatever you hear here.

You Can Use A Workflow Engine (=Durable State Machine)!

For example, and I'm not diving into much more detail here, but I want to give you the thought. You could use, and we have customers doing that, a state machine, a workflow engine in order to do these kind of things. If you do workflows, for example, in whatever language you like. I personally like BPMN. That's a BPMN workflow. You can hook that into a service, like the check-in service, and then this gets stateful. Then you have a service, and I come back to that a couple of times as well, which might get long running. There are a couple of implications of that. It might get long running and that means it needs to have durable state, it needs to remember, "I still wanted to send that guy the boarding pass, so I should do." It can also do the retry. That's what these kind of engines can do. It's relatively simple to do that if you have the right facilities. That's one thought.

I'm not diving into that, but there's code on my GitHub repo where you can just try it out, you can just run what I talked about. If you're more code person like I am, that makes it very often much easier to understand.

You Can Use A Workflow Engine (=Durable State Machine)!

If the client implements the retry, long running or not, very important other thing is that the service provider has to think about idempotency. Why is that? I'll give you an example again. I'm always using end user customer examples because you know them, you can really think about that. But again, that's the same situation if you have service-to-service API communication.

Do you know these pictures? If you pay something with credit card on the internet, you always get that loading like, "We're processing your payment, it takes a bit. Do not close this window or select the 'Back' button of your browser." If you're like me, it's always like small evil part of your brain, "Hit the 'Back' button." I never do that because I'm too afraid that something weird is going to happen to my credit card. It's a bad design. It's bad design. It's really bad. It really says, "If something goes wrong in that super stable internet connection you have on the train currently going 240 kilometers per hour, then this might break." That's really weird. It should be like, "You want to give us money? Awesome. We make that easy for you." It should be restore whatever correlation ID in the local browser cache. Whenever you open it up, you can retry it. "If your mobile crashed or a train went over your mobile, you can use your normal computer, we send you an email, we make it easy for you." There are lots of ways of doing that. You just have to think about that.

That's a zucchini, or there was a different British word. I call it zucchini. There's a picture, you can probably follow along. There was a time like six or seven years back when I ate vegan. I tried that for a couple of weeks. I didn't have kids back then, so it was possible to do. It wouldn't be possible today. What's kind of nice, my favorite dish was that. You take the zucchini and you slice it. There's a special slicer where you could take the zucchini, slice it, and then you get noodles out of the zucchini, add some sauce. Delicious. I love that. In the cookbook, it says like, "You can eat as much as you want from that." That's literally in the book, "You can eat as much as you want." I can't explain you why, but that's the case. If you prepare a dish, like one dish of zucchini, you eat it. Then you forget if you just ate some zucchinis. You just eat another one. You forget again, just eat another one.

Requirement: Idempotency of Services!

Eating zucchini is totally idempotent. Eating pasta is not. It's the best metaphor so far I came up with for idempotency. If you know a better one, let me know. I'm happy to take it. For me, the important thing is that it sticks. I see so many products struggling with exactly that. They have an API, legacy API, which doesn't provide any support for idempotency, and then you have a big problem solving that.

Just making very easy example, I'm switching to payments now because in payment, that's more important than check-ins. If you have a payment component collecting money, you probably want to charge credit card in order to do that. Let's say it's the REST API call. This API can be idempotent or not. If you just would say something like charge credit card, give it an amount, it's not idempotent. If you call it again, you have no idea if that's the same call or not. You charge the same amount soon after, but that's only a hint. It's guesswork. If you add something like a transaction ID, it's clear. I already saw that call. Right? That's very important. That's, by the way, a typical pattern.

I talked about that United flight payment earlier on. The pattern typically is create IDs as soon as possible, probably even in the front end. You could already create a booking ID or an order ID, even if you don't yet know if the customer is really ordering later on. If you have that ID, you pass it along everywhere, you're idempotent. Super important. Because, otherwise, if you do the retrying, you probably charge twice. Not a good idea, especially not in payments.

That's not all of it. I'm just mixing a bit of BPMN, because I find it quite readable. The idea here is, if I have the payment service again, and let's assume I charge the credit card, and I do the retrying there. I tried to charge it, it doesn't work for a couple of minutes. Normally what you do after some time, you give up. It's either after 10 times or 10 minutes, or whatever that is, at some point in time you give up. You basically would say something like, "I can't take the payment for you. It doesn't work. Can't do that".

Distributed Systems

There's another thing about remote communication, which is really important to have in mind. I love using that metaphor for distributed systems. This is your application. This is probably the other application. They're kind of nice to program, nice to do. In between you have that rough ocean, and that's the network. There was a talk yesterday on the fallacies of distributed computing, basically, one is the network is reliable. It isn't. It is not reliable. You will have problems.

One problem is, let's assume you do a REST call, and you get a network exception, you have no idea which of these three situations just happened. The first could be, "The network was broken. I never reached the service provider." The second could be, "I reached the service provider. It started doing something for me and then it exploded. Did it commit its transaction? I don't know." Or, "It did everything okay. The response got lost in the network." You can't know which of these situations. You can do the same thing with messaging. You send out a message, you never get a response. What happened? You don't know. The important thing is to make yourself aware that you don't know that and that you have to think about that.

For example, you might have to clean up later on, something like, "I need to check if I did the payment and probably cancel it," or whatever that is. Two things again, like with idempotency, the first, you have to think as a client, you have to think about that, and to build in the right mechanisms to clean up. Also, on the other side, if you're a service provider, your API has to provide the possibilities to do that, like check the status, cancel it. That's again, if you talk to legacy systems, the big ERP systems, very often, they don't provide the APIs in order to cancel stuff. That's a big problem.

Being Able to Implement Long Running Services Is Essential For Smart APIs (on a Technical Level)

I would say, being able to implement these long-running services, is very essential to get better APIs. I'll explain why in a second again. So far, I looked at a technical level. I looked at the connection is unstable, the REST call doesn't get through. It's very technical, actually. It can make the same examples on a more business level.

Let's say you have something like a booking component, which does, for example, the flight booking, and that needs to talk to the payment service in order to collect the money. That's, again, an API in between so you have to talk. Booking says, "Retrieve the payment for me." Let's assume the same thing. Payment needs to talk to a credit card service in order to really charge your credit card. Let's assume not the happy path, something went wrong. Your credit card got rejected. That's not a technical failure. I can't talk to the credit card service. It's a business failure like, "We can't take money for that card for whatever reason".

The typical thing we see often is like I had in the very beginning, it's just, "A hot potato," I just throw it over the fence like, "Rejected," I reject the payment right away. By the way, you reject the payment. Hopefully, for you, you don't talk about credit cards here because, again, it's about responsibility. Only payments should know about credit cards. Booking should never know about credit cards. That's kind of how you lay out the system. Payment says, "It's rejected, can't do anything." Now assume you get something like this requirement where you say, "If the credit card was rejected, please send the customer an email to update credit card details." We're getting more asynchronous here. This is a bit like what, for example, GitHub does when you have an account there and your credit card expires. They want to renew your account, doesn't work, they send you an email, give you two weeks to update the credit card.

Let's assume you want to build that into your system. What I see happening a lot is actually then kind of the hot potato thing as like, "We are building that payment service. We don't have state," it's about state and long running again. "We don't have states, we just throw it over the fence to the booking service. The booking service should take care of all of that because they have that fancy state machine over there. Because the booking is a process anyway, they should take care of that." That's not a good idea. Again, responsibilities and it's like putting that you can send your boarding pass page in front of your customer. It's not a good idea.

Quoting Sam Newmann, talked here yesterday as well. In his book he says something like that's his problem with orchestration. I come back to orchestration later on, don't worry. He says if you do things like that booking service over here, for example, and because that does a lot of things, you have the risk that this gets a so-called god service, which then just tells, he calls it, anemic craft services what to do. Basically, you degrade the payment service to something which is just within API stateless, where it can say, "Retrieve payment," and then it's rejected or fulfilled, which is not what the payment service should do. Its responsibility is collect payment. If that takes two weeks, it takes two weeks. That's better API, the better how you make boundaries in your system.

If you ask, really, who is responsible? It's payment. Booking shouldn't know about expired credit cards. I hope you've got the idea. Then you ultimately just respond with a payment received or failed, but that could be two weeks later now. So the same thing, you now get a long running service, like the payment. It's long running. And that was the same thing, by the way, as we had earlier on with the technical glitches. One of my points here is that, really, you should think about making your service more long running, or at least be able to implement it more long running to make a proper API.

If you do that, you normally enter the realm of asynchronicity. Because I can't call a REST service and then block for two weeks, just doesn't work. Doesn't make any sense. I need to communicate asynchronously. That, to my experience, still scares a lot of people.

Synchronous Communication

Let's look at synchronous communication for a second. My favorite quote, actually, is this one about synchronous communication. It's from 2015, but I think it's still very accurate. It's done by Todd Montgomery and Martin Thompson. They said, "Synchronous communication is the crystal meth of distributed programming." I can totally relate to that. It looks so easy. You're doing like a REST call, if you use the right programming framework, it looks like a local method call. It's so easy, you get a response, you move on, but it's not a local method call. It's so different. It's so much more complex. Because it looks so easy, a lot of people forget about the implications it has. You're probably aware of, very often you have a tight deadline, the product doesn't have any time. It's like, "Yeah, I know. I have to get somewhere, but the last time, I promise, it's the last time." You should think about that.

Asynchronous Communication

I think asynchronous communication had a lot of advantages. With the reasoning I had earlier on, if you have capabilities to implement your services long running, you don't have a big issue with actually asynchronous communication, because the biggest problem you have normally in the beginning is that you can't directly give a response. You have to probably capture the current state, you have to remember it later on. That's not totally easy to do. I think that's a good preparation to get asynchronous. If you look at that, check-in example again, thinking about messaging as a protocol, [inaudible 00:24:24], for example. It doesn't look that different. It basically means the check-in sends message to the barcode generator, and that one responds normally. If it doesn't respond, again, you have no idea what happened. A good strategy is not to just don't do anything because that means, like me in front of the computer saying, "I want to have my boarding pass." The website says, "We send it to you." I don't get it. Then I am monitoring that. Not a good idea. It should be down there in the system. You should really monitor that down the system.

Workflow

Again, I'm expressing that in a BPMN because that's my language of choice. If you do that differently, that's fine with me as well. Then it looks like, "I'm sending out a message." The next thing I do is I wait for response. If that doesn't happen, this is how you can read BPMN, something else should be done, and so on and so forth. Being able to implement these kind of long-running behaviors is very crucial, I think, to make your services long running, potentially at least. That gives you a much better API, it can distribute the responsibilities much better. Yes, you can even do much more cool stuff. Whenever this takes too long, wherever I am, I do plan B. I even have a screenshot for that, by the way. I think it was Virgin Atlantic sending me a text message, "We can't check you in, please visit a counter".

We do switch gears, let's say. Let's assume you're now better at implementing your APIs, you're now able to communicate asynchronously. You might even go reactive for some things. We'll come back to what that means in a minute. What I see a lot is that can you really leverage that architecture on a business level? What I mean by that is -- and I'll you an example in a minute -- that you really have to think about the user experience, and also the overall business process in order to comply with these new technologies. Because if you don't do that, you will get a lot of problems. I will make one example to make that clear. Most people can relate to that, but I want to make one example.

In this case, I don't use airlines, I use trains. It's kind of comparable. In this case Germany, because that's where I know best. If you book a train ticket in Germany, basically hand in all the details, included where you want to sit and the payment. Then you hit a button, and then you get the ticket as a PDF in the browser. Still PDF in the browser because the assumption is you print it out and take it with you. Never printed that out for the last years. But this is how it works. In the background, again, there are a couple of services at work, let's just assume there's a front end, there's a booking component, there's the seat reservation payment, and something to generate the PDF. Relatively easy to imagine.

Now the thing is, and that's my problem with that approach, is it's really a synchronous step from, "Here's what you want to have. If you click on book now, we do everything synchronously in the background, we keep you waiting in the browser. Then we say, here's the PDF." A lot of sites work like that. Even for flight bookings, for other bookings, for checkout, very often you have that synchronous behavior. If you look at that, and if you examine that, that's what I mean, that's not an optimal situation. Because if something goes wrong, basically what we get is an error message synchronously. I get it immediately. I don't have to wait for the error message. All the data entered is gone, at least that's for Deutsche Bahn. There might be other companies which are at least better for the user experience there. It basically means it didn't work, I have to restart. That's not a good user experience. The user is not happy.

Weaknesses

On the other hand, as the development team, it's actually pretty hard to build. Because if you have your hipster architecture underneath, then you might communicate via Kafka, via Rabbit, via whatever in the background, but you need to get a synchronous response. It's not easy to build. Then you have additional downsides of that. For example, you add up latency, so every of the calls have some latency. Then you might end up with the sum of these latencies. If you do it in parallel, it might be a bit better. The core problem is if you do a lot of calls with latency, they add up, it gets low. That's what we can see in a lot of systems. In the browser, if you hit the book button, it takes some time. Not milliseconds, not seconds. Sometimes 10 seconds or 12 seconds. It really makes me nervous.

Second thing is the availability goes down. Even if all of your services are 99.9% available, which I doubt they are, but if they are, even then you go down in availability. Normally, because they don't agree on the time they fail. [inaudible 00:30:15]. You can't avoid that. The more services you have, and if you remember the picture early on, with a lot of the services that can break, you're in trouble, because that really gets down the availability, big time.

Last thing, it's even hard to implement. Again, I know some tricks. That's what we do with customers. How to leverage something like a workflow engine in order to do, to wait for the response, to consistency. We don't touch consistency here at all. For example, you could succeed with a seat reservation, and then the payment fails. You probably have to cancel the seat reservation in order to be consistent. These kind of issues, you have to deal with that. It's not easy to implement. Just to remind you, you do all of that just to get a bad user experience here in the browser. That's what I mean, that's not a good approach. I think if you go asynchronous, if you have all the technology underneath, which I think is a good thing, then you also have to adjust or think about the user experience.

Typical Pattern

It's a typical pattern. I see that everywhere. Some component simulating synchronicity. It basically waits for a response. It pause periodically, if some result already is there. That's the only way to simulate synchronicity if you have asynchronous stuff on the backend. What still is important, you always have to think about timeouts. I can wait for probably 20 seconds, 30 seconds, but probably I don't get a response. What then? You have to ask that question, also to business folks, what should happen if I don't get a response from payment within a minute? Do I keep the browser waiting, or what should I do? It's very important to do these kind of discussions.

If I look at the ticket booking, it's a rumor that I want to have the PDF in the browser to print it out. I don't want to have that. Even if they want to keep it, they could do sync and the happy case. I get it if everything worked. If it doesn't work, I could still get a page where it says, "We sent you an email when we did your booking with a check in." Or they could give you a link, "Here is where we can check the status." That sounds weird for booking, I agree, but the last time, for example, I extended my ESTA to travel to the U.S. That's what they do. They say, "We got your application. Here's the link where you can check for status." They're not processing it synchronously. They're giving me a link to pop. These are the options. Very important to keep that in mind. Nowadays, you're not printing anything anymore. You're having an app, you can communicate asynchronously with that anyway. That changes the whole business process, the whole user experience. You really have to take that into account in order to leverage the underlying infrastructure.

Your Business Processes Need To Be More Reactive!

Let's say our business processes need to be more reactive. If I say reactive, I refer to the Reactive Manifesto, which you might know. It basically says that you can build a more message-driven asynchronous application, and that makes everything more elastic, because you can scale easier. It makes everything more resilient because you scale it out, so probably, you can kill anything anytime. Overall, your whole application gets much more responsive. That's for me reactive. It's kind of hype at the moment. I think most people might have already seen this or at least stumbled over reactive.

Let's go reactive. That's actually what I see, especially for, let's say, early adopters, which QCon tries to get here at QCon. Probably, we have a lot of them. Starting a couple of years back, "We do everything reactive." That was at QCon New York last year. That is Phil Calcado, and he talked about what they did at meetup at that time. He said, "We did a totally event-driven reactive system. We were suffering from Pinball machine architecture." I love the term actually, Pinball machine architecture. I didn't come up with that. I think it was [inaudible 00:34:44]. I have to check back. I love the term. I made my own slide for that, though.

That's my slide. What you normally can see out there is that you have a couple of services, might be microservices, might be standard software, might be external services, some API's. Now you want to make them work together in order to implement some kind of process. Event-driven or reactive works a bit like, "You throw in a request," and it gets to the system. I think it visualizes the problem. It's a bit like, you normally recognize that by the amount of times people say, "What the hell just happened? That can't happen in our system." Because it's emerging behavior. It's something you didn't design for, some weird constellation of events and something really bad happens. That's a risk. I'm not saying that reactive is bad or event-driven is bad. I'll come back to that in a second. It's a risk you should be aware of.

Let's look at that. That's the last example I'll do for today. Let's do order fulfillment briefly. Let's assume the Amazon Dash Button. I'm not sure if you know Amazon Dash, but the idea is very simple. You have that small hardware button, you put it right next to whatever, your washing powder. If you see it's empty, you just press the button once and then exactly one packet of washing powder's ordered at Amazon, delivered to your house. In the background, that means whenever I push the button, basically, an order fulfillment workflow has to go on, a business process like, "I have to pay for the stuff, it has to be fetched from a warehouse, and it has to be shipped to my door." If I want to implement that, and just assume that you have a couple of microservices like the checkout service communicating with the button, the payment service communicating with a payment, of course, inventory taking care of what's on stock, and shipment in order to create the parcel to send out. These don't have to be like that, but assuming you have these microservices, and you want to implement the workflow.

If you do event-driven architectures or reactive, what happens a lot is the idea of sending events around. Just to give you one example, it could be something like the different microservices or components in your system, publish events, they emit events. They're basically telling the world, "The order has placed, a new order is placed. The payment was received for this order. Some goods for fetch. The parcel was shipped." These are kind of domain events, they're called. Normally the idea is to just publish that to the world and you have no idea who's interested in that. That's the idea. You don't care who's interested in that, which is a good test, by the way. If you do care, it's probably not an event, but we have to discuss that later on. That's a good use case for event driven.

As an example, you could build one component where we say, "Whenever something happens in my system, I want to send a notification to the customer," could be an email, could be in text, could be whatever. I put that into one component where I have all the [inaudible 00:38:24], how to send GDPR-compliant emails if the customer wants an email or whatever. I don't have to think about the idea of sending emails in all the other components. That's a good use case for event-driven because you just add that, then you have all the functionality in that service.

What typically happens if you have a new tool, like event driven, you can apply that for everything. You can also implement that workflow with event notifications or event driven, then it works a bit like this. You say, "The checkout service says somebody pressed the button. He wants to order that." He placed an order. Then payment could say, "If an order is placed, I'm interested in that." Then I collect the payment, and issue another event, "The payment was collected." Then inventory could say, "If there was an order paid, I'm interested in that." I collect all the stuff from stock, and then I'm done. Then shipment knows, "I can ship it out." This is what I call an event chain.

You basically have a couple of events, implementing the overall workflow. I have a couple of issues with that, actually. I had that from the very beginning. The most important is you have nowhere to go where you can see the overall workflow. There's not one single point in the system where you can really see the workflow where, where it's implemented. It's emerging behavior. It's the pinball machine. Sometimes good, sometimes not. In this case, I don't like it actually, I don't think it's a good idea. Very often when I brought that up, and I did that very often over the last years. It was like, "You're the workflow guy, you have that product. It's not on the slide, we can understand that you don't like it." That's not the issue. The issue is that, really, I think it's not a good idea to go that way.

Nowadays, it's easier. If you google your way around, there are some experiences with event-driven systems. A couple of people also wrote about that. One being Martin Fowler, which is kind of a credible source for that. He wrote in a blog post 2017 already, "The danger is that it's very easy to make nicely decoupled systems with event notification," that's why people want to do that, "without realizing that you're losing sight of larger-scale flow," exactly that was my gut feeling as well, "and thus set yourself up for trouble in future years." That's the risk expressed also from other people. This is what you should be aware of.

You can make an easy example. Anytime you want to change something in the sequence of things, let's say you want to fetch the item before you pay, you want to change the sequence of things. Then it gets nasty. Probably 5 to 10 years back, people were normally, "Yeah, but these processes, they are stable, they don't change not that often." I think it's not true anymore. If you look at, for example, Amazon, their use cases where they get your order from stock before you ordered it. Think about that. I love it. It's like if you have dog food, regularly ordering that, they know that you will order that within the next day, so they already get it to deliver faster. These kind of changes are happening nowadays. It's not that infrequent to change the sequence of things.

If you want to do that, you now have to change inventory to listen to the order-placed event. Then you have to change payment to no longer listen to the order-placed event, but to the goods-fetched event, and so on, so forth. You have to do a couple of changes. If you look at that, you have to change all of the three services there. That's my biggest concern, actually, you not only have to change them, you have to deploy them in a coordinated fashion, which is you have circulating orders all the time. If you're a good company, then you probably have orders all the time. That means you have to think about when you end up with an order and eventually was that paid or not. You get a versioning problem. That's not easy to handle here in these situations. I'm not saying that invention is bad, it's just this is a risk and you should be totally aware of.

Very often people are not telling you about that so much. There was a time, it's still sometimes out there that people advocate for event-driven reactive architectures. It's like this is the best thing since sliced bread. Normally they use this kind of picture, like for a choreography, it's called. There is no like central conductor telling everybody what they do, but they're professional dancers, they know what to do. You can add another dancer, it will be beautiful dance. It's not exactly what I see out there happening. It's sometimes hard to manage, it starts to be chaotic. That's something you should be at least aware of.

Extract the End-to-End Responsibility

In that example, what I would do instead, there two thoughts in one slide here. The first is I would extract the responsibility of the end-to-end order fulfillment into an own service. That's [inaudible 00:44:03] thinking that companies like Amazon or [inaudible 00:44:06], for example, a customer of ours, not having an order service, taking responsibility. Again, it's about responsibility, you need somebody who is really responsible for order fulfillment end-to-end. That makes a lot of sense. That's the first thought.

The second thought is, then you might have event-driven communication in your system. Still, that might make sense that the Checkout button says, "Somebody ordered, I don't care what happens." That's probably okay. Then order takes that, so that's event driven. The next thing it does, it now wants to control the sequence of things. It says, "Hey, payment, you, collect the payment for me now." That's no longer invention. That's what I call a command.

Events & Commands

There are events you can send around, which are basically, "Something has happened in the past." What I said earlier on, the good test, if that's an event, it's normally as a sender, you don't care what people do with that. If the whole world ignores it, it should be fine for you. That's a good test if it's really an event. For a command, you have an intent, you want something to happen. If the other component ignores, that's not okay. You want something to happen. In this case, then you can send a command, "Hey, payment, retrieve payment." Next thing, "Hey, inventory, book out the goods or consignment," shipment ships out stuff. That's a command. The important next thought is like events or commands? Very often people say are, "Events, now you're talking Kafka. Command, now you're talking REST or GRPC." That's not true.

It Is Not About The Protocol!

It's totally independent of the protocol. You can send events or commands within messaging. You can send events and commands via Kafka. You can send basically events and commands via GRPC or REST, it doesn't matter. Conceptually you have to think about the API you provide and how you want to use events and commands. Then you have, like in this case, that's called orchestration because you tell somebody else to do something for you. You orchestrate it. And both is really valuable.

For me, the important decision, it's you really have to think about where you want to do the coupling. Every communication, when two components communicate, they're always coupled. It's not possible to totally decouple them, but you can decide how you want to couple them. It could be that the receiver knows about the sender, or about the event that they want to receive. Then the coupling is on the receiving side, or the receiver doesn't know anything about who want to retrieve payment, but the sender knows about what command to send, then it's on the sending side. You have to decide that. There's not a one size fits all answer. I'm sorry for that, but you have to think about that on the situation at hand.

Extract Orchestration Logic

Last thought. Of course, you can express something like orchestration logic, again, with workflows. It fits nicely because very often if you're asynchronous, orchestration means you have to wait, so you have to be long running. That's where the puzzle fits again. The important thought here is then if you're expressed that workflow doesn't mean or also business process doesn't mean that it's something central. In every of your different services, you might have some orchestration going on. Probably between them sending commands, but probably also some choreography going on if just sending events.

Balance Choreography and Orchestration

I think about [inaudible 00:48:03] like that. If you have your IT architecture, you have a couple of services, APIs, components on top of that. It was a deliberate decision to make it round stacked on each other. These are your applications. Then I think you need two pillars for that. That's choreography and orchestration. At the moment, a lot of people are kind of either/or, "You need to orchestrate everything," or, "Orchestration is evil. It's from the past, you should choreograph everything." Surprise, both extremes doesn't work well. You need to balance both. If you don't do that, and that's what I see happening at the moment, you're probably ending up with the chaos bucket. I see that happening because people in the past, like 5 or 10 years back, with stuff like BPM and SOA, they very often did it the other way around, and then they ended up in the monolith bucket. That's what they don't want to do. You have to balance it.

Recap

I tried to walk you through a couple of things. First, distributed systems are complex. I hope that's clear anyway. There are a lot of things you have to think about at least once, like retries, idempotency. That's super important. It's nothing that will go away quickly. You have to get your strategies in order to handle that. One, I think, important part of that strategy might be long-running services, or at least being able to do long-running services. That also makes it easier to get async, which I think is a good idea. If you do all that, if you have your hipster architecture in place, you also have to make sure you leverage it from a business perspective. Then last but not least, also these commands events thing. That's pretty important.

 

See more presentations with transcripts

 

Recorded at:

Jun 17, 2020

BT