BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Event-Driven Patterns for Cloud-Native Banking - What Works, What Hurts?

Event-Driven Patterns for Cloud-Native Banking - What Works, What Hurts?

51:25

Summary

Chris Tacey-Green discusses the shift from synchronous commands to asynchronous events within highly regulated environments. He explains the critical role of Inbox and Outbox patterns in preventing data loss, the nuances of event versioning, and how to maintain decoupling between domains. He shares "battle-tested" principles for implementing fault tolerance and managing eventual consistency.

Bio

Chris Tacey-Green is Head of Engineering at Investec, previously a founder, and a builder at heart. Deep experience within Cloud, ML, AI and the engineering surrounding it all.

About the conference

InfoQ Dev Summit Munich software development conference focuses on the critical software challenges senior dev teams face today. Gain valuable real-world technical insights from 20+ senior software developers, connect with speakers and peers, and enjoy social events.

Transcript

Chris Tacey-Green: Have you built or just been part of building an event-driven architecture? Was that event-driven architecture in the cloud? Was it in some kind of highly regulated industry, banking, healthcare? You consider regulation as an up and down. Aviation, we'll put that one as regulated. I'm going to talk through the foundations and the principles first. Anyone who hasn't worked with event-driven architectures, you'll be fine. We'll go through the foundations. We'll all get up to speed so that we can get into the detail. If you have done event-driven architectures before, that's fine. These might just be reminders. It might be some things that actually you haven't considered before.

Once we've done the foundations, we'll then go on to why do we want to do event-driven architectures in this highly regulated environment? Why would we put ourselves through that? What are the benefits? Then we'll get into what hurts. What are the challenges that you need to be considering if you're building systems like this? I'm not just going to leave you with the pain, we will go through what helps as well.

Foundations

Foundations first. If we take our title, event-driven, cloud native, banking, we'll break that down and we'll define each part. An event essentially is a change in state somewhere in the system. That could be caused by a user's action, an asynchronous background task, an external entity, an external system to the platform that you're building. It may carry data, and we might call that a fat event. Or it might simply be a notification, which would be a thin event. I haven't made the dietary definition of an event up. This is something that's discussed out in the world. There's actually quite a famous paper that talks to putting your events on a diet. I would tend to aim to have your events lean.

Essentially, all of the data that pertains to the event, put it in there. Anything else, don't do it. Yes, you do have these levels of an event. Before we get on to anything else, I am quickly going to discuss commands versus events. This is because this is a conversation I get into time after time. If you build an event-driven system and then you start pumping commands around it, you're not getting all the benefits that you would like to get from an event-driven system. Actually, you're going to screw yourself over in the future. Very simple differentiation. A command is me saying, I want something to happen. I'm explicitly asking you to do that thing. I'm going to wait because I'm expecting some result. Even if it's asynchronous, I'm expecting a result.

An event is me shouting into the world saying that something happened. I'm not expecting anything to happen off the back of that. In fact, I'm not necessarily expecting anyone to be listening to me. I could be shouting into the ether. No one's subscribed to that event, and that is fine. This differentiation comes up a lot. Get it burned into your brains if you've not worked with these architectures before. Understand what each thing is and when to use which one.

We know what an event is. It's a change in state. Therefore, an event-driven architecture, quite simple. It's where we combine multiple systems that are reacting to events. It tends to consist of producers, so systems that publish events. Consumers, systems that receive an event. Nice and simple. A quick shout on event sourcing. This tends to end up in the same conversations as event-driven architectures. When you talk to people and you talk about event-driven architectures, a lot of them will think of event sourcing. These are not the same thing. Please spread this to your teams. You do not need to do event sourcing in order to do an event-driven architecture. Event sourcing is actually how the state of your application is represented. It's represented as an immutable sequence of events. If we considered a shopping cart online, if we weren't doing event sourcing, we might represent the state of that shopping cart as, I have four hats in my shopping cart.

If I went to look at my state at my database, I would see a record there that says, hats times four. Event sourcing, the state of my shopping cart is represented, probably in this case, as four events. You would have four records, and each one represents me adding a hat to the shopping cart. In order for me to know the state of an application when it's event sourced, I need to play back those events in order to know, at this point in time, there are four hats in my shopping cart. It's a complicated pattern to apply. I've seen people really struggle with understanding it. It takes people time to learn it. Understand that you do not need to do event sourcing to do event-driven architectures. The reason they come hand-in-hand is that if you have done event sourcing, adding that little extra bit of subscribing to an event is much easier. That's why they tend to come hand-in-hand, but you do not need to do this, and understand there are dragons here.

Next up, cloud native. Essentially, designing, constructing, operating workloads in the cloud. Technically, the cloud has any operating model in it. We could spin up virtual machines. I could SSH into a virtual machine and copy a zip file over and run a manual service on that VM. Actually, when we're talking cloud native, we tend to be talking about doing modern engineering practices. Highly scalable. I have put microservice-based in here, although I realize there are other approaches to problems, modular monoliths, that they exist, and they are good patterns to use. These are systems that would be deployed using modern DevOps principles, CI/CD practices.

Hopefully, cloud native is something that you guys are already fairly comfortable with. Our title was Event-Driven, Cloud Native, Banking. We've got one more thing left to define, and that is banking. These are large, slow, highly regulated organizations that promise to keep your cash safe under their mattress so that you don't have to put it under yours. They also tend to be terrified of any of the modern principles that we've just been discussing. Lots of them will use fax machines as integration mechanisms. Fortunately, Investec, the bank that I work for, is not one of them. We are a pretty modern, agile organization, and so we've been doing a lot of more modern engineering practices like the ones that we're going to be talking about today. There's our foundation set.

Why Eventing?

Let's get into the details. Why do we want to do event-driven things? There are actually many different reasons. I've picked out a few because they apply to real situations, real use cases that we have had to solve for at the bank. You can read online about many different benefits and drawbacks of event-driven architectures. I'm just going to pick out some that make sense. Decoupling is an obvious one once you get into using events. We have a very real use case here of transaction monitoring at a bank. Transaction monitoring, essentially everything that happens on a client's account, we need to be paying attention to, monitoring, looking for anything that's strange.

If you think about times where you've traveled to a new country, we want to be able to see that and work out if that's something abnormal or if that's something that we would expect of you as a client. To solve for transaction monitoring, they need lots of data from our payment system. We've got two options here. We could couple the two things. Payments at a bank is a very important thing. Highly regulated. PSD2, if you want to go read about that regulation, it's a lot of fun. Payments is crucial and we have to build it with reliability at core. Transaction monitoring is not something that has to come in a payment flow. It's something that happens behind the scenes. You do fraud checks on a payment, but you don't necessarily have to monitor transactions actively in order for a payment to go out the door.

If we couple these two things, and we've got two ways of coupling, we could either determine that payments has to hit an API on transaction monitoring. It's going to push that data to that service. Or maybe we determine that transaction monitoring, it's its responsibility, so it's going to pull from an API on payments. Either way, we are now coupling these two systems. We're coupling two systems that actually should be independent of one another, have very different reliability expectations, very different expectations from the organization. Not ideal. By moving to an event-driven architecture, we can split these two things.

In the decoupled version, you can see that payments has no idea that transaction monitoring exists. Payments gets to focus on its flow and it pumps out, publishes events. In this case, I've called out two, that payment was initiated, which has some data about the location of the user, the channel that they were coming in on, the creditor, the debtor. We also pump out an event to say that the payment was processed. Which gateway did it go down? Transaction monitoring now gets to be independent. It gets to look at the event stream of payments and say, for my use case, to monitor transactions, I'm going to pull these two events. In the future, it could pull new events. It could go and find new data that it wants to use. It completely decouples these two things. Now transaction monitoring can go down without taking payments with it. Decoupling, very important benefit.

The second benefit is an immutable activity log. Before we moved to an event-driven architecture for payments, we obviously had payments running through the organization, but it was hard to know where a payment was in all of the many flow points in a bank. You have lots of different things that happen. Fraud checks, sanctions. You choose gateways. Actually, payment gateways themselves tell you lots of different things as responses when you've sent a payment out. We were struggling to see that. When we moved to an event-driven model, we now had this immutable activity log of the events that were powering a payment. That's the crucial bit. It wasn't some audit log off the side. It wasn't us explicitly messaging logs to a log aggregation that we then needed to correlate.

The events we saw, we trusted, because that's how the system was running. Now actually here, I've picked out a couple of events. There's way more in the flow. We can now see, as a business, with very nicely business-oriented event names, which is something that's important to do within your domain design, we can see that a payment was initiated. We can see that a fraud check was completed. Or maybe we can see that a fraud check, actually, we're still waiting on that. It's fallen into a manual operational process where someone's needing to do additional fraud checks on something. Huge benefit that you get from using events to power your system. Again, very real thing that we have running in production.

The third one, fan-out. I haven't called it out, but there's also fan-in as the alternative. If we take another payment-related one, where off the back of a payment, we need to do two things. We need to update our payment limits. We need to say, for example, you might only be able to spend £10,000 a day. We'll have that limit assigned to you as a client, and we need to update your payment limits when you've made a payment so that we know where are you against that limit. We also want to send comms. Maybe a push notification, an SMS, an email, a pigeon to say that your payment has completed. Without an event-driven architecture, we of course can solve for this problem. We can do these things. We end up wrapping them together. We end up saying, ok, so we need to update the payment limits, and we need to send comms.

If payment limits fails, we need to handle that failure somehow. Do we wait for that until we send out our comms? Do we still send out our comms, and then we go and fix the payment limits issue? We can avoid all of that by having a simple event fan-out. One single event that says, this payment was processed. Then, two independent processes. Actually, in reality, there'll be way more than this. Two independent processes that go off and do the things that they need to do. Client comms does not care about the payment limits service. It shouldn't need to, and it can just work independently. We'll get onto fault tolerance, but it also means that each of these can handle their faults, their retries, their fallouts independently of one another.

Fault tolerance, a huge benefit of an event-driven architecture. When we're talking about highly regulated industries, we have to be tolerant to all faults. We are not talking about some IoT processing, or big data analytics, or anything like that. We are talking about vital things that must happen. In this case, and again, very real situation, I'm not going to mention specifics about the fraud engine. We have a fraud engine. It's an external vendor, who have some reliability issues. We can't fix those reliability issues. We didn't build that software. We do need to be able to handle them. With an event-driven architecture, we have three places that we can handle faults. You can customize these however you like, based on the domain and the use case that you're solving for.

The first level is that transient box. Actually, this is no different to in-process retries that you'll probably all have written in your code. If you think about like Polly and .NET, where you just define like, we're happy to retry five times. We'll add a bit of a jitter. We'll wait a couple of seconds. Hopefully that transient issue, that network issue is solved, and our request goes through. No different to normal. The only additional benefit you get is that because it's event-driven, what this actually means is that this is asynchronous. This is eventually consistent. You might be able to extend out those transient retries a little bit longer than you might otherwise have.

The second level we now get, so fraud engine's still down. Our transient retries are still failing. We can now actually back off to our eventing tech. It really doesn't matter which cloud-native eventing tech you're using. It could be Kinesis. It could be Azure Event Hubs. It could be some managed Kafka instance. It doesn't matter. You can configure this thing on all of them. This is where you would say, we know things are problematic, but we're still going to retry, but we're going to back off a bit more. We can back off for whatever the organization is happy for us to back off to. Until we eventually say, things have gone really bad, we need to dead letter this thing. Dead lettering is pretty important, mainly for the problem of poisonous messages, poisonous events.

If some naughty person pumps out an event into your system, and it breaks your eventing contract, or it has ad data that just cannot be processed, you need a way for that to escape from your architecture eventually. Otherwise, it will continue retrying forever, and you'll have some fun screwing around in databases to fix that. We have to dead letter. We have our third level of fault tolerance there, because we can, we will alert some human. We will wake someone up at 2 a.m., and they will have to go and look and replay that event, if they determine that they want to replay that event. It's not a poisonous message. Fault tolerance, huge benefit of event-driven architectures. We have really benefited from that within our highly regulated use cases.

The fifth one that I'm going to call out is plug and play. This example we're talking about here, is the build-out of a new capability, rewards. We want to offer rewards to you, and we need to actually build out that capability. We don't have it. With some mature platforms, like payments, accounts, client, where now they're publishing events, they're publishing well-defined, ideally domain-designed events out into the world, we actually have a really nice benefit here that rewards might be able to be built without bugging any of them. If the events are good, we can slot this capability in. It just needs permissions to those events, permissions to the event streams, and it now knows when a client's onboarded. It now knows when an account is created. It knows when payments are processed. Once you reach this level of maturity with your event-driven architecture, you can plug in new capabilities really nicely.

What Hurts? What Helps?

We're going to get into some of the pain now. What hurts? Yes, I will talk about what helps. The first one is not a tech problem. Event-driven architectures are hard for people, mainly people who have not yet worked on those architectures before. This is hard, and we see it. We see it in our architects and engineers who needed to learn these new concepts, needed to learn these new patterns. We've seen it with new joiners who, in one of our spaces where we had event sourcing as well as event-driven architecture, it took about six months for a new joiner to get to the point where they were delivering at the same pace as the engineers in that team already. That is a very real consideration.

I think we very easily look at the technical tradeoffs. This is a real organizational impacting thing that people are going to be slower to deliver. It's a different paradigm when you're designing those solutions. When teams just step into this world, they may almost forget that they have a different paradigm, and they'll start solving for problems they don't need to solve for. They might forget about the problems that they really do need to solve for now, that is, eventual consistency, the fault tolerance that we've talked about. Some of the other things that I'll get onto. Your people will find it hard, and that should not be ignored.

What helps? There are things that help. Hopefully, you have a developer platform. If you don't, just quickly create one of those. Hopefully, you have some concept like paved roads in your organization. Getting event-driven artifacts into your developer platform, which look like service templates. As an engineer, I can now step in and go, here's a good-shaped template of an event-driven microservice. That will help. That will help people get started quicker. Application modules that take away a lot of the problems that actually we'll talk about coming up, take away those problems so that not every single engineer is having to solve the same problem over and again. Do that and do it early. We did, and we did find it much easier for multiple teams to start building out these architectures. Your developer platform can have all these lovely artifacts, but you do still need to train your people. In fact, it's a dangerous world if you've given them the keys to developing and smashing an event-driven system into production, but you've not focused on training them.

At 2 a.m., when that thing falls over, they're not going to have any idea of what the lovely magic that you've written into that developer platform does. We need to train them. We had an enablement team, and we took that enablement team, and we took a delivery team, they essentially came to us and said, we keep seeing people building event-driven architectures. We'd like some of that, but we've never done it before. We managed to book out an entire week with that team and our enablement team. This doesn't really scale, but it really did work. We ran through some training materials, probably some stuff that's similar to what we're going to be going through today. We also designed and built an event-driven system, it was very small, in their space, that actually ended up in production.

By the end of the five days, it wasn't quite in production, but they had a working system. I'm calling that out because I think it's a shift from how some of us think about training materials. This wasn't just a, please go and read a bunch of documentation, or go and watch a video. We sat with them. We taught them some stuff. We then got in, designed their system with them, built the system with them, found the problems, solved them. That team are now off confidently building event-driven systems.

Third callout, aligning on standards and principles across the estate. The earlier you do this, the better. It doesn't need to be that you define everything, but defining things like your event contracts, defining things about the permissions model that you want on your event streams, ideally defining what technology drives those event streams. All of those things, write them down, agree them, so that when you go off and you want to consume someone else's events, it's not completely different to the other system that you've already consumed events from. You end up in that world and you're not going to find pace ever.

My prompt for this image was myself and my clone spending lots of my money. Second pain, two ends of a spectrum here, duplicating events and losing events. Both things that in highly regulated industries, including aviation, we do not want if you go off and you go off to pay your rent and we just happen to lose that event and your landlord never gets your rent payment. You're not going to do very well off that.

Alternatively, you go off, you buy your new property. You put down your deposit and we pay it twice, you're going to be pretty angry. We cannot handle this. It's a callout that's important because in some event-driven architectures, again, with big data, analytics, IoT devices, this may not be a problem. You can afford to lose an event every 100,000 events. We can't. We are a bank. We cannot have that happen. This requires design and build upfront. You cannot leave this until later. You will hurt yourself. What helps? Two things. Inbox patterns and outbox patterns, and building both of those into that developer platform, into the frameworks that we've just talked about. Build them in immediately so that people get this stuff for free and they don't stumble into paying two deposits.

What's an inbox pattern? What's an outbox pattern? Let's look at the outbox first. An outbox pattern protects you from losing events when you publish them. In our example here, we're looking at onboarding. We're onboarding Aurelia. When we're making that modification to the client's table, we're saving that state. We draw a little transaction around it along with an outbox. In that outbox, we put the event that a client was onboarded with our unique ID. We now know that we have updated our state and published an event at the same time within the same transactional boundary. Without that, you can very easily end up in a situation where you've updated the state, we know Aurelia exists, but then something falls over when we try to publish that event. Not going to be good. We're going to lose any of the benefits that we wanted to get from our event-driven architecture. We save that record to our outbox, and we then just need a little dispatcher pattern.

The dispatcher goes off, maybe it's just polling that outbox table, that's fine, and it's going to take that event and actually publish it onto whichever technology you've chosen: Kafka, Kinesis, Event Hubs, whatever it is that you've decided to use. Fine, using an outbox, we've now protected ourselves from losing events. Crucially, we haven't actually protected ourselves from duplicating events. That dispatcher could still publish something twice. Not ideal. Also, the eventing technology that we publish to may just do some kind of at-least-once delivery and we're going to end up getting duplicate events. We still need to handle those. That's ok, because we're going to use an inbox.

The inbox is on the consumer side, where we're now receiving that client-onboarded event. Rather than going off and doing our business logic, dealing with that event, however it is that we're dealing with, where we could fail for real business validation reasons or we could run into some transient issue that we've talked about before, no. We immediately pump that event into an inbox. The inbox just states, here's the idea of the event, here's the data, we received it. Off the back of that, you then go and do your business logic. Fine, perfect. What this avoids now is if our at-least-once delivery eventing tech, pumps out the same event, that's absolutely fine now. We're protected by our inbox. We're going to check the ID, and say, I've seen that event before. I'm not doing it again. We're nicely protected.

The third painful thing to deal with is breaking event contracts. We talked about coupling, and how, by using events, we can decouple systems. However, you are coupled by your events. These are a contract that you have promised to the world. Crucially, you can't take them back.

One of the things about event-driven architectures is that you'll publish them onto an event stream, and it's an immutable event stream, and it goes back all the way to the beginning of time. Someone has the right to go back to the beginning of time and replay all of those events. Once you have published something, some data point on your events, you don't get to take it back. Consumers failing because of a change in that event data is a really painful remediation process. We talked about events being immutable. They're not immutable if you're off editing events in your datastore or in your event stream. It's a real thing that I've seen companies doing. Please don't do it. Please don't put yourself in that pain. We need to really care about our event contracts here. What helps? I find the thought of your event being like an API contract helps people. Probably just because we're more comfortable with what API contracts are and how important they are to not bring in breaking changes. Consider your events like you would consider your APIs. Design them carefully. Be aware that any property that you put on that contract is out there in the world. If you want to remove it, that's a breaking change. Ideally, avoid those breaking changes if you can. If you can't avoid them, version them like you would an API.

On a REST API, you would bring in a v2 if you can't remove that breaking change. You can do that same thing with events. Something that you'll see on some event standards is the concept of a data version property. Just some metadata that you put on your event that says, this is actually version two of this event. Now, what this allows your consumers to do, and you can just imagine it as a really simple if-else statement in their code, they now can check that. They can see, ok, for v1, we're going to do this event handling. For v2, this property has been removed, or the data type's changed, or the whole structure of this event has changed. Ideally not, that would almost be a different event. We can go and branch off and handle that event differently. Now we can replay from the beginning of time because we've got v1, v1, v1, v1, v2. Fine, safe.

The other thing that will help is separating your domain and integration events. I'll go into what I mean by domain and integration events. I have a little picture. You will thank yourself later. Essentially, if we draw our bounded context, we draw our domain, you may well have an event-driven architecture within your domain. If we take payments, you may well have some internal events in that domain. That's cool. The key thing is to model your integration events, the events that tie multiple domains together, model those differently. Because this allows you to protect yourself from bleeding domain concepts out that you will then be tied to. You've now contractually said, I'm accidentally pumping this domain concept out, and you're now consuming it, and I want to change my domain, and I can't. We'll get into that in a little bit more detail.

The fourth thing that needs considering, it's not necessarily an immediate pain, event ordering. Unless you explicitly configure it, cloud-native eventing tech does not care about the order of your events. They're usually built for scale. They'll make massive statements, and they'll be like, yes, we can handle a million events a second. That's because they don't care about the order of your events. Your retries don't either. We talked about fault tolerance. You could be retrying. You could be backing off. You're playing these events through independent of other events. There is no ordering within your technology, within your architecture. You can introduce it, but just know that it carries more risk. Allowing a client to make two $1 million payments because we hadn't updated our balance yet is slightly career-limiting. We should not do that. There is more risk as soon as we require ordering on our events. It's not that it's not a no-go. We have two approaches that you can follow for event ordering.

Firstly, we can bring in an order. We can stamp our events with a version property, which, for our first event of an aggregate, it would say this is version one. For our second one, version two, version three, version four. You can call it whatever you like. I like to call it version. That can enforce ordering. Within your inbox pattern then, you can add code to that lovely framework that everyone's getting and everyone's using. You can add code that checks for ordering. For our aggregate, let's say an aggregate is our shopping cart. For our shopping cart, we now have ordering, and we know that event one, event two, event three, event four for adding those hats. Maybe we care about the order that those hats were added, in which case, within our inbox, within our ordering check, we say, for this shopping cart, I haven't seen version one event yet.

Therefore, I'm not going to process event two. I'm going to back off. I'm going to go back to the event stream. Hopefully, eventually, we're living in an eventually consistent world here, event one comes in, we process it. We then get back to retrying event two, and we say, I've seen event one, I'm going to go off and do it. The key thing here is that it is less scalable if you bring in that kind of ordering, because you've just seen what's happened. We've now essentially built a queue into our event-driven architecture without using queuing technologies. It will scale less. It will still work. We have very real implementations of this within the bank where we need that kind of ordering, and it works. You just have to be aware of the impact on scale.

The second option here is that you introduce implicit ordering, where your domain handles the types of events that it can process without necessarily saying, I'm going to process them one after another. It may well be that, in this example, we can't pay a beneficiary until we've seen that the beneficiary was created. That makes sense. We don't know the beneficiary details yet. We've implicitly added ordering to our system by having that domain validation. We haven't had to stamp the events. This is a very real approach to the problem. We have a platform within the bank that does this, and it works absolutely fine. They've never needed to introduce ordering version stamps on their events. Two very valid options.

Summary

Good timing. Bring it all together with a very scary number of boxes, and I'm not expecting you to immediately understand this, but I want to try to bring all of the concepts that we've talked about here. I appreciate there was a lot, and into a very real, again, banking use case where we've got payments and we've got communications. We talked about them before. We now have our domain and integration events. We have our two domains, we have payments and we have communications. Let's follow the flow-through. We have our API. Someone has gone to create a payment. It flows through into our outbox. We have our outbox to avoid losing the event. We save the payment to our payments database, and we also save our domain event to our outbox. Maybe we have some internal domain event handling. I'm not going to go into any more detail on that, but maybe we do. Maybe we need to do some stuff. That's fine. We have an inbox on that event handler that avoids processing that event multiple times. Perfect.

Our event is called something really funky because we own our domain, and so we've been really verbose with our event naming. SwiftFPSPaymentProcessed, sweet. We'll name it whatever we like because we now have our integration event publisher. This is where I was talking about the difference between domain events and integration events. You build this into your service template so that people will just have this available to them, and they won't accidentally bleed this very specific event out into the world. We have our publisher.

Our publisher does three things. It filters, it aggregates, and it transforms. Not all domain events will become integration events. Fine. Sometimes you might have a fan-in where multiple domain events become just one integration event out into the world. Crucially, transformation, where we say our domain event has all of these properties, we only want to publish these. Perfect. We've got our nice protection here. Some people might see some similarities here with ACLs. You are protecting your boundary. We pump out our payment processed event. Look, we removed all of our silly domain language. We now just know that a payment was processed. Into our other domain now for communications, we have our integration event handler. It handles integration events, well-named. We handle our payment processed. We have another inbox. Lovely. We have some, maybe, filtering, aggregation, transformation here too, to move into our domain events.

For brevity, I didn't want to fill this with the same thing, but you have the same thing here as you had in the other domain. We then go off and do our work after the inbox has protected us from sending multiple SMSes to you. Again, we can follow that same flow. SMS delivered, gets transformed to communication sent out into the integration events. I'm not expecting you to immediately be able to go off and build this, but these are all there deliberately. This is how you can do event-driven architectures within a highly regulated industry, with all of these protections in place. What we found, as I stated, we found by building this stuff in to our developer platform. Teams haven't needed to solve these problems. They've not needed to run into these issues at 2 a.m. We've managed to build quite a few things, quite a few platforms in the cloud, in Azure.

Questions and Answers

Participant 1: On the previous slide regarding the ordering, like stamp events with a version, like how do we define what version to assign basically to these events? Because it might be based on a database where we are storing a record version and reading and doing plus 1. Or maybe a timestamp-based version. Both of them in my mind have some pros and cons. Because if it is time-based, how do I know there is an event previously which is not yet processed? Or if it is database, then if I'm doing two reads simultaneously, plus 1, plus 1, so those will have the same versions.

Chris Tacey-Green: Both of those approaches are done out in the world. There are people we have fights about this online. We have followed the version, that one, of 1, 2, 3, 4, 5, 6, 7, with the understanding that now you've introduced that issue with competing writes. What that means is that you're probably going to have to have some unique index on your database that says we can't, for this aggregate ID, have duplicate versions. We can't have two events with the number 2 as their version. We know what that means. It will scale worse. You can introduce it with that.

Participant 2: I'm in a very similar situation as you, also doing event-driven architecture, also in the bank. We are actually SOX compliant. Our auditor is requiring us to prove completeness on our event stream. Which is, of course, very much a batch concept, but now being put on an infinite stream. I'm having a hard time even grasping it philosophically. Have you encountered this? Any suggestions or tips how we might prove completeness on this?

Chris Tacey-Green: Maybe we're lucky that I haven't had to be asked that question before. We are audited in the same way. We've not had the question over that. We've been able to just show that we have an immutable log of our events. We can show that they haven't been tampered with. We've not needed to go any further than that.

Participant 3: In a similar industry, for example, where privacy, compliance, and auditing is a main priority, we don't want to afford deduplication. At the same time, for example, in an outbox pattern, lookup in the inbox table is very expensive for us. What should our tradeoff be?

Chris Tacey-Green: I think the tradeoff is probably, as you just stated, if you really can't afford to have duplicate events, you have two options. You'll either do the inbox. If you really can't follow an inbox pattern, you're now relying on idempotency on the actions that you're doing off the back of that event. If all of your downstreams have solid idempotency logic, you can add your X idempotency key header to every single API call that you're making. You know that they're implementing that. Then you could probably get away with not having an inbox. Without both of those, there's just risk. You're going to have to manage that risk. There's no magic bullet, unfortunately. Either idempotency on your downstream or build in the inbox. Ideally, do both.

Participant 4: When I look at your diagram, I see two boxes, payments and communication. Probably in the real world, you have way more boxes, and way more messages, way more integration events. Do you manage who can create integration events? Because I am just thinking, when you have 300 different types of integration events, it could be a little chaotic, or you just accept the chaos.

Chris Tacey-Green: There is a third level that we tend to have at the bank. We're trying to build out platforms with varying levels of success. We're trying to build out platforms. Within a platform is where you'd probably have those integration events. There is another level where a platform itself is going to publish out events. We've been naming that public events, although they're not public to the world, they're public to the organization. Again, you have the same filtering, aggregation.

On that public level, it really doesn't get that noisy. I think if all of your teams understand the implications of publishing an event, they don't want to publish that many. They don't want to publish too much data on them. Yes, you can handle it by having those layers of filtering between this domain level of events, that's going to be really noisy. Integration events is going to be less noisy. Then your platform level events are going to be even less noisy. If you still have too much going on, that's ideally where you need to think about the topics that you're exposing from a platform, or from whichever level that you need to look at this from. Maybe now you're not just saying, here's a single event stream of everything that happened on this platform. You're saying, we have a topic specifically for this, a topic specifically for this, a topic specifically for this, to avoid people having to come and ignore 99% of the events, because they're not actually interested. There are approaches.

Participant 5: I believe you recommended using slim events, but then you'd have to then fill in the data from somewhere. Would that be a blocking integration?

Chris Tacey-Green: I would recommend using lean events. Fat events is where you're essentially carrying your entire entity state around your system in an event. Thin would be like nothing at all. It's just a notification saying, something happened, and you're right to call out. I didn't talk about it here. If you are pumping out notification-style events, people are going to have to go somewhere to find out data, and you're going to end up coupling on whatever that thing is. That's why I like lean events, where it's in the middle. It makes the likelihood of them needing to go off to your API to still get more data less likely. Because as long as you've designed it carefully, you've included all the data that makes sense to be included on this type of event, and nothing more and nothing less. It's a really good call-out. I think it's certainly something that I could try and add in to these slides.

 

See more presentations with transcripts

 

Recorded at:

Apr 20, 2026

BT