BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations The Journey of ClearBank From Start-Up To Scale-Up

The Journey of ClearBank From Start-Up To Scale-Up

Bookmarks
47:39

Summary

Michael Gray discusses how ClearBank has balanced autonomy and process in its engineering teams who faced increased demands due to departmental expansion and a complex regulatory environment.

Bio

Michael Gray is a Principal Engineer at ClearBank. He has a passion for Software and Systems. He's particularly interested in DDD, engineering culture and leadership, and creating the right environment for engineering teams to thrive.

About the conference

Software is changing the world. QCon London empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Gray: Has anyone ever been to a powerboat race in their life before? I went to a powerboat race when I was a little kid, and I vividly remember this by being in awe of the speed and the agility of these boats. The way they could turn the corners. The way they could accelerate. This is actually a world winning powerboat race, the boat. This won the 2022 World Championships. It's got top speeds of up to 164 miles an hour. It's a pretty impressive piece of kit. This, however, is a cargo ship. This is the world's largest cargo ship. It's called the MSC LORETO. It's 400 meters long. It can carry 224,000 container crates, but it can only go at 19.5 miles an hour. You might be thinking, why am I telling you about different types of boats? It's because at ClearBank, as most startups, you start up as a powerboat.

You've got the ability to accelerate quickly, go quickly, and make change happen. If we went on a mission from the UK to the U.S., which is approximately 3300 miles, the powerboat would get you there in 20 hours. The cargo ship would get you there a lot slower. It'd get you there in about a week. In some cases, that's good. There's a metaphor here, the container ship is the big banks, the powerboat is ClearBank. What we wanted to make sure is we didn't become a container ship with rigorous processes. If you think about how you get a container ship loaded up, you've got to fuel a massive thing. You've got to have a really rigorous process for getting containers on and off the ship. Whereas a powerboat, you fuel up with a small team and just go. How ClearBank wanted to scale is not to become a massive container ship, but to have a fleet of powerboats that can change direction really quickly, deliver value quickly when needed. I'm Mike.

About ClearBank

Let's start off with a little bit about ClearBank. This is ClearBank's mission statement. At ClearBank, our purpose is to provide great technology that unlocks our partners' potential, ensuring everyone has the freedom to choose the financial services they need. What does that mean? We're playing quite a critical role in the revolutionization of the financial markets, previously were dominated by the Big Four, those were the only places you could get products from. Since then, we've had regulation change. We've had open banking, to name one, which is all about expanding the competitiveness of the market so that people like you can all get market lead and more products to choose from. These are some of our customers. I don't know if any of you have got a Chip savings account or a contraction potentially banking with Tide? All of those companies sit on top of us. How we do this is we provide the financial fabric, which is like payments and accounts layer.

The other companies sit on top of us, they use our banking license. TrueLayer provide the open banking payments. All of you will have used TrueLayer, no doubt, when you've been checking out customers. Cazoo use TrueLayer. While it's an advantage for Cazoo, when people are buying cards, you don't have to pay the huge card fees. I also met somebody who was a referee, who built an application essentially to make sure referees were getting paid on time, who ended up using TrueLayer. We actually moved the money for TrueLayer. We're everywhere, but you probably never see our name. ClearBank started in 2014, took a couple years for the bank to get a banking license with funding. Since then, we've grown from strength to strength. As you can see across there, we've got multiple awards. The interesting parts here are our headcount. Our headcount was 100, 2019, 250, now we're at 720. We're at 720 people now with 30 products and technology teams processing over a million payments daily. We have over 220 of those customers that I described in the previous slide. We're growing, and continue to grow, which is really great in the current climate.

ClearBank's Journey

Let's talk about how we got to where we are. ClearBank, in these early days, was successful really because of its culture. It had shared mission and purpose. Had open collaboration and communication, continuous learning and growth, trust, empowerment, and autonomy. Fast decision making. All of these are conditions for high-performing teams, and that's what gave ClearBank the competitive advantage. Everything that we've done since then is about maintaining these principles to make sure that we still have the competitive advantage in the market. That's as we add enough powerboats. Powerboats, as we go through this, represent development teams, essentially. ClearBank started a little bit differently to most startups, because we had to get a banking license, which means we have to show good controls, risk management, all of these things. One of the big things regulators do care about is segregation of duties.

Interestingly, ClearBank was a lot more structured than most startups. We had development teams. We had testing teams. We had security teams and operations teams. It was your classic big bank setup. Why? To please the regulator. From day one, it was about, we need that banking license. If we can't get the banking license, we can't do business. That introduces challenges. We just talked about all of these different powerboats being able to make autonomous decisions, move independently. It created a lot of bottlenecks in our system. Because of the segregation of duties, we had the development teams writing code, passing them off to QA. QA then maybe there's a bug, the development team has to fix it again. We had a change of advisory board, which, really, if anyone's worked in a bank, is approval board.

A bunch of people approving something that they don't necessarily have the context to approve for, and making decisions about whether this software can make its way through to production. This also creates bottlenecks in the system. As we add more development teams, again, security, operations, QA, they become bigger bottlenecks. It doesn't really scale, unless you add loads more people. ClearBank, we made the decision to go from gatekeeping to enabling. That was quite a transition. Our goal is this, deliver constant flow of value as quickly as possible to our customers, so we need to remove the constraints.

Testing as a Gatekeeper

This is how we were. I just briefly described it. Development team writes code. Test engineer writes code, finds bugs. The development team fixes it. The testing team tests the performance of the code. Engineering teams then have to fix the performance of the code. Not very efficient handoffs. People aren't too happy working that way. We're a new organization, we probably shouldn't be working that way, so we made a transition to move into testing to be an enablement function. The testing team's role fundamentally changed. They were there to help the teams get good at testing. Because we're regulated, we had to put some controls in place around this to prove that we still had segregation of concerns. For us, that's our PR and release process. All our development teams have to have at least one QA champion. That means that when they're approving the PRs into the main branch, we've got that end-to-end audit trail.

Development teams are now responsible for the quality of their software and performance of their software. Testing champions, the testing team also run a community of practice, which is all about continuously sharing new ways of working, that kind of thing, so that everybody can be upskilled. What we found that that did for us is it increased flow efficiency. We've removed the handoff, which was great. The quality actually increased. The development teams know which bugs are the bugs that need to be fixed. We can make sensible tradeoff decisions about that. They felt a sense of ownership. We kept the segregation of duty, importantly for the regulator, and actually the control was operating more effectively from our point of view.

Security as a Gatekeeper

You might see a pattern with this. Security, same thing. Software is created. Some cases it was a design that security would then review. Have to go to the change approval board or advisory board. Penetration testing would be after the fact. Security would say no a lot, to a lot of things. Because, when you don't have the context, saying no is the easiest thing to do. It's the safest thing to do, especially from a security perspective. That's really quite frustrating for the engineering teams. Again, we made this transition to migrate security to an enabling function. Now the teams at ClearBank, they're responsible for creating threat models for the software. Responsible for the security of the software.

Again, you'll see a pattern, there's a security champion within the team. Security teams move to providing training and coaching. Consulting with the development teams, we still in some cases need the advice. That's what they're there for, their expertise. They run the community of practice, so that anyone can attend, and security champions must attend. In the PR process, now when a change goes to production, we need a security champion. Satisfy the regulator. What we found with this, more considered conversations around security and risk: less, no, more, ok. We get that risk, we found the teams were better at explaining to security the risks and the potential problems, because they were more educated about security and the potential risks that different vulnerabilities might introduce. We've also removed another handoff and another bottleneck.

Operations as a Gatekeeper

The final one, operations as a gatekeeper. This one was a little bit different. Our operations team don't become an enabling function, but we'll get to that. It was like this, we write software. We wanted to blow it. Operations team, we don't have capacity. We want to use new technology. Operations teams say, in six months, maybe we'll have the ability to support it. Operations team is supposed to be running things in production. Guess who gets called in the middle of the night? It's the development teams. They're still getting called. It wasn't really working. We moved to operations, to DevOps: you build it, you run it. Our operations team migrate to infrastructure team.

A lot of ClearBank's in the cloud. Some of it's on-premise, because we have to have direct connectivity so that we can move money between banks, because that's a lot of what we do. They've owned a lot of that, running the infrastructure, patching, security updates, on-premise connectivity, that kind of thing. We moved everything left to the team. The teams are now building and running the software. They're on-call 24/7. They understand the software. They're also allowed to explore new technologies for their specific use case without that bottleneck from the operations team saying, no. Actually, what we found with this was it reduced the number of incidents we had, as point one. Two, it reduced the severity of incidents, increased flow, increased the number of deployments that were going to production. Teams were very happy, because you could start to use the power of Azure, and actually use some of the technologies that are offered there without the bottleneck of the operations team.

Mindset Shift, Influence Through Secondments

One thing to talk about that I think is pretty cool, but pretty open, secondments is something we actively promote. Because if you go and work in a specific area for a period of time, you're going to gain the knowledge, you're going to become a broader individual. When we went through this kind of transition with all of these teams, it was really encouraged that development engineers should go and join the testing teams, should go and join the operations teams, should go and join the security teams.

Why? First of all, was because it was going to upskill them. Two, they understand how the development teams work and what they need. It also gave them the context, in security and testing teams, to enable development teams in the best way possible. In summary, for that, every time we allow for end-to-end ownership, which was our goal, we want these small, autonomous powerboats, we saw improvements off the board. We stopped chucking stuff over the fence and, yes, hurried and growing. This is where we were before, and where we've ended up. Gandalf saying, you shall not pass. Now we've got our enablement teams getting change to production as quickly as possible.

Boundaries and Interactions

That's all like a really good story, but that's not all complete reality. There are a lot of other constraints that we've had. ClearBank, as with a lot of startups, started off with a monolith, a small group of teams working together. It wasn't huge constraints on it. One of the challenge, really, was, it wasn't that modular. What happened was that, as teams grew, we tried to split this monolith into microservices, but we still had some really awkward interactions. Our boundaries weren't super clear. That's also a challenge, because when one powerboat starts depending on another to get somewhere, it's going to take you longer. We focused on boundaries to enable autonomy. This is where we were when I joined. I joined in 2021. We'd done some work. We'd got a version in our APIs to tow between applications. We've got what we called our domain events. One of the ways that ClearBank wanted to split stuff was event driven, which if used, it still introduced coupling, which we'll talk about.

Challenges with this is, yes, we've introduced these APIs with versions, but they're never ever retired. All of these teams have still had these dependencies. We had powerboats and teams that just needed to maintain old stuff for forever because one team in the organization was using it. The domain events, there was no concept of internal or external, which is tough, which meant people started subscribing for really what should have been this size and was bounded context, were being exposed to the outside world. Then people start depending on them, and we're introducing coupling. What we've moved to more recently is talking about boundaries of change. We've got three. We've got public, internal public, and internal. Public changes continuously.

We've introduced stuff so that our customers, if we make a breaking change, they have six months to move, which still allows us to move. We have internal public which are typically around a domain or a bounded a context. They're internally versioned, and we've got strict policies now which mean people have to move off versions within six months when they need to. We introduced a new concept called integration events. I know it can be a bit contentious, everything's a domain event. We've called it integration events, same concept. It's speaking the external language of the domain, so that people can couple themselves to that, rather than the internal language.

We've made some more boundaries, but what we still got is quite awkward interactions. This is the next thing that we needed to tackle. This is an example of our billing engine, and this worked for a period of time, and this is where I say events still create coupling. For us a lot of our products are payment rails. Payment has been completed. Billing engine listens for that and says, I need to charge the customer for this thing that's happened. Then we had another product. Now the billing team have still to know about that event, and they have to, then, therefore, in the billing engine, figure how to charge for it. Then we had another product capability, and again, they have to subscribe for another event.

These two red arrows signify awkward interactions, and those are the things that we want to mitigate and make sure we got on top of. We moved to this model instead, where we had the product, product billing module. We introduced a concept of a product code. Product code, the products understands what it is. The product code gets charged a fee. Products can create product codes, so they just call an API. We change this billing engine to be a completely self-serve capability, and we removed those awkward interactions. That's just an example of something that we continuously do all the time. We're looking for these awkward interactions, and we're trying to mitigate them.

Speed and Velocity

We've removed a load of bottlenecks. We're moving really quickly. However, we've got powerboats going really quickly, maybe not in the direction we wanted to get to. We wanted to go to New York, but we've got a couple going around the world. They're going really fast. This is the difference between speed and velocity. This is where we ended up, not super chaotic, but chaotic enough for it to be tricky. Speed is the rate at which an object is moving. Velocity includes direction. We were missing the direction. We started off as a tech company, but continued as a tech company. It was time for us to add something new. This is how we wanted to be moving.

At ClearBank, this is where we introduced product. The mission at the top at this point became way too abstract for the teams to understand what that meant. There were too many development teams now for the mission to be translated. This is where we introduce product, so we now have a product function that worked really closely with the rest of the business and tech. They're the translation layer. They provide us with the product strategy, and therefore, help us understand what the product offerings are. That's how we organize our teams around them. That's how we started to gain direction. That's how we've translated our mission through to what the actual products and engineering teams have to build.

The Shift Towards Platform Services

Now we're all aligned. All rosy. Not quite. The transition to DevOps costs money, a different way. It costs us a lot in our cloud bill. Teams now have the freedom to choose technology. Who uses Azure? Who's had the pleasure of using App Service Environments? Expensive. We're a bank, so security is pretty important. App Service Environments give you an isolated server rack in Microsoft's Azure data center. When you're a bank and it's top priority, and you're moving people's money around, that's quite important. However, they're really expensive. We've got serverless, teams start to use serverless functions, but they have to sit inside in an App Service Environment. The technology is great, but the use case and the cost is not forefront of mind. Four grand a month to run a cron job once a month, it's not a good use of money. Those are some of the challenges that they introduced for us, and our costs increased. It's something that we're getting on top of now. I'm sure that's not an unheard-of story for the rest of you.

The other challenge, the security, data, testing, the teams grew, and suddenly they have their own roadmaps. They've got their own mission now. The mission is to make the bank more secure. Here's a roadmap to do it. Save the data. We need to improve the quality of our data, here's a roadmap to do it. Testing. Our testing could be better, let's do all of this. Problem with that is all of them are asks of our product teams. It's not long before the product teams are feeling like this, "I've got all of these things that I really have to care about, but I'm here to deliver value to our customers, so we need to do something about that." This is where we shift our mindset a bit to platform services. I think we've done it in a slightly different way. I'll talk you through that, and hopefully it's interesting. This is a picture of some of the platform services we offer at the moment for ClearBank.

The whole point of platform, as I'm sure a lot of you know, is to reduce the cognitive load of the teams. To reduce the cognitive load so that we don't get that brain picture. What we started to do is create this loop. The testing team now also starts to have a testing platform. What's great about this model is the testing team have always been collaborating with the development teams. We've had secondments of the development teams into the testing teams. They've really started to understand what the problems are, not just of one team, but of multiple development teams. This allows them to start creating a testing platform at ClearBank. We've got things in there, like how we do our SLO reporting, performance testing toolkit, chaos testing toolkit. Those are all things that teams can then consume without having to re-interact and consult with the testing team, to make sure the quality of the software is great. There's another pattern here.

Similar story with the security team. The security team are constantly collaborating with the engineering teams, the development teams, as they start to build their own platform, which is all about reducing the cognitive load of the team. We've got things in here. Where are we? Built-in pipeline, making sure our dependencies are secure. From here you got, we're a Microsoft house, if you didn't know. Threat modeling toolkit, so we introduced IriusRisk, which is a toolkit which helps teams model all the threats that could be potentially there in their software. We also introduced the security application score, which is something that teams can see every day. I think we use something called Phoenix, and this not only plugs in with Snyk that we use to scan for dependencies, but also plugs in to the infrastructure. It gives you that end-to-end picture. Also, gives you information on how you can mitigate them, which is, again, something that empowers the teams to take more ownership of this stuff.

The infrastructure team transitioned to our internal developer platform. This is a lot of what people talk about when they talk about platforms. It's fairly standard what we do here. We've got our API platform, which takes care of webhooks, public APIs, authentication, messaging standards, how we do observability, so that we can see our entire system at a glance: compute, storage, all of these things. This has started to bring our cost back under control. Teams at ClearBank now can't just use what technology they want. We have a bit more of a rigorous process in place.

Anyone can start to use any technology they like and propose to, however, they do have to then hopefully at some point get adopted into platform. That really helps with cost, if you want to start getting on top of that. Back to this, this is also a platform. We're funding our platforms continuously, it's not just about internal developer platforms. It's not just about testing platforms. This is a platform. Billing has become a platform. For us, our accounts is also a platform, whereas before, we considered it a product in its own right. It's an enabler for other products. This is a great way to get efficiencies in your system, get rid of those awkward interactions finding your platforms.

Dealing with Decisions at ClearBank

Hope now what we've done here is, hopefully, we've reduced some of that cognitive load from the teams. We've pushed it down into the platform services, rather than shifting it left. We saved some money because now the teams aren't using App Service Environments for serverless stuff that cost 4 grand a month. The teams are a lot happier, and we're delivering stuff to the customers nice and quickly. Onto decisions. This decision was bad. This makes no sense. Why have you done that? Or maybe somebody actually says, I would like to understand why they made that decision, because they know there's missing context. There are no right or wrong decisions, there are only tradeoffs. With the context that people had at the point in time, when somebody had made that decision, we don't know what context they had. If you were in that situation, quite possibly you would have made the same decision. I'll talk about how we do deal with decisions at ClearBank.

One of the consequences for us at ClearBank was we introduced really localized decisions. We wanted these powerboats to be going really quickly, make their own decisions, have agility, all the rest of this. That's the context. By design, each product team is now an autonomous unit. They make their own localized decisions with the context that they have. I just briefly talked about some of the challenges that we have there. We want localized decisions, because this is how we scale. We made the decision to embrace localized decisions. Some of the consequences of that are, decisions will be made by people with the most context. Great. We're pushing it down to the teams who know the most about the products.

However, some decisions that get made may impact others that don't have the wider context. Then we're talking about the global system, and maybe we've made local optimizations that don't benefit the global system. Then system drift could become a problem. These teams are slowly making decisions that over time mean that the teams almost become incompatible, because they've maybe chosen different technologies or whatever, that aren't as interruptible anymore. We needed a way to manage this.

Have any of you read the article by Andrew Harmel-Law that was on Thoughtworks, about scaling architecture conversationally? I would recommend reading it. I've been fortunate enough to be asked to review some of these chapters in his book as well, which is going to be a great read. It's one of those books where you're reading and nodding. They're always going to be good. We took what he defined as the Architecture Advice Process on that Thoughtworks article and gave it a little bit of a twist to make it work for ClearBank. Best part, we introduced those engineering principles. Whole point of the engineering principles is to aid in day-to-day decision making. Every decision that we make should be looking back up one of our principles. Does this fit with our principles? That's going to help us to manage a little bit of that system drift in the localized decisions, because we've got some principles for people to look about.

We've put that in our developer portal, front and center, so it's the first thing everybody sees when you log on in the morning. A huge amount of our process is actually around architecture decision records. I think they're great. How many use architecture decision records actively, because it's pretty popular now? You should explore and see what they can do for you. Not only are they a great way to store a decision, it's a great way to think, because it forces you to think about the context that you're working within, the decision that you're going to make, and the tradeoffs that you're going to have to make with the decision. All decisions have consequences: some are good, some are bad. Always there's pros and cons to all of them. This is at the center of how we communicate and make decisions at ClearBank. Everything is written in an architecture decision record.

The piece that's different to the Thoughtworks article that's unique to us, I believe, is decision scopes. We had challenges. We had some teams that were mavericks. They would make load of really quick decisions continuously. We had other teams that weren't so confident with making decisions. They were always looking to someone more senior to make decisions for them. Both of those aren't great. We needed to find some balance, so that's why we introduced decision scopes. Decision scopes are quite abstract, purposefully, and it's all about impact. It's all about, if I make this decision, who do I think it's going to impact? If it's just my team, you make that decision. If it's my domain or the teams that are working within my domain more closely, have a conversation with them, and make that decision together. If it's enterprise, we actually have a forum for that, more wide impacting decisions, which we'll talk about.

This is our process for making decisions. You'll see, all of them start with writing an ADR, before we've even had a discussion to frame the conversation. Is it a team decision? Write an ADR. Have a conversation with the team. Is it agreed? Yes, or no? If no, still store it. Rejected decisions are also super important. Why haven't I made a decision at this point in time, given this context? That's also really useful information to make sure you have. For the enterprise ones, you'll notice it says, bring to AAF. AAF stands for Architecture Advisory Forum. We run this once a week. We'll talk about that. Architecture Advisory Forum. Previously at ClearBank, we had an architectural control, it's called architecture review. It was a bunch of senior people taking minutes. It wasn't that productive. The whole point was we could show to the regulator that we were in control of the direction of our architecture. We've changed it now. It's the same control that we have in place, it's Architecture Advisory Forum.

This is where we make all the wide-sweeping decisions. It's all about this impact, the enterprise decisions, they all come here. They come here for discussion. They bring in architecture decision records to the forum. We have a discussion. We have quorum from all areas. We have data, security, infrastructure, senior leadership. We discuss it. We have a conversation, really fruitful. Anyone can come. It's not just restricted to senior leadership anymore. Why? People making decisions really benefit from maybe more experienced people asking questions about decisions other people are going to make. Then they learn to make better decisions as a consequence of that. It's also an advice forum. Sometimes you don't know who or what you need to speak to, which is one of the prerequisites, really, for the forum. Find out who you think it impacts. Sometimes it's, I would like advice on this topic. Is there a group of people here that could help me out with it? This forum, I think, has been really fruitful, and I think it's changed the way we think about and do architecture. I think it's changed the way we made decisions. I think it's really been a positive for us, for good.

What have we done with this decision-making process? We've managed system drift through conversation. We've got a history of all the decisions we've made. I think 300 or 400 now, maybe in our central repository of enterprise and domain decisions. It's a learning forum, not just a technical decision-making forum, so people can upskill and learn how to make better decisions. More importantly, I think it's a catalyst for change. Senior managers bring process changes to this forum now. Why? Because this is how it's going to impact you. This is what we're doing. Teams put their hand up, ask questions. This is how it's going to impact me. They get to make little tweaks to the processes that we're going to be introducing. That makes them feel really included rather than being dictated to. That's been really effective, and it really helps make change in the organization a lot easier, if people feel like they've had an input into it. Lastly, but most importantly, for our banking license is the regulator's happy because we've got another effective, working control.

Summary

Just to round off, just to show you where we're at and see how this is working in practice. These are the DORA metrics. This is where we sit at the moment. Deployment frequency, we're on demand. Change lead time, between one week and one day. That's because of some of the controls that we have in place that hopefully we'll be able to optimize in the future. Change failure rate, we're below 2% on that. We're pretty good at recovering as we shifted everything towards the teams that build and run teams. Biggest thing that we've managed to achieve in the last couple of years is we managed to build a new bank in nine months, from first commit to technology was ready. We're just waiting for our European banking license now from the Dutch National Bank. Hopefully, we'll get there really quickly. All of this, and this is really important, was done over a long period of time. All of this was continuous improvement.

None of this was transformation. I've seen a lot of people try to do too much at any one point in time. There's only so much change an organization can absorb at any point in time. It's really important to have that continuous improvement mindset so that, yes, you get better every day. We are where we are now at ClearBank. We still got that mindset. We're going to be continuously making some more changes to try and improve as an organization. ClearBank is successful because of its culture. I think that's true today. We've got a shared mission and purpose. The product strategy helps us out with that. We've got collaboration, open communication. We talked about a few of the mechanisms that we've got for that, with the Architecture Advisory Forum. Continuous learning and growth, which I haven't talked about all of: take it easy as an example. We've got trust, empowerment, autonomy. We've got really quick decision making at the right levels.

I think it's important for the decision making to be made at the right level. We've got a culture that breeds high-performing teams, that gives ClearBank the competitive advantage in the market. We've managed to avoid being one of the incumbent banks with really heavy processes. We've scaled through conversation and people and culture. We've managed to avoid just running around like headless chicken all over the globe. This is where I think we are today. It's a group of powerboats sailing from the UK to New York, hopefully nicely aligned and traveling at quite some speed.

Questions and Answers

Participant 1: The architectural forum, is there an ultimate decision maker? Can you give an example of a really contentious issue?

Gray: Is there an ultimate decision maker? This is one of the challenges we've had with it. When there's been quite a contentious decision, we have come into trouble, where we've come to like a stalemate in some cases. That's true. Most of the time, that can be resolved through more conversation that gets taken offline. Sometimes, maybe myself or one of the more senior people in the call, will have to make a call. Ok, we understand your objections. However, most people agree with this, so we need to move forward, because it moves ClearBank forward. Yes, we have, and they're always tough to navigate. Most of the time not. Most of the time the conversation is pretty good, and people have valid reasons for having concerns that they have that which people then address. Sometimes, yes.

This is one that's yet to be resolved, that needs to be resolved. I'll give you that example. We tried to introduce a concept called deployment units at ClearBank. We've gone from monolith to microservices. We've gone way this side on the right-hand side, introduced too much complexity into the system. We've got 700, 800 deployables, it's a bit much to manage with our team. We're trying to bring this stuff back. We put something out with a proposal to deployment units, which is all about finding the right size for these things, and really deploying capabilities, rather than just little things that do one job. A lot of pushback from that, from teams. That's not just a ClearBank culture, that's a tech-wide culture that I think really needs a bit of work on. It is happening now. People are feeling the pain. That's ongoing. We haven't closed that one out. That's the challenge. We've had process changes where people weren't very happy, but it had to happen for regulatory reasons or whatever. Those things are a bit easier, but we try to not be too dictatorial to the engineering teams. We want to make change by influence and respect. Sometimes it's tricky to walk that line.

Participant 2: You talked about the API platform, which is the upside of it. What is it specifically that you build? Is it a custom API, a set of documentation, tools, an IDP. Any feedback on that?

Gray: All of that, really. We've got documentation on best practices, how to use the platform. An example of one of the platform service is webhook delivery. We've got one mechanism to deliver webhooks to our customers, where people just call an API and they end up to our customers to understand something's happened. All of our pipelines are templated. They have to be, because we're a bank, and we have to show that segregation of concerns, when it's flowing through, approvers have approved change through to production, and that kind of thing. They're templated. They're part of our platform. How we get software to production, observability standards, some of that's documentation, some of that's packages that teams can then consume and use.

Participant 2: It's like a toolkit, or?

Gray: Some of it's toolkits, some of it's like APIs, some of it's documentation.

Participant 3: You briefly mentioned cloud costs and the regulation about that, and you already have a platform mindset in your company. Have you ever thought about getting back on-prem or getting on-prem?

Gray: Going back on-prem? No. We haven't. There's been articles recently about, which company was it? They said they've massively reduced their cloud costs by going back on-prem and buying a load of Dell service. It's quite contentious. No, not for us. There's always concern in finance from the regulators about critical partners, and if they fail, then, does your bank fail? Yes, sometimes there's pressure to explore multiple cloud providers. We do also have an agreement with AWS. We don't actively build and deploy to both clouds at the moment, but maybe in future we will need to, to give the regulators that peace of mind. At the moment, no, no plan to go back to on-prem. It suits us the way it runs at the moment.

Participant 4: Coming from a finance background, we also have one other group of people doing the 24/7 monitoring, like mission control. How is that integrated in your DevOps teams? Is this integrated? I know that the teams are running but they are running the operations, the DevOps teams, but we have mission control. They are 24/7. There's someone sitting in a room and monitoring everything. Do you have something like this? Is this integrated? Because we always have these problems that the people in mission control, they have the control over all the services, which is quite a lot, and no one can understand whatever happens there. It's very hard to integrate these people into our normal workflows.

Gray: The teams have their own dashboards. We use PagerDuty as well. We have our own dashboards and monitoring set up for each of the individual teams that look after their own products. That's how it's monitored. They don't sit there and watch it every day. It's been quite a journey to get our monitoring and observability to where it is today. We introduced it. We had a load of noise. Then you got to get rid of your false positives. Now we're at a fairly decent spot where we've balanced that out. We get alerts when there's something going wrong with the system, rather than somebody watching it 24/7.

Participant 4: I understand that you just react to events, you don't create the events yourself, on which you react?

Gray: No. We've got rules in place which will then trigger something and notify us, and wake you up in the middle of the night.

Participant 5: What is the composition of these Architectural Advisory Forum. What is the role of the principals and staff engineers, if you have those?

Gray: We facilitate.

Participant 5: Maybe you have principals being part of that group, or maybe senior ICs.

Gray: Everyone's welcome. When we rolled it out, we invited principal engineers, staff engineers, senior engineering management, and all of the team leads, with the instruction, delegate it to other people who care about it. Since then, the invite list has grown to a lot of people. We've not taken that access away from them because we think it's valuable to them. Because they learn how to make better decisions by listening to us discussing decisions. We still see value in that. It's not got a strict composition. The only bit that is strict, we have to make sure we have quorum for regulatory reasons, so it is audit, it is security, data infrastructure.

Participant 6: How long do your secondments last, and how do they get initiated or triggered?

Gray: Some are permanent. Some of those secondments, they become leaders of their areas, with respect to areas. Security is a great example. Security engineering, we had someone called Seb, who then left the infrastructure space and ended up being our security leader. The other answer is, it depends. Typically, it's like a quarter, three months, and then you go back to your team. For us, it's always been open for negotiation. We've got security engineers who were software engineers who work in the security team now building up platform out. It depends, but typically three months.

Participant 6: How do you guys think about staffing your platform teams? Were they fixed, or were they loosely defined groups, like project teams, for example? Could you share how many engineers per team do you put into these platform teams or project teams?

Gray: Five or six, typically, in each platform team. Between platform and infrastructure, which also operate on-prem, I think we've got seven teams working in that area at the moment.

Participant 6: How did you guys think about staffing your platform teams. Are they fixed, or are they loosely defined? Because API platform, for example, you had a really huge box, and then the rest were smaller.

Gray: We have teams that own compute and messaging. We have other teams that own APIs, which include webhooks. They actually also own authentication and our public API interface, sort of security stuff. We organize them around capabilities and the services that they offer.

How do we staff them? We've had some challenges in this area, where we staff the platform teams with infrastructure engineers. One of the challenges we found with that is that they didn't have much empathy for the development teams, because they're there to enable them reduce the cognitive load so they provide services to do exactly that, whereas we found that they were interested in the new technology, which actually in some cases, would increase the cognitive load on the development teams. Because, we should be doing it like this, and these are all these complicated ways we could be better. The goal is for you to reduce the cognitive load, so that's just adding more at the moment. Is this something that we need to do? We've started to change that a bit by, again, bringing development engineers into the platform space, and then they have that empathy with the development team, so that they can make sure they've got that kind of development voice while they're building the platform services as well. We've also invested in product owners in that area as well, as before. It's very technology led.

 

See more presentations with transcripts

 

Recorded at:

Nov 12, 2024

BT