On today’s podcast, Wes Reisz talks with Jason Maude of Starling Bank. Starling Bank is a relatively new startup in the United Kingdom working in the banking sector. The two discuss the architecture, technology choices, and design processes used at Starling. In addition, Maude goes into some of the realities of building in the cloud, working with regulators, and proven robustness with practices like chaos testing.
Key Takeaways
- Starling Bank was created because the government lowered the barrier to entry for banking startups, in reaction to previous industry bailouts.
- The system is composed of around 19 applications hosted on AWS and running Java and backed by a PostgreSQL database.
- These applications are not monolithic, but are focused around common functionality (such as a Card or Payment Service).
- Java was chosen primarily because of its maturity and long-term viability/reliability in the market.
- The heart of Starling lies in every action the system takes happening at least once, and at most once. To help with these rules, everything in their system uses as a correlation id (UUID), which is used to make sure these two rules are met.
Subscribe on:
Show Notes
Why would you agree to deliver a talk with only 24h notice?
- 02:00 I wanted to start speaking at conferences, and this was a fantastic opportunity.
- 02:15 I like talking about our architecture, and how we’ve taken the fairly conservative industry (in banking) and shaken it up, showing that you can run at the same speed that start-ups do.
How did the talk you gave previously in London differ from the talk you’ve just given in San Francisco?
- 03:35 Although it will be very similar, I will be focusing on different things - the environment and kind of attendees are thinking about different things.
- 03:55 For London - where a lot of things are concentrated around financial services - the idea of deploying once a day is madness.
- 04:10 Finance is conservative, and likes to deploy on very slow, stable timescales.
- 04:20 If you go to San Francisco, and tell people you’re deploying once per day, they will wonder why you’re deploying so slowly.
- 04:30 Technology companies like Twitter and Google push out code faster, and have massively high volumes and users.
- 04:40 If I asked them to imagine that I had to guarantee that a tweet I posted was seen by every single one of their followers in their feed, they might not know how to guarantee it.
- 05:05 That’s the sort of thing we have to deal with: if you make a payment, that payment will reach the other side, or we’ll tell you about it quite quickly and you’ll have to try again.
- 05:15 If we accept the payment instruction, then we have to guarantee that it will reach wherever it is supposed to reach.
When did Starling start, and what were the design goals like?
- 05:45 It came out, as many things have, of the financial crash in 2008.
- 05:55 That caused the government and many players in the banking industry to think about the fact that they had to bail out the banks.
- 06:05 They were forced to bail out the big banks because they were too big to fail.
- 06:10 This sort of thing brought industry players and government together to think how to prevent this happening again in the future.
- 06:25 Two things happened: the government lowered the incoming requirements to founding a new bank; they made it an easier on-boarding process, going from a startup to a business in a licensed industry.
- 06:45 On the industry side, our CEO Anne Boden, left the large banking environment that she had been an executive in, and got by with far fewer engineers than she had in one of the incumbent banks.
- 07:10 When I joined, we had a partial banking license; we hadn’t gotten to be a fully regulated bank, we hadn’t launched yet or were on the App Store.
- 07:25 We looked at adding all the features that you’d associate with being a bank and coding them at a fast rate into the App and back-end server code that supported the banks.
What was the reasoning for launching as a mobile (app) only bank?
- 07:40 More and more people are interacting with life through their mobile phone.
- 07:50 If you have your mobile phone on you constantly - it is becoming more and more sensible to say, let’s interact with your banking life through your phone as well.
- 08:05 Many other banks have been offering web portals, where you can log in through your browser.
- 08:15 Those generally require you to sit down at a computer, get out a pass-key, receive an access code, remember three or four different passwords; it takes a long time, and it is disconnected from how you bank.
- 08:30 If you can get your phone out and look at your banking on your phone, then it makes it much easier - especially if you’re getting notifications when you spend and when you make a regular payment- it means you can much more easily interface with how you are using your money.
- 08:50 The reason we went mobile only, as opposed to mobile and a web portal, was for ease and speed of development - it’s much easier to run one rather than both.
What does the architecture look like?
- 09:15 We’ve got about 19 or 20 different applications, running in Java in AWS.
- 09:25 They aren’t microservices; I tend to term them ‘microliths’.
- 09:30 They’re not monolithic, but they are applications that do a particular thing - like servicing card requests, like when someone uses their debit card in a shop, the request comes into the card service.
- 09:45 If they want to make a payment, we have a payment service that sends that out so that the payment can be made.
- 09:55 We have services for updating personal details, or sending notifications out.
- 10:05 We have these microliths running in different regions in AWS, and run multiple instances of them at once; at least two for resilience, plus so we can scale up if needed.
- 10:15 They all run with PostgreSQL databases, so that’s where the storage is - also running in the cloud.
How big is Starling scaling today?
- 10:30 We’re talking hundreds of thousands of users - that’s the kind of numbers using the apps.
- 10:50 Access is very time dependent; although we’re running in a modern way, banking still happens in batches.
- 11:00 There may be very little activity, and then periods where there are millions of records coming through, presenting card transactions after a weekend, for example.
- 11:20 You can get fallow periods and periods of intense activity, so it’s variable.
Is it a transactional or event stream architecture?
- 11:45 You have what you think of as transactions, but we don’t have anything that processes transactions across the different applications.
- 12:00 There are still activities where the customer would think of it as a single transaction.
- 12:10 For example, making a payment, that gets sent off to one application.
- 12:20 Another application that handles sending the payment, and a third application records that payment being made in the ledger and that we owe the payment when we settle up at the end of the day.
- 12:30 If the customer makes a payment, we want to make sure that each one of those things happened, but we can’t have a transaction that crosses the boundary of those three applications.
- 12:40 We have code that checks things have happened, and if things gets stuck at one of those stages, it gets retried so that we are eventually consistent.
What else is in the environment - what version of Java are you using?
- 13:25 We are on Java 8 - we intend to move off that fairly soon, moving to Java 11 via Java 10 as quickly as possible.
What plan are you making to get to Java 11?
- 14:00 We will have to do a lot of testing, and a lot of trials to get there.
- 14:05 We have a demo environment, which is a reasonably good volume facsimile for production - with the same volumes of data as production has.
- 14:20 We can test performance of things like GC, and we will almost certainly spin up an instance of the bank in Java 10 and Java 11 and see how it performs.
- 14:35 Obviously we have to get the code compiling first - but once we’re on there, we’re into a more stringent testing phase.
- 14:50 That’s going to be more of a tricky upgrade than we usually have.
- 14:55 When we deploy, we’re used to having two different versions of the applications, but we’re not used to having two different Java versions.
- 15:05 We’re in the very early stage of testing, but it’s not running in a deployed environment.
What garbage collector are you running on?
- 15:10 We’re using G1 - we’ve been on it quite a while.
- 15:30 In terms of garbage collection, we don’t have much of a problem with how it operates - we don’t need to do much tweaking.
- 15:40 The only place where we need to do any tweaking is where we have guaranteed SLAs, where we have to respond in a given timeframe.
- 15:55 There are really only two places in the bank where we have to do that, which is where someone is using their card in a shop (do they have enough money?), and responding when people are trying to send payments from one bank and another.
- 16:10 We have needed to tweak the GC in those places, but only a small bit.
How are the microliths organised?
- 16:30 We don’t have any front-end code; nothing is visualised, because it’s mobile.
- 16:45 The applications are basically the business logic layer, sitting above the database.
How do the applications talk to each other?
- 17:00 It’s all REST and JSON - both between the management apps and the back-end server apps, and between the server apps themselves.
Why are you using vanilla Java, as opposed to any other frameworks?
- 17:25 The thought process of why we use Java (instead of Go or Rust or any other language) is the same answer.
- 17:45 Running a bank as a start-up is very different from any other start-up; not just from the business side (you have to have much more stringent capital requirements), but also from the technology side.
- 18:00 You need to make technology decisions on a much longer-term basis than you would normally.
- 18:10 Normally at a start-up, you’re trying to get something out as fast as possible; you’re aiming to get something out quickly, watch it succeed or fail, learn about the adoption.
- 18:20 We want to do that as well, but we have an additional requirement that the things that we try also have to be sustainable.
- 18:30 We have to be reliable in the long term; that when you deploy something, it could be around in 10 or 20 years' time - because that is what customers and the regulators are looking for.
- 18:50 Doing things in Java - which is a language that has been around a long time, is well supported, and has a lot of eyes on it - and using vanilla Java rather than any recent framework.
- 19:15 It means that what we are doing will be sustainable in the long term.
- 19:25 We want something that we can go and hire someone for - and want to avoid issues with large government or banking institutions now ending up in - where they have systems written in (e.g. COBOOL) are retiring.
- 19:55 These people are finding it difficult to hire people to maintain their systems; they are getting to a state where they have a zombie system.
- 20:00 A zombie system is stuck; it’s a black-box, you can’t do anything with it - it just works, and no-one can touch it.
- 20:10 We have to keep things as simple as possible, and make sure that all the languages and frameworks are very long-term and very well supported, and that we will be able to hire people that know about them in 5, 10, 15 years' time.
You made the trade-off of iteration times over reliability and long-term support that Java offers.
- 20:45 We think that we can achieve what would be seen by traditional banks would be considered super-fast cycle times with the way we operate and our architecture.
- 21:00 We can’t go to the stage that many start-ups go to where they are releasing code thousands of times per day.
What does your deployment pipeline look like?
- 21:20 We build up each of the apps as a Jar, and we deploy them to a container, and then run multiple instances of that architecture at once.
- 21:40 We are still deploying on a monolithic basis; we don’t deploy one app at a time.
- 21:50 We tend to deploy a version of the back-end at once; that involves spinning up new instances of each app and then killing off old ones.
What’s your availability?
- 22:00 It’s almost 24/7 - I won’t say it is completely 24/7; we have had a few planned outages where we have taken out some of our services for a few minutes to an hour.
- 22:20 I don’t think we’ve had a period where we’ve taken the whole banking application offline deliberately so that no one can access it.
How do you deploy and roll-back your code?
- 22:35 We have a home-brewed deployment system at the moment, which involves a number of instances of a particular version of an application running in production.
- 22:50 What the system will do, when it detects a new version being deployed, is it will build up an image that we want to deploy and then deploy that to production.
- 23:00 We will temporarily have more instances of the application running, with different versions, than we would normally have.
- 23:10 We will start up the instance, verify that it is running, then kill one of the old once, and repeat until all the application instances are running the new version.
Do requests that come in stick to a particular application instance?
- 23:35 We have a normal load balancer that will distribute out calls to the different instances.
- 23:50 As we are rolling between versions, we will have an instance where the load balancer will choose to direct a particular call to either the new version or the old version.
- 24:05 We can have situations where it matters whether requests go to the new version or the old version.
- 24:10 Every time we write code, we have to consider what would happen if a request joins in during the roll, and whether it matters.
- 24:25 If it does matter, how do we recover?
- 24:30 A lot of the time, it will be the case that what the old version will be doing is either wrong or the new functionality isn’t there, in which case we’ll get an exception.
- 24:40 Something will happen; the wrong version will be contacted, and an exception will be thrown.
- 24:45 Our architecture has a catch-up processor, and will try to re-play the operation again.
- 25:00 This reaches down to the heart of the architecture - every action happens at least once, and at most once.
- 25:15 It is important that both of those constraints are followed.
- 25:30 Our code is deliberately mistrustful of our code in the system - it’s mistrustful not only of things coming outside the bank, but within the bank as well.
- 25:45 If we get one of these things happening during a roll, and a problem happens, then it will be tried again in five minutes' time.
How do you guarantee that your actions are replayed at most once?
- 26:20 There are a few key principles to it; the first thing is that everything we get is assigned a UUID - payments, payment, card, customer and so on.
- 26:40 As quickly as possible, the request is logged into the database.
- 26:50 When a request comes in from the mobile app, all that will happen synchronously is that we will take the request and queue it in the database, and respond 200 on the REST endpoint.
- 27:05 Now the request is logged, we have work that we can do.
- 27:15 At every stage, we will do some work with that request, and then log the fact that the work has been done.
- 27:20 The stages could involve the data and processing it somehow.
- 27:30 We’ll try it and if it works, then we’ll mark that piece of work as done.
- 27:45 If we try it and it doesn’t work, and we haven’t heard that it has been a success, then we’ll wait five minutes and try again.
- 28:00 This is fulfilling doing it at least once, and you keep retrying at every five minutes, to make sure the work has been done.
How do you debug and trace the logs?
- 29:05 We assign the correlation UUID, and you can search by that in the logs, which will persist across the different applications, so you can see where it failed.
- 29:25 When you’re tracking bugs in the system, you can follow the request through and find out why it failed and where.
What happens to requests that come in when the services are starting?
- 30:00 We have queues that the catch-up queues use; we have a command bus that will sit there until it can be processed.
- 30:10 Sometimes we get back-ups, where we’re getting a flood of requests and we can’t handle them all.
- 30:25 Generally what we do is log the fact that we’ve received the request and return quickly.
- 30:40 Typically the problem is not logging the request in the fist place, but in trying to process those requests and failing to do so because something is too busy.
- 30:55 What we can do is kill the server, if it’s backed up with far too much work.
- 31:10 That will lose all the work in the queue, because it will come back from the database, and we haven’t marked it as done because it’s not completed.
- 31:25 Crucially it turns out that if it did get done, and we haven’t managed to respond, then the "at most once" principle comes in.
- 31:40 When the server comes back and tries to replay the request, if it’s already been done then don’t throw an error, but return successfully.
Banking has typically been risk-adverse; but how do you deal with chaos testing?
- 32:20 Regulators in Britain are slowly coming round to the idea that running a bank in the cloud with modern software and infrastructure management methods is not as much of a problem as they feared.
- 32:40 Our CIO tells a story about how he went to the regulator and they asked who had the keys to the server room; he had to respond that not only we don’t know who has the keys, we don’t know where the server room is.
- 33:00 That shocked them a bit, but gradually over time, by presenting to them and showing them how we work, we have shown them that what we are doing is more resilient than how other banks work.
- 33:25 We don’t have to worry about server maintenance, or having a primary and fail-over server onsite - we don’t have to worry about having a team of servers doing it years into the future.
- 33:35 We have outsourced that and bought it from a recognised industry-standard provider.
- 33:45 Chaos testing is another element of that - it’s there to show robustness, and we’re not just faking something in a test environment.
- 34:00 We are actually running Chaos in production, and showing that if an instance disappears, then we can recover with a negligible impact on our customers.
What kind of experiments do you run with your Chaos testing?
- 34:25 We are not doing a huge amount - it’s not very advanced.
- 34:30 We are mainly killing individual instances; we’re taking it out occasionally at a completely random point in time.
- 34:45 We’ll record when we did that, and then we’ll have a look at the logs and see whether it caused any problems or exceptions.
What would you architect different if you had to start over?
- 35:15 I would go for this microlith architecture much sooner.
- 35:20 When we started off, we put a lot of code into one place as a monolithic application, and has now grown too big.
- 35:40 We often get problems where we want to put code in a new application, but it has to tie back to this one giant monolith.