Key Takeaways
- Redundancy and isolation: two sides of the same coin
- Protect against your own bugs, not just environmental factors
- Slow-downs can cross service boundaries more easily than crashes
- Synchronous APIs provide ample opportunities for fault propagation - avoid!
- Established cloud patterns work, even without schedulers and meshes - day-one architecture should be simple to the point of simplistic
Resilience is about tolerating failure, not eliminating it.
You cannot devote all your time to avoiding failure. If you do, you will build a system that is hopelessly brittle. If you really hope to build a resilient system, you must build a system that absorbs shocks, and continues or recovers.
At Starling Bank, when we built a bank, from scratch, in a year, against a backdrop of highly public outages amongst incumbent banks; we knew we needed surefire approaches to resilience. We wanted the assurance that comes from chaos engineering and the single-minded pursuit of simplicity and brutal prioritisation that sees the best startups thrive.
The following are the principles and beliefs that underlie our approach to resilience.
By the way, there are some features of our approach I don’t cover here, but which are nevertheless essential:
- Chaos engineering - because resilience must be tested
- Monitoring - because resilience requires supportability
- Incident response - because resilience applies to both the system and the organisation that operates it
Instead, let’s focus on resilient architecture from the infrastructure up.
Resilient Architecture
Having a resilient service for customers means ensuring that when a failure occurs, the part of the system affected by the error is small in comparison to your system as a whole.
There are two ways to ensure this. Redundancy is about ensuring the system as a whole extends out beyond the scope of failure. However much is impaired, we've simply got more in reserve. Isolation is about ensuring that the scope of failure remains confined within a small area and cannot spread out to the boundaries of our system.
However you design your system, you must have good answers for both of these. With these in mind, you must mentally test your design against every fault you can imagine.
To help, consider the following dimensions of failure:
- faults at an infrastructural level (like network failure), as well as faults at an application level (like uncaught exceptions or panics).
- faults that are intrinsic to the software we build (caused by us, i.e. bugs) as well as those that are extrinsic (caused by others e.g. invalid messages).
It is not sufficient to assume the intrinsic faults will be shaken out by testing. It is as important to protect yourself from yourself as from the rest of the world.
The single biggest factor in how easy it is to reason about failure is simplicity of design, so always choose the simplest design that fits your need.
Redundancy
Redundant infrastructure is easy in the cloud. Infrastructure as code allows us to create several of anything as easily as we can create one.
Cloud providers, like AWS and GCP, provide IaaS primitives like load balancers and scaling groups and well documented patterns for how to use them. Execution environments like Kubernetes offer resources like deployments and replication controllers which continually ensure that several replicas of a service are up and available. Serverless technologies will scale your services for you without any configuration at all.
There are some provisos though.
State is the old enemy. It is much harder to scale out services when instances make assumptions about accuracy of local state. Either state must be shared externally via a highly available data store, or cluster coordination mechanisms must be introduced to ensure that each instance's local picture is consistent with that of its peers, in effect turning the servers into a distributed cache.
My recommendation, for startups at least, is to keep it very simple on day one. The database is for shared state. Use it for all shared state until you hit its limits. Local memory is for transient state only. This gives you a neat “immutable" compute layer and a database which has well-understood operational requirements. You know you will need to evolve this infrastructure at some point, so architect for change and watch your metrics. Until that day, benefit from simple scaling and redundancy.
Serverless environments usually offer a stateless function service and separate database services for state, in effect making this simple, clean separation the default.
Shared resource constraints can be another problem. It is pointless scaling a service up to 10 instances when a back-end datastore will only accept enough connections to support 3. In fact, scaling out services can be hazardous when it exposes other components to load that they are unable to support. Increasing capacity in one part of the system can degrade the system as a whole. Realistically, you cannot understand all of these limits without a high degree of testing.
Isolation
At the cloud infrastructure layer, AWS provides separate availability zones within every region which provide strong isolation between infrastructure components. Load balancers and scaling groups can span multiple availability zones.
The established patterns for building in AWS will ensure you get the basics right. Other clouds have similar capabilities.
At the application layer, microservices, self-contained systems and even traditional SOA are all design approaches that can induce strong firebreaks between services, allowing those components to fail independently without compromising the viability of the whole.
If a single instance of a service crashes, the service is not interrupted. If every instance of a service crashes, customer experience might be impaired in some way, but other services must continue to function.
Using this scheme, the more finely you split up your systems, the smaller the blast radius when one fails. But again, the devil is in the detail. There are many subtle ways that services can depend upon one another that cross service boundaries and allow errors to propagate around the system. When services are very closely related it can often be simpler to treat them as, and build them as, a single service.
Some ways may be less visible, like hidden protocols; service instances might be connected via clustering protocols that do master election, or two-phase commit protocols that vote across the network on every transaction (if you go in for that sort of thing). These are network interactions that engineers are often deeply abstracted away from but which tie services together in ways that can propagate errors.
As in life, cohabitation is a decision not be taken lightly. If you run several services on the same instance, whether by deliberate, static grouping or using a scheduler like Kubernetes, then those services are less isolated than they would be on separate machines. All are subject to many of the same failure conditions - not least hardware failure.
If instances of services X, Y and Z always run together on the same host, then if X can run amok and “poison” its host, X will always take the same neighbours with it, Y and Z, regardless of how many instances you have. By grouping them you have eroded the isolation between services X, Y, Z - even if your technology should guarantee that this sort of “poisoning” could not occur.
Schedulers can mitigate this by avoiding arrangements which would see the same set of services deployed together everywhere. So if one buggy service poisons its hosts you don't see a complete wipe-out of any other services, but instead a more diffuse, partial impact across a broader range of services which might not impact customer experience at all.
In our launch architecture, Starling Bank opted for a simple single-service-per-EC2-instance model. This was a trade-off very much aimed at simplicity rather than economy, but it also provides very strong isolation. Losing an instance is barely noticeable. Even if we were to deploy a rogue service that did something nasty to its host, it could not easily affect other services.
There are also many ways for error conditions to propagate in application-tier code.
Perhaps the most pernicious factor is time itself.
In any transactional system, time is arguably the most valuable resource you have. Even when you are not actively scheduled on a processor, you may be holding vital resources for the duration it takes you to get your work done: connections, locks, files, sockets. If you dawdle you will starve someone else of these resources. And if another thread cannot progress, it may not release resources it is holding. It may force other services into wait behaviours, propagating the contagion across the system.
While deadlock is often painted as the evil villain in concurrent programming, in reality resource starvation due to slow running transactions is probably a more common and equally crippling issue.
It's too easy to ignore this. An engineer’s first instinct is to be safe and correct: take a lock, enter a transaction, maybe briefly consider setting a timeout then skip it - (engineers hate anything that looks arbitrary, and then there is the awkward choice of how to handle a timeout)… these simple temptations during everyday development can cause havoc in production.
Waits and delays must not be allowed to propagate across services. But unless you have some global precautions against this sort of misbehaviour, it's not hard to imagine scenarios where your system is vulnerable. Do you have fixed size database connection pools? Or thread pools? Are you sure you never hold a database connection while making REST call? If you do, could you exhaust a connection pool when network comms degrades? What else can eat time? What about the things you hardly even see, like logging calls? Or maybe you have some metrics being repeatedly scraped that need database access which start failing because the pool is maxed out, introducing tons of garbage into logs? Maybe you then blow the limits of your logging architecture and you lose visibility of everything that's happening. It’s fun to construct nightmare scenarios that all start with a little bit of dawdling.
Precautions abound: circuit breakers, bulkheads, retry budgets, deadline propagation… and there are libraries that help you implement these patterns in your software (hystrix, resilience4j) and service meshes or intermediaries that can provide some of these at a lower level (istio, conduit, linkerd). But none of these are going to do the right thing without investment of time and effort. As always, the performance and failure modes of any piece of code come right back to the engineer.
Synchronous calls borrow somebody else's time. And, as a synchronous API, by sound principles of design, you usually have no idea of how precious that time is to your caller or how vulnerable they are to your dawdling or failing. You are holding them hostage. For this reason we prefer asynchronous APIs between services and only use synchronous APIs where absolutely necessary or for very simple read-only calls.
Front-ends can make “optimistic” UI updates instead of waiting for confirmation and show stale information when offline. In most cases there is some cost to making decisions on stale information, but that cost need not necessarily be prohibitive. As Pat Helland once wrote: computing consists of memories, guesses and apologies. There is always a cost when a system representation diverges from reality, and it will always diverge from reality at some point when reality changes behind your back. In distributed systems there is no real “now”. Evaluate the cost of being “wrong” before deciding you need perfect transactional semantics or synchronous responses.
For a banking system, tolerating failure means tolerating failure without losing information. So while you don't necessarily need queues to implement asynchronous systems, you probably want your system as a whole to manifest some queue-like properties - like at-least-once and at-most-once delivery even in the presence of faults. That way you can ensure that some service has always taken ownership of a piece of data or a command.
You can achieve both using idempotence and catch-up processing in your services (as in Starling's "DITTO" architecture - with lots of autonomous services continually trying to do idempotent things to each other), or use queues for at-least-once guarantees with idempotent payloads for at-most-once (like Nubank's Kafka infrastructure), or try to delegate both to an “enterprise service bus” (as in many legacy banking architectures).
Using these techniques it is perfectly possible to create responsive applications on asynchronous inter-service underpinnings.
Summary
It is easier than ever to write resilient systems. In the first twelve months of operation, Starling Bank experienced many localised faults and degradations of parts of the system - never was the entire system unavailable. This comes mainly from emphasising redundancy and isolation at every level.
We designed an architecture that tolerates service outage and we treat seriously the insight that a slow service can be more of a threat to an end user than a dead service.
None of this would matter at all if we did not also have solid approaches for the items listed at the start: chaos engineering, monitoring and incident response - but those are stories for another day.
About the Author
Greg Hawkins is an independent consultant on tech, fintech, cloud and devops. He was CTO of Starling Bank, the UK mobile-only challenger bank from 2016-2018, during which the fintech start-up acquired a banking licence and went from zero to smashing through the 100K download mark on both mobile platforms. He remains a senior advisor to Starling Bank today. Starling built full-stack banking systems from scratch, became the first UK current account available to the general public to be deployed entirely in the cloud and met all the regulatory, security and availability expectations that apply to retail banking.