Monzo, the digital, mobile-only bank based in the UK, recently suffered outages in their current account payments and prepaid debit cards systems. Oliver Beattie, Monzo’s head of engineering, took on Monzo’s community forum to provide a post mortem of the outage.
Monzo has designed their infrastructure from the very beginning with global scaling as one of the core hypotheses. This has led to hundreds of microservices being developed over time.
These microservices are packaged into Docker containers which are deployed using Kubernetes in AWS. Orchestration of services is performed by etcd which kubernetes uses to identify which services are deployed where along with each service’s state. Routing and load balancing between services is done using linkerd.
The outage that affected both the prepaid cards and the current account holders was caused by a combination of several factors.
First of all, there was a bug in Kubernetes that caused requests to timeout after cluster reconfiguration. A cluster reconfiguration that happened a week before the actual outage started causing these timeouts, preventing linkerd from receiving updates from Kubernetes.
Then, when the outage happened, one of the immediate reactions was to restart all linkerd instances, which exposed an incompatibility between the versions of Kubernetes and linkerd Monzo was using, worsening the situation from a services specific outage to a full platform outage. The thread in Monzo’s community forum also provides a full timeline of events.
There are interesting lessons to learn from an outage like this one. Other than fixing bugs and keeping versions of different libraries on check for incompatibilities and other issues, Monzo has identified the need to improve procedures regarding communication of outages both internally and externally.
Also, another lesson that can be learned is the importance of alerts, dashboard and health checks in every layer of an application to get early detection of human and other errors. All in all, it’s important to do everything we can to prevent outages and also both resolving them and communicating clearly what has happened afterwards so that we can build better safeguards for the future.