Monzo developed a solution for shedding traffic in case its platform comes under intense and unexpected load that could lead to an outage. Traffic spikes can be generated by the mobile app and triggered by push notifications or other bursts in user activity. The solution can reduce the read traffic by almost 50% with 90% overall accuracy without noticeable customer impact.
Monzo banking platform has millions of users that interact with it primarily through the mobile app. Yet, sometimes spikes in traffic can destabilize the platform. This can be caused by periodic push notifications to a large number of users or time-specific features such as Get Paid Early. The team at Monzo uses proactive scaling to ensure sufficient capacity in the platform to handle Get Paid Early events, but sudden traffic spikes still pose a significant risk.
Jacob Moxham, a staff engineer at Monzo, explains why the stampeding herd effect (similar to the thundering herd problem) is so dangerous for the stability of Monzo’s platform:
Stampeding herd [...] is a term we use when large numbers of our customers open the app within a very short time period. If we aren’t prepared for these moments, we can exhaust our buffer capacity and can’t scale our platform up quickly enough. In the worst cases of this, shared infrastructure could become overloaded, causing widespread disruption.
The problem is amplified because the Monzo app prefetches data when opened or upon receiving the push notification to ensure up-to-date information is available immediately. The team suspected most of these requests would just return the same data. After deploying additional logging for 0.1% of users to the edge proxy, the logs showed that around 70% of requests returned the same data over 24 hours.
To eliminate "wasted" requests, engineers first opted for creating a "changes API" that would return the last updated time for the most commonly used and expensive endpoints. The mobile app would query the new changes API and only request the data if it has changed since the last call. The approach showed difficulties in providing an accurate last-updated timestamp because of real-time data enrichment implemented in the regular API endpoints and complex data flows for updates to API resources.
The Request-shedding Logic in the Edge Proxy (Source: Monzo Technology Blog)
Instead, the team concluded that rather than implementing a perfect and permanent solution, they could create an adequate but much more cost-effective solution and activate it only when the platform is under heavy and unexpected load. They identified three features to help determine whether to shed the request: the time since the response was computed for the request, the trigger for the data prefetch, and how long the mobile app was open when the request was made.
For the first feature, the engineers repurposed the Etag HTTP header returned by the API endpoint to contain the response hash and the last computed time. When prefetching data, the mobile app would send If-None-Match HTTP header containing the value from the previously returned Etag header for the identical request and the other two features in custom headers. Based on the metadata in headers, load-shedding policies deployed in the edge proxy would determine whether to ignore the request and return the 304 (Not Modified) status code or return the computed response. Policies for different prefetch triggers can be activated individually, allowing the team to progressively shed segments of the mobile app traffic.
The Reduction in Traffic With Request Shedding Activated (Source: Monzo Technology Blog)
The team trialed the new set of policies by deploying them in shadow mode, where responses would be computed and the determination for shedding the request based on request metadata compared with the actual outcome. With all policies activated, the platform could shed almost 50% of GET requests with 90% overall accuracy. The engineer reported that customer impact wasn’t noticeable, and a small percentage of users seeing stale data is acceptable compared to a major outage that would affect the entire platform.