Cloudflare, the web performance and reliability company, recently suffered a partial outage for internet properties and its services. The outage lasted for 27 minutes. The error in The Cloudflare Backbone network caused this outage, dropping traffic by 50% across the network. In his blog post, John Graham-Cumming, CTO, CloudFlare, clarified that this outage was not caused by an attack or breach of any kind.
Describing the timeline of the issue, Graham-Cumming said that the network engineering team updated the configuration on a router to resolve an unrelated issue to reduce congestion in the network. The issue was with a segment of the backbone from Newark to Chicago, USA. Error in configuration caused all traffic across the backbone to be sent to the Atlanta router, causing a massive load on the router.
This resulted in Cloudflare network locations connected to Atlanta to become unavailable. The 20 affected locations were San Jose, Dallas, Seattle, Los Angeles, Chicago, Washington, DC, Richmond, Newark, Atlanta, London, Amsterdam, Frankfurt, Paris, Stockholm, Moscow, St. Petersburg, São Paulo, Curitiba, and Porto Alegre.
Twitter users started reporting about services being down, as websites such as League of Legends, Deliveroo, Discord, Feedly, GitLab, Medium, Patreon, Politico, and Shopify were affected.
Issuing an apology, Graham-Cumming introduced a global change made to the backbone configuration in order to prevent such an outage occurring again. Providing the details further, Matthew Prince, CEO, Cloudflare, tweeted, "the root cause was a typo in a router configuration on the private backbone. We've applied safeguards to ensure a mistake like this will not cause problems in the future."
Across Internet discussion forums there was a mix of support and doubt. On Reddit one user, named rotarychainsaw, empathized about the ease of making a mistake such as a typo, "I mean... Who hasn't done this before really?" Several other commenters in the same thread questioned the review process, with hennirl asking, "I’m curious to see how this change even made it through a change review. Surely they do diffs of config changes and have at least two other sets of eyes review? [...]".
This outage comes a year after a similar outage on July 2nd, 2019, when Cloudflare sites threw 502 errors caused by a massive spike in CPU utilization in the network. Urging the users to ask ("hard") questions, Jerome Fleury tweeted that the outage made them learn "lots of lessons".
Interested readers can research more on the topic of postmortems, "root causing" production issues, and overcoming obstacles to learning in a related InfoQ podcast with Ryan Kitchens and also via the Learning from Incidents blog.