BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News Google Cloud Incident Root-Cause Analysis and Remediation

Google Cloud Incident Root-Cause Analysis and Remediation

Google disclosed its root-cause analysis of an incident affecting a few of its Cloud services that increased error rates between 33% and 87% for about 32 minutes, along with the steps they will take to improve the platform performance and availability.

The incident affected customers of a number of Google services relying on Google HTTP(S) Load Balancer, including Google Kubernetes Engine, Google App Engine, Google Cloud Functions, Stackdriver’s web UI, Dialogflow and the Cloud Support Portal/API. Customers started to randomly receive 502 error codes or connection resets for about 32 minutes, the time it took Google engineers to deploy a fix from the moment Google monitoring system alerted them to the increased failure rates.

Google HTTP(S) Load Balancing aims to balance HTTP and HTTPS traffic across multiple backend instances and multiple regions. One of the benefits it provides is enabling the use of a single global IP address for a Cloud app, which greatly simplifies DNS setup. To achieve maximum performance during connection setup, the service utilizes a first layer of Google Front Ends (GFE) as close as possible to clients anywhere in the world, which receive requests and relays them to a second tier of GFEs. The second tier constitutes a global network of servers that actually sends the requests to the corresponding backends, regardless of the regions they are located in.

The root cause of the incident turned out to be an undetected bug in a new feature added to improve security and performance of the second GFE tier. The bug was triggered by a configuration change in the production environment enabling that feature and causing GFEs to randomly restart, with the consequent loss of service capacity while the servers where restarting.

Luckily, the feature containing the bug had not yet been put in service, so it was relatively easy for Google engineers to deploy a fix by reverting the configuration change, which led the service to return to its normal behaviour and failure rates after a few minutes of cache warm-up.

On the prevention front, besides improving the GFE test stack and adding more safeguards to prevent disabled features from being mistakenly put in service, Google Cloud team efforts will aim to improve isolation between different shards of GFE pools to reduce the scope of failures, and create a consolidated dashboard of all configuration changes for GFE pools, to make it easier for engineers to identify problematic changes to the system.

Read the full detail in Google’s official statement.

BT