Google wrote about their challenges in scaling Google Meet due to increased usage since the COVID-19 pandemic led to more people using it. The SRE team at Google used their existing incident management framework with modifications to tackle the challenge of increased traffic that started earlier this year.
The SRE team got early warning signals in regional capacity for Google Meet around Feb 17, but there was no ongoing outage or impact on users yet. The goal was to prevent outages and scale according to as-yet-unknown growth demands. The team's response strategy was to use their regular incident management framework, even though this did not fit the parameters of a "traditional" incident. They set up specific roles like an incident commander, communications lead and and operations lead in both North America and Europe. The team set up "workstreams" to streamline the work. Each workstream dealt with a specific aspect - Capacity, Dependencies (e.g. authentication services on which Meet depends), Bottlenecks, Control Knobs, and production rollouts with new tuning parameters. They added a "standby" to each person in incident response to avoid overloading and burnout.
Samantha Schaevitz, staff site reliability engineer at Google, was one of the incident commanders. She writes that her role included "collecting status information about which tactical problems lingered, who was working on what, and on the contexts that affected our response (e.g. governments' COVID-19 responses), and then dispatched work to people who were able to help". The team's technical goal was to "keep the amount of regionally available Meet service capacity ahead of user demand".
The team was able to double Meet's serving capacity, and provisioning capacity decisions had to move away from using historical trends to a new model for predicting usage. The second phase focused on working towards a 50x growth. A steady focus on automating processes, including new ones, helped the team to decrease manual operations and roll out changes faster. Each rollout went via a canary deployment.
An interesting observation during this exercise was that assigning more resources (CPU and RAM) to processes was more efficient than the same resources distributed across processes. This was due to the fact that the unavoidable overhead of each process (monitoring, health checks, initialization) could be minimized with a lesser number of processes. By the time the team closed this "incident", Meet had close to 100 million daily participants.
Other conferencing platforms have seen similar growth in the same period. Cisco's Webex, for example, saw three times its normal volume globally, with higher increases in specific regions. It had "more than 500 million meeting participants and logged more than 25 billion meeting minutes in April". They focused more on analytics and security features as part of handling the increased load. Similarly, Zoom saw a 50% increase in daily meeting participants in April. Zoom has 17 data centers globally, and also uses AWS, Oracle Cloud and Azure. They added "5000 - 6000 servers" on AWS every night to handle the increased demand.
Microsoft Teams - which had 200 million daily meeting participants in April - runs entirely on Azure. Similar to what happened at Google, the team at Microsoft Teams realized "that our previous forecasting models were quickly becoming obsolete" with the surge in traffic. According to their blog post, they used predictive modelling techniques and speeded up resourcing decisions without going overboard. Other steps taken included deploying critical services in more regions, code optimizations, better network traffic routing and more aggressive data compression. They made a few changes to their internal incident management processes as well, to avoid burnout.