Facebook Live started in a hackathon two years ago, and was launched to users eight months later. One of the challenges has been dealing with the unpredictable number of viewers of a single stream which can vary extensively, Sachin Kulkarni noted in his presentation at the recent QCon London conference, where he described the architecture and design challenges when building Facebook live streams.
From a high-level view of the infrastructure, a live stream starts when a client connects to the closest PoP (Point of Presence), which in turn will forward the connection to a full data centre where encoding is done. From the data centre, the stream is then forwarded to different PoPs and on to playback clients. Kulkarni, director of video infrastructure at Facebook, describes a PoP as responsible for caching and terminating incoming client connections and passing these on to a data centre over Facebook’s own network, which is more reliable and reduces the roundtrip time.
Looking at scaling challenges, Kulkarni notes that both the number of concurrent unique streams and number of viewers of all streams together are quite predictable and therefore pose a limited challenge. The real problem lies in the number of viewers of a single stream, which can be anything from just a few up to a very high number. Because it is unpredictable, you can’t plan for it. Caching and stream distribution are two ways they have solved this problem.
Comparing live streaming with normal videos, Kulkarni notes some other challenges:
- Populating a cache ahead of time is not possible since content is created in real time, preventing any form of precaching.
- Planning for live events and scaling resources ahead of time is problematic.
- Predicting concurrent stream/viewer spikes caused by world events is difficult.
One major reliability challenge with live streaming is network problems. To deal with those, Kulkarni proposes three solutions:
- Adapting to a lower bandwidth by using an adaptive bitrate, thus lowering the video quality, is typically used on the playback side, but they are also using it on the ingestion side.
- Temporary connectivity loss is handled by temporary buffering on the client.
- In worst case scenarios when the bandwidth is not high enough they can switch to audio only broadcasts or playbacks, noting that hearing what people say is more important than seeing them.
Lessons Kulkarni has learned from the project include:
- Large services can grow from small beginnings. It’s better to write some code instead of forever discussing the architecture.
- Reliability and scalability must be built into the design, including designing for planned and unplanned outages.
- Make compromises to enable shipping of large projects.
- Keep the architecture flexible for future features. This will allow a team to move faster than the infrastructure is able to change.