Transcript
Dincer: In this talk, we're going to talk about low latency video streaming at Cloudflare. Video streaming is a type of broadcasting, which means that data gets traveled from wherever it's being captured in the scene and to the masses. Common ways of doing this could be through signals in the air, like radio or cable in the U.S. What's interesting to me about signals traveling in the air, is that all the data across all the channels possible are in the air at the same time traveling to the viewer. The viewer can possibly tune into whatever they want to view and get that information. Same way with cable, all the data is always flowing to your TV, all the channels are flowing to your TV, and your TV tunes into whatever it wants to view. However, this is moving to the internet. In a way, it's great, because you can get personalized, more unique content that appeal to you. You could go back in catalogs of content and watch Keeping Up with the Kardashians, Season 5 from 2010. When you're watching live content, how do you actually get the content from the camera to the user? You can't get a MP4 file on the server that the user downloads and plays. Because, say it's live news, you would have to wait an hour until the program ends to deliver that MP4. By that time, the news is old news. For a long time, people made up their own ways to deliver live video and there's plenty of protocols to do that. There was Flash, which most of us weren't used to live streaming on the video on the internet with, and Flash uses its own protocol. iOS used QuickTime Media Server, but with the use of custom ports, firewalls, and general lack of infrastructure to scale, it wasn't great.
HTTP Live Streaming Protocol (HLS)
That takes us to 14 years back almost to the day to WWDC when Apple announced HTTP Live Streaming, or HLS protocol. HLS protocol is a live streaming protocol for media: audio, video. It has two parts. There are responsibilities of the server and then responsibilities of the client. The responsibilities of the server is to get whatever data is coming in and break them into chunks few megabytes in size. Make them available as small files on disk, which the server makes it available as HTTP download with a URL. The server also puts these URLs into a list that the client can download. The client downloads the list called playlist, and picks whatever chunk to request from the server. The requested chunk is then fed in to a video player, which then buffer some of it and then displays it on the user's screen. It is very likely that you're using HLS right now. HLS also do have some benefits to what existed before for both live and recorded or VOD streaming. In 2009, Apple was announcing iPhone 3GS, and mobile clients had different internet connection speeds, like today. For example, there was Edge, there was 3G, there was Wi-Fi. One of the things that's interesting is HLS made available different quality levels at the same time. The phone would pick whatever quality level it thinks it's useful to display. If it moves from Edge to 3G, for example, it would display a higher quality stream, and same way to Wi-Fi. Another benefit of HTTP Live Streaming, the HLS protocol is the users don't have to download the entire file for playback. You can seek within a 5-hour video, say you go to the fourth hour without having to watch the first three hours of the content. You don't download anything either. HLS also makes it possible to have alternative audio or subtitles that are requested on demand. Again, you don't have to download any of these extra things, you only download them if you choose to consume them.
HLS is a very simple protocol, it's over plain text. Here's an example of the first request a client would make to the server. As you can see, there's about two URLs, and they're tagged with different bandwidth values. This one also has an audio only stream, which is interesting because that means that if there is no good bandwidth available, say your mobile connection is quite bad, you can still watch the content audio only. If you would visit one of these URLs, you would see something like this, where these are the direct links to the media files that the server is offering. This is a sliding window. Your client would refresh this page occasionally, and there will be new files, and it would add download these files and add it to its buffer until the video has ended. HTTP is a great way to distribute media through existing infrastructure such as HTTP caches, proxy servers, firewalls, because the web has been around for a long time, HTTP has been around for a long time, and the middleware boxes in the middle that facilitate this don't have to know what you're transferring. It's just HTTP to it, and what's in it doesn't matter. HTTP also means that delivering media this way is very cheap. The server is not required to keep any states and it's just serving files. Existing solutions can be used to scale the delivery of the video to many viewers. Again, this is a big deal, because when it launched in 2009, the amount of media distribution tools was very small compared to tools that we used to distribute websites with. HLS and similar protocols like DASH are how we watched video online today, whether it's YouTube, Vimeo, TikTok, Instagram, recorded or live, because it scales pretty well.
Latency
People in the broadcast industry would say that live video over the internet is inferior to traditional broadcast because live HTTP over cable has lower latency than its internet equivalent. They would say TV sports would be less fun for audiences over the internet because they could hear their neighbor cheer before the game is over. Award show would be less fun because the winners would already be on Twitter by the time you see it. TV news the same way. This is due to several factors. The latency difference is quite large, average HLS stream would be maybe 45 to 30 seconds of delay from camera by the time it reaches your screen. While regular cable HTTP in America would be about 5 seconds. This is because the content is not available in the air or in the cable all the time. There needs to be some sort of bidirectional communication over the internet. Packets on the internet are delivered on a best effort basis. Packet loss happens. Clients have fluctuating bandwidth available to them. They use buffers to account for these things and correct it so that you see the content without interruptions. In addition, there's many ways to set up video delivery over HTTP, and a lot of them are very complicated. For an average setup, you would use multiple encoders, storage layers. A lot of coordination is going on, and maybe you're using multiple vendors or tools that needs to be glued together. There's quite a lot of domain knowledge from a lot of areas, and it's easy to get things working, but it's hard to optimize. It's hard to optimize when you know that different viewers of yours have different connections, and they might be in different places. It's also difficult to collect feedback from your viewers because they could be anywhere.
New Applications
From this perspective, HLS is just competing or trying to replace the traditional broadcast over the internet. Instead, I think what's more interesting to me is all the possibilities that are available that are uniquely possible over the internet. Over the last few years, especially since the pandemic, there's been a lot of cool web native things. Obviously, Twitch is not new, but Twitch offers a way to watch other people play video games. Next to the player, there's a chat room where viewers can chat with each other and with the viewer. Clubhouse and Twitter Spaces, it's hard to describe, but kind of like a podcast, where people talk with each other across a video or audio conference. The result of this conference is broadcast to many people. You can bring people from the audience in Twitter Spaces and Clubhouse to the stage to discuss something and then send them back into the audience. This all happens remotely. HQ Trivia is an interesting thing too where a lot of people were watching a live stream, in a very synchronized fashion. They're watching the same content, and they would play a game show and earn prices. With all these applications, there's something unique that didn't exist before, participants can react to what's going on, on the content and return, contribute back into the scene, so change what's going on as a participant, which is not something that existed before.
New Infra Needed
To power these use cases, I think there's new kinds of infrastructure needed. You want to interact with the broadcasts or become the broadcast. You need to have low latency between the broadcast and the viewer to not miss the context, as you're jumping between the two roles of being in the audience and being part of broadcast. This new infrastructure to power multi-tenant systems, you would need to support a lot of broadcasts at once. Your broadcasters and viewers could be in many different places, because it's not like an antenna broadcasting, it's over the internet, so your audience could be global. You want it to scale to a lot of viewers because it's the internet, things can go viral. Maybe you're very popular, and a system needs to be built to solve this problem. A solution that's very similar to this, in my opinion, is telephony. It's interactive, and it's low latency. There are a few differences. Most of the time, things are similar. With phones you can have low latency. A system can have many phone conversations. A network can have many phone conversations at once. In the phone's case, people on the phone can be in many different locations. It's not common that you see a phone setup where you're on the phone with many people at once. To solve the problem, Clubhouse, Twitter Spaces, things like this, figured out a combined approach. The architecture for this looks like this. You would have a centralized server somewhere in the world, you would probably want to place the server close to where most of the participants are. The participants would connect directly into the videoconferencing system, which then would feed into a composer and an encoder, and it will output content via a regular HLS delivery that is common. If you want to jump between the audience and the participants, you would connect to the videoconference of the system, drop your HLS client, and continue that way. However, it's a different stream, and there could be latency gaps, there could be quite a difference. It's just a more complicated system.
There's a bunch of other use cases that require all audiences to have very low latency. Apart from this combined approach that I just showed over here, with the HLS output, sometimes you don't want to have HLS at all. It's also very difficult to scale a videoconferencing system with many participants. Here are some use cases, some of them are interesting. One thing that's interesting to me is the in-stadium interactivity where people are viewing the content with their eyes with basically zero latency. At the same time, they can have different angles to the stadium, different angles to the game on their phones, and this requires very low latency because you're competing with zero latency. What would an architecture to distribute very low latency data, media data look like? To begin, there is nothing preventing you reading from the network and writing to the network. This is done in many different places. Just like CDNs store and forward files over HTTP, by taking bits of the disk and putting them on the network, it's possible to take bits from the network and put them into a different network port. Unlike CDNs, HTTP CDNs, delivery is stateful and requires a server to maintain awareness of client state. That's a challenge. This is a challenge that we've solved before. For example, when we're handling TCP at layer 4, we already keep track of client state, and there is no reason why we can't do this up the stack.
Cloudflare Network
Cloudflare is in the business of running many servers around the world, very close to users. Historically, these servers could do many things. You can put them in front of your website, and they'd act as a HTTP proxy. They could give you HTTP caching to distribute content. They could detect and block security threats. Famously, they mitigate DDoS attacks. We also run the Edge Compute platform where you can run JavaScript at all of these locations and do some computation very close to the users. An interesting thing about the architecture of Cloudflare is that all the services that Cloudflare offers run on every server, all at once. A machine that is serving DNS responses on 1.1.1.1, and serving HTTP content of websites could be doing other things at the same time. They could be running a firewall or, in our case, they could be serving video traffic. We can do this because, in this architecture, we can just open a new port and run a new service and listen on it. Whatever the service does could be an HTTP application, it could be any other protocol. It can just open in the open port, we can listen to TCP or UDP, and do whatever we want with it. To explore low latency, we came up with a simple architecture that uses this setup. Here, we made a service. What it does is takes media and it transports it across the Cloudflare network, and then plays back to a viewer. The viewer connects to the closest location to them, the broadcaster connects to the closest location to them. We connected these two things together. As this scales, and again, there's nothing complicated in this, it's just bits flowing from one port to the other. We rely on existing protocols, which I'll talk about later, to do the communication between the broadcaster and the server, and again, the viewer and the server. When the new viewer joins, here, and want to play back, the server in Paris would reach out to New Jersey and get the bits and they will start flowing. Again, the only latency that's important here is the speed of light. If another viewer joins, again, this time in California, then something interesting happens. We can deduplicate, whatever is available in California, and feed it back into the server that this new viewer has landed, and send it to that viewer. As the system scales, there is more chance that a viewer will make a request to a server that already is serving another viewer. It's even easier now, because we can deduplicate this in-memory, of that server.
The flow would go like this. The server would ask to itself, what's in the server, is in the server? Then it would ask, is it in the same data center, so it'll just connect within the data center. Then it would ask, is in the same region? Then if it's not, it'll just go to the source server and get the stream from there. What's interesting to me about this flow is that for every video stream, there's a unique tree structure that gets formed. If there's two streams at the same time, they would have very unique patterns of access. This is because the service that plays back also can ingest content. Somebody asked me, how does the video viewer or the broadcaster get routed to which server and how that gets decided? There's a blog post about Cloudflare's layer 4 load balancer, and we just use that. We just load balance between machines, depending on several parameters, but a lot of it is CPU. Different alternatives of this oftentimes use centralized servers to take in the content and a larger number of servers to distribute content. That makes sense because, you're more likely to have a large number of viewers than a large number of broadcasters. One setup, you can have a large number of broadcasters with zero viewers as well, and it will be very cheap to operate this system. Other systems that defined, again, had to have a more rigid structure that doesn't have the playback and the ingest on the same server, form predefined tiers of servers. For example, you can have a more local server to be a lower tier and then a more regional server to be a higher tier that distributes the content. This might be good in a lot of cases, but if you have a requirement like to keep the data in a specific country, or if you want to minimize latency, this is a good choice. Our setup over here is a good choice because the content can be super local only in one data center. It could be kept in one data center.
When building this system, we use a lot of Kafka technologies. Everything here is either open source or Cloudflare products we sell. For example, to track what server has a particular stream, we use Durable Objects. It's a transactional datastore that you can make transactions to, and keep track of things. It's a datastore that moves around depending on what interacts with it around the world to minimize latency to itself. Another interesting thing we do is graceful upgrades. Again, there's a blog post about this. If you're running a service, and you want to update it, in a multi-tenant environment it's quite hard to do, because you don't know how long a live stream could be. If you want to drain connections from it, you might have to wait weeks or more until a live stream connection ends. Cloudflare's solution for this is, you have a service that listens on a port, and this service execs into the new version of the service. Then, session descriptors are passed into the new binary. The state is also passed into the new binary that's running after the exec, and the whole service is still running with the same state. There's only a few seconds. This takes only a few milliseconds, so the video traffic is not disrupted.
Video Protocols
Video protocols are a hot topic. There's a lot of them. There's a lot of discussion around it. There's a lot of tradeoffs in video protocols: latency, quality, compatibility. HLS is obviously prioritizing quality and compatibility, but there's a few options. We support RTMP, SRT and WebRTC. Inside the service, the activity is quite simple, and we could support any number of protocols. Because right after the protocol enters the service, either as UDP packets or a TCP session, we take what's being sent inside these protocols and convert them to a common format that we can port over or store before delivery, and we convert it to whatever format is required on demand as we're sending it over. For example, in this example you can see SRT, RTMP, and WHEP coming to the reader, and then right away after they're read, they're handled in a common format. We also handle in the system, and this is how all of our Cloudflare's regular HLS delivery works, we also have a segmenter listening in the reader from the common format. It could also convert SRT and RTMP and WHEP, into HLS and DASH and offer a more compatible output. These could be running at the same time without adding much cost to the system. Whenever somebody comes to listen on that particular protocol, only then the writer will run and the rest of the system will run. The system won't do anything but read the inputs and just loop without the writer.
I also want to talk about WebRTC a little bit. WebRTC is very interesting because there's a strong effort in IETF to standardize broadcasting over WebRTC, and there's a big misconception that WebRTC is a tradeoff for bad quality in favor of low latency. That's not the case. WebRTC is very flexible. WebRTC, also is the only protocol of these that has support for multiple quality levels at the same time, like HLS. With WebRTC, the quality level selection happens on server side, as opposed to the client side. By combining all these protocols in one place and using a common format, right after the media enters, and up until right before it leaves, we're able to maintain compatibility across formats and add new ones and change things around very easily.
Conclusion
I think this is a cool idea, but I think it's cooler when something becomes reality because it means that somebody has worked on solving all the problems between idea and reality. A ton of colleagues have worked on this, and a ton of colleagues that I don't interact with often worked on components that I mentioned, like our layer 4 load balancer and the Kafka architecture in general. You can use this today. Here are working examples of two URLs, SRT URLs that you can try to push content into, and pull content from.
See more presentations with transcripts