Transcript
Adepoju: I'm Ade. I'm a senior software engineer at Netflix. I'll be talking to you about how you can, in some cases, make better use of your hardware by using certain specific tools, and knowing how to look at certain commonly used things as obstacles. This talk isn't really to tell you what to do, but more like what you can do if you have certain use cases.
Case Study: Trading Stocks from Tweets
Let's take an example of, you have a nice idea that we can trade in the stock market based on tweets, which is possible. It's just going to be hard to do that at scale because a lot of tweets, we don't know what it means. Say we have a startup, and that's our product. We need investors, and we tell them, we're just constantly scouring the web, getting tweets, making trades. We need to move fast because we're a startup. We're also very poor, but we're smart. It's going to be a central theme of this talk. We're very smart so we could do dangerous things. Basically, this just means you have a machine that takes tweets, takes a bunch of data, and then gets you money out of it by trading.
Simple Sample Logic
Let's take an example. We have Warren Buffett, very famous guy, very popular in the financial industry. He says some negative stuff about the airline industry. He says they will tank. That's not good, so let's give it three thumbs down. Then the Fed Chairman, who also has a lot of power, says, "No, we're going to bail out the industry." Those are two conflicting things. The Fed Chairman can actually print money, so we'll give him five thumbs up. Then Ronaldo, a soccer player, says something about airlines in his tweet, we don't really care. It's neutral. Then the AP, a reputable source says something somewhat negative. They're not really a source of financial advice, they just say what they know. Give them a small minus one. We sum that up based on this architecture, and say, based on all these, we have a net positive of two. That means we should still buy. This is just a way we could possibly do it.
Basic Implementation: Looks Like Streaming + MapReduce + ML
This has probably given you MapReduce vibes. We have a bunch of people computing stuff. We take that data, put it together, MapReduce. Then streaming, because a lot of data is constantly going. Tweets are being generated. We need to make this happen fast, compute on the fly. We also need ML, because, how do we know what a tweet means? How do we really know without trying to have some inference engine running on there, a natural language processor?
Analysis Engine Implementation
You're probably thinking, is this another Kafka talk? A talk that's going to tell you about how you should use Kafka and streaming. Not really, this is more of a talk to let you see what you could learn from those tools. They do a lot of things really well that we could probably pick. We're talking about those things. While we're actually going to build this analysis engine that takes the tweets and does magic with them, we have to understand it's not that easy to tell what a tweet means. We need to somehow understand the sentiment. An easy place to put that code is in user space. User space is nice. User space can be slow, and it may not always give us what we want because we have to go through the kernel. The kernel is not always in our best interest. The kernel is very general purpose. It's an abstraction. It's also not very good at networking. We're trying to do crazy streaming, moving data around. We don't want to wait for the kernel to switch out and do its packet processing, which isn't always in our best interest. That's going to be a theme of this talk, realizing that what we're used to, may not always be the best for what we want in a specific use case.
Abstractions Are Great
This abstraction is not new. Abstractions are everywhere. We have them in languages. We have them in a whole bunch of things. In fact, a lot of the languages people enjoy using, and all the cool kids use nowadays, are not necessarily the most optimal for interacting with hardware, or they don't translate to hardware directly because humans don't think like machines, in most cases. Abstractions are everywhere. It didn't start with millennials, they did it in UNIX, and they ported UNIX from assembly to C. There was a huge reduction in performance. It was ok, because being able to scale and being able to grow the UNIX domain faster, became just immense when they moved to C. Staying in assembly would have been absurd.
Performance Issues - Abstractions in Linux
This in the Linux kernel is everywhere. Many layers and tradeoffs and things you have to deal with just to talk to some other machine. We just use that because it works. It's there. We don't have to think about it. It's not always good. A good example is the C10K problem, where people were running the standard Linux kernel and were hitting an upper limit in the number of concurrent connections they could have. It was about 10,000. They were just like, "It's what it is. It's just the machines. The machines are not fast enough."
Everyone said, "We can't do any better." It turns out, it was just the kernel that wasn't handling multiplexing well. It wasn't very good at multiplexing strings of connections, because it wasn't really made for that. It was made to be general purpose. People started trying to do amazing things and then complained. All they had to do was tweak the logic and they beat the C10K problem. This was around for a while.
Should We Write Everything In C?
Based on that, you're probably saying, wait, so we just write everything in C. Is that what I'm trying to say? Not at all. Because if you write everything in C, I might as well just say, that's still abstract and assembly. It's still abstract, machine language. Why don't we just go feed all the transistors ourselves? It doesn't scale. It's all about balance. We need to know when to do one over the other, when to strip back the instructions, or when to say the abstraction is fine. It's not an easy thing to answer. Usually, people don't find out they need to strip an abstraction until it's too late. That's usually what happens.
Kafka Optimizes Around Hardware Properties - Zero Copy and Sequential IO
Kafka does strip away some abstractions. How does it do it? Why? The designers of Kafka thought about this. They said, "We're trying to make something that's fast, that works really well. Can we take advantage of how hardware behaves, to make things work faster?" One thing they looked at is sequential writes to disk. Have a look at the log. Everyone talks about the log. The reason the log works is appends are fast with disk. Sequential writes rather than random writes in RAM, which RAM is made for. It's in the name, Random Access Memory. It doesn't work the same way with a persistent storage. They looked at it and decided, can we build our data around that? It's faster.
Then they looked at zero copy. A very common thing everyone should do. Zero copy just says, "If I'm trying to move data from point A to B, if I don't have to go through C, don't go through C." It's usually not easy to tell that you're going the long route until you look at what you're doing. In most cases, you copy from kernel space to user space, and then back to kernel space to another device. That's usually wasteful. Why not just keep everything in kernel space? That's the logic of zero copy. It's in a lot of places. It's something that Kafka uses. It's really useful in our talk, too. With zero copy, we're actually bypassing, going to user space.
Can We Just Bypass Things Too? Going Lower = More Control
Can we bypass other things? Yes, we can. It turns out, if we just bypass certain places, we can get a lot of freebies. One very good tool for bypassing kernel space, especially with networking, is called DPDK, Data Plane Development Kit. Its job is to basically say, "Kernel is doing certain things for me, I can do it better, give me control." Simple. It basically lets you shrink the path that your data and packets have to go through into a much shorter, compressed space, and you can handle things yourself. DPDK is useful. There's a ton of them out there right now, because people are starting to see that they're being limited by what the kernel can do.
Direct Memory Access - Copying With and Without CPU
Tweets are fast now. How do we actually do better? This is not the end. Direct memory access. A very good tool. What it says is, CPU is an expensive piece of hardware. CPU is made to do a lot of smart things. We don't want to waste time just having CPU copy data. It's like hiring a world star chef, but then your world star chef also has to spend time doing the dishes while they're at work. It's not really efficient. Why not dedicate someone to do the dishes? That's like DMA. It prevents the CPU from being blocked copying data, and just says, "CPU, tell me what to do and I'll copy it." In this case, we see that the CPU is not blocked. It simply offloads the DMA request or DMA transfer. DMA does it and then raises an interrupt or a signal to say, "I'm done." That way, we didn't waste time having the CPU copy data around.
Remote Direct Memory Access - Sushi Boats
Then these people decided to take it further with RDMA, which basically says, let's take this to the extreme. Do it over a network. Let's have CPUs completely be hands-off, and just have controllers just talk to each other's memory spaces. That's just bonkers. A good example of that would be something like sushi boats, which basically says, "I have a world star chef here but I also want to reduce how much time it takes to get food to my customers. I don't want to have to have that old menu and everything," just put the food around the chef. The food is constantly circling the chef, and people just pick what they want. You don't need waiters, waitresses, people to serve the food. Food is right there. It's how RDMA works. "What you want is right in my memory space. Take it, do what you want." That's much faster. Obviously, there are some concerns with security but that's up to you to handle. RDMA basically lets you shorten the whole time from something like this, where you have to go all the way through the first guy space to talk to the other guy, and then you have to bubble back up to the other guy's space. It's not efficient, especially when you're designing everything yourself, you know what you want and what you want to do.
RDMA Data Flow
With RDMA, much smaller. Network devices just talk to one another and they do it without CPU's involvement. That's just immense because your CPU can actually do compute on our machine learning. It can try to understand what the tweet is, and doesn't have to worry about dealing with sending data. We've gone from traditional, multiple steps to the hardware, to two steps using DPDK, and then to just straight. The application just talks straight to the network card via RDMA. It can reach into the other guy's memory space. That's just immense. We've seen that we can shorten things down. These things are not said often, because people don't usually peel back the layers to understand what they could take advantage of.
We Now Have a Separation of Concerns - Control, Compute, and Data Planes
By default, we've ended up in separation of concerns because what we're actually doing here is, we're saying, we know we want to move data fast but we want to take matters into our own hands. That is called separation of concerns. It's said a lot in microservices. It's said a lot in other contexts, but it's used everywhere, really, to just say, this thing requires special handling. Let's handle it specially. That's what we have now. We have separation of concerns with our data plane, but it's ok because the rest of the stuff, control and compute, still runs the normal line. Data needs to be fast, so we put it separately. We'll see how we can even scale that further.
How Do We Compute Trades From Tweets?
You're probably thinking, "We're going deep into the rabbit hole. We're talking about networking and all this stuff. What about the inference? How do we know what a tweet means? We still need to make money." It's all about machine learning models. Everyone knows this. You want to do NLP, the model has to be good, which also means it has to be fast in this case, because if you spend two minutes, by the time we make a decision on a tweet, the market has already moved. We need to make sure our models are really good. Now we've done all the work on networking, we may bottleneck because networking is fast, but models are slow. We have this really good path to send data through, but we're not generating that much data anyway. That's wasteful. What can we do? We've already said, we like bypassing things, so why don't we bypass in this case? Bypass again. We know we can bypass the kernel. Can we try bypassing the kernel here? Are there things we can bypass to move data faster? Turns out, yes, there are. Bypassing the kernel works here too, because in compute space, you're getting preempted. Things kick you out when you're trying to work and you have to wait for a scheduler. There are tricks we can play. Then maybe you want to talk to your peripherals, you have to make system calls. Maybe your machine learning is running on your GPU or something. If you're in kernel space, you have way more control. You can limit the translations and all this stuff. You have to know what you're doing. It's possible to hang out there and get a lot of speed-up. You have to know what you're doing. That's a place to be if you want to get things done, too.
Problems - Going Fast Causes Timing Issues
It all sounds too good to be true. Let's pause and ask, why doesn't everyone do this? Obviously, if you just put things in the kernel, or copy data from here to here, that that should be better. There are some issues you hit, one of them is timing. We're getting really fast. We're talking at nanoseconds, because when we're moving through memory that quickly, we're operating at memory speeds. Networking can handle it. We're moving really that fast, which means nanoseconds start to count. Every single tick counts. We can't just rely on some clock to synchronize. By the time this packet gets here, it may be a nanosecond apart, which would normally not be a problem. In our case, everything's a problem now because we're going so fast.
There's two solutions that are very popular, one is PTP, the other is NTP. PTP means Precision Time Protocol. It's nanosecond precision. Very fast. It synchronizes to GPS. It's amazing. It's just more expensive. NTP is more common, Network Time Protocol. It's more of a microsecond level precision. It's not bad, but it's just not as fast, so you could choose which one you want. I like PTP, just because I like having all the speed, and all the precision. These are tools that are out there. Some people don't know about them, but they can really save you a ton in having to synchronize all your clocks.
Problems - Adding Code to the Kernel Is Risky
Timing is cool, but what about if we're messing with the kernel? That's never fun, because now we've tainted it. Any small thing that happens we can crash the whole system because we're messing with the kernel. That's never fun. We need to tread carefully and look at how do we mess with the kernel, like do things in the kernel space, but also be able to work like we're working in user space? Because in the kernel, we don't really get all the fancy stuff of apt-get install, or restart the service. The kernel is a service and things depend on it. It gets tricky.
How do we do something like updating the Kernel? Remember, we're still learning as we go. We're adding more models. Our model may be partially in the kernel, if we want to make a change to it, it's going to be tricky, because you don't always have the luxury of restarting the kernel. It turns out, you can do that, something called kexec. Kexec is like exec in the process in Linux, where you basically say, here's a new thing I want to run, just exec it for me. It just overlays everything and runs. The thing about kexec is you're dealing with a kernel, but you're never turning off the system. It's like you're rebooting but you're skipping all the bootup sequence, the hardware and BIOS and all this stuff. You're just restarting just the software parts. That can be very beneficial, because you don't have to initialize your hardware. You don't have to waste time with all that stuff, because you've already done that. You're making an update to the software and not your hardware. With kexec, you skip all the yellow parts, jump straight to the purple, much faster. A good example is like having brain surgery while you're awake. You're not really functional while it's happening, but you're not really under, so it's much quicker to get you up and going. You don't have to go under general anesthesia, and then come all the way back up. All the risks of that are taken away, because they've already checked all that stuff.
Planned Kernel Updates
With that, how do we do a planned kernel update? It's actually not that bad, because we just get the new kernel, from whoever is supplying, from the main server. It says, "Computer one, update with this new kernel," and we just kexec the new kernel. That's it. New kernel just runs like it's a new process. Our old kernel is off, gone. We're good to go. That's just it. It's not much trickier than that. In other cases, what if the kernel crashes because we're doing a lot of work. We're changing things. We're messing around. The kernel can crash for multiple reasons, bad programming, anything. Remember, we're mostly writing in C here. You're going to have a lot of problems there. With a crashed kernel, it's not that bad. We can just take some stable kernel that we already know about, make it a backup kernel. That kernel can do mitigation or whatever. If we crash, we fall back to it, and maybe have it try to pull something else or try to figure out what went wrong, but it's still working. We have a fallback kernel that can get the job going while we try to get a new version and try to rectify the fix from the older one. It's very simple. You just kexec another kernel that you know is stable, when you fail, and you're good to go. It's similar to how kdump works. Kdump basically does that, it falls back to the old kernel, to a simple kernel that saves memory, and then try to understand what went wrong.
Throughput and Algorithms Are the Real Challenges
You're probably wondering now what happened to all the trading? Nothing has happened there. It turns out that trading is not the hard part. People have actually done this talk, what it's about. The core is doing it fast and at scale. The trading is easy, there's tons of tools. When you want to do it for a bunch of tweets and a bunch of things changing, it gets really tricky. I implore you to go look at some solutions if you're more curious. People have done it based on the President's tweets, because his tweets are dynamic so it's very easy to use them as a good source of understanding whether his tweets affect the market. Those are for singular points where people did. The truth is, when you try to do it for a bunch of people, and you want to do it fast, you run into other problems. Go look at NPR's solution, and another guy's solution. His name is right there, you can find out what they did. It works. It's really fun to play with, but when you stop doing it fast, you get into all kinds of problems.
DPDK + Memcached
With this, let's look at the performance differences. Looking at the numbers, one very good place to start with DPDK, which was what I talked about with respect to taking matters into your own hands, bypassing the kernel and doing your own data plane by yourself. As it turns out, in this example, they did it with memcached, that's a distributed storage, a distributed cache, and they found more than 2x speed-up. It scales well with more cores, because the kernel may not necessarily optimize for core performance. You can say, "I know I'm going to run in a lot of cores, I'll optimize for XYZ." These things are also very dependent on your solution because you're taking matters into your own hands. Obviously, DPDK gives you way more performance.
Kexec vs. Normal Boot
Next is with respect to kexec. This one is very hardware dependent. If you have a very simple machine, you'll see some speed-up, not that much. If you have a more complicated machine like a server, yes, you just skip the bunch of hardware checks, all the fans, PSUs, all those things don't have to be checked, you get way more speed-up, because you're just restarting the software piece. This guy could run it on a high-end server, so about a 6x speed-up over restarting the system. With kexec you get that.
RDMA vs. TCP/IP
RDMA is also where it gets interesting. RDMA says, "I can reach into another memory space and just grab what I want." The thing about this is it's subjective. TCP/IP is non deterministic. If we have a lossless channel, TCP/IP is better. It's really hard to gauge the actual speed-up. RDMA will always beat TCP/IP. In this case, it was more than two times speed-up. It all depends on how you run it. In any case, RDMA is faster, hands-down. It's hard to gauge the actual numbers, because it's very dependent on a lot of factors.
Hardware Offloading - When You Really Want More
Another thing, since we talked about how this works, and we can bypass a bunch of things, why don't we just bypass hardware all together, too. If we bypass hardware, we just say, "CPU is cool, but it's not really made for some of the stuff we're making it do. CPU wasn't really made to do ML, so why don't we try GPUs. They're really good at distributed workloads." FPGAs too, we can make them really specific and approach ASIC speed. With the same mindset, we can offload work from our CPU. CPU just handles the control plane, GPU or FPGA handle the compute plane. They're very good at that. Then we have all the bypassing stuff with RDMA and DPDK. We've separated these streams to do what they're really good at. That's a tangible thing to look at.
Key Takeaway
The key takeaway here is, don't be wasteful. Save the planet, save the polar bears. The 80/20 rule applies here. Turns out, 80% of servers usually are running at less than 20% of their hardware resource usage, which is usually hard to glean because when you're on top, you mostly just look at CPU running stuff, and there's a lot more going on that it's probably just idling. With hardware offloading, kernel bypassing, you can probably take more advantage of these things. We should see that when we choose to scale horizontally every time, we're probably wasting a lot of stuff that we could have gained, if we just chose to look more at our code, and scaled more vertically in that space, like scale vertically in code. Because when we scale horizontally, we run into other problems of partitioning the network and all these things.
Conclusion
Usually, if we just take a bit more time and challenge abstractions, sometimes. It's ok to peel back layers of abstraction and say, "This thing is made to do this. It seems to be doing it well, but I know I want something more specific, or more high performance. I shouldn't feel I'm limited to what this tool provides me. Maybe I should try to see what's going on, how I can improve it."
That's how these things work all the time. Be able to go in there and dig deeper, bypass things like we did in kexec, like we did with DPDK. There are tools there that make you realize the kernel is great, Node.js is great. All these things are great, but not necessarily always the best for you, because they're made to be general purpose tools. It's ok to peel back abstraction. With this, it's definitely going to be much easier to scale our tweet infrastructure, because we actually now have more control. It also lets us know that there's a lot more room sometimes we can get from our machine.
See more presentations with transcripts