BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Performance: Adventures in Thread-per-Core Async with Redpanda and Seastar

Performance: Adventures in Thread-per-Core Async with Redpanda and Seastar

Bookmarks
47:14

Summary

John Spray describes an experience of building high performance systems with C++20 in an asynchronous runtime, and explores the challenges & tradeoffs in adopting a thread-per-core architecture.

Bio

John Spray works in the Core Engineering group at Redpanda, building the high-throughput heart of the Redpanda streaming platform. His background is in high scale systems, especially distributed storage systems such as Ceph and Lustre. He enjoys writing async code in Rust and C++.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Spray: How can we write software that gets the most out of modern hardware, while keeping our code robust and maintainable? Moving from traditional threaded designs to thread-per-core async designs unlocks the performance of modern hardware that brings new challenges for robustness and developer productivity. I'm John from Redpanda. In this talk, I'll share some lessons from our adoption of C++20, and the Seastar framework. Hardware is advancing quickly. This chart shows the growth of CPUs over the past 50 years in various metrics: the number of transistors, the number of cores, and so on. There are two takeaways from this. One is that Moore's Law isn't dead. You can see that transistor graph shooting straight up into the right. The second is that the number of cores has started increasing a lot recently. Over the last 10 years, we've gone from perhaps 10 cores being the norm to over 100 cores being now mainstream on typical x86 systems. It's not just CPUs, as we moved from spinning disks to SSDs, and from SATA and SAS to NVMe, disks have become transformatively faster, to the point where it's common for a disk to be faster than the network drive, the device of the machine that it's in. It's much more of a challenge now for software to keep these very high throughput devices satisfied with the number of I/Os that they can handle. Networking has got faster too, less dramatically than storage and CPU. Nevertheless, a 100-gigabit networking is no longer an unusual speed to see in a modern data center.

The difference isn't just quantitative, it's qualitative. The improvements in CPUs have brought some compromises. Even modern desktop CPUs now have some amount of non-uniform memory access issue to contend with. The high-end Ryzen chips use a chiplet design, and it matters which pair of CPUs is talking when two CPUs are talking. That's something that software developers increasingly have to bear in mind. It's also about the ratios. We used to be in a situation, perhaps 10 or 15 years ago, where storage was slow. You didn't have to worry about whether your code was fast enough to satisfy a drive, the drive was going to be the bottleneck. That's flipped around. Now that it's common to have a drive that can do multiple gigabytes per second, it's much more likely that the software becomes the bottleneck. We as software developers become the bottleneck in getting the most out of these systems. That demands some changes in how we do things. Most applications most of the time, don't have to think about this because they run on shared machines, where you get perhaps 2 or 4 cores. You don't have to think about how to use the full power of one of these servers from one application. This talk is about what happens when you do need to do that, when you do need to take 64, or 128 cores, and use them as efficiently as you can to deliver a high throughput application.

What Makes Good High Throughput Software?

What does good look like for high throughput software? Clearly, there's a number in megabytes per second that we would like to reach. There's an expectation from users that we saturate their network bandwidth. Nobody wants to buy a 25-gigabit NIC and then have software that only runs at 10 gigabits. We also have to conform to the topology of the system. We can't just expect the system to be completely uniform when we know that modern systems aren't. That throughput has to come with reliably low latency. If I achieve microsecond latency, but only on 50% of operations, that's no good to an application architect. They need to be able to know that we'll do it 99 point some number of nines percent of the time. That is also a key motivating factor for the architecture I'm going to talk about.

Redpanda in 60 seconds

This talk isn't about Redpanda. To understand our motivation, and use this as an example, I'll give you the capsule version of what Redpanda is. It's a streaming engine. It implements a Kafka compatible API that allows clients to produce and to consume messages. Those messages go on to partitions, which are persistent logs, where each of those logs is a raft group. There are typically tens of thousands of those raft groups in a cluster. The cluster usually has between 3 and 20 nodes. We're not talking about huge clusters, but that's largely because of our efficiency on a single node that enables doing more with fewer servers. In that way, Redpanda is both a scale-out system, but also a scale-up system where we want to be able to make the most out of highly capable nodes.

What Is Thread-Per-Core?

In order to achieve those goals, we can adopt a thread-per-core architecture. This is where we commit to managing a whole core's worth of work in userspace. Rather than running several threads that will share a core and do so dynamically based on decisions made by the Linux scheduler, we say we're going to provide one thread, and everything that runs on a particular core is going to run in that thread. This aims for better data locality, with more predictable scheduling, and in general, more control in userspace of what runs where, when. This type of architecture is taken to a different extent in different domains. The simplest thread-per-core architecture is something stateless, like an HTTP proxy, where there's no communication between cores, there's no state on the cores. Doing thread-per-core is pretty much just a case of setting your work account equal to the number of cores and then calling into the kernel to pin your threads to particular cores.

Then there are intermediate approaches. You might be familiar with this thing if you've worked on low latency trading systems, where you might have an ingress core that's responsible for decoding incoming data off the network. One or more cores doing the actual trade matching. Then some other core that's responsible for sending outputs onward. That can provide very good latency, but it's a bespoke solution. Your application is written in a way that maps to your ideas about which core should do what. The most general case for a thread-per-core architecture is the shared-nothing case, where we say, all state is shard local. Anytime we need to coordinate across shards or cores, we must use inter-core messaging. We're not just going to reach out and touch each other's data structures. Various pieces of housekeeping and system management will also be core-local, for example, an allocator.

Seastar

That last approach is what the Seastar framework provides. Seastar was created by the folks who wrote ScyllaDB, which is a high-performance database. It's now used not just by Scylla but also by Redpanda, and the Ceph Storage project, amongst others. We built Redpanda on Seastar, because we love it. We think it's a great framework. What it provides is a shared-nothing architecture. In Seastar, each core is referred to as a shard. A shard is the CPU core. It's a pool of memory which is reserved for that CPU core, which will have been allocated somewhere in NUMA-local to that core. It's a collection of message queues, single producer, single consumer message queues that form an all-to-all mesh between the cores. It also contains I/O queues and task queues, which are local to a core, so that core has its own prioritization going on when it comes to, for example, latency critical disk writes versus non-latency critical disk writes.

If that all sounds a bit complicated, it perhaps is, but that's not the programming model that you have to deal with day-to-day. They're all abstractions on top of this message parsing shared-nothing architecture. Most notably, this call that I'm mentioning here, submit to where you parse a lambda into that. That lambda gets sent down one of these message queues and executed on a remote call. You don't have to write message parsing code. You can write regular C++ lambdas, and then just tell Seastar which core you would like it to go run on. To share out our state across cores, there is an abstraction called sharded, or more completely a sharded_service. That enables us to say, "Here's a class. I want an instance of this class on each core." Then I would like some helper methods for when I want to call into the class, saying which core I want to call it to it on.

Adapting a Workload to a Sharded_Service

When we want to adapt a workload to act as a sharded_service, the mindset is to imagine that each shard is like a little server within a server. Imagine that the message queues between shards are like your network. This is similar to scaling out any service you might design, except we're scaling out across shared-nothing shards, rather than scaling out purely across physically separate servers. You define some mapping over the state that you care about to shards, in our case that's partitions, where each partition maps to a particular shard. You have some dispatch layer, where an incoming TCP connection might land on any core. Once we handle an RPC that requires data on a particular shard, we have to dispatch the operation to the right shard where the data lives. We're sending the work to the data, not the data to the work. When it comes to anything that needs to be visible, globally, or housekeeping stage, you have to make copies sometimes. In the same way that in a distributed application, you'd often have a copy of your configuration on each node in the system, in Redpanda, we do that at the core layer. We have, for example, a copy of our system configuration on each CPU core. You never have to reach over to another core to get at that. The upshot of doing that engineering work to define something as a sharded_service is that you get this speed, performance, and low latency that we're aiming for.

Sharding Example

To make this concrete, here's an example of what sharding looks like. This is a very simple C++ class, it's called the KvStore, it just has a get method and a put method. It has a hash function that we're going to use to map those keys to shards. This is just using regular C++ data structures. The data structure you use under the hood doesn't have to be specialized to Seastar. Here it's an std::map. We make a lot of use of the abseil container library. You don't have to rewrite your containers for Seastar. What you do usually have to do is build some dispatch layer on top of it. This is a sketch of MyKvServer, where I have a reference to a Seastar sharded object of MyKvStore. That sharded object handles instantiating it on each shard in the system. When I dispatch a put operation, I'm first calculating the hash of the key to decide which CPU to run it on. Then taking that module of the CPU cores in the system. Then I'm using this helper method called invoke_on_shard, I'm building a lambda that knows everything it needs to know. I'm moving the request into the lambda. Then that lambda under the hood is getting sent down a queue to the relevant shard where it will get executed. That's how simple it looks to take something like a KvStore and make it into a Seastar sharded_service.

Concurrency

Now we want to talk about the tradeoffs that come with doing that. If we imagine that KvStore example, doing some real I/O, like most applications, it'll look something like this. We'll receive some bytes for a request. We'll dispatch that request. It'll go to sleep until its underlying I/O perhaps writing to disk completes. Then in the background, I start receiving bytes for other requests. Eventually the I/O is complete, and I send some responses. That's all fine, but what happens in Seastar, and to some extent in other asynchronous frameworks, is because we're not running some fixed-size pool of threads, we have lots of futures in flight. There is no implicit concurrency limit. Where you run a thread pool you might say, ok, my thread pool is 100-wide. I'm going to handle at most 100 requests in parallel. In an asynchronous runtime like Seastar, it's not going to provide that implicit limit. You're going to have an issue that your infinite concurrency collides with your finite computer.

The example I'm using here is if you try and service 50,000 disk reads concurrently, and you're using megabyte size read-ahead buffers, that's likely to be more memory than you have per core in your system. You can end up running out of memory. In order to handle that in Seastar, explicit concurrency limits are essential. We might limit the number of requests we do in parallel. It will be some high number, but it has to be an explicit number. We might limit the number of storage log replays we do at startup. Sometimes there are more intelligent limits needed that go beyond just a semaphore. When we do data recovery, it's important we don't try and simultaneously recover every partition in the system. We have opinions about which ones we want to do first. As you deal with the power of having this unbounded concurrency, you also have to reintroduce some explicit limits to account for the fact that, left to its own devices, the system will try and execute a huge number of things in parallel.

I've talked about Seastar being a shared-nothing architecture. I'm assuming a certain amount about familiarity with async, but I'll just briefly touch on that. Asynchronous C++ code, or asynchronous code in any language means that when we do I/O, we don't block on a system call to do the I/O. We construct some future object that acts as a handle. We return that, and then the caller can wait for that future to complete. In Seastar, when we wait for that future to complete, we can then go and schedule other work. That's essentially how the cooperative multitasking in the system works. When something goes to do an I/O, it constructs a future, returns it to you. Then once that future is in hand, it gets pushed on to a list of things which are pending, and then some other work comes and runs. That overhead of constructing a future and pushing it onto a list is much lower than the overhead of switching threads at an operating system level. This is what motivates people to use asynchronous runtimes, whether it's in C++, or Python, C#, Java. There are different names for this stuff. It's sometimes lightweight threads. It's not exactly the same thing, but the general idea is that userspace is handling the concurrency rather than the kernel.

Examples of Concurrency Limits in Redpanda

To make this issue of concurrency limits explicit, we have, for example, a memory semaphore for incoming RPCs. When we read the first few bytes for RPC, it'll say how big the request payload is. We take units from a semaphore for that number of bytes to ensure that there is a limit to how many bytes of RPC data in flight we're going to have at a moment. I already mentioned storage layer. We can't replay every single log in parallel, because the storage buffers for that would exhaust memory. There are situations you might not first think about, like shutdown. When we flush our logs to disk, and we do a checkpoint before shutting down, we also have to manage the concurrency of that. Because we have this extremely powerful engine running beneath us, we can't just say, for each log, flush. If we do that, we will actually run out of memory, because it will it will go, ok, and it will run 50,000 of them in parallel. There's a helpful primitive in Seastar that's called max_concurrent_for_each, which is basically a parallel for loop, where you actually batch them up and only run a certain amount at a time. For example, we might do max_concurrent_for_each 128, and make sure we're actually getting to 128 I/Os in parallel at a particular layer, rather than just throwing as many at the wall and seeing what sticks.

Reactor Stalls

I mentioned cooperative multitasking. This is a key thing for folks writing code within Seastar to understand. If a task doesn't yield promptly, and this is a configurable threshold, but for sake of argument, let's say it's 20 milliseconds, the framework will detect that and report a logline that says reactor stall, and a backtrace to where the reactor stall happened. Clearly, something hogging a core will impact your p99 performance, because some other task that wanted to complete in a timely manner won't be able to. It's important to note they can also impact stability. If you have 100-millisecond heartbeat interval in your raft implementation, and something stalls for 100 milliseconds, it risks prompting raft leadership collections across the system. These types of reactor stall usually come from big loops. If there's some naive loop that does for i in all partitions in the system, and it does a little bit of work for each one, that can end up as a reactor stall. Elsewhere, things like global coordination where we might try and balance leaderships in the system or do data balancing, anytime we implement those algorithms, we have to be very careful about scaling. Because anything that goes above roughly a logN cost has a decent chance of hitting a reactor stall at scale.

Testing is very important for this. We do empirical testing of the system as well as of course unit testing and design, where we run on dedicated nodes because the timing is important. We run at the highest scale we support and then we go look for these stalls. Sometimes we find them. Ideally, you catch them before you commit the code, but testing is super important as well. It's worth noting, this isn't really a new problem, because your existing code in a thread pool environment might also hog the CPU core. The difference is that in a traditional threaded program, the Linux kernel can preempt you and apply some limit, and say, "You've held on to that CPU core for too long, I'm going to let someone else take it up." You still have a problem in that code, but it's less acute and less visible. Whereas in Seastar, it's a very visible thing. It's reported as an error in the log. You really have to think about it whenever you're writing code.

Memory

I've talked about concurrency and how the sharding works in a system. I want to drill a little bit into memory management. A lot of what follows is about dealing with, or rather avoiding atomics. Atomic operations on most architectures are not cheap. They are surprisingly ubiquitous, considering how expensive they are. Some programming frameworks, notably C++'s shared pointer will use atomics for every update to a reference count on an object.

Every time you copy a smart pointer around, you're doing an atomic. Atomics are a bigger deal for the hardware than regular memory operations, because they require at least potentially some coordination between multiple CPU cores in order to satisfy the semantics of the atomic. Luckily, for us, because our memory is not shared between cores, Seastar can implement things like shared pointers without using atomics. As long as a particular smart pointer is only ever accessed from one CPU core, it doesn't need its reference count to be updated atomically. It can just be a regular old integer with regular add and subtract.

We can also avoid most traditional use of mutexes. Because any region between scheduling points in a cooperative multitasking system is essentially atomic units of work, there's no risk of you being interrupted. You very rarely need to do a sort of lock this structure, access some data, unlock this structure type pattern. As long as you're writing synchronous code that accesses a data structure, you can be sure that you're the only one accessing it because it's local to your coop. This means that we pretty much never use atomics. We rarely even use mutexes. It's not an atomic mutex, it's just a special Seastar mutex, which doesn't use atomics. The only time we use a mutex is when we have scheduling points in the middle of something like serializing a very large structure, then we would use one of these mutexes to protect the iteration through a large structure. Those are comparatively rare cases. Mostly we don't really think about locks, we just think about the scheduling of the work.

In order to enable that locality of memory to a particular shard, we need to have something other than the regular Linux allocator or the regular standard library allocator. At startup, each shard is allocated an equal share of memory. That means that your workload has to be reasonably spread out between shards. If you're running on a system with 64 cores and 256 gigs of RAM, then the most any one core will get is 4 gigabytes. You can't have any entity in your system that requires more memory than that. Your system has to spread things out. That works for us because our workload is already partitioned into individual logs. Within that pool of memory, Seastar provides a buddy allocator. This is a relatively simple type of memory allocator that more or less splits the memory in half and then keeps splitting extents in half until it finds a small enough part for the allocation you've asked for. It also seems to provide a separate small memory optimized allocator on top of that, so when you allocate something like 8 bytes, you don't have to traverse the whole tree of the buddy allocator for that.

The upshot is that memory allocation becomes not only very fast, because we're using this simple algorithm that doesn't do background housekeeping and is always local to one shard. There's no synchronization involved. We're never having to go to other shards and ask them, can I have some memory, or release some memory? We're never doing compaction. We're never remapping pages in the background. If we run with a command line flag for locking our memory at startup, we're also not getting page misses. We use huge pages, which is the Linux feature that lets you use pages up to something like a megabyte instead of 4 kilobytes. We're not worried about page misses. We're also not using the page cache. We're using direct I/O to disk. That means we have to do our own caching. It also means that the memory allocator has to be able to give us signals for memory pressure to release cached data. In a traditional application, you would have lots of stuff cached for you by the kernel, and the kernel would take care of freeing parts of that cache under memory pressure. In Seastar, it's on you. That's part of the downside to this approach, that you have to think much more about memory. If you're going to cache something, you have to think about how you're going to limit how much memory you use in your cache, and potentially how you're going to make your cache respond to these upcalls to release memory on-demand.

The most noticeable downside of the approach, is that the allocator that can say no. Typical allocators in Linux will always say yes. Even if there's not really any memory available on the system, they can always allocate you a new virtual page. You might not notice until you try and write something into that page that there's no physical page available to back it, or to make a physical page available to back it, Linux might have to push some stuff into swap. That results in applications that have a soft failure mode. If you try and use more memory than you have in your system, your program might not crash right away, but it will progressively get slower. Anyone who does any SRE work knows that page swapping is a classic signal for a situation where a server is probably about to be in pretty bad trouble. You will look at things like the rate of swap writes and swap reads to figure out whether the workload is too big for the server, or the software is configured to use too much memory. In Seastar, it's not a soft failure, it's a hard failure. You will actually get an std::bad_alloc exception when you try and allocate memory if there is no more memory available, or if the memory is fragmented to the point that you can't get a contiguous range big enough to satisfy your request. Anyone who's done embedded work or kernel programming will be familiar with this type of situation where you don't have the luxury of virtual memory management to move pages around to give you the impression of contiguous memory. You have to actually have contiguous memory in order to satisfy the allocation request.

We have these bad_allocs, what should we do with them? In some cases, you can handle them. If you're entering an RPC handler, and you perhaps don't have enough memory for the buffers for the RPC, you can quite easily reason about that and say, I can handle this RPC, I'm going to ideally send a response in a way that doesn't require allocating. The general case is much harder. Can you even log an error message without allocating? When you go up your stack from throwing an exception, it's possible that some destructors result in allocation. For example, if some hook into a container releases, and that causes the container to decide to reallocate its memory. It's very hard as a userspace C++ program to reason about whether a particular path through the code will allocate on it.

The robust position to take is to terminate on that alloc, which is, in recent Redpandas, that's what we do. In order to avoid that actually happening outside of our testing, we have to avoid large contiguous allocation. The I/O buffers that we use for data that comes to and from a network socket are fragmented. Typically, a fragment size is something like 128 kilobytes. Anywhere in the system that we have something like a vector of partitions, we have a custom classical fragmented vector, which avoids allocating multi-megabyte data structures, because those are the ones that on our system has been running for a while might struggle to get that amount of contiguous memory. We also more generally just have to avoid doing unbounded size allocations. Every time we're designing a data structure, we have to think, what is the cap on this? What size can this not grow beyond? When we instantiate many of them for parallel operations, we have to constrain that parallelism as I was mentioning earlier for concurrency.

That might sound quite onerous, and in some ways it is. It's also an enforced discipline that comes earlier in the development cycle. Rather than writing code which has some pathological memory consumption issues, and are you noticing it after you've deployed it? This forces you to notice it much earlier, at the point you're testing it. That's a good thing for a piece of software that you're shipping to customers. There is always a tradeoff between finding issues in development versus finding issues in production. It's quite common for applications written in Go to have memory management issues, perhaps even running out of memory that you don't notice until you deploy it and see a certain workload run on it. Sometimes that's the right tradeoff. You want to develop code faster, but tolerate planning issues in production. This is the other way around. By having this rigid approach to memory management, we are forced to think hard about memory management at the point we're writing the code, not after we've deployed it.

Async C++ with Coroutines

I've talked about concurrency. I've talked about memory. Now I'd like to talk a little bit about the practicality of writing asynchronous C++ code, because I've talked about async as an aside, but I haven't really dug into what that means for your code. Asynchronous C++ code can be written in a few different ways. The baseline way of doing it, that I'll describe an alternative to in a moment, is known as continuation style, where we have methods that return future objects. Those future objects have a .then method that you can then parse another future generated method into, and so on. This is an example of a piece of code from an older version of Redpanda tree, that is the shutdown method for a raft object. The object has various sub-objects that all have their own stop method. The point of the function is just to call stop on all the members. What that looks like is calling event_manager.stop, fine, .then, lambda capturing this, return, and then the next method stop, and so on. It's this long chain of .thens, which is ugly. It's also relatively error prone, because it relies on capturing things into lambdas in a way that you have to get just right. It's inconvenient, day-to-day, because these types of functions tend to get re-indented quite dramatically when you change them, if you're using clang-format. It happens to make code review hard, which is not perhaps the biggest consideration, but it's worth thinking about because we do care about developer productivity.

C++20 gives us an alternative called coroutines. Coroutines consist of a series of new keywords in the language, co_await, co_yield, co_return. Co_await just means at this point in the function, I'm going to take a future, and before proceeding with the rest of the function, wait for it to complete and get its result. The way that's implemented, is that the compiler generates code that takes all of the state on the stack from before that co_await, and turns it into an object allocated on the heap. Then, when the code resumes, under the hood, it's not really looking at stack variables anymore, it's looking at the stuff that was suspended in an object on the heap. It's magic. You don't really have to know how that stuff works most of the time. It lets you write very idiomatic code. Here's a very simple example of a coroutine that sleeps for 100 milliseconds, and then returns a string. There's not a .then inside, so there's no funky wrapping things in lambdas, so that we can parse them into .then. It's a huge step up in ergonomics. For that example, I started with, after switching to coroutines, which we did in our code, it's just a lot more readable. Rather than a bunch of curly braces, we're just co_awaiting one function after another. It's broadly doing the same thing as the previous function was doing in terms of generated code. Here are the two side by side so that you can see the difference. This is a fairly modest example. I had to find something that would fit on a slide. There are other functions that were very complicated before coroutines, especially when there are multiple levels of nesting, that become just a lot simpler when you move to coroutines.

That's great. We use them everywhere. Maybe? The challenge with coroutines and futures in general in C++ is reasoning about lifetimes. Here's a little case study. It's from a real pull request. The same week I was writing this presentation, I also wrote a bug, and I use that bug as my example. This function mark_clean is constructing a builder object, let's say, an object that will generate some serialized data that gets written to disk eventually, adding a message to it. Then returning builder.replicate, which is a function that does I/O, and therefore returns a future. This mark_clean function also returns a future, so I can just return a future. That looks ok. It compiles, but it crashes. The reason it crashes is that that builder object was allocated on the stack, and it falls out of scope at the end of the function. We're returning a future which refers to the builder object, but the builder object itself isn't being kept alive. This is a very easy mistake to make when you're writing asynchronous C++ code. To understand why that future object would reference the builder object, there's just an imaginary sketch on the bottom of this slide of what replicate might be doing. It might be running some I/O method .then, and then in the completion for that I/O, it has captured itself and might do some updates for another variable. Of course, that capture of itself is just a raw pointer. It doesn't keep the object alive. It's the caller's responsibility to make sure that the object stays alive by the time the future completes.

Fortunately, coroutines are actually very helpful here. If I change this code to do co_return co_await on the replicate function, rather than just returning the future, it fixes the bug. Because in the coroutine world, everything that was on the stack is kept alive until that co_await finishes. This is really a step change in how easy it is to reason about code that works with futures and C++. You're no longer having to individually reason about the lifetime of each case. You can more or less just apply a simple mental model that says, as long as I'm in a coroutine, my stuff on the stack is safe to refer to from futures until I return from my function. That's an example of coroutines making life easier.

Here's an example of why coroutines make life harder. Here's my print function again, my trivial example. It just waits 100 milliseconds then it prints a message. Here's a function in the bottom half of the slide that calls that delay print method with hello world. This is buggy code. It compiles. It runs. What you'll see when you run it is delayed_print, and then nothing. Your hello world doesn't make it. The reason it doesn't make it is that we're taking a reference into this coroutine. It's a reference to a temporary. From C++'s point of view, temporaries only need to last as long as a function call. It's going to construct this hello world string. It's going to call delayed_print. Then when delayed_print returns, it's going to destroy the hello world string. When delayed_print returns, it's returning a future, it's not actually finished. Even though this is a coroutine, it's only within the coroutine that the nice rules about stack variables staying alive work. Outside of a coroutine, you still have to be mindful of this.

If you want to fix this, it's pretty simple. You just parse in a value instead of a reference. The takeaway is that parsing references into coroutines is very error prone in C++, to the point where we even implemented our own clang linter that checks our code for cases where lambdas specifically take references or capture references into a coroutine. We fail our CI, if there's code that does that, because it's just so easy to write a bug when you do this. The good news is that pass by value is not expensive here, because modern C++ has move semantics, you don't have to copy things into a function. As long as you're doing moves, just passing everything by value is usually not terribly onerous. If it does become problematic, then maybe there are special cases where you reason more carefully about something. You can of course also wrap things in a smart pointer, so using unique pointer makes it easier to reason about compared with just using a raw reference. We worked with coroutines for a while before we quite realized how unsafe some of the things we were doing were. Over time, we've moved to a model where we've more or less just forbid passing references into coroutines.

Are coroutines async made easy? Not quite. The overriding issue here is C++ is not a memory safe language. As new features are added, they're layered on top of a situation where anything can be a raw pointer, anything can be a raw reference. These extra abstractions in some cases make it easier to write bugs, although sometimes they help, as in the first example. Temporary lifetime rules are counterintuitive. Ideally, the compiler would notice that the function we're calling is a coroutine, and use a different set of rules for temporaries. That's not trivial. I'm not criticizing the compiler developers for the fact that that's not done. It is a place where the language tends to violate your intuition a little bit. I just have to call out compiler bugs here. We love LLVM. It's a fantastic piece of software. We see bugs getting fixed incredibly fast when they're reported. Coroutines for us have been an early adopter thing. We do use the latest stable release of LLVM. Sometimes even that's not enough. We do have one open bug that we're really looking forward to seeing the fix for in LLVM 16. I don't want to scare people off coroutines too much because we use them day-to-day in our production software. They're not a toy, they are ready for use. You just have to be aware that if you're pushing the envelope of how the compiler handles these things, then you may be the one to discover a bug. Overall, they are a net win. Our decision as a software project is that we think they're worth using. They make it a lot easier to use RAII styles in future code, because of that simpler reasoning about lifetime of objects on the stack. We also find it very interesting that this is the first time that futures are a C++ language feature, rather than just being something that's laid on by libraries. That potentially means that compilers have more opportunities to optimize our asynchronous code than they might have done historically.

Tradeoffs

I'm going to talk about some of the tradeoffs that we've made by adopting Seastar, and put it in the context of some alternatives. I'm not an expert on all these other frameworks. I should point out, Seastar is not the only option, of course, for writing fast async code. There's a framework called Asio in C++, well-known Rust async framework Tokio, and various of the garbage collected languages have their own concept of a lightweight thread, which is somewhat async under the hood. The only reason I'm fudging that comment is just that I'm not enough of an expert on these languages to speak authoritatively. The main difference between Seastar and these other options, is that while these other things might be async in terms of the handling of I/O, they don't adopt the strict shared-nothing model. They don't generally manage to avoid atomics. The binding of tasks to core, it can be soft. For example, Rust Tokio, usually tasks sit in a queue that's local to a particular thread, and the thread is typically local to a particular core. There is work stealing. If a core finds that it runs out of work, it'll go grab some tasks from another core. That means that all of your code has to be written with the assumption that it could run on any core, which means your reference counting has to be atomic, and you inherit the overhead of all that.

That downside is also an upside in that those other frameworks tend to find it easier to use third-party libraries. In our code, we at times have had to more or less rewrite some third-party libraries, in order to make them usable in our code. Because the rules about memory allocation, the rules about what runs where are different enough from what library it was originally written for. For example, we write to our own S3 client, for when we're talking to cloud storage. As we go forward, I think there's an interesting possibility of some hybrid approaches. Biased reference counting is something that enables you to have smart pointers that can work across cores, but most of the time, avoid using atomics. You could imagine a hybrid approach where you use those sorts of techniques combined with something like Rust Tokio that binds things more tightly to a core, but maybe lets you specify at a fiber-by-fiber level, whether this is a shard local fiber or a fiber that could run anywhere. That might enable mixing very latency sensitive code with other more generic code, perhaps code from third-party libraries, in the same runtime. That's the idea. The point that I really want to make here is that this is an evolving area, and we are seeing new runtimes coming out in various languages that might see some combination of these models to get a better compromise between ergonomics and performance.

Using C++20 and Seastar overall is a clear net benefit for us. It enables us to get fantastic performance. By following certain conventions and being aware of the limitations, we can have a good day-to-day experience for our developers as well and have people working productively. It might be right for you too, if you're building high throughput software, and like us, your work decomposes into units that can be mapped to shards. You probably also need to be looking to scale to more than a few cores for it to be worthwhile using Seastar. Or, if not a large number of cores, then perhaps a large number of concurrent I/Os from a small number, of course. If your application fits that mold, I think this framework can be very interesting for you. If it doesn't, then you are quite possibly better off with something a little bit less opinionated and a little bit more flexible.

 

See more presentations with transcripts

 

Recorded at:

Dec 01, 2023

BT