Transcript
Rowell: How many of you know anything about GPUs? Before we get into that, I'd like to start with a tangent. This beautiful red vehicle here is an MG, MGB GT. It was produced roughly in the late '60s. I've always really wanted to buy one, because, apart from the fact that British cars are rare these days, it also has this particular benefit in that if it broke down, I could probably fix it. If I was on the side of the motorway and the very ancient engine side passed out, I could probably wrangle myself together and fix it. That's something I can't really say for a car like this. This is a very modern Volvo. I don't have a clue how this works in any way. This could die, and I would be stuck there, and I would need a mechanic to come and fix it. Chances are, they would just tow it somewhere else.
The reason why I bring this up is because, actually, if you look at these two vehicles, the ways in which we interact with them really haven't changed that much. This car, of course, has lots of clever internals, so it will have sat nav and radio and things like that. Actually, from the perspective of driving, it's just, you're just changing gears, you're moving your foot pedals, you're steering. Not much has really changed. I really want you to keep that idea in your head, because that's going to be something that comes up again throughout this talk.
Unpredictable Programs
I recently started a new job around about six months ago, and you can imagine, first day, a little bit nervous. I joined a first call, and they say to me, Joe, we need you to make this program faster. These words for me are simultaneously very exciting and also terrifying. Because, you think you've joined a company where you care about the people who work there, you think they're good. You've joined and they've said to you, "We have no idea why this is slow, and we hope you might be able to help us."
What made this even more scary is I knew nothing about GPUs when I started this job. In fact, for me, the idea of CUDA reminded me of this barracuda, and I was petrified equally at both, because I knew nothing. I didn't know anything about how this works. I had no idea, much like Thomas said, on how to make this thing faster than it already was. Much like any good person, I go on YouTube. I try to find a tutorial. I end up writing a program that looks something like this.
This is essentially a variant of C called CUDA C. You can see that we've just got essentially a regular function here that is essentially memset. We take a pointer and we just overwrite it with this value, val. Down here, the first time I tried to run it, I just allocated 100 ints, parsed them to the GPU, and expected it to do something. Something really weird about this program is it sometimes works and sometimes doesn't. Whether it works or not depends on both your hardware and your software.
I found this absolutely astounding. This was so weird, the fact that this seemingly very innocuous program would sometimes work and other times just not. The question for our talk are two things. Is, one, why does this sometimes work? Two, why was our otherwise performant program slow? These two programs are not really the same, but they have enough similarities to overlap.
Differences Between a CPU and GPU
With that, I just very briefly want to dive into the differences between a CPU and a GPU. When you're writing code for a CPU, you normally have many threads that run, and you have to make them concurrent in an explicit fashion. If you're writing pthreads, or you're using a C++ standard library, you have to explicitly say, I want this thing to be parallel. With GPUs, it's similar. Actually, a lot of the concurrency that you have is implicit. You're very declarative when you write your code and you say, actually, when this actually eventually runs I want it to take this portion of the thing and do some work on it.
Really, by form of analogy, I want to talk to you about the differences. A CPU is like an office. You have lots of people who are working there. They're all doing relatively independent tasks. They have their own space. They're doing their own things. Actually, they're doing roughly general stuff, whereas a GPU is a little bit more like a factory. In a factory, you have really specialized equipment for particular things. You can make tables, or cars, or whatever.
Normally factories don't make many different things. They just make one thing very well. You can see here that we have all of this storage space. I actually don't know what this factory is for, but this storage space down here contains all of the raw materials. For the purpose of an analogy, you can think about that as being our memory. We've got things here that have arisen from somewhere, and we're going to use that somehow to do some work. This analogy does stretch further. If you think about an office, you have dedicated space for each person. Here, you might have multiple different teams that do certain things that all share the same workspace, and you schedule these teams in and out based on essentially what needs to be done on that particular day. Just to capture this in a more programmatic way, this is what I mean by the implicit concurrency. Here we've got the same function as before, which is essentially memset.
What we're doing now is we're saying that each thread that runs is going to process mostly 32 elements. Then we're going to split this range from the pointer into essentially subranges that are indexed by low and high. Then, with this check here, make sure that we don't overwrite anything, but fundamentally, this is how we get our concurrency. We specify a very small program that just divides up our range, and we expect that this will run in parallel.
The Notion of a Stream
You'll forgive me for going off on so many tangents, but I think this one is important. Does anyone know what this car is? I don't expect you to. It's utterly unimportant in its own way. This is the Ford Model N. I know nothing about this car other than the fact that it came before the Model T. The Model T was the first car that was built using a moving assembly line. Before the moving assembly line was invented, the way that you would build a car is you would have a fixed position, and workers would move around and work on it iteratively. Henry Ford, one day, had a dream where he was like, what if the pieces move and the workers stayed fixed, rather than the pieces staying fixed and the workers moving? This was really a massive revolution at the time, and it led to this thing here, which is called the moving assembly line, so where the people are fixed and the things keep moving.
The thing I want you to take away from this is that we're really going to prefer the moving assembly line when we can. Really, we're going to prefer a setting where we can get the hardware to do very particular tasks in a way that there's no dependency between each other. Each bit only has one particular thing that it must do. One of the ways that we're going to encapsulate that is via this notion of a stream. A stream is an ordered sequence of events. You can push tasks into it, and you can expect and guarantee that they will run in some consistent way. The reason why we do this, is because sometimes there are tasks that will be waiting for certain things.
For example, on a GPU, you might have a task that spends most of its time reading and writing from memory. That's a really inefficient use of the hardware a lot of the time, because that's slow and your computation will be very fast. You want a way of telling the computer, you can do something else here. That's fine. You do this via the notion of a stream. This will come up later.
You can see here, this is, again, another code example. This cudaStream_t type is just this type in CUDA that says this is a logically consistent set of operations, please execute these particular tasks on that stream. You can see that with these angle brackets here. This is a sweeping generalization. The idea here is that GPUs thrive in settings where the computation to data ratio is really high. If we go back to our factory analogy from before, you can imagine a world in which the workers of the factory really are just waiting for raw resources all the time.
They're very quick. They're very good at doing their job, but they're waiting a lot for the raw resources to arrive from somewhere else. If that shipping time is too long, then maybe they will spend most of their time idle. To give you the canonical example of this, is matrix multiplication. In matrix multiplication, if you have an n by n matrix, it takes O to the n cubed time, and O to the n squared data. This is one of these situations in which you get a real speedup by using GPUs. You'll hear this a lot in performance tracks. I really don't like the big O notation here, because it lies.
There are asymptotically better algorithms for matrix multiplications that are not faster on GPUs, and that is in part because actually your constants really do matter in these settings. With all that said, we're going to write something. We're going to write a memcpy function that can handle any type of copy efficiently. By this I mean, if I pass it memory that is in CPU, I want it to be able to transfer that memory quickly. If I pass it memory that's on a GPU, I want it to be able to copy that quickly. I don't care. I just want it to be able to do it efficiently. Conveniently, anyone who's written any code will know, memcpy exists. CUDA, in fact, has functions for this. Really, it might just seem that all we've got to do is pick, in each case, what is we're going to do.
Just to conceptualize that slightly, there's three different types of memory that we need to worry about in a GPU, for the purpose of this talk. The first one is stuff that comes out of cudaMalloc. As the name suggests, it takes your memory on your graphics card and it gives you some buffer that you can store things in. It's like a regular malloc replacement, but just on the GPU. This memory is really special because it's constantly prioritized so it will never migrate, it will always be in that memory. It will stick there forever, and you can't change that behavior. You would have to manually decide that you want to move it somewhere else. This is obviously different from a regular allocator like malloc, which can move the memory around as it sees fit.
The second one of these here, which we also won't worry about for the purpose of this talk, is cudaMalloc, where you essentially allocate memory in system RAM, and then you make that accessible to the devices via memory mapping. The reason why you'd want to do this is because it means that your graphics card can access system memory directly, which is super cool. If I'd just used this, the first bug that we had wouldn't have been a bug. For the sake of this, again, we're going to ignore that, and I just really want to focus on this last one here, which is managed memory, or unified memory.
In this case, what happens is you're allocating memory where the physical storage may be either on a hardware device, so a GPU, or in system memory. What's funny about this is it's ethereal to an extent, because it's not always in one or the other. In fact, you can end up with a situation where, much like the stars in the sky, it's somehow in the ether. It moves between the two depending on what it is that you want or the pattern that you're expressing to the program.
I just want to go over how they achieve this lovely magic. You'll hear lots of performance tricks, but one of them is, if you don't know what's going on, use strace, because it really can tell you things that are going on that you otherwise didn't know. In this case here, I wrote a simple program that just used CUDA mapper, cudaMallocManaged, to give us some memory, and I ran it through strace. The very first thing that it does is it opens a file descriptor for this special device here, which is something that NVIDIA provide that essentially is an interface between your system memory and the device memory. Then here we're just going to allocate a large slab of memory via nmap. Those of you who are very familiar with binary will know that that number there is 64 megabytes exactly.
You can't customize this. It will just immediately say, please give me 64 megabytes straight away. Then after that, it deallocates some memory, it deallocates 32 megabytes. I don't know why. I've not been able to find anything written down that says why it would deallocate these 32 megabytes, but it just does it. Then, finally, after that, it takes the memory that it gave you before, adjusts it to your alignment, and then gives you back the size that you wanted. That's what this result value here is. You'll notice that the nmap is different between this line here and this line here. The reason why is because here we're specifying where the memory should exist.
The important part of this is that actually this is how we ensure that the pointers are the same across both the system and the device. We're specifying and saying, this thing should absolutely be at this address, and we should use that file descriptor that we open up here for that address. Really, we're specifying and saying, this is where this should live. Then we deallocate. It's very simple. We just remap the memory that we had before and then free what's left over. This is really what's going on under the hood, is every time you do this, you're making a series of system calls. You're mapping your memory into different places. Then it's eventually getting rid of it.
System Profiler (NVIDIA)
Now we know all of this, I just want to show you what happens in the profiler when you do this. This is NVIDIA system profiler. It's a super useful tool. It's really very useful. Here you can see, essentially the program that I wrote which we previously ran through strace. Down here on the left, you can see that we've got that MallocManaged call, which essentially says we're going to allocate this memory. Fill is the mset function that we had before. You'll see that when that actually runs, which is that blue line at the top there, we immediately have a page fault.
The reason why this happens is because the way that the driver actually enforces this etherealness is when you allocate the memory to begin with, it doesn't assign it physical storage. It just gives you an opaque page. Then when you actually run this, the very first time that it faults, it's going to go, I need to allocate some physical storage for this. I've had a page fault. This long bar here is it actually allocating that memory on the device, and handling the page fault. You can see from this bar, that actually most of the time of this very small function running is just handling the page fault. In some settings, this won't be true, but you can certainly imagine a world in which this ends up taking a great deal of your time.
In fact, if you allocate even more memory, you actually get a very similar pattern that has some nice quirks. First of all, you'll see that we do get this repeated page fault on writes, but actually the spacing between them is irregular. The reason why is because each time you have a page fault like this, the hardware tries to help you, and it tries to allocate more memory each time. That's why you get this very somewhat regular pattern where the first time we don't help much, then the second time, the hardware helps more. The third time, it helps even more, and so on.
Interestingly, if you expanded this line the whole way out, you would see that this entire pattern actually repeats. The reason why is because after this last one here, we have an even larger gap between faults. I don't quite know why this happens, but once this does happen, the hardware seems to forget that this is a problem that you're running into, and so it stops helping you as much. You'll actually see that this pattern that repeats will eventually continue to title itself the entire way along.
I think the programming guide for CUDA says it best, which is that the physical location of the data is invisible to a program and may be changed at any time, even if that's obvious to you or not. At any point, no matter what you're doing, the underlying physical storage of that memory that you're accessing via that shared pointer can change. It's worth thinking about this for a second, because it's not very often that this happens. It happens a lot with programs. We saw with Jules' talk where processors can move, but it's actually very rare that memory moves like this in a way that is entirely opaque and in a way that matters this much.
It happens sometimes with caches, sure, but that's different to this, because the difference there is that actually you're just making a copy of something. You're just storing something, and then you're dealing with it. Here your entire program's data can just move, if you do it via this pointer. This is the second half of this quote, which is, the access to the data's virtual address will remain valid and coherent from any processor, regardless of the locality. Actually, I think this is actually the only way they would be able to do this, because you've got so much going on, you've got so many processes that are running. I think that this is actually the really very interesting part of this talk, is that the accesses must remain valid, and that's going to have a lot of very interesting implications for the performance of our program.
I've told you all this, and I have told you that functions already exist for CUDA copying, so all we need to do is memcpy. If we've got a managed pointer on the left and managed pointer on the right, this should be super-fast, and it should work because this will run on the CPU. The pointers are accessible via the CPU. It's going to work great. Everything's going to run fast. No, we have fallen into the pit. This is our eponymous pitfall. You can really see this from this graph here. You'll see that as we increase the size in gigabytes, which is the x-axis, and the gigabytes per second on the y-axis, you'll see that we're jumping around a lot, a huge amount.
In fact, to begin with, when we're copying a gigabyte, we're getting just over 600 megabytes per second of throughput, and by the time we've gotten the whole way up to 70 gigabytes per second, we're only getting 0.85. I don't know how often you benchmark your memory bandwidth, but that's dreadful. That's really bad. I could write to and from an SSD and get faster than this.
The reason why this happens is because of some clever stuff. I'm now going to break out the profiler for a second. Again, this is the NVIDIA system profiler. If I open this, if I zoom into this, you'll see here that we end up getting CPU page faults. I just want to recap. Previously we had the memory on the GPU, we've just done something, and now we're trying to do a memcpy on the CPU. You'll see here that we have a page fault. This page fault is where the CPU tries to do a copy and it realizes that it can't access the page. In fact, as we zoom the entire way along, you'll see that we get loads of these. We get all of them the entire way along here, over again. In fact, in this particular trace, there's something like 37,000 or so. It's tremendously awful quite how much of our time is just spent with these page faults.
There's an interesting interaction here between the device, the graphics card, and the CPU, which is that, actually we're limited here by the page size of our CPU. If we think about the anatomy of a page fault, what happens? The first thing is that the CPU is going through memory, and it goes, "I haven't got this. I'm going to need to request it." It allocates some space, it contacts the GPU, and it says, I need this much memory, and the GPU gives it back. That amount of memory, by definition, can only be the page size of your system. Here, all of these lines that we see, they're all 4 kilobyte page faults.
Something I didn't do, which I probably should have done, is I didn't check how this changes if you increase the page size. You would still get the same result as this, but it would be less horrendous. It would probably be slightly better. That's because each time we have one of these problems, we end up with a syscall because we have to deal with the page fault. It's very expensive to go through this and to get an answer out that is actually useful.
This is captured again here. You can see that we end up with all of these page faults over this span. It's not great for our performance. This is the same thing, but zoomed in because I just want to show some causality for you. As we zoom really far in here, you can see that we end up with this regular, repeating pattern of page faults.
Once again, something that's interesting is these patterns repeat, and they're unequally spaced. You can see, for these two here, we have a very small gap and a slightly larger gap and a slightly larger gap and a slightly larger gap, and that this pattern repeats. The reason for this, but you have these purple blocks here followed by the red blocks, and this is the hardware trying to help you. This is the hardware trying to give you more than you asked for. Each of these red lines are a page fault for 4k, and then the purple lines are a speculative prefetch. It realizes, I've had this page fault. That's really bad. I'm going to transfer over more memory.
As before, you can see that the widths of each of these blocks grows the more times you fault. By the time you get here, I think it's something like 2 gigabytes, it transfers back over. Right at the bottom, it's only, I think, 64 kilobytes. Each time you make this mistake, it tries to give you more memory to make things faster. This is all handled in the hardware. You don't need to do anything. It's just done by the device directly.
cudaMemcpy
"Joe, I hear you say, you really told me these functions exist. We have ccudaMemcpy to copy between things. Why are you writing your own memcpy?" Sure enough, actually, it does give you some better results if you use it. This purple line down here is if we use managed memory, but we specify that we're copying from the device back to the CPU. This is just only ever so slightly faster than what we had before. You'll see that we still end up with this relatively stepwise curve. This is just because it can handle the prefetches more efficiently. Meanwhile, if we go the other way around, so we go from the device back to the host, you can see that we end up with this very fast line.
We get up to about 10 gigabytes per second coming back, and this is just because of the pages again. It turns out that the GPU actually has more capacity for dealing with the larger pages, up to 2 megabytes, and that's really what you see here. You end up with this nice curve that gives you something that is slightly better, but it's still not great. Ten gigabytes per second of transfer is still not incredible.
Just to summarize this, if you're writing code that needs to deal with this generically, I highly recommend that you don't use standard functions and you just prefer the CUDA functions. Here you can see that even in the slowest case, which is this purple line here, we're roughly twice as fast as on this slide here, and that is just a one-line change. I didn't do anything else. The operation that's happening under the hood is still the same. In situations where you can get more of a speedup, like this green line here, you'll see that it's still worth it just to do this generically.
What if We Forced the Migration Ahead of Time?
Of course, all of this cost has come from the fact that we're handling page faults. We're handling the fact that our physical memory isn't mapped to the place where we're doing the work. A question that you might have is, what if we just moved them to the same place to begin with? What if, as opposed to doing the copy between GPU and CPU, we just moved the memory to the right location, and we did the copy there. Conveniently, CUDA lets you do this. I'm really not a fan of this prefetch term, because it's not really a prefetch, it's a migration. All this does is it says, I'm going to ask you please to move the memory that is backed under this pointer, to the destination device. You can put it on a stream if you want, force it to be consistent with some previous operations or to wait until some previous operations are finished.
I do just want to point out that you do need to specify the size. This actually gives you a really clever trick, which is that, you can, if you want to, only force some of the pages to be in any one given place. With this, you can actually end up with an array that is split across multiple devices, across the CPU and the GPU. You can split it however you want. That actually gives you a different sort of parallelism, in a way, because if for whatever reason you wanted to, you could do an update on an array, on the GPU and on the CPU at the same time, and it would all work fine because of the guarantees that they give you. If we do this, and we do the prefetch just on the source, so if we're copying from one pointer to another, and we just move the source pointer to the device, you can see that we get a much nicer curve still. You can see the increase is here. It flatlines at about 16 gigabytes per second, which is ok. Then it drops like a stone. The reason for this is, actually depends on your device.
All of the numbers in this talk were gathered on a H100, and a H100 has 80 gigabytes of memory. You can see that while both our prefetched and our destination array can fit in memory, everything goes fine. Then right at the point here when they can no longer both fit, our performance tanks. This, again, just happens, because we're now in the realm of getting many page faults again. We're encountering these page faults, we're going to end up with a situation where we're trying to write to something or trying to read something that we don't already have, and so our memory just completely flatlines. This is even more pronounced if we try to prefetch them both. If we try to prefetch both of the pointers ahead of time, which is what you can see here.
You'll see that we get a huge amount of performance. This is 1300 gigabytes per second. It's huge. It's so fast because we're doing a copy between the memory on the GPU and the memory that is also on the GPU. This works really well right up until we get to the point where neither or both pointers don't fit in memory anymore, and then our memory bandwidth just completely crashes. I think this is something like 600 times slower, this line here, because what we've done here is we've actually given up the ability to be able to proactively manage this. We've essentially said we're just going to let you deal with it, and then we're going to get the performance we had before.
This is a very visual representation of what that looks like. You can see here that we have all of the red blocks of doom, as it were, and you can see that they're mostly surrounding read requests. We're trying to read some memory, and we don't have it, and so we transfer it from the host to the device. Then this second line here is us saying, no, we're out of memory. We have to get rid of something back into system memory so that we can read more stuff in. You'll see that this continuously happens. The system's not that clever at deciding what it should get rid of and what it shouldn't. I've seen this before where it's really screwed up and really given us very poor performance because it's evicted things that we haven't even used yet. We're constantly reading things from memory that actually we really would rather not have to keep rereading. You can see that here with these continued page faults for reads and for writes.
Our second conclusion is that we need to manage these copies more carefully. We can't let the hardware deal with it for us, we've got to somehow do better by ourselves. The solution to this is we're going to want to end up with a profile that looks something a little bit more like this. This is actually a trace from a running program that we had internally at poolside. This was actually the problem that I was sent to diagnose. You can see here that we've got these memory accesses here, but by the time we get to this bit, it just completely crashes. Our bandwidth goes horrendously. We're constantly just faulting. You can see that with the spikes on the PCI Express bandwidth numbers up here, you can see that we just spike. We just end up consuming more. I know that on this chart here, it went down to 40 gigabytes.
Actually, you can see this graph under any circumstances if your device memory is full. If you use the cudaMalloc API from before and your device memory is really full, remember, it will never get evicted. You can arbitrarily move where this graph sits to the left by just allocating more card memory. I think in our case, it was something like, we were moving 512 megabytes repeatedly, and it was causing this unbelievably slow bandwidth because it was constantly trying to evict things it needed to work on. It's actually very easy to trigger this behavior.
What we need to do here is we need to manage these copies more carefully. I'm going to show you how to do that. The first thing we're going to do is we're going to define this PREFETCH_SIZE here. This corresponds to roughly 2 megabytes of data that we're reading, and I'm going to make two streams, which are s1 and s2. These streams are important because it's going to enable us to queue our operations properly. Then, of course, here I'm just going to calculate the number of prefetches that we need. This is the magic of the loop. What we're going to do is we're going to start doing this copy, and the very first time, we're going to prefetch the previous thing that we had, and we're going to send it back to the CPU. That's what this CPU DeviceId thing means here. We've done some operation. We've taken some data. We're going to send the bit that we had before, and we're going to push that back to CPU memory.
Again, we have to do this explicitly, because otherwise our performance is just going to tank but also because the CPU needs to be involved with remapping things back into its own memory. You can't just unilaterally do it. You need to give it that information back. It needs to be involved in that. The way that we'd say this is, we go, we're going to queue those on a particular stream, and we're going to explicitly send them back to the CPU. The next thing we're going to do is we're going to prefetch the blocks that we need from system memory back onto the card. This DeviceId thing here is just a placeholder that says this is the ID of the device that we're sending the stuff to. You'll notice that we use a different stream here. You could do this on the same stream. You could use s1 for this as well. Actually, in practice, it doesn't really make it much better. It's better to be like this because your GPU is already doing things.
It's useful to have it know that it's got things that it's got to do stuff on. If you do this, and you unroll the entire thing, you end up with this line at the bottom here. You can see that, unlike our previous graph, which we had back here, where we had lots of these not very nice red lines, by the time we get to here, the red lines have all but disappeared. In fact, there's this tiny one here, but that's a bug that I've still not worked out what it is, so we're going to ignore that for the sake of this. You can see that just by combining these various clever prefetches and sending things back to where they were before and managing them, everything gets much better.
In fact, I don't have the hard numbers in the slides, but at least from what I've done impromptu, this actually beats most of the standard CUDA functions for doing this. If you write your own custom code, and you're careful about dealing with this, and you can track how much memory you've allocated, this will be all of the standard library CUDA functions, at least on the hardware that I have access to, under the drivers I have access to: all the usual caveats.
Actively Resisting Change
With all of that said, I want to return to this MGB here. You'll notice we've blitzed through a lot of stuff. We've gone through a lot of hardware stuff. I find it so astounding that originally, when we were talking about this, we were like, this single pointer is going to make our lives so much simpler. Just having one pointer that we can use across both devices and across system memory, it's going to make our lives great. Everything's going to be so much easier.
In a way, it reminds me of the upgrade from this car to the Volvo. We've designed stuff, and we've put stuff into the world under the mantra of it being simple, but actually, I don't think that's really true. We're on slide 51 and I spent all of this time explaining this stuff to you around something that should have been simple, and it's just not. It's actually very complicated.
At this point, I think I'd rather just manage the copies myself. I think it really comes down to the fact that, much like this car, we're programming for a computer that was designed 50 years ago. We're programming for a PDP-11. We're using the same stuff that we've always done. Actually, we've actively resisted change. Because, for those of you who are at least into some retro computing, we had this discussion with near and far pointers in the '90s. We had this idea of changing the language so that we could express the hardware more explicitly, and we just haven't. We've actively tried to stop ourselves from changing the tools that we use to meet the hardware that we're running on. I want to highlight this point home. It is impossible to statically determine if this function works or if this code works. The compiler can't tell you.
Assuming that these functions are public, you can't know because you have no way of expressing succinctly that this pointer here that is parsed in is something on which you can operate. This will not blow up your computer or anything. Everything will work fine. There's no memory violations here. The hardware is clever enough. It strikes me as so weird that you can't just say, "This thing needs to be this," and expect it to work safely in any meaningful sense. We've resisted all of this change, and it's really not been that good.
Key Takeaways
The very first thing I want you to do is I want you to profile your code. In particular, I want you to profile your code. I want you to use different tools. I want you to try clever and innovative things like strace. I want you to really try to understand why things are happening the way they're happening. Because although computers are confusing, there is almost always a reason. It might be a cosmic ray. It might be something crazy. It might be something that none of us understand, but fundamentally, there is always a reason for this.
I want you to consider where you can simplify things. I don't mean this in the sense of simplify like, I've rewritten this loop or something like that. I mean when someone reads your code for the first time, do you have to give them an hour presentation before they know why it's doing what it's doing? That's probably not that simple, even if it looks it. Above all, I really want you to choose performance. By that I mean, if in doubt, performance is good.
Questions and Answers
Participant 1: Where does that mechanism of that shared memory, of that architecture come from, if, in the end, you tell us it's easier, or it makes more sense to do it ourselves.
Rowell: NVIDIA and AMD both sell it as an easier way to get your program started. If you've got an old legacy CPU application and you want to port it to a GPU, and you want to start early, in a way that you can be aware when things are breaking, you should use unified memory. Actually, it does make your life easier in some ways. It's just, if you're trying to use it in an environment where performance really matters, I don't think it's worth it.
Participant 1: It's for migration reasons or backwards compatibility.
Rowell: Exactly. You will see some people using it for other reasons. For example, if you need more memory, then you can fit in your GPU. Let's say your working set is 100 gigs, and your device memory is 80, it is useful to sometimes have this. Actually, I don't ever think it's better than just doing it by yourself. In fact, in the end, that was exactly what we ended up doing. We ended up doing the copies by hand because it just wasn't performant enough for us and it was too hard to predict as well.
Participant 2: Was the overhead that you saw in your memcpy-ies always page fault related? If you were copying twice from device to host, would the second copy be fast?
Rowell: No. Actually, it wasn't. I will open this profile for you here, and you'll see here that our occupancy, so the amount that our device is being used, is actually really very high. It goes back to the coherency thing that I said before. When you're dealing with memory in any way, the actual device hardware needs to make sure that its view of the memory is consistent everywhere. Even if you're just requesting that same page across multiple places on the same device, it will stall.
Actually, just counting the number of page faults is not enough. There are certain tricks you can do when you're writing your code to make sure that actually, each page that you're reading is only needed in one place, but it doesn't really overcome the issues.
See more presentations with transcripts