Transcript
Beckwith: I'm Monica, and as Richard mentioned, initially I was going to co-present with Jade. Jade and I joined Arm roughly around the same time, March last year. We soon realized that we had similar interests. From a performance and managed runtime perspective, I was looking in concurrency and she, in fact, was the architect for memory models, especially Arm and Power and these kind of tools. So I got to learn those tools and work with her, and then apply them to Java-based benchmarks, and I'm going to walk you through the process of how it came to happen.
I gave a brief background about us. She, Jade Alglave, co-designed the tool suite that I'll use. So if you were here earlier during the performance panel, we were talking about optimizing performance, but you want the consistency, the barriers and everything to be there. Yes, barriers and code have performance penalties, but when you're trying to replace it with some other semantics, then you have to still maintain the program order, especially if your program expects that. So those are the tools that help us figure out if you're doing it correctly, the changing the semantics correctly.
Here is a little bit of background on me. I initially started looking at the Java memory model back in 2004. I used to work for Advanced Micro Devices x86-64 Architecture. I’ve tried to understand the model since then, and now applying it to Arm, which is a weak order, and I'll walk you through that.
What we, as in me, will cover today, are basically memory models, what it means with respect to Java and Relaxed. Not going into details so that we can move along with the performance, because this is the performance track. Why we used the litmus test, because when we're changing, like I said, when we're changing a barrier to something else or when we're adding a barrier to guarantee some program execution order, then we need to make sure that that's what we're observing. These kinds of tools help us look at what we're actually observing. If any case could go and observe the things that we don't want to observe, and I'll walk you through an example.
And then how I brought that back to JBB via micro-benchmarking. I wrote some micro-benchmarks to make sure that we're doing [inaudible 00:03:12] stores, for example, that's one of the things in Java memory model. Then after those changes, how it got applied to JBB. Here it's an Arm. I run it on an Armv8.1 Architecture and have some scaling numbers. What we cannot cover today, of course, any details, because we're trying to cover so much and a deep dive is not possible today, but please reach out if you need a deep dive on any of those things.
Memory Models
My understanding of memory models is what value can a load, like a read, observe? If I am loading from memory, or from cache, or from whatever domain, what can I actually see, and that's what a memory model guarantees. There are some guarantees from the architectural front, and there are some guarantees from the language front. So that's in one line my understanding of memory models.
Let's start with an ideal concurrent world of hardware and software. Basically I'm talking about processor threads, and then software threads. I had bunch of block diagrams that I wanted to do here, and then I thought that doesn't make sense, because you have to go with the experts. And I consider Doug Lea to be an expert in this. This is from Doug Lea's concurrent programming in Java. I think it's an old book, and these are called timethreads. This is his explanation of timethreads. Basically, thread 1 is blocked and thread 2 acquires the lock for Object 1, and then Object 2 is a helper for Object 1, and then thread 2 proceeds. So, this is a multi-threaded concurrency aware program. On the other side we have a multi-threaded hardware with shared memory structure.
This is, again, from a tutorial introduction to Arm and Power Relaxed Memory Models. If you read the paper, it builds up a story on how you have sequential consistency with writes and reads, and they go in program order. Then eventually how store buffers add to the mystery of the reordering. So in an ideal concurrent world is what we would expect writes and reads to happen in order.
Let's first start with sequentially consistent shared memory. What does that mean? So we have the hardware with the shared memory. We say that the execution timeline is single global execution order. That's how it's executing. It is executing as if it was a single threaded hardware. That, plus what we just saw- the program order, concurrency-aware program- will give you sequentially consistent machine. So, single global order execution on your hardware and a program order specified on your software front will give you sequentially consistent machine.
There are two definitions of a sequentially consistent machine. They're important because as I build the story, you will understand how different architectures fall in the either side of this definition here. The first one is no locally reordering. So if I am a processing thread, and I'm writing and reading, I should not reorder my reads and writes. And if I'm a processing thread, my writes become visible to every other thread simultaneously, so no local reordering, and writes become visible simultaneously.
I mentioned buffering. So we have the ideal machine, which has a single global execution order. But now in real world, we have store buffers, and I'll explain a real world system here soon. Let's keep this example in mind. Initially, you have X and Y, which are zero, and then you have foo and bar, which are local variables. If you've ever seen any sequential consistency or memory ordering talk, this is a token example. And I'm just going to build on that.
We have p0 and p1, which are your processing threads, your CPU execution threads. On p0 side, you have X is equal to 1, and then foo is equal to Y, and on P1 side you have Y is equal to 1, and bar is equal to X. What are the permissible values for foo and bar? Keep in mind, this is a sequentially consistent machine, okay? So for a sequentially consistent machine, given that there's no local reordering and rights are visible to others, other processing threads at the same time, we really can only have three interleavings. These are called interleavings, the way the threads can interleave the reads and writes. We can have A, B, C, D, or we can have from P1, CD, AB, or we can have AC and BD. These are the only three interleavings that are possible on a sequentially consistent machine.
At any time, we will never see foo and bar both equal to zero. Because either it's going to be X is equal to 1, Y is equal to 1, and then foo is equal to 1, bar is equal to 1, and so on and so forth. With these three interleavings, we will never see foo and bar equal to 0. Let's come back to the real world. So we have multiple CPUs. The CPUs could be multi-threaded. It's called simultaneous multithreading, SMT. You can have out of order execution. You can have load and store buffers. And many architectures, many CPUs, do have these things as of today.
Then comes another layer of caching. We have the icache, with instruction cache, and the dcache, data cache, which are on the CPU nowadays. Then we have L2s, which are usually private to that CPU. So each CPU can be SMT, and then you have these private L2s. Then we have something called the last level cache, which is L3, and which is shared. And then we have memory controllers, and then DRMs, which are your DIMMs, memory DIMMs, and then IO. So this is a machine that has tiered memory structure and multiple processing threads. This is where we are today.
With that in mind, let's see these architectures, how can we define them with respect to sequential consistency? So on the right, you have something like an x86 machine or a SPARC machine, which are called strong memory models. They're called strong memory models because they're based on something called Total Store Ordering, TSO. So sometimes you will see people write SPARC TSO, which is a total store ordering observed architecture.
What does that mean? A TSO means a thread can see its own write before other threads. So I'm a p0 and I can see my own writes, and all other writes to the other processing threads are visible simultaneously. So when I'm going out of my own buffer, everybody sees that value. And this is called Multiple Copy Atomic Model, and that's what TSO is based on. Power and Armv7 are on this side. They're called weaker memory models, and that's because local reordering position is allowed. Remember we have out of order executions or whatever. So, it then allows you to locally reorder. Any other thread is not guaranteed to see my write. I'm not responsible to show and broadcast my write at the same time for other threads to be able to see it. So that is called non-multiple-copy atomic model.
Now, there's something called Armv8, which is also weaker memory model, but the difference is that all threads are guaranteed to see the write simultaneously. If you see the similarity between the TSO onwards, you will find that the second point is actually a multiple copy. So even though I can reorder, I'm going to broadcasting, everybody sees my write at the same time. That's still weaker, but not as weak as Armv7 or Power.
Now we know that x86 is strong memory model. So we'll go back to our store buffer example. Can we reason about our concurrent programs following sequential consistency? Given that we have total store ordering, can we say that "Hey, our program is consistent, sequentially consistent, so when we run it on an x86, for example, it should provide the same program order" If we had a formal executable model, we can understand the guarantees that the model provides. Here is when Jade will come and talk about her tool. And she's not here, so I'm going to quickly talk about her tool, and then move on to more things that I'm comfortable with, which is performance realm.
There's a vast family of litmus tests, which are these kind of tests that you can execute on these models, which are either simulated or on actual hardware. The tool itself has three different things, so it's a tool set. We have something called the herd, which takes a cat model, which is an architectural model, and litmus test and runs it. In our store buffer example, the X and 0 as foo and bar observing 0, if that behavior is allowed. And if it's allowed, it will give you a yes and a no kind of answer.
The litmus test itself can be run on hardware, so you don't need a simulator. You could run it on actual hardware. Then you can ask the same question. Then of course, there's the configuration file, if you want to generate your own litmus tests based on certain configurations that you can want to allow. So you want to test how much stronger memory organization you want, so you can you can generate your own litmus test. Here is a link to the tool, and it's also in my references.
Let's go to the store buffer litmus test on our TSO hardware, which is x86. So my laptop is x86. The way the litmus test works is the first you have the hardware architecture, which is x86 and our store buffer is the test name. Then you have the initial state of your variables, and that's not necessary because initially everything is zero, but I just have shown it here so that we can understand how the flow works for the litmus test. And X and Y are shared memory locations. P0 and P1 are your processing threads. And then you have these instructions that you can execute on those processing threads, and they're displayed as columns. So P0s will have the MOV 1 to X and then MOV EAX, Y to EAX, and so it goes P1. So basically, this is what I had shown earlier about the foo bar example.
This is how you write the litmus test, and then you ask a question at the end. If ever EAX on 0 will see 0 as its value, and if EAX on 1 will see 0 as its value. So that's the question. Can we observe this final state given that the initial state was X is equal to zero and Y is equal to zero?
I mentioned about TSO. I mentioned about litmus test. I mentioned about store buffer and how sequential consistency is guaranteed. So on a TSO hardware such as x86, can we observe the final state of foo is equal to 0 and bar is equal to 0, given that X and Y are 0s? I'm going to ask you this question. What do you guys think? If I have time, I'll show a demo as well, of how the litmus test actually works. So, yes, unfortunately on all production hardware, you will observe foo and bar is equal to 0. And that's because there are no barriers. So if we do want to have no local reordering, and then all writes are observed simultaneously, then we will have to do it ourselves. We have to introduce some barriers.
I'm sorry you cannot see this, but it says here use mfences as needed. Basically, I have some demos, and I can show you how it works without mfence and how mfence helps in this case it. So Mfence is a barrier on x86 hardware.
Performance Methodology Using Litmus Tests and Tools
Let's look at the litmus test and see if we can avoid barriers using the litmus test tools that we have. So the what and the why of barriers. I just wanted to put it here because you will see these things on Arms. I'm going to run these on Arm Architecture, and these are the memory barriers that you will observe in those litmus tests, and when I profile the Java micro-benchmarks that I have. So, barriers SY, it's a full system barrier. ST it will wait until the store is complete, and LD is a wait for loads to complete.
Barriers are needed, especially when you have it correctly for each pair, then you can restore sequential consistency. The more barriers you had appropriately, you get a sequentially consistent model. Of course, they enforce strong order and they're very expensive. So from a performance-minded person like me, you want to try to avoid them but also keep the order that they're enforcing.
This is a litmus test. If you go to any other talk on consistency and memory ordering, these are the token cases. This is a typical message passing case here. So P0 updates value, and then updates a flag. And P1 reads the flag and then reads the value, so it's typical message passing. What we are trying to see here, if you'll ever see the flag as set, but the value is not written. Can we ever see that? This is the condition here. It's asking a question, if I can ever read the value not being set but the flag is set. Remember, P0 is setting the value first and then the flag, and P1 is reading the flag and the value. So will P1 ever read inconsistent an value like that?
The answer is yes. But this is the output of the tool, and it generates this pretty graph, if ever you will observe it. If you do not observe the condition, then it will not give you these graphs. And these graphs basically show what is the program order, an order before or read from, so basically under what condition do you see a 0 and a 1. Basically the flag is 1 but the value is 0. So in this case, it generates a graph saying under what condition it's observed.
Now I'm going to introduce a load barrier. I'm using DMB LD, which is the load, so wait for loads to happen. So can my stores be reordered? I'm introducing a load barrier, so I'm trying to read, complete my first load, and then I'm going to complete my second load. The DMB puts a barrier in place. Yes, unfortunately, yes, you can, if only when this is observed, ordered before. So if Rx is ordered before Ry, then it can happen. Let's introduce another barrier, and this time it's called the store barrier. Now we have our stores in order, and we have our loads in order.
What do you think? Can we observe the value as 0 and the flag as 1? Quick? No. That is correct. Then if you run the litmus test on your Arm hardware, it'll show you the generated instructions right here, and it'll show you what the test was and it found three states, and it did not find our condition. It says there were 0 positives, lots of negative calculations, and it can never happen. In the prior cases, this will be sometimes, and then the answer there will be yes. So it calculates all the possible cases in which the condition can exist, and in this particular case, it cannot.
But my barriers are very strict. I have barriers everywhere. I pretty much have store ordering, load ordering, and barriers are expensive. So Arm Architecture and also Power, they have these kind of semantics, which are implicit barriers, and they're called acquire-release semantics. They're also called one-way barrier. When you talk to people, they'll say, "Oh, let's try to replace it with one-way barriers." Makes sense.
There are two types of one-way barriers. One is called the load acquire. So, basically all loads in stores that are after load acquire barrier must be observed after that. Load acquire and anything after the load acquire barrier will be observed in that order it occurred. The load in stores are not able to pass the load acquire barrier. Similarly, there is a store barrier, and any loads in stores preceding the store barrier must be observed before. So once the store barrier is in place, any loads in stores above it should be observed in that order. There's also the concept of shareability domain, which I'm not going to go into much right now, but when you see some instructions, you will see that they are either inner or outer, so just FYI. If you're interested, you can research it by yourself at a later time.
Performance Analysis and Measurement Using Java Micro-Benchmarking Harness (JMH)
Now, with this knowledge, I'm trying to create micro-benchmarks to see how much improvement we get, that's what kind of performance-minded person would do probably. I go back to Doug Lea's cookbook, and I use the first rule for volatile stores. Basically what it says here is, "Any load store, be it normal or volatile, followed by a volatile store, cannot be reordered". Basically, if the first one is a normal order volatile load store, and then second order is a volatile store, you cannot reorder it.
Now we go to a litmus test, and basically what it means is that we have the DMB store system barrier, and then we have the DMB load barrier. Of course, the litmus test tells us that, of course, "No, this condition will never exist and this is perfect," but it is also very expensive. Can we replace the barriers by store release semantic? Litmus test says, "Yes, we can, as long as we have the DMB load right there". So, we can have a store release semantic.
Going to rule two for volatile stores. This is very interesting. If you have a volatile store as your first order, then a normal load store can be reordered, but a volatile load store cannot be reordered. So, a volatile store followed by any normal load store can be reordered. A volatile store followed by any volatile load store cannot be reordered.
So, write a JMH test code on that. You have a volatile and a normal intField right there, and then you give the value of 32 to the volatile int, and the normal is just reading the volatile variable. Then design a litmus test around the same thing. And I'm just trying to put STLR right now. I haven't put any barriers on the right-hand side on P1, just to demonstrate that STLR does not guarantee that a subsequent volatile load store will not be reordered. What's going on here is, it's identifying the various cases that it can be reordered. So, now we have to enforce some other barriers. So we did the store release semantic. Ether we put another barrier here or we put in LDAR, which is a load acquire.
These are called pairs. When you have a store release, you will need an LDAR some other time or you will need a DMB, of course. So to avoid DMB, you put an LDAR on the P1 side. And now you see that, no, it can never happen, the condition cannot exist. So this is our JMH. This is a profiling output of JMH. On your left, it shows load acquire store lease pair, which is what you see STLR and LDAR, and that's a putfield and getfield. Then on the right you see the data memory barriers, and here's what I was talking about. This is inner shareability domain, so it's DMB load, because it's a getfield. That's an expensive barrier for the putfield. So that's a store barrier. Load acquire store release, of course, for my micro-benchmarks was 36% faster. And this is on an Max-SMT system. So as your processing threads increase, barriers get more and more expensive.
This is the last one of the JMM that I'll cover. This is simple, and this talks about volatile load. So volatile load that's followed by any normal or volatile load store, cannot be reordered. For that we'll just try to put the same barriers, DMB system, and DMB system right here. Of course, there's the condition can never exist. But what about we try to replace that with LDAR right there? Because it's a volatile load, it should be just requiring the load to be ordered. We do the load acquire semantic over there. The condition can never happen. So just an LDAR did the trick because it was a volatile load, it was not a store. So we don't need the load acquire store release pair.
Performance Study on Scaling CPU Cores and Simultaneous Multithreading (SMT)
Now applying this to the scaling studies. The next chart is going to be a little crowded, but what I'm trying to show here, and I'm trying to baseline it, so not showing the actual numbers because this is a partner system, so the partners can definitely go and brag about their own systems, but I'd rather not sure numbers here. But what I did was I baselined it to 1, and then either improvement would be greater than 1, or if it's worse, then it'll be less than 1. So bigger is better.
What we see here is there are fences and barriers, which is with DMB, which is the yellowish, mustard yellow line, and without LSE. One of the things that I didn't cover, which you can with the litmus tests and with your JMM micro-benchmarks, is look at the exclusive. These are called load-link store-conditional pairs. These are something that if you don't have compare-and-swap, which is a read-modify-write locking guarantee that x86 provides, and it's been providing for a long time. So, some of the architectures may not have a compare-and-swap, but they do have something called load-link and store exclusives. The advantage of using a CAS or a LD ADD, which is also a LOCK ADD, which is used for atomic counters, it's showing you advantages of that.
What you see here is as your core counts or SMT thread count increases, without LSE, which means that without atomics, it gets better, and with DMB, which means that we're not using the load acquire store release semantic, it gets better, too. But with DMB, we're still worse than the default, which is right now in open JDK. To open JDK right now, it does load acquire store release by default, and it also does atomics by default. As your core counts increases, and this is like a maximum of 112 or something threads, so it's a lot of threads, and it increases. Looking at this pattern we saw that even at 1 core account, we have a lot of expense with DMB. So what can we do? DMBs are necessary. Not all DMBs can be replaced with load acquire store release semantic.
This is what I was talking about during the performance panel. If you look at the Java memory model, they'll say that if you have a single core, you're required to have sequential consistency, because they don't expect you to locally reorder or one of those things. But Arm is a weak-ordered memory model, so it does do those kind of things. So we do have barriers, and we do have to have barriers if we want the sequential consistency guarantee.
The new release that happened, which is called Ares, which is also called Neoverse. Basically, what we need to know is an Armv8.2-A. So I went to measured hardware improvements, and I wrote some Java micro-benchmarks. Measured object allocations, object array initialization, which is also something that's required based on the Java memory model. You need to have initial value and smart issuing and production of software barrier. So if a barrier is needed and whenever it's needed, it will be issued, but the time spent in DMB is not going to be that expensive. That's what it means. And copy chars and all those things haven't been improved. Atomics have improved as well.
Here is JMH comparing an older Arm 8.0 to a Neoverse, which is the 8.2. What we see again, so we spend about 3.5 cycles on DMB on a Cortex-A72, but then Neoverse is almost 0. This is the benchmark right here. So what it's doing is allocating and storing 0s basically, so some of the DMBs can be avoided, because you don't need back-to-back DMBs. Then there are couple other things. So hardware improvements and software improvements, something that we're very proud of. I'm going to brag about it here. Remember, we're just comparing it with our own older architecture, and what we're saying is that the Cortex-A72, so Neoverse has improved over A72 by 1.7x, if you measure that software.
So it's a big deal. People are talking about JDK8 and JDK11, when should we move? Why should we move? One of the reasons to move is that GDK11 improves performance over JDK8 by 14%. That's a big benefit as well. That's all I have, and then here are the resources. And if anybody wants to see demo, I have those things as well. But I'll take questions first.
Questions & Answers
Participant 1: The litmus test that you run, they run on actual hardware?
Beckwith: Yes.
Participant 1: So how do you know when enough iterations are enough to uncover a concurrency bug, because I've run things millions of times and they haven't exhibited this problem?
Beckwith: It will run all the combinations that it has. You run on the actual hardware and the litmus test will run. There are three tools. Let's see, here we go. Let me just open this up anyway by myself, so we don't have to click on that. What you see here are some models, and then we have some litmus tests as well. We could use litmus tool itself or the herd tool and run it on an x86 hardware right here.
So we'll do the store buffer example first. It will tell you all the combinations under negative, so how many combinations it tried. It had 136 positives, and then those many 999,864 negatives. So these are the sequences of tests it'll try, and it'll tell you how many cases you have. For each of those cases, it'll generate the positive events structure so that you can correct your event structure if you wanted to, depending on what kind of ordering you want. It also generates the sequence. In case you want to use the sequence for your compiler-generated code, you can use that kind of sequence.
Now, we'll try the other one, which was the mfence example. It tells you that it can never happen. Right here, it had those many negatives, and then it can never happen, because we put those mfences. You can see the generated code, how we generated the mfence, so it actually runs those things on the actual hardware as well. And then you can replace mfence. So that's the original litmus test. Let me show you the litmus test anyway, as well. These are some semantics we tell. We'll say mfence, memory fence, writes and reads, and then from reads. So these are semantics that you can define, and as you can have a whole array of semantics for Arm as well.
These are the conditions for my program and my architecture, and I want to know if my condition will exist given those conditions. So you're defining your architectural model and you're talking about the restraints that it provides, and you can go weaker or stronger. Then you give your test conditions, and you can have multiple threads as well, in case you have multiple hardware. So you can have 10 or processing threads, or more as well.
Participant 2: Yes. I'm just wondering, the similarities between this tool and the concurrency stress framework- this is stress 1. If I understand correctly, does this emulate the models, and then tells you what's going to happen so it's not exactly the same?
Beckwith: It can do that as well. The herd tool can. Going back to this here, so we have the herd tool that gets the cat model and litmus test. So the cat model is actually your model of your architecture that you want to run on. And you can define this model based on - remember the mfenced read/writes, or whatever, so those other model definitions that you can have. And then you have the litmus test that actually runs on the hardware, and that's what I showed just now on my x86 system, and then you have the generation, which is the configuration file, which is your cat model file, will generate your litmus test as well. So, there are two different executions and one just DIY tool.
Participant 2: And are these tools openly available?
Beckwith: Yes.
Participant 2: Great. Thank you.
Participant 3: I just wanted to ask as an application developer who writes applications with Spring Framework that takes 30 seconds to boot up and has reflection and annotation and everything, do I have to worry about throwing an extra volatile around when I write my program?
Beckwith: Do you have to worry?
Participant 3: Yes. Because I am not writing really the kind of program, I'm programming commodity hardware, and I don't really expect to gain the last bit of raw performance out of it, so, yes.
Beckwith: You have to worry about your own code. I mean, do you want program order guarantee?
Participant 3: Yes, absolutely.
Beckwtih: And you want your code to be concurrency aware, right?
Participant 3: Yes, sure.
Beckwith: Then that is definitely your problem. When you declare volatility, whatever that means to you, volatile or synchronized keyword, or whatever, depending on what kind of order guarantees you want, absolutely, that's something that you should worry about. The Java memory model is for compiler writers. Then this becomes something that hardware people - so we know that we are weakly ordered. We know that we have to issue DMBs. DMBs are expensive, so can we do something else? Can we still maintain the program order that the programmer wants, but give a load acquire store release semantic to it?
This is what the talk was about, and also introducing to the tool and stuff like that. But in your case, of course, if you want a program order guarantee, then you have to define that. Because I wouldn't know. If you don't have volatile, the compiler writer could reorder it. We have out-of-order execution. We could reorder it. So all those kind of guarantees go out. That's why you need DMB barriers in the first place.
Participant 3: Yes. My question was more, I can synchronize programs in a number of way. I can ensure memory barrier by way of using a volatile somewhere or establish [inaudible 00:46:08] before relationship. That also might serve my purpose. So my question would be, if I should really start weighing those options somehow? Or can I just go with what might be most convenient, just throw back on a volatile on a write somewhere, and then I know the memory visibility has gone away?
Beckwith: You should use the least obstructive option that you have. The guarantee is upon the hardware that it executes, which is us, and of course the language that you use. For example, if you're using, is it C++, that's really weak, it's weaker than Java, so it depends. The compiler writer then has to worry about what you mean is actually what they're generating the code for. And then we have to worry what the code is asking for, the generated code is asking for, is actually what they're getting. So the different levels of translating what you need into what the hardware actually gives you, because if you want program order guarantee with volatile, then we should provide that.
See more presentations with transcripts