Alex's full question: Hi. I'm here at QCon London 2016 with Martin Thompson of Real Logic. Martin is well known in the high performance community with his low level understanding of everything from processor caches right up through to network protocols. Martin, let's talk a little bit about what you have been working on recently with Aeron, your high throughput messaging system.
Yes. So the last couple of years I have been doing work for a major client in Chicago. I'm helping them with their communications stack, which means we do a lot of work between machines and also within the same machine itself, so IPC and a cross-machine. I'm trying it all to work in a way that's the same API and very low latency, very high throughput.
Yes, so a lot of the recent releases were focusing more on usability, sort of monitoring and debugging, that sort of stuff as we try to put it into the stacks. One good example we added was the ability to capture exceptions. Many times whenever systems go wrong, people end up having huge log files of exceptions but they tend to be the same exception over and over again repeated. So one of the things we have added is we only logged distinct exceptions and we store the full stack trace, the time and first observation, time of last observations and the number of observations that are in there so we don't get that -- well, just run out of disk space because something has gone wrong in system.
Alex: You still got an idea of what's going wrong in the system and how badly the things are but without polluting the log space.
Yes. For many things, people get carried away with logging. They just log everything or sometimes we want events, and more often it's important to find out how many observations of those events have happened rather than what is the exact detail. We always need to summarize. If you have seen the same exception, like for example the network goes down for some reason, do you want to keep every microsecond logging the fact that network is down? If you are tracking that often, you probably just want to increment a counter. And so now you know that the problem exists, you know the full stack trace where it's happened, but then just increment a counter from there rather than fill up your disk space.
3. [...]What kind of things does Aeron provide to make it really easy to not generate garbage?
Alex's full question: And of course one of the things that Aeron does is really focus on the low latency stuff and to do low latency in Java, you really need to focus on the garbage or producing lack of it. What kind of things does Aeron provide to make it really easy to not generate garbage?
Well one of the things that Aeron is designed to is it's got a driver and a client model and the driver is the thing that interfaces with the media layer itself. So either with the shared memory or with down to network with a UDP stack or Infiniband or whatever you happen to be using, we can run that out of process. By running that out of process if your main application takes a big GC event, that can still run. It can still be answering questions for like data to be retransmitted from other clients that are using, it if data goes missing on the network. You don't end up being held up by heart beats different things like that.
Typically, a lot of clusters go wrong when GC events occur. That's a very common problem. I have seen clusters of large machines where one machine takes a large GC event, others think it's dead, remove it from the cluster, usually you rebalance the cluster that can cause catastrophic failure with just one falls over to another. But moving those things out of process, they are protected from that. Within the core of our own itself it doesn't generate any garbage in steady state. So we don't end up needing GC events. You can put that back into process if you want to, that's fine but you get the option. The choice is you can run it out of process or in process and communicate it with it.
When it comes from the client interaction with the actual API itself, a lot of the API is designed to be very low garbage and also zero copy at times. Quite often with many different stacks, you have to allocate a detail or allocate a buffer and pass that down and that has to be garbage collected. You claim space in the underlying buffer. You write your data into that. It gets transmitted. It doesn't get copied again. Similar when data has arrived. It gets appended to a buffer. These are used in a circular fashion and you can read that and only when you're done with it is it released, and so you effectively have a lease on that memory rather than we allocate memory for everything that happens.
So it's allocated as memory mapped file, and that's used by the client that's talking to it but also by the threads that write things to the network stack itself. That buffer is designed to work in a manner that is append only, so that it can be reused over and over again. This is important when it comes to having loss situations in the network. You need to retransmit the data. So the data that gets written also gets kept around for a period of time so that you can retransmit if you need to.
Alex: So you have kind of like a double edged queue where you put messages in. They go into the networking stack across the shared memory boundaries so that you have only got one copy of the data. And then as the networking stack receives acknowledgements from the remote end, the pointer gets bumped up to move the space for future messages that come in.
Kind of. We don't actually acknowledge data. Acknowledgement systems tend to limit throughput in how they work. We take a different approach. We send status messages which reports available buffer at any given time and that allows us to transmit data at the maximum possible rate. Many systems get restricted by acknowledgement practices. Acknowledgement practices are really good on networks that have a very high degree of loss. On networks that have less loss, you don't want to be acknowledging every message. You want to have a view of how much buffer is available but also have a means of detecting loss to really request it if you need to. So effectively flow control but then have a NAK policy on top of that.
So this will be the NAK protocol with the Aeron itself. Aeron has a wire level protocol for working over UDP between different machines. It will detect loss and can NAK as a means of recovering that data.
Alex: I'm presuming then you hold onto that data just in case it's going to be needed in the future.
Yes.
Alex's full question: So the out of band process that does the direct trend, direct talking to the network cards -- presumably this is the thing that keeps running all the time. This is the thing that must stay up and be zero garbage. How do you achieve that as a Java programmer or does that drop down into the C layer and use C primitives?
It's all done in Java. It's a case of using patterns whereby memory can be reused over again so we tend to allocate things in circular buffers. We use those over again. It's different from idiomatic Java in many ways. It's the sort of technique you would do if you were writing the C program but we do those sorts of things in Java. It's not what you do for mainstream program; it's more like system programming.
You apply the patterns. You run tests and you run a profiler. And the profiler will tell you if you have allocation and you remove the sites of those allocations. Every now and again, people slip up, we make a mistake and then we will allocate some data. Sometimes we will surprisingly allocate data especially in the early days of using lambdas, we found that sometimes we thought lambdas wouldn't be capturing but were capturing in some cases, and would be allocating. You have to find those and remove them.
Alex: Presumably, there's not really the static checking tools that allow you to prove that it doesn't have it other than with testing at the moment.
Yes.
Alex's full question: You mentioned lambdas there. Lambdas have obviously been out in Java 8 since it came out which is about a year old at this point I think. What's you experience of using lambdas in writing Java code, does it help? Does it hinder? Performance profiles, are they in-lined appropriately with the JIT. What's your view?
It's funny how you get this now -- it is probably two years since it has been out even more than just one year. We had Java 8 here for quite some time now – I've been using it from before the major release, it's nice. It actually improves our patterns in many ways. So we can actually apply functional techniques. Often you want to apply a function as an object in its own right and that's a useful thing. It can clean up code in many ways. If it's not in the high performance path, it doesn't necessarily have to be super performant. Many applications, in principle, we have most of time, 80% of your code doesn't need to be critically very fast.
In fact, if it's just reasonably efficient, that's good enough. Usually a small amount of code needs to be very fast for how the most things work. Even in that code, we have lambdas used occasionally. We just be careful of them. You are going to be careful that they are not allocating particularly, is the main issue, because if they are capturing any state around, they need to allocate for that to capture the state as long as they are just applying directly to the data that has been passed to them, they are fine.
Alex's full question: We are talking about capturing state. We obviously used to have anonymous inner classes that weren't static that would capture this and therefore have garbage potentially generated it as a result. With lambdas, when you're talking about capturing, are you saying it's okay if we capture final variables from the local scope but you don't want to capture any of the instance fields of the classes of in case you would leak references there?
If you capture from the local scope they will be allocating because that's only within the scope of a method. The interesting one is actually capturing a field. It's very easy to write code that will capture a field and will capture it on every invocation. But if say for example you allocate that lambda in advance and capture the field once, you can then capture it once and store it, that lambda that has a field reference and use that over and over again and then is no longer capturing. It captures once and you can use it as many times as you want.
Alex: And so presumably, this is why you are talking about once you get to steady state, there is no garbage collection within Aeron.
Yes.
Just like anything. So within the lambda itself, what is it referencing? If it's referencing something in the outer scope, that it can't prove that it could be changing on an invocation basis, it has to capture it and so it's just a case of watching out for that and making sure you are not being caught, and profilers help you out a lot.
Alex: Have you seen in profilers that lambdas get inlined perhaps more so than the Java methods would, because I know one of the main benefits of invoke dynamic was the ability to teach the JVM runtime and the JIT compiler specifically about being to do be further optimizations.
Generally, they can be optimized quite well, you will see that in most code it does but one of the interesting thing with lambdas is they can be optimized really well the majority of the time and then occasionally they are not; and that makes them unpredictable. So we have to profile to see whether that has been the case or not been the case. It's being careful with it. There are limits that kick in at times and we can lose out. We can have limits with stack depth, we can have limits with number of lambdas at at a call site, how much is capturing at a call site and it also will depend on what else is going on around the code at that call site that's been inlined so that you can blow a number of those limits quite easily.
Mission Control, flight recorder I'm finding is one of the best for doing that. They are very good at ability to capture profiling information around allocation.
Alex: And those presumably who use on your desktop and the JDK. I understand Mission Control is a commercial feature for the non-developer work stations.
Yes, we have to unlock the commercial features, but it is free for use for development, not in production.
12. [...]Do you take advantage of those tricks in Aeron?
Alex's full question: In terms of dealing with the lack of garbage in processes, what other the techniques are there that you use to gain the most performance out of Java programs. I'm thinking things like understanding what the Java object layout looks like, avoiding cache line sharing and so on. Do you take advantage of those tricks in Aeron?
We have to at times. False sharing can become a major issue and has been a source for hunting down a number of our performance issues. To do that, it would be really nice if the [sun.misc] @Contented annotation was available as an end user feature but it's not: it's only available within the core JDK classes unless you provide a boot parameter [-XX:-RestrictContended] which most people won't have done. So what we have to do is we extend classes and we use the inheritance hierarchy as a means to have the right sort of padding around the fields and at times we would have to do that to get predictable performance.
13. [...]Can you tell us a little bit more about how that works?
Alex's full question: Now we are talking about being able to write data into your network buffers. You have a binary encoding as I understand as part of your messages. Can you tell us a little bit more about how that works?
Within Aeron the actual protocol itself has a binary encoding for different headers. We write those fields directly. Everything is kept in memory mapped files, so that will become a mapped byte buffer. It's a typical byte buffer that's direct so it's not within the Java heap. It's off the Java heap mapped from the file itself. We don't deal with those directly using the ByteBuffer API. We wrap that with a buffer type that we have within Agrona. Not only does it let us deal with that directly, it also lets us deal with that with memory ordering semantics. For example, sometimes you need to a CAS on memory, or an XADD, or write in a memory ordered fashion. You need to use those sorts of things.
For that, we underneath use Unsafe. Unsafe gives us the ability to work with that memory and apply the memory ordering semantics that are required. Once we come to Java 9 so hopefully have a better alternative. VarHandles is going to allow us to do very similar things so we can stop using Unsafe. It's been a missing feature for some time. I know there has been a lot of complaints that people use Unsafe – well the reason they do is because we have missing features. VarHandles will hopefully address a lot of that. There are other things that are in Unsafe as well that need to be addressed. So we'll have alternative APIs that are safer for doing that.
Alex's full question: You mentioned VarHandles there. How are VarHandles going to allow you to deal with writing out to memory mapped buffers, or is that just the case for reading the contents of an object?
That would be for the memory map buffers itself. We don't read the contents of the objects with any sort of unclean methods of doing that. It's to actually map the data down to a binary format for putting them on a wire. We will be able to do that with Java 9 by ByteBuffers are going to be extended to have the methods to write directly into that using the right memory ordering semantics. Proposals for that have just been going forward recently.
Alex: That will be exciting when it comes up.
It will indeed. It'd be great to have that.
Yes, so where that happens, we tend to have single writer principle as much as possible, so we don't need to use CAS, we just need to write counters in an ordered fashion. Occasionally, where we have multiple producers racing for something, we deal with that by using XADD and we believe we have got a reasonably unique algorithm for doing that which we don't have the spinning retry with the CAS that gives us really nice latency profile especially under contention and we see that a lot in real world applications. We get microbursts in multiple producers end up racing for the same data and what we are seeing with the Aeron data structures is we have the most predictable latency under contention of any algorithm we have measured.
Yes, so quite often, if you want to claim something – a lot of this goes back to Lamport's original work on the bakery algorithm, and it comes from the deli contour ticket idea. So who is going to take the turn when you claim your ticket? You then take that turn and you complete your work. To claim that ticket, if you use something like CAS, what happens is you have got to read the current value of a number, you will increment it and you will write it back conditionally. If two threads go to do that at the same or N threads go to do that, one will succeed and all the others will fail. The others that fail have to reread the value, do the increment again and then conditionally write it back. And you get into that spinning retry loop and that causes queues to form. So queuing theory kicks in at that point that point and the cache coherence traffic has to work to disseminate that information between the different threads.
With XADD, you can send the single instruction through the hardware to increment the counter and give you the value that it was before you have done the increment. So there is no retry loop. It all happens in hardware that then avoids a lot of the queuing effects that happened with the spinning retry loops because we have taken the possibility for failure out by doing it in the hardware directly. Java 8 gave us the ability to get at the XADD instruction on x86.
Alex: Presumably those problems only get worse the more cores you throw at it because the cache coherency protocols kick in and then the threads starts competing over who has got the cache line on the core and so.
Yes. I mean it's not just increasing core counts. The increasing core counts actually make a change in different coherence demands. So we have got multiple sockets on our servers; now with the latest Haswell sockets. We in fact have multiple rings on the same socket and we have switches now in our sockets on the server that are transferring data between different rings, and these rings are working like token ring networks. This is like going back to stuff from 20, 30 years ago. It is now happening on the same machine.
If we think we are not in the world of distributed computing even when you are on the same machine, think again. Everything is distributed computing these days and the same laws apply. So we have to think about Amdahl's Law, Universal Scalability Law, keeping this data coherent; otherwise we just have that serialization point in our algorithms and then we can't make use of all these cores to scale up. We have to think about how to minimize that contention.
17. [...]Do you take advantages of any of that?
Alex's full question: Another way of dealing with contention I guess is to try and pin processes or threads from running on certain cores effectively preempting the operating system from being able to shuffle these things around. I guess with Aeron, if you have got an out of process, if you are a message server that you are writing to memory, you can pin that on some cores and that applications run on other cores to try and reduce the collisions between them. Do you take advantages of any of that?
Yes. So some of our clients are doings things where say they are running on a two socket server where one of those sockets is directly connected to the network card. You could run the driver on that socket. It's then writing to that network card directly without having hop across the QPI bus on the server. The QPI bus then is used as a high performance network between the driver and the clients and maybe those clients are even communicating via IPC across that. So you have got to look at the machine as a distributed system now.
The QPI links are a very fast network. If you start looking at them as the means of sharing – kind of go back to the CSP mantra of "don't share memory to communicate – communicate to share memory" is a much better way of going for designing these systems and getting them to scale up. We have got much less coupling, much cleaner design and it suits the hardware.
Alex: I guess the only way you can manage that with the process is with processes at the moment because Java doesn't have the means to stay pin these threads on those cores or this thread on that core.
Not within Java itself but it's possible to do that. For example, you can have JNI library and that will let you set thread affinity on the appropriate operating system, or if you separate out Java into different control groups that's a very common thing people will do is they will start some threads in one control group, some threads in another control group and then manage the resources on the machine in high frequency trading that's a very common practice.
Alex: Presumably this requires like you said JNI memory and also knowledge of the exact hardware that you are running on.
Yes.
Alex: I guess when you deal with cloud computing things like Amazon and so on, it's all -- everything to the wind because you don't know if you are going to be a contented core. You don't know if you are going to get cache line, problems with other programs that are running on the same VM hypervisor and so on.
Yes. It's a very simple trade off you're always making is: do you deal with something completely generically and then which case you can't take any advantages. I see it's exactly the same analogy of, if you are talking to a database, do you consider just a bit bucket that you are allowed to send some very generic SQL to, or do you actually acknowledge that I have Oracle. I have Postgres. I have DB2. I have whatever. If I have spent the money on those, wouldn't you like to get the value for your money by actually using the native features of those databases?
I think we make this mistake very often: is that we try to be too generic and if we try to be too generic we are then very inefficient. We naturally will be. Sometimes it's worthwhile if you are going to purchase something, make use of it. If what you are purchasing is utility like on Amazon, that's fine, you have chosen to do that but that comes with some consequences.
We are going to see cores per sockets. That's just continuing, looking at the next generation, we are going to see in up to 28 cores on a single Xeon socket. Each of those have hyper-threading. Hyper-threading is getting more and more efficient all the time so you are effectively getting two, almost two real cores now per different core that's in there because when we went from previous to Haswell to Haswell, we had four ALUs per core. We are now going to -- we now have three and we are going to four but we have six total execution ports in that machine.
So we've actually got 50% more. These sort of things that are happening. We keep getting more and more execution units, more and more parallel ports. Instruction level parallelism goes up. Hyper-threading gets more efficient because it has got more of the different units that it can use, different buffers, different execution units, all of that stuff keeps increasing. What's fascinating with it all is if you track (say) memory controllers. The memory controllers are getting more and more efficient. But the rate of their increase is different from the rate of the core increase. We are effectively getting a 15% drop in bandwidth per core per generation at the moment.
So being lazy with memory is starting to come back. We are not getting extra clock cycles. We are losing memory per core. Although we are getting more and more hardware, the dynamics are changing, and most people aren't really observing what these changes are and these are going to have an impact for our programming approaches and how we deal with it.
19. [...]But what about cache sizes?[...]
Alex's full question: Yes, I noticed that the growth in memory, certainly the speed of which you can get bits into the CPU into the first place has really tailed off in comparison to the speed of the CPU's growth as well. But what about cache sizes? I think cache sizes aren't growing as fast as they used to on die and things like the level three cache. So you can't change much of the laws of physics about getting the data from one place to another but they have all sort of capped about the same sort of size per core.
Well have you noticed that the chip manufactures have actually declared an end of Moore's Law? So 2004 - and we're going back 12 years ago. We have seen a cap out in MegaHertz or GigaHertz for CPUs. We are now starting to see a cap out in the number of transistors that are within any given socket now because manufacturing is getting more and more difficult as things get smaller. So lots of things are starting to change and the dynamics -- 30 plus years of our industry is now starting to fundamentally shift and change. We are going to need different approaches. But going back to L1 hasn't changed in a very long time. L2 hasn't changed in a very long time. L3 has been getting bigger but that rate of growth has slowed down now as well. I think 45 megabytes is some of the largest. What's more interesting if you look at that is how cores impact each other.
Intel caches are inclusive cache. So whatever is in the L1s and L2s of all the cores must be in the L3. It's inclusive of the other cores. So if you have got a single core, that's greedy. What I mean by greedy, so let's say it copies a lot of memory or does a lot of allocation. It will stream a lot of data through its core. It will end up wiping out the cores of the other L1 and L2 caches because it's inclusive because as it flushes through L3, it must evict the L1s and L2s so you can have a single core impact others and think of like just be greedy with allocation. We shouldn't overly worry about our allocations in our code but we shouldn't be wasteful because wasteful threads will impact on the threads. So getting carried away with lots of String allocation and String parsing as we seem to do quite a lot of, it has an impact.
Alex's full question: Back in the day in the Amigas they had a bit biltter that was used off CPU for shunting memory copies around the place. I guess the more recent up to date version of that would be the DMA where drivers can write directly into the memory chips without going through the CPU or the CPU buses. Do you think we'll see over the next few years maybe more of the device drivers communicating directly into memory as a way of moving data around rather than as you say busting the caches on the CPUs?
We are already seeing it. So DMA has been a common thing for network cards for a long time. You can even go back to Sandy Bridge which is quite a few years ago, we got DDIO [Data Direct I/O] which rather than DMA into memory, it DMA'ed directly into the L3 cache. So it didn't go through main memory. This was the only way we could get from going from sort of 40 gigabyte Ethernet up to 100 gigabyte Ethernet was to take main memory out of the loop completely.
So you are going directly from the network adapter across the PCI express bus to the L3 cache and both directions in and out. An area of the L3 cache would be reserved for dealing with this directly. This is quite a common feature. The next generation of the Intel Skylake Xeons is going to take that further. Our QPI links that link our sockets is going to be superseded by the new Universal Path Interface – UPI comes next and we are going to look in at putting network technology directly on that. So you won't even go through the PCI express bus for things like OmniPath that becomes a necessity to reduce our latency and increase our throughput.
Alex: It's an exciting time to be around in the computer industry. Martin Thompson, thank you very much.
Thank you.