BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Using Shared Memory-Mapped Files in Java

Using Shared Memory-Mapped Files in Java

Bookmarks
26:50

Summary

Peter Lawrey discusses Unsafe in Java 8, Project Panama in Java 17 and Java 19, including pactical uses with code examples, demo using Panama, Event Sourcing using shared memory with Chronicle Queue.

Bio

Peter Lawrey has the most answers for concurrency and memory in stackoverflow.com, and the second-highest for Java. Peter is a Java Champion, the CEO of Chronicle Software and the architect of OpenHFT libraries downloaded from 15K different IP addresses each month.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.

Transcript

Lawrey: My name is Peter Lawrey. This talk is on using shared memory in Java. How do you go about doing this? Indeed, why would you even want to? We have a couple of libraries called Chronicle Queue and Chronicle Map, which make extensive use of a library called Chronicle Bytes, which is where our support for this functionality resides. As you can see, the significant number of downloads, over 60,000 a month. However, you can use shared memory directly yourself using plain Java. It is used by a significant proportion of all tier-1 banks to access this functionality.

Why would you want to do this? When you have a series of microservices that need to talk to each other [inaudible 00:01:14], then they can do this via any messaging fabric, such as a message bus, HTTP, TCP. There's a variety of means of doing this. However, if you use shared memory, then this is one of the fastest ways of passing data between processes because it's all virtually in memory. You're accessing the same data, the same bytes from multiple processes at the same time. Added bonus to this is that if you use a memory map file, then that data is also persisted. It can be the size of your available disk space, it's not limited by your heap size, or even your main memory. It can actually be very large, so you've got access to a lot larger area of storage. Indeed, there is no broker involved in this situation. The only agent that is working on your behalf in the background is in fact the OS. Even if every process dies, but the OS keeps running, all of the data gets persisted to disk. In the situation where the machine itself dies, we've got HA options available for creating redundant copies across machines. In terms of using memory map files, you can do this all on one machine in open source.

What Is Avoided?

One of the key things is that there's no, not only no broker involved, that's going to add latency, but there's no system end calls made on a per message basis. It's observable, so you can see every message. It's stored in files, which you can read from any process. It doesn't have to be running at a time, it can be run much later. You can get latencies between processes, including serialization that are around a microsecond or less.

Heatmaps

One of the things to notice, that we're actually dealing at an extremely low level at this point. It's your L2 cache coherence bus, typically, that's involved in exchanging data between threads. This is done on chip at the hardware level. You actually get a pattern of usage in terms of latency. If you've got a Dual Socket Xeon, then everything on the same socket communicates fast. You do see an added delay, if you go from one socket to another. In the EPYC processors, they have core complexes, and within a core complex, you get very low latencies. Actually, you get even lower latencies because it's a small complex, but then between complexes even on the same socket, you get higher latencies, and you get higher latencies again if you go between sockets. Certainly, with AMD, you have to be much more aware of the internal structure of your CPU if you want to maximize performance and minimize latency.

Chronicle Queue, 256B, 100K msgs/s

In terms of what can you get with real machines, here's an example of latencies that you can achieve passing quarter of a kilobyte messages at 100 messages a second, which is a fairly good rate. You get typical latencies that are around 400 nanoseconds, about 0.4 microseconds, or 2500th of a millisecond. That's typical latency. In the three nines for the worst 1 in a 1000, that can be between 1 and 40 microseconds depending on use case. It's interesting to see that the outliers on a virtual machine are as high as replicating data across different machines. It is quite a high outlier if you're going to use a virtual machine. However, you can get the typical latencies that can be just as good as bare metal. It really depends on how sensitive you are to jitter as to whether using a virtual machine matters to you or not. For comparison, I've done some benchmarks on Kafka. At this rate, 100,000 messages a second, you're getting typical latencies that are at least around three orders of magnitude higher, if not more.

Memory Mapped Files

Utilizing memory mapping has been available in Java since Java 1.4. It's fairly easy to do. You just create a mapping to the same file in multiple processes at the same time, and you're sharing them the same data. One of the many disadvantages is that this is not actually thread safe. Another major disadvantage is that when Java 1.4 was created, having a signed int as the size, which is up to not quite 2 gigabytes, then that seemed like a pretty good memory mapping size. These days, limiting yourself to 2 gigabytes is something of a constraint. It would be much nicer to be able to map much larger regions. Certainly, that's what our library does. You can use unsafe, which is a built-in class, which you shouldn't use, ideally, but sometimes is still the best option compared to using JNI to do the same thing. It's not any more performant than JNI, but it's actually probably less buggy than trying to write your own JNI to do it.

There's a POSIX library that we produced that has things such as memory mapping, and a lot of other low level system calls that relate to files and memory. This will support 64-bit long sizes. However, you can also use a library we have, called MappedBytes. What MappedBytes does is it adds in things like thread safety, 63-bit sizes. It can also allocate data in chunks, so you can treat it as if it's a massive file, up to 128 terabytes, but in reality, it only allocates chunks of data as needed. That's particularly useful on Windows and macOS where your sparse files aren't available, but on Linux, you can actually have a sparse file, where you just create a huge region of half a terabyte or more. Then, only the pages you actually touch do get turned into actual real memory or disk space. Another thing that bytes adds is a number of complex operations such as UTF-8 reading, writing UTF-8 strings that are object pulled, as well as support for data structures, enums, and so on.

Java 14 - Java 21

Java has increasingly improved its support for off-heap memory. One of the first ones of these is the Foreign-Memory Access API, which first appeared in Java 14. This is an incubator implementation. In Java 15, a second incubator. Java 16, further iteration. Java 16 also introduced a library which is related and it's called the Foreign Linker API, which allows you to directly interact with code that's written in C. Obviously that has a lot of crossover in usage. Java 17 has further incubator iteration, and this adds a lot of the functionality that we need. Unfortunately, it's still incubator, and in particular, if you use it. Java 18 has a second incubator for it. Java 19 fortunately now has elevated itself to preview stage, which means that it's the final stage before actually being no longer incubator, no longer having an unstable API. Hopefully, by the time we get to the next long term support release, which is Java 21, we will see something that will help us migrate away from JNI and unsafe and a lot of other libraries, which are used for binding to C libraries. I highly recommend you have a look at this video, https://www.youtube.com/watch?v=4xFV-A7JToY.

Here's an example of using memory mappings using the new API. One of the things to notice that this API doesn't actually compile in Java 19, because it's using an incubator library that can change APIs over time, which is why I look forward to the day it will become stable. Another thing, which takes a little bit of time to get your head around is that a lot of these parameters to these methods, they're untyped. You only know at runtime whether it's going to work or not. It makes it difficult to work out what are all the valid combinations for these methods. That's partly deliberate so that the API doesn't become enormous like it could do if you were to go through every permutation that's valid. However, it does mean that it is a little bit of head scratching to find combinations that actually work. Like I said, those combinations are not the same between versions of Java either, so it's a little bit of a challenge. What you may find interesting is that this is actually creating a memory region off-heap on a file that's half a terabyte, so 512 shifted by 30, is half a terabyte. That's a massive area of virtual memory. It's only virtual. It's not actually using real memory, physical memory. It's not using disk either.

Distributed Unique Timestamps

Why would you do this? What use does it have? The simplest use case we have for this kind of thing is in generating unique timestamps. These are unique across an entire machine by default. The way that's coordinated is that we get a high resolution timestamp, like a nanosecond timestamp. Then we look at the most recent timestamp that anything on this machine has produced by looking in shared memory. We ensure that it's greater than the previous value. We also embed in that a host ID so that we can actually have up to 100 different nodes producing unique timestamps across them without having to coordinate anything. You'll only need to have a coordinated host ID and then you can guarantee that every timestamp becomes a unique ID that can be used across many machines. In the happy path, the time is greater than the previous time, and therefore it just returns it. However, if the time isn't greater, then it has to go through a loop. It finds the most recent time, it finds the next time that would be valid for that machine, and then tries that. Just goes around in circles until eventually it's able to allocate a new timestamp.

The way this looks is more natural than UUID, because it is a timestamp, it's readable. It has the time in it down to a tenth of a microsecond resolution. It also has a host ID embedded in there as well. You can see just by reading a timestamp, which is reasonably natural to read, you can get a lot of information, and it's human readable. UUID is a very fast function all the same, however, it does create garbage, and it is still substantially slower. It's about six to seven times slower than doing everything that I just mentioned. At the same time, there's a good chance that you will also want to timestamp when you create a new event or a new request so that you want to create a unique ID, so you can make that request unique. At the same time, you probably want a timestamp in there so that you know when it was created, so you haven't really saved very much. Creating this unique timestamp is a two for one, and it's substantially faster. It's able to do this because every timestamp on a given host uses shared memory to ensure that that timestamp will be unique and monotonically increasing.

Thread Safe Shared Memory in Chronicle Queue

In a more complex example, we use shared memory for storing our data in queues. These are event stores, they're append only. This is a dump of some of the housekeeping information that's associated with each queue. You can see that there's a string key and a longValue associated with it down the screen. Those longs can be used for storing information like locks and working out what the most recent roll cycles are, and what its modCount is. That is all done in shared memory. We've got tooling that will allow you to dump out this binary format as YAML. You can also do the reverse, so that it's easy to read as well and to manipulate and test.

Demo - Layout

We have a working example of this, which is open source. Everything I've mentioned to do with queue is all open source on Apache 2, so you can try it out yourself.

In this demo, we are looking at, how do we test these microservices that are using shared memory for passing data between them? The challenges of using really lower level interaction with memory or off-heap, is, how do you test it? How do you debug it? When something goes wrong, how do you see what's going on? The way we deal with that is to create a much higher level interface, which is where you would be naturally working to do with events coming in and out. You're not dealing with the low level bytes, unless you really need to. You can go down to that level, but for the most part to start with, you should focus on the business logic and do behavior driven development first, for your event driven system. Then you can go down to the bytes and try and get out every nanosecond, where it makes sense to spend your time doing that. Obviously, the more you tune it, the harder it is to maintain. There's a tradeoff there. Certainly, at the start of a project, you usually want to focus on flexibility, easy to maintain, so you can do rapid prototyping and iterative development cycles. Then, as the product matures, and you have full confidence that the DTOs aren't going to change very much, the events are not going to change very much, then you can look at microtuning them to try and get the best possible performance for what you've established as a realistic use case.

What does it look like? In this case, we do still favor right from the start trying to use primitives where possible. Things like the timestamps are typically stored as primitives. In this case, it's a microsecond timestamp. There's a symbol, which will be the instrument's name, like what are we trying to trade here? This is an example of a trade. You can also see that we've got some enums and a string in there. You can have other data types, but where possible, we tend to try and use primitives.

Empty Implementation, and Testing

In this trivial example, we've got a microservice that expects one type of event in and it produces one type of event out, which is an execution report. From the order, we build an execution report object. When you're testing this, this is all in YAML, so we're dealing at this level where you can see the data structure, we've got an event in and an event out. The main point to take away from this is that even though when it's stored in memory, even though when it's written and shared between processes, it's highly optimized. It's very much down at the binary level. When you're testing and you're thinking about what business functions are we going to perform here, it's at a much higher level. That's where you want to be describing the functionality or business component we'll implement.

What happens when the test fails? How easy is it to see when something's gone wrong? If you were dealing with it at the binary level, you would just see that one of the bytes is wrong. That could potentially be very hard to diagnose, and you could waste a lot of time going through all the things that could possibly be. At this level, because we're working in YAML, we're doing a text based comparison, so we expected a particular event execution report, we got a different execution report. In this case, it's very easy to see that the order type is not the expected one. Then you can decide what action to take. Is the code incorrect? Is it that the test should have been updated? Like you've actually changed the input, and you want the output to change as well, and that wasn't done. You very quickly diagnose what you need to do. To fix it, say, for example, it's only the output that is wrong, you can just copy and paste the actual over the expected, because the expected is in a file of its own. You've updated the test if that's the appropriate correction.

Lowest Level Memory Access

We use YAML for data driven tests. It's all based around behavior driven development to ensure that we're specifying the requirements at a level that the business users could understand using their language and their terminology. We can go down to low level performance considerations by using binary formats, pregenerated serialization code, and at the lowest level, we can use trivially copyable objects where it's effectively much like a memcpy, and there's very little serialization logic actually going on. That can get well under half a microsecond, end-to-end. That's from when you want to write an event, to when that event is actually called in another process, including serialization, writing to the file, writing to memory, deserialization, decoding the event type, and making the method call.

Resources

You can have a look at our website, https://chronicle.software. All the code is on GitHub, under OpenHFT.

Questions and Answers

Ritter: Of course, Panama is one of those things that we've been waiting for, for a long time. I think it will be very valuable when we actually get to the point where we can use it without having to enable preview features or incubator modules. It took me a while to figure out why they changed it from being an incubator module to being a preview feature.

Lawrey: I have a feeling it's to show progression, to encourage people that it's coming soon. I don't think there's any limitations on how long it should be preview. I'm assuming it won't go back to incubator.

Ritter: I think the reason is that when it's an incubator module it's not in the Java SE space. When they move it to a preview, it actually falls into the Java SE space. I think it's because of whether it's in the Java or javax namespace. I think that's what they've changed.

Lawrey: It's still in the incubator package name. I feel a lot more confident when that disappears from the package name. Obviously, what I would like to see is some backports. Unfortunately, I have a lot of clients that are still in Java 8.

Ritter: I'll have to take that up with our engineering team, see whether we can do a backport to Zulu 8.

Lawrey: The thing is, in reality, we only need a very small subset. Actually, that's the only bit that we would need, backport it. Whereas perhaps backporting the entire thing isn't perhaps so practical. There's a lot of features in there that are really cool, but because they weren't in unsafe, we obviously didn't use them. I think we'd have to come up with some compatibility library, which we have done. We've talked about doing a rewrite from a newer version, so we might consider just doing a rewrite for version 21, assuming that's got it in there. Then do a proper version 21 native implementation of everything, and just effectively fork our libraries for the people that want the older version or the newer version.

Ritter: I can't see why it wouldn't be full feature by Java 21, because, what have we got in 19? Ok, still going to be a preview feature, but I would expect 20 or the last one, 21, it'll definitely be a full feature, because I know they are trying to get Panama done and dusted.

Lawrey: That'd be good. They can always extend it later. Certainly, from my own selfish needs, we only need a fairly small subset. If they end up constraining it just to get it in, I'd be more than happy with that.

 

See more presentations with transcripts

 

Recorded at:

Nov 25, 2022

BT