BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations 1BRC–Nerd Sniping the Java Community

1BRC–Nerd Sniping the Java Community

Bookmarks
50:07

Summary

Gunnar Morling discusses some of the tricks employed by the fastest solutions for processing a 13 GB input file within less than two seconds through parallelization and efficient memory access.

Bio

Gunnar Morling is a software engineer and open-source enthusiast by heart, currently working at Decodable on stream processing based on Apache Flink. At Red Hat he led the Debezium project. He is a Java Champion and has founded multiple open source projects such as JfrUnit, kcctl, and MapStruct. Gunnar has spoken at a wide range of conferences like QCon, Java One, and Devoxx.

About the conference

Transcript

Morling: I would like to talk about a viral coding challenge which I did in January of this year, called, The One Billion Row Challenge. I would like to tell a little bit the story behind it, how all this went, what I experienced during January. Then, of course, also some of the things I learned, some of the things the community learned from doing that. This is how it started. On January 1st, I put out this tweet, I put out a blog post introducing this challenge. Was a little bit on my mind to do something like that for quite a while. Between Christmas and New Year's Eve, I finally found the time to do it. I thought, let me go and put out this challenge. I will explain exactly what it was, and how it worked. That's how it started. Then it got picked up quite quickly. People started to work on that. They implemented this in Java, which was the initial idea, but then they also did in other languages, other ecosystems, like .NET, Python, COBOL, with databases, Postgres, Snowflake, Apache Pinot, all those good things.

There was an article on InfoQ, which was the most read article for the first six months of the year, about this. There also was this guy, I didn't know him, Prime, a popular YouTuber. He also covered this on his stream. What did I do then? I learned how to print 162 PDF files and send them out to people with their individual certificate of achievement. I learned how to send coffee mugs and T-shirts to people all around the world, because I wanted to give out some prices. I sent those things to Taiwan, South America, the Republic of South Korea, and so on. That was what happened during January.

Who am I? I work as a software engineer at a company called Decodable. We built a managed platform for stream processing based on Apache Flink. This thing is completely a side effort for me. It was a private thing, but then Decodable helped to sponsor it and supported me with doing that.

The Goals

What was the idea behind this? Why did I do this? I thought I would like to learn something new. There's all those new Java versions every six months. They come with new APIs, new capabilities. It's really hard to keep track of all those developments. I would like to know what's new in those new Java versions, and what can I do with those things? I wanted to learn about it, but also, I wanted to give the community an avenue to do that so that everybody can learn something new. Then, of course, you always want to have some fun. It should be a cool thing to do. You don't want to just go and read some blogs. You would like to get some hands-on experience.

Finally, also, the idea was to inspire others to do the same. This was the thing, which I think was a bit specific about this challenge, you could actually go and take inspiration from other people's implementations. Nothing was secret. You could go and see what they did, and be inspired by them. Obviously, you shouldn't just take somebody else's code and submit it as your own implementation. That would make much sense. You could take inspiration, and people actually did that, and they teamed up in some cases. The other motivation for that was, I wanted to debunk this idea, which sometimes people still have, Java is slow, and nothing could be further from the truth, if you look at modern versions and their capabilities. Still, this is on people's minds, and I wanted to help and debunk this. As we will see, I think that definitely was the case.

How Did it Work?

Let's get a bit into it. How did it work? You can see it all here in this picture. The idea was, we have a file with temperature measurements, and it's essentially like a CSV file, only that it wasn't a comma as a separator, but a semicolon with two columns, and a station name, like Hamburg or Munich, and so on. Then, a temperature measurement value associated to that, randomized values. The task was, process that file, aggregate the values from that file, and for each of those stations, determine the minimum value, the maximum value, and the average temperature value. Easy. Then the caveat only was, this has 1 billion rows as the name of the challenge gives away.

This file, if you generate this on your machine, it has a size of around 13 gigabytes, so quite sizable. Then you had to print out the results, as you can see it here. This already goes to show a little bit that I didn't spend super much time to prepare this, because this is just the two-string output of the Java HashMap implementation. Random to have this as the expected output. Then as people were implementing this, for instance, with relational databases, they actually went to great lengths to emulate that output format. I should have chosen a relational output format, next time.

The Rules

A little bit more about the rules. First of all, this was focused on Java. Why? It's the platform I know best, I would like to support. This is also what I wanted to spread the knowledge on. Then you could choose any version. Any new versions, preview versions, all the kinds of distributions, like GraalVM, or all the different JDK providers which are out there. You managed using this tool called SDKMAN. Who has heard about SDKMAN? You should go and check out SDKMAN, and you should use it to manage Java versions. It's a very nice tool for doing that, and switching back and forth between different versions. That's it. Java only. No dependencies.

It wouldn't make much sense to pull in some library and do the task for you. You should program this by yourself. No caching between runs. I did five runs of each implementation. Then I discarded the fastest and the slowest one, and I took the average value from the remaining three runs. It would make sense to have any caching there. Otherwise, you could just do the task once, persist the result in the file, read it back, and it would be very fast. That doesn't make much sense. You were allowed to take inspiration by others. Of course, not again, just resubmit somebody else's implementation. You could take inspiration.

Evaluation Environment

In terms of how I ran this. My company spent €100 on this machine which I got on the Hetzner cloud. Very beefy server with 32 cores, out of which I mostly used only 8 cores, I will explain later on. Quite a bit of RAM. Really, the file was always coming from a RAM disk. I wanted to make sure that disk I/O is not part of the equation, just because it's much less predictable, would have made the life much harder for me. Only a purely CPU bound problem here. How would we go and do this? This is my baseline implementation. I'm an average Java developer, so that's what I came up with. I use this Java Streams API. I use this files.lines method, which gives me a stream with the lines of a file. I read that file from disk, then I map each of my lines there using the split method. I want to separate the station name from the value. Then I collect the results, the lines into this grouping collector. I group it by the station name.

Then for each of my stations, I need to aggregate those values, which happens here in my aggregator implementation. Whenever a new value gets added to an existing aggregator object, I keep track of the min, the max, and in order to calculate average I keep track of the sum and the count of the values. Pretty much straight forward. That's adding a line. Then, if I run this in parallel, I would need to merge two aggregators. That's what's happening here. Again, pretty much straight forward. Finally, if I'm done, I need to reduce my processed results, and I emit such a result through object with the min, the max value. Then, for the average, I just divide sum by count, and I print it out. On this machine, this ran in about five minutes. Not super-fast, but also not terribly bad. Writing this code, it took me half an hour or maybe less. It's decent. Maybe, if you were to solve this problem in your job, you might call it a day and just go home, have a coffee and be done with it. Of course, for the purpose of this challenge, we want to go quite a bit faster and see how much we can move the needle here.

The First Submission

With this challenge, the important thing is somebody has to come and participate. That was a bit my concern like, what happens if nobody does it? It would be a bit embarrassing. Roy Van Rijn, another Java champion from the Netherlands, he was instantly interested in this, and an hour later or so, after I had put out the post, he actually created his own first implementation, and it wasn't very fancy or very elaborate. His idea just was, I want to be part of it. I want to put out a submission so other people can see, this is something we also could do. This was really great to see, because as soon as the first person comes along and takes part, then also other people will come along and take part. Of course, he kept working on that. He was one of the most active people who iterated on his implementation, but he was the very first one to submit something.

Parallelization

Let's dive a little bit into the means of what you can do to make this fast. People spent the entire month of January working on this, and they went down to a really deep level, essentially counting CPU instructions. My idea is, I want to give you some ideas of what exists, what you can do. Then maybe later on, if you find yourself in that situation where you would like to optimize certain things, then you might remember, I've heard about it. Then you could go and learn really deep. That's the idea.

Let's talk about parallelization, first of all, because we have many CPU cores. On my server, which I use to evaluate it, I have 32 cores, 64 with hyperthreading. We would like to make use of that. Would be a bit wasteful to just use a single core. How can we go about this? Going back to my simple baseline implementation, the first thing I could do is I could just say, let me add this parallel call, so this part of the Java Streams API.

Now this will process this pipeline, or I should say, part of this streaming pipeline in parallel. Just doing this, just adding this single method call, gets us down to 71 seconds. From 4 minutes 50 seconds, to 71 seconds by just adding a few characters for one method call. I think that's a pretty good thing. Very easy win. Of course, that's not all we can do. In particular, if you think about it, yes, it gets us down by quite a bit, but it's not eight times faster than what we had initially, but we have 8 CPU cores which I'm using here. Why is it not eight times faster? This parallel operator, this applies to the processing logic. All this aggregating and grouping logic, this happens in parallel, but this reading of the file from memory, this still happens sequentially.

The entire file, the reading part, that's sequentially, and we have still all our other CPU cores sitting idle, so we would like to also parallelize that. This comes back then to this notion that I would like to go out and learn something new, because all those new Java versions, they come with new APIs, the JEPs, the Java Enhancement Proposals. One of them, which was added recently, is the foreign function and memory API. You can see it here, so that's taken straight from the JEP, but essentially, it's a Java API which allows you to make use of native methods.

It's a replacement, much easier to use than the old JNI API. It also allows you to make use of native memory. Instead of the heap, which is managed by the JVM, you get the ability to manage your own memory section, like an off-heap memory, and you will be in charge of maintaining that, and making sure you free it, and so on. That's what we would like to use here, because we could memory map this file and then process it there in parallel. Let's see how we can go about that.

I mentioned there's a bit of code, but I will run you through it. That's a bit of a recurring theme. The code you will see, it gets more dense as we progress. Again, you don't really have to understand everything. I would like to give you a high-level intuition. What do we do here? First of all, we determine the degree of parallelism. We just say, how many CPU cores do we have? Eight in our case, so that's our degree of parallelism. Next, we want to memory map this file. You could have memory map files also in earlier Java versions, but for instance, you had size limits. You couldn't memory map an entire 13-gig file all at once, whereas now here with the new foreign memory API, that's possible. We can do this. You map the file. We have this Arena object there. This is essentially our representation of this memory. There's different kinds of Arenas. In this case, I'm just using this global Arena which just exists and is accessible from everywhere within my application. That's where I have that file, and now I can access that entire section of memory in parallel using multiple threads.

In order to do so, we need to split up that file and the memory representation. That's what happens here. First of all, roughly speaking, we divide into eight equal chunks. We take our entire size divided by eight. That's our estimate of chunk sizes. Now, of course, what would happen is, in all likelihood, we will end up in the middle of a line. This is not really desirable, where, ideally, we would like to have our worker processes, they should work on entire lines. What's happening here is, we went to roughly one-eighth of the file, we just keep going to the next line ending character. Then we say, that's the end of this chunk, and the starting point of the next chunk. Then we process those chunks, essentially, just using threads. We will see later on how to do this. We start our threads, we join them. In the end, we've got to wait. Now this parallelizes the entire thing.

Now we really make use of all our 8 cores for the entire time, also while we do the I/O. There's one caveat. Just by the nature of things, one of those CPU cores will always be the slowest. At some point, all the other seven, they will just wait for the last one to finish, because it's a little bit unequally distributed. What people, in the end, did, instead of using 8 chunks, they split up this file in much smaller chunks. Essentially, they had a backlog of those chunks. Whenever one of those worker threads was done with the current chunk, it would go and pick up the next one.

By that, you make sure that all your 8 threads are utilized equally all the time. The ideal chunk size, as it turned out, was 2 megabytes. Why 2 megabytes? This specific CPU, which is in this machine which I used, it has a second level cache size of 16 megabytes, 8 threads processing 2 megabytes at a time. It's just the best in terms of predictive I/O and so on. This is what people found out. This already goes to show, we really get down to the level of a specific CPU and the specific architecture to really optimize for that problem by doing those kinds of things. That's parallel I/O.

1BRC - Mythbusters, and Trivial Task?

This challenge, it was going, people were participating. They had a good time. Of course, whenever something happens, there's also conspiracy theories. That's what I fear. People said, is this actually an engineering problem? At Decodable, you had this problem and you didn't know how to do it, so you farmed it out to the community. I can tell you, this would have been the most silly thing I could have done, because I created so much work by running this challenge for myself. I didn't do much besides it during the entire month of January. It was not that. It was just a genuine thing, which I felt would be interesting to me and the community. Was it an add for GraalVM? Because many people actually used GraalVM, and we will see later on more about it. Also, no. It was just like GraalVM lends itself really well towards that problem. Finally, is it an add for this AMD EPYC processor? Also, no.

Really, no conspiracies going on here. Who is on Hacker News? I read Hacker News way too often. Of course, you always have the Hacker News guy who says, that's a trivial problem. Everybody who knows how to program just a little bit, they will have solved this in two hours, and it's like a boring thing. Then, on the other hand, you have all the people from the Java community, and also, big names like Thomas Würthinger, who is the engineering lead at GraalVM, or Cliff Click, who was one of the original people behind the JVM, or Aleksey Shipilev, and all those people, they spend the entire month of January doing this. Of course, the Hacker News dude, he does it in two hours. Always interesting to see.

Parsing

Let's dive a little more into parsing that. We have seen how to make use of our CPU cores, but what actually happens there to process a line? Let's take a closer look at that. If we want to get away from what we had initially with just splitting up the file using regex and so on, that's not very efficient. Let's see what we can do here. That's, again, something I would be able to come up with just processing those input lines, character by character. What's happening here is, we have a little bit of a state machine. We read our characters. We keep reading the line until it has no more characters. Then we use the semicolon character which separates our station name from the temperature value to switch these states. Depending on which state we are, so do we either read the bytes which make up the station name, or do we read up the bytes which make up the measurement value? We need to add them into some builder or buffer which aggregates those particular values.

Then, if we are at the end of a line, so we have found the line ending character, then we need to consume those two buffers for the station and for the measurement, which we have built up. For the measurement, we will need to see how we convert that into an integer value, because that's also what people figured out. The problem was described as a double or a floating-point arithmetic, so with values like 21.7 degrees, but then again, randomly, I always only had a single fractional digit. People realized, this data actually, it always only has a single fractional digit. Let's take advantage of that and just consider that as an integer problem by just multiplying the number by 100, for the means of calculation. Then at the end, of course, divide it by 100, or by 10. That's something which people did a lot, and which I underestimated how much they would take advantage of the particular characteristics of that dataset.

For that conversion, we can see it here, and it makes sense, so we process or we consume those values. If we see the minus character, we negate the value. If we see the first one of our two digits, we multiply it by 100 or by 10. That's how we get our value there. Doing that, it gets us down to 20 seconds. This is already an order of magnitude faster than my initial baseline implementation. So far, nothing really magic has happened. One takeaway also for you should be, how much does it make sense to keep working on such a thing? Again, if this is a problem you're faced with in your everyday job, maybe stop here. It's well readable, well maintainable. It's an order of magnitude faster than the native baseline implementation, so that's pretty good.

Of course, for the purposes of this challenge, we probably need to go a bit further. What else can we do? We can, again, come back to the notion of parallelism and try to process multiple values at once, and now we have different means of parallelization. We already saw how to make the most out of all our CPU cores. That's one degree of parallelism. We could think about scaling out to multiple compute nodes, which is what we typically would do with our datastores. For that problem, it's not that relevant, we would have to split up that file and distribute it in a network. Maybe not that desirable, but that would be the other end of the spectrum. Whereas we also can go into the other direction and parallelize within specific CPU instructions. This is what happens here with SIMD, Single Instruction, Multiple Data.

Essentially all these CPUs, they have extensions which allow you to apply the same kind of operation onto multiple values at once. For instance, here, we would like to find the line ending character. Now, instead of comparing byte by byte, we can use such a SIMD instruction to apply this to 8 or maybe 16, or maybe even more bytes at once, and it will, of course, speed up things quite a bit. The problem is, in Java, you didn't really have a good means to make use of those SIMD instructions because it's a portable, abstract language, it just wouldn't allow you to get down to this level of CPU specificity. There's good news.

There's this vector API, which is still incubating, I think, in the eighth incubating version or so, but this API allows you now to make use of those vectorized instructions at extensions. You would have calls like this compare call with this equal operator, and then this will be translated to the right SIMD instruction of the underlying architecture. This would translate to the Intel or AMD64 extensions. Also, for Arm, it would do that. Or it would fall back to a scalar execution if your specific machine doesn't have any vector extensions. That's parallelization on the instruction level. I did another talk about it, https://speakerdeck.com/gunnarmorling/to-the-moon-and-beyond-with-java-17-apis, which shows you how to use SIMD for solving FizzBuzz.

Sticking to that pattern, applying the same operation to multiple values at once, we also can do what's called SWAR, SIMD Within A Register. Again, I realize, the code gets more dense. I probably wouldn't even be able right now to explain each and every single line, or it would take me a while. The idea here is, this idea of doing the same thing, like equals to multiple values all at once, we also can do this within a single variable. Because if you have 8 bytes, we also could see one long, that's 64 bits, that's also 8 bytes. We can apply the right level of bit level magic to a long value, and then actually apply this operation to all the 8 bytes. It's like bit level masking and shifting, and so on. That's what's happening here. There's a very nice blog post by Richard Startin, which shows you, step by step, how to do this, or how to use this to find the first zero byte in a string.

I have put the math up here on the right-hand side, so you actually can go and follow along, and you will see, this actually gives you the first zero byte in a long like that. That's SIMD Within A Register, SWAR. Now the interesting thing is, if you look at this code, something is missing here. Is somebody realizing what we don't have here? There's no ifs, there's no conditionals, no branching in that code. This is actually very relevant, because we need to remember how our CPUs actually work. If you look at how a CPU would take and go and execute our code, it always has this pipelined approach. Each instruction has this phase of, it's going to be fetched from memory, it's decoded, it's executed, and finally the result is written back. Now actually multiple of those things happen in parallel. While we decode our one instruction, the CPU will already go and fetch the next one. It's a pipelined parallelized approach.

Of course, in order for this to work, the CPU actually needs to know what is this next instruction, because otherwise we wouldn't know what to fetch. In order for it to know, we can't really have any ifs, because then we wouldn't know, which way will we go? Will we go left or right? If you have a way for expressing this problem in this branchless way, as we have seen it before, then this is very good, very beneficial for what's called the branch predictor in the CPU, so it always knows which are the next instructions. We never have this situation that we actually need to flush this pipeline because we took a wrong path in this predictive execution. Very relevant for that. I didn't really know much about those things, but people challenged it. One of the resources they employed a lot is this book, "Hacker's Delight". I recommend everybody to get this if this is interesting to you. Like this problem, like finding the first zero byte in a string, you can see it here. All those algorithms, routines are described in this book. If this is the thing which gets you excited, definitely check out, and get this book.

Then, Disaster Struck

Again, people were working on the challenge. It was going really well. They would send me pull requests every day, and I didn't expect that many people to participate. That's why I always went to the server and executed them manually. At some point, someday I woke up, I saw, that's the backlog of implementations which people had sent over the night, so let's say 10 PRs to review and to evaluate, and suddenly all those results were off. It was like twice as fast as before. I ran one of the implementations which I had run on the day before, and suddenly it was much faster. I was wondering, what's going on? What happened is, this workload, I had it initially on a virtual server. I thought, I'm smart. I try to be smart, so I get dedicated virtual CPU cores, so I won't have any noisy neighbors on that machine, this kind of thing.

What I didn't expect is that they would just go and move this workload to another machine. I don't know why. Maybe it was random. Maybe they saw there was lots of load in that machine. In any case, it just got moved to another host, which was faster than before. This, of course, was a big problem for me, because all the measurements which I had done so far, they were off and not comparable anymore. That was a problem. I was a bit desperate at that point in time. This is where the wonders of the community really were very apparent. Good things happened. I realized, I need to get a dedicated server so that this cannot happen again. I need to have a box, which I can use exclusively. As I was initially paying out of my own pocket for that, I thought, I don't want to go there. I don't want to spend €100. As I mentioned, Decodable, my employer, they stepped up to sponsor it.

Then, of course, I also needed help with maintaining that, because I'm not the big operations guy. I know a little bit about it. Then for that thing, you would, for instance, like to turn off the hyperthreading, or you would like to turn off turbo boost to have stable results. I wasn't really well-versed in doing those things, but the community came to help. In particular, René came to help. He offered his help to set up the thing. We had a call. I spoke to him. I had not known him. It was the first time I ever spoke to him, but we had a great phone conversation. In the end, he just sent me his SSH key. I uploaded his key to the machine, gave him the key to the kingdom, and then he was going and configuring everything the right way. There were multiple people, many people like that, who came and helped, because otherwise I just could not have done it.

The 1BRC Community

All this was a bit of a spontaneous thing. Of course, I put out some rules and how this should work. Then, I wasn't very prescriptive. Like, what is the value range? How long could station names be? What sort of UTF character planes and whatnot? I didn't really specify it. Of course, people asked, how long can a station name be? What kinds of characters can it contain, and so on? We had to nail down the rules and the boundaries of the challenge. Then people actually built a TCK, a test kit. It was actually a test suite which you then had to pass. Not only you want to be fast, you also want to be correct. People built this test suite, and it grew, actually, over time. Then whenever a new submission, a new entry came in, it, first of all, had to pass those tests. Then if it was valid, then I would go and evaluate it and take the runtime. This is how this looked like. You can see it here.

It had example files with measurements, and an expected file, what should be the result for that file? Then the test runner would go process the implementation against that set of files, and ensure that result is right. That's the test kit. The other thing, which also was very important as well, I had to run all those things on that machine. There's quite a few things which were related to that, like just making sure the machine is configured correctly. Then, I had five runs, and I want to discard fastest and slowest, all those things. Jason, here, he came to help and scripted all that. It was actually very interesting to see how he did it. I would really recommend to go to the repo, and just check out the shell scripts which exist, which are used for running those evaluations. It's a bit like a master class in terms of writing shell scripts, with very good error handling, colored output, all this good stuff to make it really easy and also safe to run those things. If you have to do shell scripting, definitely check out those scripts.

Bookkeeping

Then, let's talk about one more thing, which is also very relevant, and this is what I would call bookkeeping. If you remember the initial code I showed, I had this Java Streams implementation, and I used this collector for grouping the values into different buckets, per weather station name. People realized, that's another thing which we can optimize a lot ourselves. By intuition, you would use a HashMap for that. You would use the weather station name as the key in that HashMap. Java HashMap is a generic structure. It works well for a range of use cases. Then, if we want to get the most performance for one particular use case, then we may be better off implementing a bespoke, specific data structure ourselves. This is what we can see here. I think it might sound maybe scary, but actually it is not scary. It's relatively simple. What happens here? We say, we would like to keep track of the measurements per our station name. It's like a map, but it is backed by an array, so those buckets.

The idea now is, we take the hash key of our station name and we use this as the index within that array, and at that particular slot in the array, we will manage the aggregator object for a particular station name. We take the hash code, and we want to make sure we don't have an overflow. That's why we take it with logical end with the size of the array. We always have it well indexed in the array. Then we need to check, at that particular position in the array, is something there already? If nothing is there, that means, we have the first instance of a particular station in our hands, so the first value for Hamburg or the first value for Munich. We just go create this aggregator object there and store it at that particular offset in the array. That makes sense. The other situation, of course, is we go to the particular index in the array, and in all likelihood, something will be there already. If you have another value for the same station, something will be there already.

The problem is we don't know yet, is this actually the aggregator object for that particular key we have in our hands, or is it something else? Because multiple station names could have the same key. Which means, in that case, if something exists already at this particular array slot, you need to fall back and compare the actual name. Only if the incoming name is also the name of the aggregate object in that slot, then we can go and add the value to that. That's why it's called linear probing. Otherwise, we will just keep iterating in that array until we either have found a free slot, so then we can go install it there, or we have found the slot for the key which we have in our hands. I think it's relatively simple. Now for this particular case, this performs much better, actually, than what we could get with just using Java HashMap.

Of course, it depends a lot on the particular hash function here which we use to find that index. This is where it goes back to people really optimized a lot for the particular dataset, so they used hash functions which would be collision free for the particular dataset. This was a bit against what I had in mind, because the problem was this file, as I mentioned, it has a size of 13 gigabytes, and I just didn't have a good way for distributing 13 gigabytes to people out there. That's why, instead, they would have to generate it themselves. I had the data generator, and everybody could use this generator to create the file for themselves and then use it for their own testing purposes. The problem was, in this data generator, I had a specific key set. I had around 400 different station names with the idea being, that's just an example, but people took it very literally, and they optimized then a lot for those 400 station names. They used hash functions, which would not have any collisions, for those 400 names. Again, people will take advantage of everything they can.

The problem with all that is it also creates lots of work for me, because you cannot really prove the absence of hash collisions. Actually, whenever somebody sent in their implementation, I had to go and check out, do they actually handle this case, the two stations which would create the same key, and do they handle those collisions accordingly? Because otherwise, if you don't do this fall back to the slow case, you would be very fast, but you would be incorrect because you don't deal correctly with all possible names. This was a bit of a trap, which I set up for myself, and it meant I always had to check for that and actually ask people in the pull request template, if you have a custom map implementation, where do you deal with collisions? Then we would have conversations like we see here. How do you deal with hash collisions? I don't, that's why it's so fast. Then he would go and rework it. A big foot trap for myself.

GraalVM: JIT and AOT Compiler

Those are three big things, parallelization, then all this parsing with SIMD and SWAR, and custom hashmapping for bookkeeping. Those were recurring themes I saw again. Then there were more specific tricks, and I just wanted to mention a few of them. I just want to give you some inspiration of what exists. One thing which exists is the Epsilon garbage collector, which is a very interesting garbage collector because it doesn't collect any garbage. It's a no-op implementation. If you have your regular Java application, that would be not a good idea. Because you keep allocating objects, and if you don't do any GC, you will run out of heap space at some point. Here, people realized, we can actually implement this in a way that we don't do any allocations on our processing loop. We'll do a few allocations initially when bootstrapping the program, but then later on, no more objects get created. We just have arrays which we can reuse, like mutable structures, which we can just update.

Then we don't need any garbage collection, and we don't need any CPU cycles to be spent on garbage collection, which means we just can be a bit faster. Again, I think that's an interesting thing. Maybe, for instance, if you work on some CLI tool, short-lived thing, could be an interesting option to just disable the garbage collector and see how that goes. The other thing, which you can see here is people used a lot GraalVM. GraalVM, it's two things, really. It's an ahead-of-time compiler, so it will take your Java program and emit a native binary out of it. This has two essential advantages. First of all, it uses less memory. Secondly, it's very fast to start because it doesn't have to do class loading and the compilation and everything, this all happens at build time. This is fast to start if you have this native binary. Now to the level of results we got here, this actually mattered.

Initially, I thought saving a few hundred milliseconds on startup won't make a difference for processing 13 gigabytes of file, but actually it does make a difference. The AOT compiler and most of the fastest implementations, they actually used the AOT compiler with GraalVM. There's also the possibility to use this as a replacement for the just-in-time compiler in your JVM. You just can use it as a replacement for the C2 compiler. I'm not saying you should always do this. It depends a little bit on your particular workload and what you do, whether it's advantageous or not. In this case, this problem, it lends itself very well to that. People just by using GraalVM as the JIT compiler in the JVM, this gave them a nice improvement of like 5% or so. It's something I can recommend for you to try out, because it's essentially free. You just need to make sure you use a JVM or a Java distribution which has GraalVM available as the C2 compiler replacement. Then it's just means of saying, that's the JIT compiler I want to use, and off you go. Either it does something for you or does not.

Further Tricks and Techniques

A few other things, like unsafe, what I found interesting is the construct here on the right-hand side, because if you look at that, this is our inner processing loop. We have a scanner object. We try to take next values. We try to process them, and so on. What we have here is we have the same loop three times in a program which is written up in a sequential way. If you look at it, you would say, those three loops, they run one after another. What actually happens is, as the CPUs have multiple execution units, the compiler will figure out, this can actually be parallelized, because there is no data dependencies between those loops. This is what happens, we can take those loops and run them concurrently. I found it very interesting. Why is it three times? Empirically determined.

Thomas, who came up with this, he tried it two times. He tried to have the loop four times and three times it was just the fastest on that particular machine. It could be different in other machines. Of course, you see already here with all those things, this creates questions around maintainability. Because I already can see the junior guys joining the team, and they're like, "That's duplication. It's like the same code three times. Let me go and refactor it. Let me clean it up", and your optimization would be out of the window. You would want to put a comment there, don't go and refactor it into one loop. That's the consideration. Are those things worth it? Should you do it for your particular context? That's what you need to ask. I found this super interesting, that this is a thing.

The Results

You are really curious then, how fast were we in the end? This is the leaderboard with those 8 CPU cores I initially had. I had 8 CPU cores because that was what I had with this virtual server initially. When I moved to the dedicated box, I tried to be in the same ballpark. With those 8 cores, we went down to 1.5 seconds. I would not have expected that you could go that fast with Java, processing 13 gigabytes of input in 1.5 seconds. I found that pretty impressive. It gets better because I had this beefy server with 32 cores and 64 threads with hyperthreading. Of course, I would like to see, how fast can we go there? Then we go down to 300 milliseconds. To me, it's like doubly mind blowing. Super impressive. Also, people did this, as I mentioned, in other languages, other platforms, and Java really is very much at the top, so you wouldn't be substantially better with other platforms.

The other thing, there was another evaluation, which there was, because I mentioned I had this data generator with those 400-something station names, and people optimized a lot for that by choosing specific hashing functions and so on. Some people realized that actually, this was not my intention. I wanted to see, how fast can this be in general? Some people agreed with that view of the world. For those, we had another leaderboard where we actually had 10k different station names. As you can see here now, it's actually a bit slower, because you really cannot optimize that much for that dataset. Also, it's different people at the top here. If I go back, here we have Thomas Würthinger, and people who teamed up with him for the regular key set, and then for the 10k key set, it's other people. It's different tradeoffs, and you see how much this gets specific for that particular dataset.

It's a Long Journey

People worked on that for a long time. Again, the challenge went through the entire month of January. I didn't do much besides running it really. People like Thomas, who was the fastest in the end, he sent 10 PRs. There were other people who sent even more. The nice thing was, it was a community effort. People teamed up. As I mentioned before, like the fastest one, it was actually an implementation by three people who joined forces and they took inspiration. When people came up with particular tricks, then very quickly, the others would also go and adopt them into their own implementation. It was a long journey with many steps, and I would very much recommend to check this out.

This is, again, the implementation from Roy Van Rijn, who was the first one, because he kept this very nice log of all the things he did. You see how he progressed over time. If you go down at the very bottom, you will see, he started to struggle a bit because he did changes, and actually they were slower than what he had before. The problem was he was running on his Arm MacBook, which obviously has a different CPU with different characteristics than the machine I was running this on. He saw improvements locally, but it was actually faster on the evaluation machine. You can see it at the bottom, he went and tried to get an Intel MacBook, to have better odds to do something locally, which then also performs better on that machine. I found it really surprising to see this happening with Java, that we get down to this level where the particular CPU and even its generation would make a difference here.

Should You Do Any of This?

Should you do any of this? I touched on this already. It depends. If you work on an enterprise application, I know you deal with database I/O most of the times. Going to that level and trying to avoid CPU cycles in your business code probably isn't the best use of your time. Whereas if you were to work on such a challenge, then it might be an interesting thing. What I would recommend is, for instance, check out this implementation, because this is one order of magnitude faster than my baseline. This would run 20 seconds or so. It's still very well readable, and that's what I observed, like improving by one order of magnitude. We have still very well readable code. It's maintainable.

You don't have any pitfalls in this. It just makes sense. You are very much faster than before. Going down to the next order of magnitude, so going down to 1.5 seconds, this is where you do all the crazy mid-level magic, and you should be very conscious whether you want to do it or not. Maybe not in your regular enterprise application. If you participate in a challenge you want to win a coffee mug, then it might be a good idea. Or if you want to be hired into the GraalVM team, I just learned this the other day, actually, some person who goes by the name of Mary Kitty in the competition, he actually got hired into the GraalVM compiler team at Oracle.

Wrap-Up, and Lessons Learned

This was impacting the Java community, but then also people in other ecosystems, databases, in Snowflake they had a One Trillion Row Challenge. This really blew up and kept people busy for quite a while. There was this show and tell in the GitHub repo. You can go there and take a look at all those implementations in Rust, and OCaml, and all the good things I've never heard about, to see what they did in a very friendly, competitive way. Some stats, you can go to my blog post there, you will see how many PRs, and 1900 workflow runs, so quite a bit of work, 187 lines of comment in Aleksey's implementation. Really interesting to see. In terms of lessons learned there, if I ever want to do this again, I would have to be really prescriptive in terms of rules, automate more, and work with the community as it happened already today. Is Java slow? I think we have debunked that. I wouldn't really say so. You can go very fast. Will I do it again next year? We will see. So far, I don't really have a good problem which would lend itself to doing that.

 

See more presentations with transcripts

 

Recorded at:

Oct 25, 2024

BT