BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News InfoQ Dev Summit Munich: How to Optimize Java for the 1BRC

InfoQ Dev Summit Munich: How to Optimize Java for the 1BRC

Java applications passed the 1 Billion Row Challenge (1BRC) from January 2024 in 1.5 seconds. 1BRC creator Gunnar Morling, software engineer at Decodable, detailed how the participants optimized Java at the InfoQ Dev Summit Munich 2024. General optimizations applicable to all Java applications cut the runtime from 290 seconds to 20 seconds using parallel loading/processing and optimized parsing. Getting to 1.5 seconds required niche optimizations that most Java applications should forego, except for possibly GraalVM Native Image compilation.

Each of the 1 billion rows has the name of a weather station and a single temperature ranging from -99.9°C to +99.9°C. Participants created that data file with a generator application, which was part of the challenge code. The application had to read that entire file and calculate the minimum, maximum, and average temperatures of the 400 hundred weather stations as quickly as possible. Using external libraries or caching was forbidden.

1BRC measured the official results on a 2019 AMD server processor with 128 GB RAM. The applications only used eight threads. The data file was on a RAM disk. Applications had about 113 GB RAM available, but most used much less. Each submission was tested five times, with the slowest and fastest runs discarded for consistency.

Morling wrote the baseline implementation, which took 290 seconds:

Collector<Measurement, MeasurementAggregator, ResultRow>
  collector = Collector.of(MeasurementAggregator:: new,
    (a, m) →> {
      a.min = Math.min(a.min, m. value) ;
      a.max = Math.max(a.max, m. value);
      a.sum += m.value;
      A.count++;
    },
    (aggl, agg2) -> {
      var res = new MeasurementAggregator();
      res.min = Math.min(aggl.min, agg2.min);
      res.max = Math.max(aggl.max, agg2.max);
      res.sum = aggl.sum + agg2.sum;
      res.count = agg1.count + agg2.count;
      return res;
    },
    agg -> {
      return new ResultRow(agg.min,
        round(agg.sum) / agg.count, agg.max);
    });

Map<String, ResultRow> measurements 
  = new TreeMap<>(Files.Lines(Paths.get(FILE))
      .map(l -> new Measurement(l.split(";")))
      .collect(groupingBy(m →> m.station(), collector)));

System.out.println(measurements);

The 1BRC is similar to back-end processing that Java often does. That's why its general optimizations are applicable to many Java applications.

Parallel processing meant adding a single parallel()call before the map()statement in the fourth line from the bottom. This automatically distributed the following stream operations across the eight CPU cores, cutting the runtime to 71 seconds, a 4x improvement.

Parallel loading replaces Files.Lines(Paths.get(FILE)), which turns the file into a Stream<String> sequentially. Instead, eight threads load chunks of the file into custom memory areas with the help of JEP 442, Foreign Function & Memory API (Third Preview), delivered in Java 21. Note that the Foreign Function & Memory API was finalized with JEP 454, delivered in Java 22.

The final step of the general optimizations was changing the line reading from a string to individual bytes. Together with the parallelization, this achieved the 14.5x improvement from 290 seconds to 20 seconds.

The first step of the niche optimizations is "Single Instruction, Multiple Data (SIMD) Within a Register" (SWAR) for parsing the data. In SIMD, a processor applies the same command to multiple pieces of data, which is faster than processing it sequentially. SWAR is faster than SIMD because accessing data in CPU registers is much faster than getting it from main memory.

Custom map implementations for storing the weather station data provided another performance boost. Because there were only 400 stations, custom hash functions also saved time. Other techniques included using the Unsafe class, superscalar execution, perfect value hashing, the "spawn trick" (as characterized by Morling), and using multiple parsing loops.

Mechanical sympathy helped: the applications picked a file chunk size so that all chunks for the eight threads fit into the processor cache. And branchless programming ensured that the processor branch predictor had to discard less code.

The last two optimizations targeted how the JVM works. Its JIT compiler profiles the interpreted Java bytecode and compiles often-used methods into machine code. Additionally, the JVM creates a class list at startup and initializes the JDK. All that slows down application startup and delays reaching the top speed. The GraalVM Ahead-Of-Time (AOT) compiler Native Image moves compilation and as much initialization work as possible to build time. That produces a native executable which starts at top speed. GraalVM Native Image ships alongside new Java versions.

Using GraalVM Native Image comes at the price of some possibly showstopping constraints that do not affect most Java applications and a more expensive troubleshooting process. Many application frameworks, such as Helidon, Micronaut, Quarkus, and Spring Boot, support GraalVM Native Image. Some libraries, especially older ones, do not, which may be a showstopper. These constraints were not a factor in the 1BRC, as no external libraries were used.

Finally, the JVM garbage collector frees up unused memory. But it also uses CPU time and memory. That is why applications that did not need garbage collection used the "no-op" Epsilon garbage collector.

The niche optimizations provided a further 13x improvement down to 1.5 seconds. The fastest application with a JIT compiler took 2.367 seconds, the fastest with GraalVM Native Image finished in 1.535 seconds.

Co-author of the 1BRC winner and GraalVM founder and project lead Thomas Wuerthinger published his 10 steps of 1BRC optimizations. His baseline solution takes just 125 seconds, compared to Morling's 290 seconds, probably because it runs on a newer 2022 Intel desktop processor. Unsurprisingly, his first step is using GraalVM Native Image. After just two steps, his solution is already down to 5.7 seconds.

Morling was positively surprised by the community that quickly formed around 1BRC. They contributed a new server, a test suite, and scripts for configuring the server environment and running the applications. Morling has not thought of a new challenge.

About the Author

Rate this Article

Adoption
Style

BT