Key Takeaways
- Applications can select an appropriate JIT compiler to produce near-machine level performance optimizations.
- Tiered compilation consists of five levels of compilation.
- Tiered compilation provides great startup performance and guides further levels of compilation to provide high performance optimizations.
- JVM switches provide diagnostic information about the JIT compilation.
- Optimizations like Intrinsics and Vectorization further enhance performance.
The OpenJDK HotSpot Java Virtual Machine, fondly known as the Java VM or the JVM, consists of two principal components: the execution engine and the runtime. The Java VM and Java APIs comprise the Java Runtime Environment, also known as the JRE.
In this article we will explore the execution engine particularly the just-in-time (JIT) compilation, as well as runtime optimizations in OpenJDK HotSpot VM.
Java VM’s Execution Engine and Runtime
The execution engine consists of two major components: the garbage collector, (which reclaims garbage objects and provides automatic memory/heap management) and the JIT compiler (which converts bytecode to executable machine code).
In OpenJDK 8, the "tiered compiler" is the default server compiler. HotSpot users can still select the non-tiered server compiler (also known as "C2") by disabling the tiered compiler (-XX:-TieredCompilation
).
We will learn more about these compilers shortly.
The Java VM’s runtime handles class loading, bytecode verification and other important functions as shown below. One of these functions is "interpretation" and we will be talking more about it shortly. You can read more about Java VM’s runtime here.
Adaptive JIT and Runtime Optimizations
The JVM system is the backend helper for Java’s write once, run anywhere capability. Once a Java program is compiled into bytecode, it can be executed by a JVM instance.
OpenJDK HotSpot VM converts bytecode into machine executable code by "mixed-mode" execution. With "mixed-mode", the first step is interpretation, which converts bytecode into assembly code using a description table. This pre-defined table, also known as the "template table", has assembly code for each bytecode instruction.
Interpretation begins at JVM startup, and is the slowest form of bytecode execution. Java bytecode is platform independent, but interpretation and compilation into machine executable code are definitely dependent on the platform. In-order to get faster, efficient (and adaptive to the underlying platform) machine code generation, the runtime kicks off just-in-time compilation, i.e. JIT compilation. JIT compilation is an adaptive optimization for methods that are proven to be performance critical. In order to determine these performance-critical methods, the JVM continually monitors the code for the following critical metrics:
- Method entry counts - assigns a call counter to every method.
- Loop back branches (commonly known as loop back-edge) counts - assigns a counter to every loop that has executed.
A particular method is considered performance critical when its method entry and loop-back edge-counters cross a compilation threshold (-XX:CompileThreshold
) set by the runtime. The runtime uses these metrics to determine whether to compile the performance critical methods themselves or their callees. Similarly, a loop is considered performance critical if the loop-back branch counter exceeds a predetermined threshold (based on the compilation threshold). When the loop back-edge counter crosses its threshold, then only that loop is compiled. The compiler optimization for loop-backs is called on-stack replacement (OSR), since the JVM replaces the compiled code on stack.
OpenJDK HotSpot VM has two different compilers, each with its own compilation thresholds:
- The client or C1 compiler has a low compilation threshold of 1,500, to help reduce startup times.
- The server or C2 compiler has a high compilation threshold of 10,000, which helps generate highly optimized code for performance critical methods that are determined to be in the critical execution path of the application.
Five Levels of Tiered Compilation
With the introduction of tiered compilation, OpenJDK HotSpot VM users can benefit from improved startup times with the server compiler.
Tiered compilation has five tiers of optimization. It starts in tier-0, the interpreter tier, where instrumentation provides information on the performance critical methods. Soon enough the tier 1 level, the simple C1 (client) compiler, optimizes the code. At tier 1, there is no profiling information. Next comes tier 2, where only a few methods are compiled (again by the client compiler). At tier 2, for those few methods, profiling information is gathered for entry-counters and loop-back branches. Tier 3 would then see all the methods getting compiled by the client compiler with full profiling information, and finally tier 4 would avail itself of C2, the server compiler.
Tiered Compilation and Effects on Code Cache
When compiling with the client compiler (tier 2 onwards), the code is profiled by the client compiler during startup, when the critical execution paths are still warming up. This helps produce better profiled information than interpreted code. The compiled code resides in a cache known as the "code cache". A code cache has a fixed size, and when full, the Java VM will cease method compilation.
Tiered compilation has its own set of thresholds for every level e.g.
-XX:Tier3MinInvocationThreshold, -XX:Tier3CompileThreshold, -XX:Tier3BackEdgeThreshold
. The minimum invocation threshold at tier 3 is 100 invocations. Compared to the non-tiered C1 threshold of 1,500 you can see that tiered compilation occurs much more frequently, generating a lot more profiled information for client compiled methods. Therefore the code cache for tiered compilation must be a lot larger than code cache for non-tiered, and so the default code cache size for tiered compilation in OpenJDK 8 is 240MB as opposed to the non-tiered default of 48MB.
The Java VM will provide warning signs in case the code cache is full. Users are encouraged to increase the code cache size by using the –XX:ReservedCodeCacheSize
option.
Understanding Compilation
In order to visualize what methods get compiled and when, OpenJDK HotSpot VM provides a very useful command line option called -XX:+PrintCompilation
that reports when the code cache becomes full and when the compilation stops.
Let’s look at some examples:
567 693 % ! 3 org.h2.command.dml.Insert::insertRows @ 76 (513 bytes)
656 797 n 0 java.lang.Object::clone (native)
779 835 s 4 java.lang.StringBuffer::append (13 bytes)
The above output is formatted as:
timestamp compilation-id flags tiered-compilation-level class:method <@ osr_bci> code-size <deoptimization>
where,
timestamp
is the time from Java VM start
compilation-id
is an internal reference id
flags
could be one of the following:
%
: is_osr_method (@ sign indicates bytecode index for OSR methods)
s
: is_synchronized
!
: has_exception_handler
b
: is_blocking
n
: is_native
tiered-compilation
indicated the compilation tier when tiered compilation is enabled
Method
will have the method name usually in the ClassName::method
format
@osr_bci
is the bytecode index at which the OSR happened
code-size
is the total bytecode size
deoptimization
indicated if a method was de-optimized and made not entrant
or zombie
(More on this in section titled ‘Dynamic De-optimization’).
Based on the above key, we can tell that line 1 of our example
567 693 % ! 3 org.h2.command.dml.Insert::insertRows @ 76 (513 bytes)
had a timestamp of 567, compilation-ide of 693. The method had an exception handler as indicated by ‘!’. We can also tell that the tiered compilation level was at 3 and it was an OSR method (as indicated by ‘%’) with bytecode index of 76. The total bytecode size was 513 bytes. Please note 513 is the bytecode size and not the compiled code size.
Line 2 of our example shows that
656 797 n 0 java.lang.Object::clone (native)
the JVM facilitated a native method call and line 3 of our example
779 835 s 4 java.lang.StringBuffer::append (13 bytes)
shows that the method was compiled at tier 4 and is synchronized.
Dynamic De-optimization
We know that Java does dynamic class loading, and the Java VM checks the inter-dependencies at every dynamic class load. When a previously optimized method is no longer relevant, OpenJDK HotSpot VM will perform dynamic de-optimization of that method. Adaptive optimization aids in dynamic de-optimization; in other words, a dynamically de-optimized code would revert/move to its previous/new compiled level as shown in the following example. (Note: This is the output generated when PrintCompilation is enabled on the command line):
573 704 2 org.h2.table.Table::fireAfterRow (17 bytes)
7963 2223 4 org.h2.table.Table::fireAfterRow (17 bytes)
7964 704 2 org.h2.table.Table::fireAfterRow (17 bytes) made not entrant
33547 704 2 org.h2.table.Table::fireAfterRow (17 bytes) made zombie
This output show that at timestamp 7963, fireAfterRow
is tiered compiled at level 4. Right after that at timestamp 7964, the previous compilation of fireAfterRow
at level 2 is made not entrant. And after a while, the fireAfterRow
is made zombie; that is, the previous code is reclaimed.
Understanding Inlining
One of the biggest advantages of adaptive optimization is the ability to inline performance critical methods. This helps in avoiding the method invocation overhead for these critical methods, by replacing the invocations by actual method bodies. There are a lot of "tuning" options for inlining, based on size and invocation thresholds, and Inlining has been thoroughly studied and optimized to very near its maximum potential.
If you want to spend time looking at the inlining decisions you can use a diagnostic Java VM option called -XX:+PrintInlining
. PrintInlining can be a very useful tool to understand the decisions as shown in the following example:
@ 76 java.util.zip.Inflater::setInput (74 bytes) too big
@ 80 java.io.BufferedInputStream::getBufIfOpen (21 bytes) inline (hot)
@ 91 java.lang.System::arraycopy (0 bytes) (intrinsic)
@ 2 java.lang.ClassLoader::checkName (43 bytes) callee is too large
Here you can see location of the inlining and total bytes inlined. Sometimes you see tags such as "too big"
or "callee is too large"
, which indicate that inlining didn’t happen because the thresholds were exceeded. The output on line 3 above shows an "intrinsic"
tag, let’s learn more about intrinsics in the next section.
Intrinsics
Usually the OpenJDK HotSpot VM JIT compiler would execute generated code for performance critical methods, but at times some methods have a very common pattern e.g. java.lang.System::arraycopy
as shown in the PrintInlining output in the previous section. These methods can be hand-optimized to generate more performant, optimized code similar to having your native methods but without the overhead. These intrinsics can be effectively inlined just like the Java VM would inline regular methods.
Vectorization
When talking about intrinsics, I would like to highlight a common compiler optimization called vectorization. Vectorization can be applied wherever the underlying platform (processor) can handle special parallel computation/vector instructions known as "SIMD" instructions (single instruction, multiple data). SIMD instructions and "vectorization" help with data-level parallelism by operating on larger cache-line size (64 bytes) datasets.
HotSpot VM provides two different levels of vector support -
- Assembly stubs for counted inner loops;
- SuperWord Level Parallelism (SLP) support for auto-Vectorization.
In the first case, the assembly stubs can provide vector support for inner loops while working in the nested loop, and the inner loop can be optimized and replaced by vector instructions. This is similar to intrinsics.
The SLP support in HotSpot VM is based on a paper from MIT Labs. Right now, HotSpot VM only optimizes a destination array with unrolled constants, as shown in the following example provided by Vladimir Kozlov, a senior member of Oracle Compiler Team who has contributed to various compiler optimization including auto-vectorization support:
a[j] = b + c * z[i]
So, after the above is unrolled, it can be auto-vectorized.
Escape Analysis
Escape analysis is another perk of adaptive optimization. Escape analysis (EA in short) takes the entire intermediate representation graph into consideration in order to determine if any allocations are ":escaping". That is, if any allocations are not one of the following:
- stored to a static field or a nonstatic field of an external object;
- returned from method;
- passed as parameter to another method where it escapes
If the allocated object doesn’t escape, the compiled method and the object is not passed as a parameter, then the allocation can be removed and the field values can be stored in registers. And if the allocated object doesn’t escape the compiled method, but is passed as a parameter the Java VM can still remove locks associated with the object and use optimized compare instructions when comparing it to other objects.
Other Common Optimizations
There are other OpenJDK HotSpot VM optimizations that come with adaptive JIT compilation -
- Range check elimination - wherein the JVM doesn’t have to check for index out-of-bounds error if it can be assured that the array index never crosses its bounds.
- Loop unrolling - helps with reducing the number of iterations by unrolling the loop. This aids in the JVM being able to apply other common optimizations (such as loop vectorization) wherever needed.
About the Author
Monica Beckwith is a Java Performance Consultant. Her past experiences include working with Oracle/Sun and AMD; optimizing the JVM for server class systems. Monica was voted a Rock Star speaker @JavaOne 2013 and was the performance lead for Garbage First Garbage Collector (G1 GC). You can follow Monica on twitter @mon_beck.