Benchmarking pitfalls
Getting a benchmark and in particular microbenchmark to answer the question that is being asked can be particularly difficult. More often than not, benchmarks end up something completely different than what you are looking to measure. Even when you happen to measure the effect in question, those measurements can be corrupted by other effects. It was clear at the start of this exercise that this benchmark would need to be throughly scrutinized and reviewed by others if I were to avoid falling into the trap of reporting bad benchmarking numbers. Aside from having the benchmark peer reviewed, I used a number of tools and techniques in my efforts to validate the results which I'll discuss in the following sections.
Statistical treatment of results
The vast majority of operations that a computer will execute will complete is a constant amount of time. In my experience I have found that even the non-deterministic operations will complete in almost constant amount of time under most conditions. It is this property of computing that gives us a tool, a measurement to let us know when things aren't going as expected. That tool is statistics and the measurement is variance. That said, I don't really want to go beyond a "hand waving" explanation of why we should be reporting on more than just averages. The reasoning goes like this; if I provide the CPU with a fix number of instructions and it doesn't complete then within a relatively constant time frame, it is an indication that there is some thing, some external influence on my measurements. When I see a large variance in my benchmarking numbers I know that I have to look for that external influence and deal with it.
While this variance effect is exaggerated with microbenchmarks, it won't be with larger ones. With larger benchmarks, measured aspects of the application being studied will interfere with each other and this will introduce some variance. But once again, variance can still provide us with some good information as a measure of the level of interference. At constant loads I would expect the variance to; well, not vary all that much. I would view runs with variances that were larger and smaller than most and use that as an indication that my benchmark is not properly isolated or configured. This different treatment of the same measurement really points out the differences between a full benchmark and a microbenchmark.
One last point, it still isn't an indication that you are measuring what you think you are measuring. It is only an indication that what ever it is that you are measuring, that measurment is most likely the correct one for the ultimate question.
Warming up the method cache
One of the activities that will interfere with your benchmark is JIT compilation of your code. The responsible agent, Hotspot, is constantly profiling your application looking for opportunities to apply some optimization. When it finds an opportunity it will direct a JIT compiler to recompile the section of the code in question. It then uses a technique known as On Stack Replacement (OSR) to switch to the execution of the new code. Performing an OSR has all kinds of knock-on effects that includes having to pause the executing thread. Of course all of this activity interferes with our benchmark. It is this type of interference that will skew our results. We have two tools at our disposal to help us identify when our code has been JIT'ed. The first is (of course) variance and the second is the flag -XX:-PrintCompilation. Fortunately most, if not all of the code will be JIT'ed fairly early on in the benchmark so we can just treat it like any other startup anomaly. All we need to do is run the benchmark until all of the code has been JIT'ed and then start measuring. This warmup phase is effectively known as "warming up the method cache".
Most JVMs run in both interpreted and native modes at the same time. This is known as mixed mode execution. Over time Hotspot and the JIT will convert the interpreted code into native code using profiling information. To help Hotspot understand which optimization should be applied, it will sample method invocations and branches. Once it reaches a specified threshold, it will direct the JIT to produce native code. The threshold can be set using the -XX:CompileThreshold flag. For example, using -XX:CompileThreshold=10000 will cause Hotspot to compile code after it has been executed 10,000 times.
Heap management
The next activity to consider is Garbage Collection or what is more formally known as Heap Management. There are several memory management activities that will regularly occur during the execution of any application. These activities include heap space re-sizing, reclamation of memory no longer used, moving data from one place to another and so on. All of these maintenance activities require the JVM to compete with your application. The question is: should a benchmark measurements include the time needed for maintenance or garbage collection? The answer really depends upon the type of question you're trying to answer. In this instance I'm only interested in lock acquisition costs which means I need to make sure that my measurements exclude garbage collection timings. Once again variance can tell us that something is bothering our benchmark and when that happens garbage collection is a usual suspect. The best way to convict is to turn on GC logging using the -verbose:gc flag.
In my benchmark I'm doing a lot of String, StringBuffer, and StringBuilder manipulation. All toll, the benchmark creates about 40 million objects on each run. With that level of object churn going on there was no question that garbage collection was going to be a problem. I used two techniques. First I increased the Eden heap space size to prevent garbage collection from running during a single iteration of the bench. For this I used the command line:
>java -server -XX:+EliminateLocks -XX:+UseBiasedLocking -verbose:gc -XX:NewSize=1500m -XX:SurvivorRatio=200000 LockTest
I then included the code in Listing 1 to prepare the heap for the next iteration.
System.gc();
Thread.sleep(1000);
Listing 1. Running GC followed by a short sleep
The purpose of the sleep is to allow the garbage collector to finish up after it has released the other threads. It should be noted that some processors will step down clock speeds if there is no activity on the CPU. Thus introducing a sleep may introduce a delay in the benchmark while the CPU clock spins back up. If your process supports this feature you may have to go into the firmware and turn off the handy power saving feature for duration of the benchmarking exercise.
The flag setting used above did not prevent GC from running. That said it only ran once for each test case. Each time the pause time was small enough that it had little overall effect on the results which was good enough for our purposes.
Biased locking delay
There was yet another effect that had a significant impact on the benchmarking results. Where as most optimizations are applied very early on, for some unknown reason, biased locking was only applied after three or four full seconds into the run. To risk sounding like a broken record, variance once again played an important role in determining that this was a problem. The flag -XX:+TraceBiasedLocking was helpful in tracking down this problem. Warmup times were adjusted to account for this delay.
Other Hotspot provided optimizations
Hotspot doesn't stop profiling your code once is has completed applying an optimization. Instead it keeps on profiling looking for more opportunities to provide further optimizations. With locking there are a number of optimizations that are simply not allowed because they would break the specification described by the Java Memory Model. However those restriction quickly melt away if the lock can been JIT'ed away. In the case of this single threaded benchmark, it is quite safe for Hotspot to safely elide the lock. That would leave open the opportunity to apply other optimizations such as; method inlining, loop invariant hoisting and dead code elimination.
If we consider the code in this sidebar, we can see that both A and B are invariant and that we can avoid the repeated calculation by hoisting it out of the loop and introducing a third variable as is shown in the second listing. Traditionally it has been the responsibility of the programmer to perform this task. However Hotspot is more than capabile of recognizing loop invariants and hoisting them out of the loops. Thus we may write the code in listing 2 while executing something that looks more like what is in listing 3.int A = 1;
int B = 2;
int sum = 0;
for (int i = 0; i < someThing; i++) sum += A + B;Listing 2. Loop containing invariant
int A = 1;
int B = 2;
int sum = 0;
int invariant = A + B;
for (int i = 0; i < someThing; i++) sum += invariant;Listing 3. Loop with hoisted invariant
It is debatable if these optimizations should be allowed or if we should some how try to prevent them from being applied. At the very least we should know if these optimizations are being applied. Dead code elimination is one optimization that we definitely want to defeat otherwise our benchmark will be measuring nothing! It is possible that Hotspot can recognize that we are not using the results of the concatBuffer and concatBuilder operations. It can see that these operations have no side effects. Thus there isn't any point to execute the code. Once it has determined that the code is "dead", it can direct the JIT to eliminate it. Fortunatly my benchmark has confused Hotspot so that it doesn't recognize that optimization... yet.
If inlining is prevented with the presence of a lock and not prevented without, than we need to make sure that we don't include the extra method call in our results. One technique you can currently use to fool Hotspot is to introduce an interface (listing 4).
public interfaceConcat {
String concatBuffer(String s1, String s2, String s3);
String concatBuilder(String s1, String s2, String s3);
public class LockTest implements Concat {
...}
Listing 4. Using an interface to prevent method inlining
Another way to prevent inlining is by using the command line option -XX:-Inline. I verified that differences due to method inlining were not involved in any of the measurements reported on in this benchmark.
Trace output
Finally I'd like to show you some of the output produced when using the flags as depicted below.
>java -server -XX:+DoEscapeAnalysis -XX:+PrintCompilation -XX:+EliminateLocks -XX:+UseBiasedLocking -XX:+TraceBiasedLocking LockTest
Figure 1. Trace output from benchmark
By default the JVM starts 12 threads which include; the main thread, Reference Handler, Finalize, Attach Listener and so on. The first section in grey is reporting on the alignment of these thread object so that it may work with biased locking (note all addresses end with 00). You can safely ignore this section. The next section in yellow contains information about compiled methods. If we look at lines 5 and 12 we can see that they are marked with an extra "s". The information in table 1 tells us that those methods are synchronized. The lines containing the "%" have used OSR. The line annotated in red is where biased locking is enabled. Finally the last aqua-blue area is where the benchmark timings start. You see by the output that by the time the benchmark has started, all of the compiling has taken place. This verifies that the warmup period was long enough. For more details on specifics about the logging output you may like to visit [http://forum.java.sun.com/thread.jspa?forumID=27&messageID=980887&threadID=235212] and [http://www.unixville.com/~moazam/stories/2004/06/17/thePrintcompilationFlagAndHowToReadItsOutput.html].
Table 1. Compilation codes
Single core results
While most of my colleagues were using Intel Core 2 Duo processors, a few of them were using older uni-processor machines. On these older machines results from the StringBuffer benchmark were almost identical to those produced by the StringBuilder implementation. Since the difference could be explained by a number of factors, I needed another test to try to eliminate as many of the possibilities as I could. The best option turned out to be re-running the benchmark on my machine with one of the cores in the Core 2 Duo turned off, set in the BIOS. The results of those runs can be seen in figure 2.
Figure 2. Single-core Performance
As was the case with the multi-core runs, the baseline values were obtained with all three of the optimizations turned off. Once again StringBuilder performance remained flat throughout. More interestingly, though marginally slower than StringBuilder, StringBuffer performance was closer to that offered by StringBuilder than when running on a multi-core platform. It is this test that provides the information that starts to paint a picture as to what is really being measured by this benchmark.
In a multi-core world, sharing data between threads takes on a whole new look. All modern CPUs must utilize local memory cache to minimize latency of fetching instructions and data from memory. When we use a lock we cause a memory barrier to be inserted into the execution path. The memory barrier is a signal that tells the CPU that it must coordinate with all the other CPUs in order to fetch the most up to date value. To do this, the CPUs will communicate with each other. This causes each processor to pause application threads that are currently running. Just how long turns out to be a function of the CPU's memory model. More conservative the memory models may have more thread-safety but they will also take longer to coordinate amongst all the cores. With the Core 2 Duo, that second core increased the runtime of the fixed work benchmark from 3731 ms. to 6574 ms. or a factor of 176%. Clearly any help that Hotspot can give us could bring marked improvements in the overall performance of our applications.
Does escape analysis actually work?
One of the optimizations that was clearly available here yet wasn't exploited was lock elision. That said, lock elision has only been recently implemented and it depends upon escape analysis, a profiling technique which it's self has only recently been implemented. To be fair, it's been well advertised that these technologys work only for limited number of cases. For instance, escape analysis works for simple loops that increment a local variable within a block synchronized with a locally declared lock [http://blog.nirav.name/2007_02_01_archive.html]. It also works as promised for the Mont Carlo test of the Scimark2 benchmark. (see [http://math.nist.gov/scimark2/index.html]).
Putting escape analysis to the test
So, why does it work for these cases and not for our benchmark? I tried to reduce the benchmark by inlining the required methods of StringBuffer and StringBuilder. I also modified the code in an attempt to coerce escape analysis into working. I was expecting that at some point the locks would be elided and I'd see a boost in performance. It would be dishonest to say that working through this benchmark was both confusing and frustrating. I had to use ctrl-z many times in my editor to revert from a version where I thought elision was working and then for some reason, it just didn't work anymore. Sometimes lock elision just started working for no apparent reason.
Finally I came to recognize that there seems to be some relation between the enabling of lock elision and the data size of the locked object. You can see this when you run the code in listing 2. As you can see, there is no difference in the run times which implies that DoEscapeAnalysis had no effect.
>java -server -XX:-DoEscapeAnalysis EATest
thread unsafe: 941 ms.
thread safe: 1960 ms.
Thread safety overhead: 208%
>java -server -XX:+DoEscapeAnalysis EATest
thread unsafe: 941 ms.
thread safe: 1966 ms.
Thread safety overhead: 208%
In the next two runs, I removed an unused field index out of the class ThreadSafeObject. As you can see, there is a significant boost in performance when escape analysis has been turned on.
>java -server -XX:-DoEscapeAnalysis EATest
thread unsafe: 934 ms.
thread safe: 1962 ms.
Thread safety overhead: 210%
>java -server -XX:+DoEscapeAnalysis EATest
thread unsafe: 933 ms.
thread safe: 1119 ms.
Thread safety overhead: 119%
The numbers for escape analysis holds for both Windows and Linux. However on Mac OS X, having the extra unused variable has no effect as both versions of the benchmarks come in at 120%. This leads me to believe that the range of effectiveness range of situations seems to be a bit wider on Mac OS X. My guess is that it is implemented very conservatively and disabled early based on various conditions like lock object data size and OS specific features.
Conclusions
When I first started this exercise to examine the effectiveness of Hotspot applied lock optimizations, I figured it would take a couple hours of my time and in the end I've have a useful blog entry. But as is the case with just about any benchmark, the process of verification and interpretation of the results ended up lasting many weeks. It also required that I collaborated with a number of others experts who in turn spent more time validating the benchmark and offering their interpretation. Even after all of this work, it is still impossible to say much of anything beyond which optimizations appeared to work and which ones didn't. Though this article quotes specific results they are specific to my hardware and one can only speculate if any of you would see the same types of results on your systems. I also started with what I thought was a pretty reasonable microbenchmark but in the end it took considerable tweaking to satisfy both myself and the people who reviewed the code that the benchmark wasn't busted but having Hotspot apply other unwanted optimizations. In short the exercise turned out to be much more difficult than I first anticipated.
If you need to run a multi-threaded application on a multi-core machine and you care about performance than clearly you need to be continuously upgrading to the latest revision of the version of the JDK you're working with. Many, but not all, of the optimizations are being back ported from the latest version to previous ones. You will need to ensure that all of the threading optimizations are enabled. In the 6.0 they are enabled by default. However in the 5.0 you will need to explicitly set them on the command line. If you run a single-threaded application on a multi-core machine you might even make your application run faster by switching off all but your first core.
At a lower level, locking overhead is much lower on a single core than on a dual core processor. Inter-core coordination, that is, memory barrier semantics, clearly has its overhead as evidenced by the performance of the benchmark when running on my system with one of the cores turned off. It is apparent that we really need the threading optimizations to reduce this overhead. Fortunately lock-coarsening and in particular biased-locking did appear to have significant effect on benchmark performance. I was hoping that escape analysis combined with lock elision would have a much greater influence than it did. This technology works but only in a very limited number of cases. To be fair, escape analysis is still in its infancy and it will take time to mature this complex tool.
In conclusion, the best benchmark is your application running on your hardware. The best that this article has to offer are some insights into what maybe going on if your multi-threaded performance isn't quite what you hoped it would be.
About Jeroen Borgers
Jeroen Borgers is a senior consultant with Xebia - IT Architects. Xebia is an international IT consultancy and project organization specialized in enterprise Java and agile development. Jeroen helps customers on enterprise Java performance issues and is instructor of Java Performance Tuning courses. He has worked on various Java projects in several industries since 1996, as a developer, architect, team lead, quality officer, mentor, auditor, performance tester and tuner. He works exclusively on performance assignments since 2005.
Acknowledgments
This article could only be achieved with the help of several others. Special thanks go to:
Dr. Cliff Click, former lead architect of Sun's server VM and currently working with Azul Systems; for helping in analysis and pointing at valuable resources.
Kirk Pepperdine, Java performance authority; for helping me in the process in addition to contributing and extensive editing.
David Dagastine, Sun JVM performance team lead, for explaining and pointing me in the right direction.
Several of my Xebia colleagues for performing the benchmarks.
Resources
Java concurrency in practice, Brian Goetz et all.
Java theory and practice: Synchronization optimizations in Mustang,
Did escape analysis escape from Java 6
Dave Dice's Weblog
Java SE 6 Performance White Paper
Listing 1.
public class LockTest {
private static final int MAX = 20000000; // 20 million
public static void main(String[] args) throws InterruptedException {
// warm up the method cache
for (int i = 0; i < MAX; i++) {
concatBuffer("Josh", "James", "Duke");
concatBuilder("Josh", "James", "Duke");
}
System.gc();
Thread.sleep(1000);
long start = System.currentTimeMillis();
for (int i = 0; i < MAX; i++) {
concatBuffer("Josh", "James", "Duke");
}
long bufferCost = System.currentTimeMillis() - start;
System.out.println("StringBuffer: " + bufferCost + " ms.");
System.gc();
Thread.sleep(1000);
start = System.currentTimeMillis();
for (int i = 0; i < MAX; i++) {
concatBuilder("Josh", "James", "Duke");
}
System.out.println("StringBuilder: " + builderCost + " ms.");
System.out.println("Thread safety overhead of StringBuffer: "
}
StringBuffer sb = new StringBuffer();
sb.append(s1);
sb.append(s2);
sb.append(s3);
return sb.toString();
}
StringBuilder sb = new StringBuilder();
sb.append(s1);
sb.append(s2);
sb.append(s3);
return sb.toString();
}
Listing 2.
public class EATest {
private static final int MAX = 200000000; // 200 million
public static final void main(String[] args) throws InterruptedException {
// warm up the method cache
sumThreadUnsafe();
sumThreadSafe();
sumThreadUnsafe();
sumThreadSafe();
System.out.println("Starting test");
long start;
start = System.currentTimeMillis();
sumThreadUnsafe();
long unsafeCost = System.currentTimeMillis() - start;
System.out.println(" thread unsafe: " + unsafeCost + " ms.");
start = System.currentTimeMillis();
sumThreadSafe();
long safeCost = System.currentTimeMillis() - start;
System.out.println(" thread safe: " + safeCost + " ms.");
System.out.println("Thread safety overhead: "
+ ((safeCost * 10000 / (unsafeCost * 100)) - 100) + "%\n");
}
public static int sumThreadSafe() {
String[] names = new String[] { "Josh", "James", "Duke", "B" };
ThreadSafeObject ts = new ThreadSafeObject();
int sum = 0;
for (int i = 0; i < MAX; i++) {
sum += ts.test(names[i % 4]);
}
return sum;
}
public static int sumThreadUnsafe() {
String[] names = new String[] { "Josh", "James", "Duke", "B" };
ThreadUnsafeObject tus = new ThreadUnsafeObject();
int sum = 0;
for (int i = 0; i < MAX; i++) {
sum += tus.test(names[i % 4]);
}
return sum;
}
}
final class ThreadUnsafeObject {
// private int index = 0;
private int count = 0;
private char[] value = new char[1];
public int test(String str) {
value[0] = str.charAt(0);
count = str.length();
return count;
}
}
final class ThreadSafeObject {
private int index = 0; // remove this line, or just the '= 0' and it will go faster!!!
private int count = 0;
private char[] value = new char[1];
public synchronized int test(String str) {
value[0] = str.charAt(0);
count = str.length();
return count;
}
}