Introduction - Threading optimizations in Java 6
Much attention has been given by Sun, IBM, BEA and others to optimize lock management and synchronization in their respective Java 6 virtual machine offerings. Features like biased locking, lock coarsening, lock elision by escape analysis and adaptive spin locking are all designed to increase concurrency by allowing more effective sharing amongst application threads. As sophisticated and interesting as each of these features are, the question is; will they actually make good on these promises? In this two part article I will explore these features and attempt to answer the performance question with the aid of a single threaded benchmark.
Locking is Pessimistic
The locking model supported in Java (as is the case with most threading libraries) is a very pessimistic one. If there is even the most remote danger that two or more threads will utilize data in a way that interferes with each other, we are forced to the draconian solution of using locks to prevent this from happening. Yet research has shown that locks are rarely, if ever, contended for. Translation, a thread asking for a lock will rarely have to wait to acquire it. Yet the act of asking for a lock will trigger a sequence of actions that can result in accumulation of significant overheads, one that cannot be avoided.
We do have some choice. Consider for example usage of the thread safe StringBuffer. Ask yourself this, if you've ever used StringBuffer knowing that it would only be accessed from a single thread, why didn't you use StringBuilder instead?
Knowing that most locks won't be contended or will rarely be contended isn't very helpful because if there is even the smallest probability that two threads could access the same data, then we are pretty much forced to protect access using a lock via synchronization. It is only when we look at the lock in the context of a runtime environment can we finally answer the question, did we really need the lock? To obtain an answer to this question, the JVM developers have started experimenting with HotSpot and the JIT. From this work we now have adaptive spinning, biased locking and two forms of lock elimination known as lock coarsening, lock elision. Before we start benching, let's take some time to review these features so that we all understand how they work.
Escape analysis - lock elision explained
Escape analysis is examination of the scope of all references in a running application. The analysis is carried out as a normal part of the HotSpot profilers work. If HotSpot (via Escape Analysis) can determine that references to an object are limited to a local scope and none of those references can "escape" to a broader scope, it can direct the JIT to apply a number of runtime optimizations. One such optimization is known as lock elision. When references to a lock are limited to some local scope, it implies that only the thread that created the lock will ever have access to that lock. Under those conditions the values in the synchronized block will never be contended for. This implies that we never really needed the lock and it can safely be elided (omitted). Consider the following method:
publicString concatBuffer(String s1, String s2, String s3) {,
StringBuffer sb = new StringBuffer();
sb.append(s1);
sb.append(s2);
sb.append(s3);
return sb.toString();
}
Figure 1. String concatenation using local StringBuffer
If we examine the variable sb we can quickly determine that it only lives with in the confines of the concatBuffer method. Further more, references to sb never "escape" from the scope in which it has been declared. Thus there is no way another thread could ever gain access to our copy of sb. With the knowledge in hand we know that locks protecting sb can be elided.
On the surface it looks as though lock elision allows us to write thread safe code without any synchronization penalty for using in cases where it really wasn't needed. If it really works is a question we will investigate with our benchmark.
Biased locking explained
Biased locking is derived from the observation that most locks are never accessed by more than one thread during their life time. On those rare occasions when multiple threads do share data, that access is rarely contended. To understand the advantage of biased locking in light of the observation we first need to review how locks (monitors) are acquired.
Acquiring a lock is a two step dance. First you need to acquire a lease. Once you have the lease in hand you are free to grab the lock. To acquire the lease the thread is required to execute an expensive atomic instruction. Releasing the lock traditionally releases the lease. In light of our observation it seems as if we should be able to optimize access in the case where a thread is looping over a synchronized block of code. One of the things we could do is coarsen the lock to include the loop. This would cause the thread to access the lock once instead of loop count times. However this isn't a good solution because it may lock out the other threads from being able to have fair access. A more sensible solution would be to bias the lock towards the looping thread.
Biasing a lock to a thread means that the thread is not required to release the lease on the lock. Thus subsequent lock acquisitions are much less expensive. The thread will only release the lease if another thread attempts to acquire the lock. The Java 6 HotSpot/JIT implementation optimizes for biased locking by default.
Lock coarsening explained
Yet another threading optimization is lock coarsening or merging. Lock coarsening occurs when adjacent synchronized blocks may be merged into one synchronized block. A variation on this theme is to combine multiple synchronized methods into one. This optimization can be applied if the same lock object is used by all methods. Consider the example shown in Figure 2.
public static String concatToBuffer(StringBuffer sb, String s1, String s2, String s3) {
sb.append(s1);
sb.append(s2);
sb.append(s3);
returnsb.toString();
}
Figure 2. String concatenation using non-local StringBuffer
In this case, the StringBuffer has non-local scope and can be accessed by several threads. Consequently escape analysis will determine that its' lock cannot be safely elided. If the lock happens to be accessed by only one thread, biased locking can be applied. Interestingly, the decision to coarsen can be made independent to the number of threads contending for the locks. In this case an instance lock is acquired four times: three times for the append method and one time for the toString method, right after another. The first action is to in-line the methods. Then we can replace all four calls to acquire the lock with a single one that surrounds the entire method body.
Then net effect is that we will end up with a longer critical section which may result in other threads stalling and reduced throughput. Because of this, locks inside a loop are not coarsened to include the looping construct.
Thread suspending versus spinning
When a thread waits for a lock to be released by another thread, it usually is suspended by the operating system. Suspending a thread requires the operating system to swap it out of the CPU often before it's time quantum has been consumed. When the thread that has the lock leaves the critical section, the suspended thread needs to be woken up. The thread will need to be re-scheduling and context switched back into the CPU. All of this activity can place extra strain on the JVM, OS and hardware.
The observation that is useful in this instance is; locks are normally held for very brief periods of time. This implies that if we wait for a bit, we may be able to get the lock without being suspended. To wait all we do is put the thread into a busy loop (a spin). This technique is known as spin locking.
Spin locks work well in cases where lock durations are very short. On the other hand, if the lock is held for a longer time, spinning is wasteful as the spinning thread consumes CPU without doing anything useful. When introduced in the JDK 1.4.2, spin locking was split into two phases, spin for (default value of) 10 iterations before being suspended.
Adaptive spinning
The JDK 1.6 introduced adaptive spinning. With adaptive spinning the duration of the spin is not fixed anymore, but determined by a policy based on previous spin attempts on the same lock and the state of the lock owner. If spinning succeeded on that same lock object in the recent history and the thread holding the lock is running, then spinning is likely to succeed again. It will then be applied for a relatively longer time, e.g. 100 iterations. On the other hand, if spinning is very unlikely to succeed, it is left out altogether thereby not wasting any CPU cycles.
StringBuffer vs. StringBuilder benchmark
The road to deciding on exactly how to measure the effects of all of these nifty optimizations wasn't so smooth. First and foremost is the question of what should the benchmark look like? To answer this question I decided to look at some of the common idioms that people use in their code all the time. The one thing that stood out was the age old question of what is the cost savings of using StringBuffer instead of String.
The familiar advice is; use StringBuffer instead of a String if you are going to be mutating it. The reason for the advice is quite clear. String is immutable and if we are doing work that requires mutations, StringBuffer is a less costly alternative. Interestingly enough the advice fails to recognize StringBuffer's new (since 1.5) unsynchronized cousin, StringBuilder. Since the only difference between StringBuilder and StringBuffer is synchronization, it seems as though a benchmark that measures the performance differences between the two should reveal the cost of synchronization. The quest starts with the first question, what is the cost of uncontended locking.
The essence of the benchmark (as seen in listing 1) is to take a couple of strings and concatenate them together. The initial capacity of the underlying buffer is large enough to hold the three strings to be joined. This allows us to minimize the amount of work in the critical section which should bias the measurement towards the cost of synchronization.
Benchmark results
Below are results for the sensible combinations of the three options of EliminateLocks, UseBiasedLocking, and DoEscapeAnalysis.
Figure 3. Benchmark results
Discussion of results
The purpose of using the unsynchronized StringBuilder was to provide a baseline measurement of performance. I also wanted to see if the optimizations would have any effect on StringBuilder performance. As can be seen, StringBuilder performance remained constant throughout exercise. Because these flags are aimed directly a optimizing the use of locks, this result is as expected. At the other end of the performance spectrum we can see that using the synchronized StringBuffer without any of the optimization runs about 3 times slower.
Moving from left to right across the results presented in figure 3, we see a decent boost in performance that can be attributed to EliminateLocks. However that boost pales in comparison to the one provided by biased locking. In fact, with the exception of column C, every run with biased locking turned on ended up providing approximately the same boost in performance. But what of column C?
In the process of treating the raw data, it was noticed that 1 run in 6 took significantly longer. The difference was large enough that it looked as though the bench was reporting on two completely different optimizations.After some thought, I decided to report the higher and lower values separately (runs B & C). Short of a deeper investigation, I can only speculate that there is more than one possible optimization that can be applied (most likely two) and that there exists some race condition where one where biased locking wins most of the time, but not all of the time. If the other optimization wins, the application of biased locking is either prevented or delayed.
The curious result is that produced by Escape Analysis. Given the single threaded nature of this benchmark, I was fully expecting Escape Analysis to elide the lock thus rendering StringBuffer performance equivalent to that of StringBuilder. Quite clearly that didn't happen. Yet another problem; timings taken on my machine varied from run to run. To further muddy the water, I had several of my colleagues run tests on their systems. In several cases the optimizations did not speed things up all that much.
Premature conclusions
Though the results exhibited in figure 3 are less than what I hoped for, they do indicate that the optimizations do remove most of the locking overhead. However, having the test runs made by my colleagues produce different results would seem to challenge the veracity of the results. Is the benchmark measuring locking overhead or is our conclusion premature and is there something else going on? In part 2 of this article, we'll take a deeper look at this benchmark to try and answer these questions. In the process we will discover that getting results is easy. Determining if the results have answered our question asked is a much more complex task.
public class LockTest {
private static final int MAX = 20000000; // 20 million
public static void main(String[] args) throws InterruptedException {
// warm up the method cache
for (int i = 0; i < MAX; i++) {
concatBuffer("Josh", "James", "Duke");
concatBuilder("Josh", "James", "Duke");
}
System.gc();
Thread.sleep(1000);
System.out.println("Starting test");
long start = System.currentTimeMillis();
for (int i = 0; i < MAX; i++) {
concatBuffer("Josh", "James", "Duke");
}
long bufferCost = System.currentTimeMillis() - start;
System.out.println("StringBuffer: " + bufferCost + " ms.");
System.gc();
Thread.sleep(1000);
start = System.currentTimeMillis();
for (int i = 0; i < MAX; i++) {
concatBuilder("Josh", "James", "Duke");
}
long builderCost = System.currentTimeMillis() - start;
System.out.println("StringBuilder: " + builderCost + " ms.");
System.out.println("Thread safety overhead of StringBuffer: "
+ ((bufferCost * 10000 / (builderCost * 100)) - 100) + "%\n");
}
public static String concatBuffer(String s1, String s2, String s3) {
StringBuffer sb = new StringBuffer();
sb.append(s1);
sb.append(s2);
sb.append(s3);
return sb.toString();
}
public static String concatBuilder(String s1, String s2, String s3) {
StringBuilder sb = new StringBuilder();
sb.append(s1);
sb.append(s2);
sb.append(s3);
return sb.toString();
}
}
In part II we'll take a deeper look at techniques that we used to validate that the benchmarking results were valid and we will then answer the real question.
Running the Benchmark
I ran this benchmark on my 32 bit Windows Vista laptop with an Intel Core 2 Duo using Java 1.6.0_04. Please note that all optimizations are implemented in the server vm. That is not the default vm on my platform. It is not even available from the JRE, but only from the JDK. To ensure that I was using the server vm I specified the -server option on the command line. The other options are:
- -XX:+DoEscapeAnalysis, off by default
- -XX:+UseBiasedLocking, on by default
- -XX:+EliminateLocks, on by default
To run the benchmark, compile the sources and then use a command line similar to
java-server -XX:+DoEscapeAnalysis LockTest
About Jeroen Borgers
Jeroen Borgers is a senior consultant with Xebia - IT Architects. Xebia is an international IT consultancy and project organization specialized in enterprise Java and agile development. Jeroen helps customers on enterprise Java performance issues and is instructor of Java Performance Tuning courses. He has worked on various Java projects in several industries since 1996, as a developer, architect, team lead, quality officer, mentor, auditor, performance tester and tuner. He works exclusively on performance assignments since 2005.
Acknowledgments
This article could only be achieved with the help of several others. Special thanks go to:
Dr. Cliff Click, former lead architect of Sun's server VM and currently working with Azul Systems; for helping in analysis and pointing at valuable resources.
Kirk Pepperdine, Java performance authority; for helping me in the process in addition to contributing and extensive editing.
David Dagastine, Sun JVM performance team lead, for explaining and pointing me in the right direction.
Several of my Xebia colleagues for performing the benchmarks.
Resources
Java concurrency in practice, Brian Goetz et all.
Java theory and practice: Synchronization optimizations in Mustang
Did escape analysis escape from Java 6
Dave Dice's Weblog
Java SE 6 Performance White Paper