Key Takeaways
- Virtual threads are an important advancement in Java concurrent programming, but they do not offer a clear advantage over Open Liberty’s existing autonomic thread pool for running typical cloud-native Java workloads.
- For CPU-intensive workloads, throughput is lower with virtual threads than with Open Liberty’s thread pool for currently unclear reasons.
- Virtual thread ramp-up time from idle to maximum throughput is quicker than Open Liberty’s thread pool due to its thread-per-request model.
- Memory footprint in Open Liberty deployments can vary greatly based on factors like application design, workload level, and garbage collection behavior, so the reduced footprint of virtual threads may not result in an overall reduction in memory used.
- Virtual threads showed some unexpected performance issues in some use cases that Java developers should be aware of. We are working with the OpenJDK Community to investigate the root cause and to try to resolve the issues.
The release of JDK 21 brought into general availability a much-publicized new function, Java Virtual Threads. This feature marks a significant leap forward in how Java developers can better handle parallelism in their applications. Some of the aims of Java Virtual Threads include:
- lightweight, scalable, and user-friendly concurrency model
- efficient utilization of system resources
- "dramatically reduce the effort of writing, maintaining, and observing high-throughput concurrent applications" (JEP425)
Virtual threads have sparked great interest in the Java developer community, which includes application frameworks, such as Open Liberty, an open-source, modular, cloud-native Java application runtime. As members of the Liberty performance engineering team, we evaluated if this new Java capability could be of benefit to our users, or even potentially replace the current thread pool logic used in the Liberty application runtime itself. At the very least, we wanted to better understand the virtual thread technology and how it performs, so that we could provide informed guidance to Liberty users.
This article reports our findings. It includes:
- An overview of the Java Virtual Thread implementation.
- An overview of the current Liberty thread pool technology.
- Our evaluation across some performance metrics, including some unexpected observations.
- A summary of our findings.
Java Virtual Threads
Virtual threads were first introduced in JDK 19, enhanced in JDK 20, and finalized in JDK 21 (as described in JDK Enhancement Proposal (JEP) 444).
Historically, Java developers implemented applications using the "thread-per-request" model, where each request is handled by a dedicated thread for the duration of its lifecycle. These threads (referred to as platform threads) are implemented as wrappers around an operating system thread (OS thread). However, OS threads use a lot of system memory and are scheduled by the OS layer, and this can lead to scaling issues as more and more of them are deployed.
One of the key motivators for virtual threads is to preserve the simplicity of the thread-per-request model while avoiding the high cost of dedicated OS threads. Virtual threads minimize this issue by initially creating each thread as a lightweight object on the Java heap, and only using OS threads when needed. This "sharing" of OS threads provides a better utilization of system resources. In theory, this is an advantage for virtual threads: developers can now effectively use "millions of threads" in a single JVM.
The following diagram shows the many-to-one relationship between Java virtual threads and OS threads, which are then scheduled to be run at the CPU level.
Open Liberty’s autonomic thread pool
Open Liberty’s shared thread pool approach also minimizes the high cost of dedicated OS threads. Liberty uses shared threads (referred to as the "Liberty thread pool") to perform application business logic functions and separate threads for I/O functions. Additionally, the Liberty thread pool is adaptive and is sized autonomically (as described in this post). For most use cases, there is no need for additional tuning, though the minimum and maximum pool sizes can optionally be configured.
Unlike a web server (such as the Helidon Web Server, implemented using virtual threads), an application runtime like Liberty isn’t just establishing an I/O connection that then sits idle for long periods. The applications that run on Liberty generally perform some noticeable amount of business logic, which requires CPU resources. A Liberty deployment does not typically use thousands or millions of threads because CPU resources are fully consumed by a few hundred threads (or less), especially in containers or pods which have allocations of just a few, or even fractions of, CPUs.
Performance Tests
We focused our evaluation primarily on use cases and configurations commonly used by Liberty customers. We used our existing benchmark applications to compare the relative performance of Liberty’s thread pool and virtual threads. These benchmark applications use REST and MicroProfile and perform some basic business logic during the transaction.
We aimed to model what most Liberty users would see if we replaced the autonomic thread pool in Liberty with virtual threads. For this reason, our evaluation focused primarily on configurations with 10s-100s of threads. However, we extended the evaluation to also compare Liberty’s thread pool and virtual thread behavior with a few thousand threads, because running with many threads is an advertised strength of virtual threads.
To evaluate a use case that would exercise virtual thread unmount and mount actions, we used an online banking simulation app that generates a request to a remote system, which responds after a configurable delay. The delayed response means that the threads in the system under test are blocked on the I/O and are not being used by the CPU for some period. This app generates the type of work that allows virtual threads to be unmounted mid-transaction and then remounted after the reply from the remote system (i.e., it allows sharing of OS threads).
Test Case Environment
We ran these performance tests with both Eclipse Temurin (OpenJDK with HotSpot JVM) and IBM Semeru Runtimes (OpenJDK with OpenJ9 JVM). We observed similar performance differentials between Liberty’s thread pool and virtual threads with both JDKs. Unless otherwise noted, the results shown below were produced running Liberty 23.0.0.10 GA with Temurin 21.0.1_12 release.
Disclaimer: Our evaluation of virtual threads focused on whether there would be performance benefits to Liberty users if the autonomic thread pool were replaced by implementing virtual threads using the "thread-per-request" model as described above. This is an important context to keep in mind while reading through the test cases since the results could be completely different for another application runtime that does not have a self-tuning thread pool like Liberty does.
Test Case 1: CPU throughput
Objective: Evaluate CPU throughput to discover whether there is any loss of performance when using virtual threads vs Liberty’s thread pool.
Findings: For some configurations, workloads had 10-40% lower throughput when using virtual threads than when using Liberty’s thread pool.
For this test, we ran several CPU-intensive apps and compared how many transactions per second (tps) can be completed on a given number of CPUs that are running with virtual threads vs Liberty’s thread pool. We used Apache JMeter to drive various loads to get a small system to reach higher and higher levels of CPU utilization.
In one example, we ran the online banking app with a short 2 ms delay so that the virtual threads functionality (mount/unmount/remount on OS threads) is exercised for each individual task while the application overall is still fairly CPU-intensive. The load was gradually increased, running long enough (150 s) at each load level to get a stable average throughput measurement.
At low load levels, the online banking app's virtual thread throughput was roughly equal to Liberty's thread pool throughput (see graph), with virtual threads using somewhat more CPU (CPU utilization not shown). As the load increased, transactions per second using virtual threads gradually fell behind Liberty’s thread pool.
We expected that virtual threads might be somewhat slower in this sort of CPU-intensive application because virtual threads do not make code run any faster than it runs on traditional Java platform threads, and there is some overhead with virtual threads, including:
- Mounting and unmounting: Virtual threads are mounted on a platform thread to run and unmounted at blocking points and when execution is complete. Also, JVM Tool Interface (JVMTI) notifications are emitted for each mount or unmount action. These actions are lightweight but not at zero cost.
- Garbage collection: A virtual thread object is created and discarded for each transaction, with allocation and garbage collection costs.
- Loss of thread-linked context: Liberty uses
ThreadLocal
variables to share common information across requests. The efficiency of this approach when using pooled threads is lost with virtual threads, since theThreadLocal
goes away with the virtual thread. As part of this project, we converted majorThreadLocal
uses to other non-thread-linked sharing mechanisms, but a number of smaller-impact instances are still present.
However, CPU profiling showed that none of these possible virtual threads' overheads was large enough to explain the observed throughput discrepancy. We discuss other possible causes in the "Unexpected virtual threads performance findings" section later.
Virtual threads did not make Java code execute any faster compared to running the same code on regular Java platform threads in the Liberty thread pool for a CPU-intensive application on a small number of CPUs (a typical use case for Liberty).
Test Case 2: Ramp-up time
Objective: Quantify how quickly virtual threads get to full throughput compared to Liberty’s thread pool.
Findings: When a heavy load is suddenly applied, apps running on virtual threads reach maximum throughput significantly faster than when running on Liberty’s thread pool.
The simple model of virtual thread usage is that every task gets its own (virtual) thread to run on, so our Liberty virtual threads prototype launched a new virtual thread to execute each task received from the load driver. Thus, with virtual threads, every task immediately has a thread to run on, while with Liberty’s thread pool, a task might have to wait for a thread to become available.
To adequately test this scenario, we needed to run our online banking application with a long enough response latency to cause several thousand simultaneous transactions to saturate the CPU. This workload required thousands of threads to handle the transactions, either per-transaction virtual threads or traditional Java platform threads in Liberty’s thread pool.
Handling thousands of threads in Liberty’s thread pool
We found that Liberty’s thread pool ran fine with a few thousand threads. Because of the commentary in various virtual thread discussions about the issues with using many platform threads, we were on the lookout for signs of trouble in Liberty’s thread pool. For example, it may become unstable when handling a few thousand threads or show other signs of a "too many threads" issue. We did not see problems of that sort.
On the contrary, we found that throughput was actually slightly faster (2-3%) with Liberty’s thread pool than with virtual threads. CPU usage was about 10% less with Liberty’s thread pool, and transactions per CPU utilization were 12-15% higher for Liberty’s thread pool (mostly due to the design of the autonomic controls determining Liberty's thread pool size). The Liberty thread pool autonomics allow the pool to grow to thousands of threads if required by the workload while maintaining stable operation.
Ramp-up time using Liberty’s thread pool vs virtual threads
In the scaling evaluation, the ramp-up time for virtual threads to go from low load to full capacity was very quick. Liberty’s thread pool ramp-up was slower because it adjusts gradually based on observed throughput; Liberty’s thread pool makes decisions to grow, shrink, or stay the same size at 1500 ms intervals, and it would take tens of minutes to gradually decide that more and more threads should be added to handle the offered load.
As a result of this testing, we modified Liberty’s thread pool autonomics to grow the pool more aggressively when there are more idle CPU resources available and Liberty’s thread pool request queue is deep. With this fix (available in Open Liberty 23.0.0.10 onwards), when a heavy load was suddenly applied (over 30 secs) to the online banking app running on Liberty’s thread pool, the app now reached peak throughput only about 20-30 seconds (instead of tens of minutes) after the same app running on virtual threads, even with a workload requiring about 6000 threads on an idle JVM (see graph). The virtual threads prototype was still quicker to ramp up because it gave a new virtual thread to every request upon arrival, but the difference in acceleration between virtual threads and Liberty’s thread pool was greatly reduced.
Test Case 3: Memory footprint
Objective: Determine how much memory is used by the Java process under constant load, for both virtual threads and Liberty’s thread pool.
Findings: The smaller per-thread footprint of virtual threads had only a relatively small direct effect in a configuration requiring a few hundred threads and may be outweighed by the effect of other memory usages in the JVM.
Virtual threads use less memory (Java process size) than traditional platform threads because they do not require a dedicated backing OS thread. This test case measured how this per-thread memory advantage of virtual threads translated into the total memory usage of the JVM at typical Liberty workload levels. We found a rather mixed set of results.
We expected to see loads that were running with virtual threads consistently using less memory than when running the same load with the Liberty thread pool. What we found instead is that sometimes the virtual thread configuration used less memory, but sometimes it used more.
This variability arose because factors other than the thread implementation contributed to memory usage by the Java process. One element that had a significant impact on the variability of memory usage in our testing was DirectByteBuffers (DBBs), which are part of the Java networking infrastructure. (See the ByteBuffer API for background on Direct ByteBuffers.)
DirectByteBuffers are a two-part structure, with a small Java reference object on the heap and a variable-sized (generally much larger) memory area in the native or off-heap area. The Java reference object is released after it is no longer needed and is garbage-collected, after which the associated native memory is cleared. If the DirectByteBuffers reference object survives long enough to be promoted to the old-gen area (in a typical Java generational GC model), the native memory allocation is held until a global GC. Because global GCs are (by design) infrequent, this allocation and retention pattern can cause the Java process footprint to grow significantly larger than the active runtime usage.
Note: This test was run with a small minimum heap size and a relatively large maximum heap size. This was to allow the variability of heap memory usage to be evident as one of the factors affecting total memory usage by the JVM.
In some cases where a load run with virtual threads took more memory than the same load with Liberty’s thread pool, we found that the difference was attributable to DirectByteBuffers retention. This does not indicate a problem with virtual threads: how long DirectByteBuffers memory is retained depends on the interplay of several factors, including transaction duration, Java heap nursery size, and tenure promotion timing. We could run the same test with a slightly different configuration or tuning and have virtual threads use less memory than Liberty’s thread pool, with the difference coming from the DirectByteBuffers retention.
For example, a slight 10% increase in workload caused a 25% decrease in memory used by the online banking app running on the Liberty thread pool but caused a 185% increase in memory used by the same app running on virtual threads (see graphs).
Native memory reduction from avoiding an OS thread for each virtual thread can be significant, but this may be relatively small when compared to other memory used in the application runtime. In configurations where only a few hundred threads are required, the native memory reduction from using virtual threads may be eclipsed by other effects that are somewhat hard to predict, such as the rate of Java heap growth and the timeliness of garbage collections that release associated native memory, such as DirectByteBuffers.
In performance work, the mantra YMMV ("your mileage may vary") is well known. Some users of virtual threads will see a decrease in their system's total memory usage, and some will see an increase. Only a relatively small portion of those changes in memory usage will be attributable to virtual threads.
Unexpected virtual threads performance findings
Our virtual threads investigation involved many experiments with our benchmark apps, varying the number of CPUs, amount of load, remote delay (for the online banking app), heap size, etc. These experiments produced some very unexpected findings that do not fit neatly in the preceding sections.
In particular, when running short-duration tasks on two CPUs, we sometimes saw very poor performance in virtual threads. We tracked this down to how the Linux kernel scheduler interacts with Java’s ForkJoinPool thread management. Newer versions of the Linux kernel scheduler changed the interactions with ForkJoinPool, but we still saw a poor performance of virtual threads, just in different ways. Virtual threads users might encounter similar problems and should be aware that upgrading to newer Linux kernels just changes the behavior rather than fixing it.
In this test, we used our MicroProfile benchmark application, mp-ping, which performs a simple "ping" on a REST service. The load driver hits a REST URL on the mp-ping app running on Liberty and receives an immediate "ping" response (0.05-0.10 ms).
Low throughput and low CPU when running on virtual threads
We found that running short-duration tasks (mp-ping) on 2-CPU configurations on virtual threads produced much lower throughput than when running on Liberty’s thread pool and correspondingly lower CPU utilization. The throughput on virtual threads was as low as 50-55% of the throughput on Liberty’s thread pool, as can be seen in the following graph.
The poor performance was also present with longer duration tasks, up to 1 ms in duration, and when using more CPUs, just less severely.
We reproduced this problem of low throughput and low CPU utilization with virtual threads on several different hardware platforms with different Linux kernel levels to ensure that the behavior was not an artifact of some quirk on the original test system. We also created a simple standalone application that generates tasks that burn CPU for a configurable period, and it showed similar low throughput and low CPU utilization behavior with virtual threads, so the poor performance is not somehow caused by Liberty.
ForkJoinPool and the Linux kernel scheduler
Investigation of the root cause of this poor performance on virtual threads revealed that Java’s ForkJoinPool, which manages the platform threads that underpin virtual threads, was parking one of the platform threads for 10-13 ms at a time when there was plenty of work available to be done. With one platform thread parked, the virtual threads do not get to run promptly, leading to the low throughput and low CPU utilization that we observed.
Further investigation implicated the Linux thread scheduler: trace showed calls being made in ForkJoinPool code to unpark the parked platform thread, but which did not result in prompt unparking. We concluded that the poor performance was caused by interaction between the Linux thread scheduler and ForkJoinPool worker thread management. This interaction was not a problem for Liberty’s thread pool because it does not use ForkJoinPool to manage platform threads.
We experimented with the available ForkJoinPool tuning options, Linux scheduler tuning options, and various modifications to the ForkJoinPool implementation, producing some minor performance improvements but not significantly closing the gap with Liberty thread pool performance.
Note: Our investigation suggests that running with 2 CPUs is probably the worst case for the problems we have found with virtual threads in the 4.18 Linux kernel. The performance issues were still present but less prominent when running the same workloads on test systems with 1 CPU or on 4 or more CPUs.
Low throughput and high CPU when running virtual threads
The testing described in the previous two sections was mainly on Linux kernel 4.18, which is the currently available kernel in Red Hat Enterprise Linux (RHEL) 8. We found a different performance problem for virtual threads when we ran the same tests on newer Linux kernel 5.14 (RHEL 9) and kernel 6.2 (Ubuntu 22.04).
With the newer Linux kernels, running the mp-ping app on virtual threads still produced somewhat lower throughput than on Liberty’s thread pool but with higher CPU utilization. As the load increased, the throughput on virtual threads was 20-30% lower than the throughput on Liberty’s thread pool, as can be seen in the following graph.
These findings show that there may be different performance issues with virtual threads, for some workloads, dependent on the Linux kernel level.
Next steps to investigate the cause of these behaviors
We discussed these findings with members of the OpenJDK community and are continuing to investigate and test modifications with them. The runs shown in both graphs used the latest nightly build of Temurin 22 to make use of the latest versions of ForkJoinPool, which is currently under revision, in case the ForkJoinPool revisions corrected the problems that we originally observed with Temurin 21 (but they did not).
Further investigation is needed to determine the root cause and resolution fully, and we are actively working with the OpenJDK community. We would like to acknowledge and appreciate Doug Lea (a leader in Java concurrency work and the author of the ForkJoinPool class) and others in the OpenJDK community for their assistance in our investigation of these virtual threads performance issues. We are reporting the problems here as a heads-up for virtual threads users who may encounter similar issues, depending on their virtual threads use case.
For any who are interested in reproducing the problems described in "Unexpected ... Findings", we have provided a README with instructions in a GitHub repo.
Summary and conclusions
We investigated virtual threads' performance using some simple applications representative of typical customer uses of Liberty, along with three main performance aspects:
- Throughput: Virtual threads performed worse than Liberty’s thread pool in the apps that we tried. This poor performance was observed at varying levels depending on the number of CPUs, task duration, Linux kernel level, and Linux scheduler tuning.
- Ramp-up: When workload arrives in bursts, with long task duration requiring many threads, virtual threads ramp up to full throughput more quickly than Liberty’s thread pool, but that advantage evaporates quickly.
- Memory footprint: The effect of virtual threads’ smaller per-thread footprint is relatively small in a configuration requiring a few hundred threads, and may be outweighed by the effect of other memory usages in the JVM.
In addition, we were surprised to find a performance problem, for certain use cases, when running on virtual threads. We traced this problem to an interaction between the Linux kernel scheduler and Java’s ForkJoinPool thread management. This problem persists, though differently, even with newer versions of the kernel.
After comparing Liberty’s existing thread management with the new Java Virtual Threads, we found that the existing Liberty thread pool produces comparable or often better performance for Liberty (and, therefore, for any applications running on Liberty) at a moderately high (1000s of threads) level of concurrency. While virtual threads can show advantages at higher concurrency levels compared to Liberty’s thread pool, this depends on the right conditions, high task delays, a large number of CPUs, or a combination of these factors.
Java application developers can still use virtual threads in their own applications that run on Liberty, but we decided against replacing the Liberty thread pool with virtual threads for the time being. As discussed earlier in this article, there are plenty of use cases where virtual threads are likely to be very useful in simplifying the development of multithreaded applications. However, as described, there are also some issues that developers should be aware of in certain kinds of applications. By sharing our experience in this article, we hope that Java developers will be better informed on when and whether to implement virtual threads in their own applications.