Netflix, a long-time Java adopter, recently upgraded to Java 21. They are now harnessing new features such as generational ZGC, introduced in JEP 439, and virtual threads, introduced in JEP 444, to improve performance across its extensive microservices fleet. While virtual threads, designed for high-throughput concurrent applications, showed early promise, they also brought unique challenges in real-world scenarios.
In a recent post on the Netflix Tech Blog, the JVM Ecosystem team shared insights from their experience with virtual threads, particularly an issue where services experienced timeouts and hung instances. The issue was related to the interaction of virtual threads with blocking operations and OS thread availability, resulting in a deadlock-like situation in their SpringBoot-based applications.
Netflix engineers observed intermittent timeouts and non-responsive instances in services running on Java 21 with SpringBoot 3 and embedded Tomcat. Despite the JVM instances remaining active, they stopped serving traffic, which was characterized by a significant increase in sockets stuck in a closeWait
state. This state occurs when the remote side closes a TCP connection, but the local side has not yet closed its end, leaving the socket in a waiting state. More about this can be found in RFC 793 in the terminology section.
Initial diagnostics suggested that virtual threads were implicated in the issue, although they didn't appear in traditional thread dumps. Using jcmd Thread.dump_to_file
, the team found thousands of "blank" virtual threads, indicating threads created but not yet running. The issue was traced to Tomcat's request handling, where new virtual threads were created but couldn't be scheduled due to the unavailability of OS threads.
#119821 "" virtual
#119820 "" virtual
#119823 "" virtual
#120847 "" virtual
#119822 "" virtual
...
The analysis revealed that Tomcat's virtual thread executor was creating threads for each request, but these threads were stuck waiting for a lock. Specifically, the threads were pinned to OS threads due to blocking operations within synchronized blocks, exacerbated by the limited number of available OS threads in the ForkJoinPool.
The problem resulted from a classic deadlock scenario in which virtual threads could not proceed because the required lock was held by other virtual threads pinned to all available OS threads. This prevented new virtual threads from being scheduled, effectively stalling the application.
To resolve the issue, Netflix's JVM Ecosystem team used a heap dump to inspect the lock's state and confirmed that no thread owned it, yet the threads waiting for it were unable to proceed. This was a transient state that should have resolved but was instead causing a deadlock-like situation.
The team identified the root cause and developed a reproducible test case to prevent similar issues in the future. While virtual threads in Java 21 have shown potential for improving performance by reducing overhead, this case highlights the importance of understanding their interaction with existing threading models and locking mechanisms.
Adding to Netflix's findings, a recent case study on InfoQ also delves into the practical challenges and benefits of virtual threads, particularly in scenarios involving heavy concurrent workloads. This study underscores the need for careful consideration and testing when integrating virtual threads into production systems, as even small architectural details can lead to significant performance impacts.
In addition to virtual threads, Netflix’s adoption of generational ZGC has also played a crucial role in optimizing its systems, as mentioned in one of the recent articles. ZGC, with its ability to maintain low pause times even as heap sizes grow, has significantly improved Netflix's application performance by reducing garbage collection overhead and enhancing responsiveness. More on generational ZGC can be found in this InfoQ news item.
Netflix also has a robust alert system, leveraging its Atlas Streaming Eval platform, which was vital in identifying and diagnosing these issues. The system, designed for improved real-time monitoring and alerting, enabled the team to catch instances in a problematic state and provided critical data for retroactive analysis.
Despite the challenges, Netflix is optimistic about the future of virtual threads and anticipates further improvements in upcoming Java releases, particularly in addressing the integration challenges with locking primitives. This case study is a valuable example for performance engineers and developers as they explore virtual threads in their applications.