Production debugging is hard, and it’s getting harder. With architectures becoming more distributed and code more asynchronous, pinpointing and resolving errors that happen in production is no child’s game. In this article we’ll talk about some advanced techniques that can help you get to the root cause of painful bugs in production more quickly, without adding material overhead to your already busy production environment.
Better jstacks. jstack has been around for a long time (about as long as the JVM’s been around) and even to this day remains a crucial tool in every developer’s arsenal. Whenever you’re staring at a Java process that’s hung or not responding, it’s your go-to tool to see the stack trace of each thread. Even so, jstack has a couple of disadvantages that detract from its ability to help you debug complex issues in production. The first is that while it tells you what each thread is doing through its stack trace, it doesn't tell what you why it’s doing it (the kind of information usually only available through a debugger); and it doesn’t tell you when it started doing it.
Fortunately enough, there’s a great way you can fix that, and make a good tool even better, by injecting dynamic variable state into each thread’s data. The key to this problem lies in one of the most unexpected places. You see, at no point in it’s execution does jstack query the JVM or the application code for variable state to present. However there is one important exception, which we can leverage to turn lemons into lemonade, and that is - the Thread.Name() property, whose value is injected into the stack dump. By setting it correctly you can move away from uninformative jstack thread printouts that looks like this
“pool-1-thread-1″ #17 prio=5 os_prio=31 tid=0x00007f9d620c9800 nid=0x6d03 in Object.wait() [0x000000013ebcc000]
Compare that with the following thread printout that contains a description of the actual work being done by the thread, the input parameters passed to it, and the time in which it started processing the request:
”pool-1-thread- #17: Queue: ACTIVE_PROD, MessageID: AB5CAD, type: Analyze, TransactionID: 56578956, Start Time: 10/8/2014 18:34″
Here’s an example for how we set a stateful thread name:
private void processMessage(Message message) { //an entry point into your code
String name = Thread.currentThread().getName();
try {
Thread.currentThread().setName(prettyFormat(name, getCurrTranscationID(),
message.getMsgType(), message.getMsgID(), getCurrentTime()));
doProcessMessage(message);
}
finally {
Thread.currentThread().setName(name); // return to original name
}
}
In this example, where the thread is processing messages out of a queue, we see the target queue from which the thread is dequeuing messages, as well as the ID of the message being processed, and the transaction to which it is related (which is critical for reproducing locally), and last, but far from least - the time in which the processing of this message began. This last bit of information enables you to look at a server jstack with upwards of a hundred worker threads, and see which ones started first and are most likely causing an application server to hang.
(Click on the image to enlarge it)
An example of how an enhanced jstack shows dynamic variable state for each thread in the dump. Thread start time is marked as TS.
The capability works just as well when you’re using a profiler, a commercial monitoring tool, a JMX console, or even Java 8’s new Mission Control. In all of these cases, your ability to look at the live thread state, or a historic thread dump, and see exactly what each thread is doing and when it started is materially enhanced by having stateful thread contexts.
This thread variable state will also be shown by any JDK or commercial debugger or profiler.
But the value of Thread names doesn’t stop there. They play an even bigger role in production debugging - even if you don’t use jstack at all. One instance is the global Thread.uncaughtExceptionHandler callback which serves as a last line of defense before an uncaught exception terminates a thread (or is sent back to the thread-pool).
By the point an uncaught exception handler is reached, code execution has stopped and both frames and variable state have already been rolled back. The only state that remains for you to log the task that thread was processing, it’s parameters and starting time is captured by you guessed it - a stateful Thread name (and any additional TLS variables loaded onto it).
Its important to keep in mind that a threading framework might implicitly catch exceptions without you knowing it. A good example is ThreadPoolExecutorService, which catches all exceptions in your Runnable and delegates them to its afterExecute method, which you can override to display something meaningful. So whatever framework you use, be mindful that if a thread fails you still have a chance to log the exception and thread state to avoid tasks disappearing into the ether.
Throughput and deadlock jstacks. Another disadvantage of tools like jstack or Mission Control is that they need to be activated manually on a target JVM which is experiencing issues. This reduces their effectiveness in production where 99% of the time when issues occur you’re not there to debug.
Happily enough there’s a way by which you can activate jstack programmatically when your application’s throughput falls under a specific threshold or meets a specific set of conditions. You can even automatically activate jstack when your JVM deadlocks to see exactly which threads are deadlocking and what all the other threads are doing (coupled of course with dynamic variable state for each one, courtesy of stateful thread names). This can be invaluable as deadlocking and concurrency issue are sporadic for the most part and notoriously hard to reproduce. In this case by activating jstack automatically at the moment of deadlock, one which also contains the stateful information for each thread can be a huge catalyst in your ability to reproduce and solve these kinds of bugs.
Click here to learn more about how to automatically detect deadlocks from within your application.
Capturing live variables. We’ve talked about ways of capturing state from the JVM through thread contexts. This approach, however effective, is restricted to variables that you had to format into the thread name in advance. Ideally, we want to be able to go in and get the value of any variable from any point in the code from a live JVM, without attaching a debugger or redeploying code. A great tool that’s been around for a while, but hasn’t got the recognition it deserves, lets you do just that; and that tool is BTrace.
BTrace is a helpful tool that lets you run Java-like scripts on top of a live JVM to capture or aggregate any form of variable state without restarting the JVM or deploying new code. This enables you to do pretty powerful things like printing the stack traces of threads, writing to a specific file, or printing the number of items of any queue or connection pool and many more.
This is done using BTrace scripting, a Java-like syntax in which you write functions that are injected into the code in locations of your choice through bytecode transformation (a process we’ll touch on below). The best way to try out the tool is to attach its sample scripts into a live application. Usage is very straightforward, from your command line simply enter - btrace <JVM pid> <script name>.There’s no need to restart your JVM.
BTrace is very well documented and comes with many sample scripts (see below) to cover various common debugging scenarios around IO, memory allocation and class loading. Here are a couple of powerful examples of things you can do very easily with BTrace -
- NewArray.java: Print whenever a new char[] is allocated, and also add your own conditions based on its value. Pretty handy for selective memory profiling.
- FileTracker.java: Print whenever the application writes to a specific file location. Great for pinpointing the cause of excessive IO operations.
- Classload.java: React whenever a target class is loaded into the JVM. Very useful for debugging “jar-hell” situations.
BTrace was designed as a non-intrusive tool, which means it cannot alter application state or flow control. That’s a good thing, as it reduces the chances of us negatively interfering with the execution of live code and makes its use in production much more acceptable. But this capability comes with some heavy restrictions - you can’t create objects (not even Strings!), call into your or 3rd party code (to perform actions such as logging), or even do simple things such as looping for fear of creating an infinite loop. To be able to do those you’ll have to use the next set of techniques - writing your own Java agent.
A Java agent is a jar file that provides access by the JVM to an Instrumentation object to enables you to modify bytecode that has already been loaded into the JVM to alter its behaviour. This essentially lets you “rewrite” code that is already loaded and compiled by the JVM, without restarting the application or changing the .class file on disk. Think about it like BTrace on steroids - you can essentially inject new code anywhere in your app into both your code and 3rd party code to capture any piece of information you want.
The biggest downside to writing your own agent is that unlike BTrace, which lets you write Java-like scripts to capture state, Java agents operate at the bytecode level. This means that if you want to inject code into an application you’ll have to create the right bytecode. This can be tricky because bytecode can be hard to produce and read, following an operator stack-like syntax which is similar in many ways to Assembly language. And to make things harder, since bytecode is already compiled, any miscorrelation to the location in which it is injected will be rejected without much fanfare by the JVM’s verifier.
To assist with this, many bytecode generation libraries have been written over the years such as JavaAssist and ASM (which is my personal favorite). A great hack which I find myself using quite a lot uses the ASM byteview IDE plugin, which lets you type any Java code in your editor and automatically generate the right ASM code to then generate it’s equivalent bytecode, which you can copy and paste into your agent.
Click here for a real-world example of a sample agent we used to detect and sporadic memory leaks coming from 3rd party code on our server - correlating it to application state.
Dynamically generating bytecode generation scripts from your IDE using the ASM bytecode plugin.
The last technique I’d like to touch on briefly is building Native JVM agents. This approach uses the JVM TI C++ API layer, which gives you unprecedented control and access into the internals of the JVM. This includes things like getting callbacks whenever GC starts and stops, whenever new threads are spawn, monitors are acquired, and many more low-level capabilities. This is by far the most powerful approach to acquire state from running code, as you are essentially running at the JVM level.
But with great power comes great responsibility, and some pretty complex challenges make this approach relatively harder to implement. The first is that since you’re running at the JVM level you're no longer writing in cross platform Java, but in low-level platform dependant C++. The second disadvantage is that the APIs themselves, while extremely powerful, are hard to use and can impact performance significantly, depending on the specific set of capabilities you’re consuming.
On the plus side, if used correctly, this layer provides terrific access to parts of the JVM which would otherwise be closed to us in our search of the root cause of production bugs. When we began writing Takipi for production debugging, I’m not sure we knew the extent to which TI would play a crucial role in our ability to build the tool. The reason for that is that through the use of this layer you’re able to detect exceptions, calls into the OS or map application code without manual user instrumentation. If you have the time to take a look at this API layer - I highly recommend it, as it opens a unique window into the JVM not many of us know.
About the Author
Tal Weiss is the CEO of Takipi. Tal has been designing scalable, real-time Java and C++ applications for the past 15 years. He still enjoys analyzing a good bug though, and instrumenting Java code. In his free time Tal plays Jazz drums.