The system layer cake
The box as depicted in figure 1, consists of 4 layers. These layers are labeled as People, Application, Java Virtual Machine, and Hardware. As we further explore the box we will see the major role that each of these layers plays in how our systems perform. We will also see how the box can help us organize our efforts to find our performance bottlenecks.Figure 1. The Box
People
It has been said in regards to performance, change the people and you've changed the problem. What this is saying is that performance bottlenecks are sensitive to the load that is put on the system. If we change a layer in the box we will end up with a different system so it is consistent that the box includes People. Having People layer represent people isn't enough. People also represent anything that drives our system including batch processes and other systems. These all put demands on the other layers in the system, that in turn consume the scarce resources they provide.
From this we can see that in order to understand our system, we must first understand what the People are doing. This will help us setup a load test and it is this setup that will be critical in helping us to identify bottlenecks. You maybe thinking that UML contains something called an actor and the role of an actor is to explain the forces acting up on the system. Indeed it does but an Actor in UML will describe what is to happen but it often lacks the other information that is needed in order to setup a load test.
The list of things we need to know in order to create a good simulation includes, the number of users, what they are doing, how often they are doing it, and when they are doing it. We also need to consider scenarios such as; beginning and end of shift activities, seasonal trends, special events, the ever present 2 am backup activity. Once we've collected all of this information and have scripted it into a load testing tool then we are ready to start the process finding our bottleneck.
Application
The layer below People is the Application. I like to think of this layer as a mapping between the People and the JVM and Hardware. If we've been clever or lucky, we will have managed to define an efficient mapping - a mapping that makes minimal use of the underlying resources to satisfy the requests of the People. In more common terms we would say that our system is efficient. If it isn't than every book, article, piece of advice and fiber in our bodies tells us that this is the layer we are going to have to work on so why not just ignore the box and start here.
I have had to opportunity to give one very simple performance tuning exercise to several hundred developers. The modest goal is to make the method in question run 3 times faster. The observations though unscientific were none the less stunning. After 30 minutes of effort less than 2% of all participants were able to identify the bottleneck. The vast majority of the 2% that were able to identify the bottleneck ignored the code and looked at what the lower layers in the box were telling them. The conclusion from this unscientific set of observations is; ignore the code until after you've looked at the lower layers of the box. Even then you're foray into the code should use a profiler as a guide.
Java Virtual Machine
Just as Application can be seen as a mapping between People and JVM, JVM can be seen as a mapping between Application and Hardware. While we don't really have the option of changing the code in the JVM, we may have an option of changing the JVM its self. Most likely what will happen is that one will work to configure (tune) the JVM by setting a number of the many command line switches. An improperly configured JVM that artificially resource starves an application will likely have a huge negative impact on application performance.
Hardware
The final layer in the box is Hardware. This is a static layer that has a finite capacity. The CPU can only process so many instructions per second, memory can only hold so much data, I/O channels are limited data transfer rates, disks have a fixed capacity. It hardly needs to be said that if you don't have enough capacity in your hardware, your application's performance will suffer. Given the direct impact that hardware has on performance, all investigations must to start here.
Treating the kernel like a rented mule
There are many tools that will help you to see if your system is starved for one of the 4 primary resources, CPU, memory, disk and network I/0. On Windows systems it is as simple as turning on the task manager as seen in figure 2. What this will invariable show is that your application is either consuming 100% of the CPU or is unable to consume 100% of the CPU. While this may not sound like much to go on it is in fact a very valuable clue especially when you combine it with other tidbits of information.
Figure 2 Windows TaskManager
If we look at figure 2 we can see that the CPU is running hot, in fact very hot. However, aside from the occasional spike, it runs less than 100% utilization. Notice the red line below the green line in the CPU utilization graphic. That line represents how much of the CPU the Windows kernel is consuming. Normally we would like system CPU utilization to be less than 20% of total. Anything greater and we have an indication that something in our application code is causing the operating system to work much harder than it should. To understand what we need to understand what the kernel could be up to. Understand what the kernel is up to can in invaluable for helping us understand what we need to be looking for in our code.
The tasks performed by the kernel include: context switching, thread scheduling, memory management, interrupt handling and so on. All of these activities require use of the CPU. Let's consider the task of managing memory. As demand for system memory increases, the kernel will most likely start ejecting pages from memory to disk. It will also be required to swap in a page should an application make reference to it. In most cases the page to be ejected will be the one that has been least recently used. Finding that page requires use of the CPU. It also requires the use of the disk I/O channel. The later step of reading or writing to the disk will cause the kernel to stall. A stalled kernel and stalled application will not be using the CPU. However if memory utilization becomes critical we will see a kernel frequently trying to figure out which page to eject mixed in with the disk I/O. This scanning activity can be responsible for the system utilization numbers.
If we look at the bottom panel of figure 2 we can quickly see that there is plenty of memory in this system and most of it is free. Thus we can (with reasonable confidence) eliminate memory starvation as being the source of the problem. We still need to do a little more work to completely confirm this finding. Instead of doing that now we will move onto the next contender, context switching.
Threads get swapped in and out of the CPU on a regular basis. It is this activity that allows us to run many processes at the same time. The time quantum given to each thread is fixed. When a thread has consumed it's time quantum it will be replace by another. All of this activity is managed by the thread scheduler running in the kernel. In order to do this work the kernel must use the CPU. Under normal conditions the work involved to reschedule a thread is barely noticeable. However there are conditions under which a thread maybe removed from the CPU before it's finished with it's time quantum. Some of the more common reasons are being blocked on I/O or being forced to wait for a lock. Repeated removal of a thread from the CPU will cause the scheduler to work harder. If this happens very frequently it can drive up CPU utilization by the kernel causes it to impact application performance.
In this case of the graphic in figure 2, frequent early context switching in the CPU is the most likely cause of the high CPU utilization. Knowing this provides us with a good starting point for further investigation/characterization of the problem. To review, we have characterized the behavior and we have some idea as to what type of code may cause this type of problem. At this point we could look into the code and see if we can find anything that could cause the premature context switch, however there are still a lot of other possibilities. We will do better to use the box as a guide to help us further characterize the problem before we dive into the code. For example we may want to re-aim our monitoring to look at network, disk activity or to see if we are experience pressure on the lock. Once again we can turn to tools that read counters maintained by the kernel (both Windows and Unix) or turn to a tools such as VTune (Intel) or CodeAnalyst (AMD).
While CPU utilization can be a problem what is more likely that an I/O or lock bound applications will display a strong aversion to the CPU. This aversion can be so strong CPU utilization will actually decrease as the load increases. However you will see that the kernel CPU utilization will take up a significant portion of the overall CPU utilization.
Taking out the trash
In addition to looking at the hardware, we do need to consider the effects that can be introduced by the JVM. The primary resources that the JVM provides us with are threads and memory (Java heap space). A large heap will cause your application to "stall" for a long period of time. Small heap will produce frequent short stalls. In either case, the process of managing memory, garbage collection, can consume a serious amount of CPU. Long story short, an improperly sized heap will cause you JVM to do a lot more work than is normally necessary and a working JVM is stealing CPU cycles from your application. Unlike kernel CPU utilization, JVM CPU utilization as reported by the system is not broken down into time spent running garbage collection and time spent running the application.
To measure the efficiency (or inefficiency) of the garbage collector we need to monitor garbage collection activity. We can do this by setting -Xverbose:gc flag on the command line. This will cause a summary of garbage collection activity to be logged to standard out. By using the numbers found in the GC log we can calculate GC throughput or GC efficiency.
GC efficiency is defined as the time spent in garbage collection over the running time of the application. Since GC can run 1000s of times in the normal course of an applications runtime, it is best measured using a tool such as Tagtram's GCViewer of HP's JTune (both freely available). A GC efficiency greater than 10% is an indication that one needs to tune the JVM heap. If CPU utilization is high and GC is running ok then it is most likely that an algorithmic inefficiency is at fault. To diagnose that possibility we'd turn to an execution profiler. Again before turning to a profiler it is best to try and narrow the problem by looking at the box.
Threads and Thread Pooling
Application server vendors learned very early on that letting every request spawn a new thread could quickly destabilize other wise well running system. The solution was to introduce thread pooling. Thread pooling works to limit the level of activity in your application by limiting the number of requests it can handle. The upside of thread pooling is that a system under load should maintain maximal throughput. Under high loads some requests may have to wait a long time to be services. Sizing the thread pool is a tunable that can have dramatic effect on performance. Too large a pool and you've negated the advantages. Too small and there will be requests that could be processed that won't be. This will inflate their response times. The only way to know if you've set a proper balance is to monitor the number of active threads, response times, and the level of utilization of critical system resources.
There are several ways to infer that your thread pools are too small. Most involve comparing the complete round trip time with the servers internal response time for a given request. If there is a large gap that cannot be accounted for with network latency than most likely you thread pool is too small. On the other hand if there is a large gap that cannot be accounted for and you hardware is indicates you've got a pool that is too large, then it could be that you simply don't have enough hardware or are suffering from an algorithmic inefficiency. Either way you've gained valuable insight into where to look next.
If we've eliminated the hardware and we've determined that the JVM is properly configured, the only things left to consider are lock contention and interactions with external systems. Each of these cases are characterized by threads that are stalled and thus unable to do "useful work". Useful is in quotes because there is no doubt that interacting with a credit card service is clearly useful work. That said, long stalls maybe an indication that the transaction should be handled asynchronously thus freeing the tread to perform other useful tasks.
Avoid the code
In this bottom to top investigative process we can see that the goal is to eliminate as many potential sources for the bottlenecks as can possibly eliminated before we start looking at the code. We also want to use the clues that we have derived from the monitoring of our system to help us narrow the focus or our search when we start to apply profiling tools to the problem. In most cases the last thing that we want to do is look at the code. In fact should you feel the need to start digging around in the code you've most likely not obtained the right measurement that would tell you exactly where to look. For sure there will be times when you will need to rummage about in the code but it is certainly an activity that you want to avoid because it is one that leads to guessing and guessing is the bane of all performance tuning activities.
Conclusion
Real application will create volumes of data that one much search through to find potential bottlencks. Given the vast amounts of data that tools can produce it is not surprising that teams try to take shortcuts and guess at potential problems. Some times these guesses are correct but just as often they can turn out to be wrong. The difficulty is that with guessing you will see inconsistency in your results. The purpose of the box is to eliminate guessing by showing us how to sequence an investigation. It also works to help us understand what is important in each of the major components of our systems. With the box teams should be able increase their effectiveness in finding and elimination performance bottlenecks.
About the Author
Kirk Pepperdine is a Java performance tuning specialists. When he is not tuning applications you will find him teaching about performance tuning or writting articles like this one.