It is notoriously hard to detect and distinguish busy and hung threads in todays complex enterprise middleware even more so when the middleware is distributed across multiple processing nodes with temporal execution dependency (locking, co-ordination). A large part of the reason for this is the lack of current execution information associated with frames within thread stack dumps. For one we cannot easily determine from two thread dumps listing the same call stack for a thread whether the thread in question is hung waiting on a response from a resource  or whether the thread is performing the same operation again following receipt of another request into a processing queue. We might not even be able to detect the case were the thread is merely busy making calls from the stack listed in the dump to other fast executing methods not listed.

To help with the problem analysis we have recently enhanced our optional stack probes provider to collect additional information such as the number of times a uniquely identifiable stack appears across two mark requests from within our management console and the number of calls (or frames) made during the execution interval of a caller frame.

Here is a Java class I quickly wrote to demonstrate the problem and the solution. I have created two Runnable classes that mimic the behavior of hung and busy threads.

stacks.code.1

Here is a sample of a partial stack dump during the execution of the above class. As you can see it is pretty hard to actually determine the type of problem that may be present ignoring the obvious clue in the class naming.

stacks.dump.1

Below is a resource metering snapshot taken during a similar execution run with our dynamic probes instrumentation enabled. What we can quickly see from this view is that the run() call executed by Thread-0 has only made one additional direct (or indirect) call during its ongoing execution interval. Whereas the run() call executed by Thread-1 has been busy performing more than 1 million calls (stack push operations).

stacks.probes.1

Now we cannot be sure that Thread-0 is effectively hung because it might have only just began executing prior to the snapshot capture. Likewise we cannot be sure of our initial prognosis for Thread-1 as it might just be finishing up or it has been active for a prolonged period in the runtime waking up occasionally to service queued requests.

So lets make a mark (reset) from within our management console that instructs the remote managed runtime to mark all current resource metering related data including associated stacks. Then we wait a little while (a few seconds) and then capture another snapshot shown below. From the Marks table column we can see that some parts of the stacks present in this new snapshot capture were present prior to our recent mark remote operation – marks > 0. The only new stack frame is the Busy.doit() call which has a thread execution sequence id of 3031907.

In addition we can see from the Count Delta column that only the run() call executed by Thread-1 has performed additional (instrumented) calls during its still unfinished execution.

stacks.probes.2

To be sure that the stacks visible from within the management console have indeed an execution period of longer than a few seconds we can make another mark. Now any element of a stack listed in a snapshot following the last mark operation with a Marks value greater than 1 has at least an execution duration of the time difference between both mark operations.

For those elements with a Marks value of 1 we can only say with confidence that the stack (or partial stack) has existed since the period of the last mark to the point at which a snapshot is captured which can be done repeatedly without marking.

stacks.probes.3
Now if that is not exciting enough we can easily perform the same steps across a complete cluster of nodes from a single view within our management console and this works for all probe related extensions whether the probes represent actual Java methods or contextually mapped Ruby and Python calls.

Leave a Reply