It is notoriously hard to detect and distinguish busy and hung threads in todays complex enterprise middleware even more so when the middleware is distributed across multiple processing nodes with temporal execution dependency (locking, co-ordination). A large part of the reason for this is the lack of current execution information associated with frames within thread stack dumps. For one we cannot easily determine from two thread dumps listing the same call stack for a thread whether the thread in question is hung waiting on a response from a resource  or whether the thread is performing the same operation again following receipt of another request into a processing queue. We might not even be able to detect the case were the thread is merely busy making calls from the stack listed in the dump to other fast executing methods not listed.

To help with the problem analysis we have recently enhanced our optional stack probes provider to collect additional information such as the number of times a uniquely identifiable stack appears across two mark requests from within our management console and the number of calls (or frames) made during the execution interval of a caller frame.

Here is a Java class I quickly wrote to demonstrate the problem and the solution. I have created two Runnable classes that mimic the behavior of hung and busy threads.

stacks.code.1

Here is a sample of a partial stack dump during the execution of the above class. As you can see it is pretty hard to actually determine the type of problem that may be present ignoring the obvious clue in the class naming.

stacks.dump.1

Below is a resource metering snapshot taken during a similar execution run with our dynamic probes instrumentation enabled. What we can quickly see from this view is that the run() call executed by Thread-0 has only made one additional direct (or indirect) call during its ongoing execution interval. Whereas the run() call executed by Thread-1 has been busy performing more than 1 million calls (stack push operations).

stacks.probes.1

Now we cannot be sure that Thread-0 is effectively hung because it might have only just began executing prior to the snapshot capture. Likewise we cannot be sure of our initial prognosis for Thread-1 as it might just be finishing up or it has been active for a prolonged period in the runtime waking up occasionally to service queued requests.

So lets make a mark (reset) from within our management console that instructs the remote managed runtime to mark all current resource metering related data including associated stacks. Then we wait a little while (a few seconds) and then capture another snapshot shown below. From the Marks table column we can see that some parts of the stacks present in this new snapshot capture were present prior to our recent mark remote operation – marks > 0. The only new stack frame is the Busy.doit() call which has a thread execution sequence id of 3031907.

In addition we can see from the Count Delta column that only the run() call executed by Thread-1 has performed additional (instrumented) calls during its still unfinished execution.

stacks.probes.2

To be sure that the stacks visible from within the management console have indeed an execution period of longer than a few seconds we can make another mark. Now any element of a stack listed in a snapshot following the last mark operation with a Marks value greater than 1 has at least an execution duration of the time difference between both mark operations.

For those elements with a Marks value of 1 we can only say with confidence that the stack (or partial stack) has existed since the period of the last mark to the point at which a snapshot is captured which can be done repeatedly without marking.

stacks.probes.3
Now if that is not exciting enough we can easily perform the same steps across a complete cluster of nodes from a single view within our management console and this works for all probe related extensions whether the probes represent actual Java methods or contextually mapped Ruby and Python calls.

There are two basic approaches to developing diagnostics for any enterprise Java application server (be it OSGi based or not). The first approach is to create lots of custom code that treats every subsystem, service and component uniquely different in terms of presentation, collection and modeling forcing the creation of extraneous interfaces and/or additions to existing classes to support state extraction. The second approach involves defining a common model and representation of state data (including meta data) that can be used across multiple diagnostics (state) datasources, implemented by a single provider using runtime object state inspection (without any code changes) and supporting exploration within a single management console in a consistent manner and accessible whilst being disconnected.

Unfortunately nearly all application server vendors select the first approach including those newcomers that claim to be anti-bloat ware.

Naturally we opted for the second option building a kind of dynamic state based configuration management database (CMDB) snapshot facility with supporting disconnected offline analysis tooling.

Amazingly some vendors require you to start an actual server instance to view the contents of a dump in a readable format.

With JXInsight all that is required to capture diagnostics state is to simply drop a technology specific observation extension into the class path of an application server. Here I have used our jxinsight-ext-aj-osgi-bundle-observe.jar to collect actual state diagnostics for live OSGi bundles.

Note: A snapshot can contain both dynamic & static, public & private, state. With static state typically representing configuration loaded from a persistent store such as a file system.

1

The depth to which object (graph) state is collected within a diagnostic snapshot is configurable but typically 6-8 is sufficient for most offline analysis purposes.

2

By combining both state and metadata into a single diagnostic snapshot we can render, highlight and annotate various items within the model – all in a consistent manner with as much detail as one would required in whatever role we assume during the problem analysis.

Note: We record the time(stamp) the object was constructed or first appeared to the diagnostics runtime as well as the system identity hash code.

3

If we want to enhance the collection capabilities of the diagnostics system we simply drop in another extension library. Here I have added jxinsight-ext-aj-osgi-service-observe.jar

4

We can even record much more dynamic state including that which is being referenced directly or indirectly during request processing. Here I have added jxinsight-ext-aj-osgi-service-diagnostics.jar to the classpath to collect at any moment in time within the management console the object graph state related to arguments and target objects currently executing methods defined in the one or more of the OSGi service related interfaces. The method being called is getServiceReferences by thread server-dm-1 with a single string argument on a target object of type BundContext.

5