When performing JVM runtime diagnostics the standard (and need I say pretty crude) approach adopted by most working for companies with very little in the way of mature IT management processes and tooling is the thread (or exception) stack trace dump.

There are numerous problems with this approach including the lack of object & argument state information (already solved quite comprehensively) as well as timing & resource usage which I hope to address here.

Lets start with the sample code used to generate the stack trace print outs presented in this entry.

A simple deep recursive call chain execution ending with a wait followed by a stack trace print.

probestackcode1

Here is a small extract of a single stack trace printed by the above code.

probestackout

The problem with the output is that we have no inclination to how long each frame has been on the threads execution call stack and the resources it has consumed since added.

We cannot even determine whether two identical stack traces, printed by the same thread, represent the exact same execution call (frame) stack. Is this a new request at the same execution point or has the code stalled (waited) since the last print?

In addition the stack trace print contains frames that represent noise in our runtime analysis if we were focused on tracking down performance bottlenecks. This can also be the case when analyzing applications deployed to standard middleware containers.

Now lets rewrite the waiting method to use our recently announced stack probes provider extension.

probestackcode21

Here is an extract of the stack trace printed.

probestackxml1

A few observations. It is in XML format. Each fired and metered Probe (frame) has a unique id within the context of each thread. The wall clock time (in microseconds) is listed as a Meter for each frame on the stack.

Note: With the id we can compare across stack trace prints and performance call stack sampling with a much higher degree of accuracy (no second guessing whether the method indeed remained active during interval).

Did you notice that the intermediary stacks.Dumping.call method is absent in the above output? The reason is the call(int) method had been executed a sufficiently number of times (default=1000) prior to this particular stack print for the resource metering runtime to have already determined that it is not a hotspot (in terms of inherent cost) and does not need to be metered any further. It was fired but not metered.

Note: The runtime allows for different meters to be used in determining whether a particular named probe (method) is a hotspot or not.

Here is another stack trace printed later in the same run.

probestackxml2

Now all the stacks.Dumping.$* methods have completely disappeared. We are left with probes fired and metered for methods marked as hotspots or still under analysis by the probes runtime metering strategies.

You have probably already guessed that if you enable additional resource meters within the metering runtime these will also be included in the probe stack prints.

probestackxml3

Conventional call stack dumps quickly fill up a log file with a large amount of irrelevant and duplicated data but they are also extremely costly both in terms of wall clock time and CPU usage.

probestackchart

I am not advocating the complete elimination of thread stack trace dumps from logs though I am sure many of us would wish it to be so especially when staring at the same exception printed 20 times or more as it passes back up through the request processing pipeline.

Having the complete call frame stack can be helpful in a few cases but for the most part it is woefully inefficient (both at runtime and off-line analysis), wasteful, and severely lacking in important runtime diagnostic information – object state and resource usage.

There are solutions available today that make problem determination pretty efficient even for the most complex of systems. You just need to step outside of the runtime and to stop staring into the Sun (it can blind you).

Note: With the solution above probe stack prints can also cross multiple language boundaries – Java-to-Ruby.

7 Responses to “Better Java Thread Stack Trace Dumps”

  1. Ian Clarke Says:

    Sometimes you see a problem, and you think – “Hey, I could use XML to solve this problem”. Then you have two problems…

    ;-P

    • williamlouth Says:

      Hi Ian,

      I think we all agree that XML can be abused by developers especially when used as a execution language or as a means to create hundreds of POJO’s. I believe in this case its usage is justified and plays to its strengths especially in light of the differences in stack trace print-outs across the various JVM’s. We can easily output in different formats and you have access to the data yourself via the Open API.

      The central problem was lack of information and not the actual formats of stack traces which is a problem that has been focused on by many other stack trace parsing tools which in my opinion seems to completely missed the point.


  2. [...] 11, 2009 Following my last post in which I presented an alternative (more like complimentary) Java stack trace print I have [...]

  3. afgafa Says:

    I agree William, Ian seems to troll with no actual argument. If every usage of XML was a problem, I don’t think it would exist!

  4. Peter Booth Says:

    You make a good point about the crude, but useful nature of Java stack traces. There is another side to this discussion though: you market a sophisticated, powerful tool that makes the stack trace seem crude.

    However there are many shops that don’t ever produce stack traces in production, look at verbose:gc output, enable sar, etc because they fear impacting their system. I frequently encounter developers who have no runtime metrics whatsoever, regardless of how crude and imperfect they are, and my job is allay fears and convince that more knowledge is a better thing.

  5. williamlouth Says:

    Global stack dumps as they currently stand (expensive with low information content) and in light of current operating contexts (high number of threads, huge stack execution depths) should be avoided and used as a last resort. Unfortunately that is not the reality.

    If there is a better way why would you want to use stack trace dump?

    Judging by the number of “stack trace” and “logging” searches hitting this blog I think we are unfortunately going to have to stay in the dark ages for quite some time more.

  6. williamlouth Says:

    Incidentally with our probes stacks you obtain much more than a method dump as our named probes (that are stacked) can be contextual – derived from say a http/jdbc url, sql/ejb query, or ruby/scripting call.


Leave a Reply