Hybrid Probe Profiling: Execution + Sampling
October 29, 2009
Whilst I have long argued that execution based profilers (metering agents & activity monitors) can out-perform sample based profilers on many levels (when implemented right) I have always recognized that statistical sampling techniques do occasionally have their place in ones arsenal in tackling the complexity of a real-world application/system performance model collected from a production environment under particular runtime constraints. That is why when we introduced our pioneering strategy based resource metering approach I made sure to include a number of out-of-the-box strategies that have sampling characteristics. This list initially included frequency, interval, sample, and burst. All of which have the following metering behavior – a non-metered gap and a metered run.

The two main differences in the metering strategies relate to the unit of measurement for both gap and run (time or frequency), and the scope of the strategy (global or probe name based).
Note: JXInsight’s Probes technology supports complex chaining of metering strategies allowing you to derive your very own metering strategy from base (built-in) strategies.
Each of the above mentioned metering strategies have their strengths and weaknesses depending on the application (or problem) context but none are truly comparable to how a standard call stack sampling profiler works so in a recent update to our product I approached this problem from a different angle within our runtime model by way of a new meter named sample.count which is basically a low-resolution clock (counter) with a user configurable time interval (sample) for increments (ticks). In doing so I addressed one of the key performance indicators (KPI) of a typical execution based profiler – the number of relatively costly (mostly native) counter access calls (which would otherwise occur in close time proximity in a heavily load system).
So how does this match much closer with the behavior of a call stack sampler? Well first of all the number of samples (meter increments) and the cost of the underlying native system calls is approximately the same (it is also a known quantity). Second of all only fired probes (metered activities) which have a metered execution period spanning at least one sample window increment (tick) will have their metering totals updated which means that the accuracy for high frequency and low cost methods is similar. Were it differs from sampling is in the increased accuracy of the performance model by including frequency (invocation) counts as well as collecting the min, avg and max for sample.count meter readings. In addition we can still utilize our default hotspot metering strategy by simply changing the strategy’s meter and the threshold value used.
Note: Thread call stack samplings in ability to track the complete duration (or ticks) of a method’s invocation restricts its application solely to development environments.
Naturally I was curious to see how a sample.count metering model compared with one based on the clock.time meter. So I metered the startup execution of a JBoss 4.x application server instance with a sample JBoss Seam application deployed. For this initial run I disabled our default hotspot metering strategy. The startup took less than 31 seconds with 776,468 method invocations metered within the org.jboss.* package which. This translates to 1.5 million clock time access calls – 50k / sec.
Note: “Count” is the number of method invocations metered and “Total” is the sum of the ticks during the execution period of the metered method invocation.

Here is the sample.count metering model for a similar startup sequence. The breakdown of the metering “Total %” column looks very similar to that of the clock.time model. The startup time took less than 30 seconds which translates (approximately) to 30,000 clock ticks with a default tick interval of 1ms – 1k / sec.

Of course one would expect the previous models to be very similar. A real comparison test would be with the hotspot metering strategy enabled which attempts to eliminate the metering of low cost instrumented method invocations (reducing the cumulative count). Here is the clock.time metering model for the startup. The number of metered probes fired has now halved though this reduction would be much higher in a long running system as we by default meter the first 1,000 firings per named probe.

With the hotspot metering strategy configured to use the sample.count meter the breakdown of both the “Count” and “Total” columns looks again similar to the clock.time metering model.

Though I still prefer to use our sample based metering strategies (more so because of familiarity) which guarantee accurate metering of those probes sampled the sample.count meter approach allows me to create a filtering like mechanism to the model at the granularity that suits the context of the investigation. With the meter I get the frequency for every probe fired which the strategy approach does not afford me but I only aggregate the metering total for those probes fired and metered that exceed a meter threshold I have defined.
Java Call Stack Sampling in the Wild
October 21, 2009
To be in anyway successful at managing & monitoring real-world Java enterprise applications in production (that excludes Spring “PetClinic”) one needs to assess the runtime impact of the various performance data collection techniques and their actual (tool) implementations. Unfortunately most programmers fail to understand this and instead develop/promote/deploy tools that are clearly completely inadequate for the task at hand – in a production context. They incorrectly assume ease of use equates to lightweight when in fact the only thing lightweight is the effort and due diligence performed in understanding the problem domain and managing the associated risks. It is for this reason why most operations teams immediately discount (with any evaluation) the use of monitoring tools a developer might have used in any way during localized tuning.
Recently this was made so evident when its was falsely claimed (without any qualification other than wishful thinking) that a Java sample based profiling tool could out-perform JXInsight – a dynamic strategy based resource metering solution. I will save the publication of a benchmark comparison for a (near) future blog but instead show you a chart (taken from within the “visual” tool itself) depicting the performance impact of such a technique on an application with a realistic number of threads (100) running with call stacks of realistic sizes (100-200).
Note: The similarities of both the 100 ms and 1,000 ms run sections indicates the tool was not able to perform at the lower of the sampling rates.

Even at a 10 second sample interval rate the runtime impact is excessive ignoring the fact that at this resolution the data would be completely meaningless unless the application was a dog in terms of performance which begs the question “How did such a beast get into production in the first place?”.
Yes the overhead of sampling is zero when not being performed but what type of enterprise application is not monitored continuously in production? Instrumentation needs to be a primary source of resource metering & software execution metric performance data. Sampling should be considered (and only after discounting many other sources and techniques) when the current instrumentation coverage is not adequate at the time a problem appears though I am doubtful of its effectiveness with the noise that would be created when enabled based on the charting above.
(Java) Call Stack Problem Analysis Simplified
October 6, 2009
It is notoriously hard to detect and distinguish busy and hung threads in todays complex enterprise middleware even more so when the middleware is distributed across multiple processing nodes with temporal execution dependency (locking, co-ordination). A large part of the reason for this is the lack of current execution information associated with frames within thread stack dumps. For one we cannot easily determine from two thread dumps listing the same call stack for a thread whether the thread in question is hung waiting on a response from a resource or whether the thread is performing the same operation again following receipt of another request into a processing queue. We might not even be able to detect the case were the thread is merely busy making calls from the stack listed in the dump to other fast executing methods not listed.
To help with the problem analysis we have recently enhanced our optional stack probes provider to collect additional information such as the number of times a uniquely identifiable stack appears across two mark requests from within our management console and the number of calls (or frames) made during the execution interval of a caller frame.
Here is a Java class I quickly wrote to demonstrate the problem and the solution. I have created two Runnable classes that mimic the behavior of hung and busy threads.

Here is a sample of a partial stack dump during the execution of the above class. As you can see it is pretty hard to actually determine the type of problem that may be present ignoring the obvious clue in the class naming.

Below is a resource metering snapshot taken during a similar execution run with our dynamic probes instrumentation enabled. What we can quickly see from this view is that the run() call executed by Thread-0 has only made one additional direct (or indirect) call during its ongoing execution interval. Whereas the run() call executed by Thread-1 has been busy performing more than 1 million calls (stack push operations).

Now we cannot be sure that Thread-0 is effectively hung because it might have only just began executing prior to the snapshot capture. Likewise we cannot be sure of our initial prognosis for Thread-1 as it might just be finishing up or it has been active for a prolonged period in the runtime waking up occasionally to service queued requests.
So lets make a mark (reset) from within our management console that instructs the remote managed runtime to mark all current resource metering related data including associated stacks. Then we wait a little while (a few seconds) and then capture another snapshot shown below. From the Marks table column we can see that some parts of the stacks present in this new snapshot capture were present prior to our recent mark remote operation – marks > 0. The only new stack frame is the Busy.doit() call which has a thread execution sequence id of 3031907.
In addition we can see from the Count Delta column that only the run() call executed by Thread-1 has performed additional (instrumented) calls during its still unfinished execution.

To be sure that the stacks visible from within the management console have indeed an execution period of longer than a few seconds we can make another mark. Now any element of a stack listed in a snapshot following the last mark operation with a Marks value greater than 1 has at least an execution duration of the time difference between both mark operations.
For those elements with a Marks value of 1 we can only say with confidence that the stack (or partial stack) has existed since the period of the last mark to the point at which a snapshot is captured which can be done repeatedly without marking.

Now if that is not exciting enough we can easily perform the same steps across a complete cluster of nodes from a single view within our management console and this works for all probe related extensions whether the probes represent actual Java methods or contextually mapped Ruby and Python calls.