Profiling: Sampling versus Execution (Part 2)
January 16, 2009
This is part 2 of a series of blog entries examining the claim that call stack sample based profiling offers much less overhead than instrumented/execution based profiling. In this entry I have constructed a simple benchmark class based on the following observations raised in part 1 related to enterprise Java applications in the wild (which I assume was the context when the health center monitoring tool was marketed to the audience at the conference).
- Large number (>50) of request processing threads
- Very deep call stacks (>200) with a high percentage of call frames non-application related (especially so when using frameworks such as Spring)
- High degree of database activity with high latency costs (>10 ms)
The benchmark class (partially listed) below does not necessarily look to declare a winner between sampling and execution profiling in terms of overhead as the answer depends on the product used and the application itself. It is an attempt at understanding the difference in the runtime cost and behavior of both techniques – across different products and execution contexts. To do this I have collected a sufficient amount of test data to help me in answering the following questions :-
- What is the typical CPU cost of each data collection technique?
- What is the rate of change in CPU cost as the average call stack depth is increased?
- Is the CPU cost of a data collection technique an accurate measurement of its runtime overhead? (Does sampling throttle the throughput of other threads?)
- Does a data collection technique have any inherent scalability issues in an enterprise setting when factoring in the previously mentioned observations and the trend of increasing multi-core computing?

As you can see from the code above I have created some real challenges for an instrumentation/ execution based profiling solution in that we have a disproportionate number of methods that are relatively cheap (mimicking lightweight request processing steps), easily in-lined (to some degree) and executed at a much higher frequency than the final sleep call (mimicking a database call).
Note: Another reason for the test execution time being largely dominated by the sleep time is that I can quickly calculate the CPU overhead of a product and collection technique using the process’s CPU time.
I conducted more than 50 test runs, each lasting between 100-300 seconds, comparing JXInsight instrumentation/execution based resource metering solution with the call stack sample based profiling option offered by the 3 most popular Java profilers on the market. The first chart below shows the wall clock time for each benchmark test run with varying call stack depths.
Note: There was no filtering of the class and methods to be instrumented in the execution profiling test runs which is not typical of a production environment (see observations above).

From the chart above we can see that when the call stack depth is kept relatively low (<100) the overhead of sampling could be categorized as being light weight (2-10%). But once we hit the 200 call stack depth the overhead of sampling grows quickly and becomes excessively heavy weight (20%-200%). This is not the case for our execution profiling solution, JXInsight, which keeps the overhead extremely low (1-5%) via a dynamic metering strategy whilst firing a relatively large number of probes (41-161 million) per test run.
Note: For every sample cycle performed by the other products JXInsight can fire (begin-end) between 5,000-20,000 execution based probes at an extremely low cost of between 0.5 to 1.0 microseconds depending on whether a firing is metered (underlying counter read) or not.
The next chart shows the CPU consumption (following the execution of a warm-up cycle) for each benchmark test run with varying call stack depths.

As expected the CPU cost increases inline with the increase in call stack depth across all sample based profilers. But what is interesting about this chart is that it shows the mix-mode based profiler (A) having a lower CPU cost than JXInsight whilst having a significantly higher wall clock time (the slowest reported in the previous chart). It would appear that this product has traded throughput/scalability for increased levels of efficiency (smaller code paths?) using possibly coarse grain synchronization (serialized processing) during measurement and/or the queuing events & updates for processing by a (single?) non-application thread.
Similarly sample based profiler (B) has near equal CPU costs but much higher clock times than JXInsight. This is more than likely caused by a reduction in the call throughput due to the suspension of all application threads (> 40) during each call stack sample cycle which from the CPU times reported we know to be anything but short or cheap (0.5-1.0 seconds).
If an execution profiling solution can keep the measurement overhead sufficiently low it can scale as the application workload scales across cores. This is not the case with call stack sampling which is rarely implemented in a multi-threaded manner as it is inherently serialized at costly execution points to maintain some degree of accuracy in the data reported.
Is call (stack) sample profiling more light weight than execution profiling? No if you are using JXInsight. If not then the answer really depends on a number of factors including the average sampled call stack depth, the average number of active threads, the request processing characteristics (CPU / IO), the number of computing cores, the sampling profiler, and sampling configuration (interval).
In the next entry I will look at the characteristics of the data collected by each technique and its applicability in the context of application performance management.
Update 19th Jan 2009: Whilst looking at the probe snapshots generated during our test runs I noticed that I had mistakenly used an instrumented version of our probes runtime (internally we use a number of counters to instrument parts of our runtime with itself) overstating both our clock times and cpu times. Here is a chart showing the reduction in CPU usage when using the non-instrumented production version of our libraries – at least 10% reduction across each test run.

Update 19th Jan 2009: The JXInsight test runs were executed with the following system properties.
jxinsight.server.profiler.autostart=false
jxinsight.server.probes.strategy.hotspot.statistic=inherent.avg
(Historical) Aside: When we released JDBInsight 2.0 (a very long time ago) it offered an extremely efficient call stack collection facility implemented natively within our JVMPI agent with maximum frame size limit of 128. We subsequently changed this to 256 when object-relation mapping tools became more widespread adopted along with increased indirection in the middleware request processing pipeline. When developers started adding IOC frameworks into the application stack the size of the stack was increased beyond 320. Will this grow even further? I think it has already happened with some middleware products using renamed packages of the same IOC frameworks internally in the pipeline. A stack size depth of 400 or more sounds absurd and presents a strong argument for rethinking exception handling and the usage of logging by developers today.
Update 16th October 2009
Our overhead has now dropped to 1 nanosecond for a non-hotspot method.
http://williamlouth.wordpress.com/2009/07/27/one-billion-operations-per-second/