This is part 2 of a series of blog entries examining the claim that call stack sample based profiling offers much less overhead than instrumented/execution based profiling. In this entry I have constructed a simple benchmark class based on the following observations raised in part 1 related to enterprise Java applications in the wild (which I assume was the context when the health center monitoring tool was marketed to the audience at the conference).

  • Large number (>50) of request processing threads
  • Very deep call stacks (>200) with a high percentage of call frames non-application related (especially so when using frameworks such as Spring)
  • High degree of database activity with high latency costs (>10 ms)

The benchmark class (partially listed) below does not necessarily look to declare a winner between sampling and execution profiling in terms of overhead as the answer depends on the product used and the application itself. It is an attempt at understanding the difference in the runtime cost and behavior of both techniques – across different products and execution contexts. To do this I have collected a sufficient amount of test data to help me in answering the following questions :-

  • What is the typical CPU cost of each data collection technique?
  • What is the rate of change in CPU cost as the average call stack depth is increased?
  • Is the CPU cost of a data collection technique an accurate measurement of its runtime overhead? (Does sampling throttle the throughput of other threads?)
  • Does a data collection technique have any inherent scalability issues in an enterprise setting when factoring in the previously mentioned observations and the trend of increasing multi-core computing?

samplingcode2

As you can see from the code above I have created some real challenges for an instrumentation/ execution based profiling solution in that we have a disproportionate number of methods that are relatively cheap (mimicking lightweight request processing steps), easily in-lined (to some degree) and executed at a much higher frequency than the final sleep call (mimicking a database call).

Note: Another reason for the test execution time being largely dominated by the sleep time is that I can quickly calculate the CPU overhead of a product and collection technique using the process’s CPU time.

I conducted more than 50 test runs, each lasting between 100-300 seconds, comparing JXInsight instrumentation/execution based resource metering solution with the call stack sample based profiling option offered by the 3 most popular Java profilers on the market. The first chart below shows the wall clock time for each benchmark test run with varying call stack depths.

Note: There was no filtering of the class and methods to be instrumented in the execution profiling test runs which is not typical of a production environment (see observations above).

samplingchart12

From the chart above we can see that when the call stack depth is kept relatively low (<100) the overhead of sampling could be categorized as being light weight (2-10%). But once we hit the 200 call stack depth the overhead of sampling grows quickly and becomes excessively heavy weight (20%-200%). This is not the case for our execution profiling solution, JXInsight, which keeps the overhead extremely low (1-5%) via a dynamic metering strategy whilst firing a relatively large number of probes (41-161 million) per test run.

Note: For every sample cycle performed by the other products JXInsight can fire (begin-end) between 5,000-20,000 execution based probes at an extremely low cost of between 0.5 to 1.0 microseconds depending on whether a firing is metered (underlying counter read) or not.

The next chart shows the CPU consumption (following the execution of a warm-up cycle) for each benchmark test run with varying call stack depths.

samplingchart23

As expected the CPU cost increases inline with the increase in call stack depth across all sample based profilers. But what is interesting about this chart is that it shows the mix-mode based profiler (A) having a lower CPU cost than JXInsight whilst having a significantly higher wall clock time (the slowest reported in the previous chart). It would appear that this product has traded throughput/scalability for increased levels of efficiency (smaller code paths?) using possibly coarse grain synchronization (serialized processing) during measurement and/or the queuing events & updates for processing by a (single?) non-application thread.

Similarly sample based profiler (B) has near equal CPU costs but much higher clock times than JXInsight. This is more than likely caused by a reduction in the call throughput due to the suspension of all application threads (> 40) during each call stack sample cycle which from the CPU times reported we know to be anything but short or cheap (0.5-1.0 seconds).

If an execution profiling solution can keep the measurement overhead sufficiently low it can scale as the application workload scales across cores. This is not the case with call stack sampling which is rarely implemented in a multi-threaded manner as it is inherently serialized at costly execution points to maintain some degree of accuracy in the data reported.

Is call (stack) sample profiling more light weight than execution profiling? No if you are using JXInsight. If not then the answer really depends on a number of factors including the average sampled call stack depth, the average number of active threads, the request processing characteristics (CPU / IO), the number of computing cores, the sampling profiler, and sampling configuration (interval).

In the next entry I will look at the characteristics of the data collected by each technique and its applicability in the context of application performance management.

Update 19th Jan 2009: Whilst looking at the probe snapshots generated during our test runs I noticed that I had mistakenly used an instrumented version of our probes runtime (internally we use a number of counters to instrument parts of our runtime with itself) overstating both our clock times and cpu times. Here is a chart showing the reduction in CPU usage when using the non-instrumented production version of our libraries – at least 10% reduction across each test run.

samplingchart3

Update 19th Jan 2009: The JXInsight test runs were executed with the following system properties.
jxinsight.server.profiler.autostart=false
jxinsight.server.probes.strategy.hotspot.statistic=inherent.avg

(Historical) Aside: When we released JDBInsight 2.0 (a very long time ago) it offered an extremely efficient call stack collection facility implemented natively within our JVMPI agent with maximum frame size limit of 128. We subsequently changed this to 256 when object-relation mapping tools became more widespread adopted along with increased indirection in the middleware request processing pipeline. When developers started adding IOC frameworks into the application stack the size of the stack was increased beyond 320. Will this grow even further? I think it has already happened with some middleware products using renamed packages of the same IOC frameworks internally in the pipeline. A stack size depth of 400 or more sounds absurd and presents a strong argument for rethinking exception handling and the usage of logging by developers today.

Update 16th October 2009
Our overhead has now dropped to 1 nanosecond for a non-hotspot method.
http://williamlouth.wordpress.com/2009/07/27/one-billion-operations-per-second/

13 Responses to “Profiling: Sampling versus Execution (Part 2)”


  1. [...] Profiling: Sampling versus Execution (Part 2) [...]


  2. First, kudos on being one serious coder and building/selling a product. I mean it.

    So am I, though I’ve been around a lot longer. I’ve seen hot ideas come & go. For the general world-view on performance tuning, I’m trying to get folks to step back and take a second look at common wisdom (which never hurts). For example, for Java tuning, consider this link:

    http://stackoverflow.com/questions/266373/one-could-use-a-profiler-but-why-not-just-halt-the-program/317160#317160

    The broader implication is that basic assumptions about profiling could stand to be reconsidered.

  3. yuzzamatuzz Says:

    How much overhead does JXinsight introduce when the machine is fully loaded? Although it’s interesting to see how much CPU time is added when the application isn’t really doing anything, it’s something of an artificial scenario that focuses on only one part of the problem (the overhead of collecting a data point). But your original point was how well it works on an enterprise Java application, which is generally trying to push as many transactions through the machine as it can. It spends it doing lots of “stuff” and I’m left wondering how and you filter out the probes for all that “stuff”? Sampling-based approaches filter out “stuff” by only looking periodically. How do you filter it out?

    Another thing I’m confused about is whether this tool is something Azul-specific or if it runs as efficiently on off-the-shelf systems?

    • williamlouth Says:

      Any type of performance management is focused on key performance indicators. JXInsight allows these to be modeled as either probes (activity) or as meters (generally a thread specific counter). The selection of which one to use really depends on the software & system execution contexts and whether the underlying software execution behavior (the code) can be effectively changed directly or indirectly (callers). Probes (metered activities) generally have a higher footprint & overhead than meters though this depends on the frequency of the metering and its underlying implementation (is it already tracked by another system).

      The primary overhead for a probe is not in the firing but in the metering which consists of two parts – meter access time and aggregated metering update. Probes in JXInsight are associated with dynamic metering strategies that determine whether an actually firing results in a metering. We have a large number of strategies available via a simple system property and customers can combine them together to make even more exotic metering strategies to suit the execution nature of the application being monitored. Hotspot is our default one and it basically does what the JVM hotspot engine does but naturally from a metering perspective.

    • williamlouth Says:

      Nothing specific to Azul we run on nearly all Java EE supported platforms: Linux, Solaris, HP-UX, AIX, Windows, Mac OS X along with all flavors of chip architectures.

      We do have meters (counters) specific to the Azul hardware which points to our extensibility of our metering runtime.

    • williamlouth Says:

      By the way most of our posts are focused on heavily loaded systems. We are called into accounts after most other IT mgmt vendors have failed miserably in collecting data needed to resolve complex system execution issues. We leave the low hang fruit issues to other developer centric tools parading as production oriented solutions.

  4. yuzzamatuzz Says:

    And not to be negative, but what does “our overhead has now dropped to 1 nanosecond for a non hotspot method” even mean? 3 months ago, you removed any information about this statement from the article you linked to (I’ll assume it was there originally)…

    Some answers that might help understand this statement: Is the overhead higher for a hotspot method? On what kind of machine, running with kind of application, probing what kind of information, with what level of accuracy, over what period of time, etc.

    It’s an aggressively framed posting, I admit, but then again: the statement is virtually information free linking to an also information free link as supporting evidence for the statement! I am not necessarily disputing your statement, I’m saying it’s impossible to interpret it without more information.

    • williamlouth Says:

      Naturally the overhead for a metered firing of probe is higher than 1 nanosecond. It generally really does not matter what type of application once a method becomes marked as a non-hotspot (which again is defined by the customer based on metering statistics) we effectively short-cut any further execution of the instrumentation within the method itself. We do this is a clever, efficient and safe manner and customers can combine this with other metering strategies some of which can be sample based.

      The cost of a metered probe ranges between 250-350 nanoseconds on pretty standard Intel hardware. The largest element of this is the cost in accessing the underlying meter counter (or counters). The rest is related to the update of metering associated with one or more metering cost groups represented by a composite probe name.

    • williamlouth Says:

      I am not sure whether the last paragraph remains valid after my responses on this matter. Surely you do understand that we cannot just put out all the details of the implementation.

      What exactly is your intent?

      I have a reputation in the industry for not putting out BS and delivering the goods when all others have failed. I value this. If I say we are the fastest we are the fastest. When this is not the case I will correct this – by making us the fastest once again (I love a good challenge). If that is not good enough for you then so be it. Its not like I am trying to sell to you and even so I would expect you to benchmark us against all other vendors you are considering.

      • yuzzamatuzz Says:

        I wasn’t (trying to) ask you for all the details, I was (trying to) understand at a relatively high level what I felt was the key difference between your approach and the approaches you’re trying to contrast with. In my experience, it’s far from typical that you get both lower overhead *and* better information, so I wanted to understand your achievement in more detail to make sure it wasn’t just marketing hype (I’m convinced it isn’t that). I think I understand a little better after your recent postings. I won’t ask for more.

        I’m sorry if you thought I was trying to get you to divulge your secrets. That was absolutely *not* my intent, though in retrospect I suppose I did push you pretty hard for more information.

        You naturally have a lot more background information about your technology, its history, and its successes, than appears in these articles whereas I can only interpret what you’ve written here and ask (hopefully not too impertinent) questions when I don’t see the evidence for some of the statements you make.

        I don’t know you, so I don’t know your reputation. I hope no slight was taken because none was intended. I had questions when I read your articles, so I asked them. I did freely admit that the last one in particular was on the aggressive side :) .

        Best of success to you and your company.

    • williamlouth Says:

      By the way I would appreciate you sending me an email with your contact details (IBM?).

    • williamlouth Says:

      Honestly I think the key selling point of our solution is that our Open API and (concept) model are beautiful designed & aligned when measured in terms of extensibility, versatility, application and fit for purpose.

      Its terribly unfortunate we cannot make publicly available the documentation highlighting this because of one or two companies who systematically clone our innovation and product feature set. But this has its benefits in that the customers we have partnered with have pretty smart people who see beyond the passion we have for our work.

  5. williamlouth Says:

    The aggressiveness of the last posting is probably more of a reflection of my frustration with current thinking in the industry which seems to have not at all progressed in all the years I have been doing things differently (at least that is my intention in hope for finding a better approach).

    The industry seems so obsessed with delivering many disjointed management consoles for each element of an application/runtime stack. No thought is given to an unified set of models and interfaces for data collection that is efficient, extensible and versatile.

    Looking beneath these consoles one quickly realizes that it is all lip stick on a pig (and people seem to buy porkies). Whereas we first design the model, then the Open API, then build one or more instrumentation integrations, and finally create the visualizations within the console.


Leave a Reply