At the recent Devoxx conference (which I did not manage to attend) it was claimed that sample based profiling has much less overhead than execution profiling. This claim was made by a person that works on a product (not yet released) that only supports sample based profiling during a session which opened up with the statement “Any performance tuning advice provided in this presentation….. will be wrong!” 

This same person previously claimed GC actually made applications faster. If this holds true for an application then more than likely the application has serious resource bottlenecks elsewhere in the request processing pipeline which the additional GC stop-the-world event is alleviating by inadvertently throttling traffic and reducing resource contention (concurrency).

But before testing the validity of such a claims (ignoring the actual benefit of the data collection) lets consider the typical production workload context for enterprise Java applications.

  • Large number (>50) of request processing threads
  • Very deep call stacks (>200) with a high percentage of call frames non-application related (especially so when using frameworks such as Spring)
  • High degree of database activity with high latency costs (>10 ms)

Which means that there is a high probability that when the sample profiler executes a measurement cycle (every 1-5 ms?) a large number of threads will have very deep call stacks that by and large are of little value in terms of the application performance analysis – non-application code and with no application context.

Obtaining the call stack for a thread is incredibly expensive (we know this from the cost of throwing exceptions) and this is typically performed after all threads have been suspended temporarily by the sampling profiler (more cores, more waste). This expense does not even include the cost in performing a per thread call stack comparison with the previous call stack collected, recording timing and updating statistics – a cost that grows with each thread and each frame.

I think the problem here is that the session speaker incorrectly assumes that other tools managing applications in production using some form of execution analysis (1) instrument every single method on the call stack, (2) measure every method invocation occurrence, and (3) have a relatively high overhead in the measuring invocations. This is certainly not the case today

Unless we are talking about a “HelloWorld” application with only one main thread of execution being profiled a dynamic strategy based execution profiling (metering) solution can indeed out-perform simplistic sample based profilers whilst collecting much more relevant data, discarding noise, at a much higher degree of accuracy. This will be demonstrated in part 2.

Aside: One of the reasons that sample based profiling is offered by vendors is that it simplifies the development work for the product team. There is no requirement to deliver technology specific extensions, configuration options or an open API to allow custom extension. There is overhead reduction it is at the vendors development site!!!

14 Responses to “Profiling: Sampling versus Execution (Part 1)”


  1. [...] entry I have constructed a simple benchmark class based on the following observations raised in part 1 related to enterprise Java applications in the wild (which I assume was the context when the health [...]

  2. yuzzamatuzz Says:

    “I think the problem here is that the session speaker incorrectly assumes that other tools managing applications in production using some form of execution analysis (1) instrument every single method on the call stack, (2) measure every method invocation occurrence, and (3) have a relatively high overhead in the measuring invocations. This is certainly not the case today.”

    If #1 and #2 aren’t true, doesn’t that just mean that modern tools are also doing a form of sampling, just using different triggers for the samples (admittedly more intelligent ones than just “N ms has elapsed”)? So far all I see is lower overhead on the probes and smarter choice of triggers (which are good things, no doubt). But I think you’re clouding the issue by making it an anti-sampling diatribe. Even #3 is just a simple conclusion from #1 and #2 rather than an additional complaint.

    • williamlouth Says:

      Yes sampling in the statistical sense (though combined to one or more strategies) but not in the “instrumentation” & “collection technique” sense which is what the particular slide was comparing. I do not think I am muddying the waters at all most software performance engineers would assume an “instrumentation” context.

      Anyway I am more focused on the actual sampling data collection technique (please see part 2) which is call stack based.

    • williamlouth Says:

      Clearly call stack sampling analysis does not scale both in terms of the runtime and the offline analysis. Even performance engineers at Sun admit some of its problems though they do seem to think the sadistic (statistical) nature of the collection & analysis work is “fun”.

      http://weblogs.java.net/blog/sdo/archive/2009/10/16/fun-jstack

      • yuzzamatuzz Says:

        I don’t see why call stack sampling *needs* to be inefficient. The problem is that the JVM has little knowledge about how far up the stack things become “interesting” to a profiling user scenario. Current stack walkers have likely received the performance improvement attention appropriate for the frequency of handling exceptions (after all, they’re called “exceptions” for a reason). Using that same code for a call stack profiler changes the equation (it’s not used only in exceptional cases anymore) so it deserves more effort to improve its performance. So that’s just engineering effort needed.

        Forgive me for not being familiar with the profiling tool: does it solve the “what’s interesting to look at” question and is that a part of why it’s effective, in your opinion?

      • williamlouth Says:

        By the way on further reflection I would not classify what our solution does as “sampling”. It is strategy based. Admittedly of the 16 or so base metering strategies we support 4 could be classified as sample based (time, frequency) but most are based on behavioral aspects: concurrent, entry point, warm-up, initial, exclude, include, hotspot, dynamic, checkpoint, delay, highcpu, busythread, busy, …..

  3. yuzzamatuzz Says:

    I think opening a session with the comment you quoted isn’t necessarily a bad thing; it just reflects the complex nature of giving advice on performance tuning, especially when you don’t have time to go into minute detail explaining the contexts where the advice is appropriate and where it’s not. Seems like an appropriate standard disclaimer to me.

  4. williamlouth Says:

    Leaving out the context of any advice is dangerous and such disclaimers make what follows completely irrelevant to the subject (other than to highlight the complexity as you rightly point out) – just entertainment.

  5. williamlouth Says:

    “Creating a stack trace extremely quickly can be done”

    600 ns per thread is not extremely quickly in my book or anyones book for that matter with regard to enterprise applications with demanding performance requirements.

  6. williamlouth Says:

    I think you miss the reason why sampling was introduced & used ignoring the obvious problems with it. Sampling was used because (primitive) execution based profiling/monitoring solutions incurred excessive overhead. The trade-off in sampling is loss in accuracy and reduced statistical data collection (averages, max, min, stddev, var,…). That is assuming we are only talking about desktop applications with a single thread of execution. Today with a large number of concurrent threads executing within each JVM process and most making database/messaging calls at very deep call stack depths the cost benefit analysis of sampling is pretty dismal. Getting the call stack for a single thread with a depth of +100 will cost at least 100 microseconds of cpu time before it is even processed by a tool. Now multiply that by 100 threads. Then multiply that by 2x-4x for call stack depths of between 300-400 and you can see the problem. Then factor in that most native samplers first suspend the execution of all threads before collecting the actual call stack (frames). Seems a no brainer to me at least when one takes into account that less than 10% of the methods fired need to be actually metered and their frequency percentages is even less than that. A smart execution profiling/metering solution will out-perform a sampling solution both in terms of cost (overhead) and benefit (data collection).

  7. williamlouth Says:

    I am actually going to correct myself here. Our native agent generates thread call stacks, java.lang.reflect.Method[], very efficiently and that takes 100-200 microseconds.

    Doing it the standard (naive) way via a call to Thread.currentThread().getStackTrace() takes on average between 600-750 microseconds for a depth of 200 with occasional outliners in the order of 1-2 milliseconds due to the excessive object allocation nature of this particular call. So a stack depth of over 300 (unfortunately a standard these days) will take over 1 ms which is equivalent to the cost of a distributed call. This is why the latest crop of straw man Java sampling profilers have a minimum interval of 100 ms. Now factor in the drop of throughput during this period due to thread suspension and the conclusion is pretty obvious even for “peter the programmer” who should probably be called “peter the plumber”.

  8. yuzzamatuzz Says:

    My point was just that the design point for stack walking in current JVMs was driven primarily by creating stack traces for exceptions, which don’t happen very often (and so the performance of walking stacks wasn’t given too much attention — so you walk the stack when the exception occurs). Creating a stack trace extremely quickly can be done, it just creates a higher overhead on application execution time when you don’t need the result of the walk (no different than your instrumentation probes, except you seem to know which methods are “important” for tracing whereas the JVM wouldn’t usually have that kind of information, I don’t think). But if you’re constantly connected to a profiler or other performance tool, then perhaps it makes sense to invest in that kind of approach so you can get the trace without the current degree of overhead.

    Similarly, there’s no reason thread stacks cannot be walked asynchronously…you don’t need to stop all the threads all at the same time.

    So I’m still curious how you identify what you suggest are the 10% of the methods fired that need to be actually metered? Does the tool figure it out automatically, somehow, or does it rely on input from the user of the tool? That part seems to me to be the key advantage of your approach, rather than the relative perceived inherent costs of doing things in currently engineered JVMs.

  9. williamlouth Says:

    Please read the rest of the blog entries on this site that demonstrate who we do this which by the way is entirely in your hands in terms of its metering definition.

  10. williamlouth Says:

    Yes sampling, unlike instrumentation, has no overhead when not enabled but how many applications in the enterprise are not actually managed & monitored. That said we have managed to drop the overhead down pretty low that it is generally not noticeable at all – instrumentation can be created in such a way that when disabled the impact is negligible (i.e. DTrace).

    Sample away if you are a Java desktop developer but please do not use this in production and continuously.


Leave a Reply