One of the hardest challenges for any application performance management tool is in reducing the runtime overhead to levels acceptable for production environments whilst collecting as much measurement data as possible for the purposes of service level management and problem management. For the typical code profiling solution this is an impossible task without significant filtering which incidentally most of the time results in the actual bottleneck being hidden as filtering is usually based on namespaces rather the actual performance profile of the methods themselves.

Even with significant filtering most profiling tools cannot perform at levels required for applications that have incredibly low request processing times whilst executing a significant number of method calls. If a request executes in 250 microseconds (which pretty much rules out call stack sampling) and during its execution it makes 2,500 method calls it is pretty easy to see that achieving an acceptable level of overhead, 2% (5 microseconds), is an insurmountable task especially when a system call to the clock counter costs between 0.2 – 0.3 microseconds and you need to do this twice for each of the 2,500 method calls.

With a dynamic measurement (metering) solution such as the product I work on you can reduce the overhead of an instrumented method call, that is not deemed a hotspot (dynamically at runtime), to just under 0.03 microseconds but multiplying that by 2,500 we still come out above the acceptable levels for this extreme type of request processing.

Can such applications then be profiled? Yes. There are two approaches one which I will discuss in this blog entry and the other in a future entry.

Instrument->Measure->Refine->Repeat
Clearly a solution to the problem described above must involve reducing the cost of both the instrumentation and measurement. But how can this be done in such a way that we do not eliminate expensive calls from the performance model whilst ensuring we do not inadvertently include calls perturbed (directly and indirectly) by the remaining instrumentation. The solution is to iteratively refine the instrumentation based on measurements that identify execution hotspots. With each iteration the hotspot list is reassessed in light of the reduced level of instrumentation.

imrr

For this process to be much more effective than the guesswork performed by way of manual namespace filtering the underlying runtime and tooling must at minimum

  1. Perform the measurement with as low overhead as possible (which is why our product out-performs the competition by a huge margin)
  2. Allow the hotspot determination to be based on the dynamic analysis of resource usage across one or more meters (via composite cost meters)
  3. Automate the accurate generation of the refined instrumentation set (it is a one-click operation in our product).

To demonstrate this approach I am going to use a real-world software example – Oracle Coherence. Shipped with the Coherence product is a command line interface that allows one to interactively execute Map.put() and Map.get() operations on a distributed cache. The command line facility also allows you to perform such operations in bulk and to repeat them. Here is the command sequence I will execute across each test presented below.

> cache cache
> #100 bulkput

The above commands will create a distributed cache named “cache” and repeat 100 times a bulkput command which by default executes 1,000 Map.put(Integer,byte[]) commands. In total 100,000 Map.put calls are executed.

Below is a metering model collected following the above command sequence which does include a small number of call counts related to the start-up and excludes constructor calls. Over 2,300 method calls are made per Map.put call. Running the test without any instrumentation the average execution time is 0.265 microseconds per Map.put call. This is our baseline.

Note: This particular use case is a best (if unrealistic) case for Coherence as the command uses a byte[] (easy IO mapping work) as the put value, whereas most applications would have objects that need to be serialized which is very expensive relatively speaking that is.

table1

The above metering was collected without any of our production-oriented features enabled such as our dynamic hotspot metering strategy. By re-enabling such features whilst keeping the same level of instrumentation we get the metering model shown below with a significant reduction (1000x) in the number of firing probes that were actually metered (calls measured). Unfortunately we still take a performance hit in dynamically assessing whether a called method (firing probe) is a hotspot or not (metered or not). The average execution time for each Map.put is now 0.640 microseconds.

Note: The definition of a hotspot is highly configurable within our runtime. For the tests reported in this entry I defined it to be any method with an average inherent wall clock time of 50 or more microseconds after 1000 executions globally.

table2

Fortunately from within the management console we can automatically generate a new probe extension (jar) based on those probes labeled hotspots within the metering model exported from the managed runtime. One click operation!!! Instead of instrumenting over 4,500 methods we now have a small list of methods to be instrumented spread across a total of 5 classes.

xml1

Below is the metering model following the deployment of the new probe extension containing the refined instrumentation. We have now halved the metering count but more importantly the instrumentation overhead associated with methods not deemed hotspots has been removed. The average execution time has dropped back down to 0.285 microseconds.

table3

Generating a new revised probe extension based on the hotspots in this new metering model reveals a further reduction in the methods and classes deemed hotspots.

xml2

Here is the final metering model captured based on the refined hotspots – a ever so small reduction. The average execution time has now dropped to 0.275 microseconds which equates to 0.010 microseconds of overhead which is largely a result of the underlying system clock calls.

table4

The following bar chart compares the changes across each iteration with the baseline reported above and that of another Java profiler.

chart

Note: Hopefully this explains why today we ship with over 600 technology specific extension libraries that specifically target the main client call entry points (mainly interfaces) into each technology.

In a follow-up to this entry I will revisit the tests again but this time using another technique ideal for CPU analysis in situations were the resolution of the system counter is insufficient – execution count profiling.

One Response to “Instrument->Measure->Refine->Repeat”


  1. [...] 4, 2009 In a previous blog entry, Instrument->Measure->Refine->Repeat , I discussed how to approach the performance analysis of software with request processing [...]


Leave a Reply