Self Optimizing Java Hotspot Detection
November 30, 2009
I was recently contacted by Solomon Duskis to see whether I would be interested in assisting him with some performance tuning work he was involved with on the JBoss RESTEasy open source project. He was in particular concerned that his findings based on data reported by other profiling tools was inaccurate due to their excessive overhead especially when the entry point operations being tested had sub-millisecond execution times.
Looking at the initial tests results of a prolonged benchmark run, which exhibited relatively (but to be expected) high GC activity and extremely short high frequency method executions, I thought this would be an interesting test for some new features I had been working on in recent software updates to JXInsight specifically adjusted clock time metering and self-optimizing hotspot detection.
Note: Once you start profiling & tuning at the microsecond level you can effectively rule out using any thread cpu based timers as their resolution is very low and their cost of access relatively high.
The first thing I did to understand the software execution behavior of the benchmark was to execute a run with our complete coverage dynamic probes extension, weaving (at load-time) resource metering into all methods within the org.jboss.resteasy.* and org.codehaus.jackson.*. For this run I disabled the default clock.time meter and instead installed our probe.count meter so that I could count the number of metered method executions made directly and indirectly by each package, class and method.
The Count column below tells me the number of probes fired per package – in total approximately 604 million. The Total column tells me how many probes, including those in other packages, were fired and metered during the course of executing the probes listed in the Count column with the last column, Total (I), telling me how many of those probes are actually inherent (directly attributed) to the package.
Note: The
Total (I)column can be less than theCountcolumn because probes do not include their own firing within their reported resource metering which also explains why theTotal (I)column can be greater than theCountcolumn as in the case ororg.codehaus.jackson.*which calls back intoorg.jboss.resteasy.*. This might be a little bit confusing because we have effectively mapped a resource meter to the actual occurrence (count) of the resource metering but to me that is the beauty of our approach – we can meter anything including the act of metering itself.
The metering data above helped me in very quickly understanding the degree of impact a standard code profiler would have before I even ran such a test. Knowing that most other profilers have an overhead of between 1 and 5 microseconds per method execution I can estimate the total benchmark time from a baseline of 30 seconds. In the best case we are talking about 10 1/2 minutes (634 seconds) though from my own testing the best code profiler took more than 16 minutes (980 seconds). In the worst case more than a hour. With such numbers you are only ever likely to find low hanging fruit along with a large number of red herrings especially when one factors in the amount of GC activity generated.
The above does not really present much of a problem to JXInsight because its metering of any method’s executions is strategy based (In fact we support multiple chained metering strategies). With the default hotspot metering strategy installed the runtime will dynamically disable the instrumentation after a number (default is 1000) of method executions if at that time a specified metering statistic does not exceed a particular threshold. But when a hotspot metering threshold is defined as low as 5 microseconds as in the case of my own benchmarking even the initial number of warmup metered method executions before the hotspot determination is made can have some degree of impact on the actual test run timing and the resulting hotspot determination (especially callers which can be incorrectly labeled hotspots).
Previously the solution was to generate and deploy a new hotspot instrumentation extension after and before each repeat of the benchmark test (see Instrument->Measure->Refine->Repeat). In JXInsight 5.7.41 we offered up another option that can be enabled to defer the generation of extensions until much later in the test cycle by automatically persisting (and reloading) the hotspot metering analysis across consecutive benchmark test runs allowing the runtime to immediately upon initialization disable non-hotspots before their first firing whilst refining the previous hotspot candidate list. This is all done without any intervention by the performance engineer. The performance engineer just keeps re-running the benchmark until the test times and number of hotspots remains constant across runs. The hotspot metering runtime continually builds on-top of what it previously determined – self optimizing.
Here is the output of 13 benchmark test runs I executed in a matter of a few minutes.
Note: For the 9th test run I changed the hotspot metering configuration (whilst still reusing the previous hotspot analysis) to use the inherent min statistic to get rid pass thru elements within a call chain.
Note: For the 13th test run I used a hotspot instrumentation extension generated from within the console based on the metering model collected from the 12th test run removing completely all instrumentation from non-hotspot methods and disabling the metering strategy probes provider extension.
Note: The benchmark duration is measured in seconds and the metering threshold in microseconds.
Update: In our latest internal build, 5.7.42, which includes the ability to enable continuous hotspot performance analysis the benchmark time has now dropped down under 33 seconds.
Near to the end of the testing we were still metering 10 millions calls and incurring only a 5 second slow down compared with a competitors best time of 950 seconds. Here is the breakdown of the resource metering by the main top packages for 12th test run prior to the generation of the hotspot instrumentation extension.
Note: The c
lock.adj.timemeter was used instead of theclock.timemeter to discount gc activity in the hotspot metering analysis.

Looking at the inherent clock time averages for hotspot candidates (a partial list is show below) we can see that there was still room for further reduction (in overhead) if we increased the hotspot metering warmup count or instead this was continuously re-assessed during the course of an individual test execution (coming soon).

In a follow-up to this entry I am going to compare the hotspot results from a number of tools showing how surprisingly different they can be beyond the low hanging fruit cases and how widely inaccurate they are no matter what profiling technique is used – execution/event based or call stack sampling. Free Advice != Good Advice.
Hybrid Probe Profiling: Execution + Sampling
October 29, 2009
Whilst I have long argued that execution based profilers (metering agents & activity monitors) can out-perform sample based profilers on many levels (when implemented right) I have always recognized that statistical sampling techniques do occasionally have their place in ones arsenal in tackling the complexity of a real-world application/system performance model collected from a production environment under particular runtime constraints. That is why when we introduced our pioneering strategy based resource metering approach I made sure to include a number of out-of-the-box strategies that have sampling characteristics. This list initially included frequency, interval, sample, and burst. All of which have the following metering behavior – a non-metered gap and a metered run.

The two main differences in the metering strategies relate to the unit of measurement for both gap and run (time or frequency), and the scope of the strategy (global or probe name based).
Note: JXInsight’s Probes technology supports complex chaining of metering strategies allowing you to derive your very own metering strategy from base (built-in) strategies.
Each of the above mentioned metering strategies have their strengths and weaknesses depending on the application (or problem) context but none are truly comparable to how a standard call stack sampling profiler works so in a recent update to our product I approached this problem from a different angle within our runtime model by way of a new meter named sample.count which is basically a low-resolution clock (counter) with a user configurable time interval (sample) for increments (ticks). In doing so I addressed one of the key performance indicators (KPI) of a typical execution based profiler – the number of relatively costly (mostly native) counter access calls (which would otherwise occur in close time proximity in a heavily load system).
So how does this match much closer with the behavior of a call stack sampler? Well first of all the number of samples (meter increments) and the cost of the underlying native system calls is approximately the same (it is also a known quantity). Second of all only fired probes (metered activities) which have a metered execution period spanning at least one sample window increment (tick) will have their metering totals updated which means that the accuracy for high frequency and low cost methods is similar. Were it differs from sampling is in the increased accuracy of the performance model by including frequency (invocation) counts as well as collecting the min, avg and max for sample.count meter readings. In addition we can still utilize our default hotspot metering strategy by simply changing the strategy’s meter and the threshold value used.
Note: Thread call stack samplings in ability to track the complete duration (or ticks) of a method’s invocation restricts its application solely to development environments.
Naturally I was curious to see how a sample.count metering model compared with one based on the clock.time meter. So I metered the startup execution of a JBoss 4.x application server instance with a sample JBoss Seam application deployed. For this initial run I disabled our default hotspot metering strategy. The startup took less than 31 seconds with 776,468 method invocations metered within the org.jboss.* package which. This translates to 1.5 million clock time access calls – 50k / sec.
Note: “Count” is the number of method invocations metered and “Total” is the sum of the ticks during the execution period of the metered method invocation.

Here is the sample.count metering model for a similar startup sequence. The breakdown of the metering “Total %” column looks very similar to that of the clock.time model. The startup time took less than 30 seconds which translates (approximately) to 30,000 clock ticks with a default tick interval of 1ms – 1k / sec.

Of course one would expect the previous models to be very similar. A real comparison test would be with the hotspot metering strategy enabled which attempts to eliminate the metering of low cost instrumented method invocations (reducing the cumulative count). Here is the clock.time metering model for the startup. The number of metered probes fired has now halved though this reduction would be much higher in a long running system as we by default meter the first 1,000 firings per named probe.

With the hotspot metering strategy configured to use the sample.count meter the breakdown of both the “Count” and “Total” columns looks again similar to the clock.time metering model.

Though I still prefer to use our sample based metering strategies (more so because of familiarity) which guarantee accurate metering of those probes sampled the sample.count meter approach allows me to create a filtering like mechanism to the model at the granularity that suits the context of the investigation. With the meter I get the frequency for every probe fired which the strategy approach does not afford me but I only aggregate the metering total for those probes fired and metered that exceed a meter threshold I have defined.
Java Call Stack Sampling in the Wild
October 21, 2009
To be in anyway successful at managing & monitoring real-world Java enterprise applications in production (that excludes Spring “PetClinic”) one needs to assess the runtime impact of the various performance data collection techniques and their actual (tool) implementations. Unfortunately most programmers fail to understand this and instead develop/promote/deploy tools that are clearly completely inadequate for the task at hand – in a production context. They incorrectly assume ease of use equates to lightweight when in fact the only thing lightweight is the effort and due diligence performed in understanding the problem domain and managing the associated risks. It is for this reason why most operations teams immediately discount (without any evaluation) the use of monitoring tools a developer might have used in any way during localized tuning.
Recently this was made so evident when its was falsely claimed (without any qualification other than wishful thinking) that a Java sample based profiling tool could out-perform JXInsight – a dynamic strategy based resource metering solution. I will save the publication of a benchmark comparison for a (near) future blog but instead show you a chart (taken from within the “visual” tool itself) depicting the performance impact of such a technique on an application with a realistic number of threads (100) running with call stacks of realistic sizes (100-200).
Note: The similarities of both the 100 ms and 1,000 ms run sections indicates the tool was not able to perform at the lower of the sampling rates.

Even at a 10 second sample interval rate the runtime impact is excessive ignoring the fact that at this resolution the data would be completely meaningless unless the application was a dog in terms of performance which begs the question “How did such a beast get into production in the first place?”.
Yes the overhead of sampling is zero when not being performed but what type of enterprise application is not monitored continuously in production? Instrumentation needs to be a primary source of resource metering & software execution metric performance data. Sampling should be considered (and only after discounting many other sources and techniques) when the current instrumentation coverage is not adequate at the time a problem appears though I am doubtful of its effectiveness with the noise that would be created when enabled based on the charting above.
Metering (Profiling) Tagged Execution Periods
September 21, 2009
One of the many cool and innovative features we have recently pushed out the door with our regular software updates is an enhancement to tagging that really makes this unique feature extremely useful and versatile in the analysis of software under construction during development & test or under management in production. Now when you set a tag at either the global or thread context level not only do we aggregate resource metering for the tagged metering group but we do this for every other metering group updated during the tagged execution period. We do this by creating an extended @tag metering group that concatenates the name of the metered group updated with the value of tag.

Being able to inject tags remotely from within our management console or command line interface or locally from within the actual managed runtime via our Open API gives us the ability answer two very broad questions:
1. What is the resource usage for a package/class/method during a particular tagged execution period?
2. How does the resource usage for a particular probe compare across tagged and non-tagged execution periods in terms of count, total, min, max and average?
Previously the way to answer such questions at a global level across various profiling and application performance management (APM) tools would be to reset all data collected prior to the execution period, specify a period filter on an event journal table or log, or in our case mark the current totals and then perform delta analysis following the completion of the execution metered. But this leads to the creation of multiple disconnected performance models that cannot be easily compared and reconciled. In addition it is not possible to narrow the analysis to one or more threads or have concurrent tagged execution periods with each thread having its own tag value which is required when the tag values reflect the some aspect of the execution context such as the web application user name, request parameter or session attribute.

Here are some concrete kind use cases where this capability might be much more appropriate during analysis across multiple application life cycle management phases.
1. During the testing of a web application interaction consisting of a series of steps and pages a tag based on the step or page name is set globally allowing the developer to break down the resource metering costs for a particular package/class/method by the tagged step (or page).
2. During unit testing a tag is set at the thread context level based on the name of each @Test method executed providing code coverage like analysis for direct and indirect executed code by test method. This data can also be used in generating dependency analysis reports for such tests.
3. During a system failure (i.e. database bounce) a global tag is injected identifying the failure period allowing service level management to later break-out such costs on a per activity/interaction basis.
4. In diagnosing a user specific problem a thread context level tag is set identifying the user or the particular interaction at the entry point of request processing enabling resource metering to be reported both from an operational perspective (code) and contextual (user marker) perspective side by side.
5. In delivering elastic computing capabilities to an application a tag is injected into a cluster of runtimes at each dynamic change (a capacity profile) in the computing capacity made available to the application in the cloud which can then later be analyzed offline to determine reasons for the increase in capacity not being fully utilized due to possible inherent scalability problems with the applications own code base. Likewise one can easily determine which particular activities are impacted more than others by decreases in computing capacity due to elasticity policies.
The Fastest Ruby Profiler is a Java Profiler
August 5, 2009
Recently we announced that JXInsight can execute 1 billion instrumented Java method calls per second once the multi-resource (hotspot) strategy based metering runtime determines a particular method is a non-hotspot. So I was curious to see how our JRuby-to-Ruby cross language profiling extension faired with such optimizations as our probes technology is not in anyway tied to metering just named Java code execution constructs. The good news is that our technology can drop the overhead so low that it even out-performs no-op instrumentation calls in Ruby itself which means we are faster than any Ruby language based profiling or benchmarking solution (naturally when executed on the Java platform – JRuby VM).
Here is the Ruby micro-benchmark I used to determine the above.

Here is a chart showing the average time reported by multiple executions of the run method without and with our probes based instrumentation. Our instrumentation overhead is just over 15% but that is 15% on a method, call, that does nothing. Effectively our overhead is just under 15 nanoseconds for non-hotspots.
The overhead incurred in the instrumentation and measurement of fine grain method executions, which dominate the execution profile of most web applications, is a key performance indicator (KPI) for any enterprise performance management and monitoring solution.

Here is a partial screenshot of the resulting resource metering in our application management console. Note the strike-through on the call method indicating the associated instrumentation has been disabled following the execution of 1000 probe firings and the determination by the hotspot metering strategy that the named probe, Object.call, is a non-hotspot.

To determine the best (if not impossible) time a Ruby based profiling solution could do I added both a before and after method call to the call method which is typical of how most dynamic instrumentation agents work across various languages and runtimes. Both methods do nothing.

Here is a revised chart with the average execution time for the above manual instrumentation added. JXInsight is 3 times faster than a Ruby solution that does nothing.

Note: Disabling our hotspot metering strategy and comparing the performance with a version having a callout to
Benchmark.measure()JXInsight is more than 15 times more efficient. Moving to JRuby has many advantages for production applications both in terms of the raw speed of the runtime itself and the supporting management and monitoring tools.
Hardware
Model Name: MacBook Pro
Model Identifier: MacBookPro5,1
Processor Name Intel Core 2 Duo
Processor Speed: 2.53 GHz
Number Of Processors: 1
Total Number Of Cores: 2
L2 Cache: 6 MB
Memory: 4 GB
Bus Speed: 1.07 GHz
JRuby
jruby 1.3.1 (ruby 1.8.6p287) (2009-06-15 2fd6c3d)
Runtime
java version "1.6.0_15"
Java(TM) SE Runtime Environment (build 1.6.0_15-b03-223)
Java HotSpot(TM) 64-Bit Server VM (build 14.1-b02-92, mixed mode)
One Billion Operations Per Second
July 27, 2009
The problem with (published) performance benchmarks is that they are there to be beaten (luckily only ever by ourselves) and today well we have shattered all our previous benchmark results, including one last week, which means I am now faced with the daunting task of going over all my previous related postings and redoing the testing or just simply linking to the latest and greatest numbers. I think I will go with second option for now.

JXInsight’s multi-resource metering & monitoring (& profiling) technology is truly the fastest and most scalable Java solution on the market that is designed specifically for production usage in the most demanding of environments. Getting to this level of software quality is never easy. Its been a long and hard slog but Rod and I (co-founders) have managed to consistently out class the competition in terms of open & extensible API design, efficient runtime & tooling, and in the delivery of innovative information visualizations for the vast amount of data that the product can collect today.
Last week I managed to get back working on cool and innovative (hopefully you will agree) visualization work that I had designed (lots of sketches scattered around my office) a long time back but never managed to get the push to actually realize them within the product until today when we released a first in a series of visualization updates I am actively working on that will raise the bar (sky high) once again (well that is the motivational chant used currently).
In this first update I have tried to tackle a common problem with most application performance profiling/tracing/logging/journaling tools that use ordered child nodes within trees to display sequential execution behavior. The problem being that they do not scale in the enterprise context (too many nested branches and large sequences of nodes), are wasteful of precious screen real-estate, and cause repetitive strain injury (RSI) with excessive clicking to expand & contract nodes (seriously).
The solution is to not use a tree. Instead we use an ordered table that overlays the hierarchical execution nature (caller-to-callee) on top of a flattened sequential execution flow and makes it contextual (based on the current selection). And best of all we can display the execution of multiple concurrent threads side-by-side (to be exact its row by row) which is impossible with trees as the basis for any such visualization. Here is an animated gif showing this in action.
By the way the graphical tree overlay also highlights re-entrant calls on the billed probe stack. Can you see the visual pattern indicator?

Execution Profiling: Counting KPI’s
May 4, 2009
In a previous blog entry, Instrument->Measure->Refine->Repeat , I discussed how to approach the performance analysis of software with request processing execution times in the low microseconds via an iterative hotspot detection process that seeks to refine the instrumentation and measurement with each cycle, reducing overhead and minimizing perturbation in the data collected. In this follow-up I would like to present an alternative and complimentary approach that allows us to largely discount any instrumentation and measurement overhead by focusing on the rate of change in key performance indicators (KPI) that are not time based.
But first allow me to ask the question “Why do we profile the execution of software?“. For many developers the answer is simple – “to tune”. But this is not entirely the only valid reason and in fact this answer is more of a by-product of other activities and tasks within software performance engineering (SPE).
For large complex systems we observe and measure the execution of software in order to understand, model, quantify, and manage its behavior (resources usage & utilization, performance, scalability, runtime system dependencies) across multiple software releases and under different operating & executing contexts. Naturally we need a way to effectively manage all this information. This is achieved by metering (tracking) a relatively small number of indicators that measure operational activities that have a significant impact on performance, correlating the rate of change in such indicators with higher-level activities and various operating parameters.
Note: In fact the ability to detect and assess underlying changes in both the system & software execution models across deployments is of paramount importance in delivering service level management and reducing the time window for problem resolution in production environments.
The choice of a good key performance indicator depends largely on the type of software being monitored and managed. For standard business applications this can be the number and timing for requests serviced by various resource managers (file system, database, messaging middleware, data grid, compute grid). For low latency request processing systems there are a number of useful candidates for key performance indicators that are not entirely time dependent or impacted by instrumentation including: method execution counts, field read/write counts, object allocation counts & sizes (and to some lesser extent cpu time).
Lets apply this approach to the performance analysis of Oracle Coherence used in my previous blog entry which you should have already read at this stage. Before looking at any execution profile data you first need to know what was the activity being tested and how was it parameterized. Here it is:
>cache cache
>#10 bulkput
>#10 bulkput
Executing the above command sequence in the Oracle Coherence interactive shell results in a total of 20,000 SafeNamedCache.put() calls. We can see this below in the metering table with the Count column representing the number of times this particular method was called.
Lets keep this number, 20,000, in mind when looking at all numbers presented below.

The Total column represents the number of methods executed within the com.tangosol.* package during the execution of the SafeNamedCache.put() 20,000 calls by the same calling thread. The Total(I) tells us the number of methods called that were directly (inherently) attributed to the execution of the SafeNamedCache.put()
So 20,000 SafeNamedCache.put() calls resulted in 13.7 million method calls, 40,000 of which were made directly from within the method itself. But what about the other 38 million (51 less 13) call? Why only 13.7 million? The reason is that these other 38 million calls were made outside of the execution of the cache operation and more importantly by other threads. Expansion of the metering model tree reveals one of the main contributing classes, PacketPublisher, whose method executions resulted in a total of more than 22 million (1.3 million x 17) method calls, 4 million of which were made directly from within the class.
Note: For extremely low latency systems the overhead of excessive method calls can be relatively high even if in-lined to some degree because there still needs to be some housekeeping performed by the runtime (call stack mgmt). When looking for possible candidates to tune start with those probes with high averages.

In analyzing the performance of our own software product we use two low level meters – field.read.count and field.write.count. Below is such an analysis with 37.6 million reads performed on fields residing in the com.tangosol.* package during the execution of 52.1 million method calls also within the com.tangosol.* package.
Note: We ourselves monitor call paths within our probes runtime at the nanosecond level by averaging billions of test executions and by managing the number of field reads and writes.

Comparing the Count column with the Total(I) columns shows that a vast number of calls within the PacketPublisher class are simple field getters – an average of 1 field read per method call. It also reveals that during the execution of methods within this class 18.5 millions fields are read with 1 million within the class’s own methods and the rest in methods executed elsewhere in the com.tangosol.* package.

Another useful meter in analyzing software with a large amount of thread co-ordination is waiting.count which tracks the number of Object.wait() calls made by a thread.
Note: The
waiting.countmeter can be used in “profiling nothing” as a high number of waits calls is generally indicative of worker threads starved of work items to process.

For low latency systems you really need to look at the rate of object allocation and the resulting byte sizes. This is not only important for performance management but for capacity management as well as most OutOfMemoryErrors are the result of high concurrent request processing placing too much demand on the available memory pool and not some permanent memory leakage (more like a temporal & transient waste).
Note: Execution count profiling allows one to identify and quantify the impact of particular request execution parameters have on key performance indicators – correlating rate of change across multiple variables.

Focusing on the average object allocation count per method execution can sometimes reveal opportunities for tuning which by the way does not necessarily imply that objects should be cached – always try to eliminate work completely before improving the efficiency of the work via reuse.
From the metering table below we can see how many objects are allocated directly from within a method and how many are typically allocated during its execution including from within methods called directly and indirectly.
Note: Object allocation metering can also reveal the cost of initialization – methods only ever executed once.

Key performance indicators can be used to drive change (tune) but they can also be used to effectively manage and track change not necessarily tied to the achievement of an aligned target. They can also be used to identify unplanned changes that would otherwise result in customers notifying you of visible side effects at a much higher granularity and largely disconnected from the problem domain.
In a follow-up to this entry I will show how easy it is to create your own custom key performance indicators with our aspect library extensions.
Profiling The Nothing – Part 1
April 24, 2009
Kirk Pepperdine recently queried whether JXInsight could indeed profile nothing. Naturally JXInsight can profile the “nothing” in all its many forms (resource contention, coupling & co-ordination) but to keep the first part of this follow-up short & sweet let me show you how to detect latency in browser based operations from a software execution model perspective. In this scenario a single button click on a web page to cancel a hotel booking that results in the user returning to a page displaying the list of current bookings remaining.
And here it is – our timeline analysis of the two resulting HTTP requests processed by the same worker thread (this need not always be the case) within an application server.

The light blue rectangles represent the time the thread was processing within the servlet layer. The yellow and brown the time the thread was processing within the JNDI and EJB layers respectively. The bracket above the brown rectangle in the first HTTP request represents the actual active resource (JDBC) transaction window.
Finally the whitespace (the “nothing”) in the middle of the timeline graphic is the time spent outside of the container sending/receiving/processing the request at the client – 15-20% of the total user experience time.
Next up profiling the “nothing” from a system execution model perspective.
Instrument->Measure->Refine->Repeat
April 6, 2009
One of the hardest challenges for any application performance management tool is in reducing the runtime overhead to levels acceptable for production environments whilst collecting as much measurement data as possible for the purposes of service level management and problem management. For the typical code profiling solution this is an impossible task without significant filtering which incidentally most of the time results in the actual bottleneck being hidden as filtering is usually based on namespaces rather the actual performance profile of the methods themselves.
Even with significant filtering most profiling tools cannot perform at levels required for applications that have incredibly low request processing times whilst executing a significant number of method calls. If a request executes in 250 microseconds (which pretty much rules out call stack sampling) and during its execution it makes 2,500 method calls it is pretty easy to see that achieving an acceptable level of overhead, 2% (5 microseconds), is an insurmountable task especially when a system call to the clock counter costs between 0.2 – 0.3 microseconds and you need to do this twice for each of the 2,500 method calls.
With a dynamic measurement (metering) solution such as the product I work on you can reduce the overhead of an instrumented method call, that is not deemed a hotspot (dynamically at runtime), to just under 0.03 microseconds but multiplying that by 2,500 we still come out above the acceptable levels for this extreme type of request processing.
Can such applications then be profiled? Yes. There are two approaches one which I will discuss in this blog entry and the other in a future entry.
Instrument->Measure->Refine->Repeat
Clearly a solution to the problem described above must involve reducing the cost of both the instrumentation and measurement. But how can this be done in such a way that we do not eliminate expensive calls from the performance model whilst ensuring we do not inadvertently include calls perturbed (directly and indirectly) by the remaining instrumentation. The solution is to iteratively refine the instrumentation based on measurements that identify execution hotspots. With each iteration the hotspot list is reassessed in light of the reduced level of instrumentation.

For this process to be much more effective than the guesswork performed by way of manual namespace filtering the underlying runtime and tooling must at minimum
- Perform the measurement with as low overhead as possible (which is why our product out-performs the competition by a huge margin)
- Allow the hotspot determination to be based on the dynamic analysis of resource usage across one or more meters (via composite cost meters)
- Automate the accurate generation of the refined instrumentation set (it is a one-click operation in our product).
To demonstrate this approach I am going to use a real-world software example – Oracle Coherence. Shipped with the Coherence product is a command line interface that allows one to interactively execute Map.put() and Map.get() operations on a distributed cache. The command line facility also allows you to perform such operations in bulk and to repeat them. Here is the command sequence I will execute across each test presented below.
> cache cache
> #100 bulkput
The above commands will create a distributed cache named “cache” and repeat 100 times a bulkput command which by default executes 1,000 Map.put(Integer,byte[]) commands. In total 100,000 Map.put calls are executed.
Below is a metering model collected following the above command sequence which does include a small number of call counts related to the start-up and excludes constructor calls. Over 2,300 method calls are made per Map.put call. Running the test without any instrumentation the average execution time is 0.265 microseconds per Map.put call. This is our baseline.
Note: This particular use case is a best (if unrealistic) case for Coherence as the command uses a
byte[](easy IO mapping work) as theputvalue, whereas most applications would have objects that need to be serialized which is very expensive relatively speaking that is.

The above metering was collected without any of our production-oriented features enabled such as our dynamic hotspot metering strategy. By re-enabling such features whilst keeping the same level of instrumentation we get the metering model shown below with a significant reduction (1000x) in the number of firing probes that were actually metered (calls measured). Unfortunately we still take a performance hit in dynamically assessing whether a called method (firing probe) is a hotspot or not (metered or not). The average execution time for each Map.put is now 0.640 microseconds.
Note: The definition of a hotspot is highly configurable within our runtime. For the tests reported in this entry I defined it to be any method with an average inherent wall clock time of 50 or more microseconds after 1000 executions globally.

Fortunately from within the management console we can automatically generate a new probe extension (jar) based on those probes labeled hotspots within the metering model exported from the managed runtime. One click operation!!! Instead of instrumenting over 4,500 methods we now have a small list of methods to be instrumented spread across a total of 5 classes.

Below is the metering model following the deployment of the new probe extension containing the refined instrumentation. We have now halved the metering count but more importantly the instrumentation overhead associated with methods not deemed hotspots has been removed. The average execution time has dropped back down to 0.285 microseconds.

Generating a new revised probe extension based on the hotspots in this new metering model reveals a further reduction in the methods and classes deemed hotspots.

Here is the final metering model captured based on the refined hotspots – a ever so small reduction. The average execution time has now dropped to 0.275 microseconds which equates to 0.010 microseconds of overhead which is largely a result of the underlying system clock calls.

The following bar chart compares the changes across each iteration with the baseline reported above and that of another Java profiler.

Note: Hopefully this explains why today we ship with over 600 technology specific extension libraries that specifically target the main client call entry points (mainly interfaces) into each technology.
In a follow-up to this entry I will revisit the tests again but this time using another technique ideal for CPU analysis in situations were the resolution of the system counter is insufficient – execution count profiling.
Efficient Runtime Analysis of Hibernate
February 24, 2009
In a recent article claiming to “demystify” the Hibernate session context & cache a rather expensive (@runtime) and somewhat laborious (@analysis-time) approach was presented to show how the impact of certain operations in the Session interface have on the persistence context & cache components as well as database access patterns.
Note: The approach involved journaling (@runtime) each call along with some primitive (string) parameters and then presenting the entries as a sequenced call tree with the important measurements need for this rather small runtime analysis nested and scattered across different call branches.
But getting the answer to such questions as “Which entities are being added to the persistence context?” and “Which software activities trigger entity additions and database queries?” can be so much easier for both the user and the runtime once you have the right measurement model that is inherently extensible, uniform, and flexible – probes, meters and counters.
Lets start with the counters. Below is a table showing some of the counters maintained by our (@load-time) instrumentation extension libraries for Hibernate.
Note: The
hibernate.add.counttracks the entities added to the persistence context.

The counters are maintained per thread and can be aggregated at the process, host and cluster levels within the management console. We can also mark the counters to monitor the rate of change during a particular measurement interval.
In addition we can add other extension libraries to the class path to count at a contextual level (the class of the entity) as well.
Note: The class of the entity prefixes the
hibernate.*.countcounter.

Now that we know what is happening in terms of our data model lets find out why by simply mapping (via configuration) custom meters to some of the above counters and then enabling probes (activities) at the appropriate application level (Seam Bean).
With the following metering table it is very easy to determine which packages, classes and methods result in entities being added.

We can also perform a similar analysis across all meters mapped to counters.

If you want you can even map a meter to an individual entity.* counter and rename it within the metering model. I actually recently did this for a customer that helped resolve a significant issue with an application in production.
This approach can be applied to other persistence frameworks (a similar set of counters are supplied for JPA) as well as data caches such as Oracle’s Coherence.

The only skill required in effectively using our probes technology is in deciding which software execution points become counters (and possibly mapped resource meters) and which become probes (hierarchical cost/metered groups).
Annotated Java Thread Call Stacks
February 11, 2009
Following my last post in which I presented an alternative (more like complimentary) Java stack trace print I have received a few inquiries offline questioning my view that call stacks are not necessarily that important for runtime diagnostics.
To be honest I am somewhat surprised by this incorrect perception because I have always placed a significant value on the efficient collection of thread call stacks especially when only a small portion of the code base is instrumented. Having a call stack at particular instrumented points in a software’s execution is important which is why in 2002 I made it a core element within our own software & system execution models for both database transaction path analysis and component distributed tracing. This was prior to the introduction of StackTraceElement (@since 1.4) when amazingly some tools were actually using printStackTrace to obtain the calling context.
Unfortunately the benefit of the StackTraceElement class was greatly diminished by its initial class design (and possibly its name) which was probably heavily influenced by the usage of stack traces at the time and possibly the need to transmit them over the wire (RMI?).
Can you see the problem?

The important context fields are Strings. In a typical container environment with a proliferation of ClassLoaders it is practically impossible to accurately obtain access to the declaring java.lang.Class instance. Even if this was possible within a “hello world” like application we still cannot access the java.lang.reflect.Method as there is no parameter list only a line number into a file (allowing for overloaded method names).
When I designed our native profiling agent (initially JVMPI now JVMTI) I set out to ensure that for every frame on the stack we could obtain the Method reference and that the cost in obtaining the call stack from within the runtime itself would be significantly cheaper than any other means available especially for call stacks that appear repeatedly during execution.
What did the agent solution buy us? Well it allowed us to offer the ability to apply filtering to a collected call stack based on class meta data (what it implements, what it extends) and not just on the package/class/method name as offered by other tools. It also allowed use to add meta data to our software execution model which could be used to more effectively visualize call stacks. Here are some screenshots taken from within the product’s management console to illustrate this last point.
Icons: Using the meta data each Class within the model is given one or more classifications each mapped to an icon.
Note: Classifications have a ranking which determines which one is used for rendering when there are multiple classifications assigned.

Signatures: The meta data allows us to display a tooltip for a Method listing its modifiers, parameter types and return type.
Note: You might be surprised to see the
abstractkeyword. This is because forProxyclasses we substitute the actual interfaceClassandMethod. The * denotes this.

Annotations: The meta data allows us to present to the user the Annotations assigned to a Method.

Class & Classloader Hierarchies & CodeSources: The meta data allows us to display the Class & ClassLoader hierarchies for a Method’s declaring Class including the CodeSource of the Class.
Note: The
ejb:sessionclass classification is based on the presence of the @Statefulannotation.

Compressed Call Stack Rendering: With the meta data I can quickly process (visually that is) the following 26 call stack collections immediately identifying the sequence of calls between major classified components/classes.
[Servlet]->[JSF View]->[SessionBean]->[SeamBean]->[JPA Class]

Note: Collecting call stacks is important in creating a software & system execution model but one cannot ignore there is a relatively high cost to the collection even if you are using our agent which significantly out-performs all other available means and tools today. Call stacks should only be collected at coarse grain component boundaries i.e. database calls, out-bound remote calls, messaging endpoints, service/activity entry points, etc.
Better Java Thread Stack Trace Dumps
February 8, 2009
When performing JVM runtime diagnostics the standard (and need I say pretty crude) approach adopted by most working for companies with very little in the way of mature IT management processes and tooling is the thread (or exception) stack trace dump.
There are numerous problems with this approach including the lack of object & argument state information (already solved quite comprehensively) as well as timing & resource usage which I hope to address here.
Lets start with the sample code used to generate the stack trace print outs presented in this entry.
A simple deep recursive call chain execution ending with a wait followed by a stack trace print.

Here is a small extract of a single stack trace printed by the above code.

The problem with the output is that we have no inclination to how long each frame has been on the threads execution call stack and the resources it has consumed since added.
We cannot even determine whether two identical stack traces, printed by the same thread, represent the exact same execution call (frame) stack. Is this a new request at the same execution point or has the code stalled (waited) since the last print?
In addition the stack trace print contains frames that represent noise in our runtime analysis if we were focused on tracking down performance bottlenecks. This can also be the case when analyzing applications deployed to standard middleware containers.
Now lets rewrite the waiting method to use our recently announced stack probes provider extension.

Here is an extract of the stack trace printed.

A few observations. It is in XML format. Each fired and metered Probe (frame) has a unique id within the context of each thread. The wall clock time (in microseconds) is listed as a Meter for each frame on the stack.
Note: With the id we can compare across stack trace prints and performance call stack sampling with a much higher degree of accuracy (no second guessing whether the method indeed remained active during interval).
Did you notice that the intermediary stacks.Dumping.call method is absent in the above output? The reason is the call(int) method had been executed a sufficiently number of times (default=1000) prior to this particular stack print for the resource metering runtime to have already determined that it is not a hotspot (in terms of inherent cost) and does not need to be metered any further. It was fired but not metered.
Note: The runtime allows for different meters to be used in determining whether a particular named probe (method) is a hotspot or not.
Here is another stack trace printed later in the same run.

Now all the stacks.Dumping.$* methods have completely disappeared. We are left with probes fired and metered for methods marked as hotspots or still under analysis by the probes runtime metering strategies.
You have probably already guessed that if you enable additional resource meters within the metering runtime these will also be included in the probe stack prints.

Conventional call stack dumps quickly fill up a log file with a large amount of irrelevant and duplicated data but they are also extremely costly both in terms of wall clock time and CPU usage.

I am not advocating the complete elimination of thread stack trace dumps from logs though I am sure many of us would wish it to be so especially when staring at the same exception printed 20 times or more as it passes back up through the request processing pipeline.
Having the complete call frame stack can be helpful in a few cases but for the most part it is woefully inefficient (both at runtime and off-line analysis), wasteful, and severely lacking in important runtime diagnostic information – object state and resource usage.
There are solutions available today that make problem determination pretty efficient even for the most complex of systems. You just need to step outside of the runtime and to stop staring into the Sun (it can blind you).
Note: With the solution above probe stack prints can also cross multiple language boundaries – Java-to-Ruby.
Listening to software as it executes
January 22, 2009
Have you ever wished you could listen in on the execution of your software at particular points in its execution? Are you tired of writing stop watch classes and bracketing up a piece of code you are currently working on with a try..finally block? Well when we developed the event probes provider extension announced today we initially envisaged the work being used by partners and customers to simplify integration with existing event & alert management solutions at customer sites. But in writing the documentation for this extension I realized it could also be used during development and test. Here are some examples I quickly came up with.
Note: The actual instrumentation of the methods is all done transparently at load-time though the instrumentation does use the same underlying Probes Open API.
The first example shows how we can register with the current thread execution context a probes event listener that will receive a notification when any method called by the thread has completed its execution.

Here is the output printed to the console.
events.Events.waiting waited=10008352
Note: JXInsight reports all resource metering time units in microseconds.
This next example shows how easy it is to print resource metering (clock.time) for outbound calls by registering a probes event listener prior to the actual call invocation sequence.
Much simpler than adding timing code before and after each call listed.

Here is the output printed to the console.
events.Events.call1 timed=1000457
events.Events.call2 timed=2000625
events.Events.call3 timed=3000892
The following example shows how to trace direct and indirect method executions made by the thread within the execution interval of a method in which a listener is added (and removed).

Here is the output printed to the console.
entering: events.Events.outer
entering: events.Events.inner
exiting: events.Events.inner
exiting: events.Events.outer
The last example uses the same tracing like mechanism to determine the maximum call stack depth reached starting from a particular method execution frame.

Here is the output printed to the console.
max call stack depth was 5
Have you ever wondered how long it takes to get from one method frame to another method frame nested deeply amongst all the indirection of supposedly light weight frameworks? What about the time interval between two method calls not stacked or in any way directly call dependent? All these and many more ad hoc queries can be answered whilst still relying on JXInsight to do what it does best – instrument, meter, measure, aggregate, correlate and visualize.
Profiling: Sampling versus Execution (Part 2)
January 16, 2009
This is part 2 of a series of blog entries examining the claim that call stack sample based profiling offers much less overhead than instrumented/execution based profiling. In this entry I have constructed a simple benchmark class based on the following observations raised in part 1 related to enterprise Java applications in the wild (which I assume was the context when the health center monitoring tool was marketed to the audience at the conference).
- Large number (>50) of request processing threads
- Very deep call stacks (>200) with a high percentage of call frames non-application related (especially so when using frameworks such as Spring)
- High degree of database activity with high latency costs (>10 ms)
The benchmark class (partially listed) below does not necessarily look to declare a winner between sampling and execution profiling in terms of overhead as the answer depends on the product used and the application itself. It is an attempt at understanding the difference in the runtime cost and behavior of both techniques – across different products and execution contexts. To do this I have collected a sufficient amount of test data to help me in answering the following questions :-
- What is the typical CPU cost of each data collection technique?
- What is the rate of change in CPU cost as the average call stack depth is increased?
- Is the CPU cost of a data collection technique an accurate measurement of its runtime overhead? (Does sampling throttle the throughput of other threads?)
- Does a data collection technique have any inherent scalability issues in an enterprise setting when factoring in the previously mentioned observations and the trend of increasing multi-core computing?

As you can see from the code above I have created some real challenges for an instrumentation/ execution based profiling solution in that we have a disproportionate number of methods that are relatively cheap (mimicking lightweight request processing steps), easily in-lined (to some degree) and executed at a much higher frequency than the final sleep call (mimicking a database call).
Note: Another reason for the test execution time being largely dominated by the sleep time is that I can quickly calculate the CPU overhead of a product and collection technique using the process’s CPU time.
I conducted more than 50 test runs, each lasting between 100-300 seconds, comparing JXInsight instrumentation/execution based resource metering solution with the call stack sample based profiling option offered by the 3 most popular Java profilers on the market. The first chart below shows the wall clock time for each benchmark test run with varying call stack depths.
Note: There was no filtering of the class and methods to be instrumented in the execution profiling test runs which is not typical of a production environment (see observations above).

From the chart above we can see that when the call stack depth is kept relatively low (<100) the overhead of sampling could be categorized as being light weight (2-10%). But once we hit the 200 call stack depth the overhead of sampling grows quickly and becomes excessively heavy weight (20%-200%). This is not the case for our execution profiling solution, JXInsight, which keeps the overhead extremely low (1-5%) via a dynamic metering strategy whilst firing a relatively large number of probes (41-161 million) per test run.
Note: For every sample cycle performed by the other products JXInsight can fire (begin-end) between 5,000-20,000 execution based probes at an extremely low cost of between 0.5 to 1.0 microseconds depending on whether a firing is metered (underlying counter read) or not.
The next chart shows the CPU consumption (following the execution of a warm-up cycle) for each benchmark test run with varying call stack depths.

As expected the CPU cost increases inline with the increase in call stack depth across all sample based profilers. But what is interesting about this chart is that it shows the mix-mode based profiler (A) having a lower CPU cost than JXInsight whilst having a significantly higher wall clock time (the slowest reported in the previous chart). It would appear that this product has traded throughput/scalability for increased levels of efficiency (smaller code paths?) using possibly coarse grain synchronization (serialized processing) during measurement and/or the queuing events & updates for processing by a (single?) non-application thread.
Similarly sample based profiler (B) has near equal CPU costs but much higher clock times than JXInsight. This is more than likely caused by a reduction in the call throughput due to the suspension of all application threads (> 40) during each call stack sample cycle which from the CPU times reported we know to be anything but short or cheap (0.5-1.0 seconds).
If an execution profiling solution can keep the measurement overhead sufficiently low it can scale as the application workload scales across cores. This is not the case with call stack sampling which is rarely implemented in a multi-threaded manner as it is inherently serialized at costly execution points to maintain some degree of accuracy in the data reported.
Is call (stack) sample profiling more light weight than execution profiling? No if you are using JXInsight. If not then the answer really depends on a number of factors including the average sampled call stack depth, the average number of active threads, the request processing characteristics (CPU / IO), the number of computing cores, the sampling profiler, and sampling configuration (interval).
In the next entry I will look at the characteristics of the data collected by each technique and its applicability in the context of application performance management.
Update 19th Jan 2009: Whilst looking at the probe snapshots generated during our test runs I noticed that I had mistakenly used an instrumented version of our probes runtime (internally we use a number of counters to instrument parts of our runtime with itself) overstating both our clock times and cpu times. Here is a chart showing the reduction in CPU usage when using the non-instrumented production version of our libraries – at least 10% reduction across each test run.

Update 19th Jan 2009: The JXInsight test runs were executed with the following system properties.
jxinsight.server.profiler.autostart=false
jxinsight.server.probes.strategy.hotspot.statistic=inherent.avg
(Historical) Aside: When we released JDBInsight 2.0 (a very long time ago) it offered an extremely efficient call stack collection facility implemented natively within our JVMPI agent with maximum frame size limit of 128. We subsequently changed this to 256 when object-relation mapping tools became more widespread adopted along with increased indirection in the middleware request processing pipeline. When developers started adding IOC frameworks into the application stack the size of the stack was increased beyond 320. Will this grow even further? I think it has already happened with some middleware products using renamed packages of the same IOC frameworks internally in the pipeline. A stack size depth of 400 or more sounds absurd and presents a strong argument for rethinking exception handling and the usage of logging by developers today.
Update 16th October 2009
Our overhead has now dropped to 1 nanosecond for a non-hotspot method.
http://williamlouth.wordpress.com/2009/07/27/one-billion-operations-per-second/

