Self Optimizing Java Hotspot Detection
November 30, 2009
I was recently contacted by Solomon Duskis to see whether I would be interested in assisting him with some performance tuning work he was involved with on the JBoss RESTEasy open source project. He was in particular concerned that his findings based on data reported by other profiling tools was inaccurate due to their excessive overhead especially when the entry point operations being tested had sub-millisecond execution times.
Looking at the initial tests results of a prolonged benchmark run, which exhibited relatively (but to be expected) high GC activity and extremely short high frequency method executions, I thought this would be an interesting test for some new features I had been working on in recent software updates to JXInsight specifically adjusted clock time metering and self-optimizing hotspot detection.
Note: Once you start profiling & tuning at the microsecond level you can effectively rule out using any thread cpu based timers as their resolution is very low and their cost of access relatively high.
The first thing I did to understand the software execution behavior of the benchmark was to execute a run with our complete coverage dynamic probes extension, weaving (at load-time) resource metering into all methods within the org.jboss.resteasy.* and org.codehaus.jackson.*. For this run I disabled the default clock.time meter and instead installed our probe.count meter so that I could count the number of metered method executions made directly and indirectly by each package, class and method.
The Count column below tells me the number of probes fired per package – in total approximately 604 million. The Total column tells me how many probes, including those in other packages, were fired and metered during the course of executing the probes listed in the Count column with the last column, Total (I), telling me how many of those probes are actually inherent (directly attributed) to the package.
Note: The
Total (I)column can be less than theCountcolumn because probes do not include their own firing within their reported resource metering which also explains why theTotal (I)column can be greater than theCountcolumn as in the case ororg.codehaus.jackson.*which calls back intoorg.jboss.resteasy.*. This might be a little bit confusing because we have effectively mapped a resource meter to the actual occurrence (count) of the resource metering but to me that is the beauty of our approach – we can meter anything including the act of metering itself.
The metering data above helped me in very quickly understanding the degree of impact a standard code profiler would have before I even ran such a test. Knowing that most other profilers have an overhead of between 1 and 5 microseconds per method execution I can estimate the total benchmark time from a baseline of 30 seconds. In the best case we are talking about 10 1/2 minutes (634 seconds) though from my own testing the best code profiler took more than 16 minutes (980 seconds). In the worst case more than a hour. With such numbers you are only ever likely to find low hanging fruit along with a large number of red herrings especially when one factors in the amount of GC activity generated.
The above does not really present much of a problem to JXInsight because its metering of any method’s executions is strategy based (In fact we support multiple chained metering strategies). With the default hotspot metering strategy installed the runtime will dynamically disable the instrumentation after a number (default is 1000) of method executions if at that time a specified metering statistic does not exceed a particular threshold. But when a hotspot metering threshold is defined as low as 5 microseconds as in the case of my own benchmarking even the initial number of warmup metered method executions before the hotspot determination is made can have some degree of impact on the actual test run timing and the resulting hotspot determination (especially callers which can be incorrectly labeled hotspots).
Previously the solution was to generate and deploy a new hotspot instrumentation extension after and before each repeat of the benchmark test (see Instrument->Measure->Refine->Repeat). In JXInsight 5.7.41 we offered up another option that can be enabled to defer the generation of extensions until much later in the test cycle by automatically persisting (and reloading) the hotspot metering analysis across consecutive benchmark test runs allowing the runtime to immediately upon initialization disable non-hotspots before their first firing whilst refining the previous hotspot candidate list. This is all done without any intervention by the performance engineer. The performance engineer just keeps re-running the benchmark until the test times and number of hotspots remains constant across runs. The hotspot metering runtime continually builds on-top of what it previously determined – self optimizing.
Here is the output of 13 benchmark test runs I executed in a matter of a few minutes.
Note: For the 9th test run I changed the hotspot metering configuration (whilst still reusing the previous hotspot analysis) to use the inherent min statistic to get rid pass thru elements within a call chain.
Note: For the 13th test run I used a hotspot instrumentation extension generated from within the console based on the metering model collected from the 12th test run removing completely all instrumentation from non-hotspot methods and disabling the metering strategy probes provider extension.
Note: The benchmark duration is measured in seconds and the metering threshold in microseconds.
Update: In our latest internal build, 5.7.42, which includes the ability to enable continuous hotspot performance analysis the benchmark time has now dropped down under 33 seconds.
Near to the end of the testing we were still metering 10 millions calls and incurring only a 5 second slow down compared with a competitors best time of 950 seconds. Here is the breakdown of the resource metering by the main top packages for 12th test run prior to the generation of the hotspot instrumentation extension.
Note: The c
lock.adj.timemeter was used instead of theclock.timemeter to discount gc activity in the hotspot metering analysis.

Looking at the inherent clock time averages for hotspot candidates (a partial list is show below) we can see that there was still room for further reduction (in overhead) if we increased the hotspot metering warmup count or instead this was continuously re-assessed during the course of an individual test execution (coming soon).

In a follow-up to this entry I am going to compare the hotspot results from a number of tools showing how surprisingly different they can be beyond the low hanging fruit cases and how widely inaccurate they are no matter what profiling technique is used – execution/event based or call stack sampling. Free Advice != Good Advice.
Metering (Profiling) Tagged Execution Periods
September 21, 2009
One of the many cool and innovative features we have recently pushed out the door with our regular software updates is an enhancement to tagging that really makes this unique feature extremely useful and versatile in the analysis of software under construction during development & test or under management in production. Now when you set a tag at either the global or thread context level not only do we aggregate resource metering for the tagged metering group but we do this for every other metering group updated during the tagged execution period. We do this by creating an extended @tag metering group that concatenates the name of the metered group updated with the value of tag.

Being able to inject tags remotely from within our management console or command line interface or locally from within the actual managed runtime via our Open API gives us the ability answer two very broad questions:
1. What is the resource usage for a package/class/method during a particular tagged execution period?
2. How does the resource usage for a particular probe compare across tagged and non-tagged execution periods in terms of count, total, min, max and average?
Previously the way to answer such questions at a global level across various profiling and application performance management (APM) tools would be to reset all data collected prior to the execution period, specify a period filter on an event journal table or log, or in our case mark the current totals and then perform delta analysis following the completion of the execution metered. But this leads to the creation of multiple disconnected performance models that cannot be easily compared and reconciled. In addition it is not possible to narrow the analysis to one or more threads or have concurrent tagged execution periods with each thread having its own tag value which is required when the tag values reflect the some aspect of the execution context such as the web application user name, request parameter or session attribute.

Here are some concrete kind use cases where this capability might be much more appropriate during analysis across multiple application life cycle management phases.
1. During the testing of a web application interaction consisting of a series of steps and pages a tag based on the step or page name is set globally allowing the developer to break down the resource metering costs for a particular package/class/method by the tagged step (or page).
2. During unit testing a tag is set at the thread context level based on the name of each @Test method executed providing code coverage like analysis for direct and indirect executed code by test method. This data can also be used in generating dependency analysis reports for such tests.
3. During a system failure (i.e. database bounce) a global tag is injected identifying the failure period allowing service level management to later break-out such costs on a per activity/interaction basis.
4. In diagnosing a user specific problem a thread context level tag is set identifying the user or the particular interaction at the entry point of request processing enabling resource metering to be reported both from an operational perspective (code) and contextual (user marker) perspective side by side.
5. In delivering elastic computing capabilities to an application a tag is injected into a cluster of runtimes at each dynamic change (a capacity profile) in the computing capacity made available to the application in the cloud which can then later be analyzed offline to determine reasons for the increase in capacity not being fully utilized due to possible inherent scalability problems with the applications own code base. Likewise one can easily determine which particular activities are impacted more than others by decreases in computing capacity due to elasticity policies.
Profiling The Nothing – Part 1
April 24, 2009
Kirk Pepperdine recently queried whether JXInsight could indeed profile nothing. Naturally JXInsight can profile the “nothing” in all its many forms (resource contention, coupling & co-ordination) but to keep the first part of this follow-up short & sweet let me show you how to detect latency in browser based operations from a software execution model perspective. In this scenario a single button click on a web page to cancel a hotel booking that results in the user returning to a page displaying the list of current bookings remaining.
And here it is – our timeline analysis of the two resulting HTTP requests processed by the same worker thread (this need not always be the case) within an application server.

The light blue rectangles represent the time the thread was processing within the servlet layer. The yellow and brown the time the thread was processing within the JNDI and EJB layers respectively. The bracket above the brown rectangle in the first HTTP request represents the actual active resource (JDBC) transaction window.
Finally the whitespace (the “nothing”) in the middle of the timeline graphic is the time spent outside of the container sending/receiving/processing the request at the client – 15-20% of the total user experience time.
Next up profiling the “nothing” from a system execution model perspective.
A Healthy Mistrust of Java Based Stress|Load Test Tools
December 16, 2008
I was recently asked my opinion of Java stress|load test tools. I had considered providing a very detailed technical review of the various products and open source projects on the market but considering this was a performance workshop that was already running over its time schedule I opted instead for a quick visualization tour that would bring home one of the biggest issues with a fundamental feature of such tools – inaccurate reporting.
To visually demonstrate this I opened up a timeline analysis snapshot in JXInsight of a load-test run targeting an Java EE application deployed in JBoss with the remote procedure call traffic generated from a ITKO Lisa server process. Both processes were monitored with our distributed tracing technology and product/middleware related extensions.
The screenshot below shows the traces recorded across both processes with the bottom two rows representing load generating threads and the two rows above representing the corresponding traces in JBoss. In the screenshot I have selected a trace in the bottom row which will result in a rectangle overlay being drawn to include the corresponding distributed trace in the second process.

To fast track to my observations in the field when using such tools I simply added a highlight to the timeline – High GC.

Here is the same timeline graph above but with a highlight on only those traces that have a “High GC” symptom associated by way of an automated observation inspection. All the application server side traces have faded leaving the spotlight on the threads executing the load scripts in the test server with reported GC times between 20 and 40% of the wall clock time.

What impact does this have on the load tests results? Well responses times are overstated with the throughput reduced and understated in reports.
It is possible that your tests might not be impacted to the same degree but can you be sure? Have you load and performance tested the test tools you are using today?

