I recently came across a very interesting paper on the increasing usage of fine-scale charging made possible by the latest advances in technology in which the authors argue that in many cases such an approach maybe inappropriate.

To some degree I agree with the authors of the paper at least when metering is applied to the physical world and transaction costs are superfluous and static. But I would argue that in the virtual world (of cloud computing) such costs can be greatly mitigated by a dynamic resource metering runtime (used as the basis for a charge back scheme) that is largely strategy based and driven by a policy that factors in the cost of collection and adjusts accordingly.

Note: The resource metering profile of a process, activity or request has many uses and users – billing (or charge back) is just one.

Now being “strategy based” might imply that in some cases charges (measurements) are waived (or understated) but that need not necessarily be the case. For the vast majority of cases there is always some larger encompassing activity (call execution) context were charges (measurements) can be appropriately assigned. What metering strategies do is change how direct the association of charges (measurements) is. In fact a well defined strategy can help one be more effective in determining the most appropriate point in an activity (call execution) for cost reduction (performance improvement).

Hopefully if you managed to stay with me up to this point we can now apply to this to some actual code. The following code resembles one of the problems highlighted in the paper. With the inner() method instrumented with execution timers the transaction (measurement) cost (overhead) is double the actual execution time (cost).

Here is a sample metering (charge) model without any strategies applied. The inherent execution costs (charges) are assigned largely to outer() and inner() – 2.6 secs and 1.6 secs respectively. But how accurate is this and how useful is it when one considers that there is not really anything that can be done to inner() to make it faster. It would be better to focus on the source of the inner() calls.

Lets run the application again this time with our hotspot metering strategy enabled and configured to disable metering of probes which have an average clock time of less than 75 microseconds after the first 1000 metered firings.

The metering model shown below has drastically changed from the previous one above. The majority of the inherent total has now been assigned to outer() following the dynamic disabling of the inner() instrumentation. None of the metering has been discarded. We have just changed the location were costs (charges) are assigned. In doing so we have also greatly reduced the measurement (transaction) overhead – the overall time has dropped down from 4.2 secs to 0.7 secs. But notice that the average of outer() is 66 microseconds. Why was this not disabled? By default the hotspot metering strategy makes its hotspot determination at the 1,000 execution point and does not re-evaluate this again. At that point the average was higher than our hotspot threshold.

Note: The hotspot metering strategy can be configured to use any meter included in the metering model and you can easily create your own meters that represent concepts within the domain model of the application metered.

Here is another metering model with continuous hotspot analysis enabled. At the 2,000 execution mark the outer() method becomes a non-hotspot in terms of our definition. The majority of costs have now moved up to main() which if you look at the code is probably were they should be if one cannot change the number of loops made in the outer() method.

Finally an advanced resource metering runtime can learn not only during a particular process execution but across multiple process executions. Enabling this capability and running the application again we get the metering model below. Both inner() and outer() have been disabled from the start.

It is worth noting that assignment of costs follows how the Hotspot JVM will in-line code.

Today’s web applications have execution paths with huge stack depths  - a trend that is unlikely to be reversed anytime soon. New web application frameworks deliver increased developer productivity at a cost that is to some degree incurred during performance testing/tuning and production problem resolution in the increased complexity involved in the analysis of the software execution being monitored.

Profiling tools based on thread call stack sampling are now pretty much useless due to their excessive overhead and the amount of noise within any derived performance model. Even with standard package filtering options available in most code profilers (call stack sampling or event/execution based) one ends up losing far too much execution context because it is largely an all or nothing choice in terms of each package.

Here is a partial call path for an inbound HTTP request collected from an instance of Apache ESME running in a Tomcat server that has already had standard filtering of web container packages applied to it. More than 30 deep and we have not even seen sight of actual ESME code.

Note: You need to navigate much much deeper  (> 60) to begin to understand what is the underlying domain action(s) performed in the processing of this request.

To address this problem we recently released an update to JXInsight that delivers a highly innovative approach to managing such complexity without losing all of a request’s execution path context as processing moves down through the various web application layers - smart dynamic runtime filtering.

This is achieved by enabling our tierpoint probes provider and defining (via a few system properties) logical tiers mapped to one or more probe groups (typically a package or class though methods are supported). The provider then limits resource metering to execution points at the boundaries of such tiers filtering out all internal method dispatches within a tier.

Here is the same HTTP request as above but this time with our tierpoint provider enabled and tiers defined for LiftApache ESMEApache Derby and Scala. From L.doFilter to E.find we have excluded 60 call frames.

Note: This is performed within the managed runtime which means we can drastically reduce the amount of metering performed whilst still assigning inherent resource metering accordingly across tiers (packages).

Note: The probe names for each element of a track are displayed in shortened form to conserve precious (blog/desktop) real-estate.

I was recently contacted by Solomon Duskis to see whether I would be interested in assisting him with some performance tuning work he was involved with on the JBoss RESTEasy open source project. He was in particular concerned that his findings based on data reported by other profiling tools was inaccurate due to their excessive overhead especially when the entry point operations being tested had sub-millisecond execution times.

Looking at the initial tests results of a prolonged benchmark run, which exhibited relatively (but to be expected) high GC activity and extremely short high frequency method executions, I thought this would be an interesting test for some new features I had been working on in recent software updates to JXInsight specifically adjusted clock time metering and self-optimizing hotspot detection.

Note: Once you start profiling & tuning at the microsecond level you can effectively rule out using any thread cpu based timers as their resolution is very low and their cost of access relatively high.

The first thing I did to understand the software execution behavior of the benchmark was to execute a run with our complete coverage dynamic probes extension, weaving (at load-time) resource metering into all methods within the org.jboss.resteasy.* and org.codehaus.jackson.*. For this run I disabled the default clock.time meter and instead installed our probe.count meter so that I could count the number of metered method executions made directly and indirectly by each package, class and method.

The Count column below tells me the number of probes fired per package – in total approximately 604 million. The Total column tells me how many probes, including those in other packages, were fired and metered during the course of executing the probes listed in the Count column with the last column, Total (I), telling me how many of those probes are actually inherent (directly attributed) to the package.

Note: The Total (I) column can be less than the Count column because probes do not include their own firing within their reported resource metering which also explains why the Total (I) column can be greater than the Count column as in the case or org.codehaus.jackson.* which calls back into org.jboss.resteasy.*. This might be a little bit confusing because we have effectively mapped a resource meter to the actual occurrence (count) of the resource metering but to me that is the beauty of our approach – we can meter anything including the act of metering itself.

The metering data above helped me in very quickly understanding the degree of impact a standard code profiler would have before I even ran such a test. Knowing that most other profilers have an overhead of between 1 and 5 microseconds per method execution I can estimate the total benchmark time from a baseline of 30 seconds. In the best case we are talking about 10 1/2 minutes (634 seconds) though from my own testing the best code profiler took more than 16 minutes (980 seconds). In the worst case more than a hour. With such numbers you are only ever likely to find low hanging fruit along with a large number of red herrings especially when one factors in the amount of GC activity generated.

The above does not really present much of a problem to JXInsight because its metering of any method’s executions is strategy based (In fact we support multiple chained metering strategies). With the default hotspot metering strategy installed the runtime will dynamically disable the instrumentation after a number (default is 1000) of method executions if at that time a specified metering statistic does not exceed a particular threshold. But when a hotspot metering threshold is defined as low as 5 microseconds as in the case of my own benchmarking even the initial number of warmup metered method executions before the hotspot determination is made can have some degree of impact on the actual test run timing and the resulting hotspot determination (especially callers which can be incorrectly labeled hotspots).

Previously the solution was to generate and deploy a new hotspot instrumentation extension after and before each repeat of the benchmark test (see Instrument->Measure->Refine->Repeat). In JXInsight 5.7.41 we offered up another option that can be enabled to defer the generation of extensions until much later in the test cycle by automatically persisting (and reloading) the hotspot metering analysis across consecutive benchmark test runs allowing the runtime to immediately upon initialization disable non-hotspots before their first firing whilst refining the previous hotspot candidate list. This is all done without any intervention by the performance engineer. The performance engineer just keeps re-running the benchmark until the test times and number of hotspots remains constant across runs. The hotspot metering runtime continually builds on-top of what it previously determined – self optimizing.

Here is the output of 13 benchmark test runs I executed in a matter of a few minutes.

Note: For the 9th test run I changed the hotspot metering configuration (whilst still reusing the previous hotspot analysis) to use the inherent min statistic to get rid pass thru elements within a call chain.

Note: For the 13th test run I used a hotspot instrumentation extension generated from within the console based on the metering model collected from the 12th test run removing completely all instrumentation from non-hotspot methods and disabling the metering strategy probes provider extension.

Note: The benchmark duration is measured in seconds and the metering threshold in microseconds.

Update: In our latest internal build, 5.7.42, which includes the ability to enable continuous hotspot performance analysis the benchmark time has now dropped down under 33 seconds.

Update: JXInsight 5.7.42 has been released with continuous hotspot metering analysis following an initial hotspot classification allowing for reclassification and subsequent disablement of associated instrumentation in the event of changes in resource usage.

Near to the end of the testing we were still metering 10 millions calls and incurring only a 5 second slow down compared with a competitors best time of 950 seconds. Here is the breakdown of the resource metering by the main top packages for 12th test run prior to the generation of the hotspot instrumentation extension.

Note: The clock.adj.time meter was used instead of the clock.time meter to discount gc activity in the hotspot metering analysis.


Looking at the inherent clock time averages for hotspot candidates (a partial list is show below) we can see that there was still room for further reduction (in overhead) if we increased the hotspot metering warmup count or instead this was continuously re-assessed during the course of an individual test execution (coming soon).


In a follow-up to this entry I am going to compare the hotspot results from a number of tools showing how surprisingly different they can be beyond the low hanging fruit cases and how widely inaccurate they are no matter what profiling technique is used – execution/event based or call stack sampling. Free Advice != Good Advice.

Whilst I have long argued that execution based profilers (metering agents & activity monitors) can out-perform sample based profilers on many levels (when implemented right) I have always recognized that statistical sampling techniques do occasionally have their place in ones arsenal in tackling the complexity of a real-world application/system performance model collected from a production environment under particular runtime constraints. That is why when we introduced our pioneering strategy based resource metering approach I made sure to include a number of out-of-the-box strategies that have sampling characteristics. This list initially included frequency, interval, sample, and burst. All of which have the following metering behavior – a non-metered gap and a metered run.

hybridsampling.1
The two main differences in the metering strategies relate to the unit of measurement for both gap and run (time or frequency), and the scope of the strategy (global or probe name based).

Note: JXInsight’s Probes technology supports complex chaining of metering strategies allowing you to derive your very own metering strategy from base (built-in) strategies.

Each of the above mentioned metering strategies have their strengths and weaknesses depending on the application (or problem) context but none are truly comparable to how a standard call stack sampling profiler works so in a recent update to our product I approached this problem from a different angle within our runtime model by way of a new meter named sample.count which is basically a low-resolution clock (counter) with a user configurable time interval (sample) for increments (ticks). In doing so I addressed one of the key performance indicators (KPI) of a typical execution based profiler – the number of relatively costly (mostly native) counter access calls (which would otherwise occur in close time proximity in a heavily load system).

So how does this match much closer with the behavior of a call stack sampler? Well first of all the number of samples (meter increments) and the cost of the underlying native system calls is approximately the same (it is also a known quantity). Second of all only fired probes (metered activities) which have a metered execution period spanning at least one sample window increment (tick) will have their metering totals updated which means that the accuracy for high frequency and low cost methods is similar. Were it differs from sampling is in the increased accuracy of the performance model by including frequency (invocation) counts as well as collecting the min, avg and max for sample.count meter readings. In addition we can still utilize our default hotspot metering strategy by simply changing the strategy’s meter and the threshold value used.

Note: Thread call stack samplings in ability to track the complete duration (or ticks) of a method’s invocation restricts its application solely to development environments.

Naturally I was curious to see how a sample.count metering model compared with one based on the clock.time meter. So I metered the startup execution of a JBoss 4.x application server instance with a sample JBoss Seam application deployed. For this initial run I disabled our default hotspot metering strategy. The startup took less than 31 seconds with 776,468 method invocations metered within the org.jboss.* package which. This translates to 1.5 million clock time access calls – 50k / sec.

Note: “Count” is the number of method invocations metered and “Total” is the sum of the ticks during the execution period of the metered method invocation.

hybridsampling.2

Here is the sample.count metering model for a similar startup sequence. The breakdown of the metering “Total %” column looks very similar to that of the clock.time model. The startup time took less than 30 seconds which translates (approximately) to 30,000 clock ticks with a default tick interval of 1ms – 1k / sec.

hybridsampling.3

Of course one would expect the previous models to be very similar. A real comparison test would be with the hotspot metering strategy enabled which attempts to eliminate the metering of low cost instrumented method invocations (reducing the cumulative count). Here is the clock.time metering model for the startup. The number of metered probes fired has now halved though this reduction would be much higher in a long running system as we by default meter the first 1,000 firings per named probe.

hybridsampling.4

With the hotspot metering strategy configured to use the sample.count meter the breakdown of both the “Count” and “Total” columns looks again similar to the clock.time metering model.

hybridsampling.5

Though I still prefer to use our sample based metering strategies (more so because of familiarity) which guarantee accurate metering of those probes sampled the sample.count meter approach allows me to create a filtering like mechanism to the model at the granularity that suits the context of the investigation. With the meter I get the frequency for every probe fired which the strategy approach does not afford me but I only aggregate the metering total for those probes fired and metered that exceed a meter threshold I have defined.

It is notoriously hard to detect and distinguish busy and hung threads in todays complex enterprise middleware even more so when the middleware is distributed across multiple processing nodes with temporal execution dependency (locking, co-ordination). A large part of the reason for this is the lack of current execution information associated with frames within thread stack dumps. For one we cannot easily determine from two thread dumps listing the same call stack for a thread whether the thread in question is hung waiting on a response from a resource  or whether the thread is performing the same operation again following receipt of another request into a processing queue. We might not even be able to detect the case were the thread is merely busy making calls from the stack listed in the dump to other fast executing methods not listed.

To help with the problem analysis we have recently enhanced our optional stack probes provider to collect additional information such as the number of times a uniquely identifiable stack appears across two mark requests from within our management console and the number of calls (or frames) made during the execution interval of a caller frame.

Here is a Java class I quickly wrote to demonstrate the problem and the solution. I have created two Runnable classes that mimic the behavior of hung and busy threads.

stacks.code.1

Here is a sample of a partial stack dump during the execution of the above class. As you can see it is pretty hard to actually determine the type of problem that may be present ignoring the obvious clue in the class naming.

stacks.dump.1

Below is a resource metering snapshot taken during a similar execution run with our dynamic probes instrumentation enabled. What we can quickly see from this view is that the run() call executed by Thread-0 has only made one additional direct (or indirect) call during its ongoing execution interval. Whereas the run() call executed by Thread-1 has been busy performing more than 1 million calls (stack push operations).

stacks.probes.1

Now we cannot be sure that Thread-0 is effectively hung because it might have only just began executing prior to the snapshot capture. Likewise we cannot be sure of our initial prognosis for Thread-1 as it might just be finishing up or it has been active for a prolonged period in the runtime waking up occasionally to service queued requests.

So lets make a mark (reset) from within our management console that instructs the remote managed runtime to mark all current resource metering related data including associated stacks. Then we wait a little while (a few seconds) and then capture another snapshot shown below. From the Marks table column we can see that some parts of the stacks present in this new snapshot capture were present prior to our recent mark remote operation – marks > 0. The only new stack frame is the Busy.doit() call which has a thread execution sequence id of 3031907.

In addition we can see from the Count Delta column that only the run() call executed by Thread-1 has performed additional (instrumented) calls during its still unfinished execution.

stacks.probes.2

To be sure that the stacks visible from within the management console have indeed an execution period of longer than a few seconds we can make another mark. Now any element of a stack listed in a snapshot following the last mark operation with a Marks value greater than 1 has at least an execution duration of the time difference between both mark operations.

For those elements with a Marks value of 1 we can only say with confidence that the stack (or partial stack) has existed since the period of the last mark to the point at which a snapshot is captured which can be done repeatedly without marking.

stacks.probes.3
Now if that is not exciting enough we can easily perform the same steps across a complete cluster of nodes from a single view within our management console and this works for all probe related extensions whether the probes represent actual Java methods or contextually mapped Ruby and Python calls.

One of the many cool and innovative features we have recently pushed out the door with our regular software updates is an enhancement to tagging that really makes this unique feature extremely useful and versatile in the analysis of software under construction during development & test or under management in production. Now when you set a tag at either the global or thread context level not only do we aggregate resource metering for the tagged metering group but we do this for every other metering group updated during the tagged execution period. We do this by creating an extended @tag metering group that concatenates the name of the metered group updated with the value of tag.

tagging.startup

Being able to inject tags remotely from within our management console or command line interface or locally from within the actual managed runtime via our Open API gives us the ability answer two very broad questions:

1. What is the resource usage for a package/class/method during a particular tagged execution period?

2. How does the resource usage for a particular probe compare across tagged and non-tagged execution periods in terms of count, total, min, max and average?

Previously the way to answer such questions at a global level across various profiling and application performance management (APM) tools would be to reset all data collected prior to the execution period, specify a period filter on an event journal table or log, or in our case mark the current totals and then perform delta analysis following the completion of the execution metered. But this leads to the creation of multiple disconnected performance models that cannot be easily compared and reconciled. In addition it is not possible to narrow the analysis to one or more threads or have concurrent tagged execution periods with each thread having its own tag value which is required when the tag values reflect the some aspect of the execution context such as the web application user name, request parameter or session attribute.

tagging.steps

Here are some concrete kind use cases where this capability might be much more appropriate during analysis across multiple application life cycle management phases.

1. During the testing of a web application interaction consisting of a series of steps and pages a tag based on the step or page name is set globally allowing the developer to break down the resource metering costs for a particular package/class/method by the tagged step (or page).

2. During unit testing a tag is set at the thread context level based on the name of each @Test method executed providing code coverage like analysis for direct and indirect executed code by test method. This data can also be used in generating dependency analysis reports for such tests.

3. During a system failure (i.e. database bounce) a global tag is injected identifying the failure period allowing service level management to later break-out such costs on a per activity/interaction basis.

4. In diagnosing a user specific problem a thread context level tag is set identifying the user or the particular interaction at the entry point of request processing enabling resource metering to be reported both from an operational perspective (code) and contextual (user marker) perspective side by side.

5. In delivering elastic computing capabilities to an application a tag is injected into a cluster of runtimes at each dynamic change (a capacity profile) in the computing capacity made available to the application in the cloud which can then later be analyzed offline to determine reasons for the increase in capacity not being fully utilized due to possible inherent scalability problems with the applications own code base. Likewise one can easily determine which particular activities are impacted more than others by decreases in computing capacity due to elasticity policies.

The problem with (published) performance benchmarks is that they are there to be beaten (luckily only ever by ourselves) and today well we have shattered all our previous benchmark results, including one last week, which means I am now faced with the daunting task of going over all my previous related postings and redoing the testing or just simply linking to the latest and greatest numbers. I think I will go with second option for now.

JXInsight’s multi-resource metering & monitoring (& profiling) technology is truly the fastest and most scalable Java solution on the market that is designed specifically for production usage in the most demanding of environments. Getting to this level of software quality is never easy. Its been a long and hard slog but Rod and I (co-founders) have managed to consistently out class the competition in terms of open & extensible API design, efficient runtime & tooling, and in the delivery of innovative information visualizations for the vast amount of data that the product can collect today.

Last week I managed to get back working on cool and innovative (hopefully you will agree) visualization work that I had designed (lots of sketches scattered around my office) a long time back but never managed to get the push to actually realize them within the product until today when we released a first in a series of visualization updates I am actively working on that will raise the bar (sky high) once again (well that is the motivational chant used currently).

In this first update I have tried to tackle a common problem with most application performance profiling/tracing/logging/journaling tools that use ordered child nodes within trees to display sequential execution behavior. The problem being that they do not scale in the enterprise context (too many nested branches and large sequences of nodes), are wasteful of precious screen real-estate, and cause repetitive strain injury (RSI) with excessive clicking to expand & contract nodes (seriously).

The solution is to not use a tree. Instead we use an ordered table that overlays the hierarchical execution nature (caller-to-callee) on top of a flattened sequential execution flow and makes it contextual (based on the current selection). And best of all we can display the execution of multiple concurrent threads side-by-side (to be exact its row by row) which is impossible with trees as the basis for any such visualization. Here is an animated gif showing this in action.

By the way the graphical tree overlay also highlights re-entrant calls on the billed probe stack. Can you see the visual pattern indicator?

treeoverlay

In a previous blog entry, Instrument->Measure->Refine->Repeat , I discussed how to approach the performance analysis of software with request processing execution times in the low microseconds via an iterative hotspot detection process that seeks to refine the instrumentation and measurement with each cycle, reducing overhead and minimizing perturbation in the data collected. In this follow-up I would like to present an alternative and complimentary approach that allows us to largely discount any instrumentation and measurement overhead by focusing on the rate of change in key performance indicators (KPI) that are not time based.

But first allow me to ask the question “Why do we profile the execution of software?“. For many developers the answer is simple – “to tune”. But this is not entirely the only valid reason and in fact this answer is more of a by-product of other activities and tasks within software performance engineering (SPE).

For large complex systems we observe and measure the execution of software in order to understand, model, quantify, and manage its behavior (resources usage & utilization, performance, scalability, runtime system dependencies) across multiple software releases and under different operating & executing contexts. Naturally we need a way to effectively manage all this information. This is achieved by metering (tracking) a relatively small number of indicators that measure operational activities that have a significant impact on performance, correlating the rate of change in such indicators with higher-level activities and various operating parameters.

Note: In fact the ability to detect and assess underlying changes in both the system & software execution models across deployments is of paramount importance in delivering service level management and reducing the time window for problem resolution in production environments.

The choice of a good key performance indicator depends largely on the type of software being monitored and managed. For standard business applications this can be the number and timing for requests serviced by various resource managers (file system, database, messaging middleware, data grid, compute grid). For low latency request processing systems there are a number of useful candidates for key performance indicators that are not entirely time dependent or impacted by instrumentation including: method execution counts, field read/write counts, object allocation counts & sizes (and to some lesser extent cpu time).

Lets apply this approach to the performance analysis of Oracle Coherence used in my previous blog entry which you should have already read at this stage. Before looking at any execution profile data you first need to know what was the activity being tested and how was it parameterized. Here it is:

>cache cache
>#10 bulkput
>#10 bulkput

Executing the above command sequence in the Oracle Coherence interactive shell results in a total of 20,000 SafeNamedCache.put() calls. We can see this below in the metering table with the Count column representing the number of times this particular method was called.

Lets keep this number, 20,000, in mind when looking at all numbers presented below.

execprofiling1

The Total column represents the number of methods executed within the com.tangosol.* package during the execution of the SafeNamedCache.put() 20,000 calls by the same calling thread. The Total(I) tells us the number of methods called that were directly (inherently) attributed to the execution of the  SafeNamedCache.put()

So 20,000 SafeNamedCache.put() calls resulted in 13.7 million method calls, 40,000 of which were made directly from within the method itself. But what about the other 38 million (51 less 13) call? Why only 13.7 million? The reason is that these other 38 million calls were made outside of the execution of the cache operation and more importantly by other threads. Expansion of the metering model tree reveals one of the main contributing classes, PacketPublisher, whose method executions resulted in a total of more than 22 million (1.3 million x 17) method calls, 4 million of which were made directly from within the class.

Note: For extremely low latency systems the overhead of excessive method calls can be relatively high even if in-lined to some degree because there still needs to be some housekeeping performed by the runtime (call stack mgmt). When looking for possible candidates to tune start with those probes with high averages.

execprofiling21

In analyzing the performance of our own software product we use two low level meters – field.read.count and field.write.count. Below is such an analysis with 37.6 million reads performed on fields residing in the com.tangosol.* package during the execution of 52.1 million method calls also within the com.tangosol.* package.

Note: We ourselves monitor call paths within our probes runtime at the nanosecond level by averaging billions of test executions and by managing the number of field reads and writes.

execprofiling3

Comparing the Count column with the Total(I) columns shows that a vast number of calls within the PacketPublisher class are simple field getters – an average of 1 field read per method call. It also reveals that during the execution of methods within this class 18.5 millions fields are read with 1 million within the class’s own methods and the rest in methods executed elsewhere in the com.tangosol.* package.

execprofiling4

Another useful meter in analyzing software with a large amount of thread co-ordination is waiting.count which tracks the number of Object.wait() calls made by a thread.

Note: The waiting.count meter can be used in “profiling nothing” as a high number of waits calls is generally indicative of worker threads starved of work items to process.

execprofiling51

For low latency systems you really need to look at the rate of object allocation and the resulting byte sizes. This is not only important for performance management but for capacity management as well as most OutOfMemoryErrors are the result of high concurrent request processing placing too much demand on the available memory pool and not some permanent memory leakage (more like a temporal & transient waste).

Note: Execution count profiling allows one to identify and quantify the impact of particular request execution parameters have on key performance indicators – correlating rate of change across multiple variables.

execprofiling6

Focusing on the average object allocation count per method execution can sometimes reveal opportunities for tuning which by the way does not necessarily imply that objects should be cached – always try to eliminate work completely before improving the efficiency of the work via reuse.

From the metering table below we can see how many objects are allocated directly from within a method and how many are typically allocated during its execution including from within methods called directly and indirectly.

Note: Object allocation metering can also reveal the cost of initialization – methods only ever executed once.

execprofiling7

Key performance indicators can be used to drive change (tune) but they can also be used to effectively manage and track change not necessarily tied to the achievement of an aligned target. They can also be used to identify unplanned changes that would otherwise result in customers notifying you of visible side effects at a much higher granularity and largely disconnected from the problem domain.

In a follow-up to this entry I will show how easy it is to create your own custom key performance indicators with our aspect library extensions.

Kirk Pepperdine recently queried whether JXInsight could indeed profile nothing. Naturally JXInsight can profile the “nothing” in all its many forms (resource contention, coupling & co-ordination) but to keep the first part of this follow-up short & sweet let me show you how to detect latency in browser based operations from a software execution model perspective. In this scenario a single button click on a web page to cancel a hotel booking that results in the user returning to a page displaying the list of current bookings remaining.

And here it is – our timeline analysis of the two resulting HTTP requests processed by the same worker thread (this need not always be the case) within an application server.

profilingnothing11

The light blue rectangles represent the time the thread was processing within the servlet layer. The yellow and brown the time the thread was processing within the JNDI and EJB layers respectively. The bracket above the brown rectangle in the first HTTP request represents the actual active resource (JDBC) transaction window.

Finally the whitespace (the “nothing”) in the middle of the timeline graphic is the time spent outside of the container sending/receiving/processing the request at the client – 15-20% of the total user experience time.

Next up profiling the “nothing” from a system execution model perspective.

I recently came across a customer that was using our probes Open API in a novel way to simulate a real-world business process and composed activities for the purpose of cost analysis. This is quite unusual as the technology is mainly used in metering and costing services and activities that are purely software based. But here the customer has created a software execution engine that models the real-world and uses our software instrumentation & measurement technologies to meter the simulated model in terms of the real-world cost drivers.

Note: Our probes technology is unlike any other performance management solution in that it is not tied in anyway to method execution constructs. Composite named probes can represent any identifiable execution activity and resources can be abstract representing such things as chargeable cloud service interactions. This allows us to meter across languages within a single execution context.

This usage of our technology made me revisit a recent discussion I had with a cloud computing vendor regarding how to apply an activity based costing approach to hosted computing appliances in the cloud. The vendor in question wanted to initially offer very high level metering information to customers before rolling out a complete activity based costing solution that would allow customers to expose activities within their software to our extensible resource metering engine delivering an unprecedented cost and performance management offering. What if I created a software model of the cloud platform itself – a model that would act as a proxy or facade on to the underlying cloud and that would receive events or control actions directly.

mvmmodel

The following table shows the mapping from the cloud model to our Open API interfaces and then to the corresponding activity based costing concepts.

Note: This mapping is largely only applicable to the context described here as a resource in activity based costing can be mapped to a Meter, Counter, or Measure within our Open API.

mvmtable

Lets create a mock implementation to demonstrate this approach to using activity based costing to meter at the VM level by starting with a rather simplified VM interface representing a virtual machine within a cloud.

mvmcode1

Here is a Mock implementation class of the VM interface. I have added a units field representing the number of processors assigned to the virtual machine. Naturally most of these calls would be delegated to an agent running on the virtual machine itself or possibly some central information system.

mvmcode2

Below is the Worker thread class that will run within the runtime for the lifetime of the VM being modeled. In the run() method the worker registers two Resources, vm.cpu.time and vm.up.time with the current thread resource metering context before starting the metered activity.

Note: JXInsight comes with 30+ built in resource meters and offers the ability to register custom resources via the (pushed based) Counter interface or (pulled based) Measure interface.

mvmcode3

Finally here is the Cloud class I used to interact with the model from the command line. The class supports 3 commands: start (vm), stop (vm) and tag. The tag command allows the customer to set an alternative metering cost structure for probes (activities) being metered.

mvmcode41

Before running up the mock simulation I created a jxinsight.override.config file to setup the meter-to-resource mapping as well as some unit costing. Here is the contents of the file.

jxinsight.server.probes.tag.enabled=true
jxinsight.server.probes.meter.clock.time.include=false
jxinsight.server.probes.meter.resources=vm.cpu.time,vm.up.time
jxinsight.server.probes.meter.unit.cost.include=true
jxinsight.server.probes.meter.resource.vm.cpu.time.unit.cost=2

And here is a small transcript of an interaction with the cloud.

start com.acme.sales.crm.vm1 4
# tv advert airing
tag tvadvert
# vm1 is loaded need more capacity
start com.acme.sales.crm.vm2 4
# resuming normal service
stop com.acme.sales.crm.vm2
tag
stop com.acme.sales.crm.vm1

Connecting our management console to the Java runtime which is acting as a proxy to the cloud infrastructure I captured the following resource metering & costing model.

Note: The metering (cost) group structure shown below is derived from the composite name used in creating a probe (activity) which itself a parsed representation of the virtual machine identifier.

mvmmetering1

In a follow-up to this blog entry I will show how easy it is to combine this cloud model based approach with an actual standard software activity based costing approach provided to customers and used by cloud service solutions (storage, queues).

One of the hardest challenges for any application performance management tool is in reducing the runtime overhead to levels acceptable for production environments whilst collecting as much measurement data as possible for the purposes of service level management and problem management. For the typical code profiling solution this is an impossible task without significant filtering which incidentally most of the time results in the actual bottleneck being hidden as filtering is usually based on namespaces rather the actual performance profile of the methods themselves.

Even with significant filtering most profiling tools cannot perform at levels required for applications that have incredibly low request processing times whilst executing a significant number of method calls. If a request executes in 250 microseconds (which pretty much rules out call stack sampling) and during its execution it makes 2,500 method calls it is pretty easy to see that achieving an acceptable level of overhead, 2% (5 microseconds), is an insurmountable task especially when a system call to the clock counter costs between 0.2 – 0.3 microseconds and you need to do this twice for each of the 2,500 method calls.

With a dynamic measurement (metering) solution such as the product I work on you can reduce the overhead of an instrumented method call, that is not deemed a hotspot (dynamically at runtime), to just under 0.03 microseconds but multiplying that by 2,500 we still come out above the acceptable levels for this extreme type of request processing.

Can such applications then be profiled? Yes. There are two approaches one which I will discuss in this blog entry and the other in a future entry.

Instrument->Measure->Refine->Repeat
Clearly a solution to the problem described above must involve reducing the cost of both the instrumentation and measurement. But how can this be done in such a way that we do not eliminate expensive calls from the performance model whilst ensuring we do not inadvertently include calls perturbed (directly and indirectly) by the remaining instrumentation. The solution is to iteratively refine the instrumentation based on measurements that identify execution hotspots. With each iteration the hotspot list is reassessed in light of the reduced level of instrumentation.

imrr

For this process to be much more effective than the guesswork performed by way of manual namespace filtering the underlying runtime and tooling must at minimum

  1. Perform the measurement with as low overhead as possible (which is why our product out-performs the competition by a huge margin)
  2. Allow the hotspot determination to be based on the dynamic analysis of resource usage across one or more meters (via composite cost meters)
  3. Automate the accurate generation of the refined instrumentation set (it is a one-click operation in our product).

To demonstrate this approach I am going to use a real-world software example – Oracle Coherence. Shipped with the Coherence product is a command line interface that allows one to interactively execute Map.put() and Map.get() operations on a distributed cache. The command line facility also allows you to perform such operations in bulk and to repeat them. Here is the command sequence I will execute across each test presented below.

> cache cache
> #100 bulkput

The above commands will create a distributed cache named “cache” and repeat 100 times a bulkput command which by default executes 1,000 Map.put(Integer,byte[]) commands. In total 100,000 Map.put calls are executed.

Below is a metering model collected following the above command sequence which does include a small number of call counts related to the start-up and excludes constructor calls. Over 2,300 method calls are made per Map.put call. Running the test without any instrumentation the average execution time is 0.265 microseconds per Map.put call. This is our baseline.

Note: This particular use case is a best (if unrealistic) case for Coherence as the command uses a byte[] (easy IO mapping work) as the put value, whereas most applications would have objects that need to be serialized which is very expensive relatively speaking that is.

table1

The above metering was collected without any of our production-oriented features enabled such as our dynamic hotspot metering strategy. By re-enabling such features whilst keeping the same level of instrumentation we get the metering model shown below with a significant reduction (1000x) in the number of firing probes that were actually metered (calls measured). Unfortunately we still take a performance hit in dynamically assessing whether a called method (firing probe) is a hotspot or not (metered or not). The average execution time for each Map.put is now 0.640 microseconds.

Note: The definition of a hotspot is highly configurable within our runtime. For the tests reported in this entry I defined it to be any method with an average inherent wall clock time of 50 or more microseconds after 1000 executions globally.

table2

Fortunately from within the management console we can automatically generate a new probe extension (jar) based on those probes labeled hotspots within the metering model exported from the managed runtime. One click operation!!! Instead of instrumenting over 4,500 methods we now have a small list of methods to be instrumented spread across a total of 5 classes.

xml1

Below is the metering model following the deployment of the new probe extension containing the refined instrumentation. We have now halved the metering count but more importantly the instrumentation overhead associated with methods not deemed hotspots has been removed. The average execution time has dropped back down to 0.285 microseconds.

table3

Generating a new revised probe extension based on the hotspots in this new metering model reveals a further reduction in the methods and classes deemed hotspots.

xml2

Here is the final metering model captured based on the refined hotspots – a ever so small reduction. The average execution time has now dropped to 0.275 microseconds which equates to 0.010 microseconds of overhead which is largely a result of the underlying system clock calls.

table4

The following bar chart compares the changes across each iteration with the baseline reported above and that of another Java profiler.

chart

Note: Hopefully this explains why today we ship with over 600 technology specific extension libraries that specifically target the main client call entry points (mainly interfaces) into each technology.

In a follow-up to this entry I will revisit the tests again but this time using another technique ideal for CPU analysis in situations were the resolution of the system counter is insufficient – execution count profiling.

Following my last post in which I presented an alternative (more like complimentary) Java stack trace print I have received a few inquiries offline questioning my view that call stacks are not necessarily that important for runtime diagnostics.

To be honest I am somewhat surprised by this incorrect perception because I have always placed a significant value on the efficient collection of thread call stacks especially when only a small portion of the code base is instrumented. Having a call stack at particular instrumented points in a software’s execution is important which is why in 2002 I made it a core element within our own software & system execution models for both database transaction path analysis and component distributed tracing. This was prior to the introduction of StackTraceElement (@since 1.4) when amazingly some tools were actually using printStackTrace to obtain the calling context.

Unfortunately the benefit of the StackTraceElement class was greatly diminished by its initial class design (and possibly its name) which was probably heavily influenced by the usage of stack traces at the time and possibly the need to transmit them over the wire (RMI?).

Can you see the problem?

stacktraceelementcode1

The important context fields are Strings. In a typical container environment with a proliferation of ClassLoaders it is practically impossible to accurately obtain access to the declaring java.lang.Class instance. Even if this was possible within a “hello world” like application we still cannot access the java.lang.reflect.Method as there is no parameter list only a line number into a file (allowing for overloaded method names).

When I designed our native profiling agent (initially JVMPI now JVMTI) I set out to ensure that for every frame on the stack we could obtain the Method reference and that the cost in obtaining the call stack from within the runtime itself would be significantly cheaper than any other means available especially for call stacks that appear repeatedly during execution

What did the agent solution buy us? Well it allowed us to offer the ability to apply filtering to a collected call stack based on class meta data (what it implements, what it extends) and not just on the package/class/method name as offered by other tools. It also allowed use to add meta data to our software execution model which could be used to more effectively visualize call stacks. Here are some screenshots taken from within the product’s management console to illustrate this last point.

Icons: Using the meta data each Class within the model is given one or more classifications each mapped to an icon.

Note: Classifications have a ranking which determines which one is used for rendering when there are multiple classifications assigned.

annotatedstack1

Signatures: The meta data allows us to display a tooltip for a Method listing its modifiers, parameter types and return type.

Note: You might be surprised to see the abstract keyword. This is because for Proxy classes we substitute the actual interface Class and Method. The * denotes this.

annotatedstack2
Annotations: The meta data allows us to present to the user the Annotations assigned to a Method.

annotatedstack3

Class & Classloader Hierarchies & CodeSources:  The meta data allows us to display the Class & ClassLoader hierarchies for a Method’s declaring Class including the CodeSource of the Class.

Note: The ejb:session class classification is based on the presence of the @Stateful annotation.

annotatedstack4

Compressed Call Stack Rendering: With the meta data I can quickly process (visually that is) the following 26 call stack collections immediately identifying the sequence of calls between major classified components/classes.

[Servlet]->[JSF View]->[SessionBean]->[SeamBean]->[JPA Class]

annotatedstack5

Note: Collecting call stacks is important in creating a software & system execution model but one cannot ignore there is a relatively high cost to the collection even if you are using our agent which significantly out-performs all other available means and tools today. Call stacks should only be collected at coarse grain component boundaries i.e. database calls, out-bound remote calls, messaging endpoints, service/activity entry points, etc.

When performing JVM runtime diagnostics the standard (and need I say pretty crude) approach adopted by most working for companies with very little in the way of mature IT management processes and tooling is the thread (or exception) stack trace dump.

There are numerous problems with this approach including the lack of object & argument state information (already solved quite comprehensively) as well as timing & resource usage which I hope to address here.

Lets start with the sample code used to generate the stack trace print outs presented in this entry.

A simple deep recursive call chain execution ending with a wait followed by a stack trace print.

probestackcode1

Here is a small extract of a single stack trace printed by the above code.

probestackout

The problem with the output is that we have no inclination to how long each frame has been on the threads execution call stack and the resources it has consumed since added.

We cannot even determine whether two identical stack traces, printed by the same thread, represent the exact same execution call (frame) stack. Is this a new request at the same execution point or has the code stalled (waited) since the last print?

In addition the stack trace print contains frames that represent noise in our runtime analysis if we were focused on tracking down performance bottlenecks. This can also be the case when analyzing applications deployed to standard middleware containers.

Now lets rewrite the waiting method to use our recently announced stack probes provider extension.

probestackcode21

Here is an extract of the stack trace printed.

probestackxml1

A few observations. It is in XML format. Each fired and metered Probe (frame) has a unique id within the context of each thread. The wall clock time (in microseconds) is listed as a Meter for each frame on the stack.

Note: With the id we can compare across stack trace prints and performance call stack sampling with a much higher degree of accuracy (no second guessing whether the method indeed remained active during interval).

Did you notice that the intermediary stacks.Dumping.call method is absent in the above output? The reason is the call(int) method had been executed a sufficiently number of times (default=1000) prior to this particular stack print for the resource metering runtime to have already determined that it is not a hotspot (in terms of inherent cost) and does not need to be metered any further. It was fired but not metered.

Note: The runtime allows for different meters to be used in determining whether a particular named probe (method) is a hotspot or not.

Here is another stack trace printed later in the same run.

probestackxml2

Now all the stacks.Dumping.$* methods have completely disappeared. We are left with probes fired and metered for methods marked as hotspots or still under analysis by the probes runtime metering strategies.

You have probably already guessed that if you enable additional resource meters within the metering runtime these will also be included in the probe stack prints.

probestackxml3

Conventional call stack dumps quickly fill up a log file with a large amount of irrelevant and duplicated data but they are also extremely costly both in terms of wall clock time and CPU usage.

probestackchart

I am not advocating the complete elimination of thread stack trace dumps from logs though I am sure many of us would wish it to be so especially when staring at the same exception printed 20 times or more as it passes back up through the request processing pipeline.

Having the complete call frame stack can be helpful in a few cases but for the most part it is woefully inefficient (both at runtime and off-line analysis), wasteful, and severely lacking in important runtime diagnostic information – object state and resource usage.

There are solutions available today that make problem determination pretty efficient even for the most complex of systems. You just need to step outside of the runtime and to stop staring into the Sun (it can blind you).

Note: With the solution above probe stack prints can also cross multiple language boundaries – Java-to-Ruby.

ABC for Cloud Computing

January 27, 2009

How Fast and at What Cost? 
Whilst there are many benefits to cloud computing in terms of dynamic provisioning and on-demand scaling it presents challenges to IT management organizations in the performance monitoring of applications and streamlining processing costs as result of cloud computing service charges. Fortunately there is a common solution to tackling both challenges that offers to make the cost of both the operation and quality of an IT service delivered transparent, measurable and manageable. A solution that allows one to explicitly model the relation between the performance of an IT service and its operating costs when delivered at particular levels of quality (performance) to consumers – activity-based costing (ABC)

Activity-based costing consists of a service costing method and a resource consumption model that facilitates better decision-making by interpreting relationships between services delivered and their operational costs.

ABC In Practice
Below I have created a mock Twitter application to demonstrate some of the key concepts and design choices in applying activity based costing (ABC) to applications deployed to the cloud. The mock application I have created runs on a mock Google AppEngine platform and stores user tweets in a mock Amazon S3 service.

Advice: I strongly recommend that you read our Metering the Cloud article as well as some basic information on the mocked application and technologies before proceeding if you have not already done so. 

Activity & Resource Mappings
An activity is represented as a named Probe with composite names representing hierarchical cost centers (Groups).

A resource is represented by either a Counter or Meter. The main difference between a Counter and a Meter is that changes to a Meter (which can in fact be mapped to a counter) are tracked at the activity level via a Metering associated with a Group (full or partial name).

Note: The mock application has been simplified significantly leaving out nested activities (probes), tracking of costs by context paths, and optimal metering strategies (cost versus overhead).

Within each of the mock cloud technology components I have inserted instrumentation for the purpose of tracking and costing of software activities (Probes) and resource usage (Meters & Counters). These lines are bookmarked in the left margin.

Note: The instrumentation inserted would normally be weaved into the code base transparently at load-time using our aspect libraries. 

Mock Twitter WepApp
The Twitter WebApp component mimics the handling of web requests dispatched by the Google Engine. The request handling code is the main entry point from which requests can be dispatched to two service points – status and timeline.

The status service point will update the latest status text associated for the specified user by writing the binary version of the text to an Amazon S3 Bucket keyed on the user’s name.

The timeline service point will list previous status updates for the specified user by reading all objects stored in an Amazon S3 Bucket keyed on the user’s name.

Tracking relevant activities (software services) is the foundation of a successful ABC model so this is the only component in the application that is instrumented with named Probes. A single Probe is fired and metered in the request handling code creating a dynamic composite named Probe with the following hierarchical cost center (metered groups):

twitter.users.${user}.${service}

In addition a Counter, representing a service charge, is incremented for each request delivered to the status service point. Charges are assigned to status updates by the user and not the viewing of such updates by the user and others.

ccabctwitter

Mock Google Engine
The Google Engine component mimics a web request processing dispatcher. It reads a single line representing a URL from a text console and forwards both a Request and Response object to the registered Handler

From the perspective of the Twitter application the software execution of the Engine is modeled as a resource (with multiple cost driver) and not as an activity with Counters accumulating both the counts of requests dispatched and associated CPU usage.

ccabcengine1

Mock Google Engine Request
The Google Request component mimics the reading and parsing of a web request. A single Counter is maintained (across threads) by the Request component tracking the number of bytes read which is used by Google in its cloud computing billing rate plan.

Note: Counters are thread specific so their accumulation can easily be mapped to a metered resource (meter) which is in turn mapped to a metered activity (probe) executed by a request handling thread. This is a very important distinction with legacy system monitoring solutions working with metrics at the process level and oblivious to the causality of changes. 

ccabcrequest

Mock Google Engine Response
The Google Response component mimics the writing of a web response back to the user. It also increments the same Counter used by the Google Request component recording the number of bytes written out.

ccabcrespone

Mock Amazon S3 Bucket
The Amazon Bucket component mimics the storage capabilities of the S3 service. Tweets that have been transformed into binary form are written and read from a Bucket maintained on per user basis (ignoring the limits imposed by Amazon).

Unlike the Google Request and Response components the Amazon Bucket maintains two distinct Counters tracking bytes transfered during read and write operations to and from the Bucket.  

ccabcbucket

Test Scenario
The following series of console input lines have been executed with each of the resource metering model configurations depicted below.

in: service/status/user/wlouth/text/hello world
in: service/status/user/wlouth/text/hello cloud
in: service/timeline/user/wlouth
out: hello cloud
out: hello world

To help with understanding the execution flow I have collected a probes tracking model based solely on code level instrumentation performed by our load-time bytecode weaving agent and some aspect extension libraries.

Note: In this partial ABC model every method execution is represented as an activity itself which is not necessarily what we would want in the context of the large cloud application.

ccabctracking

Resource Metering Models
Executing the test scenario (this time with no code level instrumentation) we get the following resource Metering table view that tracks the changes in Meter usage at the activity level (leaf node) and associated cost centers (root and inner nodes). This type of data is similar (to some degree) to what is found in transaction monitoring and hotspot (CPU) detection profiling tools.

Note: The Count is the number of Probes fired and metered. The Total is the cumulative change in the Meter during this period.

ccabcmetering1

Here is a table listing the values of the inserted Counters at the process level. This is type of the data presented by system management tools with no actual knowledge or tracking of the software activity treating the process largely as a black box with a basic and flat metric surface.

ccabccounting1

What we really need initially is a way of combining these two perspectives especially as our monthly cloud computing billing is closely tied to increases in one or more of these resource counters updated by the cloud platform and cloud service components.

Well it is actually not so hard (at least when using our resource metering runtime) because Counters are in fact thread (activity) specific and only aggregated at the process level when the counting probes provider extension is enabled.

Lets create some Meters mapped to to Counters and execute our test again with the following configuration.

ccabcconfig21

Below is the new Metering table view captured following the test execution. With just an external configuration change we can now see which activities impact which Counters via the Meter mapping.

Additionally we have introduced a charge back mechanism by way of the twitter.service.charge Meter (more on this later).

ccabcmetering2

Cost Metering Models
Now lets try to introduce a more explicit cost model by introducing unit cost Meters based on the Counters with some relabeling of the mapped Meters to hide the underlying technologies and billable cloud services by appending the following properties to the configuration.

ccabcconfig31

Below is the revised metering model after executing the test scenario. The model now includes a new unit cost based meter, io.cost, representing our storage costs. From the model we can see that the storage cost in delivering the status and timeline services (activities) to user wlouth is 66 (io.read.bytes*1 + io.write.bytes*2). This is higher than our service charge units (20) so we might need to revise things slightly in our next configuration.

Note: A good practice (of abstraction) in cost modeling is to create an internal unit charge scheme and associated unit currency.

ccabcmetering3

Our final configuration assigns a unit cost to our service charge Meter as well as relabeling the mapped Meter.

ccabcconfig4

And here is the resulting model. Yes! The cloud application is making profit (well at least in terms of the storage cost).

ccabcmetering4

An important benefit in our innovative resource metering approach is that we now have a single model for both cost management and service management (performance & capacity management) that can in addition provide business insight into software service sales (and in real-time alongside operational costs).

This unified model allows performance engineers to much more easily factor in cost when making trade-offs (in terms of resource usage) in order to improve the response times and throughput of one or more software services. With each possible performance optimization (CPU vs IO) the cost of the change can be evaluated. Here is a table that combines the code centric and service centric views into a single model.

ccabcmetering5

Naturally the model lends itself to being distributed (and disconnected) across multiple processing units and execution flows in the cloud whilst offering a user defined consolidated view from any business & service management workstation.

ccabcmetering6

Summary
Cost management is important for companies using cloud computing as a platform for application delivery as it provides a framework for planning and controlling decisions related to such services in terms of service value and computing cost. It also serves to help cloud computing vendors to deliver resources in a cost effective manner whilst maximizing value. Finally, knowledge of costs is required to make intelligent decisions related to cost-justified service quality.