Profiling The Nothing – Part 1
April 24, 2009
Kirk Pepperdine recently queried whether JXInsight could indeed profile nothing. Naturally JXInsight can profile the “nothing” in all its many forms (resource contention, coupling & co-ordination) but to keep the first part of this follow-up short & sweet let me show you how to detect latency in browser based operations from a software execution model perspective. In this scenario a single button click on a web page to cancel a hotel booking that results in the user returning to a page displaying the list of current bookings remaining.
And here it is – our timeline analysis of the two resulting HTTP requests processed by the same worker thread (this need not always be the case) within an application server.

The light blue rectangles represent the time the thread was processing within the servlet layer. The yellow and brown the time the thread was processing within the JNDI and EJB layers respectively. The bracket above the brown rectangle in the first HTTP request represents the actual active resource (JDBC) transaction window.
Finally the whitespace (the “nothing”) in the middle of the timeline graphic is the time spent outside of the container sending/receiving/processing the request at the client – 15-20% of the total user experience time.
Next up profiling the “nothing” from a system execution model perspective.
Cloud Modeling with Activity Based Costing
April 18, 2009
I recently came across a customer that was using our probes Open API in a novel way to simulate a real-world business process and composed activities for the purpose of cost analysis. This is quite unusual as the technology is mainly used in metering and costing services and activities that are purely software based. But here the customer has created a software execution engine that models the real-world and uses our software instrumentation & measurement technologies to meter the simulated model in terms of the real-world cost drivers.
Note: Our probes technology is unlike any other performance management solution in that it is not tied in anyway to method execution constructs. Composite named probes can represent any identifiable execution activity and resources can be abstract representing such things as chargeable cloud service interactions. This allows us to meter across languages within a single execution context.
This usage of our technology made me revisit a recent discussion I had with a cloud computing vendor regarding how to apply an activity based costing approach to hosted computing appliances in the cloud. The vendor in question wanted to initially offer very high level metering information to customers before rolling out a complete activity based costing solution that would allow customers to expose activities within their software to our extensible resource metering engine delivering an unprecedented cost and performance management offering. What if I created a software model of the cloud platform itself – a model that would act as a proxy or facade on to the underlying cloud and that would receive events or control actions directly.

The following table shows the mapping from the cloud model to our Open API interfaces and then to the corresponding activity based costing concepts.
Note: This mapping is largely only applicable to the context described here as a resource in activity based costing can be mapped to a
Meter,Counter, orMeasurewithin our Open API.

Lets create a mock implementation to demonstrate this approach to using activity based costing to meter at the VM level by starting with a rather simplified VM interface representing a virtual machine within a cloud.

Here is a Mock implementation class of the VM interface. I have added a units field representing the number of processors assigned to the virtual machine. Naturally most of these calls would be delegated to an agent running on the virtual machine itself or possibly some central information system.

Below is the Worker thread class that will run within the runtime for the lifetime of the VM being modeled. In the run() method the worker registers two Resources, vm.cpu.time and vm.up.time with the current thread resource metering context before starting the metered activity.
Note: JXInsight comes with 30+ built in resource meters and offers the ability to register custom resources via the (pushed based)
Counterinterface or (pulled based)Measureinterface.

Finally here is the Cloud class I used to interact with the model from the command line. The class supports 3 commands: start (vm), stop (vm) and tag. The tag command allows the customer to set an alternative metering cost structure for probes (activities) being metered.

Before running up the mock simulation I created a jxinsight.override.config file to setup the meter-to-resource mapping as well as some unit costing. Here is the contents of the file.
jxinsight.server.probes.tag.enabled=true
jxinsight.server.probes.meter.clock.time.include=false
jxinsight.server.probes.meter.resources=vm.cpu.time,vm.up.time
jxinsight.server.probes.meter.unit.cost.include=true
jxinsight.server.probes.meter.resource.vm.cpu.time.unit.cost=2
And here is a small transcript of an interaction with the cloud.
start com.acme.sales.crm.vm1 4
# tv advert airing
tag tvadvert
# vm1 is loaded need more capacity
start com.acme.sales.crm.vm2 4
# resuming normal service
stop com.acme.sales.crm.vm2
tag
stop com.acme.sales.crm.vm1
Connecting our management console to the Java runtime which is acting as a proxy to the cloud infrastructure I captured the following resource metering & costing model.
Note: The metering (cost) group structure shown below is derived from the composite name used in creating a probe (activity) which itself a parsed representation of the virtual machine identifier.

In a follow-up to this blog entry I will show how easy it is to combine this cloud model based approach with an actual standard software activity based costing approach provided to customers and used by cloud service solutions (storage, queues).
Instrument->Measure->Refine->Repeat
April 6, 2009
One of the hardest challenges for any application performance management tool is in reducing the runtime overhead to levels acceptable for production environments whilst collecting as much measurement data as possible for the purposes of service level management and problem management. For the typical code profiling solution this is an impossible task without significant filtering which incidentally most of the time results in the actual bottleneck being hidden as filtering is usually based on namespaces rather the actual performance profile of the methods themselves.
Even with significant filtering most profiling tools cannot perform at levels required for applications that have incredibly low request processing times whilst executing a significant number of method calls. If a request executes in 250 microseconds (which pretty much rules out call stack sampling) and during its execution it makes 2,500 method calls it is pretty easy to see that achieving an acceptable level of overhead, 2% (5 microseconds), is an insurmountable task especially when a system call to the clock counter costs between 0.2 – 0.3 microseconds and you need to do this twice for each of the 2,500 method calls.
With a dynamic measurement (metering) solution such as the product I work on you can reduce the overhead of an instrumented method call, that is not deemed a hotspot (dynamically at runtime), to just under 0.03 microseconds but multiplying that by 2,500 we still come out above the acceptable levels for this extreme type of request processing.
Can such applications then be profiled? Yes. There are two approaches one which I will discuss in this blog entry and the other in a future entry.
Instrument->Measure->Refine->Repeat
Clearly a solution to the problem described above must involve reducing the cost of both the instrumentation and measurement. But how can this be done in such a way that we do not eliminate expensive calls from the performance model whilst ensuring we do not inadvertently include calls perturbed (directly and indirectly) by the remaining instrumentation. The solution is to iteratively refine the instrumentation based on measurements that identify execution hotspots. With each iteration the hotspot list is reassessed in light of the reduced level of instrumentation.

For this process to be much more effective than the guesswork performed by way of manual namespace filtering the underlying runtime and tooling must at minimum
- Perform the measurement with as low overhead as possible (which is why our product out-performs the competition by a huge margin)
- Allow the hotspot determination to be based on the dynamic analysis of resource usage across one or more meters (via composite cost meters)
- Automate the accurate generation of the refined instrumentation set (it is a one-click operation in our product).
To demonstrate this approach I am going to use a real-world software example – Oracle Coherence. Shipped with the Coherence product is a command line interface that allows one to interactively execute Map.put() and Map.get() operations on a distributed cache. The command line facility also allows you to perform such operations in bulk and to repeat them. Here is the command sequence I will execute across each test presented below.
> cache cache
> #100 bulkput
The above commands will create a distributed cache named “cache” and repeat 100 times a bulkput command which by default executes 1,000 Map.put(Integer,byte[]) commands. In total 100,000 Map.put calls are executed.
Below is a metering model collected following the above command sequence which does include a small number of call counts related to the start-up and excludes constructor calls. Over 2,300 method calls are made per Map.put call. Running the test without any instrumentation the average execution time is 0.265 microseconds per Map.put call. This is our baseline.
Note: This particular use case is a best (if unrealistic) case for Coherence as the command uses a
byte[](easy IO mapping work) as theputvalue, whereas most applications would have objects that need to be serialized which is very expensive relatively speaking that is.

The above metering was collected without any of our production-oriented features enabled such as our dynamic hotspot metering strategy. By re-enabling such features whilst keeping the same level of instrumentation we get the metering model shown below with a significant reduction (1000x) in the number of firing probes that were actually metered (calls measured). Unfortunately we still take a performance hit in dynamically assessing whether a called method (firing probe) is a hotspot or not (metered or not). The average execution time for each Map.put is now 0.640 microseconds.
Note: The definition of a hotspot is highly configurable within our runtime. For the tests reported in this entry I defined it to be any method with an average inherent wall clock time of 50 or more microseconds after 1000 executions globally.

Fortunately from within the management console we can automatically generate a new probe extension (jar) based on those probes labeled hotspots within the metering model exported from the managed runtime. One click operation!!! Instead of instrumenting over 4,500 methods we now have a small list of methods to be instrumented spread across a total of 5 classes.

Below is the metering model following the deployment of the new probe extension containing the refined instrumentation. We have now halved the metering count but more importantly the instrumentation overhead associated with methods not deemed hotspots has been removed. The average execution time has dropped back down to 0.285 microseconds.

Generating a new revised probe extension based on the hotspots in this new metering model reveals a further reduction in the methods and classes deemed hotspots.

Here is the final metering model captured based on the refined hotspots – a ever so small reduction. The average execution time has now dropped to 0.275 microseconds which equates to 0.010 microseconds of overhead which is largely a result of the underlying system clock calls.

The following bar chart compares the changes across each iteration with the baseline reported above and that of another Java profiler.

Note: Hopefully this explains why today we ship with over 600 technology specific extension libraries that specifically target the main client call entry points (mainly interfaces) into each technology.
In a follow-up to this entry I will revisit the tests again but this time using another technique ideal for CPU analysis in situations were the resolution of the system counter is insufficient – execution count profiling.