This was my first time to attend (and speak) at the Jazoon conference and though I was only able to attend for 2 days I really did have a great time attending the conference and talking in person (and drinking) with a number of well known individuals in the Java industry including Bela “a blast” Ban (JBoss), Mike “I’m nice” Keith (Oracle), and Kirk Pepperdine (Kodewerk) who dared discuss and debate application performance management & monitoring approaches with me over lunches and walks to and from our hotel.

Kirk just about to experience my Irish resolve.

I did attend the keynote session given by Adrian Colyer (SpringSource) and I must say I was impressed with his presentation skills even though it was a tad bit too long in terms of the green theme and light on in-depth technical analysis but I suppose there is only so much you can do in such a short time especially when predicting the future which could all change with an acquisition.

Adrian was as ever economical with the facts in his assessment of which VM language will win the war when he claimed that his choice of Groovy was made prior to the acquisition of the company behind Groovy & Grails – hence it was a pure technical decision. He failed to state that the VC that provided financial support to SpringSource at the very outset had also invested in the acquired company and had been promoting the technology to Rod and Co. for some time. Whether it had any bearing on the final decision is a different matter but it should have been disclosed in his argument to the audience questioning his objectivity.

The same cannot be said for the other main session I attended in which Peter “let me put this chastity belt on your classloaders before everything drops out under” Kriens (OSGi) who did a very poor job of presenting a supposedly technical session which in fact turned out to be more like “let me preach & market my 10 year old modular framework (that keeps re-inventing new component models)”. Peter really used every cheap marketing trick in the book when it came to the information (can we even call it [data] that, more like the gospel according to Peter) he presented in his slides. Below is an actual slide from the presentation.

osgi-heaven
Heaven awaits those who embrace OSGi?

The session might have been somewhat informative if he stopped talking about his 10 years of servitude, left out the gimmicky slides, and limited his bashing of Java and Sun at every opportunity during the talk. It felt like 10 years had passed when the session was over which should not have been the case for this particular technology which will have served an important purpose when Java does indeed deliver a built-in and much improved replacement. His insistence that modularity should be managed entirely outside of the language and oblivious to the Java runtime sounded like someone completely divorced from reality (vital lies?).

When Juergen Hoeller (SpringSource) was pulled over to our table I did question him on whether Rod (SpringSource) should continue blogging or go back to coding factory classes. What is worse? His reply was “No comment” but his body language betrayed him.

During Peters talk he mentioned that we was instructed by the OSGi committee to not talk about the original target environment for the technology – the embedded market.

Not all OSGi members share Peter’s view that the default behavior for OSGi should be to break existing enterprise Java applications which he refers to as “no pain, no gain”. There appears to be two camps. The hard-liners (and groupies/followers whose only response to issues raised is to name call those who dare mention them) and those (mainly app server vendors) who just want to get things to work sometimes only using the technology to segregate container libraries from application deployments.

Peter has responded (partially) and apparently “new” is (only) good when it is in relation to “OSGi (component) programming models” but not when it is in relation to an approach to modularity that involves enhancements to the language and runtime which are outside of Peter’s control (hence his rejection). I cannot believe that Peter & Co. would have designed OSGi as it is today if they had the opportunity to extend the language, runtime and tooling to accommodate their design. I refuse to believe that they did not compromise their design (and resulting implementations) especially in light of this and the original target environment – the embedded space. At the very least they would have looked at ways of reducing they own class loader hacks. I also doubt that Sun would have allowed any proposed design break compatibility in a number of ways that OSGi has done so (and some for no good reason). Even if Sun fails to delivery a deployment centric modularity approach we still might gain from their efforts as a smaller (modular) VM could gives us the ultimate class isolation container – an operating system process (and many of them hosting specific services connected to a grid/cloud). Lets not be fooled in to believing that to achieve modularity we just put a class chastity belt around our private parts. In the enterprise space state is just as important as behavior and it is not so easy to tackle in or outside of a container which is why most systems go occasionally offline (at least internally) to sync changes across different data processing domains (messaging, databases, …). Even if you can isolate classes from each other you cannot do the same for the external side effects (of multiple & different versions). This last item seems foreign to many OSGi disciples who apparently do not understand one of the main reasons why operations stop and restart processes in applying a managed change – to be sure the process can indeed be restarted when something goes terribly amiss in production outside of the container (hardware). It is much easier rolling back at the time the change is applied.

The “chastity belt” reference reflects my view that modularity, which should be managed at many levels in the design and development process and not just in bundling, has more to do with education and support during the early application life-cycle than a cruel troublesome enforcement at the late stage of post-deployment (into the bad real world). We need modularity to be supported in different ways (not just a single realization) and at different stages with each building on top and augmenting that which precedes its. I think currently there is way to much emphasis placed on bundle separation (thinking being that the more modules the greater the modularity). We should look at packaging (bundling) of application artifacts to indicate deployment collocation for reasons other than a “modularity nirvana”. We should also encourage diversity in approaches to how modularity is enforced at runtime (if it need be or not) and not just standardize on a particular implementation. The language and tooling needs to help capture and convey intent not target implementation. Modularity is not achieved by just creating many little bundles and hiding classes which seems to be Peter’s fixation (crucifix). I have seen very modular software designs that never needed runtime policing of visibility (outside of what is provided by the language) maybe they were not so dynamic (at least within a single runtime and in terms of service implementation) but how many organizations really need such capabilities and have the ability and tooling to effectively manage and control such environments in an enterprise context. Very few and even if they had the capability within the organization most try to avoid such scenarios (multiple live bundle/library versions in a single container) because the risks (diagnostics is problematic) far outweigh the benefits (”its cool”).

Peter did not actually address my main criticism points of his presentation. The cheap tricks in how he presented the information to back up his OSGi marketing campaign like the dependency graph of the Java class library with no comparable graph of a similar runtime and SDK built over the same period (Java 1.x-1.6) and offering the same breath and depth of services using OSGi. How much different would it indeed look like? Would any noticeable differences reflect the choice of the component approach or just the talent and longevity of the people involved. Great products are created by great teams. Yes there is a need for a framework (of reference and values) along with a support system but does it need to be policed in the way proposed which in my opinion reflects the constraints on the original design space (or the authors cynical view of the rest of us)? Does it need to be marketed as “the” silver bullet for the software industry that ultimately fixes us to a future designed 10 years ago? I hope not. We need to learn from previous work along with the mistakes made and 10 years gives you plenty of time to do this no matter how hard you try not to as time and the pace of innovation is not on your side.

Mike over at Canoo has a write-up of the talk I gave today at the Jazoon 09 conference on applying activity based costing in the cloud. I understand some of his concerns but I hope he will see that the model itself has application in many areas not all cost related (at least not directly) including performance and capacity management.

I am delighted he is excited (“it’s certainly an incredibly powerful idea”) about the possibility of our OpenCore Probes API becoming a standard which we can all expect to be available in the runtime and which can be extended by cloud or runtime vendors via the registration of their own custom meters.

I would like to add that metering does not necessarily imply that every activity has to be costed. Metering is a powerful and scalable way of analyzing the resource usage of a software without suffering from the obvious scalability and performance problems of call or path trace tools. Activity analysis can be the sole purpose of the data collected and the model generated. During the demo I showed how you can even create a meter based on the frequency or resource usage of another metered activity. This is what so cool about our design and implementation – you can model the execution of a block code as an activity or meter or both.

Activity based metering (ABM) is an incredibly important source of information on behavioral patterns of users which can in fact increase development budgets in areas that have higher profit margins or are used much more frequently (and hopefully appreciated) by clients. More and more of our daily activities are mirrored in the execution of software in todays web and business applications. Threads are our delegates (agents) acting on our behalf.

It is also extremely important when you are a Web 2.0 startup with a growing user base and no clue of what business model can be applied that will eventually offset operational costs that will likely to continue to grow in the future.

Next time I intend ending the session with a clip from the Matrix showing our life streams (threads/calls) flowing down the panel with a little William (or Smith) assuming control of one or more threads temporarily.

In a previous blog entry, Instrument->Measure->Refine->Repeat , I discussed how to approach the performance analysis of software with request processing execution times in the low microseconds via an iterative hotspot detection process that seeks to refine the instrumentation and measurement with each cycle, reducing overhead and minimizing perturbation in the data collected. In this follow-up I would like to present an alternative and complimentary approach that allows us to largely discount any instrumentation and measurement overhead by focusing on the rate of change in key performance indicators (KPI) that are not time based.

But first allow me to ask the question “Why do we profile the execution of software?“. For many developers the answer is simple – “to tune”. But this is not entirely the only valid reason and in fact this answer is more of a by-product of other activities and tasks within software performance engineering (SPE).

For large complex systems we observe and measure the execution of software in order to understand, model, quantify, and manage its behavior (resources usage & utilization, performance, scalability, runtime system dependencies) across multiple software releases and under different operating & executing contexts. Naturally we need a way to effectively manage all this information. This is achieved by metering (tracking) a relatively small number of indicators that measure operational activities that have a significant impact on performance, correlating the rate of change in such indicators with higher-level activities and various operating parameters.

Note: In fact the ability to detect and assess underlying changes in both the system & software execution models across deployments is of paramount importance in delivering service level management and reducing the time window for problem resolution in production environments.

The choice of a good key performance indicator depends largely on the type of software being monitored and managed. For standard business applications this can be the number and timing for requests serviced by various resource managers (file system, database, messaging middleware, data grid, compute grid). For low latency request processing systems there are a number of useful candidates for key performance indicators that are not entirely time dependent or impacted by instrumentation including: method execution counts, field read/write counts, object allocation counts & sizes (and to some lesser extent cpu time).

Lets apply this approach to the performance analysis of Oracle Coherence used in my previous blog entry which you should have already read at this stage. Before looking at any execution profile data you first need to know what was the activity being tested and how was it parameterized. Here it is:

>cache cache
>#10 bulkput
>#10 bulkput

Executing the above command sequence in the Oracle Coherence interactive shell results in a total of 20,000 SafeNamedCache.put() calls. We can see this below in the metering table with the Count column representing the number of times this particular method was called.

Lets keep this number, 20,000, in mind when looking at all numbers presented below.

execprofiling1

The Total column represents the number of methods executed within the com.tangosol.* package during the execution of the SafeNamedCache.put() 20,000 calls by the same calling thread. The Total(I) tells us the number of methods called that were directly (inherently) attributed to the execution of the  SafeNamedCache.put()

So 20,000 SafeNamedCache.put() calls resulted in 13.7 million method calls, 40,000 of which were made directly from within the method itself. But what about the other 38 million (51 less 13) call? Why only 13.7 million? The reason is that these other 38 million calls were made outside of the execution of the cache operation and more importantly by other threads. Expansion of the metering model tree reveals one of the main contributing classes, PacketPublisher, whose method executions resulted in a total of more than 22 million (1.3 million x 17) method calls, 4 million of which were made directly from within the class.

Note: For extremely low latency systems the overhead of excessive method calls can be relatively high even if in-lined to some degree because there still needs to be some housekeeping performed by the runtime (call stack mgmt). When looking for possible candidates to tune start with those probes with high averages.

execprofiling21

In analyzing the performance of our own software product we use two low level meters – field.read.count and field.write.count. Below is such an analysis with 37.6 million reads performed on fields residing in the com.tangosol.* package during the execution of 52.1 million method calls also within the com.tangosol.* package.

Note: We ourselves monitor call paths within our probes runtime at the nanosecond level by averaging billions of test executions and by managing the number of field reads and writes.

execprofiling3

Comparing the Count column with the Total(I) columns shows that a vast number of calls within the PacketPublisher class are simple field getters – an average of 1 field read per method call. It also reveals that during the execution of methods within this class 18.5 millions fields are read with 1 million within the class’s own methods and the rest in methods executed elsewhere in the com.tangosol.* package.

execprofiling4

Another useful meter in analyzing software with a large amount of thread co-ordination is waiting.count which tracks the number of Object.wait() calls made by a thread.

Note: The waiting.count meter can be used in “profiling nothing” as a high number of waits calls is generally indicative of worker threads starved of work items to process.

execprofiling51

For low latency systems you really need to look at the rate of object allocation and the resulting byte sizes. This is not only important for performance management but for capacity management as well as most OutOfMemoryErrors are the result of high concurrent request processing placing too much demand on the available memory pool and not some permanent memory leakage (more like a temporal & transient waste).

Note: Execution count profiling allows one to identify and quantify the impact of particular request execution parameters have on key performance indicators – correlating rate of change across multiple variables.

execprofiling6

Focusing on the average object allocation count per method execution can sometimes reveal opportunities for tuning which by the way does not necessarily imply that objects should be cached – always try to eliminate work completely before improving the efficiency of the work via reuse.

From the metering table below we can see how many objects are allocated directly from within a method and how many are typically allocated during its execution including from within methods called directly and indirectly.

Note: Object allocation metering can also reveal the cost of initialization – methods only ever executed once.

execprofiling7

Key performance indicators can be used to drive change (tune) but they can also be used to effectively manage and track change not necessarily tied to the achievement of an aligned target. They can also be used to identify unplanned changes that would otherwise result in customers notifying you of visible side effects at a much higher granularity and largely disconnected from the problem domain.

In a follow-up to this entry I will show how easy it is to create your own custom key performance indicators with our aspect library extensions.

Kirk Pepperdine recently queried whether JXInsight could indeed profile nothing. Naturally JXInsight can profile the “nothing” in all its many forms (resource contention, coupling & co-ordination) but to keep the first part of this follow-up short & sweet let me show you how to detect latency in browser based operations from a software execution model perspective. In this scenario a single button click on a web page to cancel a hotel booking that results in the user returning to a page displaying the list of current bookings remaining.

And here it is – our timeline analysis of the two resulting HTTP requests processed by the same worker thread (this need not always be the case) within an application server.

profilingnothing11

The light blue rectangles represent the time the thread was processing within the servlet layer. The yellow and brown the time the thread was processing within the JNDI and EJB layers respectively. The bracket above the brown rectangle in the first HTTP request represents the actual active resource (JDBC) transaction window.

Finally the whitespace (the “nothing”) in the middle of the timeline graphic is the time spent outside of the container sending/receiving/processing the request at the client – 15-20% of the total user experience time.

Next up profiling the “nothing” from a system execution model perspective.

I recently came across a customer that was using our probes Open API in a novel way to simulate a real-world business process and composed activities for the purpose of cost analysis. This is quite unusual as the technology is mainly used in metering and costing services and activities that are purely software based. But here the customer has created a software execution engine that models the real-world and uses our software instrumentation & measurement technologies to meter the simulated model in terms of the real-world cost drivers.

Note: Our probes technology is unlike any other performance management solution in that it is not tied in anyway to method execution constructs. Composite named probes can represent any identifiable execution activity and resources can be abstract representing such things as chargeable cloud service interactions. This allows us to meter across languages within a single execution context.

This usage of our technology made me revisit a recent discussion I had with a cloud computing vendor regarding how to apply an activity based costing approach to hosted computing appliances in the cloud. The vendor in question wanted to initially offer very high level metering information to customers before rolling out a complete activity based costing solution that would allow customers to expose activities within their software to our extensible resource metering engine delivering an unprecedented cost and performance management offering. What if I created a software model of the cloud platform itself – a model that would act as a proxy or facade on to the underlying cloud and that would receive events or control actions directly.

mvmmodel

The following table shows the mapping from the cloud model to our Open API interfaces and then to the corresponding activity based costing concepts.

Note: This mapping is largely only applicable to the context described here as a resource in activity based costing can be mapped to a Meter, Counter, or Measure within our Open API.

mvmtable

Lets create a mock implementation to demonstrate this approach to using activity based costing to meter at the VM level by starting with a rather simplified VM interface representing a virtual machine within a cloud.

mvmcode1

Here is a Mock implementation class of the VM interface. I have added a units field representing the number of processors assigned to the virtual machine. Naturally most of these calls would be delegated to an agent running on the virtual machine itself or possibly some central information system.

mvmcode2

Below is the Worker thread class that will run within the runtime for the lifetime of the VM being modeled. In the run() method the worker registers two Resources, vm.cpu.time and vm.up.time with the current thread resource metering context before starting the metered activity.

Note: JXInsight comes with 30+ built in resource meters and offers the ability to register custom resources via the (pushed based) Counter interface or (pulled based) Measure interface.

mvmcode3

Finally here is the Cloud class I used to interact with the model from the command line. The class supports 3 commands: start (vm), stop (vm) and tag. The tag command allows the customer to set an alternative metering cost structure for probes (activities) being metered.

mvmcode41

Before running up the mock simulation I created a jxinsight.override.config file to setup the meter-to-resource mapping as well as some unit costing. Here is the contents of the file.

jxinsight.server.probes.tag.enabled=true
jxinsight.server.probes.meter.clock.time.include=false
jxinsight.server.probes.meter.resources=vm.cpu.time,vm.up.time
jxinsight.server.probes.meter.unit.cost.include=true
jxinsight.server.probes.meter.resource.vm.cpu.time.unit.cost=2

And here is a small transcript of an interaction with the cloud.

start com.acme.sales.crm.vm1 4
# tv advert airing
tag tvadvert
# vm1 is loaded need more capacity
start com.acme.sales.crm.vm2 4
# resuming normal service
stop com.acme.sales.crm.vm2
tag
stop com.acme.sales.crm.vm1

Connecting our management console to the Java runtime which is acting as a proxy to the cloud infrastructure I captured the following resource metering & costing model.

Note: The metering (cost) group structure shown below is derived from the composite name used in creating a probe (activity) which itself a parsed representation of the virtual machine identifier.

mvmmetering1

In a follow-up to this blog entry I will show how easy it is to combine this cloud model based approach with an actual standard software activity based costing approach provided to customers and used by cloud service solutions (storage, queues).

One of the hardest challenges for any application performance management tool is in reducing the runtime overhead to levels acceptable for production environments whilst collecting as much measurement data as possible for the purposes of service level management and problem management. For the typical code profiling solution this is an impossible task without significant filtering which incidentally most of the time results in the actual bottleneck being hidden as filtering is usually based on namespaces rather the actual performance profile of the methods themselves.

Even with significant filtering most profiling tools cannot perform at levels required for applications that have incredibly low request processing times whilst executing a significant number of method calls. If a request executes in 250 microseconds (which pretty much rules out call stack sampling) and during its execution it makes 2,500 method calls it is pretty easy to see that achieving an acceptable level of overhead, 2% (5 microseconds), is an insurmountable task especially when a system call to the clock counter costs between 0.2 – 0.3 microseconds and you need to do this twice for each of the 2,500 method calls.

With a dynamic measurement (metering) solution such as the product I work on you can reduce the overhead of an instrumented method call, that is not deemed a hotspot (dynamically at runtime), to just under 0.03 microseconds but multiplying that by 2,500 we still come out above the acceptable levels for this extreme type of request processing.

Can such applications then be profiled? Yes. There are two approaches one which I will discuss in this blog entry and the other in a future entry.

Instrument->Measure->Refine->Repeat
Clearly a solution to the problem described above must involve reducing the cost of both the instrumentation and measurement. But how can this be done in such a way that we do not eliminate expensive calls from the performance model whilst ensuring we do not inadvertently include calls perturbed (directly and indirectly) by the remaining instrumentation. The solution is to iteratively refine the instrumentation based on measurements that identify execution hotspots. With each iteration the hotspot list is reassessed in light of the reduced level of instrumentation.

imrr

For this process to be much more effective than the guesswork performed by way of manual namespace filtering the underlying runtime and tooling must at minimum

  1. Perform the measurement with as low overhead as possible (which is why our product out-performs the competition by a huge margin)
  2. Allow the hotspot determination to be based on the dynamic analysis of resource usage across one or more meters (via composite cost meters)
  3. Automate the accurate generation of the refined instrumentation set (it is a one-click operation in our product).

To demonstrate this approach I am going to use a real-world software example – Oracle Coherence. Shipped with the Coherence product is a command line interface that allows one to interactively execute Map.put() and Map.get() operations on a distributed cache. The command line facility also allows you to perform such operations in bulk and to repeat them. Here is the command sequence I will execute across each test presented below.

> cache cache
> #100 bulkput

The above commands will create a distributed cache named “cache” and repeat 100 times a bulkput command which by default executes 1,000 Map.put(Integer,byte[]) commands. In total 100,000 Map.put calls are executed.

Below is a metering model collected following the above command sequence which does include a small number of call counts related to the start-up and excludes constructor calls. Over 2,300 method calls are made per Map.put call. Running the test without any instrumentation the average execution time is 0.265 microseconds per Map.put call. This is our baseline.

Note: This particular use case is a best (if unrealistic) case for Coherence as the command uses a byte[] (easy IO mapping work) as the put value, whereas most applications would have objects that need to be serialized which is very expensive relatively speaking that is.

table1

The above metering was collected without any of our production-oriented features enabled such as our dynamic hotspot metering strategy. By re-enabling such features whilst keeping the same level of instrumentation we get the metering model shown below with a significant reduction (1000x) in the number of firing probes that were actually metered (calls measured). Unfortunately we still take a performance hit in dynamically assessing whether a called method (firing probe) is a hotspot or not (metered or not). The average execution time for each Map.put is now 0.640 microseconds.

Note: The definition of a hotspot is highly configurable within our runtime. For the tests reported in this entry I defined it to be any method with an average inherent wall clock time of 50 or more microseconds after 1000 executions globally.

table2

Fortunately from within the management console we can automatically generate a new probe extension (jar) based on those probes labeled hotspots within the metering model exported from the managed runtime. One click operation!!! Instead of instrumenting over 4,500 methods we now have a small list of methods to be instrumented spread across a total of 5 classes.

xml1

Below is the metering model following the deployment of the new probe extension containing the refined instrumentation. We have now halved the metering count but more importantly the instrumentation overhead associated with methods not deemed hotspots has been removed. The average execution time has dropped back down to 0.285 microseconds.

table3

Generating a new revised probe extension based on the hotspots in this new metering model reveals a further reduction in the methods and classes deemed hotspots.

xml2

Here is the final metering model captured based on the refined hotspots – a ever so small reduction. The average execution time has now dropped to 0.275 microseconds which equates to 0.010 microseconds of overhead which is largely a result of the underlying system clock calls.

table4

The following bar chart compares the changes across each iteration with the baseline reported above and that of another Java profiler.

chart

Note: Hopefully this explains why today we ship with over 600 technology specific extension libraries that specifically target the main client call entry points (mainly interfaces) into each technology.

In a follow-up to this entry I will revisit the tests again but this time using another technique ideal for CPU analysis in situations were the resolution of the system counter is insufficient – execution count profiling.

Here are the slides with some commentary from the CloudCamp session “Metering the Cloud” I gave last week in Cologne at the WebHostingDay (WHD) conference.

The slides look at metering (and utility billing) from the perspective of two machines. The first being a washing machine representing the physical world and the second a virtual machine representing the virtual world in the cloud.
introduction

Locations: Today billing is typically assigned based on the address of a household (account or cost center). Every household has an on-site meter for each utility (electricity, water, gas).  This is also the case in the virtual world with companies receiving a single bill at the end of the month for the usage of computing capacity and/or services in the cloud.
locations
Machines: In the near future we will see each of our household appliances being fitted with their own power meters that feed metering data into a local management system providing real-time reporting on electricity consumption.

This is also happening in the cloud computing space with itemized billing per virtual machine/appliance. In the cloud this is much easier at least when talking about CPU consumption (not necessarily power usage) because instrumentation is already built into the operating system for monitoring purposes.

This level of reporting, typical of what is provided by todays commercial and open source system management solution, is woefully insufficient for the purposes of cost management as well as performance-, capacity- and service-management as it is devoid of the actual execution context which is extremely important in managing the typical cloud application which has varying workload characteristics.
machines
Programs: Todays household appliances are multi-purpose (multi-function) and offer an array of standard operating programs. With each program having potentially different utility (electricity, water) consumption levels customers will eventually demand that metering be reported at the individual program level within each appliance.

Similarly in the cloud customers will want to see their own metered resource usage and metered cloud service interactions to be reported at the program type and possibly program instance level. This could potentially lead to a radically (but hopefully beneficial) change in how software and technologies are evaluated for inclusion in the building of cloud applications as the once hidden operational cost plays a bigger role in the decision making process.
programs
Activities: When selecting a standard washing program users have additional operational choices. The spin rate can be increased or decreased. The washing temperature lowered or raised. Each of these has the potential to alter the standard utility consumption. Ideally metering would be reported by activity within a program per application within a household.

Likewise standard software programs can operate significantly different (in terms of resource consumption) due to differences in the workload characteristics of the various activities bundled in the program. To effectively manage such programs customers will require much greater insight into the software activities executing within each program. Programs cannot appear to be black boxes with flat anemic management interfaces especially as programs (operating system processes) act as mere bootstrap containers for the real work (software activity) performed by worker threads acting on input.

Customers will want to meter the software activity across key input ports, processing points, and output sinks. If you are not metering it then you are not managing it.
activities
Resources: A number of household appliances are multi-meter based. In the case of our washing machine the primary meters are electricity and water. But the washing powder used during a washing could also be modeled as a resource if the appliances could also track the quantities added and consumed during a programs run.

A virtual machine in the cloud can have many meters. Each of the cloud services (external) and cloud software (internal) utilized by a cloud application would be represented as a meter. The power consumption would also be a meter as well as the cpu/wall clock time elapsed during the lifetime of an activity (within a program, within a virtual machine, with a cloud). Unlike in our appliance example the metering of multiple resources (some global based) in the cloud must deal with concurrency, apportioning consumption across concurrent executing activities (as accurately as possible).
resources
Cost Groups: Most household appliances operate upon a direct interaction with a member of the household. Being able to identify the member at the point of the interaction would allow for metering to not only model the costs in terms of operational activities but also in terms of cost groups. It should also be possible to associate one or more additional cost groups that indirectly initiated the interaction – i.e. the football club (cost group) tells me (cost group) it is my turn this week to clean the clothing.

There can be many cost groups associated with a cloud application (especially for multi-tenant deployments). The costs groups can have deep hierarchies. An user initiating a request can belong to a team, department, unit, organization, and company. Metering in the cloud must support the tracking of resource usage and costs across distinct cost hierarchies and within a cost hierarchy (path) across cost groups with a clear distinction between direct and indirect metering.
groups
Services: When managing the household budget we look to group our utility costs into household services – heating, lighting, cooking. Unfortunately without activity level metering this is not always possible as one utility (resource) can be used by multiple appliances across different service groups. This problem is also compounded by the fact that appliances themselves are multi-purpose and can be connected (plumbing/wiring) to multiple appliances each of which can be in a different service group.

Luckily software is much easier to instrument & measure (meter) than hardware as long as meta-data can be associated with each software activity identify the service context for a metered execution and this service context can be passed across processing container boundaries.
services
I would like to point out that metering is not just for cost management purposes. The same model can be used for performance management, capacity planning, service level management even service value pricing.

More Information: If you are interested in learning more about activity based costing and how it can be applied to manage application performance, control computing costs, and enable fair and accurate charge back plans then please visit our website shown below.
meteringthecloud
DZone Link: Cloud Computing – A Tale of Two Machines

In a recent article claiming to “demystify” the Hibernate session context & cache a rather expensive (@runtime) and somewhat laborious (@analysis-time) approach was presented to show how the impact of certain operations in the Session interface have on the persistence context & cache components as well as database access patterns.

Note: The approach involved journaling (@runtime) each call along with some primitive (string) parameters and then presenting the entries as a sequenced call tree with the important measurements need for this rather small runtime analysis nested and scattered across different call branches.

But getting the answer to such questions as “Which entities are being added to the persistence context?” and “Which software activities trigger entity additions and database queries?” can be so much easier for both the user and the runtime once you have the right measurement model that is inherently extensible, uniform, and flexible – probes, meters and counters.

Lets start with the counters. Below is a table showing some of the counters maintained by our (@load-time) instrumentation extension libraries for Hibernate.

Note: The hibernate.add.count tracks the entities added to the persistence context.

hibernatecounters1

The counters are maintained per thread and can be aggregated at the process, host and cluster levels within the management console. We can also mark the counters to monitor the rate of change during a particular measurement interval.

In addition we can add other extension libraries to the class path to count at a contextual level (the class of the entity) as well.

Note: The class of the entity prefixes the hibernate.*.count counter.

hibernatecounters2

Now that we know what is happening in terms of our data model lets find out why by simply mapping (via configuration) custom meters to some of the above counters and then enabling probes (activities) at the appropriate application level (Seam Bean).

With the following metering table it is very easy to determine which packages, classes and methods result in entities being added.

hibernatecounters3

We can also perform a similar analysis across all meters mapped to counters.

hibernatecounters4

If you want you can even map a meter to an individual entity.* counter and rename it within the metering model. I actually recently did this for a customer that helped resolve a significant issue with an application in production.

This approach can be applied to other persistence frameworks (a similar set of counters are supplied for JPA) as well as data caches such as Oracle’s Coherence.

jxinsight5710coherence2

The only skill required in effectively using our probes technology is in deciding which software execution points become counters (and possibly mapped resource meters) and which become probes (hierarchical cost/metered groups).

Following my last post in which I presented an alternative (more like complimentary) Java stack trace print I have received a few inquiries offline questioning my view that call stacks are not necessarily that important for runtime diagnostics.

To be honest I am somewhat surprised by this incorrect perception because I have always placed a significant value on the efficient collection of thread call stacks especially when only a small portion of the code base is instrumented. Having a call stack at particular instrumented points in a software’s execution is important which is why in 2002 I made it a core element within our own software & system execution models for both database transaction path analysis and component distributed tracing. This was prior to the introduction of StackTraceElement (@since 1.4) when amazingly some tools were actually using printStackTrace to obtain the calling context.

Unfortunately the benefit of the StackTraceElement class was greatly diminished by its initial class design (and possibly its name) which was probably heavily influenced by the usage of stack traces at the time and possibly the need to transmit them over the wire (RMI?).

Can you see the problem?

stacktraceelementcode1

The important context fields are Strings. In a typical container environment with a proliferation of ClassLoaders it is practically impossible to accurately obtain access to the declaring java.lang.Class instance. Even if this was possible within a “hello world” like application we still cannot access the java.lang.reflect.Method as there is no parameter list only a line number into a file (allowing for overloaded method names).

When I designed our native profiling agent (initially JVMPI now JVMTI) I set out to ensure that for every frame on the stack we could obtain the Method reference and that the cost in obtaining the call stack from within the runtime itself would be significantly cheaper than any other means available especially for call stacks that appear repeatedly during execution

What did the agent solution buy us? Well it allowed us to offer the ability to apply filtering to a collected call stack based on class meta data (what it implements, what it extends) and not just on the package/class/method name as offered by other tools. It also allowed use to add meta data to our software execution model which could be used to more effectively visualize call stacks. Here are some screenshots taken from within the product’s management console to illustrate this last point.

Icons: Using the meta data each Class within the model is given one or more classifications each mapped to an icon.

Note: Classifications have a ranking which determines which one is used for rendering when there are multiple classifications assigned.

annotatedstack1

Signatures: The meta data allows us to display a tooltip for a Method listing its modifiers, parameter types and return type.

Note: You might be surprised to see the abstract keyword. This is because for Proxy classes we substitute the actual interface Class and Method. The * denotes this.

annotatedstack2
Annotations: The meta data allows us to present to the user the Annotations assigned to a Method.

annotatedstack3

Class & Classloader Hierarchies & CodeSources:  The meta data allows us to display the Class & ClassLoader hierarchies for a Method’s declaring Class including the CodeSource of the Class.

Note: The ejb:session class classification is based on the presence of the @Stateful annotation.

annotatedstack4

Compressed Call Stack Rendering: With the meta data I can quickly process (visually that is) the following 26 call stack collections immediately identifying the sequence of calls between major classified components/classes.

[Servlet]->[JSF View]->[SessionBean]->[SeamBean]->[JPA Class]

annotatedstack5

Note: Collecting call stacks is important in creating a software & system execution model but one cannot ignore there is a relatively high cost to the collection even if you are using our agent which significantly out-performs all other available means and tools today. Call stacks should only be collected at coarse grain component boundaries i.e. database calls, out-bound remote calls, messaging endpoints, service/activity entry points, etc.

When performing JVM runtime diagnostics the standard (and need I say pretty crude) approach adopted by most working for companies with very little in the way of mature IT management processes and tooling is the thread (or exception) stack trace dump.

There are numerous problems with this approach including the lack of object & argument state information (already solved quite comprehensively) as well as timing & resource usage which I hope to address here.

Lets start with the sample code used to generate the stack trace print outs presented in this entry.

A simple deep recursive call chain execution ending with a wait followed by a stack trace print.

probestackcode1

Here is a small extract of a single stack trace printed by the above code.

probestackout

The problem with the output is that we have no inclination to how long each frame has been on the threads execution call stack and the resources it has consumed since added.

We cannot even determine whether two identical stack traces, printed by the same thread, represent the exact same execution call (frame) stack. Is this a new request at the same execution point or has the code stalled (waited) since the last print?

In addition the stack trace print contains frames that represent noise in our runtime analysis if we were focused on tracking down performance bottlenecks. This can also be the case when analyzing applications deployed to standard middleware containers.

Now lets rewrite the waiting method to use our recently announced stack probes provider extension.

probestackcode21

Here is an extract of the stack trace printed.

probestackxml1

A few observations. It is in XML format. Each fired and metered Probe (frame) has a unique id within the context of each thread. The wall clock time (in microseconds) is listed as a Meter for each frame on the stack.

Note: With the id we can compare across stack trace prints and performance call stack sampling with a much higher degree of accuracy (no second guessing whether the method indeed remained active during interval).

Did you notice that the intermediary stacks.Dumping.call method is absent in the above output? The reason is the call(int) method had been executed a sufficiently number of times (default=1000) prior to this particular stack print for the resource metering runtime to have already determined that it is not a hotspot (in terms of inherent cost) and does not need to be metered any further. It was fired but not metered.

Note: The runtime allows for different meters to be used in determining whether a particular named probe (method) is a hotspot or not.

Here is another stack trace printed later in the same run.

probestackxml2

Now all the stacks.Dumping.$* methods have completely disappeared. We are left with probes fired and metered for methods marked as hotspots or still under analysis by the probes runtime metering strategies.

You have probably already guessed that if you enable additional resource meters within the metering runtime these will also be included in the probe stack prints.

probestackxml3

Conventional call stack dumps quickly fill up a log file with a large amount of irrelevant and duplicated data but they are also extremely costly both in terms of wall clock time and CPU usage.

probestackchart

I am not advocating the complete elimination of thread stack trace dumps from logs though I am sure many of us would wish it to be so especially when staring at the same exception printed 20 times or more as it passes back up through the request processing pipeline.

Having the complete call frame stack can be helpful in a few cases but for the most part it is woefully inefficient (both at runtime and off-line analysis), wasteful, and severely lacking in important runtime diagnostic information – object state and resource usage.

There are solutions available today that make problem determination pretty efficient even for the most complex of systems. You just need to step outside of the runtime and to stop staring into the Sun (it can blind you).

Note: With the solution above probe stack prints can also cross multiple language boundaries – Java-to-Ruby.

ABC for Cloud Computing

January 27, 2009

How Fast and at What Cost? 
Whilst there are many benefits to cloud computing in terms of dynamic provisioning and on-demand scaling it presents challenges to IT management organizations in the performance monitoring of applications and streamlining processing costs as result of cloud computing service charges. Fortunately there is a common solution to tackling both challenges that offers to make the cost of both the operation and quality of an IT service delivered transparent, measurable and manageable. A solution that allows one to explicitly model the relation between the performance of an IT service and its operating costs when delivered at particular levels of quality (performance) to consumers – activity-based costing (ABC)

Activity-based costing consists of a service costing method and a resource consumption model that facilitates better decision-making by interpreting relationships between services delivered and their operational costs.

ABC In Practice
Below I have created a mock Twitter application to demonstrate some of the key concepts and design choices in applying activity based costing (ABC) to applications deployed to the cloud. The mock application I have created runs on a mock Google AppEngine platform and stores user tweets in a mock Amazon S3 service.

Advice: I strongly recommend that you read our Metering the Cloud article as well as some basic information on the mocked application and technologies before proceeding if you have not already done so. 

Activity & Resource Mappings
An activity is represented as a named Probe with composite names representing hierarchical cost centers (Groups).

A resource is represented by either a Counter or Meter. The main difference between a Counter and a Meter is that changes to a Meter (which can in fact be mapped to a counter) are tracked at the activity level via a Metering associated with a Group (full or partial name).

Note: The mock application has been simplified significantly leaving out nested activities (probes), tracking of costs by context paths, and optimal metering strategies (cost versus overhead).

Within each of the mock cloud technology components I have inserted instrumentation for the purpose of tracking and costing of software activities (Probes) and resource usage (Meters & Counters). These lines are bookmarked in the left margin.

Note: The instrumentation inserted would normally be weaved into the code base transparently at load-time using our aspect libraries. 

Mock Twitter WepApp
The Twitter WebApp component mimics the handling of web requests dispatched by the Google Engine. The request handling code is the main entry point from which requests can be dispatched to two service points – status and timeline.

The status service point will update the latest status text associated for the specified user by writing the binary version of the text to an Amazon S3 Bucket keyed on the user’s name.

The timeline service point will list previous status updates for the specified user by reading all objects stored in an Amazon S3 Bucket keyed on the user’s name.

Tracking relevant activities (software services) is the foundation of a successful ABC model so this is the only component in the application that is instrumented with named Probes. A single Probe is fired and metered in the request handling code creating a dynamic composite named Probe with the following hierarchical cost center (metered groups):

twitter.users.${user}.${service}

In addition a Counter, representing a service charge, is incremented for each request delivered to the status service point. Charges are assigned to status updates by the user and not the viewing of such updates by the user and others.

ccabctwitter

Mock Google Engine
The Google Engine component mimics a web request processing dispatcher. It reads a single line representing a URL from a text console and forwards both a Request and Response object to the registered Handler

From the perspective of the Twitter application the software execution of the Engine is modeled as a resource (with multiple cost driver) and not as an activity with Counters accumulating both the counts of requests dispatched and associated CPU usage.

ccabcengine1

Mock Google Engine Request
The Google Request component mimics the reading and parsing of a web request. A single Counter is maintained (across threads) by the Request component tracking the number of bytes read which is used by Google in its cloud computing billing rate plan.

Note: Counters are thread specific so their accumulation can easily be mapped to a metered resource (meter) which is in turn mapped to a metered activity (probe) executed by a request handling thread. This is a very important distinction with legacy system monitoring solutions working with metrics at the process level and oblivious to the causality of changes. 

ccabcrequest

Mock Google Engine Response
The Google Response component mimics the writing of a web response back to the user. It also increments the same Counter used by the Google Request component recording the number of bytes written out.

ccabcrespone

Mock Amazon S3 Bucket
The Amazon Bucket component mimics the storage capabilities of the S3 service. Tweets that have been transformed into binary form are written and read from a Bucket maintained on per user basis (ignoring the limits imposed by Amazon).

Unlike the Google Request and Response components the Amazon Bucket maintains two distinct Counters tracking bytes transfered during read and write operations to and from the Bucket.  

ccabcbucket

Test Scenario
The following series of console input lines have been executed with each of the resource metering model configurations depicted below.

in: service/status/user/wlouth/text/hello world
in: service/status/user/wlouth/text/hello cloud
in: service/timeline/user/wlouth
out: hello cloud
out: hello world

To help with understanding the execution flow I have collected a probes tracking model based solely on code level instrumentation performed by our load-time bytecode weaving agent and some aspect extension libraries.

Note: In this partial ABC model every method execution is represented as an activity itself which is not necessarily what we would want in the context of the large cloud application.

ccabctracking

Resource Metering Models
Executing the test scenario (this time with no code level instrumentation) we get the following resource Metering table view that tracks the changes in Meter usage at the activity level (leaf node) and associated cost centers (root and inner nodes). This type of data is similar (to some degree) to what is found in transaction monitoring and hotspot (CPU) detection profiling tools.

Note: The Count is the number of Probes fired and metered. The Total is the cumulative change in the Meter during this period.

ccabcmetering1

Here is a table listing the values of the inserted Counters at the process level. This is type of the data presented by system management tools with no actual knowledge or tracking of the software activity treating the process largely as a black box with a basic and flat metric surface.

ccabccounting1

What we really need initially is a way of combining these two perspectives especially as our monthly cloud computing billing is closely tied to increases in one or more of these resource counters updated by the cloud platform and cloud service components.

Well it is actually not so hard (at least when using our resource metering runtime) because Counters are in fact thread (activity) specific and only aggregated at the process level when the counting probes provider extension is enabled.

Lets create some Meters mapped to to Counters and execute our test again with the following configuration.

ccabcconfig21

Below is the new Metering table view captured following the test execution. With just an external configuration change we can now see which activities impact which Counters via the Meter mapping.

Additionally we have introduced a charge back mechanism by way of the twitter.service.charge Meter (more on this later).

ccabcmetering2

Cost Metering Models
Now lets try to introduce a more explicit cost model by introducing unit cost Meters based on the Counters with some relabeling of the mapped Meters to hide the underlying technologies and billable cloud services by appending the following properties to the configuration.

ccabcconfig31

Below is the revised metering model after executing the test scenario. The model now includes a new unit cost based meter, io.cost, representing our storage costs. From the model we can see that the storage cost in delivering the status and timeline services (activities) to user wlouth is 66 (io.read.bytes*1 + io.write.bytes*2). This is higher than our service charge units (20) so we might need to revise things slightly in our next configuration.

Note: A good practice (of abstraction) in cost modeling is to create an internal unit charge scheme and associated unit currency.

ccabcmetering3

Our final configuration assigns a unit cost to our service charge Meter as well as relabeling the mapped Meter.

ccabcconfig4

And here is the resulting model. Yes! The cloud application is making profit (well at least in terms of the storage cost).

ccabcmetering4

An important benefit in our innovative resource metering approach is that we now have a single model for both cost management and service management (performance & capacity management) that can in addition provide business insight into software service sales (and in real-time alongside operational costs).

This unified model allows performance engineers to much more easily factor in cost when making trade-offs (in terms of resource usage) in order to improve the response times and throughput of one or more software services. With each possible performance optimization (CPU vs IO) the cost of the change can be evaluated. Here is a table that combines the code centric and service centric views into a single model.

ccabcmetering5

Naturally the model lends itself to being distributed (and disconnected) across multiple processing units and execution flows in the cloud whilst offering a user defined consolidated view from any business & service management workstation.

ccabcmetering6

Summary
Cost management is important for companies using cloud computing as a platform for application delivery as it provides a framework for planning and controlling decisions related to such services in terms of service value and computing cost. It also serves to help cloud computing vendors to deliver resources in a cost effective manner whilst maximizing value. Finally, knowledge of costs is required to make intelligent decisions related to cost-justified service quality.

Have you ever wished you could listen in on the execution of your software at particular points in its execution? Are you tired of writing stop watch classes and bracketing up a piece of code you are currently working on with a try..finally block? Well when we developed the event probes provider extension announced today we initially envisaged the work being used by partners and customers to simplify integration with existing event & alert management solutions at customer sites. But in writing the documentation for this extension I realized it could also be used during development and test. Here are some examples I quickly came up with. 

Note: The actual instrumentation of the methods is all done transparently at load-time though the instrumentation does use the same underlying Probes Open API.

The first example shows how we can register with the current thread execution context a probes event listener that will receive a notification when any method called by the thread has completed its execution.

Here is the output printed to the console.

events.Events.waiting waited=10008352

Note: JXInsight reports all resource metering time units in microseconds.

This next example shows how easy it is to print resource metering (clock.time) for outbound calls by registering a probes event listener prior to the actual call invocation sequence.

Much simpler than adding timing code before and after each call listed.

listenertiming1

Here is the output printed to the console.

events.Events.call1 timed=1000457
events.Events.call2 timed=2000625
events.Events.call3 timed=3000892

The following example shows how to trace direct and indirect method executions made by the thread within the execution interval of a method in which a listener is added (and removed).

Here is the output printed to the console.

entering: events.Events.outer
entering: events.Events.inner
exiting: events.Events.inner
exiting: events.Events.outer

The last example uses the same tracing like mechanism to determine the maximum call stack depth reached starting from a particular method execution frame.

listenerstacking1

Here is the output printed to the console.

max call stack depth was 5

Have you ever wondered how long it takes to get from one method frame to another method frame nested deeply amongst all the indirection of supposedly light weight frameworks? What about the time interval between two method calls not stacked or in any way directly call dependent? All these and many more ad hoc queries can be answered whilst still relying on JXInsight to do what it does best – instrument, meter, measure, aggregate, correlate and visualize.

This is part 2 of a series of blog entries examining the claim that call stack sample based profiling offers much less overhead than instrumented/execution based profiling. In this entry I have constructed a simple benchmark class based on the following observations raised in part 1 related to enterprise Java applications in the wild (which I assume was the context when the health center monitoring tool was marketed to the audience at the conference).

  • Large number (>50) of request processing threads
  • Very deep call stacks (>200) with a high percentage of call frames non-application related (especially so when using frameworks such as Spring)
  • High degree of database activity with high latency costs (>10 ms)

The benchmark class (partially listed) below does not necessarily look to declare a winner between sampling and execution profiling in terms of overhead as the answer depends on the product used and the application itself. It is an attempt at understanding the difference in the runtime cost and behavior of both techniques – across different products and execution contexts. To do this I have collected a sufficient amount of test data to help me in answering the following questions :-

  • What is the typical CPU cost of each data collection technique?
  • What is the rate of change in CPU cost as the average call stack depth is increased?
  • Is the CPU cost of a data collection technique an accurate measurement of its runtime overhead? (Does sampling throttle the throughput of other threads?)
  • Does a data collection technique have any inherent scalability issues in an enterprise setting when factoring in the previously mentioned observations and the trend of increasing multi-core computing?

samplingcode2

As you can see from the code above I have created some real challenges for an instrumentation/ execution based profiling solution in that we have a disproportionate number of methods that are relatively cheap (mimicking lightweight request processing steps), easily in-lined (to some degree) and executed at a much higher frequency than the final sleep call (mimicking a database call).

Note: Another reason for the test execution time being largely dominated by the sleep time is that I can quickly calculate the CPU overhead of a product and collection technique using the process’s CPU time.

I conducted more than 50 test runs, each lasting between 100-300 seconds, comparing JXInsight instrumentation/execution based resource metering solution with the call stack sample based profiling option offered by the 3 most popular Java profilers on the market. The first chart below shows the wall clock time for each benchmark test run with varying call stack depths.

Note: There was no filtering of the class and methods to be instrumented in the execution profiling test runs which is not typical of a production environment (see observations above).

samplingchart12

From the chart above we can see that when the call stack depth is kept relatively low (<100) the overhead of sampling could be categorized as being light weight (2-10%). But once we hit the 200 call stack depth the overhead of sampling grows quickly and becomes excessively heavy weight (20%-200%). This is not the case for our execution profiling solution, JXInsight, which keeps the overhead extremely low (1-5%) via a dynamic metering strategy whilst firing a relatively large number of probes (41-161 million) per test run.

Note: For every sample cycle performed by the other products JXInsight can fire (begin-end) between 5,000-20,000 execution based probes at an extremely low cost of between 0.5 to 1.0 microseconds depending on whether a firing is metered (underlying counter read) or not.

The next chart shows the CPU consumption (following the execution of a warm-up cycle) for each benchmark test run with varying call stack depths.

samplingchart23

As expected the CPU cost increases inline with the increase in call stack depth across all sample based profilers. But what is interesting about this chart is that it shows the mix-mode based profiler (A) having a lower CPU cost than JXInsight whilst having a significantly higher wall clock time (the slowest reported in the previous chart). It would appear that this product has traded throughput/scalability for increased levels of efficiency (smaller code paths?) using possibly coarse grain synchronization (serialized processing) during measurement and/or the queuing events & updates for processing by a (single?) non-application thread.

Similarly sample based profiler (B) has near equal CPU costs but much higher clock times than JXInsight. This is more than likely caused by a reduction in the call throughput due to the suspension of all application threads (> 40) during each call stack sample cycle which from the CPU times reported we know to be anything but short or cheap (0.5-1.0 seconds). 

If an execution profiling solution can keep the measurement overhead sufficiently low it can scale as the application workload scales across cores. This is not the case with call stack sampling which is rarely implemented in a multi-threaded manner as it is inherently serialized at costly execution points to maintain some degree of accuracy in the data reported.

Is call (stack) sample profiling more light weight than execution profiling? No if you are using JXInsight. If not then the answer really depends on a number of factors including the average sampled call stack depth, the average number of active threads, the request processing characteristics (CPU / IO), the number of computing cores, the sampling profiler, and sampling configuration (interval).

In the next entry I will look at the characteristics of the data collected by each technique and its applicability in the context of application performance management.

Update 19th Jan 2009: Whilst looking at the probe snapshots generated during our test runs I noticed that I had mistakenly used an instrumented version of our probes runtime (internally we use a number of counters to instrument parts of our runtime with itself) overstating both our clock times and cpu times. Here is a chart showing the reduction in CPU usage when using the non-instrumented production version of our libraries – at least 10% reduction across each test run.

samplingchart3

Update 19th Jan 2009: The JXInsight test runs were executed with the following system properties.
jxinsight.server.profiler.autostart=false
jxinsight.server.probes.strategy.hotspot.statistic=inherent.avg

(Historical) Aside: When we released JDBInsight 2.0 (a very long time ago) it offered an extremely efficient call stack collection facility implemented natively within our JVMPI agent with maximum frame size limit of 128. We subsequently changed this to 256 when object-relation mapping tools became more widespread adopted along with increased indirection in the middleware request processing pipeline. When developers started adding IOC frameworks into the application stack the size of the stack was increased beyond 320. Will this grow even further? I think it has already happened with some middleware products using renamed packages of the same IOC frameworks internally in the pipeline. A stack size depth of 400 or more sounds absurd and presents a strong argument for rethinking exception handling and the usage of logging by developers today. 

We recently announced support for continuous performance management via an extension to our resource metering runtime that transparently injects named assertions (defined externally in a configuration file) into unit test classes (JUnit or TestNG).  

If you look at the screenshot accompanying the update announcement you will have noticed that there was no evidence of any code changes as it is all down under the covers at load-time using byte code instrumentation. But did you know that you can in fact use both transparent instrumentation & measurement with explicit assertions.

A reason to use explicit assertions would be to use more elaborate expressions than the configuration based threshold expressions.

Below is an example of  using transparent measurement (resource metering) with explicit performance related assertions. An additional performance test method (testDBTime) has been created which executes the functional test method (testBD) a number of times to get an average timing. To handle the case of a GC cycle (caused possibly by previous test executions) perturbing the test the assertion discounts GC time from the clock time measure (meter).

Note: The actual measurement (metering) of the functional test method is performed by byte code inserted at load-time.

assertman11

Note: The time reported above would have caused the clock time assertion to fail if we had not discounted the gc time that occurred because of the repeated allocation of BigDecimal objects.

If there is only one test method executing concurrently you can simplify the code above by using our marking facility to track the resource metering delta caused by the transparently injected instrumentation.

Note: The actual measurement (metering) of the functional test method is performed by byte code inserted at load-time.

assertman21

Finally you can simply use both our consumer and producer aspects of our open API to manually insert instrumentation and assertions.

Note: The name used in creating the measurement probe below is arbitrary. I have used the same name that would be created by the load-time byte code instrumentation.

assertman3

Note: Using the same open API above you can easily create a custom dashboard within your own web application that offers a simpler performance management and problem diagnostic view on our resource metering data available in our award winning management console.

At the recent Devoxx conference (which I did not manage to attend) it was claimed that sample based profiling has much less overhead than execution profiling. This claim was made by a person that works on a product (not yet released) that only supports sample based profiling during a session which opened up with the statement “Any performance tuning advice provided in this presentation….. will be wrong!” 

This same person previously claimed GC actually made applications faster. If this holds true for an application then more than likely the application has serious resource bottlenecks elsewhere in the request processing pipeline which the additional GC stop-the-world event is alleviating by inadvertently throttling traffic and reducing resource contention (concurrency).

But before testing the validity of such a claims (ignoring the actual benefit of the data collection) lets consider the typical production workload context for enterprise Java applications.

  • Large number (>50) of request processing threads
  • Very deep call stacks (>200) with a high percentage of call frames non-application related (especially so when using frameworks such as Spring)
  • High degree of database activity with high latency costs (>10 ms)

Which means that there is a high probability that when the sample profiler executes a measurement cycle (every 1-5 ms?) a large number of threads will have very deep call stacks that by and large are of little value in terms of the application performance analysis – non-application code and with no application context.

Obtaining the call stack for a thread is incredibly expensive (we know this from the cost of throwing exceptions) and this is typically performed after all threads have been suspended temporarily by the sampling profiler (more cores, more waste). This expense does not even include the cost in performing a per thread call stack comparison with the previous call stack collected, recording timing and updating statistics – a cost that grows with each thread and each frame.

I think the problem here is that the session speaker incorrectly assumes that other tools managing applications in production using some form of execution analysis (1) instrument every single method on the call stack, (2) measure every method invocation occurrence, and (3) have a relatively high overhead in the measuring invocations. This is certainly not the case today

Unless we are talking about a “HelloWorld” application with only one main thread of execution being profiled a dynamic strategy based execution profiling (metering) solution can indeed out-perform simplistic sample based profilers whilst collecting much more relevant data, discarding noise, at a much higher degree of accuracy. This will be demonstrated in part 2.

Aside: One of the reasons that sample based profiling is offered by vendors is that it simplifies the development work for the product team. There is no requirement to deliver technology specific extensions, configuration options or an open API to allow custom extension. There is overhead reduction it is at the vendors development site!!!