In a previous blog entry, Instrument->Measure->Refine->Repeat , I discussed how to approach the performance analysis of software with request processing execution times in the low microseconds via an iterative hotspot detection process that seeks to refine the instrumentation and measurement with each cycle, reducing overhead and minimizing perturbation in the data collected. In this follow-up I would like to present an alternative and complimentary approach that allows us to largely discount any instrumentation and measurement overhead by focusing on the rate of change in key performance indicators (KPI) that are not time based.

But first allow me to ask the question “Why do we profile the execution of software?“. For many developers the answer is simple – “to tune”. But this is not entirely the only valid reason and in fact this answer is more of a by-product of other activities and tasks within software performance engineering (SPE).

For large complex systems we observe and measure the execution of software in order to understand, model, quantify, and manage its behavior (resources usage & utilization, performance, scalability, runtime system dependencies) across multiple software releases and under different operating & executing contexts. Naturally we need a way to effectively manage all this information. This is achieved by metering (tracking) a relatively small number of indicators that measure operational activities that have a significant impact on performance, correlating the rate of change in such indicators with higher-level activities and various operating parameters.

Note: In fact the ability to detect and assess underlying changes in both the system & software execution models across deployments is of paramount importance in delivering service level management and reducing the time window for problem resolution in production environments.

The choice of a good key performance indicator depends largely on the type of software being monitored and managed. For standard business applications this can be the number and timing for requests serviced by various resource managers (file system, database, messaging middleware, data grid, compute grid). For low latency request processing systems there are a number of useful candidates for key performance indicators that are not entirely time dependent or impacted by instrumentation including: method execution counts, field read/write counts, object allocation counts & sizes (and to some lesser extent cpu time).

Lets apply this approach to the performance analysis of Oracle Coherence used in my previous blog entry which you should have already read at this stage. Before looking at any execution profile data you first need to know what was the activity being tested and how was it parameterized. Here it is:

>cache cache
>#10 bulkput
>#10 bulkput

Executing the above command sequence in the Oracle Coherence interactive shell results in a total of 20,000 SafeNamedCache.put() calls. We can see this below in the metering table with the Count column representing the number of times this particular method was called.

Lets keep this number, 20,000, in mind when looking at all numbers presented below.

execprofiling1

The Total column represents the number of methods executed within the com.tangosol.* package during the execution of the SafeNamedCache.put() 20,000 calls by the same calling thread. The Total(I) tells us the number of methods called that were directly (inherently) attributed to the execution of the  SafeNamedCache.put()

So 20,000 SafeNamedCache.put() calls resulted in 13.7 million method calls, 40,000 of which were made directly from within the method itself. But what about the other 38 million (51 less 13) call? Why only 13.7 million? The reason is that these other 38 million calls were made outside of the execution of the cache operation and more importantly by other threads. Expansion of the metering model tree reveals one of the main contributing classes, PacketPublisher, whose method executions resulted in a total of more than 22 million (1.3 million x 17) method calls, 4 million of which were made directly from within the class.

Note: For extremely low latency systems the overhead of excessive method calls can be relatively high even if in-lined to some degree because there still needs to be some housekeeping performed by the runtime (call stack mgmt). When looking for possible candidates to tune start with those probes with high averages.

execprofiling21

In analyzing the performance of our own software product we use two low level meters – field.read.count and field.write.count. Below is such an analysis with 37.6 million reads performed on fields residing in the com.tangosol.* package during the execution of 52.1 million method calls also within the com.tangosol.* package.

Note: We ourselves monitor call paths within our probes runtime at the nanosecond level by averaging billions of test executions and by managing the number of field reads and writes.

execprofiling3

Comparing the Count column with the Total(I) columns shows that a vast number of calls within the PacketPublisher class are simple field getters – an average of 1 field read per method call. It also reveals that during the execution of methods within this class 18.5 millions fields are read with 1 million within the class’s own methods and the rest in methods executed elsewhere in the com.tangosol.* package.

execprofiling4

Another useful meter in analyzing software with a large amount of thread co-ordination is waiting.count which tracks the number of Object.wait() calls made by a thread.

Note: The waiting.count meter can be used in “profiling nothing” as a high number of waits calls is generally indicative of worker threads starved of work items to process.

execprofiling51

For low latency systems you really need to look at the rate of object allocation and the resulting byte sizes. This is not only important for performance management but for capacity management as well as most OutOfMemoryErrors are the result of high concurrent request processing placing too much demand on the available memory pool and not some permanent memory leakage (more like a temporal & transient waste).

Note: Execution count profiling allows one to identify and quantify the impact of particular request execution parameters have on key performance indicators – correlating rate of change across multiple variables.

execprofiling6

Focusing on the average object allocation count per method execution can sometimes reveal opportunities for tuning which by the way does not necessarily imply that objects should be cached – always try to eliminate work completely before improving the efficiency of the work via reuse.

From the metering table below we can see how many objects are allocated directly from within a method and how many are typically allocated during its execution including from within methods called directly and indirectly.

Note: Object allocation metering can also reveal the cost of initialization – methods only ever executed once.

execprofiling7

Key performance indicators can be used to drive change (tune) but they can also be used to effectively manage and track change not necessarily tied to the achievement of an aligned target. They can also be used to identify unplanned changes that would otherwise result in customers notifying you of visible side effects at a much higher granularity and largely disconnected from the problem domain.

In a follow-up to this entry I will show how easy it is to create your own custom key performance indicators with our aspect library extensions.