ABC for Cloud Computing

January 27, 2009

How Fast and at What Cost? 
Whilst there are many benefits to cloud computing in terms of dynamic provisioning and on-demand scaling it presents challenges to IT management organizations in the performance monitoring of applications and streamlining processing costs as result of cloud computing service charges. Fortunately there is a common solution to tackling both challenges that offers to make the cost of both the operation and quality of an IT service delivered transparent, measurable and manageable. A solution that allows one to explicitly model the relation between the performance of an IT service and its operating costs when delivered at particular levels of quality (performance) to consumers – activity-based costing (ABC)

Activity-based costing consists of a service costing method and a resource consumption model that facilitates better decision-making by interpreting relationships between services delivered and their operational costs.

ABC In Practice
Below I have created a mock Twitter application to demonstrate some of the key concepts and design choices in applying activity based costing (ABC) to applications deployed to the cloud. The mock application I have created runs on a mock Google AppEngine platform and stores user tweets in a mock Amazon S3 service.

Advice: I strongly recommend that you read our Metering the Cloud article as well as some basic information on the mocked application and technologies before proceeding if you have not already done so. 

Activity & Resource Mappings
An activity is represented as a named Probe with composite names representing hierarchical cost centers (Groups).

A resource is represented by either a Counter or Meter. The main difference between a Counter and a Meter is that changes to a Meter (which can in fact be mapped to a counter) are tracked at the activity level via a Metering associated with a Group (full or partial name).

Note: The mock application has been simplified significantly leaving out nested activities (probes), tracking of costs by context paths, and optimal metering strategies (cost versus overhead).

Within each of the mock cloud technology components I have inserted instrumentation for the purpose of tracking and costing of software activities (Probes) and resource usage (Meters & Counters). These lines are bookmarked in the left margin.

Note: The instrumentation inserted would normally be weaved into the code base transparently at load-time using our aspect libraries. 

Mock Twitter WepApp
The Twitter WebApp component mimics the handling of web requests dispatched by the Google Engine. The request handling code is the main entry point from which requests can be dispatched to two service points – status and timeline.

The status service point will update the latest status text associated for the specified user by writing the binary version of the text to an Amazon S3 Bucket keyed on the user’s name.

The timeline service point will list previous status updates for the specified user by reading all objects stored in an Amazon S3 Bucket keyed on the user’s name.

Tracking relevant activities (software services) is the foundation of a successful ABC model so this is the only component in the application that is instrumented with named Probes. A single Probe is fired and metered in the request handling code creating a dynamic composite named Probe with the following hierarchical cost center (metered groups):

twitter.users.${user}.${service}

In addition a Counter, representing a service charge, is incremented for each request delivered to the status service point. Charges are assigned to status updates by the user and not the viewing of such updates by the user and others.

ccabctwitter

Mock Google Engine
The Google Engine component mimics a web request processing dispatcher. It reads a single line representing a URL from a text console and forwards both a Request and Response object to the registered Handler

From the perspective of the Twitter application the software execution of the Engine is modeled as a resource (with multiple cost driver) and not as an activity with Counters accumulating both the counts of requests dispatched and associated CPU usage.

ccabcengine1

Mock Google Engine Request
The Google Request component mimics the reading and parsing of a web request. A single Counter is maintained (across threads) by the Request component tracking the number of bytes read which is used by Google in its cloud computing billing rate plan.

Note: Counters are thread specific so their accumulation can easily be mapped to a metered resource (meter) which is in turn mapped to a metered activity (probe) executed by a request handling thread. This is a very important distinction with legacy system monitoring solutions working with metrics at the process level and oblivious to the causality of changes. 

ccabcrequest

Mock Google Engine Response
The Google Response component mimics the writing of a web response back to the user. It also increments the same Counter used by the Google Request component recording the number of bytes written out.

ccabcrespone

Mock Amazon S3 Bucket
The Amazon Bucket component mimics the storage capabilities of the S3 service. Tweets that have been transformed into binary form are written and read from a Bucket maintained on per user basis (ignoring the limits imposed by Amazon).

Unlike the Google Request and Response components the Amazon Bucket maintains two distinct Counters tracking bytes transfered during read and write operations to and from the Bucket.  

ccabcbucket

Test Scenario
The following series of console input lines have been executed with each of the resource metering model configurations depicted below.

in: service/status/user/wlouth/text/hello world
in: service/status/user/wlouth/text/hello cloud
in: service/timeline/user/wlouth
out: hello cloud
out: hello world

To help with understanding the execution flow I have collected a probes tracking model based solely on code level instrumentation performed by our load-time bytecode weaving agent and some aspect extension libraries.

Note: In this partial ABC model every method execution is represented as an activity itself which is not necessarily what we would want in the context of the large cloud application.

ccabctracking

Resource Metering Models
Executing the test scenario (this time with no code level instrumentation) we get the following resource Metering table view that tracks the changes in Meter usage at the activity level (leaf node) and associated cost centers (root and inner nodes). This type of data is similar (to some degree) to what is found in transaction monitoring and hotspot (CPU) detection profiling tools.

Note: The Count is the number of Probes fired and metered. The Total is the cumulative change in the Meter during this period.

ccabcmetering1

Here is a table listing the values of the inserted Counters at the process level. This is type of the data presented by system management tools with no actual knowledge or tracking of the software activity treating the process largely as a black box with a basic and flat metric surface.

ccabccounting1

What we really need initially is a way of combining these two perspectives especially as our monthly cloud computing billing is closely tied to increases in one or more of these resource counters updated by the cloud platform and cloud service components.

Well it is actually not so hard (at least when using our resource metering runtime) because Counters are in fact thread (activity) specific and only aggregated at the process level when the counting probes provider extension is enabled.

Lets create some Meters mapped to to Counters and execute our test again with the following configuration.

ccabcconfig21

Below is the new Metering table view captured following the test execution. With just an external configuration change we can now see which activities impact which Counters via the Meter mapping.

Additionally we have introduced a charge back mechanism by way of the twitter.service.charge Meter (more on this later).

ccabcmetering2

Cost Metering Models
Now lets try to introduce a more explicit cost model by introducing unit cost Meters based on the Counters with some relabeling of the mapped Meters to hide the underlying technologies and billable cloud services by appending the following properties to the configuration.

ccabcconfig31

Below is the revised metering model after executing the test scenario. The model now includes a new unit cost based meter, io.cost, representing our storage costs. From the model we can see that the storage cost in delivering the status and timeline services (activities) to user wlouth is 66 (io.read.bytes*1 + io.write.bytes*2). This is higher than our service charge units (20) so we might need to revise things slightly in our next configuration.

Note: A good practice (of abstraction) in cost modeling is to create an internal unit charge scheme and associated unit currency.

ccabcmetering3

Our final configuration assigns a unit cost to our service charge Meter as well as relabeling the mapped Meter.

ccabcconfig4

And here is the resulting model. Yes! The cloud application is making profit (well at least in terms of the storage cost).

ccabcmetering4

An important benefit in our innovative resource metering approach is that we now have a single model for both cost management and service management (performance & capacity management) that can in addition provide business insight into software service sales (and in real-time alongside operational costs).

This unified model allows performance engineers to much more easily factor in cost when making trade-offs (in terms of resource usage) in order to improve the response times and throughput of one or more software services. With each possible performance optimization (CPU vs IO) the cost of the change can be evaluated. Here is a table that combines the code centric and service centric views into a single model.

ccabcmetering5

Naturally the model lends itself to being distributed (and disconnected) across multiple processing units and execution flows in the cloud whilst offering a user defined consolidated view from any business & service management workstation.

ccabcmetering6

Summary
Cost management is important for companies using cloud computing as a platform for application delivery as it provides a framework for planning and controlling decisions related to such services in terms of service value and computing cost. It also serves to help cloud computing vendors to deliver resources in a cost effective manner whilst maximizing value. Finally, knowledge of costs is required to make intelligent decisions related to cost-justified service quality.

Have you ever wished you could listen in on the execution of your software at particular points in its execution? Are you tired of writing stop watch classes and bracketing up a piece of code you are currently working on with a try..finally block? Well when we developed the event probes provider extension announced today we initially envisaged the work being used by partners and customers to simplify integration with existing event & alert management solutions at customer sites. But in writing the documentation for this extension I realized it could also be used during development and test. Here are some examples I quickly came up with. 

Note: The actual instrumentation of the methods is all done transparently at load-time though the instrumentation does use the same underlying Probes Open API.

The first example shows how we can register with the current thread execution context a probes event listener that will receive a notification when any method called by the thread has completed its execution.

Here is the output printed to the console.

events.Events.waiting waited=10008352

Note: JXInsight reports all resource metering time units in microseconds.

This next example shows how easy it is to print resource metering (clock.time) for outbound calls by registering a probes event listener prior to the actual call invocation sequence.

Much simpler than adding timing code before and after each call listed.

listenertiming1

Here is the output printed to the console.

events.Events.call1 timed=1000457
events.Events.call2 timed=2000625
events.Events.call3 timed=3000892

The following example shows how to trace direct and indirect method executions made by the thread within the execution interval of a method in which a listener is added (and removed).

Here is the output printed to the console.

entering: events.Events.outer
entering: events.Events.inner
exiting: events.Events.inner
exiting: events.Events.outer

The last example uses the same tracing like mechanism to determine the maximum call stack depth reached starting from a particular method execution frame.

listenerstacking1

Here is the output printed to the console.

max call stack depth was 5

Have you ever wondered how long it takes to get from one method frame to another method frame nested deeply amongst all the indirection of supposedly light weight frameworks? What about the time interval between two method calls not stacked or in any way directly call dependent? All these and many more ad hoc queries can be answered whilst still relying on JXInsight to do what it does best – instrument, meter, measure, aggregate, correlate and visualize.

This is part 2 of a series of blog entries examining the claim that call stack sample based profiling offers much less overhead than instrumented/execution based profiling. In this entry I have constructed a simple benchmark class based on the following observations raised in part 1 related to enterprise Java applications in the wild (which I assume was the context when the health center monitoring tool was marketed to the audience at the conference).

  • Large number (>50) of request processing threads
  • Very deep call stacks (>200) with a high percentage of call frames non-application related (especially so when using frameworks such as Spring)
  • High degree of database activity with high latency costs (>10 ms)

The benchmark class (partially listed) below does not necessarily look to declare a winner between sampling and execution profiling in terms of overhead as the answer depends on the product used and the application itself. It is an attempt at understanding the difference in the runtime cost and behavior of both techniques – across different products and execution contexts. To do this I have collected a sufficient amount of test data to help me in answering the following questions :-

  • What is the typical CPU cost of each data collection technique?
  • What is the rate of change in CPU cost as the average call stack depth is increased?
  • Is the CPU cost of a data collection technique an accurate measurement of its runtime overhead? (Does sampling throttle the throughput of other threads?)
  • Does a data collection technique have any inherent scalability issues in an enterprise setting when factoring in the previously mentioned observations and the trend of increasing multi-core computing?

samplingcode2

As you can see from the code above I have created some real challenges for an instrumentation/ execution based profiling solution in that we have a disproportionate number of methods that are relatively cheap (mimicking lightweight request processing steps), easily in-lined (to some degree) and executed at a much higher frequency than the final sleep call (mimicking a database call).

Note: Another reason for the test execution time being largely dominated by the sleep time is that I can quickly calculate the CPU overhead of a product and collection technique using the process’s CPU time.

I conducted more than 50 test runs, each lasting between 100-300 seconds, comparing JXInsight instrumentation/execution based resource metering solution with the call stack sample based profiling option offered by the 3 most popular Java profilers on the market. The first chart below shows the wall clock time for each benchmark test run with varying call stack depths.

Note: There was no filtering of the class and methods to be instrumented in the execution profiling test runs which is not typical of a production environment (see observations above).

samplingchart12

From the chart above we can see that when the call stack depth is kept relatively low (<100) the overhead of sampling could be categorized as being light weight (2-10%). But once we hit the 200 call stack depth the overhead of sampling grows quickly and becomes excessively heavy weight (20%-200%). This is not the case for our execution profiling solution, JXInsight, which keeps the overhead extremely low (1-5%) via a dynamic metering strategy whilst firing a relatively large number of probes (41-161 million) per test run.

Note: For every sample cycle performed by the other products JXInsight can fire (begin-end) between 5,000-20,000 execution based probes at an extremely low cost of between 0.5 to 1.0 microseconds depending on whether a firing is metered (underlying counter read) or not.

The next chart shows the CPU consumption (following the execution of a warm-up cycle) for each benchmark test run with varying call stack depths.

samplingchart23

As expected the CPU cost increases inline with the increase in call stack depth across all sample based profilers. But what is interesting about this chart is that it shows the mix-mode based profiler (A) having a lower CPU cost than JXInsight whilst having a significantly higher wall clock time (the slowest reported in the previous chart). It would appear that this product has traded throughput/scalability for increased levels of efficiency (smaller code paths?) using possibly coarse grain synchronization (serialized processing) during measurement and/or the queuing events & updates for processing by a (single?) non-application thread.

Similarly sample based profiler (B) has near equal CPU costs but much higher clock times than JXInsight. This is more than likely caused by a reduction in the call throughput due to the suspension of all application threads (> 40) during each call stack sample cycle which from the CPU times reported we know to be anything but short or cheap (0.5-1.0 seconds).

If an execution profiling solution can keep the measurement overhead sufficiently low it can scale as the application workload scales across cores. This is not the case with call stack sampling which is rarely implemented in a multi-threaded manner as it is inherently serialized at costly execution points to maintain some degree of accuracy in the data reported.

Is call (stack) sample profiling more light weight than execution profiling? No if you are using JXInsight. If not then the answer really depends on a number of factors including the average sampled call stack depth, the average number of active threads, the request processing characteristics (CPU / IO), the number of computing cores, the sampling profiler, and sampling configuration (interval).

In the next entry I will look at the characteristics of the data collected by each technique and its applicability in the context of application performance management.

Update 19th Jan 2009: Whilst looking at the probe snapshots generated during our test runs I noticed that I had mistakenly used an instrumented version of our probes runtime (internally we use a number of counters to instrument parts of our runtime with itself) overstating both our clock times and cpu times. Here is a chart showing the reduction in CPU usage when using the non-instrumented production version of our libraries – at least 10% reduction across each test run.

samplingchart3

Update 19th Jan 2009: The JXInsight test runs were executed with the following system properties.
jxinsight.server.profiler.autostart=false
jxinsight.server.probes.strategy.hotspot.statistic=inherent.avg

(Historical) Aside: When we released JDBInsight 2.0 (a very long time ago) it offered an extremely efficient call stack collection facility implemented natively within our JVMPI agent with maximum frame size limit of 128. We subsequently changed this to 256 when object-relation mapping tools became more widespread adopted along with increased indirection in the middleware request processing pipeline. When developers started adding IOC frameworks into the application stack the size of the stack was increased beyond 320. Will this grow even further? I think it has already happened with some middleware products using renamed packages of the same IOC frameworks internally in the pipeline. A stack size depth of 400 or more sounds absurd and presents a strong argument for rethinking exception handling and the usage of logging by developers today.

Update 16th October 2009
Our overhead has now dropped to 1 nanosecond for a non-hotspot method.
http://williamlouth.wordpress.com/2009/07/27/one-billion-operations-per-second/

We recently announced support for continuous performance management via an extension to our resource metering runtime that transparently injects named assertions (defined externally in a configuration file) into unit test classes (JUnit or TestNG).  

If you look at the screenshot accompanying the update announcement you will have noticed that there was no evidence of any code changes as it is all down under the covers at load-time using byte code instrumentation. But did you know that you can in fact use both transparent instrumentation & measurement with explicit assertions.

A reason to use explicit assertions would be to use more elaborate expressions than the configuration based threshold expressions.

Below is an example of  using transparent measurement (resource metering) with explicit performance related assertions. An additional performance test method (testDBTime) has been created which executes the functional test method (testBD) a number of times to get an average timing. To handle the case of a GC cycle (caused possibly by previous test executions) perturbing the test the assertion discounts GC time from the clock time measure (meter).

Note: The actual measurement (metering) of the functional test method is performed by byte code inserted at load-time.

assertman11

Note: The time reported above would have caused the clock time assertion to fail if we had not discounted the gc time that occurred because of the repeated allocation of BigDecimal objects.

If there is only one test method executing concurrently you can simplify the code above by using our marking facility to track the resource metering delta caused by the transparently injected instrumentation.

Note: The actual measurement (metering) of the functional test method is performed by byte code inserted at load-time.

assertman21

Finally you can simply use both our consumer and producer aspects of our open API to manually insert instrumentation and assertions.

Note: The name used in creating the measurement probe below is arbitrary. I have used the same name that would be created by the load-time byte code instrumentation.

assertman3

Note: Using the same open API above you can easily create a custom dashboard within your own web application that offers a simpler performance management and problem diagnostic view on our resource metering data available in our award winning management console.

At the recent Devoxx conference (which I did not manage to attend) it was claimed that sample based profiling has much less overhead than execution profiling. This claim was made by a person that works on a product (not yet released) that only supports sample based profiling during a session which opened up with the statement “Any performance tuning advice provided in this presentation….. will be wrong!” 

This same person previously claimed GC actually made applications faster. If this holds true for an application then more than likely the application has serious resource bottlenecks elsewhere in the request processing pipeline which the additional GC stop-the-world event is alleviating by inadvertently throttling traffic and reducing resource contention (concurrency).

But before testing the validity of such a claims (ignoring the actual benefit of the data collection) lets consider the typical production workload context for enterprise Java applications.

  • Large number (>50) of request processing threads
  • Very deep call stacks (>200) with a high percentage of call frames non-application related (especially so when using frameworks such as Spring)
  • High degree of database activity with high latency costs (>10 ms)

Which means that there is a high probability that when the sample profiler executes a measurement cycle (every 1-5 ms?) a large number of threads will have very deep call stacks that by and large are of little value in terms of the application performance analysis – non-application code and with no application context.

Obtaining the call stack for a thread is incredibly expensive (we know this from the cost of throwing exceptions) and this is typically performed after all threads have been suspended temporarily by the sampling profiler (more cores, more waste). This expense does not even include the cost in performing a per thread call stack comparison with the previous call stack collected, recording timing and updating statistics – a cost that grows with each thread and each frame.

I think the problem here is that the session speaker incorrectly assumes that other tools managing applications in production using some form of execution analysis (1) instrument every single method on the call stack, (2) measure every method invocation occurrence, and (3) have a relatively high overhead in the measuring invocations. This is certainly not the case today

Unless we are talking about a “HelloWorld” application with only one main thread of execution being profiled a dynamic strategy based execution profiling (metering) solution can indeed out-perform simplistic sample based profilers whilst collecting much more relevant data, discarding noise, at a much higher degree of accuracy. This will be demonstrated in part 2.

Aside: One of the reasons that sample based profiling is offered by vendors is that it simplifies the development work for the product team. There is no requirement to deliver technology specific extensions, configuration options or an open API to allow custom extension. There is overhead reduction it is at the vendors development site!!!