I have spent the past year doing performance modelling of a new product that is being rolled progressively to clients. The most frustrating thing has the lack of meaningful feedback from the product support group. Everytime I raise a concern that the performance is not as expected, I get the platitude that the product is performing as expected.
To illustrate this problem, I have mocked up the following graph:
The metric of concern is the maximum CPU consumption recorded by SAR in a week for the database server. This metric is chosen because the product is very CPU intensive as the result of query tuning. Nearly of the I/O load is due to logging and checkpointing.
Note: In order to standardise the CPU usage, I choose to present the CPU usage as if it was a single server. Thus when the CPU usage exceeds 100%, this means that the database was configured in a RAC. This has the effect of identifying when to expand the RAC.
The red line shows the weekly measurements. They show a low variance for the first seventeen (17) weeks when the client numbers were low (50 to 55). Based on that stable measurement, I came up with the following simple model:
This model is crude, but it appears to fit the first seventeen (17) weeks well. This model is plotted as the green dashed line. And the number of clients is overlaid on the secondary Y-axis as the solid blue line to emphasise the linear relationship.
The problems began in week #18 when the actual measurements fail far short of the expected one. The product support said this was expected as a slow take-up of the product because the third tranche of 45 clients were still using an older version of the product and were in a longer transition.
This explanation seemed to blown away when the next tranche of 100 clients were added and the CPU usage climbed in week #29. Again, the product support told me that was expected.
And of course, the following week (#30) told a different story. Again, it was expected. (Is it possible to surprise the product support group?).
I was rather annoyed by all of this because I knew the model was wrong. What I needed to find out when and where the model was wrong.
If the clients are homogeneous in behaviour, then the model could be adjusted to:
This introduces the problem of how to explain that the product consumes one-eight of a server for zero (0) clients!
Or the model could wrong in assuming that the clients are homogeneous in their use of the product. Without a way to classify the clients, it would be impossible to refine the model further.