Monitor What Matters, Simplicity in a Sea of Numbers

2023 Nov 05 · 4 min

I’m hosting a Diwali party this year and there’ll be about 8 people in total.

The issue is I’ve only got 6 chairs, because, well… I usually only need a couple, so buying 4 extra felt like a very safe bet at the time.

And yet, here I am, wondering whether I need to buy 2 more chairs, or uninvite my 2 least favourite guests.

Both options have long-lasting implications, but one of them will continue to affect my life even after I’ve churned through all my friends. Tricky.

If only chairs had a Pay-As-You-Go option; charging only for the time we actually need them.

It’d be cheaper overall, we’d stub our toes less and I might even keep more friends. Maybe.

This kind of short-term need for capacity is why there’s such a demand for Cloud Computing.

Hardware is super expensive to purchase and maintain, so a Pay-As-You-Go solution is often the best choice.

Unlike chairs however, where 1 chair = 1 person (usually… hopefully), 1 computer may support 10s or even 1000s of people.

This makes it much less obvious to know when you need to scale up and get more compute.

It’s commonplace to monitor the hardware’s metrics to see how the environment and the workload within it are performing. Some typical ones being:

CPU and Memory usage
Database Query Time
Disk Utilisation

I have some issues with these kinds of metrics though.

These metrics describe the environment, not the application. This makes being proactive with the information incredibly difficult, unless you know exactly how the application is doing as well. I feel this is a really common misunderstanding.
They all need to be monitored in tandem. Any one of the metrics could hint at a degraded performance, so if you’ve not got all the bases covered you might miss something important. This increases the overall solution complexity and the likelihood something breaks.
Few people understand them well. This silos the knowledge and creates a dependency on a specific team, hindering innovation and requiring a translation exercise whenever a scaling related project becomes a business priority.

Okay so what’s better than these industry standards, what metrics avoid the shortcomings I listed above?

Well, if we flip the shortcomings into requirements, we get the following:

Individually actionable metrics which describe the workload itself and resonates with the entire business.

A term which I like to describe as a Core Performance Indicator (CPI)

The issue now is that this new metric specification rules out all the typical industry standards, so what else can we use?

The CPIs we’re looking for are actually within the codebase itself.

They hide amongst the core application logic. Innocent little variables and objects, secretly holding a monopoly on all the diamond insights.

Consider these example applications and their Core Performance Indicators:

Real-Time Chat app - Number of active WebSocket connections or messages per second
E-commerce Website - Shopping cart operations per second or number of active shopping carts
IoT Device Hub - Number of active devices or messages sent/received from devices

You can see how these metrics have meaning and value individually, they describe the workload itself and the entire business understands them.

Every application is unique, and so is the way to scale it.

As you’ve probably gathered, these workload specific metrics aren’t available ‘out of the box’. They’re custom metrics which need to be set up in whichever monitoring framework you use.

These would be determined and configured as part of a CPI Discovery project.

Before you hesitate and decide this is all to much work, I want to explain quickly some of the benefits from it…

It makes the following people happy:
- End User - better performance and experience
- infrastructure Team - more efficient monitoring and less outages
- Finance Team - reduction in cloud costs
You’re doing this work already, it’s just being spent on less efficient metrics, outages and time taken to explain it all.
Simply carrying out this kind of project will bring so much more than a simple metric, it will optimise scaling protocols, result in a more robust and cost efficient solution and undoubtedly uncover workload behaviours which you never knew existed. Allowing you to make better technical decisions going forward.

This may seem daunting, but with the right team and the right intentions, the challenge becomes a powerful learning process. It’s just about taking that first step.

Identifying Core Performance Indicators isn’t about picking what’s easiest to measure; it’s about understanding the core actions that affect your application’s performance.

In an upcoming post, I’ll give a step-by-step guide of how to do this right, first time. Ensuring you’re not just measuring activity, but capturing the pulse of your application’s health.

If this all sounds like something you need right away, or if you simply don’t want to wait until your infrastructure starts to struggle, then just reach out to me and let’s have a chat about it!

Scaling is an art, not just a process. It requires consistent attention and fine-tuning, much like the balance of our chairs-to-friends ratio.

Are there any aspects of your application that could become a Core Performance Indicator?
How frequently are applications updated compared to their monitoring and scaling code?
Have you discovered an unconventional metric that transformed the way you scale? If yes, share it below and help spark innovation.

mds.coffee

a space for thought

Monitor What Matters, Simplicity in a Sea of Numbers