Pro-active monitoring is the cornerstone of our delivering amazing systems reliability – instead of reacting to customer complaints of a server being down or a service being unavailable, we seek to understand all the ways in which things may fail, and we test for them relentlessly. Our happiest moments are when our monitoring screens show this:
Of course it’s impossible to keep things permanently this way – systems and the way they are used are extremely fluid – after all, servers needs regular upgrades, they will be attack from the outside world (and occasionally by unwitting trusted end-users) – regardless of why this is, the key is for us to detect anomalies before the end user does.
Instead of starting the clock on problem remediation after the problem is reported by the end -user, we start fixing it once it shows up on our monitoring screens.
So instead of our KPI being “time to resolve a ticket arising from a phone call/email” aka resolution-time, our KPI becomes minimizing the number of downtime reports – so eventually, we reach the point where the number of daily calls for downtime is less than one, which it has been for years now, despite our managing thousands of virtual machines/services that support hundreds of thousands of end-users.
That’s the Lightspeed miracle – changing the way we look at the problem provides lower stress for the backend engineering team as well as superor customer satisfaction (it helps that we’re efficient and therefore affordable too), resulting in the highest customer retention rate in the managed services business (hint – we’re twenty years old)
To paraphrase an old adage – “If a server fails and it gets fixed before any end user realizes there was a problem, was it ever really down?”