Whether you're a video gamer, a sports nut, or a D&D fan, you're probably used to reading, calculating, and tracking scores. Simply put, keeping score is just a way to measure something — usually to determine a winner.
In DevOps, there are plenty of scores and measurements. Apdex is New Relic's way of tracking user satisfaction in regard to website loading-time responsiveness. And APM has a slew of its own metrics and scores to help engineers keep up on application performance.
"If it doesn't matter who wins or loses, then why do they keep score?" —Vince Lombardi, football player, coach, and NFL executive
But what about scores and metrics for downtime prevention?
No Adequate Measure for Downtime Prevention Exists
The closest scores related to downtime that we know of are MTTR (mean time to resolution/recovery, MTTD (mean time to detection/discovery), and SLAs (service-level agreements). Each of these measures the reactive nature of alerts and notifications. Even combined, they are wholly inadequate to help measure downtime prevention. SLAs, for example, are intended to measure site usability. Instead, they measure whatever the company thinks they can get away with.
With the relaunch of Blue Matador this month, we're introducing to the world the first and only metric for measuring a team's responsiveness to fixing issues that lead to outages. We're calling it your Downtime Prevention Score™, and it's a reportable, single number that represents your ability to prevent downtime.
Built right into your Blue Matador dashboard, your DPS is measured each minute and also displays the last 30 days for trend analysis to show you where you're trending. It's designed to give greater visibility to your CEO, CTO, VP of engineering, board, and other non-ops people on how your app is maintaining uptime as your customers see it.
We hope it will be used as an OKR (objectives and key results), as a figure displayed on your company's TV status board, and as an internal figure for the engineering team to understand their own performance. Ultimately, it's designed to give greater peace of mind to all the stakeholders who use your app or service. Peace of mind is what Blue Matador is all about.
Creating the Formula
“That which is measured improves. That which is measured and reported improves exponentially.” —Karl Pearson, father of mathematical statistics
To make it easily understandable and instantly familiar, the DPS is represented as a number from 0 to 100. It's a percentage, really, but reported as a whole number and without a percent sign (%). The formula is simple, but scientific:
Note that total samples does not include P5 issues (see below). The formula's all derived from the number and slices of recommendations you have populating your Prevention Timeline in any given minute. The weight for each slice is given below:
P1: "Now" issues
These issues hurt your score the most (P1 x - 2) because they're critical and happening now — they represent current downtime.
P2: "Today" issues
While P1 issues actively detract from your score, P2 issues add to your score (P2 x 0.5), because there is still some time to prevent them. If the DPS measures your responsiveness to preventing downtime, then this figure means you're cutting it close, but there's still a chance to save customers from connectivity issues.
P3: "Tomorrow" issues
P3 represents even more time to fix the issue before it affects customers, which is why the weight on your score is increasingly less (P3 x 0.75). There's an active buffer here in P3s, which is why we weighted them such.
P4: "One week" issues
Finally, P4 issues hurt you the least and are fully weighted (P4 x 1, essentially) because they're far enough out (up to 7 days) that you still have plenty of time to be proactive and plan for them before they affect your systems.
We omitted P5 issues from the DPS formula intentionally. These issues are up to 30 days away and are on the horizon. Given that you haven't done anything with them yet, we reasoned, P5s shouldn't help or hurt your score. So they're not weighted at all.
In a nutshell, the DPS measures the time buffer you regularly have to prevent issues. It's not the likeliness of an issue occurring in the first place. It's how long you have to solve them before the issues could turn into dreadful downtime for you and your customers.
Improving Your DPS Score
"What gets measured gets managed." —Peter Drucker, Austrian-American management guru
Every Blue Matador account starts out with a DPS of 0 because there is no data initially. But if you followed the onboarding steps correctly, you installed Blue Matador not only on your development environment but also your production one, so data should start flowing within minutes.
The first week or so of using Blue Matador will result in a fluctuating DPS number, while you'll see the figure stabilize over the next weeks and months as your team's habits and responsiveness to solving the issues becomes apparent. But like a GPA, you can improve your Downtime Prevention Score by performing better — in this case, responding to recommendations coming down your Prevention Timeline.
How difficult is it to recover from a poor DPS number? Unlike GPAs, we made the DPS a figure you can improve without having to retake your computer science classes. This is because we can calculate your score at any time. In your dashboard, the large DPS figure is your current, or active score.
The daily report of your scores (the smaller numbers under your active score) show you how you're trending and might even reveal patterns in your team's responsiveness. (For example, do you typically have lower DPS numbers on Mondays? Perhaps it's time to assign engineers to fix problems on the weekend.)
By responding to issues more quickly, your score is more influenced by the P3s and P4s, making your scores go up each day.
We want the Downtime Prevention Score to become a part of your daily vernacular when using Blue Matador. A score of 100 is possible, but not common. Strive for scores of 90 or higher to maximize your uptime and reach greater levels of reliability for your customers while simultaneously providing more insight to other stakeholders in your company.
Ultimately, higher Downtime Prevention Scores indicate your team's growing maturity at solving downtime issues preventively, not reactively. And when you prevent downtime proactively, your company experiences peace of mind.
Want to track the DPS number for your own systems? Sign up with a free 30-day trial of Blue Matador's recommendation engine for preventing downtime below.
Blue Matador Staff
We began with the goal of making monitoring truly proactive — not reactive. At Blue Matador, we provide peace of mind for DevOps professionals by enabling them to proactively monitor their infrastructure for the first time.