DevOps can be more than a full-time job. Thanks to on-call assignments and unexpected downtime, site reliability engineers also have night watch and weekend gigs because they are tasked with maintaining uptime, no matter when it strikes.
At Blue Matador, we get how hard that can be on you and your life because we've been there before ourselves. In fact, our founder brought Lucid Software's servers back online the day his third son was born (it's an interesting story). Notifying your team about impending issues so you have time to proact — not react — is the impetus for our software.
Downtime breaks more than apps. It also breaks your sprint.
The problem with downtime isn't just that it breaks your app. It also breaks your sprint. How do you plan for downtime when you only receive notifications after it's already started? Your team finds itself spending story points fixing server issues when it had planned developing, fixing bugs, or adding and testing features.
Let's talk a little more about traditional monitoring alerts and why they're insufficient for preventing service interruptions in most cases.
Alerts vs. Recommendations
Traditional DevOps monitoring tools only give you alerts for issues that you've configured for already. This means you either (a) need to experience downtime to know what to configure for going forward, or (b) endlessly tweak alert thresholds so you're not getting over-notified and spammed by your own software.
The noise and hassle this creates makes traditional alerting inferior and not helpful for actually preventing outages you didn't know to set up alerts for in the first place. This is especially true because alerts typically come from issues happening now or which happened yesterday or within the last month. We've found that no tool really tends to be forward looking. They're usually focused on presenting historical data (and they're really good at it, too). What you really want is your software to show you future trends. And then you want it to tell you how to plan for them.
Example of a recommendation from Blue Matador's recommendation engine.
Blue Matador's recommendations (as we call them instead of alerts) don't just alert you of a problem but also tell you when it will occur in the future and how you can solve it. Our recommendations are automatically configured for you the moment you install our agent on your production servers. And we actively check for more than 30 leading indicators of downtime.
Prevention Timeline: Your Prioritized Calendar with Actionable Recommendations
The reasons above are why we invented the Prevention Timeline, a DevOps-specific planning tool that's an integral part of your Blue Matador dashboard. Like a calendar, your timeline gives you an at-a-glance look at your upcoming events. Unlike your personal or work calendar, however, the events in the Prevention Timeline are downtime-causing issues projected ahead of time, up to 30 days in advance. It's a glimpse into your infrastructure's future, and keeping track of it helps you stay ahead of the curve.
Here's what it looks like, with each "slice" of your timeline representing a different priority level for issues coming up. Each of these slices has a different rating, from P5 to P1, which indicate increasingly urgent issues to fix so your app doesn't suffer performance issues.
Your timeline helps you plan into your sprint what you can do about preventing downtime right now. By planning proactive fixes into your sprints, you will find your team having more time to work on the things that really matter — your job. You know, the things listed in your job description that don't relate to fighting fires.
It works because we give you up to 30 days of lookahead for issues coming down the pipeline on your production environment. We automatically prioritize these issues for you based on when the downtime would occur if the issue is not addressed.
Prevention Timeline Slices
Let's take a look at each "slice" of the timeline in greater detail.
ONE MONTH (30D TO PREVENT)
ONE WEEK (7D TO PREVENT)
Issues on the horizon appear here, and need to be planned into your sprint to prevent downtime. P4 issues are great for visibility into infrastructure problems over the work week and weekend. If you have them taken care of, you can go into your weekend or vacation with greater peace of mind.
The recommendations here in this slice become more urgent because you've got less time to plan them into your current sprint. You may still have time to look at thorough fix solutions in our troubleshooting tips for these recommendations. But we also include short-term solutions as well if it's 1 AM and you're feeling like you just want these issues to go away quickly.
TOMORROW (48H TO PREVENT)
Now that you've allocated time to address Blue Matador's P5 and P4 recommendations, focus on the highest priority first. You should fix issues a day in advance to maintain a healthy time buffer. That's why P3 recommendations, signifying issues that will occur tomorrow, are where you want to spend most of your time.
Recommendations in your P3 slice have become increasingly urgent. Downtime hasn't occurred yet, but your application's performance will suffer soon if you don't get this fixed quickly. These recommendations likely need your attention with short-term solutions from our troubleshooting tips.
TODAY (24H TO PREVENT)
These are extremely urgent incidents that will happen within the current day. Your time buffer is gone by P2. Fix them now to avoid downtime!
Since your downtime is imminent, you won't have time to plan these issues into your sprint. Instead, just get cracking on fixing it before it actually turns into an outage. Definitely look at our short-term solutions in the troubleshooting tips for each recommendation in your P2 slice, as you don't have any time to delay.
This is an issue that is happening now. Other metrics and dashboards are probably reflecting failures. Address it now because customers are already suffering from downtime.
The quick fix troubleshooting tips are your best bet for resolving these issues ASAP.
With our thorough active system checks, you will sometimes receive recommendations on your timeline for problems you didn't really know about before. This might especially be true if you're new to DevOps or a junior-level employee. For instance, we notify you about your virtual servers' CPU steal time, a common but often under-reported metric that indicates potential downtime. If you're unfamiliar with it, don't worry. We've got you covered with an incredibly helpful and succinct guide on how to resolve steal time and prevent it from happening again. We'll even tell you what's causing it.
Use Your Prevention Timeline for Greater Freedom Outside Your Job
The Prevention Timeline helps you plan what you can do right now to prevent downtime. When you assume your planning is accurate, you suddenly experience greater freedom with your job. Even if you're on-call, you can use your Prevention Timeline to know what issues, if any, would appear during your shift. You can fix them ahead of time (at your desk during regular work hours) and go home knowing that everything will work.
We hope you will use your Prevention Timeline as part of your daily, monthly, and sprint planning. As you work our recommendations into your sprints, and start fixing the underlying issues ahead of time, you'll find greater freedom to go home to your friends or family at night or on the weekends. Relax, because you know everything's alright thanks to your Prevention Timeline.
Want to see the Prevention Timeline queue up specific recommendations for your unique production environment? Try it free for 30 days.
Blue Matador Staff
We began with the goal of making monitoring truly proactive — not reactive. At Blue Matador, we provide peace of mind for DevOps professionals by enabling them to proactively monitor their infrastructure for the first time.