25 April, 2017
How ‘Essential Alerts’ Could Have Saved the Millennium Falcon; How It Could Save Your Servers
Remember how the Millennium Falcon’s hyperdrive was consistently breaking, all the way from Star Wars Episode IV to VI? Han Solo and Chewbacca could have benefitted by having system alerts — instant alerts about critical hardware and software system components — installed on their “fastest hunk of junk in the galaxy.” (And yes, at Blue Matador, we do talk about Star Wars like it’s a documentary, not a sci-fi movie.)
Essential Alerts™ (as we call them) are a critical part of an overall DevOps monitoring strategy. Or at least they should be. Also called “metric,” “core,” “system,” or “hardware” alerts by others, these crucial notifications let you know if your servers’ CPU usage spirals out of control, disk space is dangerously low, increased latency is clogging the pipes, or when you’ve just “got a bad feeling about this.” They’re a great complement to Lumberjack’s log alerts, which tell you something’s wrong with your servers and applications.
So here’s an honest question: Why don’t monitoring tools include these types of alerts out of the box? We all want to have high uptime on our servers, and these are the most common culprits for failure, so why aren’t metric alerts set up for you automatically? Just think: in 2017, do you really want to be spending time setting up alerts on swap usage? You see, as much as we like system and app logs here at Blue Matador, we haven’t ignored the crucial system metrics that indicate the health of the hardware your servers run on, either.
In a nutshell, monitoring platforms should have one goal: Increase uptime. Unless you’re a hawkish daytrader, you don’t monitor for monitoring’s sake. As a DevOps engineer, you want to be able to detect errors and prevent problems from affecting your server so you can go home at the normal end of your day to your family/friends/Minecraft server and enjoy the rest of your evening. And on the lowest level, Lumberjack’s Essential Alerts help you achieve that goal.
Having used Splunk, Loggly, Datadog, Nagios, and more, we have experience o’ plenty with monitoring tools at Blue Matador. We’ve noticed a thing or two about hardware alerts in the logging and monitoring space: They’re never configured for you out of the box.
Case in point: We make fun of Splunk a bit here on this blog, and we don’t intend to do so too much, but there’s a reason we’re calling them out again right now. The most expensive logging solution on the planet offers these system alerts but doesn’t even bother to set up baseline alerts for your hardware automatically after installing.
Here’s a story that illustrates why this is important. A couple of our engineers are alumni of Lucid Software, a software company that makes cloud-based chart and page layout software à la Google Docs. Lucid relies on Linux servers that generate PDF exports for users wanting to save and share their creations.
One day, customers started complaining about exporting errors. The DevOps team (comprised of our humble founders and others) spent copious amounts of time investigating and troubleshooting the issue, all while customers went without export capabilities. Turns out what caused the failed exports was faulty code that wasn’t deleting old PDFs from the servers’ hard drives like it should have been, so the PDF mount point filled up. The logging solution used at the time didn’t report any errors because there was nothing wrong in the logs. Instead, the problems could have been detected by system metrics that monitor server disk usage, but — you guessed it — those hadn’t been set up on that particular mount point.
We know what this kind of frustration feels like, so we are building Lumberjack, our AI-powered centralized log management solution, with out-of-the-box Essential Alerts included. After installing Lumberjack (it only takes 23 seconds on Linux or Windows), metric alerts with finely tuned defaults are automatically configured for youwik, so you’ll never be surprised by running out of inodes ever again.
Here’s a non-exhaustive list of the kinds of Essential Alerts Lumberjack configures for you:
- CPU usage & system load
- Memory allocation & usage
- Disk bandwidth, latency, & usage
- File systems (inodes)
- Software interrupts
- Swap activity & usage
- Network bandwidth & latency
And because our smart agent is written entirely in lightning-quick golang, it’s extremely efficient on your server and has lower overhead than our closest competitors (<2%), because we don’t subscribe to Heisenberg’s uncertainty principle.
We know that you’ll find out about errors before your customers do, unlike we did with our logging solution at Lucid. Here are some other use cases where Lumberjack’s metric alerts will save your time at work (and your nights at home):
Not a backend person, but need some robust alerts? Use a single program and monitor everything on your servers.
No time to constantly update alerts & thresholds? We do it automatically for you.
Already have basic system alerts, but need something more substantial? Use centralized logging, which aggregates all your logs in one place.
We’re excited to bring Lumberjack to market this year so you don’t have to monitor your systems like a hawk, freeing you up to do more about what you love to do in DevOps. Want to be a part of our beta? Request your invite here.
Looking to reduce downtime?
Install our Smart Agent to try 14 days of free AI-powered centralized log management with Lumberjack or forever-free server monitoring with Watchdog. No credit card required.