4 August, 2017
What We Learned Using Watchdog to Monitor Our Own Servers
At Blue Matador, we build products that we know to be useful and use ourselves. Our Watchdog server monitoring product is no exception. Since our beta, we have been using our own monitoring software to, well, monitor our own infrastructure.
Here are a few examples of what we learned that prove that Watchdog provides immediate value to organizations of every size, from small businesses to large enterprises, and everything in between.
Running Out of Inodes
Most people have never even heard of inodes. Most technical people may have heard of them, but probably have never actually ran out of inodes on a system. Our entire philosophy behind Watchdog is to automatically detect the things you didn’t even know could be problems affecting your servers.
We experienced this firsthand when one of our servers came close to running out of inodes. Our Smart Agent caught the issue on a server with 95% inode utilization, and we had time to correct the problem before it became a production issue.
The root cause? Ubuntu automatic updates. On this particular server, the fix was as simple as running
sudo apt-get autoremove.
CPU Steal Time
Another rarely seen issue is when your cloud-based server experiences CPU steal time. Our servers are all Amazon EC2 instances, meaning they run on shared hardware and can experience CPU steal time. We recently saw this on some servers that were running on m3.medium instances.
During our beta, these instances were overprovisioned and an m3.medium was appropriate. Recently, however, we have seen an increase in traffic overall and these servers had to be upgraded. Because we received alerts from Watchdog about CPU steal time, we were able to determine that we should upgrade to m4.large instances to give more dedicated CPU to the servers.
This would not have been possible when just monitoring load or user and system CPU, as most other monitoring tools do, because those metrics were staying mostly flat. If you are experiencing CPU steal time but are not utilizing the CPU very much, then it is possible that the physical host has been overprovisioned by your cloud provider. The fix in that case is as simple as stopping and restarting your VM so it moves to a different physical host.
Increased Resource Usage
Just this morning, Watchdog alerted our ops team about increased context switches, load, RAM usage, and processes launched on one of the nodes in our Kubernetes cluster. Watchdog was able to detect this anomaly because it knows the normal usage of these resources on this host for Monday mornings, and it saw a huge difference compared to the previous Monday.
The root cause? A different node in our cluster had gone offline and all of its running containers got moved to this one, causing an increase in resource usage. This is a perfect example of an issue that did not cause downtime immediately but would have if left unattended. Since Watchdog was able to alert us about an unexpected change in our system’s behavior, we were able to quickly resolve the issue with no customer impact.
Watchdog is Free Forever
I know these are only three examples of how Watchdog helped us out, but the value is clear. Each of these issues were missed by other monitoring tools on our systems and would have led to hard-to-debug issues within a few weeks.
Couple Watchdog’s zero-configuration setup and the fact that it is free forever, and you have a winning monitoring solution that is guaranteed to save you the embarrassment of a production outage.
Install Watchdog for free on all your servers right now and see what notifications you’ve been missing out on. Couple it with a free 14-day trial of Lumberjack centralized log management to get predictive alerts from your logentries as well.
Looking to reduce downtime?
Install our Smart Agent to try 14 days of free AI-powered centralized log management with Lumberjack or forever-free server monitoring with Watchdog. No credit card required.