No false positives

One of the worst things about monitoring systems is false positives.

If the rate of false positives is high enough (and it doesn't take much), you stop trusting your monitoring system and you start having doubt about the failures it reports.


We've been using various other monitoring systems before developing MonitLab (Nagios, Zabbix, hosting provider monitoring systems, etc.).

We have suffered from false positive notifications way too much.
A notification must mean “something is definitely terribly wrong – wake up and fix it!”.

To say it humbly, we believe that MonitLab achieves that, by decreasing the false positives rate to a low-enough value (if not 0% for the general use-case).

 

How do we decrease monitoring false positive rates?

 

MonitLab eliminates monitoring false positives and only delivers notifications when real failures happen by employing several different techniques:


Monitoring from outside

 

We've had clients running the ready-made Nagios monitoring system for monitoring their own infrastructure. This is a terrible idea leading to a huge number of false positives or to the inability to receive notifications if things go horribly wrong.

Fact #1: you'd better not run a monitoring system on the same infrastructure that you want monitored.

Fact #2: you'd better not allow your technical/operations team responsible for the monitoring system as well

Fact #3: you can't trust that your monitoring system is working, unless you have a 2nd or 3rd monitoring system watching over it

By design, using MonitLab means that the monitoring system is not part of your network/infrastructure, is not affected by your team and is someone else's thing to worry about.

MonitLab is our core business, so naturally, we make sure it's running and it's doing its job well.

If you're running Nagios and looking for an easy to switch alternative - check out our Nagios compatibility article.


Checking from multiple locations

 

Monitoring from a single location will lead to false-positives if the connectivity between your server and the check location is experiencing temporary problems.

MonitLab monitors from multiple locations at the same time.

You decide which of our check locations should be monitoring your service (website, email server or other type of check).
The more locations you select, the more information we gather and the better we perform.

We use a consensus (voting) protocol to come up with a representative status based on multiple results coming from different places on Earth.

Only if multiple (many enough) check locations report a problem, do we consider your service failing and do we generate some noise telling you about it.


Check-infrastructure health monitoring


We constantly monitor our monitoring system to ensure that our check infrastructure is not facing any problems.

MonitLab is a distributed and horizontally scalable system - we are prepared to run tens, hundreds or as many check servers as necessary in a single geographic area, to ensure we handle all that “checking” we have to do for you.

We work closely with our server and internet providers to ensure we're getting the best service possible.


Check-infrastructure self-health evaluation

 

With other monitoring systems (especially ready-made ones), you may get a failure notification for 2 reasons:

  • your service is actually failing (e.g. your website is DOWN)
  • the monitoring/checking server is having its own set of problems that make it think that your service is down (e.g. your website may be UP, but the system incorrectly believes it's DOWN) – false positive

At MonitLab, we've worked hard to eliminate this false positive monitoring scenario.

Our check infrastructure's servers opt out of casting a success/failure vote, if they decide they're not competent enough at the given moment to make that decision.

Each check server is evaluating its health continuously and can detect when it's having problems that prevent it from casting a success/failure vote.



With all these measures in place, we don't remember the last time we received a failure notification, which turned out to be a false positive.