How CheckMK predictive monitoring works

Posted on Sun 07 October 2018 in Articles

Introduction

Recently I have been working with members of our NOC (network operations center) on improving monitoring tooling and processes for some of our production systems. One problem we have run into is overly sensitive (noisy) alerts. Partially, this is because many of our alerts are based on a metric value (CPU usage, request rate, etc) exceeding a threshold. These thresholds can be difficult to set properly, and often if it is difficult for us to determine what "normal" looks like for a metric until the system has been live for a few weeks.

The primary monitoring tool we use at Ionic Security is CheckMK, an open source monitoring suite based on Nagios. I was wondering whether CheckMK had tools and settings that could help with this problem, and it turns out CheckMK does have a feature called "Predictive Analytics" which is designed for exactly this problem. Since I spend a good amount of my time working on anomaly detection systems at Ionic, I was interested in learning how CheckMK's solution worked, and what the shortcomings of that system may be. This article covers some of what I learned.

Walkthrough of the core functions used in predictive analytics

The core prediction code can be found in the public CheckMK git repository here: http://git.mathias-kettner.de/git/?p=check_mk.git;a=blob;f=cmk_base/prediction.py. Everything that follows is my interpretation of the contents of this file.

CheckMK's metric system is based on pnp4nagios which uses RRDs (round robin databases) to store data. RRDs are a very interesting storage mechanism for time series data, but how they work is beyond the scope of this article. The most important detail to know is that they store older data with less precision than newer data. These RRD files are used as the source of metric data.

Stepping through the compute_prediction function, we see the following steps:

  1. Data is loaded from RRD files into a 2 dimensional list named slices.
    • The loop goes from the start time (horzion days in the past) to from_time.
      • The loop moves from more recent data to older data so that the first row of data it accumulates has the smallest timestep (most accuracy).
      • This is necessary because of the compression of older points in a time series in RRDs.
    • The loop proceeds through time at a granularity dependent on the user-specified periodicity of the data.
      • There are options are for minute, hour, day, weekday.
        • minute means that data is bucketed per minute of the hour. This is really just for testing.
        • hour means that data is bucketed per hour of day.
        • day means that data is bucketed per day of the month.
        • wday means that data is bucketed per day of the week.
    • In this first loop, data is loaded into a 2 dimensional list named slices.
      • The first dimension is the index of the slice of data (e.g. data for a single day).
      • The second dimension is a set of tuples where the first 2 elements describe the shape of the data and the 3rd element is the data points themselves.
  2. A second loop loops over the number of data points in the first row of slices, creating a list named consolidated.
    • The function grabs the point from each slice at the same time offset into that slice.
      • Since not all slices have the same number of points, slices for older time ranges are interpolated to calculate the value at the requested offset.
    • Basic statistics (min, max, mean, stddev) are calculated for the N points at the same offset in each slice and appended to consolidated.
      • These stats tell us about the properties of a single second of a metric.
      • For example, one of these points in consolidated will contain stats for the last N Mondays at 17:20:25 if periodicity is set to wday.
  3. The function return a dictionary with the consolidated list of per-second statistics as the field points.

Thus at the end of this we have stats for every second at the user-specified periodicity. These stats tell us what is normal for any given second of a metric.

These predictions are used inside the function get_levels, which performs the following steps.

  1. Calculates the paths where the prediction should be stored for a given metric.
  2. Attempts to load the prediction data.
    • This load works by simply evaling the contents of a file.
    • The prediction will be recalculated if:
      • The prediction file doesn't exist.
      • The prediction file is corrupt or in an out of date format.
      • The prediction if out of date.
        • The recalculation interval is predefined for each periodicity option (minute, hour, day, wday) in the constant prediction_periods.
    • The function also cleans up any unused prediction files that may be sitting around (e.g. from different parameter settings for prediction).
  3. (Optionally) recalculate the prediction.
    • This is a call to compute_prediction.
    • The raw contents of the dictionary returned by compute_prediction are written into a file (which is why the contents can be loaded via eval).
  4. Look up the data for the time point being analyzed in prediction, and load the average and stddev values.
    • As far as I can tell the values for max and min are never used again after they are calculated.
  5. Calculate admissable levels (warning and critical) for upper and lower bounds.
    • There are 3 methods than can be used to calculate levels.
      • absolute = level is the average plus a constant value
      • relative = level is the average times a constant percent
      • stddev = level is the average plus a multiple of the standard deviation
  6. Return these levels to be used by the alerting and plotting systems in CheckMK.

Thus at the end of the process what we get is a system that stores metrics about historical data on disk for each metric series configured for predictive monitoring, re-evaluating those metrics as the data ages off. These metrics are loaded and used to calculate bounds for values in the metric series using user-configured parameters.

Opportunities for enhancement

First, there are a few things that seem a bit odd about this system.

  1. Evaling files to load.
    • I don't remember ever seeing this before now, and I am actually a bit surprised it works.
    • Since this is just a basic python dictionary, using something like JSON would be more expected. A more fancy system could use mmap'd files.
    • I am curious about the relative overhead of this solution vs json. Lightweight compression may even make sense if the loading is IO dominated (but I would think loading is mostly CPU bound because of the need to parse data on load).
  2. All data must be loaded for a single prediction to be computed.
    • Breaking the prediction storage up into smaller pieces would mean that the system would have to load much smaller files.
    • For example, if data is stored in per-hour slices, we would need to load only ~0.6% (= 1/(24*7)) of the data needed for wday calculations.
    • This would mean more IO when data is recomputed, and more small files to manage, but it seems like this would be worthwhile for a read-mostly system.
  3. Predictions are done for 1 data point at a time.
    • Batching seems like it would make sense, esp. since once the data is loaded predictions are very fast to calculate.
  4. No tests.
    • I don't see any test code for this fairly complex series of operations in the CheckMK code base.
    • The code would be fairly hard to test well because the functions are not very granular (mix IO, business logic, failure checks, etc).
    • Breaking these giant functions into smaller well-named pieces would make the code easier to understand and easier to test.

Second, given what I know about other similar anomaly detection systems, there are a few things that could be quickly improved with minimal code changes.

  1. Use robust estimators.
    • It is now common to use median / MAD for monitoring applications instead of mean / standard deviation because the robust estimators are less sensitive to outliers.
    • Adding this in as an option (i.e. the user can select which to use) would be a quite simple.
  2. Apply smoothing to the prediction values.
    • A simple rolling average would smooth out the data points and make the series less prone to sporadic jumps.
    • Not all data benefits from smoothing (and it can be dangerous to smoothing to the prediction and not the data coming in), but a small moving window of a ~30 seconds would likely just make predictions more consistent.
    • The stddev/MAD values could be smoothed in addition to the average/median values.
    • The user could select the sliding window size, and this could be applied to the consolidated list in compute_prediction as a post-processing step.

However, even with all this there are a few things I really like about the current solution.

  1. Does all its work in 1 loop.
    • Ordinarily I would think that once you find data out of data you would put an item in a queue to kick off a background process to reload the data to keep the read process fast.
    • However this means you have to handle missing/malformed and out of date records separately.
    • Even though write and read happen in the same place, this makes the logic very easy to follow.
  2. Results are easily interpretable.
    • Your baseline value is always the average of the last N values for a given time offset into a cycle.
    • This is very easy to understand, and it makes the system less surprising.

Summary

The predictive analytics feature of CheckMK is a fairly simple system (~300 LOC) based off simple statistics for historical data. There are some odd parts in the implementation that go against best practices for the design of more modern analytics systems, and there are several opportunities for improvement. Still, the system has been used by 1000s of organizations to solve real business problems.

The biggest point I personally took away from this research is a reminder that, even for complex problems, sometimes simple solutions are good enough to do the job.