7/27/18

Every once in a while, my brother throws some interesting consulting work my way. A few days ago he gave me "flapping nodes". While it sounds like a medical condition, "flapping nodes" is actually a problem system administrators can face when they are monitoring their systems. If a node has events that are on, then off, then on, off, on, off, etc., and sending out notifications at each off/on, it is a nuisance because they are flooded with messages. There can also be several thousand nodes to monitor! One could use predeterminded cutoffs, like "send out a notification only if there are 3 off/ons in a row", but that is not a general solution. Statistics to the rescue! First, some details.

Note for this work, the number of "nodes" ultimately doesn't matter, as each will track its buffer of events independently. Also, each node may have their state changes come in at differing times, but for each node, the time is constant. In other words, one node may get state updates every 5 minutes, one may get them every 30 seconds, etc., but it won't change per node. So solving this problem for one node in a time-independent manner will solve it for all.

Lets say node "a" has states Up and Down and is getting state updates every so many minutes. We'll keep the last 20 for the sake of argument. Reading from left to right, oldest state event to newest, we have:

A node that isn't flapping:

[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1] ...

And at some point starts flapping:

[0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1] ...

And then starts settling down to a constant state:

[1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1] ...

We can solve this problem using a **nonparametric test for nonrandomness**, from I believe the 1930s-1950s, called the **Runs Test**. It tests the hypotheses:

H_{0}: sequence is random

H_{1}: there are runs present (ie. sequence is nonrandom)

One simply has to count the number of 0s, n_{0}, the number of 1s, n_{1}, and the number of runs, R. A run is simply the number of groups of 0s and 1s. For example, the three sequences above have 3, 12, and 11 runs.

We also need the expected number of runs, E(R), and the expected standard deviation of runs, Std(R).

E(R) = ((2*n_{0}*n_{1})/(n_{0}+n_{1}))+1

Std(R) = ((2*n_{0}*n_{1}*(2*n_{0}*n_{1}-n_{0}-n_{1}))/((n_{0}+n_{1})^{2}*(n_{0}+n_{1}-1)))^{0.5}

Ideally, we'd want at least 25 observations to use the normal distribution. You want this for two reasons. The first reason is that your probability statements will be better. The second reason is that working with the discrete distribution is really, really nasty for this particular problem even if that would get you exact probability statements.

Converting into a Z-score, we get Z_{obs} = (|R-E(R)|-.5)/Std(R), because we are making a "continuity correction". Then you'd conclude there are runs of the 10101010101010 "flapping node" variety for large values of Z_{obs}. Specifically, you'd compare Z_{obs} to Z_{alpha/2}, where alpha = P(concluding flapping nodes are present|flapping nodes are not present), ie. making an error.

I think this can be done using a sliding window of size n events. I also think this could possibly be done using weights, to weight the most recent events more and the older ones less, in effect making some old 1s disappear, and then the runs test be done on that adjusted column of 0s and 1s (although I'm not entirely sure how that would change the statistical theory, or if that is less ideal than just using a smaller window).

What if you have more than 2 states, for example, Up, Warning, Down? There is a "k-category extension" of the runs test that handles these situations. The only things that would change are the equations for the expected number of runs, E(R), and the expected standard deviation of runs, Std(R). All of the other logic would be the same. Here are E(R) and Std(R) for the k-category extension of the runs test:

E(R) = [ n*(n+1)-sum(n_{i}^{2}, i=1 to k) ] / n

Std(R) = ( [ sum(n_{i}^{2}, i=1 to k)*(sum(n_{i}^{2}, i=1 to k) + n(n+1)) - 2n*sum(n_{i}^{3}, i=1 to k) - n^{3} ] / (n^{2}*(n-1)) )^{.5}

I hope you found this interesting. Please check out the great book __Nonparametric Statistical Inference__, by Gibbons and Chakraborti, for the theory on the runs test and many, many more nonparametric statistics topics

Thanks for reading!

If you enjoyed *any* of my content, please consider supporting it in a variety of ways:

- Check out a random article at http://statisticool.com/random.htm
- Buy what you need on Amazon from my affiliate link
- Share my Shutterstock photo gallery
- Sign up to be a Shutterstock contributor