Streaming mean and standard deviation


If you have N numbers and you take the mean of these numbers, how fast can you do it? For small N, the calculation is very quick. However, when N gets large (and in this day and age with terrabytes of real-time Internet, genome, geophysical, satellite data, etc.) the calculation can take much too long, even for something this simple.

Streaming statistics is a way around this. Instead of storing N numbers, then calculating statistics, T, using that stored data, one calculates statistics in real-time, and updates the statistics as each new number arrives. The differences are summarized below

Standard: Get x1, x2, ..., xn, store to vector X, calculate T(X)

Streaming: Get x1, calculate T(x1), get x2, calculate f(x2,T(x1)), get x3, calculate f(x3,f(x2,T(x1))), etc.

With streaming statistics one can still store the data "for the record", but the statistics are not calculated as T(X).

Here is the theory behind a streaming mean and a streaming standard deviation.

Let the mean of t numbers be xbart. Then

xbart+1 = ((t-1)*xbart + xt)/t

For the standard deviation, we first have to calculate a streaming mean of the t squared numbers, xbar2t, and then the streaming standard deviation is

stddevt+1 = ((t*xbar2t-t*xbart2)/(t-1)).5

I wrote the following program for my calculator that calculates a streaming mean and standard deviation.

Input "Number "&string(t),xt
Lbl a
Input "Number "&string(t),xt
Disp "Mean=",xbartt1
Disp "Stddev=",sqrt((t*xbartt1-t*xbart^2)/(t-1))
Goto a

The above procedure made the mean and standard deviation calculatable by streaming. Therefore, any statistic that is a function of the mean and/or the standard deviation is able to be calculated by streaming.

If you enjoyed any of my content, please consider supporting it in a variety of ways: