2020

Binapprox median algorithm

Statistics show us that median lies between the range [mean-stddeviation (inclusive), mean+stddeviation (exclusive)]. Binapprox algorithm exploits this fact to find the approx median.

‌

It is a fast algorithm to find the approximate running median of a data set. The algorithm is as follows:

Step 1: Compute the mean μ and standard deviation σ2. Step 2: Form B bins across [μ−σ, μ+σ], map each data point to a bin Step 3. Find the bin b that contains the median. Step 4: Return the midpoint of bin b

‌

(Ref:

https://www.stat.cmu.edu/~ryantibs/papers/median.pdf⁠

)

‌

The algorithm has a worst case running time of O(n), keeping in mind that the median is approximate and the actual median can differ by (stdDev/noOfBins)

‌

This article will focus on calculating running median on 1D array. Calculation of median in 2D matrices (in our case the fits data matrix) will be covered in Median Image Stacking page.

‌

⁠

‌

Step 1: Computing the mean and standard deviation.

‌

In 2 separate passes of the array O(n), we can calculate the mean and std deviation. I used numpy's mean and std functions to do the same.

mean = np.mean(arr);

std = np.std(arr);

‌

For 2D arrays, we will use Welford's online algorithm (mentioned in Mean Image Stacking page). Explained more in Median Image Stacking page.

⁠

‌

Step 2: Form B bins across [mean-stddev, mean+stddev], mapping each data point to the bin

‌

To create the bins, I used arange method of numpy, with step = binwidth = (2*stddev)/B

minVal = mean - std;

maxVal = mean + std;

binWidth = (2* std)/binCount;

bins = np.arange(start = minVal, stop=maxVal+binWidth, step=binWidth);

‌

We will also create a left_bin with values less than minVal, it will help us find the bin which will contain the median.

left_bin = sum(i < minVal for i in arr)

‌

Mapping each data point to bin

‌

Since the value (mean + stddev) is exclusive, we will first filter out the values and then use hist method of plt to put each value in bins.

exit: ⌘↩

filteredAr = arr[arr < maxVal];

counts, edges, plot = plt.hist(filteredAr, bins)

‌

⁠

‌

Step 3,4: Find the bin that contains the median and return the midpoint.

‌

Imagine a histogram like this:

⁠

‌

Each bin will contain the numbers of elements of array within the width. If we start from the min value + the left_bin count, we count elements in each bin, adding them, till we reach the half of size of array, which is where the median will lie. The approx median will be the avg of bin edges of that bin.

mid = (np.size(arr) + 1)/2;

count = left;

i = -1

while (count < mid and i < binCount-1) :

i+=1;

count += countInEachBin[i];

binStart = mean-std;

binWidth = (2* std)/binCount;

binBoundLower = (binStart) + ((i)* (binWidth));

binBoundHigh = binBoundLower + binWidth;

return (binBoundHigh + binBoundLower)/2;

‌

Note that this code can be optimized further, the purpose of the article was to get familiarize with the algorithm. As mentioned in the research paper too, binapprox algorithm is very competitive with quicksort but when new data comes, this algorithm will function very well.

Want to print your doc?
This is not the way.

Try clicking the ⋯ next to your doc name or using a keyboard shortcut (

CtrlP

) instead.