Calculate Percentiles in O(1) space and O(n) time
An unprecise algorithm to compute percentiles of a stream of data with one scan and a fixed buffer.
Introduction
In my work, I needed to calculate the percentile of a stream of data. The stream can be very long and the data keeps coming. Each time new data arrives, I need to update my percentile. After the update is done, the data is no more needed. The order of the data is pretty random with no obvious pattern. The result does not need to be always 100% precise, some reasonable error is tolerable.
Based on the above requirement, I tried to Google to see if there is any existent code, but I didn't find the exact solution. Although I found John D. Cook's article very inspiring.
Background
The idea is, instead of storing the whole set of data and sorting it, in most cases, we only need to maintain a sorted list to store the data around the current percentile. When new data arrives, we decide if we should insert the new data into the list, and if so, either the head or the tail of the list would be abandoned. As the data keeps coming, the sorted list slides in the whole data set. And in a not too bad environment, we expect the right percentile will stay in the sliding window.
Occasionally, especially when the incoming data is in some order, the right percentile could be lost from the list.
Using the code
Following is the code. As demonstrated, with the buffer size set to 30, we calculate .05 percentile of 10,000 uniform random numbers and get good results.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Diagnostics;
class testPercentile
{
public void Run()
{
Random autoRand = new Random( );
const double percentage = .05;
Percentile<double> percentile=new Percentile<double>(percentage,50);
List<double> list=new List<double>();
for (int i = 0; i < 10000; i++)
{
double tmp = autoRand.NextDouble();
Debug.WriteLine(tmp);
percentile.add(tmp);
int index = list.BinarySearch(tmp);
if (index < 0)
list.Insert(~index, tmp);
else
list.Insert(index, tmp);
index = list.BinarySearch(percentile.value);
if (index < 0) index = ~index;
Console.WriteLine( index*1.0 / (i + 1)-percentage);
}
Console.WriteLine();
Console.WriteLine(percentile.value);
}
}
class Percentile<T> where T : IComparable
{
uint position, size,count;
double percentile;
List<T> list;
public Percentile(double percentile, uint buffersize)
{
size = buffersize;
this.percentile = percentile;
count=position = 0;
list = new List<T>();
}
public void add(T newvalue)
{
if (++count > size)
{
int newpos = (int)System.Math.Floor(count * percentile - size / 2.0);
if (newpos < 0) newpos = 0;
if (newvalue.CompareTo(list.First())>0)
{
if (newpos > position)
{
list.RemoveAt(0);
++position;
}
else
{
if (newvalue.CompareTo ( list.Last())>=0)
return;
list.RemoveAt(list.Count - 1);
}
}
else
{
if (newpos > position)
{
++position;
}
else
{
list.RemoveAt(list.Count - 1);
list.Insert(0, newvalue);
}
return;
}
}
int index = list.BinarySearch(newvalue);
if (index < 0)
list.Insert(~index, newvalue);
else
list.Insert(index, newvalue);
}
public T value
{
get
{
return list.ElementAt((int)System.Math.Floor(count*percentile-position));
}
}
}
Points of interest
I'm lazy. I found my code seems to have bias because when the error occurs, the calculated percentile is always larger than the real percentile. Can anyone help me reduce the bias?