12,290,311 members (68,505 online)
alternative version

21.3K views
9 bookmarked
Posted

Calculate Percentiles in O(1) space and O(n) time

, 1 Mar 2009 CPOL
 Rate this:
An unprecise algorithm to compute percentiles of a stream of data with one scan and a fixed buffer.

Introduction

In my work, I needed to calculate the percentile of a stream of data. The stream can be very long and the data keeps coming. Each time new data arrives, I need to update my percentile. After the update is done, the data is no more needed. The order of the data is pretty random with no obvious pattern. The result does not need to be always 100% precise, some reasonable error is tolerable.

Based on the above requirement, I tried to Google to see if there is any existent code, but I didn't find the exact solution. Although I found John D. Cook's article very inspiring.

Background

The idea is, instead of storing the whole set of data and sorting it, in most cases, we only need to maintain a sorted list to store the data around the current percentile. When new data arrives, we decide if we should insert the new data into the list, and if so, either the head or the tail of the list would be abandoned. As the data keeps coming, the sorted list slides in the whole data set. And in a not too bad environment, we expect the right percentile will stay in the sliding window.

Occasionally, especially when the incoming data is in some order, the right percentile could be lost from the list.

Using the code

Following is the code. As demonstrated, with the buffer size set to 30, we calculate .05 percentile of 10,000 uniform random numbers and get good results.

```using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Diagnostics;

class testPercentile
{
public void Run()
{
Random autoRand = new Random( );
const double percentage = .05;
Percentile<double> percentile=new Percentile<double>(percentage,50);
List<double> list=new List<double>();
for (int i = 0; i < 10000; i++)
{
double tmp = autoRand.NextDouble();
Debug.WriteLine(tmp);
int index = list.BinarySearch(tmp);
if (index < 0)
list.Insert(~index, tmp);
else
list.Insert(index, tmp);
index = list.BinarySearch(percentile.value);
if (index < 0) index = ~index;
Console.WriteLine( index*1.0 / (i + 1)-percentage);
}

Console.WriteLine();
Console.WriteLine(percentile.value);
}
}
class Percentile<T> where T : IComparable
{
uint position, size,count;
double percentile;
List<T> list;
public Percentile(double percentile, uint buffersize)
{
size = buffersize;
this.percentile = percentile;
count=position = 0;
list = new List<T>();
}
{
if (++count > size)
{
int newpos = (int)System.Math.Floor(count * percentile - size / 2.0);
if (newpos < 0) newpos = 0;
if (newvalue.CompareTo(list.First())>0)
{
if (newpos > position)
{
list.RemoveAt(0);
++position;
}
else
{
if (newvalue.CompareTo ( list.Last())>=0)
return;
list.RemoveAt(list.Count - 1);
}
}
else
{
if (newpos > position)
{
++position;
}
else
{
list.RemoveAt(list.Count - 1);
list.Insert(0, newvalue);
}
return;
}
}
int index = list.BinarySearch(newvalue);
if (index < 0)
list.Insert(~index, newvalue);
else
list.Insert(index, newvalue);
}
public T value
{
get
{
return list.ElementAt((int)System.Math.Floor(count*percentile-position));
}
}
}```

Points of interest

I'm lazy. I found my code seems to have bias because when the error occurs, the calculated percentile is always larger than the real percentile. Can anyone help me reduce the bias?

Share

 United States
No Biography provided

You may also be interested in...

 First Prev Next
 Some similar issue cmbtobmc6-Nov-14 4:39 cmbtobmc 6-Nov-14 4:39
 Bias progress krn_2k8-Jan-10 2:53 krn_2k 8-Jan-10 2:53
 My vote of 1 joker995-Mar-09 3:24 joker99 5-Mar-09 3:24
 How to remove the bias wamckee2-Mar-09 13:47 wamckee 2-Mar-09 13:47
 Are you sure? jerron3-Mar-09 12:07 jerron 3-Mar-09 12:07
 Last Visit: 31-Dec-99 18:00     Last Update: 23-May-16 23:56 Refresh 1