13,143,693 members (29,390 online)
alternative version

#### Stats

44.9K views
31 bookmarked
Posted 28 Apr 2008

# Calculating Percentiles in Memory-bound Applications

, 26 Jun 2014
 Rate this:
How to compute percentiles of a stream of data too large to fit into memory at once

## Introduction

Suppose you have a long list of numbers and you want to find the 5th and 95th percentiles. If the list is small enough, you can read the list into memory, sort it, and see which numbers are 5% of the way from each end. Simple. But what do you do if the list is being generated sequentially and the entire list is too large to fit into memory at once? This article presents a simple C++ class for solving this problem. The class takes template parameters so it can be used with any data type.

## Motivating Application

The solution presented here could be used in a variety of memory-bound problems, but I'll describe briefly the problem that originally motivated the code. As part of its processing, the statistical package WFMM generates a huge matrix one row at a time, then reports a couple percentiles of each column. It would be nice if it were possible to produce the data one column at a time, but that's not the case. The nature of the problem is that rows can be computed independently but columns cannot. The size of this matrix varies according to the input, and the amount of available memory determines the maximum problem size the software can handle. The code presented here was used in the WFMM project to conserve memory.

## Background

To make the problem tangible, suppose we have a 1000 numbers and we want to know what the 50th number would be if we were to sort the list in increasing order. Imagine we want to do this using as little memory as possible. We could read in the first 50 numbers and sort them. Then when we read in the 51st number, we compare it to the largest of the 50 numbers we've saved. If the new number is larger, we throw it away because we know it cannot be the 50th smallest number of the full list. If the new number is smaller, we throw away the largest number we were keeping and insert the new number into our cache. This gives us a way to compute the 50th smallest number using 50 memory locations. Obviously we could use a similar approach to find the 50th largest number as well.

Could we do any better? No. Until we see the last 50 numbers, we can't rule out the possibility of any one of the numbers we're saving being the one we're after. When we see the 951st number, we can rule out one of the numbers in our cache. When we see the 952nd number we can throw out another number from the cache, etc. But for the majority of the runtime of the algorithm, we need to hold on to 50 numbers.

In general, if you want to find the nth smallest number in a list of M numbers, you need to save n numbers along the way. (Unless n is larger than half of M. Then you could turn the problem around and find the (M-n)th largest number in the list.) So if you want to calculate the 5th or 95th percentile of a list of M numbers, you need to store 0.05 M numbers. If you want to find both the 5th and the 95th percentiles you'd need to store 0.10 M numbers along the way. By storing just the numbers you need, you can solve a problem 10 times larger than you could if you were to load everything into memory first and then sort.

## Using the Code

The class `TailKeeper` keeps track of the “tails” of a sequence of numbers, the largest and smallest values, in order to compute upper and lower percentiles as described above. Values are inserted into a hash data structure as they arrive. This allows us to keep the stored numbers sorted and make fast inserts without requiring additional memory.

The class uses templates to work with any data type that supports comparison, not just numeric types. Also, the input and storage types may differ. In our application, the input data had type `double` but was stored as type `float`. This allowed us to save twice as much data in the same memory. There was enough noise in the data that the loss of precision in casting `double` to `float` did not matter.

To use `TailKeeper` in your application, `#include` the files TailKeeper.h and FixedSizeHeap.h. The class needs two constants to specify which values you're looking for. If you want to find the mth smallest and nth largest values, you can specify m and n either as arguments to the constructor or as arguments to the `Initialize` function.

For example, suppose you're wanting to find the 40th smallest and the 10th largest element in a list of `double`s and you are willing to cast your input to `float`s to save space. You could declare a `TailKeeper` class as follows:

`TailKeeper<double, float> tk(40, 10);`

The input type defaults to `double` and the storage type defaults to `float` and so `<double, float>` could be left out in this case.

As each value in the list becomes available, insert it into `TailKeeper` by calling the `AddSample` method:

`tk.AddSample( x );`

To find the 40th smallest element along the way, call `GetMaxLeftTail()`. Similarly, to find the 10th largest element, call `GetMinRightTail()`. (The smallest 40 elements seen at any point constitute the left tail, so the maximum of these is the 40th smallest. The largest 10 elements are the right tail, so the minimum of these is the 10th largest.)

## Testing the Code

The demo project first tests the `FixedSizeHeap` class that the `TailKeeper` depends on by checking and inserting several values and comparing the internal state of the heap to the results of manual calculations.

The project then creates a list of 1000 random integers and finds the 5th and 96th percentiles using `TailKeeper` directly by sorting the list.

```///////////////////////////////////////////////////////////////////////
// Generate a list of random integers.
int length = 1000;
std::vector<int> data(length);
data[0] = 137; // arbitrary seed
for (int i = 1; i != length; ++i)
{
// This is George Marsaglia's "CONG" random number generator.
data[i] = 69069*data[i-1] + 1234567;
}

///////////////////////////////////////////////////////////////////////
// Find the 50th smallest and 40th largest elements using TailKeeper.
int lower = 50, upper = 40;
TailKeeper<int,><int, int> tk(lower, upper);
for (int i = 0; i != length; ++i)
int leftTailTK = tk.GetMaxLeftTail();
int rightTailTK = tk.GetMinRightTail();

///////////////////////////////////////////////////////////////////////
// Find the same values directly by sorting the entire list.
std::sort(data.begin(), data.end());
int leftTailSort = data[lower-1];
int rightTailSort = data[length - upper];

///////////////////////////////////////////////////////////////////////
// Compare the results.
assert(leftTailTK == leftTailSort);
assert(rightTailTK == rightTailSort);</int,></int>```

For further testing, you could vary the random number generator seed value `data[0]` or the values of the `length`, `upper`, and `lower` parameters.

## History

• 28th April, 2008: Initial post

## About the Author

 Singular Value Consulting United States
I am an independent consultant in software development and applied mathematics. I help companies learn from their data to make better decisions.

Check out my blog or send me a note.

## You may also be interested in...

 Pro Pro

## Comments and Discussions

 First Prev Next
 Data structure YvesDaoust1-Jul-14 23:33 YvesDaoust 1-Jul-14 23:33
 Other algorithms krn_2k29-Jan-09 23:25 krn_2k 29-Jan-09 23:25
 Re: Other algorithms krn_2k12-Feb-09 1:38 krn_2k 12-Feb-09 1:38
 Nice article! wtwhite7-May-08 3:53 wtwhite 7-May-08 3:53
 More algorithms Daniel Grunwald28-Apr-08 23:11 Daniel Grunwald 28-Apr-08 23:11
 Re: More algorithms John D. Cook29-Apr-08 2:51 John D. Cook 29-Apr-08 2:51
 Re: More algorithms Daniel Grunwald29-Apr-08 2:58 Daniel Grunwald 29-Apr-08 2:58
 Re: More algorithms John D. Cook29-Apr-08 3:40 John D. Cook 29-Apr-08 3:40
 Re: More algorithms Thomas Guest29-Apr-08 22:11 Thomas Guest 29-Apr-08 22:11
 Re: More algorithms John D. Cook30-Apr-08 0:07 John D. Cook 30-Apr-08 0:07
 Last Visit: 31-Dec-99 18:00     Last Update: 21-Sep-17 14:12 Refresh 1

General    News    Suggestion    Question    Bug    Answer    Joke    Praise    Rant    Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.