Reservoir Sampling of Data Streams

George Swan

4.60/5 (4 votes)

Jun 19, 2017

CPOL

3 min read

8991

An introduction to the Reservoir Sampling of Data Streams.

Introduction

Reservoir sampling is a technique to enable a representative sample of a large dataset to be taken. It can be used in cases where the size of the dataset is unknown and it uses very little of the processor’s memory as only the sample needs to be stored. Each entity is processed one at a time and is either added to the sample or rejected, this makes it ideal for sampling a steam of data.

Implementation.

The technique uses a reservoir, usually an array, in which to store the sample. The algorithm reads data into the sample array until it is full. After that, entities are added to the sample with a probability of s/n, where s is the sample size and n is the number of items currently read from the stream. An item selected for inclusion replaces one of the items in the sample. The item to be replaced is chosen randomly with a probability of 1/s.

Proof.

The proof needs to show that every item read from the stream has the same probability of being included in the sample.
Considering the base case where the number of items n read from the stream equals the size of the sample s. The probability of any item being included is s/n=1. The algorithm is correctly applied here as the items read from the stream are added to the sample with 100% probability until the sample is full. The algorithm will be shown to apply for all values of n subsequent to the base case if it can be shown that the probability of an item being included at any step k is correct when the probability of an item being included in the previous case , step k-1, is also correct. This is because the base case has been proven to be correct and the base case represents the previous case for the next case and the next case is the previous case for the next but one case etc.
Given that, at step k, the true probability of being included in the sample, where k<=n, is s/k, it needs to be proved that, at step k+1, the true probability is s/(k+1).
At step k+1 the probability of a previously selected item being removed from the sample is
1/s*s/(k+1)=1/(k+1)
So the probability, Pnr, of a selection not being removed is
Pnr=1-1/(k+1)
Pnr*(k+1)=k+1-1
Pnr=k/(k+1)
The probability of an item being in the sample is the probability of the item being selected at step k multiplied by the probability of it not being removed at step k+1
s/k*k/(k+1)=s/(k+1)
This is the same probability as the item read at step k+1 has of being selected when applying the algorithm. So all items are selected with the same probability.

The Algorithm.

The following example reads integers from a file stream. There is only one pass of the data and the same random number is used both to select an item for inclusion and to remove an item from the sample. The maximum random number that can be generated is the int32 maximum value of 2147483647

  public static int[] SampleData(int sampleSize, string fileName)
        {
            int s = sampleSize;
            var rnd = new Random();
            var reservoir = new int[s];
            using (var reader = new BinaryReader(File.Open(fileName, FileMode.Open)))
            {
                int n;
                for (n = 0; n < s; n++)
                {
                    reservoir[n] = reader.ReadInt32();
                }

                while (reader.BaseStream.Position < reader.BaseStream.Length)
                {
                    int item = reader.ReadInt32();
                    n++;
                    int r = rnd.Next(n);
                    //The number of possible integers generated from rnd(n) is n, 
                    // but their range is 0 to n-1. So, when n==s
                    // the item is always selected
                    if (r < s)
                    {
                        reservoir[r] = item;
                    }
                }
            }
            return reservoir;
        }

Conclusion.

Reservoir sampling allows a basic processor with a minimal amount of memory to produce a truly representative sample from terabytes of data. It’s an excellent technique and it’s hope that this article may enable it to be more widely recognised.

References.

The proof uses the mathematical principle of induction. There is a good demonstration of that principle at Maths is Fun.
There is a detailed explanation of a reservoir sampling proof on video here.
Code Project. Random Number Generation Methods by Peter Occil.
Wikipedia.