Pitfalls in Random Number Generation

John D. Cook

4.82/5 (77 votes)

Aug 12, 2008

BSD

10 min read

188424

Some of the subtle problems that can arise when working with random number generators

Introduction

Random number generation is subtle. Random number generators contain deterministic algorithms designed to produce output that simulates non-deterministic behavior. It's amazing that there are algorithms that do this well enough for many applications. But unless used carefully, random number generators can misbehave in mysterious ways.

Uniform Random Number Generators

It makes no sense to ask for a random number without some context. You need to specify random according to what distribution. What most people have in mind by default is random numbers drawn uniformly between 0 and 1, that is, every number in this range is equally likely. This is the simplest case. However, you may need random samples that have a normal (Gaussian) distribution, or an exponential distribution, or any of a wide variety of distributions that arise in applications.

Random number generators for other distributions have at their core a uniform random number generator. This means that the uniform generator is the most important link in the chain. If this core generator has poor statistical properties, nothing built on top of it is likely to be any better. For serious applications, don't trust a uniform random number generator that hasn't been through some sort of public review. This article does not cover how to build a good uniform generator. Most people should not attempt to write their own core uniform generation algorithm but instead use something like the Mersenne Twister algorithm that has been subject to much professional scrutiny.

Non-uniform Random Number Generators

Applications usually require more than just uniform random samples from the interval (0, 1). The simplest variation would be random samples in the range (a, b). If u is a uniform sample from (0, 1) then a + (b-a)*u is a uniform sample from the interval (a, b). Another simple variation would be needing to produce a 1 with probability p and 0 with probability 1-p (technically, a Bernoulli random sample). This can be done by first generating a random sample u from (0, 1) and then returning 1 if u < p and 0 otherwise.

Other random number distributions are not as simple. For example, applications often need random samples to have a "normal" (Gaussian) distribution. You may be able to find software that implements the sampling distribution you need. For example, the Visual C++ implementation of the C++ Technical Report 1 supports these random number distributions.

Bernoulli
Binomial
Exponential
Gamma
Geometric
Normal
Poisson

See Random Number Generation with C++ TR1 for details. The same article also explains how to use the built-in generators to create generators for four common distributions not included in the library: Cauchy, chi-squared, Student t, and Snedecor F.

Be aware that there are conflicting conventions for how to parameterize distributions. For example, some books and software libraries parameterize a normal distribution in terms of its variance σ² while others parameterize in terms of the standard deviation σ. Likewise some books and libraries parameterize the exponential and gamma distributions in terms of their mean μ while other use the rate λ = 1/μ. To make matters worse, these variations coincide for the most common default parameters. This can lead to a nasty bug.

For example, assume your code relies on normal random variables and you test your code by setting σ = 1 explicitly or by using a default value. You could assume your code uses the variance when it actually uses standard deviation, but you wouldn't uncover your error because σ²=1 also means that σ = 1. Then later when your code is in production, someone passes in a value of 10 expecting a variance of 10 but they actually get a variance of 100. And if you only tested the average of your samples, you also wouldn't catch this problem. The other distributions have a similar problem. The default parameter is 1, in which case you can't tell the mean from the rate. If you only test your code with parameters near 1, you may find later that someone expecting very large values got very small values or vice versa.

See notes on Mathematica and in R/S-PLUS for how these packages parameterize distributions and for some specific distributions have alternative parameterizations.

Boundary Values

You need to know the exact details of your core random number generator. When it generates a number between 0 and 1, is zero included? Is one included? This may seem that this is a trivial point, but it matters. The output of this generator may be (and often is) the input to a function that is undefined at 0 or 1.

Suppose you need to generate random samples from a distribution not included with your library. The most general method is to use the inverse CDF to generate samples. Suppose you want to generate samples from a random variable with a cumulative distribution function (CDF) given by F(x). If you first take a uniform sample u from (0, 1) then return F^-1(u), the resulting sample will have the desired distribution. For example, if your library did not have code for generating samples from an exponential distribution, you could make your own by taking the natural logarithm of samples from a uniform distribution. As long as your uniform random number generator returns numbers strictly bigger than 0, this works. But if your uniform generator ever returns exactly 0, the logarithm will blow up. When generating from other distributions, there could be a problem when the uniform generator returns exactly 1.

The uniform random number generators specified by C++ TR1 sample from the half-open interval [0, 1). That is, 0 is a possible return value but 1 is not. I don't know what their reasoning was for such a choice, but in my opinion it would have been safer to sample from the open interval (0, 1), i.e. to exclude both end points. When using this generator, depending on your application, you may need to test for exactly zero values and discard these.

Setting Seeds

Random number generators have a seed that is used to begin the sequence of random samples. There can be a number of subtle problems with how a random number generator is seeded.

Suppose a program seeds a random number generator with a fixed seed, say 21495. If you run this program again with the same seed, it will produce the same series of results. This is good news for reproducibility and testability, but it might not be the behavior you expect. If this seed is part of a game, players may be disappointed if your game always acts the same way given the same moves. A common remedy for this situation is to have your program set the seed from the system clock when the program is launched or when the game begins.

Imagine an application that outputs exactly one random value then exists. (For example, a program to randomly assign patients to one of two treatments in a clinical trial. You fire up the program, get a treatment assignment, and exit.) If this program used a seed hard-coded into the source code, it would return the same value every time! To get around this, such a program needs to persist the state of the random number generator between sessions so that the generator will pick back up where it left off the last time the program executed. (It may be adequate for some applications to set the seed from the system time each time the application starts, but this may not be good enough for statistical applications.)

Next consider two chunks of code that need to generate random numbers. These may be on the same thread or on different threads. If the two code sections contain the same random number generator set with the same seed, the sequences of random numbers produced in the two cases will be identical. This is probably not what you want. Either the two pieces of code should call a common random number generator (with the corresponding concurrency issues settled) or each use separate seeds.

Now consider the problem of running copies of a program on many different machines with a large queue of such programs to run. This is a common scenario when using a cluster or grid for a simulation study. If you set the random number generator seed for each program using the system clock, it's possible that two jobs will start at the same time. The output from both jobs will be identical, meaning one of the jobs was wasted. One solution to this problem is to use a GUID — a 128-bit globally unique identifier — to set the seed. Algorithms for generating GUIDs use information about the machine the code is executing on to guarantee that the GUIDs are unique across time and machines. However, before you set the seed of a random number generator from a GUID, you must be using a generator with at least 128 bits of state. If a generator just uses say 64 bits from the GUID, there is no guarantee that the results of reducing a GUID to fewer bits is unique across machines. One possible solution in that case is to use the system time to set 32 bits and use the machine's IP address to fill the other 32 bits.

Testing

Testing random number generators is subtle business. For uniform random number generators, there are test suites such as George Marsaglia's DIEHARD suite. Both the Mersenne Twister and the C# generator presented in the CodeProject article Simple Random Number Generation pass these tests. I will assume you're starting with a reliable uniform generator and need to test the way you've turned that uniform generator into a non-uniform generator. Not many people need to write their own uniform random number generator. However, it's not uncommon to need to create a non-uniform random number generator for a distribution not directly supported by a random number generation library.

A simple place to start testing is to look at the sample mean and sample variance, if the distribution you're sampling from has a mean and variance. Without going into a course in statistics it would be hard to say just how close the mean and variance of the samples should come to the theoretical mean and variance of the distribution. However, a bug in your code is likely to show up as a large discrepancy between sample and theoretical values.

If your samples appear to have the right mean and variance, the next step would be to test the overall distribution. The "Kolmogorov-Smirnov" test will test how well your samples match the distribution you are sampling from. This test is described in section 3.3.1 of Volume 2 of Donald Knuth's series The Art of Computer Programming. If the K-S test passes, that's a good sign that your code is correct. If the K-S test fails, use a new seed and try again. If the test usually passes, occasional failures are probably due to random fluctuation. (Statistical tests are supposed to fail occasionally, just not often!) If the test fails consistently, a bug is the more likely explanation. For more details, see the book chapter How to test a random number generator from Beautiful Testing.

Summary

In short, here are some tips for working with random number generators.

Start with a high-quality uniform generator that has been publically reviewed.
Be careful about parameterizations when using non-uniform generators.
Pay attention to the exact range of values your generator can emit.
Understand whether you want to use a fixed seed or to use system time to set the seed.
Set seeds from GUIDs when running jobs on several machines.
When you write your own non-uniform generator, test sample means and variances first then use a Kolmogorov-Smirnov test.

History

12 August 2008: Initial post
23 October 2008: Revised text, added references