12,446,146 members (76,923 online)
Technical Blog
alternative version

19.9K views
15 bookmarked
Posted

# Massively Parallel RNG using CUDA C, Thrust and C#

, 22 Sep 2011 CPOL
 Rate this:
Massively Parallel Random Nunber Generation using CUDA C, Thrust and C#

## Introduction

In this post, I’ll give you some simple examples of how to use the massively parallel GPU on your computer to generate uniformly distributed psuedo-random numbers. Why GPU? Because it is orders of magnitude faster than CPU, does not occupy your CPU time, it is already on all computers and many other reasons I mentioned in my previous post. While there are maybe hundreds of ways to generate pseudorandom numbers, I only covered four ways to do it on NVIDIA cards using CUDA related APIs:

1. A Basic Linear Congruential Generator (LCG) implementation using CUDA C
2. A Thrust C++ template library implementation
3. An NVIDIA CURAND implementation
4. A Mersenne Twister implementation using CUDA C

In order to demonstrate how to utilize the GPU, the implementations are provided as C++ native DLLs and used within C# sample application. Perhaps there are many other APIs and ways worth talking about to utilize your GPU within your C# application, but this post’s scope is limited only to the subject I mentioned above. I’m suggesting that you visit http://gpgpu.org/, if you already did not, to see the endless possibilities in this area.

While I was preparing these samples, I saw that visualizing the data is very important to understand the algorithms. Therefore, I used Microsoft WPF on C# to visualize the generated random numbers. You can use your own application and copy the classes under the RNGLibs folder.

## Background

### About Random Number Generators (RNG)

The generation of random numbers is important in many applications like simulations, cryptography, sampling and mostly in statistics. A sequence of numbers is random when it does not have a recognizable pattern in it or in other words if it is non-deterministic. Although non-deterministic random numbers are ideal, the computer generated, deterministic random numbers can be statistically “random enough”. These random numbers are named as pseudo-random numbers and can have easily identifiable patterns if the algorithm is not chosen wisely. (http://en.wikipedia.org/wiki/Pseudorandom_number_generator).

There are many pseudo-random number generators and also many different implementations of them in sequential and parallel environments. (http://en.wikipedia.org/wiki/List_of_pseudorandom_number_generators) In this post, I used only Linear Congruential Generators, Xorshift and Mersenne Twister. Therefore, I explained only these three algorithms, but you can use CUDA to develop other RNGs too.

### Thrust

As I mentioned in my previous post, writing code using CUDA API is very powerful in terms of controlling the hardware, but there are high level libraries like Thrust C++ template library, which provide many fundamental programming logic like sorting, prefix-sums, reductions, transformations, etc. The best part is that Thrust consists only of header files and is distributed with CUDA 4.0 installation.

## Projects

I’ve used Visual Studio 2010 to host one C# Windows Application and native C++ DLLs for RNG implementations as seen in the solution structure below:

• `RNGVisual `(Main C# application)
• `CURACRNGLib `(CUDA C RNG implementation)
• `CURANDRNGLib `(CURAND RNG implementation)
• `ThrustRNGLib `(Thrust RNG Implementation)
• `MersenneTwisterRNGLib `(Mersenne Twister RNG implementation)

The only additional API is the NVIDIA CUDA Toolkit 4.0, which you will need to install along with a supported graphics device driver from the same link. Also do not forget to install the Build Customization BUG FIX Update from the same link or from here.

Once the CUDA Toolkit is installed, creating CUDA enabled projects is really simple. For those who are not familiar using native C++ CUDA enabled projects, please follow the steps below to create one:

• Create a Visual C++ console project in Visual Studio 2010 by selecting DLL and Empty project on the wizard
• Open Build Customization from the C++ project context menu, and check the CUDA 4.0(.targets, .props) checkbox
• Open the project properties, expand the Configuration Properties, Linker and select Input. Edit the additional dependencies and add cudart.lib.
• Add a new empty source file ending with .cu.

## Implementation

### WPF Application

The `RNGVisual `C# WPF application provides visualization of the random numbers in 2D and 3D. It allows you to select an RNG algorithm (.NET, CUDA C, CURAND, Thrust or Merseene Twister) and allows you to set some display parameters and processor parameters. With any number count below 10K, all RNGs calculate in about one millisecond and most of the time is spent drawing the squares to the screen. Therefore, the time should not confuse you in terms of performance comparison. You can run the algorithms with 100K numbers without the visualization and see the difference on your hardware. But please be aware that it is better to use CUDA events with cudaEventRecord to time GPU execution more precisely.

RNGVisual

`RNGVisual `implements various proxy classes, which uses platform invoke to call RNG methods exported from the native C++ DLLs. I used the same export and import technique in my previous post. The RNG libraries have the following two exports, one for CPU implementation and one for GPU implementation:

```extern "C" __declspec(dllexport) void __cdecl
GPU_RNG(float*, unsigned int, unsigned int);

extern "C" __declspec(dllexport) void __cdecl
CPU_RNG(float*, unsigned int, unsigned int);```

The first argument is a pointer to the memory location to hold the random numbers. The second argument is the size of the array and the last argument is the initial seed.

An important point of random number generation is selecting the seed value, because the same seed will give the same result. While there are many different techniques studied, I used my own method of combining current time, CPU load and available physical memory with the help of the Windows Management Instrumentation (WMI); it still does not perform well in multi-threaded solutions, but it gives at least a better random start. The implementation is in the `CPUHelper` class of the `RNGVisual `application.

### A Linear Congruential Generator (LCG) implementation using CUDA C

The first RNG project is using native CUDA Runtime API to implement the oldest and best-known pseudorandom number generator algorithms named LCG. LCG is fast and requires minimal memory to retain state. Therefore, LCG is very efficient for simulating multiple independent streams. But LCGs have some disadvantages and should not be used for applications where high-quality randomness is critical. In fact, the simple example I implemented repeats numbers in a very short period and should be enhanced with methods like explained in GPUGems 3 (37-4).

LCG is as simple as seen in the formula below; starting with a seed (Xn), the next random number is determined with `(a * seed + c) mod m`.
`Xn+1 = (a Xn + c) (mod m)`
where `0 < m, 0 < a < m, 0 <= x0 < m` and `0 <= c < m`.

Below is a sequential implementation of the LCG algorithm, which generates 100 pseudo-random numbers:

```random[0] = 123; //some initial seed
for(int i = 1; i < 100; i++)
random[i] = ( a * random[i-1] + c) % m; ```

The `CUDACRNGLib `project has a very basic implementation of LCG by distributing the work onto 256 threads. Because the same seed will result in the same random number, first we generate different random seeds for every thread. When the kernel below is executed, every thread generates one section of the random number sequence:

```__global__ void RNGKernel(float * randomNumbers, unsigned int numberCount,
unsigned int * seeds, unsigned int c, unsigned int a, unsigned int M)
{
int startIdx = threadIdx.x * numberCount;
unsigned int x = seeds[threadIdx.x];
for(int i=0; i < numberCount; i++) {
x = (a * x + c) % M; //M is shown for purpose of example
randomNumbers[startIdx + i]= (float)x / (float)M; //normalize
}
}```

As I mentioned before, this implementation is simplified to give you an idea how you can start using CUDA C. It even has static block count of one and thread count of 256. If you plan to go for production code, it is good to start many blocks of threads. You may want to check out a better implementation on GPU Gems 3 (37-7) or check out Arnold and Meel’s implementation, which provides also better randomness.

### A Thrust C++ Template Library Implementation

The Thrust library default random engine (default_random_engine) is a Linear Congruential Generator (may change in the future) with a = 48271, c = 0 and m = 2^31. Because c equals to zero, the algorithm is also named multiplicative congruential method or Lehmer RNG.

The `ThrustRNGLib `has a very basic implementation of Thrust default random engine by running the following functor to generate one random number. A functor is a type of class in C++ that overloads the `operator() `and therefore allows to be called like an ordinary function. Thrust provides `unary_function` and `binary_function` functors. Below I used the `unary_function `because my functor requires on argument to be passed into the function:

```struct RandomNumberFunctor :
public thrust::unary_function<unsigned int, float>
{
unsigned int mainSeed;
RandomNumberFunctor(unsigned int _mainSeed) :
mainSeed(_mainSeed) {}

__host__ __device__
float operator()(unsigned int threadIdx)
{
unsigned int seed = hash(threadIdx) * mainSeed;

// seed a random number generator
thrust::default_random_engine rng(seed);

// create a mapping from random numbers to [0,1)
thrust::uniform_real_distribution<float> u01(0,1);

return u01(rng);
}
};```

Using thrust to utilize the GPU is the simplest way to go. You can see the difference by comparing the GPU_RNG from below to the `CUDACRNGLib GPU_RNG `implementation. Using CUDA C gives you full control of the toolkit but it comes with the price of writing more code.

```extern void GPU_RNG(float * h_randomData, unsigned int dataCount, unsigned int mainSeed)
{
//Allocate device vector
thrust::device_vector<float> d_rngBuffer(dataCount);

//generate random numbers
thrust::transform(thrust::counting_iterator<int>(0),
thrust::counting_iterator<int>(dataCount),
d_rngBuffer.begin(), RandomNumberFunctor(mainSeed));

//copy the random mask back to host
thrust::copy(d_rngBuffer.begin(), d_rngBuffer.end(), h_randomData);
}```

Another good part of Thrust is that every implementation (except copy) exists for GPU as well as for CPU. The CPU implementation is also another three lines of code, this time by using the `thrust::generate` to generate the random numbers by using the C++ standard template library `rand` method and then after using `thrust::transform` to normalize the integer result into float with the help of the `[](float n) {return n / (RAND_MAX + 1);}` lambda expression. I used the lambda expression instead of a functor to show you also this possibility. Especially on the upcoming
Microsoft C++ AMP, lambda expressions play a big role. Lambda expression are handy in C++ as well as in C#, but it comes with a price of giving up unit testing of the inline expression.

### An NVIDIA CURAND Implementation

The NVIDIA CURAND library provides an API for simple and efficient generation of high-quality pseudorandom and quasirandom numbers. The `CURAND `library default pseudorandom engine is a `XORWOW `implementation of the Xorshift RNG (page 5) and it produces higher quality random numbers than LCG.

In order to start using `CURAND`, you only need to include the curand.h header and add the curand.lib to the Linker additional dependencies on the Linker settings.

Like `ThrustRNGLib `Thrust implementation, the `CURANDRNGLib `has a very basic implementation by running the following main code to generate a series of random numbers:

```....
//Create a new generator
curandCreateGenerator(&m_prng, CURAND_RNG_PSEUDO_DEFAULT);
//Set the generator options
curandSetPseudoRandomGeneratorSeed(m_prng, (unsigned long) mainSeed);
//Generate random numbers
curandGenerateUniform(m_prng, d_randomData, dataCount);
....```

`CURAND `provides the `curandCreateGeneratorHost` method besides the `curandCreateGenerator` method, to generate random numbers on the CPU instead of the GPU. Therefore the CPU part is as simple as the GPU part.

### A Mersenne Twister Implementation using CUDA C

Mersenne Twister (MT) is an algorithm developed by Makoto Matsumoto and provides very fast generation of high-quality random numbers. (MT Home Page) A common Mersenne twister implementation, uses an LCG to generate seed data.
Originally there are two MT algorithms suitable to use with CUDA: TinyMT and Mersenne Twister for Graphics Processors (MTGP). But I implemented part of the code from the NVIDIA CUDA Toolkit 4.0 MersenneTwister sample, which uses the original code from Makoto Matsumoto anyways.

The Mersenne Twister RNG is maybe the most complicated implementation out of the other three RNGs I provided, but with that you can look into the algorithm, unlike `CURAND`. The MersenneTwisterRNG.cpp file from the `MersenneTwisterRNGLib` project is the entry point to the library and exports the same `GPU_RNG` and `CPU_RNG` methods as the other libraries. I simplified the host code as much as possible and placed all GPU logic into the GPURNG.cu file. The remaining simple host code can be seen below:

```extern void GPU_RNG(float * h_randomData, unsigned int dataCount,
unsigned int mainSeed)
{
float * d_randomData = 0;

//load GPU twisters configuration
return;
seedMTGPU(mainSeed);

//find the rounded up data count
//because the generator generates in multiples of 4096
int numbersPerRNG = iAlignUp(iDivUp(dataCount, MT_RNG_COUNT), 2);
int randomDataCount = MT_RNG_COUNT * numbersPerRNG;

//allocate device memory
size_t randomDataSize = randomDataCount * sizeof(float);
cudaMalloc((void**)&d_randomData, randomDataSize);

//Call the generator
RNGOnGPU(32, 128, d_randomData, numbersPerRNG);

//Make sure all GPU work is done

//Copy memory back to the device
cudaMemcpy(h_randomData, d_randomData, dataCount * sizeof(float),
cudaMemcpyDeviceToHost);

//free device memory
cudaFree(d_randomData);
}```

## About the Implementations

The Pseudo-random number generators I provided on this post are widely used algorithms, but still I'm not guaranteeing that any of them will perform good enough in your particular solution. In fact, I left some generators poor by purpose to point out the core algorithm and provide variations in randomness. Also, for sake of clarity there is almost no exception handling and logging implemented. This is not an API; my goal is to give you a high level idea how you can use Thrust, CUDA C, CURAND to generate pseudo-random number. Therefore, it is important that you research the algorithms on-line and re-factor the code for your own use.

Some of the code is taken from original NVIDIA samples and have copyright notice on them, the rest is my code which is free to use without any restriction or obligation.

## Conclusion

As in every field of computer science, there are many ways to solve a problem and the possibilities expand exponentially. I just scratched the surface of the possibility of using CUDA to add pseudorandom number generation to your C# application. I hope this post will help you in your project.

## About the Author

 United States
I am an MCAD, MCSD, MCTS, MCPD, MCT and Certified CUDA Programmer.
You can find more technical articles on my blog at http://www.adnanboz.com.

## You may also be interested in...

 Pro Pro

## Comments and Discussions

 First Prev Next
 My vote of 5 manoj kumar choubey23-Feb-12 21:19 manoj kumar choubey 23-Feb-12 21:19
 My vote of 5 Sergio Andrés Gutiérrez Rojas24-Sep-11 7:19 Sergio Andrés Gutiérrez Rojas 24-Sep-11 7:19
 Re: My vote of 5 Adnan Boz25-Sep-11 5:55 Adnan Boz 25-Sep-11 5:55
 Last Visit: 31-Dec-99 18:00     Last Update: 24-Aug-16 7:35 Refresh 1

General    News    Suggestion    Question    Bug    Answer    Joke    Praise    Rant    Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.