13,087,235 members (60,156 online)
alternative version

#### Stats

39.6K views
50 bookmarked
Posted 10 May 2014

# Parallelised Monte Carlo Algorithms #1

, 26 May 2014
 Rate this:
How to make best use of current technology for computationally intensive applications?

Database Article of the month: Analysing Big Data in Real Time #1

C++ Article of the month: Parallelised Montecarlo Algorithms #1

## Introduction

This is the first in a series of articles I hope to write about parallelism and distribution of computationally intensive applications.

In this article, I present a number of techniques available to .NET developers for parallelising Monte Carlo routines on a single machine. We learn that whilst TPL has its benefits, it is also difficult to write efficient routines that outperform Threadpooling and mixed C#/C++ implementations.

## Background

Briefly, a Monte Carlo routine is any iterative routine that uses random number generation in order to calculate a result whose estimated accuracy increases per iteration. On top of that, because each iterative result is independent, we can disperse the computation to any number of locations and then aggregate our results later on.

Confused? Then go here for a decent description.

For the purposes of this article, I've used the version of this algorithm presented in Numerical Recipes in C by Press, Flannery, Teukolsky and Vetterling pages 239-240. Where the routine is designed to find the weight and centre of mass of a section of torus with varying density. In plain English: we're chopping up a lumpy doughnut and finding the centre of mass of single chunk.

## Using the Code

The attached solution should compile and run within Visual Studio Express for Windows Desktop 2012 which you should build in Release mode. It includes 4 projects:

1. `Montecarlo `- containing just .NET code
2. `ComponentWin32 `- a native C++ DLL with static exports
3. `ComponentCLR `- a C++/CLR assembly that can be referenced by any other .NET assembly
4. `CppConsole `- a simple runner for the native Monte Carlo implementation

## Points of Interest

There are several techniques demonstrated, which I've developed and tested on my 4 core desktop machine:

1. Pure C# on a single thread which we shall use as our benchmark for accuracy
2. Native C++ on a single thread as our benchmark for speed
3. A mixed C#/C++ version to demonstrate one method of integration
4. Single threaded PInvoke version where we create an unmanaged object and call methods on it
5. A Threadpool version where the work is split up, executed by pooled threads and aggregated afterwards
6. TPL `Parallel.For `methods where we improve the performance and accuracy of the result with each evolution
7. Threadpooled and Tasked PInvoke where we attempt to use the fastest execution method with .NET parallelisation.

Starting off with a single threaded approach in C#, we can rapidly implement our basic routine and use it to validate the results of any consequent evolution.

```public void Calculate(ref ResultsdotNet results)
{
double w,x,y,s,z,dw,dx,dy,dz,sw,swx,swy,swz, varw,varx,vary,varz,ss,vol;
w=x=y=s=z=dw=dx=dy=dz=sw=swx=swy=swz=varw=varx=vary=varz= 0.0;
ss=0.2*(Math.Exp(5.0)-Math.Exp(-5.0));
vol=3.0*7.0*ss;
long count = 0;
long n = results.Iterations;
for(int j=1; j<= n; j++)
{
x=1.0+3.0* r.NextDouble();
y=(-3.0)+7.0* r.NextDouble();
s = 0.00135 + ss * r.NextDouble();
z=0.2*Math.Log(5.0*s);

double b = Math.Sqrt((x * x ) + (y * y) ) - 3.0;
double a = (z * z) + ( b * b);
if (a < 1.0)
{
sw += 1.0;
swx += x;
swy += y;
swz += z;
varw = 1.0;
varx += x * x;
vary += y * y;
varz += z * z;
}
count++;
}

results.W = vol * sw / n;
results.X = vol * swx/n;
results.Y = vol * swy/n;
results.Z = vol * swz/n;
results.dW = vol * Math.Sqrt((varw / n - Math.Pow((sw /n), 2.0)) / n);
results.dX = vol * Math.Sqrt((varx / n - Math.Pow((swx / n), 2.0)) / n);
results.dY = vol * Math.Sqrt((vary / n - Math.Pow((swy / n), 2.0)) / n);
results.dZ = vol * Math.Sqrt((varz / n - Math.Pow((swz / n), 2.0)) / n);

}
```

Running this in Release mode for 10 million iterations gives:

We know we can make this faster, but so long as any other approach gives us similar numbers then we know we are on the right track.

Let's try the same thing in C++ and see how that compares.

```vector<double> TorusMontecarlo::Calculate(int n)
{
double w,x,y,s,z,dw,dx,dy,dz,sw,swx,swy,swz, varw,varx,vary,varz,ss,vol;
w=x=y=s=z=dw=dx=dy=dz=sw=swx=swy=swz=varw=varx=vary=varz= 0.0;
ss=0.2*(exp(5.0)-exp(-5.0));
vol=3.0*7.0*ss;
for(int j=1; j<= n; j++)
{
x=1.0+3.0* RandomNumbers::ran3(&idum);
y=(-3.0)+7.0* RandomNumbers::ran3(&idum);
s=0.00135+ss* RandomNumbers::ran3(&idum);
z=0.2*log(5.0*s);
if (SQR(z) + SQR( sqrt(SQR(x)+SQR(y)) -3.0) < 1.0)
{
sw += 1.0;
swx+= x;
swy+= y;
swz += z;
varw = 1.0;
varx += SQR(x);
vary += SQR(y);
varz += SQR(z);
}
}
w = vol * sw/n;
x = vol * swx/n;
y = vol * swy/n;
z = vol * swz/n;
dw=vol*sqrt((varw/n-SQR(sw/n))/n);
dx=vol*sqrt((varx/n-SQR(swx/n))/n);
dy=vol*sqrt((vary/n-SQR(swy/n))/n);
dz=vol*sqrt((varz/n-SQR(swz/n))/n);
double res [] = {w,dw,x,dx,y,dy,z,dz};
vector<double> result (res, res + sizeof(res)/ sizeof(double) );
return result;
}   ```

Without much knowledge of how to optimise for MS C++ compiler and using a rather primitive stopwatch implementation, we get similar accuracy for about 2/3 the time:

Obviously C++ is not everyone's cup of tea and developing full applications rather than just libraries is probably ill advised for business applications. So let's see how we do when mixing C++ with the managed world.

### Mixed C# and C++

Let's flip the `/clr `switch on the compiler, add some supporting classes and call this routine (implemented in the ComponentCLR.dll), from C#.

Not so impressive, but still beats our pure C# implementation.

What if we expose our native Win32 C++ implementation and invoke it statically from C# via PInvoke?

Our unmanaged object performs as well as our native implementation, so why bother with managed C++?

From .NET 1.1 onwards, C# developers have been able to take advantage of the Threadpool to parallelise work without having to manually instantiate threads. But to do so optimally, we need to figure out the number of cores on a particular machine and split the work to be done equally among them. Once work has been sent to the threadpool and kicked off, the parent thread needs to wait until each batch signals that it has completed.

Something like the below should do the trick:

```public ResultsdotNet CalculateThreadpool(ResultsdotNet total)
{
ManualResetEvent[] events = new ManualResetEvent[Environment.ProcessorCount];
List<ResultsdotNet> results = new List<ResultsdotNet>();

for (int i = 1; i <= Environment.ProcessorCount; i++)
{
ResultsdotNet batchResult = new ResultsdotNet()
{ Iterations = total.Iterations / Environment.ProcessorCount,
ManualResetEvent = new ManualResetEvent(false) };
events[i - 1] = batchResult.ManualResetEvent;
}

WaitHandle.WaitAll(events);

foreach (ResultsdotNet batchResult in results)
{
total.dW += batchResult.dW;
total.dX += batchResult.dX;
total.dY += batchResult.dY;
total.dZ += batchResult.dZ;
total.W += batchResult.W;
total.X += batchResult.X;
total.Y += batchResult.Y;
total.Z += batchResult.Z;
total.Iterations += batchResult.Iterations;
}

total.dW = total.dW / results.Count;
total.dX = total.dX / results.Count;
total.dY = total.dY / results.Count;
total.dZ = total.dZ / results.Count;
total.W = total.W / results.Count;
total.X = total.X / results.Count;
total.Y = total.Y / results.Count;
total.Z = total.Z / results.Count;

}
```

Not a vast amount of sophisticated code, but could certainly be more elegant using TPL. When we run it purely in C# for a total of 10 million iterations spread equally among the CPU cores, we get:

This is almost 4 times faster than single threaded. We can assume that minus some minor overhead from task switching monte carlo routines will scale very well indeed!

Let's try a TPL and specifically the `Parallel.For `methods to see if we can get good performance and more elegant code. For example, you might expect the following to be as good if not better than using the Threadpool, it is newer after all:

```public void CalculateParallel(ref ResultsdotNet results)
{
double w, x, y, s, z, dw, dx, dy, dz, swx, swy, swz, varw, varx, vary, varz, ss, vol, sw;
w = x = y = s = z = dw = dx = dy = sw = dz = swx = swy = swz = varw = varx = vary = varz = 0.0;
ss = 0.2 * (Math.Exp(5.0) - Math.Exp(-5.0));
vol = 3.0 * 7.0 * ss;
long count = 0;
long n = results.Iterations;
Parallel.For( 0, n, f =>
{
Random r = new Random(Environment.TickCount);

x = 1.0 + 3.0 * r.NextDouble();
y = (-3.0) + 7.0 * r.NextDouble();
s = 0.00135 + ss * r.NextDouble();
z = 0.2 * Math.Log(5.0 * s);
double b = Math.Sqrt((x * x) + (y * y)) - 3.0;
double a = Math.Pow(z, 2.0) + (b * b);
if (a < 1.0)
{
sw += 1;
swx += x;
swy += y;
swz += z;
varw = 1.0;
varx += Math.Pow(x, 2.0);
vary += Math.Pow(y, 2.0);
varz += Math.Pow(z, 2.0);
}
Interlocked.Increment(ref count);
});
results.W = vol * sw / n;
results.X = vol * swx / n;
results.Y = vol * swy / n;
results.Z = vol * swz / n;
results.dW = vol * Math.Sqrt((varw / n - Math.Pow((sw / n), 2.0)) / n);
results.dX = vol * Math.Sqrt((varx / n - Math.Pow((swx / n), 2.0)) / n);
results.dY = vol * Math.Sqrt((vary / n - Math.Pow((swy / n), 2.0)) / n);
results.dZ = vol * Math.Sqrt((varz / n - Math.Pow((swz / n), 2.0)) / n);
}  ```

The result:

Not so impressive as it performs worse than the threadpool. It looks like we're not splitting work up correctly between the threads that perform the work.

To take care of the division of labour issue, let's use `Parallel.ForEach `but also take advantage of partitioning in order to batch the work more efficiently:

```long n = results.Iterations;
OrderablePartitioner<Tuple<long, long>> chunkPart =
Partitioner.Create(0, n, n/100);
long count = 0;
Parallel.ForEach(chunkPart, chunkRange =>
{});
```

Note the 3rd argument, where I partition the total iterations by 100 times. Naturally, you might assume that sizing the partitions according to the number of cores would work similarly to the Threadpool example above. Unfortunately that doesn't seem to be the case. Feel free to tweak the code and prove it to yourself.

Next, let's use interlocking in order to make operations on state variables are atomic. As there is no Interlocked equivalent for adding doubles in .NET, we'll have to implement our own:

```public static double Add(ref double location1, double value)
{
double newCurrentValue = 0;
while (true)
{
double currentValue = newCurrentValue;
double newValue = currentValue + value;
newCurrentValue = Interlocked.CompareExchange(ref location1, newValue, currentValue);
if (newCurrentValue == currentValue)
return newValue;
}
}
```

With partitioning and interlocking added in, what do we get?

This time, the results look correct and we've achieved a significant improvement in speed, but not enough to beat the Threadpool or our mixed C#/C++ solutions.

Clearly, how we divide the work to be done and then put the results back together is a big factor, so lets compare the traditional Threadpool with a Task based approach using similar logic to partition the work:

```public void Run(ResultsdotNet results, int partitions)
{

long iterationsPerTask = results.Iterations / partitions;
for (int n = 1; n <= partitions; n++)
);

}
```

Note here that (for lack of a better technique), we are again using the number of processor cores to partition our work, plus we are able to use a value type for each batch.

This is much better with results comparable if not slightly better than the Threadpool example. When I ran this again on my laptop (another 4 core machine) however, the results were reversed. Threadpool ran slightly faster than Tasks so I expect your results will also vary.

Based on our single threaded testing, we might assume that a mix of threading and C#/PInvoke might give us the most speedy performance. So lets put that to the test.

Unfortunately the original C++ implementation of ran3() from Numerical Receipes is not thread safe. We'll have to refactor it slightly:

```#define MBIG 1000000000
#define MSEED 161803398
#define MZ 0
#define FAC (1.0/MBIG)

RandomNumbers::RandomNumbers()
{
iff=0;
}

double RandomNumbers::ran3(long *idum)
{

long mj,mk;
int i,ii,k;

if (*idum < 0 || iff == 0) {
iff=1;
mj=labs(MSEED-labs(*idum));
mj %= MBIG;
ma[55]=mj;
mk=1;
for (i=1;i<=54;i++) {
ii=(21*i) % 55;
ma[ii]=mk;
mk=mj-mk;
if (mk < MZ) mk += MBIG;
mj=ma[ii];
}
for (k=1;k<=4;k++)
for (i=1;i<=55;i++) {
ma[i] -= ma[1+(i+30) % 55];
if (ma[i] < MZ) ma[i] += MBIG;
}
inext=0;
inextp=31;
*idum=1;
}
if (++inext == 56) inext=1;
if (++inextp == 56) inextp=1;
mj=ma[inext]-ma[inextp];
if (mj < MZ) mj += MBIG;
ma[inext]=mj;
return mj*FAC;
}
#undef MBIG
#undef MSEED
#undef MZ
#undef FAC```

When we run this via the Threadpool, we get:

Suprisingly, we get WORSE results than we might otherwise expect. Why? The most obvious candidate is that we create a new instance of the C++ calculator with every thread. We can probably do better by creating a calculation pool of objects but we'll leave that to the next update.

### Conclusions

Clearly it's not easy to use multi-threaded techniques for a 'simple' iterative routine like monte carlo integration and out perform single threaded methods. As we saw with the Parallel.For methods the key is to efficiently partition the work between each unit of execution and then aggregate correctly at the end.

So far, we have found that Tasks and Threadpooling are our best bet for comparison as we scale out, but which of these you use will be very much up to personal circumstances. It looks almost certain that by combining this with a PInvoke based calculation pool and techniques explained here then we would achieve the most efficient use of our resources.

So far we have only really looked at utilising the CPU on a single machine. But what about the graphics card or all those wasted highly powered desktops sitting around your organisarion? In the next article we will look at building a calculation pool and compare and combine it to other techniques such as CUDA and distributed grid processing to see how we can scale it further.

All suggestions for improvements are happily accepted.

## History

1. Creation
2. Added some performance improvements suggested by Andy Harman and Alexey.
3. Tagged with C++ to expose to that audience encase there are some helpful suggestions
4. Fixed the zip file, updated images and commentary on single threaded and Parallel.For
7. Edited article

## Share

 Technical Lead Alpha Integralis Limited United Kingdom
CatchExAs aka Nick Wilton runs Alpha Integralis, a software development consultancy to companies in the City of London.

Main interests include exploring Algo-Trading, Quant Development and Physics using C# and Java.

www.nickwilton.info/Blog

## You may also be interested in...

 First Prev Next
 public void Calculate(ref ResultsdotNet results) The_Inventor1-Jul-14 17:39 The_Inventor 1-Jul-14 17:39
 Re: public void Calculate(ref ResultsdotNet results) CatchExAs2-Jul-14 1:33 CatchExAs 2-Jul-14 1:33
 seems c++ parallel is not in the comparison X Liu9-Jun-14 6:00 X Liu 9-Jun-14 6:00
 Re: seems c++ parallel is not in the comparison CatchExAs9-Jun-14 6:43 CatchExAs 9-Jun-14 6:43
 Re: seems c++ parallel is not in the comparison Wong Shao Voon11-Jun-14 23:29 Wong Shao Voon 11-Jun-14 23:29
 Cache those expensive calculations Wong Shao Voon26-May-14 19:04 Wong Shao Voon 26-May-14 19:04
 Re: Cache those expensive calculations CatchExAs26-May-14 20:08 CatchExAs 26-May-14 20:08
 Good stuff CatchExAs! Volynsky Alex26-May-14 8:35 Volynsky Alex 26-May-14 8:35
 Concerning ThreadPool Member 1058612524-May-14 23:31 Member 10586125 24-May-14 23:31
 Re: Concerning ThreadPool CatchExAs28-May-14 8:38 CatchExAs 28-May-14 8:38
 Re: Concerning ThreadPool Member 1058612528-May-14 9:07 Member 10586125 28-May-14 9:07
 Re: Concerning ThreadPool Member 1058612528-May-14 10:36 Member 10586125 28-May-14 10:36
 Re: Concerning ThreadPool CatchExAs29-May-14 1:37 CatchExAs 29-May-14 1:37
 Re: Concerning ThreadPool Member 1058612529-May-14 2:09 Member 10586125 29-May-14 2:09
 About optimization more precisely Member 1058612515-May-14 4:36 Member 10586125 15-May-14 4:36
 On an unrelated note Wong Shao Voon18-May-14 19:03 Wong Shao Voon 18-May-14 19:03
 Re: On an unrelated note Member 1058612518-May-14 19:19 Member 10586125 18-May-14 19:19
 Re: On an unrelated note CatchExAs19-May-14 2:21 CatchExAs 19-May-14 2:21
 Re: On an unrelated note Member 1058612519-May-14 2:57 Member 10586125 19-May-14 2:57
 Re: On an unrelated note Member 1058612519-May-14 3:57 Member 10586125 19-May-14 3:57
 Re: On an unrelated note CatchExAs20-May-14 2:30 CatchExAs 20-May-14 2:30
 Re: About optimization more precisely CatchExAs19-May-14 2:14 CatchExAs 19-May-14 2:14
 Re: About optimization more precisely Member 1058612519-May-14 2:56 Member 10586125 19-May-14 2:56
 Re: About optimization more precisely Rick York20-Mar-17 7:59 Rick York 20-Mar-17 7:59
 Re: About optimization more precisely Alexey KK20-Mar-17 9:12 Alexey KK 20-Mar-17 9:12
 Re: About optimization more precisely CatchExAs21-Apr-17 2:28 CatchExAs 21-Apr-17 2:28
 Concerning code optimization Member 1058612513-May-14 13:46 Member 10586125 13-May-14 13:46
 Re: Concerning code optimization CatchExAs13-May-14 13:58 CatchExAs 13-May-14 13:58
 Re: Concerning code optimization Member 1058612513-May-14 14:42 Member 10586125 13-May-14 14:42
 Re: Concerning code optimization Member 1058612513-May-14 21:14 Member 10586125 13-May-14 21:14
 Re: Concerning code optimization Member 1058612513-May-14 21:25 Member 10586125 13-May-14 21:25
 Concerning PLINQ Member 1058612513-May-14 13:26 Member 10586125 13-May-14 13:26
 Additional optimization ideas... Member 1058612513-May-14 9:57 Member 10586125 13-May-14 9:57
 Apples and oranges? andyharman13-May-14 8:24 andyharman 13-May-14 8:24
 Re: Apples and oranges? CatchExAs13-May-14 9:39 CatchExAs 13-May-14 9:39
 Re: Apples and oranges? andyharman23-May-14 4:20 andyharman 23-May-14 4:20
 Re: Apples and oranges? CatchExAs23-May-14 5:24 CatchExAs 23-May-14 5:24
 Re: Apples and oranges? CatchExAs17-May-14 1:11 CatchExAs 17-May-14 1:11
 Re: Apples and oranges? andyharman23-May-14 3:53 andyharman 23-May-14 3:53
 Re: Apples and oranges? CatchExAs23-May-14 5:27 CatchExAs 23-May-14 5:27
 Very good article! Volynsky Alex13-May-14 6:29 Volynsky Alex 13-May-14 6:29
 Concerning computationally intensive applications Member 1058612512-May-14 11:28 Member 10586125 12-May-14 11:28
 Re: Concerning computationally intensive applications CatchExAs12-May-14 20:11 CatchExAs 12-May-14 20:11
 Re: Concerning computationally intensive applications Member 1058612512-May-14 23:03 Member 10586125 12-May-14 23:03
 suggested future direction Ben Mcmillan12-May-14 10:35 Ben Mcmillan 12-May-14 10:35
 Re: suggested future direction CatchExAs13-May-14 9:15 CatchExAs 13-May-14 9:15
 Last Visit: 31-Dec-99 18:00     Last Update: 17-Aug-17 5:55 Refresh 1