12,760,727 members (33,418 online)
Technical Blog
alternative version

#### Stats

9.4K views
4 bookmarked
Posted 12 Oct 2012

# Cudafy Me: Part 3 of 4

, 12 Oct 2012 CPOL
 Rate this:
These posts are meant to inspire you to enter into the world of graphics processor programming.

These posts are meant to inspire you to enter into the world of graphics processor programming.

Posts in This Series

Full Source Code

http://cudafytsp.codeplex.com/

Recap

In the last post, we put together a multi-threaded, multi-core CPU solution to the traveling salesman problem. Here was the main outer for loop that accomplished our work:

{
var bestDistance = float.MaxValue;
var bestPermutation = -1;
var locker = new Object();
var stopWatch = Stopwatch.StartNew();

Parallel.For(0, _permutations, permutation =>
{
var path = new int[1, Cities];
var distance = FindPathDistance(_permutations, permutation,
Cities, _latitudes, _longitudes, path, 0);
lock (locker)
{
if (distance < bestDistance)
{
bestDistance = distance;
bestPermutation = permutation;
}
}
});

return new Answer { Distance = bestDistance,
Permutation = bestPermutation, Milliseconds = stopWatch.ElapsedMilliseconds };
}

Massively Parallel GPGPU Solution

Now we will modify the solution to take advantage of an NVidia GPGPU.

{
var gpu = CudafyHost.GetDevice();
var km = CudafyTranslator.Cudafy();

var stopWatch = Stopwatch.StartNew();
var answer = new float[BlocksPerGrid, 2];
var gpuLatitudes = gpu.CopyToDevice(_latitudes.ToArray());
var gpuLongitudes = gpu.CopyToDevice(_longitudes.ToArray());

gpu.Synchronize();

var bestDistance = float.MaxValue;
var bestPermutation = 0;
for (var i = 0; i < BlocksPerGrid; i++)
{
{
}
}

return new Answer { Distance = bestDistance,
Permutation = bestPermutation, Milliseconds = stopWatch.ElapsedMilliseconds };
}

Some things to notice:

1. We gain access to the GPGPU by calling CudafyHost.GetDevice. There are nuances here, but for now, let’s call it simple.
2. We transfer data (the longitudes and latitudes) to the GPGPU using its device memory. Again, nuances but nothing too weird.
3. We launch a function called GpuFindPathDistance. We will be looking at that code right after we discuss the concepts of threads, blocks, the grid, and memory.
4. We call Synchronize, which, as you might expect, waits for the GPU operation to complete.
5. We then transfer some answers from the GPGPU back into main memory and look for the shortest permutation. For now, rest assured that even if we are performing millions or billions of permutations, the number of answers we need to loop through will be a relatively small number, like 1024 or less.

Threads, Blocks, the Grid, and Memory

In Cudafy, a thread is basically what you might expect it to be. What is new to CPU developers are the concepts of blocks and grids.

• Block: Think of a block as a group of threads that can allocate and share GPGPU memory. If one thread in a block allocates memory, all threads in the block have access to that memory. There is also an ability to synchronize the threads within a block. There is a limit, however, of only 512 or 1024 threads per block depending on the hardware.
• Grid: Think of the grid as the GPGPU itself. In other words, if we see a reference to blocks per grid, that really means the total number of blocks. There is a hardware limit of 65535 blocks per grid in older hardware, and 2^31-1 blocks per grid in newer hardware.

So in the code above, suppose we specify 128 threads per block and 128 blocks per grid. That means we will have 16384 threads working the problem.

There are many types of GPGPU memory, but in this series of posts we will cover only two kinds:

• Device Memory: This kind of memory can be accessed (read and written) by all threads in all blocks in the GPGPU and by the CPU. This is not as fast as shared memory.
• Shared Memory: This kind of memory is allocated by GPGPU threads and is accessible only among the threads in the same block. In other words, shared memory allocations take place once per block.

GpuFindPathDistance

Now the fun begins with some real heavy-duty GPGPU code that implements the GpuFindPathDistance function we launched above. The first thing you should notice is that it is C#. Also, it calls our function from part 1 of this series of posts: FindPathDistance (which in turn calls PathFromRoutePermutation) – so none of that needs to be re-written.

int cities, float[] latitudes, float[] longitudes, float[,] answer)
{
var blockIndex = thread.blockIdx.x; // block index within the grid

var bestDistance = float.MaxValue;
var bestPermutation = 0;

while (permutation < permutations)
{
var distance = FindPathDistance(permutations, permutation,
if (distance < bestDistance)
{
bestDistance = distance;
bestPermutation = permutation;
}
}

for (var i = threadsPerBlock / 2; i > 0; i /= 2)
{
{
{
}
}
}

{
}
}

Computing the Permutation Number

Shared Memory

The three lines of code that allocate memory look as if each thread in each block is allocating memory. This is an understandable mistake, and simply a nuance of how the GPGPU works. In reality, this memory is only allocated once per block. Therefore, whatever we allocate needs to be enough for all threads in the block. That is why the number of items allocated is usually some multiple of the number of threads. In our solution, we allocate one floating point number to track the shortest distance found by each thread. We also need to store the permutation number that corresponds to the shortest distance. Finally, we need some scratch pad memory to hold the city path sequence. Now it might make sense to you why we wrote PathFromRoutePermutation as we did in part 1.

Clever Algorithm

At the end of the permutation loop (at the first thread.SyncThreads), we are left with 128 best distances and permutation numbers per block. We need the call to SyncThreads to ensure all the all 128 threads have made it to this point before proceeding.

Once we have narrowed down the results to one best distance per block, we place that result into the GPGPU device memory that is accessible by the CPU. The CPU is then tasked with retrieving these results and comparing the best result of each block to get the final answer. We could perform this last pass in the GPGPU. Would that be better? You tell me.

Discussion

## Share

 Software Developer (Senior) LECO Corporation United States
John Hauck has been developing software professionally since 1981, and focused on Windows-based development since 1988. For the past 17 years John has been working at LECO, a scientific laboratory instrument company, where he manages software development. John also served as the manager of software development at Zenith Data Systems, as the Vice President of software development at TechSmith, as the lead medical records developer at Instrument Makar, as the MSU student who developed the time and attendance system for Dart container, and as the high school kid who wrote the manufacturing control system at Wohlert. John loves the Lord, his wife, their three kids, and sailing on Lake Michigan.

## You may also be interested in...

 First Prev Next
 asp.bet designdot17-Oct-12 0:05 designdot 17-Oct-12 0:05
 Last Visit: 31-Dec-99 19:00     Last Update: 26-Feb-17 1:53 Refresh 1