These posts are meant to inspire you to enter into the world of graphics processor programming.

**Posts in This Series**

**Full Source Code**

http://cudafytsp.codeplex.com/

**Recap**

In the last post, we put together a multi-threaded, multi-core CPU solution to the traveling salesman problem. Here was the main outer `for`

loop that accomplished our work:

private Answer MpuTsp()
{
var bestDistance = float.MaxValue;
var bestPermutation = -1;
var locker = new Object();
var stopWatch = Stopwatch.StartNew();
Parallel.For(0, _permutations, permutation =>
{
var path = new int[1, Cities];
var distance = FindPathDistance(_permutations, permutation,
Cities, _latitudes, _longitudes, path, 0);
lock (locker)
{
if (distance < bestDistance)
{
bestDistance = distance;
bestPermutation = permutation;
}
}
});
return new Answer { Distance = bestDistance,
Permutation = bestPermutation, Milliseconds = stopWatch.ElapsedMilliseconds };
}

**Massively Parallel GPGPU Solution**

Now we will modify the solution to take advantage of an NVidia GPGPU.

private Answer GpuTsp()
{
var gpu = CudafyHost.GetDevice();
var km = CudafyTranslator.Cudafy();
gpu.LoadModule(km);
var stopWatch = Stopwatch.StartNew();
var answer = new float[BlocksPerGrid, 2];
var gpuLatitudes = gpu.CopyToDevice(_latitudes.ToArray());
var gpuLongitudes = gpu.CopyToDevice(_longitudes.ToArray());
var gpuAnswer = gpu.Allocate(answer);
gpu.Launch(BlocksPerGrid, ThreadsPerBlock,"GpuFindPathDistance",
_permutations, Cities, gpuLatitudes, gpuLongitudes, gpuAnswer);
gpu.Synchronize();
gpu.CopyFromDevice(gpuAnswer, answer);
var bestDistance = float.MaxValue;
var bestPermutation = 0;
for (var i = 0; i < BlocksPerGrid; i++)
{
if (answer[i, 0] < bestDistance)
{
bestDistance = answer[i, 0];
bestPermutation = (int)answer[i, 1];
}
}
return new Answer { Distance = bestDistance,
Permutation = bestPermutation, Milliseconds = stopWatch.ElapsedMilliseconds };
}

Some things to notice:

- We gain access to the GPGPU by calling
`CudafyHost.GetDevice`

. There are nuances here, but for now, let’s call it simple. - We transfer data (the longitudes and latitudes) to the GPGPU using its device memory. Again, nuances but nothing too weird.
- We launch a function called
`GpuFindPathDistance`

. We will be looking at that code right after we discuss the concepts of threads, blocks, the grid, and memory. - We call Synchronize, which, as you might expect, waits for the GPU operation to complete.
- We then transfer some answers from the GPGPU back into main memory and look for the shortest permutation. For now, rest assured that even if we are performing millions or billions of permutations, the number of answers we need to loop through will be a relatively small number, like 1024 or less.

**Threads, ****Blocks, **the Grid, and Memory

In Cudafy, a thread is basically what you might expect it to be. What is new to CPU developers are the concepts of blocks and grids.

**Block**: Think of a block as a group of threads that can allocate and share GPGPU memory. If one thread in a block allocates memory, all threads in the block have access to that memory. There is also an ability to synchronize the threads within a block. There is a limit, however, of only 512 or 1024 threads per block depending on the hardware. **Grid**: Think of the grid as the GPGPU itself. In other words, if we see a reference to blocks per grid, that really means the total number of blocks. There is a hardware limit of 65535 blocks per grid in older hardware, and 2^31-1 blocks per grid in newer hardware.

So in the code above, suppose we specify 128 threads per block and 128 blocks per grid. That means we will have 16384 threads working the problem.

There are many types of GPGPU memory, but in this series of posts we will cover only two kinds:

**Device Memory**: This kind of memory can be accessed (read and written) by all threads in all blocks in the GPGPU and by the CPU. This is not as fast as shared memory. **Shared Memory**: This kind of memory is allocated by GPGPU threads and is accessible only among the threads in the same block. In other words, shared memory allocations take place once per block.

**GpuFindPathDistance**

Now the fun begins with some real heavy-duty GPGPU code that implements the
`GpuFindPathDistance`

function we launched above. The first thing you should notice is that it is C#. Also, it calls our function from part 1 of this series of posts:
`FindPathDistance`

(which in turn calls `PathFromRoutePermutation`

) – so none of that needs to be re-written.

public static void GpuFindPathDistance(GThread thread, int permutations,
int cities, float[] latitudes, float[] longitudes, float[,] answer)
{
var threadIndex = thread.threadIdx.x; var blockIndex = thread.blockIdx.x; var threadsPerBlock = thread.blockDim.x;
var blocksPerGrid = thread.gridDim.x;
var threadsPerGrid = threadsPerBlock * blocksPerGrid;
var permutation = threadIndex + blockIndex * threadsPerBlock;
var paths = thread.AllocateShared<int>("path", ThreadsPerBlock, Cities);
var bestDistances = thread.AllocateShared<float>("dist", ThreadsPerBlock);
var bestPermutations = thread.AllocateShared<int>("perm", ThreadsPerBlock);
var bestDistance = float.MaxValue;
var bestPermutation = 0;
while (permutation < permutations)
{
var distance = FindPathDistance(permutations, permutation,
cities, latitudes, longitudes, paths, threadIndex);
if (distance < bestDistance)
{
bestDistance = distance;
bestPermutation = permutation;
}
permutation += threadsPerGrid;
}
bestDistances[threadIndex] = bestDistance;
bestPermutations[threadIndex] = bestPermutation;
thread.SyncThreads();
for (var i = threadsPerBlock / 2; i > 0; i /= 2)
{
if (threadIndex < i)
{
if (bestDistances[threadIndex] > bestDistances[threadIndex + i])
{
bestDistances[threadIndex] = bestDistances[threadIndex + i];
bestPermutations[threadIndex] = bestPermutations[threadIndex + i];
}
}
thread.SyncThreads();
}
if (threadIndex == 0)
{
answer[blockIndex, 0] = bestDistances[0];
answer[blockIndex, 1] = bestPermutations[0];
}
}

**Computing the Permutation Number**

There is a Cudafy-supplied parameter of type
`GThread`

. From `GThread`

we can determine which thread and block we are in, as well as the total number of threads and blocks. The formula to determine the permutation from all this looks pretty straightforward. However, we might suspect that the number of permutations needing to be computed is more than the total number of threads running. Therefore, we use a for loop to calculate additional permutations as needed. For example, the fifth thread in the eleventh block begins by computing the permutation 10*128+4 (if there are 128 threads per block). Once this thread computes the distance for permutation 1284, it will compute the distance for permutation 1284+128*128 (if there are 128 threads per block and 128 blocks). The loop will continue until the next permutation to compute exceeds the number of permutation computations required. The first thread in the first block will likely loop one more time than the last thread in the last block since the number of permutations is probably not divisible by 128*128.

**Shared Memory**

The three lines of code that allocate memory look as if each thread in each block is allocating memory. This is an understandable mistake, and simply a nuance of how the GPGPU works. In reality, this memory is only allocated once per block. Therefore, whatever we allocate needs to be enough for all threads in the block. That is why the number of items allocated is usually some multiple of the number of threads. In our solution, we allocate one floating point number to track the shortest distance found by each thread. We also need to store the permutation number that corresponds to the shortest distance. Finally, we need some scratch pad memory to hold the city path sequence. Now it might make sense to you why we wrote
`PathFromRoutePermutation`

as we did in part 1.

**Clever Algorithm**

At the end of the permutation loop (at the first
`thread.SyncThreads`

), we are left with 128 best distances and permutation numbers per block. We need the call to
`SyncThreads`

to ensure all the all 128 threads have made it to this point before proceeding.

What we want to do now is loop through all 128 best distances to find the smallest one. One solution is to let thread 0 loop through the data while the other 127 threads do nothing. This is easily done by enclosing the loop inside of an if statement. Or, we could come up with some clever algorithm that used more than 1 thread to go faster. The for loop in the code above does this most elegantly. This code was adapted from the code I found in a wonderfully helpful book called CUDA By Example. The basic idea is for thread 0 to compare its own results with that of thread 64’s results. Thread 0 then places the shortest of those two results back into thread 0’s results. At the same time thread 1 does the same thing with the results of thread 65. This will keep 64 of the 128 threads busy for a wee bit. At the end of that iteration of the for loop we are left with 64 best distances. We loop again and are left with 32 best distances. You can see how this ends.

**Final Answer**

Once we have narrowed down the results to one best distance per block, we place that result into the GPGPU device memory that is accessible by the CPU. The CPU is then tasked with retrieving these results and comparing the best result of each block to get the final answer. We could perform this last pass in the GPGPU. Would that be better? You tell me.

**Discussion**

>>> Part 4 – Discussion