Smoked CUDA Cheese and Macaroni

John Michael Hauck

5.00/5 (1 vote)

Jul 22, 2013

CPOL

8 min read

14951

Writing massively parallel Windows software in C++ that takes full advantage of the processing power found in the video cards of today’s gaming computers.

This blog post is inspired by one of the greatest meals of all time: Smoked Gouda Macaroni and Cheese. Yet the post itself is about software programming. Specifically, it is about writing massively parallel Windows software in C++ that takes full advantage of the processing power found in the video cards of today’s gaming computers. These programs can easily process many gigabytes of data per second.

[Recipe: Smoked Gouda Macaroni and Cheese]

Source Code

https://www.assembla.com/code/cudafyperformance/subversion/nodes

Introduction

Unlike aged cheese, there is no virtue in software that takes time to accomplish its purpose. The name of the game is speed, and today’s speed can be found in its purest form in the video card. One of the major video card manufacturers is a company called NVidia. In 2007, they introduced a tool called CUDA which allows C++ programmers to tap the computing power of their video cards. CUDA (which is pronounced “Gouda”) is smoking-fast. It is no accident that one of the world’s fastest computers (Titan) is composed of these graphics processors.

Yet CUDA is not the only high performance technology for Windows C++ developers. In 2011, Microsoft introduced a library called “Accelerated Massive Parallelism”, or AMP. Where CUDA works only on NVidia hardware, AMP works on a variety of hardware devices including other graphics cards, such as those from AMD. Microsoft’s future plans include many twists and turns to support cloud computing, XBOX devices and new multi-core offerings like Intel’s Phi. The Microsoft graphics depicting this strategy appears fairly complex and intertwined, kind of like pasta.

So that then brings us to the main course of this post – a Visual Studio solution that contains a blend of CUDA and AMP. Measuring performance is certainly a large part of this endeavor. We have all read many articles that compare various technologies to see which one is the fastest. Many conclude that “it depends”. Well, of course, it depends. Be assured that I will not use that phrase to conclude this post, but rather use it as a disclaimer before I offer up very specific numbers and conclusions. Now that the disclaimer has been stated, I am going to mix the smoking CUDA cheese and AMP macaroni together in one test to see how they compare.

Oh, one last disclaimer… I’ve learned that when making claims of performance, it often happens that I’ve overlooked a technique or slipped in a bug that causes an arguably irresponsible result. To that end, please accept my apologies and rest assured, I will update this blog entry as I learn about my mistakes.

Testing Memory Transfer Performance

While GPUs can process data very quickly, getting the data to the GPU processor (and back) can be relatively slow. In this test, I am using a Dell Precision T3600, 16GB RAM, Intel Xeon E5-2665 0 @ 2.40GHz, and an NVidia GTXTitan to measure the throughput of transferring an array of floating point values from CPU RAM to GPU RAM and back to CPU RAM again. I have written this test in both CUDA C and C++ AMP. I cycle through various buffer sizes on both platforms to see how transfer size affects performance.

Side Dish - Streams

In addition to varying the buffer size, I also ran the test using a different number of concurrently issued memory transfer requests. This requires a bit of explanation.

In CUDA, concurrent memory transfers are accomplished with a concept called streams. All commands issued using the same stream will be performed in sequential order. Commands issued on separate streams may proceed concurrently. Consider this example:

CPU issues an asynchronous upload (to the GPU) on stream 1
CPU issues an asynchronous upload on stream 2
CPU issues an asynchronous download on stream 1
CPU issues an asynchronous download on stream 2

Operation C is guaranteed to begin only after operation A completes. However, C can begin regardless of the completion status of operation B.

Strictly speaking, AMP does not support streams. It does however return a “future” for asynchronous memory operations that can be waited upon and used to invoke the next operation.

In CUDA, data processing routines (called kernels) can also be scheduled with streams. This then allows the programmer to interleave uploads, processing, and downloads all using the fairly intuitive stream concept.

In AMP, there is no asynchronous parallel_for_each. This then makes the code that implements interleaved memory transfers a bit awkward. Rumors have it that someday a parallel_for_each_async will make its debut and all will be better.

The Test

Here is the C++ AMP code fragment that performs the transfer operations:

for( auto myloop = 0; myloop < 1000 / streams; myloop++ )
{
    for (auto stream = 0; stream < streams; stream++)
    {
        if( downloadFutures[stream].valid() )
        {
            downloadFutures[stream].wait();
            loops++;
        }
 
        uploadFutures[stream] = copy_async(*cpuMemIns[stream], *gpuMemIns[stream]);
    }
 
    for (auto stream = 0; stream < streams; stream++)
    {
        if( uploadFutures[stream].valid() )
        {
            uploadFutures[stream].wait();
        }
 
        downloadFutures[stream] = copy_async(*gpuMemOuts[stream], *cpuMemOuts[stream]);
    }
}

Here is the CUDA C code fragment that performs the transfer operations:

for( auto myloop = 0; myloop < 500 / streams; myloop++ )
{
    for (auto stream = 0; stream < streams; stream++)
    {
        he(cudaStreamSynchronize(cudaStreams[stream]));
        upload(cpuMemIns[stream], gpuMemIns[stream], cudaStreams[stream], bufferSize);
        download(cpuMemOuts[stream], gpuMemOuts[stream], cudaStreams[stream], bufferSize);
        loops++;
    }
}

Check out the complete source code for details on how to allocate the memory in each platform. Look for the functions called “…ThroughputTest”.

The Memory Transfer Results

What these tests show is that when transferring less than 1048545 (1024*1024-31) floats, CUDA shows a 400% performance advantage over AMP. When the transfers are larger than this, the differences diminish. I have been trying to determine the cause of this performance threshold in AMP with no success. I am hopeful someone can shed some light on this for me.

The tests also show that for a single stream, CUDA performs at 130% the speed of AMP. For multiple streams operating on many data points, the performance advantage of CUDA becomes much more insignificant.

If your data happens to be 8-bit or 16-bit values, then there is an additional complication in the AMP environment. That is, AMP does not support these value types. A worst case scenario then is that you increase the number of bits of your data to 32-bit values and incur the overhead of transferring a large number of zero bits. Of course, you could pack multiple values into a 32-bit AMP integer, with the resulting code complexities.

AMP’s array_view

In addition to the “array” class used in the performance test above, AMP offers another mechanism called array_view. The array_view class’s purpose is to remove the complexity of data transfer calls from the programmer.

In order to benchmark the memory transfer performance of AMP’s array_view, we need to force the array_view to transfer data to the GPU. The only way to do this is to write some code that runs on the GPU that accesses the GPU data. The array_view will then recognize this code and then correctly transfers the data between the CPU and GPU as needed.

To compare the transfer speed with the array_view implementation, I have added the same GPU code to the array implementation. Adding this GPU code slows down the array implementation from 10.4 GB/s to 9.0 GB/s.

The performance of array_view shows a large performance dip below 1024*1024 floats. This is a slightly different position than with array (which dips below 1024*1024-31). The array_view implementation also shows a significant (1:2.3) overall lethargy compared to the array implementation.

Smoothing Data

To test computational performance, we will smooth 1,048,576 floating point values using a set of 63 smoothing coefficients in both CUDA and AMP. We will make our timings based upon repeatedly smoothing the data so that we are measuring the sustained throughput of the system.

The source code provided also allows us to test the two platforms with or without overlapped I/O (concurrent streams). It also allows us to test the platforms using different kernel complexities:

Normal – Invoke the smoothing kernel one time
Faster – Invoke a specially optimized smoothing kernel
Nothin – Invoke the smoothing kernel but do no smoothing
5Times – Invoke the smoothing kernel 5 times for each data point

Using the numbers from each platform’s best configuration, we can compare the “Faster smooth, overlapped I/O” of AMP (at 1.24 ms) with the similar CUDA run (at 0.74 ms). From this, we can conclude that CUDA takes a little more than half the time that AMP does to perform the same work.

Beyond Performance

At this point, it might seem easy to conclude that CUDA is the preferred technology. Yet AMP offers many advantages not to be overlooked by the Visual Studio developer.

For example, AMP is integrated with Visual Studio and does not require the sometimes tricky CUDA installation and setup. As of this writing, CUDA still does not integrate nicely with VS2012. This may indeed be fixed, but it does show that with CUDA, you are relying on two separate vendors instead of one.

Another area where AMP shines is the ability to run on non-NVidia hardware. At first blush, that may not seem like a big deal in an NVidia-dominated market. Yet with the introduction of Intel’s Phi, it may be that NVidia’s compute market share will be seriously challenged.

Finally, AMP may simply feel more comfortable to the seasoned C++ developer than the strange mix found in CUDA C. Many of the constructs in AMP meld seamlessly into the C+ framework. Therefore, a small performance hit may be worth it in the long run of production code.