Faster copies to CUDA GPUs

Nick Kopp

4.75/5 (3 votes)

Jun 21, 2012

CPOL

6 min read

30519

How to both simplify CUDA applications and improve PCI-Express performance.

Introduction

When developing CUDA applications for NVIDIA GPUs an important factor is the performance of the data transfers between CPU and the GPU. Since this is often a bottleneck for many algorithms it is important that this is optimized.

Background

Modern GPUs are connected to their host CPUs via PCI-Express (PCIe). With 16 lane Gen2 PCIe sustained speeds of around 6GBytes/sec are possible and for graphics applications this is typically plenty. In multi-GPU set-ups the PCIe bus will often switch back to 8 lane without impact on frames per second. This is because much of the graphics information is held on the GPU and does not need updated from the host in entirety that often. This is also why the latest move to Gen3 PCIe does not provide significant speed-ups for games.

However when using a GPU for compute via CUDA for example the PCIe bandwidth becomes an important factor. Gen3 PCIe will be welcome. This is because fresh input data is usually streamed to the GPU. Furthermore result data is also being returned to the host unlike in a graphics application. This all puts greater demands on PCIe performance.

Tests

We want to test the bandwidth of both uploading and downloading data to a GPU. All tests are performed on a Windows 7 64-bit machine. Two GPUs are installed - a Geforce GTX 460 and a Quadro 4000. Per transfer we copy 256MByte blocks. Per test we will make use of two different types of system memory:

Non-pinned memory (memory allocated using malloc or cudaMalloc).
Pinned memory (pagepool memory allocated using cudaHostAlloc)

Test One

GeForce GTX 460
Standard WDDM (Windows Display Driver Model) Driver (not optimized for transfers)

copy_timed
GeForce GTX 460
Using non-optimized driver.
Using cudaMalloc: 
        MB/s during copy up: 3730 
        MB/s during copy down: 3673 
Using cudaHostAlloc:
        MB/s during copy up: 5718 
        MB/s during copy down: 6287 
Done!

Test Two

Quadro 4000
Windows Tesla Compute Cluster (TCC) Driver (optimized for transfers, but no display possible)

copy_timed
Quadro 4000
Using optimized driver.
Using cudaMalloc: 
        MB/s during copy up: 5558
        MB/s during copy down: 5364
Using cudaHostAlloc:
        MB/s during copy up: 5737
        MB/s during copy down: 6188 
Done!

Discussion

What is evident is that by using the TCC Driver we get a massive performance boost for non-pinned memory transfers. Why is this significant? Why not just always use pinned memory? Well, pinned memory is in relatively short supply compared to non-pinned. It is also less flexible in use and we see this especially when using the CUDA .NET wrapper CUDAfy.NET. An array of CLR Int32s for example can be directly transferred only from non-pinned memory. Pinned memory is therefore primarily used as a staging post. You are then forced to copy into this first. Unless this is done in parallel with the data transfer to the GPU then you'll lose all the advantage of pinned. Doing so in parallel is possible but it complicates the code and puts stress on the system memory. When you start adding up 6GBytes/sec copies to/from GPU and in and out of staging posts then it does not take much to saturate system memory.

How do I enable TCC? Use the Tesla driver and then the command line NVIDIA tool. See this post for more info.

Code

The following code was used to compare the Geforce GTX 460 with the Quadro 4000. Both were in system at same time. The 460 was in slot 0 but shows up as CUDA device 1. The Quadro 4000 is in slot 1 and is CUDA device 0. It has no display attached and the TCC driver is enabled.

The code is written in C# using the open source CUDAfy.NET library and is based on the C code in the book CUDA By Example by Sanders and Kandrot. Source code and downloads are available on CodePlex. You will need V1.10 to run the code below as is, or just build from the latest sources.

/* 
* This software is based upon the book CUDA By Example by Sanders and Kandrot
* and source code provided by NVIDIA Corporation.
* It is a good idea to read the book while studying the examples!
*/
using System;
using System.Collections.Generic;
using System.Text;
using Cudafy;
using Cudafy.Host;
namespace CudafyByExample
{
    public class copy_timed
    {
        public const int SIZE = 64*1024*1024;
        private GPGPU _gpu;
        private float cuda_malloc_test(int size, bool up) 
        {
            int[] a = new int[size];
            int[] dev_a = _gpu.Allocate<int>(size);
            
            _gpu.StartTimer();
            
            for (int i=0; i<100; i++) 
            {
                if (up)
                    _gpu.CopyToDevice(a, dev_a);
                else
                    _gpu.CopyFromDevice(dev_a, a);
            }
            float elapsedTime = _gpu.StopTimer();
            _gpu.FreeAll();
   
            GC.Collect();
            return elapsedTime;
        }
        private float cuda_host_alloc_test(int size, bool up) 
        {
            IntPtr a = _gpu.HostAllocate<int>(size);
            int[] dev_a = _gpu.Allocate<int>(size);
            
            _gpu.StartTimer();
            
            for (int i=0; i<100; i++) 
            {
                if (up)
                    _gpu.CopyToDevice(a, 0, dev_a, 0, size);
                else
                    _gpu.CopyFromDevice(dev_a, 0, a, 0, size);
            }
            float elapsedTime = _gpu.StopTimer();
            _gpu.FreeAll();
            _gpu.HostFree(a);
            GC.Collect();
            return elapsedTime;
        }

        public void Execute() 
        {
            float elapsedTime;
            float MB = (float)100*SIZE*sizeof(int)/1024/1024;
            _gpu = CudafyHost.GetDevice(CudafyModes.Target, 0);
            var props = _gpu.GetDeviceProperties();
            Console.WriteLine(props.Name);
            Console.WriteLine("Using {0}optimized driver.", 
                              props.HighPerformanceDriver ? "" : "non-");
            // try it with malloc
            elapsedTime = cuda_malloc_test(SIZE, true);
            Console.WriteLine("Time using cudaMalloc: {0} ms",
                    elapsedTime);
            Console.WriteLine("\tMB/s during copy up: {0}",
                    MB / (elapsedTime / 1000));
            elapsedTime = cuda_malloc_test(SIZE, false);
            Console.WriteLine("Time using cudaMalloc: {0} ms",
                    elapsedTime);
            Console.WriteLine("\tMB/s during copy down: {0}",
                    MB / (elapsedTime / 1000));
            // now try it with cudaHostAlloc
            elapsedTime = cuda_host_alloc_test(SIZE, true);
            Console.WriteLine("Time using cudaHostAlloc: {0} ms",
                    elapsedTime);
            Console.WriteLine("\tMB/s during copy up: {0}",
                    MB / (elapsedTime / 1000));
            elapsedTime = cuda_host_alloc_test(SIZE, false);
            Console.WriteLine("Time using cudaHostAlloc: {0} ms",
                    elapsedTime);
            Console.WriteLine("\tMB/s during copy down: {0}",
                    MB / (elapsedTime / 1000));
        }
    } 
}

Points of Interest

Using TCC has other advantages, namely you can use remote desktop. With the WDDM drivers when connecting to your PC remotely none of your GPUs will show up. Your CUDA apps will not run since they cannot discover any CUDA GPUs. TCC overcomes this.

Quadro vs Geforce: The Quadro 4000 costs around 5 times the price of the Geforce GTX 460. And what do you get for that? Lower performance! Memory is slower, clock speed slower, CUDA cores comparable, WDDM copy bandwidth the same... But you do get to use TCC Driver and equally important you can overlap uploading with downloading. If you architect your app correctly you will have uploads, downloads and compute kernels all operating in parallel. The Quadro 4000 has dual copy engines compared to single on the Geforce. The combination of all these factors can suddenly give boosts of 50% or more to your application.

GPUDirect: As of time of writing remains a rather mysterious and poorly documented area by NVIDIA. The term GPUDirect is an NVIDIA marketing tool that refers to optimized transfers between GPUs and other devices. The mechanism is unfortunately wide open and includes:

Transfers between 3rd party cards and GPU via system memory (GPUDirect for Video).
Transfers between 3rd party cards and GPU NOT via system memory.
Transfers between GPU and network cards (read some InfiniBand) via system memory.
Transfers between GPU and network cards (read some InfiniBand) NOT via system memory (i.e., via the IOH bridge chip).
Transfers between two CUDA GPUs via system memory.
Transfers between two CUDA GPUs NOT via system memory.
Access to peer GPU's memory from within kernel of another GPU (NOT via system memory).

Here's my take on it:

GPUDirect for Video is for producers of frame grabbers to optimize transfer to GPU. The API is only available to producers for them to incorporate into their applications and libraries. It is therefore transparent to end-users. Linux and Windows support with a Quadro 4000 or higher. Note that you cannot have a Geforce GPU in your system even if it is not involved in processing - its mere presence prevents operation. TCC not necessary (thankfully since most likely you want to display the data from frame grabber).
This is something that will open up exciting possibilities in CUDA 5. Linux support only.
Mellanox and QLogic/Intel. Linux.
Mellanox, Extoll and QLogic/Intel. Linux.
This can be done using cuMemCpyPeer.
This can be done using cuMemCpyPeer in a system that supports this: Both devices using TCC driver (or Linux OS) and compute capability 2.0 or higher. If not via system memory you'll see a 75% increase over bullet point #5 and less load on CPU and system memory.
If system supports (cuDeviceCanAccessPeer) then enable peer access using cuCtxEnablePeerAccess.

Anyway to get TCC on the cheap? Well yes, there are hacks posted on some forums that modify your Geforce firmware to fool the NVIDIA driver into thinking it has got a Tesla 2090. Suddenly you can enable TCC and have dual copy engines. However this is not without risk of permanently rendering your GPU useless.

History

Initial release.