Install Cuda and Use Managed Code with VS 2008 Express on Windows 7-x64

Mark H Bishop

4.73/5 (9 votes)

Jan 11, 2012

CPOL

5 min read

45118

Getting Cuda started on a VS Express budget

Introduction

Note: This article is still relevant but I have changed my approach to GPU programming. I now use CUDA with Java and JCuda from an Eclipse IDE. See my new approach at CodeProject Article 513265

Getting NVidia Cuda up and running when you are on a Visual Studio Express budget can be frustrating, particularly if you want to access Cuda functions from managed code. There are plenty of resources on line to help you on your way but you have to combine information from different sources – while avoiding certain dead ends. It’s a little hit and miss. I hope you can benefit from my journey so far.

For now, I decided to keep it simple: use VS 2008 Express, write my own wrappers, and stick to the x86 platform. Here’s how I succeeded:

Background

I have not configured Cuda for VS 2010 Express. I understand that part of the process requires configuring your 2010 project to use the VS 2008 (VC 90) compiler instead of the VS 2010 (VC100) compiler. Most likely there are a few other hacks required to get things going. There appear be some resources that provide direction on doing this. In particular, I saw one article that looks promising at http://blog.cuvilib.com/2011/02/24/how-to-run-cuda-in-visual-studio-2010/
Running managed code using configurations other than x86 did not work for me. There are several convoluted posts on the web concerning this configuration with VS Express. Google search “Visual C++ 2008 Express Edition And 64-Bit Targets” for some entertaining ways to break your VS Express install.
Working out the install in a virtual machine first is a good idea but it was unclear to me how to access the host’s GPU hardware directly from my guest machine. My VBox virtual graphics adapter is not Cuda enabled and, as best I can tell, Cuda no longer easily supports the emulator mode. So I used the standard technique: make mistakes, break the install, reinstall, and follow the smoke.
I am particularly interested in Fourier transforms on the GPU. Only a few of the canned wrappers sport CUFFT functionality. Cudafy (CodePlex) seemed the most promising but it’s not (yet) an out of the box set-up when you have VS Express.

First time setup

Be sure you have a Cuda enabled card. NVidia has an exhaustive list of compatible GPUs on their Developer Zone web site. http://developer.nvidia.com/cuda-gpus. (I have a GeForce GTX 560 GPU.) If you are not sure, have a look at the GPU Caps Viewer. I am usually hesitant to download many utilities like this, but I have used this application for a few years now, it is widely recognized, and it has a solid green WOT rating. It will fairly reliably identify your GPU and report its OpenGl and Cuda capabilities.

Install VC++ and VSC^# 2008 Express, then verify install with a “Hello World” test in each.

Install the NVidia Development Drivers, Cuda Toolkit, and the Cuda SDK from http://developer.nvidia.com/cuda-toolkit-40

Take Note: From release notes in Toolkit (Start -> Programs -> NVidia): The Win7 environment variables need to be fixed on the v4.1 RC2 installation for Windows7-x64: Environment variables written by the installer may have mistakenly included an extra slash in the path specification.

Double check the environment variables (Computer -> Properties -> Advanced -> Advanced tab):

CUDA_BIN_PATH %CUDA_PATH%\bin

CUDA_INC_PATH %CUDA_PATH%\include

CUDA_LIB_PATH %CUDA_PATH%\lib\x64

Check the install so far:

From a command window run: nvcc –V (You should get a compilation release message.)

Find bandwidthTest.exe (C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK 4.1\C\bin\win64\Release) and run it.

Also try oceanFFT.exe

Copy all *.rules files in “C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.1\extras\visual_studio_integration\rules” to “C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\VCProjectDefaults”

Copy “C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK 4.1\C\doc\syntax_highlighting\visual_studio_8\usertype.dat” to “C:\Program Files (x86)\Microsoft Visual Studio 9.0\Common7\IDE” folder

Open VC++ and at Tools -> Options:

Text Editor -> File Extensions add two extensions: .cu and .cuh

Projects and Solutions -> VC++ Directories

add %CUDA%bin in the directories for Executable Files

add %CUDA Directory%include in the directories for Include Files

add %CUDA%lib in the directories for Library Files

Close VC++ and reopen, then load your “hello world” program and make sure it still works.

Creating projects

Example: A simple bare-bones wrapper for FFT:

Create a new, empty, Win 32 project named BareBonesCuda. Check the “dll” checkbox on the next page.

Add a source file – type cpp – but name it with .cu extension, eg: test.cu

Right-click the project and choose Custom Build Rules. Tick the box for CUDA Runtime API. There will be two. I use the one that does not have the version # after the name.

Right-click the project and choose Properties.

Under Linker -> General -> Additional Library Directories add: $(CUDA_PATH)/lib/$(PlatformName);

Under Linker -> Input -> Additional Dependencies add: cudart.lib cufft.lib

Paste the following into test.cu:

#include "cufft.h"

 
extern "C" int __declspec(dllexport) __stdcall _Fft(float real[], float imaginary[], int N, int batchSize)

{

            cufftComplex *a_h, *a_d;

            cufftHandle plan;

            int i, nBytes;

            nBytes = sizeof(cufftComplex)*N*batchSize;

            a_h = (cufftComplex *)malloc(nBytes);

            for (i=0; i < N*batchSize; i++) {

                        a_h[i].x = real[i];

                        a_h[i].y = imaginary[i];

            }

            cudaMalloc((void **)&a_d, nBytes);

            if ( cudaGetLastError ( ) != cudaSuccess ) {

                        cufftDestroy(plan);

                        free(a_h); cudaFree(a_d);

                        //False = 0: error condition

                        return 0;

            }

            cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice);

            if (cufftPlan1d(&plan, N, CUFFT_C2C, batchSize) != CUFFT_SUCCESS)

            {

                        cufftDestroy(plan);

                        free(a_h); cudaFree(a_d);

                        //False = 0: error condition

                        return 0;

            }

            cufftExecC2C(plan, a_d, a_d, CUFFT_FORWARD);

            cudaDeviceSynchronize();

            cudaMemcpy(a_h, a_d, nBytes, cudaMemcpyDeviceToHost);

            for (i=0; i < N*batchSize; i++) {

                        real[i] = a_h[i].x;

                        imaginary[i] = a_h[i].y;

            }

            cufftDestroy(plan);

            free(a_h); cudaFree(a_d);

            return 1;

}

Build it. (I hope it works for you too.)

Use the dll in C#

In the example above a file named BareBonesCuda.dll was created in the Debug folder for the solution. Make note of it.

Create a new C# console application. Change the configuration to x86 then debug the empty solution once. This will create a folder in your solution called \bin\x86\Debug. Copy your BareBonesCuda.dll into this folder.

Paste the following into Program.cs:

#include "cufft.h"

 
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Runtime.InteropServices;

namespace MyTestSharp
{
 class Program
 {
 static void Main(string[] args)
 {
  test();
 }

 [DllImport("BareBonesCuda.dll", CallingConvention = CallingConvention.StdCall, EntryPoint = "_Fft")]
 public static extern int _Fft(float[] real, float[] imaginary, int N, int batchSize);
 private static List<float[]> fftFloat(float[] real, float[] imaginary, int N)
 {
  int oK = _Fft(real, imaginary, N, 1);
  List<float[]> fftResult = new List<float[]>();
  fftResult.Add(real);
  fftResult.Add(imaginary);
  return fftResult;
 }

 private static void test()
 {
  int N = 32768;
  float[] real = new float[N];
  float[] imaginary = new float[N];
  StringBuilder sb = new StringBuilder(); ;
  char br = (char)13;

  for (int i = 0; i < N; i++)
  {
  real[i] = (float)i + 1;
  sb.Append(real[i].ToString());
  sb.Append(" + ");
  imaginary[i] = 0;
  sb.Append(imaginary[i].ToString());
  sb.Append(br);
  }

  Console.WriteLine(sb.ToString());
  sb = new StringBuilder();

  List<float[]> result = fftFloat(real, imaginary, N);
  for (int i = 0; i < N; i++)
  {
  sb.Append(real[i].ToString());
  sb.Append(" + ");
  sb.Append(imaginary[i].ToString());
  sb.Append(br);
  }

  Console.WriteLine(sb.ToString());
 }
 }
}

Run it. (Again, I hope it works for you too.)

References

Some references I found useful:

http://developer.download.nvidia.com/compute/cuda/3_1/docs/GettingStartedWindows.pdf

http://www.programmerfish.com/how-to-run-cuda-on-visual-studio-2008-vs08/

http://www.isnull.com.ar/2010/12/tutorial-cuda-32-and-visual-studio-2008.html

Syntax coloring: http://www.c-sharpcorner.com/uploadfile/rafaelwo/cuda-integration-with-C-Sharp/

http://developer.download.nvidia.com/compute/cuda/1_1/CUFFT_Library_1.1.pdf<

http://www.codeproject.com/Messages/4106223/Re-unmanaged-returning-arrays.aspx

Some results

Now that I am up and running, I am very happy with my Cuda performance. Using the CUFFT 1-D, forward, complex Fourier transform with double precision numbers as an example, I see a GPU/CPU performance advantage approaching 270x. For the CPU side of my test I am using a simple recursive radix-2 implementation based on the Sedgwick/ Wayne Java procedure. The transforms from the GPU and CPU versions agree exactly (to machine precision)! My GPU handles vectors up to length N = 16777216… and does it in 0.5 seconds.