Click here to Skip to main content
Click here to Skip to main content

Install Cuda and Use Managed Code with VS 2008 Express on Windows 7-x64

, 24 Dec 2012
Rate this:
Please Sign up or sign in to vote.
Getting Cuda started on a VS Express budget

Introduction 

Note:  This article is still relevant but I have changed my approach to GPU programming. I now use CUDA with Java and JCuda from an Eclipse IDE. See my new approach at CodeProject Article 513265

Getting NVidia Cuda up and running when you are on a Visual Studio Express budget can be frustrating, particularly if you want to access Cuda functions from managed code. There are plenty of resources on line to help you on your way but you have to combine information from different sources – while avoiding certain dead ends. It’s a little hit and miss. I hope you can benefit from my journey so far.

For now, I decided to keep it simple: use VS 2008 Express, write my own wrappers, and stick to the x86 platform. Here’s how I succeeded:

Background

  • I have not configured Cuda for VS 2010 Express. I understand that part of the process requires configuring your 2010 project to use the VS 2008 (VC 90) compiler instead of the VS 2010 (VC100) compiler. Most likely there are a few other hacks required to get things going. There appear be some resources that provide direction on doing this. In particular, I saw one article that looks promising at http://blog.cuvilib.com/2011/02/24/how-to-run-cuda-in-visual-studio-2010/

  • Running managed code using configurations other than x86 did not work for me. There are several convoluted posts on the web concerning this configuration with VS Express. Google search “Visual C++ 2008 Express Edition And 64-Bit Targets” for some entertaining ways to break your VS Express install.

  • Working out the install in a virtual machine first is a good idea but it was unclear to me how to access the host’s GPU hardware directly from my guest machine. My VBox virtual graphics adapter is not Cuda enabled and, as best I can tell, Cuda no longer easily supports the emulator mode. So I used the standard technique: make mistakes, break the install, reinstall, and follow the smoke.

  • I am particularly interested in Fourier transforms on the GPU. Only a few of the canned wrappers sport CUFFT functionality. Cudafy (CodePlex) seemed the most promising but it’s not (yet) an out of the box set-up when you have VS Express.

First time setup

  • Be sure you have a Cuda enabled card. NVidia has an exhaustive list of compatible GPUs on their Developer Zone web site. http://developer.nvidia.com/cuda-gpus. (I have a GeForce GTX 560 GPU.) If you are not sure, have a look at the GPU Caps Viewer. I am usually hesitant to download many utilities like this, but I have used this application for a few years now, it is widely recognized, and it has a solid green WOT rating. It will fairly reliably identify your GPU and report its OpenGl and Cuda capabilities.

  • Install VC++ and VSC# 2008 Express, then verify install with a “Hello World” test in each.

Take Note: From release notes in Toolkit (Start -> Programs -> NVidia): The Win7 environment variables need to be fixed on the v4.1 RC2 installation for Windows7-x64: Environment variables written by the installer may have mistakenly included an extra slash in the path specification.

  • Double check the environment variables (Computer -> Properties -> Advanced -> Advanced tab):

  • CUDA_BIN_PATH %CUDA_PATH%\bin

  • CUDA_INC_PATH %CUDA_PATH%\include

  • CUDA_LIB_PATH %CUDA_PATH%\lib\x64

  • Check the install so far:

  • From a command window run: nvcc –V (You should get a compilation release message.)

  • Find bandwidthTest.exe (C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK 4.1\C\bin\win64\Release) and run it.

  • Also try oceanFFT.exe

  • Copy all *.rules files in “C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.1\extras\visual_studio_integration\rules” to “C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\VCProjectDefaults”

  • Copy “C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK 4.1\C\doc\syntax_highlighting\visual_studio_8\usertype.dat” to “C:\Program Files (x86)\Microsoft Visual Studio 9.0\Common7\IDE” folder

  • Open VC++ and at Tools -> Options:

  • Text Editor -> File Extensions add two extensions: .cu and .cuh

  • Projects and Solutions -> VC++ Directories

    • add %CUDA%bin in the directories for Executable Files

    • add %CUDA Directory%include in the directories for Include Files

    • add %CUDA%lib in the directories for Library Files

  • Close VC++ and reopen, then load your “hello world” program and make sure it still works.

Creating projects

Example: A simple bare-bones wrapper for FFT:

  • Create a new, empty, Win 32 project named BareBonesCuda. Check the “dll” checkbox on the next page.

  • Add a source file – type cpp – but name it with .cu extension, eg: test.cu

  • Right-click the project and choose Custom Build Rules. Tick the box for CUDA Runtime API. There will be two. I use the one that does not have the version # after the name.

  • Right-click the project and choose Properties.

  • Under Linker -> General -> Additional Library Directories add: $(CUDA_PATH)/lib/$(PlatformName);

  • Under Linker -> Input -> Additional Dependencies add: cudart.lib cufft.lib

Paste the following into test.cu:

#include "cufft.h"

 
extern "C" int __declspec(dllexport) __stdcall _Fft(float real[], float imaginary[], int N, int batchSize)

{

            cufftComplex *a_h, *a_d;

            cufftHandle plan;

            int i, nBytes;

            nBytes = sizeof(cufftComplex)*N*batchSize;

            a_h = (cufftComplex *)malloc(nBytes);

            for (i=0; i < N*batchSize; i++) {

                        a_h[i].x = real[i];

                        a_h[i].y = imaginary[i];

            }

            cudaMalloc((void **)&a_d, nBytes);

            if ( cudaGetLastError ( ) != cudaSuccess ) {

                        cufftDestroy(plan);

                        free(a_h); cudaFree(a_d);

                        //False = 0: error condition

                        return 0;

            }

            cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice);

            if (cufftPlan1d(&plan, N, CUFFT_C2C, batchSize) != CUFFT_SUCCESS)

            {

                        cufftDestroy(plan);

                        free(a_h); cudaFree(a_d);

                        //False = 0: error condition

                        return 0;

            }

            cufftExecC2C(plan, a_d, a_d, CUFFT_FORWARD);

            cudaDeviceSynchronize();

            cudaMemcpy(a_h, a_d, nBytes, cudaMemcpyDeviceToHost);

            for (i=0; i < N*batchSize; i++) {

                        real[i] = a_h[i].x;

                        imaginary[i] = a_h[i].y;

            }

            cufftDestroy(plan);

            free(a_h); cudaFree(a_d);

            return 1;

} 

Build it. (I hope it works for you too.)

Use the dll in C#

In the example above a file named BareBonesCuda.dll was created in the Debug folder for the solution. Make note of it.

Create a new C# console application. Change the configuration to x86 then debug the empty solution once. This will create a folder in your solution called \bin\x86\Debug. Copy your BareBonesCuda.dll into this folder.

Paste the following into Program.cs:

#include "cufft.h"

 
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Runtime.InteropServices;

namespace MyTestSharp
{
 class Program
 {
 static void Main(string[] args)
 {
  test();
 }

 [DllImport("BareBonesCuda.dll", CallingConvention = CallingConvention.StdCall, EntryPoint = "_Fft")]
 public static extern int _Fft(float[] real, float[] imaginary, int N, int batchSize);
 private static List<float[]> fftFloat(float[] real, float[] imaginary, int N)
 {
  int oK = _Fft(real, imaginary, N, 1);
  List<float[]> fftResult = new List<float[]>();
  fftResult.Add(real);
  fftResult.Add(imaginary);
  return fftResult;
 }

 private static void test()
 {
  int N = 32768;
  float[] real = new float[N];
  float[] imaginary = new float[N];
  StringBuilder sb = new StringBuilder(); ;
  char br = (char)13;

  for (int i = 0; i < N; i++)
  {
  real[i] = (float)i + 1;
  sb.Append(real[i].ToString());
  sb.Append(" + ");
  imaginary[i] = 0;
  sb.Append(imaginary[i].ToString());
  sb.Append(br);
  }

  Console.WriteLine(sb.ToString());
  sb = new StringBuilder();

  List<float[]> result = fftFloat(real, imaginary, N);
  for (int i = 0; i < N; i++)
  {
  sb.Append(real[i].ToString());
  sb.Append(" + ");
  sb.Append(imaginary[i].ToString());
  sb.Append(br);
  }

  Console.WriteLine(sb.ToString());
 }
 }
}

Run it. (Again, I hope it works for you too.)

References

Some references I found useful:

http://developer.download.nvidia.com/compute/cuda/3_1/docs/GettingStartedWindows.pdf

http://www.programmerfish.com/how-to-run-cuda-on-visual-studio-2008-vs08/

http://www.isnull.com.ar/2010/12/tutorial-cuda-32-and-visual-studio-2008.html

Syntax coloring: http://www.c-sharpcorner.com/uploadfile/rafaelwo/cuda-integration-with-C-Sharp/

http://developer.download.nvidia.com/compute/cuda/1_1/CUFFT_Library_1.1.pdf<

http://www.codeproject.com/Messages/4106223/Re-unmanaged-returning-arrays.aspx

Some results

Now that I am up and running, I am very happy with my Cuda performance. Using the CUFFT 1-D, forward, complex Fourier transform with double precision numbers as an example, I see a GPU/CPU performance advantage approaching 270x. For the CPU side of my test I am using a simple recursive radix-2 implementation based on the Sedgwick/ Wayne Java procedure. The transforms from the GPU and CPU versions agree exactly (to machine precision)! My GPU handles vectors up to length N = 16777216… and does it in 0.5 seconds.

History

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Mark H Bishop
PEI Watershed Alliance
United States United States
I am a scientist and not a professional programmer. I program primarily to perform matrix computations for regression analysis, process signals, acquire data from sensors, and to control devices.
 
I have a personal webpage at www.mark-bishop.net

Comments and Discussions

 
Questionalternative implementation PinmemberBen Mcmillan25-Dec-12 9:01 
AnswerRe: alternative implementation PinmemberMax Bishop25-Dec-12 11:31 
GeneralMy vote of 5 PinmemberSharjith23-Feb-12 14:14 
GeneralMy vote of 4 Pinmemberbartolo16-Jan-12 8:48 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web03 | 2.8.140827.1 | Last Updated 24 Dec 2012
Article Copyright 2012 by Mark H Bishop
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid