Click here to Skip to main content
13,004,839 members (64,770 online)
Click here to Skip to main content
Add your own
alternative version


29 bookmarked
Posted 11 Jan 2012

Install Cuda and Use Managed Code with VS 2008 Express on Windows 7-x64

, 24 Dec 2012
Rate this:
Please Sign up or sign in to vote.
Getting Cuda started on a VS Express budget


Note:  This article is still relevant but I have changed my approach to GPU programming. I now use CUDA with Java and JCuda from an Eclipse IDE. See my new approach at CodeProject Article 513265

Getting NVidia Cuda up and running when you are on a Visual Studio Express budget can be frustrating, particularly if you want to access Cuda functions from managed code. There are plenty of resources on line to help you on your way but you have to combine information from different sources – while avoiding certain dead ends. It’s a little hit and miss. I hope you can benefit from my journey so far.

For now, I decided to keep it simple: use VS 2008 Express, write my own wrappers, and stick to the x86 platform. Here’s how I succeeded:


  • I have not configured Cuda for VS 2010 Express. I understand that part of the process requires configuring your 2010 project to use the VS 2008 (VC 90) compiler instead of the VS 2010 (VC100) compiler. Most likely there are a few other hacks required to get things going. There appear be some resources that provide direction on doing this. In particular, I saw one article that looks promising at

  • Running managed code using configurations other than x86 did not work for me. There are several convoluted posts on the web concerning this configuration with VS Express. Google search “Visual C++ 2008 Express Edition And 64-Bit Targets” for some entertaining ways to break your VS Express install.

  • Working out the install in a virtual machine first is a good idea but it was unclear to me how to access the host’s GPU hardware directly from my guest machine. My VBox virtual graphics adapter is not Cuda enabled and, as best I can tell, Cuda no longer easily supports the emulator mode. So I used the standard technique: make mistakes, break the install, reinstall, and follow the smoke.

  • I am particularly interested in Fourier transforms on the GPU. Only a few of the canned wrappers sport CUFFT functionality. Cudafy (CodePlex) seemed the most promising but it’s not (yet) an out of the box set-up when you have VS Express.

First time setup

  • Be sure you have a Cuda enabled card. NVidia has an exhaustive list of compatible GPUs on their Developer Zone web site. (I have a GeForce GTX 560 GPU.) If you are not sure, have a look at the GPU Caps Viewer. I am usually hesitant to download many utilities like this, but I have used this application for a few years now, it is widely recognized, and it has a solid green WOT rating. It will fairly reliably identify your GPU and report its OpenGl and Cuda capabilities.

  • Install VC++ and VSC# 2008 Express, then verify install with a “Hello World” test in each.

Take Note: From release notes in Toolkit (Start -> Programs -> NVidia): The Win7 environment variables need to be fixed on the v4.1 RC2 installation for Windows7-x64: Environment variables written by the installer may have mistakenly included an extra slash in the path specification.

  • Double check the environment variables (Computer -> Properties -> Advanced -> Advanced tab):




  • Check the install so far:

  • From a command window run: nvcc –V (You should get a compilation release message.)

  • Find bandwidthTest.exe (C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK 4.1\C\bin\win64\Release) and run it.

  • Also try oceanFFT.exe

  • Copy all *.rules files in “C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.1\extras\visual_studio_integration\rules” to “C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\VCProjectDefaults”

  • Copy “C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK 4.1\C\doc\syntax_highlighting\visual_studio_8\usertype.dat” to “C:\Program Files (x86)\Microsoft Visual Studio 9.0\Common7\IDE” folder

  • Open VC++ and at Tools -> Options:

  • Text Editor -> File Extensions add two extensions: .cu and .cuh

  • Projects and Solutions -> VC++ Directories

    • add %CUDA%bin in the directories for Executable Files

    • add %CUDA Directory%include in the directories for Include Files

    • add %CUDA%lib in the directories for Library Files

  • Close VC++ and reopen, then load your “hello world” program and make sure it still works.

Creating projects

Example: A simple bare-bones wrapper for FFT:

  • Create a new, empty, Win 32 project named BareBonesCuda. Check the “dll” checkbox on the next page.

  • Add a source file – type cpp – but name it with .cu extension, eg:

  • Right-click the project and choose Custom Build Rules. Tick the box for CUDA Runtime API. There will be two. I use the one that does not have the version # after the name.

  • Right-click the project and choose Properties.

  • Under Linker -> General -> Additional Library Directories add: $(CUDA_PATH)/lib/$(PlatformName);

  • Under Linker -> Input -> Additional Dependencies add: cudart.lib cufft.lib

Paste the following into

#include "cufft.h"

extern "C" int __declspec(dllexport) __stdcall _Fft(float real[], float imaginary[], int N, int batchSize)


            cufftComplex *a_h, *a_d;

            cufftHandle plan;

            int i, nBytes;

            nBytes = sizeof(cufftComplex)*N*batchSize;

            a_h = (cufftComplex *)malloc(nBytes);

            for (i=0; i < N*batchSize; i++) {

                        a_h[i].x = real[i];

                        a_h[i].y = imaginary[i];


            cudaMalloc((void **)&a_d, nBytes);

            if ( cudaGetLastError ( ) != cudaSuccess ) {


                        free(a_h); cudaFree(a_d);

                        //False = 0: error condition

                        return 0;


            cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice);

            if (cufftPlan1d(&plan, N, CUFFT_C2C, batchSize) != CUFFT_SUCCESS)



                        free(a_h); cudaFree(a_d);

                        //False = 0: error condition

                        return 0;


            cufftExecC2C(plan, a_d, a_d, CUFFT_FORWARD);


            cudaMemcpy(a_h, a_d, nBytes, cudaMemcpyDeviceToHost);

            for (i=0; i < N*batchSize; i++) {

                        real[i] = a_h[i].x;

                        imaginary[i] = a_h[i].y;



            free(a_h); cudaFree(a_d);

            return 1;


Build it. (I hope it works for you too.)

Use the dll in C#

In the example above a file named BareBonesCuda.dll was created in the Debug folder for the solution. Make note of it.

Create a new C# console application. Change the configuration to x86 then debug the empty solution once. This will create a folder in your solution called \bin\x86\Debug. Copy your BareBonesCuda.dll into this folder.

Paste the following into Program.cs:

#include "cufft.h"

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Runtime.InteropServices;

namespace MyTestSharp
 class Program
 static void Main(string[] args)

 [DllImport("BareBonesCuda.dll", CallingConvention = CallingConvention.StdCall, EntryPoint = "_Fft")]
 public static extern int _Fft(float[] real, float[] imaginary, int N, int batchSize);
 private static List<float[]> fftFloat(float[] real, float[] imaginary, int N)
  int oK = _Fft(real, imaginary, N, 1);
  List<float[]> fftResult = new List<float[]>();
  return fftResult;

 private static void test()
  int N = 32768;
  float[] real = new float[N];
  float[] imaginary = new float[N];
  StringBuilder sb = new StringBuilder(); ;
  char br = (char)13;

  for (int i = 0; i < N; i++)
  real[i] = (float)i + 1;
  sb.Append(" + ");
  imaginary[i] = 0;

  sb = new StringBuilder();

  List<float[]> result = fftFloat(real, imaginary, N);
  for (int i = 0; i < N; i++)
  sb.Append(" + ");


Run it. (Again, I hope it works for you too.)


Some references I found useful:

Syntax coloring:<

Some results

Now that I am up and running, I am very happy with my Cuda performance. Using the CUFFT 1-D, forward, complex Fourier transform with double precision numbers as an example, I see a GPU/CPU performance advantage approaching 270x. For the CPU side of my test I am using a simple recursive radix-2 implementation based on the Sedgwick/ Wayne Java procedure. The transforms from the GPU and CPU versions agree exactly (to machine precision)! My GPU handles vectors up to length N = 16777216… and does it in 0.5 seconds.



This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


About the Author

Mark H Bishop
Founder PEI Watershed Alliance, Inc.
United States United States
I am an analytical chemist and an educator. I program primarily to perform matrix computations for regression analysis, process signals, acquire data from sensors, and to control devices.

I participate in many open source development communities and Linux user forums. I occasionally perform IT contract work, primarily focused on network design/deployment and penetration testing for small organizations.

I am a member of several community-interest groups such as the Prince Edward Island Watershed Alliance, the Lot 11 and Area Watershed Management Group, and the Petersham Historic Commission.

You may also be interested in...


Comments and Discussions

Questionalternative implementation Pin
Ben Mcmillan25-Dec-12 9:01
memberBen Mcmillan25-Dec-12 9:01 
AnswerRe: alternative implementation Pin
Max Bishop25-Dec-12 11:31
memberMax Bishop25-Dec-12 11:31 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170627.1 | Last Updated 24 Dec 2012
Article Copyright 2012 by Mark H Bishop
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid