Click here to Skip to main content
Click here to Skip to main content

A Neural Network on GPU

By , , 13 Mar 2008
 

Introduction

An Artificial Neural Network is an information processing method that was inspired by the way biological nervous systems function, such as the brain, to process information. It is composed of a large number of highly interconnected processing elements (neurons) working in unison to solve specific problems. Neural Networks have been widely used in "analogous" signal classifications, including handwriting, voice and image recognitions. Neural network can also be used in computer games. It enables games with the ability to adaptively learn from player behaviors. This technique has been used in racing games, such that opponent cars controlled by computers can learn how to drive by human players.

Since a Neural Network requires a considerable number of vector and matrix operations to get results, it is very suitable to be implemented in a parallel programming model and run on Graphics Processing Units (GPUs). Our goal is to utilize and unleash the power of GPUs to boost the performance of a Neural Network solving handwriting recognition problems.

This project was originally our graphics architecture course project. We ran on GPU the same Neural Network described by Mike O'Neill in his brilliant article "Neural Network for Recognition of Handwritten Digits".

About the Neural Network

A Neural Network consists of two basic kinds of elements, neurons and connections. Neurons connect with each other through connections to form a network. This is a simplified theory model of the human brain.

A Neural Network often has multiple layers; neurons of a certain layer connect neurons of the next level in some way. Every connection between them is assigned with a weight value. At the beginning, input data are fed into the neurons of the first layer, and by computing the weighted sum of all connected first layer neurons, we can get the neuron value of a second layer neuron and so on. Finally, we can reach the last layer, which is the output. All the computations involved in operating a Neural Network are a bunch of dot products.

The secret of a Neural Network is all about weight values. Right values make it perfect. However, at the beginning, we don't know those values. Therefore, we need to train our network with sample inputs and compare the outcomes with our desired answers. Some algorithm can take the errors as inputs and modify the network weights. If patient enough, the Neural Network can be trained to achieve high accuracy.

IllustrationNeuralNet.gif

The neural network we implemented was a 5 layer network called convolutional neural network. This kind of network is proven to be suitable for recognizing handwritten digits. For more theoretical details, please check out Mike's article and the references he has listed.

The first three layers of our neural network consist of several feature maps. Each of them is shrunken from the previous layer. Our input is a 29*29 image of a digit. Therefore, we have 29*29=841 neurons in the first layer. The second layer is a convolutional layer with 6 feature maps. Each feature map which is a 13*13 image is sampled from the first layer. Each pixel/neuron in a feature map is a 5*5 convolutional kernel of the input layer. So, there are 13*13*6 = 1014 nodes/neurons in this layer, and (5*5+1(bias node))*6 = 156 weights, 1014*(5*5+1) = 26364 connections linking to the first layer.

Layer 3 is also a convolutional layer, but with 50 smaller feature maps. Each feature map is 5*5 in size, and each pixel in these feature maps is a 5*5 convolutional kernel of corresponding areas of all 6 feature maps of the previous layer. There are thus 5*5*50 = 1250 neurons in this layer, (5*5+1)*6*50 = 7800 weights, and 1250*26 = 32500 connections.

The fourth layer is a fully-connected layer with 100 neurons. Since it is fully-connected, each of the 100 neurons in the layer is connected to all 1250 neurons in the previous layer. There are therefore 100 neurons in it, 100*(1250+1) = 125100 weights and 100x1251 = 125100 connections.

Layer 5 is the final output layer. This layer is also a fully-connected layer with 10 units. Each of the 10 neurons in this layer is connected to all 100 neurons of the previous layer. There are 10 neurons in Layer 5, 10*(100+1) = 1010 weights and 10x101 = 1010 connections.

As you can see, although structurally simple, this Neural Network is a huge data structure.

Previous GPU Implementation

Fast Neural Network Library (FANN) has a very simple implementation of Neural Network on GPU with GLSL. Each neural is represented by a single color channel of a texture pixel. This network is very specific; neurons are ranging from 0 to 1 and have an accuracy of only 8 bits. This implementation takes the advantage of hardware accelerated dot product function to calculate neurons. Both neurons and weights are carried on texture maps.

This implementation is straightforward and easy, however limited. First, in our neural network, we require 32-bit float accuracy for each neuron. Since our network has five layers, accuracy lost at the first level could be accumulated and alter the final results. And because it is important that a handwriting recognition system should be sensitive enough to detect slight differences between different inputs, using only 8 bits to represent a neuron is unacceptable. Secondly, normal Neural Networks map neuron values to the range from 0 to 1. However, in our program, the Neural Network which is specifically designed for handwriting recognition has a special activation function mapping each neuron value to the range from -1 to 1. Therefore, if the neuron is represented by a single color value as in FANN library, our neurons will lose accuracy further. Finally, the FANN method uses a dot product to compute neurons, which is suitable for full connected Neural Networks. In our implementation, the Neural Network is partially connected. Computations performed on our Neural Network involve dot products of large vectors.

Our Implementation

Due to all the inconvenience about GLSL mentioned above, we finally choose CUDA. The reason that the Neural Network is suitable for GPU is that the training and execution of a Neural Network are two separate processes. Once properly trained, no writing access is required while using a Neural Network. Therefore there is no synchronization issue that needs to be addressed. Moreover, neurons on a same network level are completely isolated, such that neuron value computations can achieve highly parallelization.

In our code, weights for the first layer are stored as an array, and those inputs are copied to device. For each network level, there is a CUDA function handling the computation of neuron values of that level, since parallelism can only be achieved within one level and the connections are different between levels. The connections of the Neural Network are implicitly defined in CUDA functions with the equations of next level neuron computation. No explicit connection data structure exists in our code. This is one main difference between our code and the CPU version by Mike.

cuda.PNG

For example, each neuron value of the second level is a weighted sum of 25 neurons of the first level and one bias. The second neuron level is composed of 6 feature maps; each has a size of 13*13. We assign a blockID for each feature map and a threadID for each neuron on a feature map. Every feature map is handled by a block and each pixel on it is dealt with by a thread.

This is the CUDA function that computes the second network layer:

__global__ void executeFirstLayer
    (float *Layer1_Neurons_GPU,float *Layer1_Weights_GPU,float *Layer2_Neurons_GPU)
{
    int blockID=blockIdx.x;
    int pixelX=threadIdx.x;
    int pixelY=threadIdx.y;

    int kernelTemplate[25] = {
        0,  1,  2,  3,  4,
        29, 30, 31, 32, 33,
        58, 59, 60, 61, 62,
        87, 88, 89, 90, 91,
        116,117,118,119,120 };

    int weightBegin=blockID*26;
    int windowX=pixelX*2;
    int windowY=pixelY*2;

    float result=0;

    result+=Layer1_Weights_GPU[weightBegin];

    ++weightBegin;

    for(int i=0;i<25;++i)
    {
        result+=Layer1_Neurons_GPU
            [windowY*29+windowX+kernelTemplate[i]]*Layer1_Weights_GPU[weightBegin+i];
    }

    result=(1.7159*tanhf(0.66666667*result));

    Layer2_Neurons_GPU[13*13*blockID+pixelY*13+pixelX]=result;
} 

All other levels are computed the same way; the only difference is the equation of calculating neurons.

program.PNG

The main program first transfers all the input data to GPU and then calls each CUDA function in order and finally gets the answer.

recod.jpg

The user interface is a separate program using C#. Users can draw a digit with the mouse on the input pad, the program then generates a 29*29 image and calls the kernel Neural Network program. The kernel, as described above, will read the input image and feed it into our Neural Network. Results are also returned with files and then read back by the user interface.

Here is a screenshot. After drawing a digit, we can get all the 10 neuron values of the last network layer. The index of the maximum neuron value is the most possible digit. We shade candidates with different depth of red colors according to their possibilities.

On the right, the user interface will print out feature maps of the first three layers.

Note that C# under Windows XP has a resolution issue. We tested our program under 120dpi. A 96dpi resolution setting could shift the input image around, so that the accuracy is badly affected.

No training part is included in our GPU implementation. We use Mike’s code to train all the weights and cached them with files.

Result

Accuracy

Our Neural Network can achieve a 95% accuracy. The database we used to train the network is called MNIST containing 60000 handwriting examples from different people. It is reported by Dr. LeCun that this network can converge after around 25 times of training. This number is confirmed by our test. We achieved only around 1400 miss-recognition samples out of 60000 inputs.

Also note that there is a bug in Mike's code. This is the corrected code for initializing the second layer:

for ( fm=0; fm<50; ++fm)
{
  for ( ii=0; ii<5; ++ii )
  {
    for ( jj=0; jj<5; ++jj )
    {
      // iNumWeight = fm * 26;  // 26 is the number of weights per feature map
      iNumWeight = fm * 156;  // 156 is the number of weights per feature map
      NNNeuron& n = *( pLayer->m_Neurons[ jj + ii*5 + fm*25 ] );

      n.AddConnection( ULONG_MAX, iNumWeight++ );  // bias weight

      for ( kk=0; kk<25; ++kk )
      {
        // note: max val of index == 1013, corresponding to 1014 neurons in prev layer
        n.AddConnection(       2*jj + 26*ii + kernelTemplate2[kk], iNumWeight++ );
        n.AddConnection( 169 + 2*jj + 26*ii + kernelTemplate2[kk], iNumWeight++ );
        n.AddConnection( 338 + 2*jj + 26*ii + kernelTemplate2[kk], iNumWeight++ );
        n.AddConnection( 507 + 2*jj + 26*ii + kernelTemplate2[kk], iNumWeight++ );
        n.AddConnection( 676 + 2*jj + 26*ii + kernelTemplate2[kk], iNumWeight++ );
        n.AddConnection( 845 + 2*jj + 26*ii + kernelTemplate2[kk], iNumWeight++ );
      }
    }
  }
} 

Please refer to this for the details about this bug.

Our GPU implementation is based on the correct version, however there isn't too much difference in terms of accuracy.

Performance

performance.PNG

The major reason for using GPU to compute Neural Network is to achieve robustness. The outcome is promising compared to CPU implementation. As shown in the table above, the executing time of GPU version, EmuRelease version and CPU version running on one single input sample is compared. The GPU version speeds up by 270 times compared to CPU version and 516.6 times compared to EmuRelease version. To be more accurate, we also considered the IO time consumption of the GPU version. As we can see, even when the IO time is considered, our method is 10 times faster. And in practical use, weight values need only be loaded into the device once.

History

  • 14th March, 2008: Initial post

License

This article, along with any associated source code and files, is licensed under The Creative Commons Attribution-ShareAlike 2.5 License

About the Authors

billconan
Other
United States United States
Member
Ph.D. Student
University of California, Davis

kavinguy
Other
United States United States
Member
Ph.D. student,
University of California, Davis

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board.
Search this forum  
    Spacing  Noise  Layout  Per page   
QuestionThanks for sharing this information!memberyoels5 Apr '13 - 9:02 
I am definitely going to port my ANN code from C++ to cuda, especially after reading the paper written by the current winners of the MNIST contest...
(Search for "Multi-column Deep Neural Networks for Image Classification")
 
My main concern is naturally the training itself, but this is still a nice start reference Smile | :)
 
Thanks for sharing your experience!
QuestionHelp me!member2210199114 Oct '12 - 14:06 
error: macro "CUT_DEVICE_INIT" requires 2 arguments, but
only 1 given

Questiongreat,thank you.memberBaolei Fan15 Jul '12 - 4:12 
it is very helpful to me.
GeneralMy vote of 5memberKanasz Robert8 Jun '11 - 3:52 
Very good article. Wink | ;)
Generali have not worked Demos and source code of NNmemberedem4610 Apr '11 - 22:08 
Hello,
 
i got vs2008 SP1 and CUDA v3.2, but when i want to run Demos, NN.exe cause a error so system closed NN.exe,
 
afterthat, i want to build source code, I created a new project and added source code to it. but i got some error.
 
Can you say that where i make mistake or can you check source code of NN or reload source as VS.net project files.
 
Thanks.
GeneralMy vote of 3member428820 Mar '11 - 5:08 
sorry but the code is not optimized in my opinion
GeneralRe: My vote of 3memberDivydeep Agarwal14 Apr '11 - 10:03 
can u help me to get the code to work
GeneralRe: My vote of 3member428815 Apr '11 - 3:30 
isn't that working to you?
---

GeneralCodes do not workmembercardano712 Mar '11 - 15:14 
Non of these uploaded programs does work. It's nothing except for wasting time.
GeneralMy vote of 5memberKanasz Robert10 Jan '11 - 11:18 
Very interestin article. In future I want try to experiment with GPU. Well done.
Questionhow exactly did you obtain the weights files?memberfizdumn14 Jul '10 - 6:54 
how exactly did you obtain the weights files using mike's code?
Questionnot fully optimized code??memberilios8610 Jul '10 - 22:18 
Hi.
 
I analyzed the execution of the program,and i found that computing 4th layer(function executeThirdLayer)
is the bottleneck of program.
 
I have a curious about
why the above implementation does not use multiple threads in each block
while computing 4th, 5th layer.
 
Because of that, the benefit of parallelism is not well exposed.
 
I'm not familiar with Neural Network. So i'm not sure whether the reason of such unparallelized code is limit of NN problem or not.
 
Could you give me any comment?
 
Thank you.
GeneralCan't Get it to workmemberKarlDuke2 Dec '09 - 11:17 
When the paint2.exe calls NN.exe the application crashes...
The same happens when NN.exe is called separately.
GeneralStuck on 2memberZodical27 Nov '09 - 14:42 
I can build everything, but everytime I run it gives the answer 2, no matter what number I draw. It even does that if I try to delete all the data files. I'll try rebooting in case the weights have been loaded in to the graphics card and are not being refreshed.
GeneralThe CUDA code is not optimizedmemberJulekmen20 Oct '09 - 22:53 
I like the program, but the CUDA code is not optimized - optimizations strategies were not applied. I optimized executeThirdLayer kernel to see the difference between optimized and unoptimized code. What I changed:
- Weights in each block (neuron) are padded to 1280 bytes - this makes coalesced reads
- Partial sums are copied to shared memory and then summed by one thread in each block(neuron)
 
In cudaprof, the standard method took 3 ms to complete. My optimized method took 0.3 ms to complete!
It is easy to write program for CUDA, but it's difficult to make it work rweally fast Smile | :)
 
My project and code is here:
http://rapidshare.com/files/295861492/NeuralNetworks_CUDA.zip.html[^]
 
You can switch between standard and optimized versions by commenting/uncommenting "#define ROMANOWSKI_THIRD_LAYER"
NewsRe: The CUDA code is not optimizedmembersergneu13 Nov '09 - 1:49 
Error Sigh | :sigh:
To download this file, the uploader either needs to transfer this file into his/her Collector's Account, or upload the file again. The file can later be moved to a Collector's Account. The uploader just needs to click the delete link of the file to get further information.
Generalhelp mememberhamzadj2328 Sep '09 - 7:08 
Please, would you like to help me to prepare my project handwritten recognition word or letters with TDNN (time delay neuronal network) ,a language is c# or .
Thank you very match
contact me in:
Doc_dh23000@hotmail.com

 
Cry | :((
GeneralPermission to redistribute source codemembercael4710 Sep '09 - 13:28 
Hello,
 
Just to let you know, we have published a paper on our GPGPU performance simulator, called GPGPU-Sim. You can find it available for download at www.gpgpu-sim.org .
It has become quite popular in the last little while, and a lot of people have been asking for the applications that we used as benchmarks in our paper, "Analyzing CUDA Workloads using a Detailed GPU Simulator" (http://www.ece.ubc.ca/~aamodt/papers/gpgpusim.ispass09.pdf), where we used your Neural Network application as one of our benchmarks.
We are currently asking the CUDA application authors whether we could package their source code together to make available for download along with our GPU simulator. Would it be OK to do so?
 
Just to let you know, in our paper, we slightly modified your code to increase the amount of thread-level parallelism by allowing concurrent execution on multiple digits.
 
George
GeneralAnother interesting Neural Network applicationmemberSharkTime6 Aug '09 - 8:01 
Great, thanks to you.
 
You may be interested about another neural network application, see Sharky Neural Network.
 
This is free software for playing with neural networks classification (for Windows XP/Vista).
You can see network results during learning like a movie - live view.
 
You may also be interested in other CodeProject article: Neural Network Classifier[^]
 
Regards,
SharkTime.com
GeneralI can not obtain a good recognition of this methodmemberbogycat9 May '09 - 17:12 
Hello!
 
I have realize your method depend on MNIST with the weight of Mr. Mike O'Neil's
 
but I can not get the accuracy that Mr. Mike O'Neil get,
 
About only one in ten can be recognized correctly
 
Coud you email to me, and sent your recognition result on the text set of MNIST.
 
Thanks! Smile | :)
 
Email: hanxiaoxue724@hotmail.com
Generali think it possible to train the convolutional neural network in CUDAmemberxam_jjf@yahoo.com.cn8 Dec '08 - 14:08 
i think it possible to train the convolutional neural network in CUDA ,only need more shared memory.
we can compute every delta value separately and finnaly sum them up in the kernel function one layer one kernel function., or other way.does anyone else agree with me.Smile | :) ,
 
as the performance,i think the CUDA is the better architecture for neural network compute, more like brain then pc.
Generalwhere can i get Layer_1.neu , Layer_2.neu ... [modified]'tmemberxam_jjf@yahoo.com.cn30 Nov '08 - 14:01 
where can i get Layer_1.neu , Layer_2.neu ...
i can't get it work!? i use 9600 GT and sdk has been setuped.
 
who will be kind to tell me how can i get it work?
 
and my english is poor ,i holp my words will be understood.
 
modified on Wednesday, December 3, 2008 8:40 PM

GeneralRe: where can i get Layer_1.neu , Layer_2.neu ... [modified]'tmemberxam_jjf@yahoo.com.cn3 Dec '08 - 18:42 
i get it.
 
Layer_1.neu , Layer_2.neu ... be created by nn.exe.
but nn.exe can't be execute in my system.
 
i modify the main function in nn.cu and link the nn.cu ,then it works. but i don't know why?
 
main(int argc, char** argv)
{
NeuralNetwork();
//CUT_EXIT(argc, argv);
}
GeneralRe: where can i get Layer_1.neu , Layer_2.neu ... [modified]'tmemberxam_jjf@yahoo.com.cn4 Dec '08 - 15:53 
i hope it can be used to training the network.
GeneralRe: where can i get Layer_1.neu , Layer_2.neu ... [modified]'tmemberNguyennamhsm5 Sep '12 - 7:47 
u miss cudart.dll file .download it from....google

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Permalink | Advertise | Privacy | Mobile
Web03 | 2.6.130523.1 | Last Updated 14 Mar 2008
Article Copyright 2008 by billconan, kavinguy
Everything else Copyright © CodeProject, 1999-2013
Terms of Use
Layout: fixed | fluid