Cuda Matrix addition

Question

0.00/5 (No votes)

See more:

Can anyone help me in doing matrix addition in Cuda C.Matrix should be square as well as non square and block dimension should be 2D.Here is the code what i have done.But it wont work for matrix above 2*2 matrix.Can anyone help me in solving this.....

C++

#include <iostream>
#include <cuda.h>
#define blocksize 16
    
texture<float,> texVecA; 
texture<float,> texVecB; 

__constant__ int x;
__constant__ int y;

     
__global__ void MatrixAdd_d(float *C)
  {
    int N=x*y;
    int i = blockIdx.x*blockDim.x + threadIdx.x;
    int j = blockIdx.y*blockDim.y + threadIdx.y;
    int index = i*N + j;
    
    float flValA = tex1Dfetch(texVecA, index);
    float flValB = tex1Dfetch(texVecB, index);
    
    if(i<n mode="hold" />    {
		C[index]=flValA +flValB; 
	}
	
  }
     
 int main()
 {
    float *a_h, *b_h, *c_h; // pointers to host memory; a.k.a. CPU
    float *a_d, *b_d, *c_d; // pointers to device memory; a.k.a. GPU
    int n,m, i, j, index;
    printf("Enter dimension of matrix\n");
    scanf("%d%d",&n,&m);
    int N=m*n;
    // allocate arrays on host
    a_h = (float *)malloc(sizeof(float)*n*m);
    b_h = (float *)malloc(sizeof(float)*n*m);
    c_h = (float *)malloc(sizeof(float)*n*m);
    
   
    // allocate arrays on device
    cudaMalloc((void **)&a_d,m*n*sizeof(float));
    cudaMalloc((void **)&b_d,m*n*sizeof(float));
    cudaMalloc((void **)&c_d,m*n*sizeof(float));
   
    // initialize the arrays
   
    
	printf("Enter elements of first Matrix:\n");

	for(int i=0;i<n;i++)>
	{
		for(int j=0;j<m;j++)>
		{
			scanf("%f",&a_h[i * m + j]);
		}
	}
	printf("Enter elements of second matrix:\n");
	
	for(int i=0;i<n;i++)>
    {
		for(int j=0;j<m;j++)>
		{
			scanf_s("%f",&b_h[i * m + j]);

		}
	}


    
    // copy and run the code on the device
    cudaMemcpy(a_d,a_h,N*sizeof(float),cudaMemcpyHostToDevice);
    cudaMemcpy(b_d,b_h,N*sizeof(float),cudaMemcpyHostToDevice);
    
    cudaMemcpyToSymbol(x, &m,  sizeof(int), 0);
	cudaMemcpyToSymbol(y, &n,  sizeof(int), 0);
    
    cudaBindTexture(0, texVecA, a_d, (N * sizeof(float)));
	cudaBindTexture(0, texVecB, b_d, (N * sizeof(float)));
	
	dim3 dimBlock( blocksize, blocksize );
    dim3 dimGrid( ceil(float(n)/float(dimBlock.x)), ceil(float(n)/float(dimBlock.y)) );
    
    MatrixAdd_d<<<dimgrid,>>>(c_d);
    cudaMemcpy(c_h,c_d,N*sizeof(float),cudaMemcpyDeviceToHost);
    cudaThreadSynchronize();
    
   
    
	
    // print out the answer
    for(j=0;j<n;j++)>
	{
		for(i=0;i<m;i++)>
			{
				index = j*m+i;
				printf("A + B = C: %d %d %f + %f = %f\n",i,j,a_h[index],b_h[index],c_h[index]);
			}
    }
    
	cudaUnbindTexture(texVecA);
	cudaUnbindTexture(texVecB);
    
    // cleanup...
    free(a_h);
    free(b_h);
    free(c_h);
    cudaFree(a_d);
    cudaFree(b_d);
    cudaFree(c_d);
    return(0);
    }</cuda.h></iostream>

Posted 22-Nov-11 18:23pm

renj00790

Updated 22-Nov-11 18:57pm

Mehdi Gholam

v2

Add a Solution

1 solution

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Andrew Brock · Answer 1 · 2011-11-22T20:09:00

The NVIDIA GPU Computing SDK[^] has a few examples of multiplication, which for all intents and purposes is the same as addition.

You shouldn't need texture memory for this. You are essentially accessing the whole chunk of memory in a linear manner, which is fine from normal global memory.
Texture memory is just global memory with an extra bit of hardware between it and the GPU, which adds a slight overhead. If you are using the texture memory correctly, the benefits outweigh this, but you aren't.

Now for some issues with your code:
Your equation for calculating the index is backwards, you have x*size + y, it should be y*size + x. This will give coalesced memory reads, which will give a massive increase in performance.

You should call __syncthreads() after your if. Refer to the CUDA programming guide[^] for more information.

I have to go somewhere now, and I haven't run your code, so there could be some other issues, compare your code to the SDK samples.