Click here to Skip to main content
15,904,297 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
Can anyone help me in doing matrix addition in Cuda C.Matrix should be square as well as non square and block dimension should be 2D.Here is the code what i have done.But it wont work for matrix above 2*2 matrix.Can anyone help me in solving this.....

#include <iostream>
#include <cuda.h>
#define blocksize 16
texture<float,> texVecA; 
texture<float,> texVecB; 

__constant__ int x;
__constant__ int y;

__global__ void MatrixAdd_d(float *C)
    int N=x*y;
    int i = blockIdx.x*blockDim.x + threadIdx.x;
    int j = blockIdx.y*blockDim.y + threadIdx.y;
    int index = i*N + j;
    float flValA = tex1Dfetch(texVecA, index);
    float flValB = tex1Dfetch(texVecB, index);
    if(i<n mode="hold" />    {
		C[index]=flValA +flValB; 
 int main()
    float *a_h, *b_h, *c_h; // pointers to host memory; a.k.a. CPU
    float *a_d, *b_d, *c_d; // pointers to device memory; a.k.a. GPU
    int n,m, i, j, index;
    printf("Enter dimension of matrix\n");
    int N=m*n;
    // allocate arrays on host
    a_h = (float *)malloc(sizeof(float)*n*m);
    b_h = (float *)malloc(sizeof(float)*n*m);
    c_h = (float *)malloc(sizeof(float)*n*m);
    // allocate arrays on device
    cudaMalloc((void **)&a_d,m*n*sizeof(float));
    cudaMalloc((void **)&b_d,m*n*sizeof(float));
    cudaMalloc((void **)&c_d,m*n*sizeof(float));
    // initialize the arrays
	printf("Enter elements of first Matrix:\n");

	for(int i=0;i<n;i++)>
		for(int j=0;j<m;j++)>
			scanf("%f",&a_h[i * m + j]);
	printf("Enter elements of second matrix:\n");
	for(int i=0;i<n;i++)>
		for(int j=0;j<m;j++)>
			scanf_s("%f",&b_h[i * m + j]);


    // copy and run the code on the device
    cudaMemcpyToSymbol(x, &m,  sizeof(int), 0);
	cudaMemcpyToSymbol(y, &n,  sizeof(int), 0);
    cudaBindTexture(0, texVecA, a_d, (N * sizeof(float)));
	cudaBindTexture(0, texVecB, b_d, (N * sizeof(float)));
	dim3 dimBlock( blocksize, blocksize );
    dim3 dimGrid( ceil(float(n)/float(dimBlock.x)), ceil(float(n)/float(dimBlock.y)) );
    // print out the answer
				index = j*m+i;
				printf("A + B = C: %d %d %f + %f = %f\n",i,j,a_h[index],b_h[index],c_h[index]);
    // cleanup...
Updated 22-Nov-11 18:57pm

1 solution

The NVIDIA GPU Computing SDK[^] has a few examples of multiplication, which for all intents and purposes is the same as addition.

You shouldn't need texture memory for this. You are essentially accessing the whole chunk of memory in a linear manner, which is fine from normal global memory.
Texture memory is just global memory with an extra bit of hardware between it and the GPU, which adds a slight overhead. If you are using the texture memory correctly, the benefits outweigh this, but you aren't.

Now for some issues with your code:
Your equation for calculating the index is backwards, you have x*size + y, it should be y*size + x. This will give coalesced memory reads, which will give a massive increase in performance.

You should call __syncthreads() after your if. Refer to the CUDA programming guide[^] for more information.

I have to go somewhere now, and I haven't run your code, so there could be some other issues, compare your code to the SDK samples.
Share this answer

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900