GPGPU on Accelerating Wave PDE

Alesiani Marco

4.93/5 (50 votes)

Apr 20, 2012

CPOL

41 min read

52340

1429

A Wave PDE simulation using GPGPU capabilities

Download GPGPUWave.zip - 632.6 KB

Abstract
Wave PDE
Solving the Wave PDE with finite difference method
Creating the algorithm
Implementing the Wave simulation
GPGPU Architectures
Massive parallelization with CUDA
Performance comparison
CUDA Kernels best practices
References

Abstract

This article aims at exploiting GPGPU (GP²U) computation capabilities to improve physical and scientific simulations. In order for the reader to understand all the passages, we will gradually proceed in the explanation of a simple physical simulation based on the well-known Wave equation. A classic CPU implementation will be developed and eventually another version using GPU workloads is going to be presented.

Wave PDE

PDE stands for “Partial differential equation” and indicates an equation which has one or more partial derivatives as independent variables in its terms. The order of the PDE is usually defined as the highest partial derivative in it. The following is a second-order PDE:

Usually a PDE of order n having m variables x_i for i=1,2…m is expressed as

A compact form to express is u_x and the same applies to (u_xx ) and (u_xy ).

The wave equation is a second order partial differential equation used to describe water waves, light waves, sound waves, etc. It is a fundamental equation in fields such as electro-magnetics, fluid dynamics and acoustics. Its applications involve modeling the vibration of a string or the air flow dynamics as a result from an aircraft movement.

We can derive the one-dimensional wave equation (i.e. the vibration of a string) by considering a flexible elastic string that is tightly bound between two fixed end-points lying of the x axis

A guitar string has a restoring force that is proportional to how much it’s stretched. Suppose that, neglecting gravity, we apply a y-direction displacement at the string (i.e. we’re slightly pulling it). Let’s consider only a short segment of it between x and x+ :

Let’s write down for the small red segment in the diagram above. We assume that the string has a linear density (linear density is a measure of mass per unit of length and is often used with one-dimensional objects). Recalling that if a one-dimensional object is made of a homogeneous substance of length L and total mass m the linear density is

we have m= .

The forces (as already stated we neglect gravity as well as air resistance and other tiny forces) are the tensions T at the ends. Since the string is actually waving, we can’t assume that the two T vectors cancel themselves: this is the force we are looking for. Now let’s make some assumptions to simplify things: first we consider the net force we are searching for as vertical (actually it’s not exactly vertical but very close)

Furthermore we consider the wave amplitude small. That is, the vertical component of the tension at the x+ end of the small segment of string is

The slope is if we consider dx and dy as the horizontal and vertical components of the above image. Since we have considered the wave amplitude to be very tiny, we can therefore assume . This greatly helps us: and the total vertical force from the tensions at the two ends becomes

The equality becomes exact in the limit .

We see that y is a function of x, but it is however a function of t as well: y=y(x,t). The standard convention for denoting differentiation with respect to one variable while the other is held constant (we’re looking at the sum of forces at one instant of time) let us write

The final part is to use Newton’s second law and put everything together: the sum of all forces, the mass (substituted with the linear density multiplied by the segment length) and the acceleration (i.e. just because we’re only moving in the vertical direction, remember the small amplitude approximation).

And we’re finally there: the wave equation. Using the spatial Laplacian operator, indicating the y function (depending on x and t) as u and substituting (a fixed constant) we have the common compact form

The two-dimensional wave equation is obtained by adding a second spatial term to the equation as follows

The constant c has dimensions of distance per unit time and thus represents a velocity. We won’t prove here that c is actually the velocity at which waves propagate along a string or through a surface (although it’s surely worth noting). This makes sense since the wave speed increases with tension experienced by the medium and decreases with the density of the medium.

In a physical simulation we need to take into account forces other than just the surface tension: the average amplitude of the waves on the surface diminishes in real-world fluids. We may therefore add a viscous damping force to the equation by introducing a force that acts in the opposite direction of the velocity of a point on the surface:

where the nonnegative constant represents the viscosity of the fluid (it controls how long it takes for the wave on the surface to calm down, a small allows waves to exist for a long time as with water while a large causes them to diminish rapidly as for thick oil).

Solving the Wave PDE with finite difference method

To actually implement a physical simulation which uses the wave PDE we need to find a method to solve it. Let’s solve the last equation we wrote with the damping force

here t is the time, x and y are the spatial coordinates of the 2D wave, c² is the wave speed and the damping factor. u=u(t,x,y) is the function we need to calculate: generally speaking it could represent various effects like a change of height of a pool’s water, electric potential in an electromagnetic wave, etc..

A common method to solve this problem is to use the finite difference method. The idea behind is basically to replace derivatives with finite differences which can be easily calculated in a discrete algorithm. If there is a function f=f(x) and we want to calculate its derivative respect to the x variable we can write

if h is a discrete small value (which is not zero), then we can approximate the derivative with

the error of such an approach could be derived from Taylor’s theorem but that isn’t the purpose of this paper.

A finite difference approach uses one of the following three approximations to express the derivative

Forward difference

Backward difference

Central difference

Let’s stick with the latter (i.e. central difference); this kind of approach can be generalized, so if we want to calculate the discrete version of f''(x) we could first write

and then calculate f''(x) as follows

The same idea is used to estimate the partial derivatives below

Let’s get back to the wave equation. We had the following

let’s apply the partial derivative formula just obtained to it (remember that u=u(t,x,y), that is, u depends by t,x and y)

This is quite a long expression. We just substituted the second derivatives and the first derivatives with the formulas we got before.

If we now consider , we are basically forcing the intervals where the derivatives are calculated to be the same for both the x and the y directions (and we can greatly simplify the expression):

To improve the readability, let us substitute some of the terms with letters

and we have

If you look at the t variables we have something like

This tells us that the new wave’s height (at t+1) will be the current wave height (t) plus the previous wave height (t-1) plus something in the present (at t) depending only by what are the values around the wave point we are considering.

This can be visualized as a sequence of time frames one by another where a point on the surface we are considering evolves

The object

has a central dot which represents a point (t,x,y) on the surface at time t. Since the term we previously called something_at_t(x,y) is actually

the value of the central point is influenced by five terms and the latter is its same value (-4u_t,x,y ) multiplied by -4.

Creating the algorithm

As we stated before, the wave PDE can be effectively solved with finite difference methods. However we still need to write some resolution code before a real physical simulation can be set up. In the last paragraph we eventually ended up obtaining

we then simplified this expression with

This is indeed a recursive form which may be modeled with the following pseudo-code

for each t in time
{
u_t+1<- u_t+u_t-1+something_at_t(x,y)

// Do something with the new u_t+1, e.g. you can draw the wave here

u_t-1<- u_t

u_t_<-u_t+1
_}

The above pseudo-code is supposed to run in retained-mode so the application might be able to draw the wave in each point of the surface we’re considering by simply calling draw_wave(u_t+1) .

Let assume that we’re modeling how a water wave evolves and so let the u function represent the wave height (0 – horizontal level): we can now write down the beginning of our pseudo-code

// Surface data

height = 500;

width = 500;

// Wave parameters

c = 1; // Speed of wave

dx = 1; // Space step

dt = 0.05; // Time step

k=0.002 // Decay factor

kdt=k*dt; // Decayment factor per timestep, recall that q = 2 - kdt, r = -1 +kdt

c1 = dt^2*c^2/dx^2; // b factor

// Initial conditions

u_current_t = zeros(height,width); // Create a height x width zero matrix

u_previous_t = u_current_t;

We basically defined a surface where the wave evolution will be drawn (a 500x500 area) and initialized the wave parameters we saw a few paragraphs ago (make sure to recall the q,r and b substitutions we did before). The initial condition is a zero matrix (u_current_t) so the entire surface is quiet and there’s no wave evolving.

Given that we are considering a matrix surface (every point located at (x;y) coordinates is described by a u value indicating the height of the wave there) we need to write code to implement the

line in the for cycle. Actually the above expression is a simplified form for

and we need to implement this last one. We may write down something like the following

for(t=0; t<A_LOT_OF_TIME; t += dt)

{

u_next_t = (2-kdt)*u_current_t+(kdt-1)*u_previous_t+c1*something_at_t(x,y)

u_previous_t = u_current_t; // Current becomes old

u_current_t = u_next_t; // New becomes current

// Draw the wave

draw_wave(u_current_t)

}

that is, a for cycle with index variable t increasing by dt at every step. Everything should be familiar by now because the (2-kdt),(kdt-1) and c1 terms are the usual q,r and b substitutions. Last thing we need to implement is the something_at_t(x,y) term, as known as:

The wave PDE we started from was

and the term we are interested now is this:

that, in our case, is

Since we have a matrix of points representing our surface, we are totally in the discrete field, and since we need to perform a second derivative on a matrix of discrete points our problem is the same as having an image with pixel intensity values I(x,y) and need to calculate its Laplacian

This is a common image processing task and it’s usually solved by applying a convolution filter to the image we are interested in (in our case: the matrix representing the surface). A common small kernel used to approximate the second derivatives in the definition of the Laplacian is the following

So in order to implement the

term, we need to apply the D Laplacian kernel as a filter to our image (i.e. the u_current_t):

u_next_t=(2-kdt)*u_current_t+(kdt-1)*u_previous_t+c1* convolution(u_current_t,D);

In fact in the element we saw earlier

the red dot elements are weighted and calculated with a 2D convolution of the D kernel.

An important element to keep in mind while performing the D kernel convolution with our u_current_t matrix is that every value in outside halos (every value involved in the convolution but outside the u_current_t matrix boundaries) should be zero as in the image below

In the picture above the red border is the u_current_t matrix perimeter while the blue 3x3 matrix is the D Laplacian kernel, everything outside the red border is zero. This is important because we want our surface to act like water contained in a recipient and the waves in the water to “bounce back” if they hit the container’s border. By zeroing all the values outside the surface matrix we don’t receive wave contributions from the outside of our perimeter nor do we influence in any way what’s beyond it. In addition the “energy” of the wave doesn’t spread out and is partially “bounced back” by the equation.

Now the algorithm is almost complete: our PDE assures us that every wave crest in our surface will be properly “transmitted” as a real wave. The only problem is: starting with a zero-everywhere matrix and letting it evolve would produce just nothing. Forever.

We will now add a small droplet to our surface to perturb it and set the wheels in motion.

To simulate as realistic as possible the effect of a droplet falling into our surface we will introduce a “packet” in our surface matrix. That means we are going to add a matrix that represents a discrete Gaussian function (similar to a Gaussian kernel) to our surface matrix. Notice that we’re going to “add”, not to “convolve”.

Given the 2D Gaussian formula

we have that A is the Gaussian amplitude (so will be our droplet amplitude), x₀ and y₀ are the center’s coordinates and the and are the spreads of the blob. Putting a negative amplitude like in the following image we can simulate a droplet just fallen into our surface.

To generate the droplet matrix we can use the following pseudo-code

// Generate a droplet matrix

dsz = 3; // Droplet size

da=0.07; // Droplet amplitude

[X,Y] = generate_XY_planes(dsz,da);

DropletMatrix = -da * exp( -(X/dsz)^2 -(Y/dsz)^2);

da is the droplet amplitude (and it’s negative as we just said), while dsz is the droplet size, that is, the Gaussian “bell” radius. X and Y are two matrices representing the X and Y plane discrete values

so the X and Y matrices for the image above are

And the final DropletMatrix is similar to the following

where the central value is -0.0700. If you drew the above matrix you would obtain a 3D Gaussian function which can now model our droplet.

The final pseudo-code for our wave algorithm is the following

// Surface data

height = 500;

width = 500;

// Wave parameters

c = 1; // Speed of wave

dx = 1; // Space step

dt = 0.05; // Time step

k=0.002 // Decay factor

kdt=k*dt; // Decayment factor per timestep, recall that q = 2 - kdt, r = -1 +kdt

c1 = dt^2*c^2/dx^2; // b factor

// Initial conditions

u_current_t=zeros(height,width); // Create a height x width zero matrix

u_previous_t = u_current_t;

// Generate a droplet matrix

dsz = 3; // Droplet size

da=0.07; // Droplet amplitude

[X,Y] = generate_XY_planes(dsz,da);

DropletMatrix = -da * exp( -(X/dsz)^2 -(Y/dsz)^2);

// This variable is used to add just one droplet

One_single_droplet_added = 0;

for(t=0; t<A_LOT_OF_TIME; t = t + dt)

{

u_next_t = (2-kdt)*u_current_t+(kdt-1)*u_previous_t+c1*convolution(u_current_t,D);

u_previous_t = u_current_t; // Current becomes old

u_current_t = u_next_t; // New becomes current

// Draw the wave

draw_wave(u_current_t)

if(One_single_droplet_added == 0)

{

One_single_droplet_added = 1; // no more droplets

addMatrix(u_current_t, DropletMatrix, [+100;+100]);

}

The variable One_single_droplet_added is used to check if the droplet has already been inserted (we want just one droplet). The addMatrix function adds the DropletMatrix values to the u_current_t surface matrix values centering the DropletMatrix at the point (100;100), remember that the DropletMatrix is smaller (or equal) to the surface matrix, so we just add the DropletMatrix’s values to the u_current_t’s values that fall within the dimensions of the DropletMatrix.

Now the algorithm is complete, although it is still a theoretical simulation. We will soon implement it with real code.

Implementing the Wave simulation

We will now discuss how the above algorithm has been implemented in a real C++ project to create a fully functional openGL physical simulation.

The sequence diagram below shows the skeleton of the program which basically consists of three main parts: the main module where the startup function resides as well as the kernel function which creates the matrix image for the entire window, an openGL renderer wrapper to encapsulate GLUT library functions and callback handlers and a matrix hand-written class to simplify matrix data access and manipulation. Although a sequence diagram would require following a standard software engineering methodology and its use is strictly enforced by predetermined rules, nonetheless we will use it as an abstraction to model the program’s control flow

The program starts at the main() and creates an openGLrenderer object which will handle all the graphic communication with the GLUT library and the callback events (like mouse movements, keyboard press events, etc.). OpenGL doesn’t provide routines for interfacing with a windowing system, so in this project we will rely on GLUT libraries which provide a platform-independent interface to manage windows and input events. To create an animation that runs as fast as possible we will set an idle callback with the glutIdleFunc() function. We will explain more about this later.

Initially the algorithm sets its initialization variables (time step, space step, droplet amplitude, Laplacian 2D kernel, etc.. practically everything we saw in the theory section) and every matrix corresponding to the image to be rendered is zeroed out. The Gaussian matrix corresponding to the droplets is also preprocessed. A structure defined in openGLRenderer’s header file contains all the data which should be retained across image renderings

 typedef struct kernelData
{
    float a1; // 2-kdt
    float a2; // kdt-1
    float c1; // decayment factor
    sMatrix* u;
    sMatrix* u0;
    sMatrix* D;
    int Ddim; // Droplet matrix width/height
    int dsz; // Gaussian radius
    sMatrix* Zd;
} kernelData;

The structure is updated each time a time step is performed since it contains both current and previous matrices that describe the waves evolution across time. Since this structure is both used by the openGL renderer and the main module to initialize it, the variable is declared as external and defined afterwards in the openGLrenderer cpp file (so its scope goes beyond the single translation unit). After everything has been set up the openGLRenderer class’ startRendering() method is called and the openGL main loop starts fetching events. The core of the algorithm we saw is in the main module’s kernel() function which is called every time an openGL idle event is dispatched (that is, the screen is updated and the changes will be shown only when the idle callback has completed, thus the amount of rendering work here performed should be minimized to avoid performance loss).

The kernel’s function code is the following

// This kernel is called at each iteration
// It implements the main loop algorithm and someway "rasterize" the matrix data
// to pass to the openGL renderer. It also adds droplets in the waiting queue
//
void kernel(unsigned char *ptr, kernelData& data)
{
    // Wave evolution
    sMatrix un(DIM,DIM);

    // The iterative discrete update (see documentation)
    un = (*data.u)*data.a1 + (*data.u0)*data.a2 + convolutionCPU((*data.u),(*data.D))*data.c1;

    // Step forward in time
    (*data.u0) = (*data.u);
    (*data.u) = un;

    // Prepare matrix data for rendering
    matrix2Bitmap( (*data.u), ptr );

    if(first_droplet == 1) // By default there's just one initial droplet
    {
        first_droplet = 0;
        int x0d= DIM / 2; // Default droplet center
        int y0d= DIM / 2;

        // Place the (x0d;y0d) centered Zd droplet on the wave data (it will be added at the next iteration)
        for(int Zdi=0; Zdi < data.Ddim; Zdi++)
        {
            for(int Zdj=0; Zdj < data.Ddim; Zdj++)
            {
                (*data.u)(y0d-2*data.dsz+Zdi,x0d-2*data.dsz+Zdj) += (*data.Zd)(Zdi, Zdj);
            }
        }
    }

    // Render the droplet queue
    m_renderer->renderWaitingDroplets();
 }

The pattern we followed in the wave PDE evolution can be easily recognized in the computational-intensive code line

un = (*data.u)*data.a1 + (*data.u0)*data.a2 + convolutionCPU((*data.u),(*data.D))*data.c1;

which basically performs the well-known iterative step

All constants are preprocessed to improve performances.

It is to be noticed that the line adds up large matrices which are referred in the code as sMatrix objects. The sMatrix class is an handwritten simple class that exposes simple operator overrides to ease working with matrices. Except that one should bear in mind that large matrix operations shall avoid passing arguments by value and that to create a new matrix and copy it to the destination a copy constructor is required (to avoid obtaining a shallow copy without the actual data), the code is pretty straight forwarding so no more words will be spent on it

// This class handles matrix objects
class sMatrix
{
public:
   int rows, columns;
   float *values;

   sMatrix(int height, int width)
   {
       if (height == 0 || width == 0)
      throw "Matrix constructor has 0 size";

       rows = height;
       columns = width;

       values = new float[rows*columns];
   }

   // Copy constructor, this is needed to perform a deep copy (not a shallow one)
   sMatrix(const sMatrix& mt)
   {
       this->rows = mt.rows;
       this->columns = mt.columns;

       this->values = new float[rows*columns];

       // Copy the values
       memcpy(this->values, mt.values, this->rows*this->columns*sizeof(float));
   }

   ~sMatrix()
   {
       delete [] values;
   }

   // Allows matrix1 = matrix2
   sMatrix& operator= (sMatrix const& m)
   {
       if(m.rows != this->rows || m.columns != this->columns)
       {
        throw "Size mismatch";
       }

       memcpy(this->values,m.values,this->rows*this->columns*sizeof(float));

      return *this; // Since "this" continues to exist after the function call, it is perfectly legit to return a reference
   }
    
   // Allows both matrix(3,3) = value and value = matrix(3,3)
   float& operator() (const int row, const int column)
   {
    // May be suppressed to slightly increase performances
    if (row < 0 || column < 0 || row > this->rows || column > this->columns)
        throw "Size mismatch";

    return values[row*columns+column]; // Since the float value continues to exist after the function call, it is perfectly legit to return a reference
   }

  // Allows scalar*matrix (e.g. 3*matrix) for each element
  sMatrix operator* (const float scalar)
  {
    sMatrix result(this->rows, this->columns);
    // Multiply each value by the scalar
    for(int i=0; i<rows*columns; i++)
    {
        result.values[i] = this->values[i]*scalar;
    }
    return result; // Copy constructor
  }

  // Allows matrix+matrix (if same size)
  sMatrix operator+ (const sMatrix& mt)
  {
    if (this->rows != mt.rows || this->columns != mt.columns)
        throw "Size mismatch";

    sMatrix result(this->rows, this->columns);
    // Sum each couple of values
    for(int i=0; i<rows; i++)
    {
        for(int j=0; j<columns; j++)
            result.values[i*columns+j] = this->values[i*columns+j] + mt.values[i*columns+j];
    }
    return result; // Copy constructor
  }
};

The convolution is performed with the following code (a classic approach):

// Returns the convolution between the matrix A and the kernel matrix B,
// A's size is preserved
//
sMatrix convolutionCPU(sMatrix& A, sMatrix& B)
{
  sMatrix result(A.rows, A.columns);
  int kernelHradius = (B.rows - 1) / 2;
  int kernelWradius = (B.columns - 1) / 2;
  float convSum, convProd, Avalue;
  for(int i=0; i<A.rows; i++)
  {
    for(int j=0; j<A.columns; j++)
    {
        // 
        // --------j--------->
        // _ _ _ _ _ _ _     
        // |            |    |
        // |            |    |
        // |       A    |    i
        // |            |    |
        // |            |    |
        // _ _ _ _ _ _ _|    v
        //
        convSum = 0;
        for(int bi=0; bi<B.rows; bi++)
        {
            for(int bj=0; bj<B.columns; bj++)
            {
                // A's value respect to the kernel center
                int relpointI = i-kernelHradius+bi;
                int relpointJ = j-kernelWradius+bj;
                if(relpointI < 0 || relpointJ < 0 || relpointI >= A.rows || relpointJ >= A.columns)
                    Avalue = 0;
                else
                    Avalue = A(i-kernelHradius+bi,j-kernelWradius+bj);
                convProd = Avalue*B(bi,bj);
                convSum += convProd;
            }
        }
            
        // Store the convolution result
        result(i,j) = convSum;
    }
  }
  return result;
}

After calculating the system’s evolution, the time elapsing is simulated by swapping the new matrix with the old one and discarding the previous state as we described before.

Then a matrix2Bitmap() call performs a matrix-to-bitmap conversion as its name suggests, more precisely the entire simulation area is described by a large sMatrix object which contains, obviously, float values. To actually render these values as pixel units we need to convert each value to its corresponding RGBA value and pass it to the openGLRenderer class (which in turn will pass the entire bitmap buffer to the GLUT library). In brief: we need to perform a float-to-RGBcolor mapping.

Since in the physical simulation we assumed that the resting water height is at 0 and every perturbation would heighten or lower this value (in particular the droplet Gaussian matrix lowers it by a maximum -0.07 factor), we are searching for a [-1;1] to color mapping. A HSV color model would better simulate a gradual color transition as we actually experience with our own eyes, but this would require converting it back to RGB values to set up a bitmap map to pass back at the GLUT wrapper. For performance reasons we chose to assign each value a color (colormap). A first solution would have been implementing a full [-1;1] -> [0x0;0xFFFFFF] mapping in order to cover all the possible colors in the RGB format

// Returns a RGB color from a value in the interval between [-1;+1]
RGB getColorFromValue(float value)
{
    RGB result;
 
    if(value <= -1.0f)
    {
            result.r = 0x00;
            result.g = 0x00;
            result.b = 0x00;
    }
    else if(value >= 1.0f)
    {
            result.r = 0xFF;
            result.g = 0xFF;
            result.b = 0xFF;
    }
    else
    {
            float step = 2.0f/0xFFFFFF;
 
            unsigned int cvalue = (unsigned int)((value + 1.0f)/step);
 
            if(cvalue < 0)
                    cvalue = 0;
            else if(cvalue > 0xFFFFFF)
                    cvalue = 0xFFFFFF;
 
            result.r = cvalue & 0xFF;
            result.g = (cvalue & 0xFF00) >> 8;
            result.b = (cvalue & 0xFF0000) >> 16;
    }
 
    return result;
}

However the above method is either performance-intensive and doesn’t render very good a smooth color transition: let’s take a look at a droplet mapped like that

looks more like a fractal rather than a droplet, so the above solution won’t work. A better way to improve performances (and the look of the image) is to hard-code a colormap in an array and to use it when needed:

float pp_step = 2.0f / (float)COLOR_NUM;
// The custom colormap, you can customize it to whatever you like
unsigned char m_colorMap[COLOR_NUM][3] = 
{
{0,0,143},{0,0,159},{0,0,175},{0,0,191},{0,0,207},{0,0,223},{0,0,239},{0,0,255},
{0,16,255},{0,32,255},{0,48,255},{0,64,255},{0,80,255},{0,96,255},{0,111,255},{0,128,255},
{0,143,255},{0,159,255},{0,175,255},{0,191,255},{0,207,255},{0,223,255},{0,239,255},{0,255,255},
{16,255,239},{32,255,223},{48,255,207},{64,255,191},{80,255,175},{96,255,159},{111,255,143},{128,255,128},
{143,255,111},{159,255,96},{175,255,80},{191,255,64},{207,255,48},{223,255,32},{239,255,16},{255,255,0},
{255,239,0},{255,223,0},{255,207,0},{255,191,0},{255,175,0},{255,159,0},{255,143,0},{255,128,0},
{255,111,0},{255,96,0},{255,80,0},{255,64,0},{255,48,0},{255,32,0},{255,16,0},{255,0,0},
{239,0,0},{223,0,0},{207,0,0},{191,0,0},{175,0,0},{159,0,0},{143,0,0},{128,0,0},
{100,0,20},{80,0,40},{60,0,60},{40,0,80},{20,0,100},{0,0,120}
};
// Returns a RGB color from a value in the interval between [-1;+1] using the above colormap
RGB getColorFromValue(float value)
{
  RGB result;
  unsigned int cvalue = (unsigned int)((value + 1.0f)/pp_step);
  if(cvalue < 0)
    cvalue = 0;
  else if(cvalue >= COLOR_NUM)
    cvalue = COLOR_NUM-1;
  result.r = m_colorMap[cvalue][0];
  result.g = m_colorMap[cvalue][1];
  result.b = m_colorMap[cvalue][2];
  return result;
}

Creating a colormap isn’t hard and different colormaps are freely available on the web which produce different transition effects. This time the result was much nicer (see the screenshot later on) and the performances (although an every-value conversion is always an intensive task) increased substantially.

The last part involved in the on-screen rendering is adding a droplet wherever the user clicks on the window with the cursor. One droplet is automatically added at the center of the surface (you can find the code in the kernel() function, it is controlled by the first_droplet variable) but the user can click everywhere (almost everywhere) on the surface to add another droplet in that spot. To achieve this a queue has been implemented to contain at the most 60 droplet centers where the Gaussian matrix will be placed (notice that the matrix will be added to the surface values that were already present in the spots, not just replace them).

#define MAX_DROPLET        60
typedef struct Droplet
{
    int mx;
    int my;
} Droplet; 
Droplet dropletQueue[MAX_DROPLET];
int dropletQueueCount = 0;

The queue system has been implemented for a reason: unlike the algorithm in pseudo-code we wrote before, rendering a scene with openGL requires the program to control the objects to be displayed in immediate-mode: that means the program needs to take care of what should be drawn before the actual rendering is performed, it cannot simply put a droplet to be rendered inside the data to be drawn because it could be in use (you can do this in retained-mode). Besides, we don’t know when a droplet will be added because it’s totally user-dependent. Because of that, every time the kernel() finishes, the droplet queue is emptied and the droplet Gaussian matrices are added to the data to be rendered (the surface data). The code which performs this is the following

void openGLRenderer::renderWaitingDroplets()
{
   // If there are droplets waiting to be rendered, schedule for the next rendering
   while(dropletQueueCount > 0)
   {
         dropletQueueCount--;
         addDroplet(dropletQueue[dropletQueueCount].mx,dropletQueue[dropletQueueCount].my);
   }       
}
 
void addDroplet( int x0d, int y0d )
{
   y0d = DIM - y0d;
   // Place the (x0d;y0d) centered Zd droplet on the wave data (it will be added at the next iteration)
   for(int Zdi=0; Zdi< m_simulationData.Ddim; Zdi++)
   {
       for(int Zdj=0; Zdj< m_simulationData.Ddim; Zdj++)
       {
          (*m_simulationData.u)(y0d-2*m_simulationData.dsz+Zdi,x0d-2*m_simulationData.dsz+Zdj)
+= (*m_simulationData.Zd)(Zdi, Zdj);
       }
   }
}

The code should be familiar by now: the addDroplet function simply adds the Zd 2D Gaussian matrix to the simulation data (the kernel data) at the “current” time, a.k.a. the u matrix which represents the surface.

The code loops until the keyboard callback handler (defined by the openGLrenderer) detects the Esc keypress, after that the application is issued a termination signal, the loops ends. The resources allocated by the program are freed before the exit signal is issued, however this might not be necessary since a program termination would let the OS free any of its previously allocated resources.

With the droplet-adding feature and all the handlers set the code is ready to run. This time the result is much nicer than the previous, that’s because we used a smoother colormap (take a look at the images below). Notice how the convolution term creates the “wave spreading” and the “bouncing” effect when computing values against the padded zero data outside the surface matrix (i.e. when the wave hits the window’s borders and is reflected back). The first image is the simulation in its early stage, that is when some droplets have just been added, the second image represents a later stage when the surface is going to calm down (in our colormap blue values are higher than red ones).

Since we introduced a damping factor (recall it from the theory section), the waves are eventually going to cease and the surface will sooner or later be quiet again. The entire simulation is (except for the colors) quite realistic but quite slow too. That’s because the entire surface matrix is being thoroughly updated and calculated by the system. The kernel() function runs continuously updating the rendering buffer. For a 512x512 image the CPU has to process a large quantity of data and it has also to perform 512x512 floating point convolutions. Using a profiler (like VS’s integrated one) shows that the program spends most of its time in the kernel() call (as we expected) and that the convolution operation is the most cpu-intensive.

It is also interesting to notice that the simulation speed decreases substantially when adding lots of droplets.

In a real scientific simulation environment gigantic quantities of data need to be processed in relatively small time intervals. That’s where GPGPU computing comes into the scene. We will briefly summarize what this acronym means and then we will present a GPU-accelerated version of the wave PDE simulation.

GPGPU Architectures

GPGPU stands for General Purpose computation on Graphics Processing Units and indicates using graphic processors and devices to perform high parallelizable computations that would normally be handled by CPU devices. The idea of using graphic devices to help CPUnits with their workloads isn’t new, but until recent architectures and frameworks like CUDA (©NVIDIA vendor-specific) or openCL showed up, programmers had to rely on series of workarounds and tricks to work with inconvenient and unintuitive methods and data structures. The reason why a developer should think about porting his CPU-native code into a new GPU version resides in the architecture design differences between CPUs and GPUs. While CPUs evolved (multicore) to gain performance advantages with sequential executions (pipelines, caches, control flows, etc..), GPUs evolved in a many-core way: they tended to operate at higher data bandwidths and chose to heavily increase their execution threads number. In the last years GPGPU has been seen as a new opportunity to use graphical processing units as algebraic coprocessors capable of handling massive parallelization and precision floating point calculations. The idea behind GPGPU architectures is letting CPU handling sequential parts of programs and GPU getting over with parallelizable parts. In fact many scientific applications and systems found their performances increased by such an approach and GPGPU is now a fundamental technology in many fields like medical imaging, physics simulations, signal processing, cryptography, intrusion detection, environmental sciences, etc..

Massive parallelization with CUDA

We chose to use the CUD-Architecture to parallelize some parts of our code. Parallelizing with GPGPU techniques means passing from a sequentially-designed code to a parallel-designed code, this also often means having to rewrite large parts of your code. The most obvious part of the entire simulation algorithm that could benefit of a parallel approach is the surface matrix convolution with the 2D Laplacian kernel.

Notice: for brevity’s sake, we will assume that the reader is already familiar with CUDA C/C++.

The CUDA SDK comes with a large variety of examples regarding how to implement an efficient convolution between matrices and how to apply a kernel as an image filter. However we decided to implement our own version rather than rely on standard techniques. The reasons are many:

we wanted to show how a massively parallel algorithm is designed and implemented
since our kernel is very small (3x3 2D Laplacian kernel, basically 9 float values) using a FFT approach like the one described by Victor Podlozhnyuk would be inefficient
the two-dimensional Gaussian kernel is the only radially symmetric function that is also separable, our kernel is not, so we cannot use separable convolution methods
such a small kernel seems perfect to be “aggressively cached” in the convolution operation. We’ll expand on that as soon as we describe the CUDA kernel designed

The most obvious way to perform a convolution (although extremely inefficient) consists in delegating each GPU thread multiple convolutions for each element across the entire image.

Take a look at the following image, we will use a 9x9 thread grid (just one block to simplify things) to perform the convolution. The purple square is our 9x9 grid while the red grids correspond to the 9x9 kernel. Each thread performs the convolution within its elements, then the X are shifted and the entire block is “virtually” moved to the right. When the X coordinate is complete (that is: the entire image horizontal area has been covered), the Y coordinate is incremented and the process starts again until completion. In the border area, every value outside the image will be set to zero.

The code for this simple approach is the following where A is the image matrix and B is the kernel.

__global__ void convolutionGPU(float *A, int A_cols, int A_rows, float *B, int B_Wradius, int B_Hradius,
int B_cols, int B_rows, float *result)
{
 
    // Initial position
    int threadXabs = blockIdx.x*blockDim.x + threadIdx.x;
    int threadYabs = blockIdx.y*blockDim.y + threadIdx.y;
    int threadXabsInitialPos = threadXabs;
 
    float convSum;
 
    while(threadYabs < A_rows)
    {
        while(threadXabs < A_cols)
        {
             // If we are still in the image, start the convolution
             convSum = 0.0f;
             // relative x coord to the absolute thread
             #pragma unroll
             for(int xrel=-B_Wradius;xrel<(B_Wradius+1); xrel++)
             {
                 #pragma unroll
                 for(int yrel=-B_Hradius;yrel<(B_Hradius+1); yrel++)
                 {
                       // Check the borders, 0 if outside
                       float Avalue;
                       if(threadXabs + xrel < 0 || threadYabs + yrel <0 || threadXabs + xrel >= A_cols || threadYabs + yrel >= A_rows)
                            Avalue = 0;
                       else
                            Avalue = A[ (threadYabs+yrel)*A_cols + (threadXabs + xrel) ];
 
                       // yrel+b_hradius makes the interval positive 
                       float Bvalue = B[ (yrel+B_Hradius)*B_cols + (xrel+B_Wradius) ];
 
                       convSum += Avalue * Bvalue;
                   }
              }
 
              // Store the result and proceed ahead in the grid
              result[threadYabs*A_cols + threadXabs ] = convSum;
 
              threadXabs += blockDim.x * gridDim.x;
         }
 
         // reset X pos and forward the Y pos
         threadXabs = threadXabsInitialPos;
         threadYabs += blockDim.y * gridDim.y;
    }
        
    // convolution finished
}

As already stated, this simple approach has several disadvantages

The kernel is very small, keeping it into global memory and accessing to it for every convolution performed is extremely inefficient
Although the matrix readings are partially coalesced, thread divergence can be significant with threads active in the border area and threads that are inactive
There’s no collaborative behavior among threads, although they basically use the same kernel and share a large part of the apron region

Hence a way better method to perform GPU convolution has been designed keeping in mind the points above.

The idea is simple: letting each thread load part of the apron and data regions in the shared memory thus maximizing readings coalescence and reducing divergence.

The code that performs the convolution on the GPU version of the simulation is the following

 // For a 512x512 image the grid is 170x170 blocks 3x3 threads each one
__global__ void convolutionGPU(float *A, float *result)
{
   __shared__ float data[laplacianD*2][laplacianD*2];
 
   // Absolute position into the image
   const int gLoc = threadIdx.x + IMUL(blockIdx.x,blockDim.x) + IMUL(threadIdx.y,DIM) + IMUL(blockIdx.y,blockDim.y)*DIM;
 
   // Image-relative position
   const int x0 = threadIdx.x + IMUL(blockIdx.x,blockDim.x);
   const int y0 = threadIdx.y + IMUL(blockIdx.y,blockDim.y);
 
   // Load the apron and data regions
 
   int x,y;
   // Upper left square
   x = x0 - kernelRadius;
   y = y0 - kernelRadius;
   if(x < 0 || y < 0)
         data[threadIdx.x][threadIdx.y] = 0.0f;
   else
         data[threadIdx.x][threadIdx.y] = A[ gLoc - kernelRadius - IMUL(DIM,kernelRadius)];
 
   // Upper right square
   x = x0 + kernelRadius + 1;
   y = y0 - kernelRadius;
   if(x >= DIM || y < 0)
         data[threadIdx.x + blockDim.x][threadIdx.y] = 0.0f;
   else
         data[threadIdx.x + blockDim.x][threadIdx.y] = A[ gLoc + kernelRadius+1 - IMUL(DIM,kernelRadius)];
 
   // Lower left square
   x = x0 - kernelRadius;
   y = y0 + kernelRadius+1;
   if(x < 0 || y >= DIM)
         data[threadIdx.x][threadIdx.y + blockDim.y] = 0.0f;
   else
         data[threadIdx.x][threadIdx.y + blockDim.y] = A[ gLoc - kernelRadius + IMUL(DIM,(kernelRadius+1))];
 
   // Lower right square
   x = x0 + kernelRadius+1;
   y = y0 + kernelRadius+1;
   if(x >= DIM || y >= DIM)
         data[threadIdx.x + blockDim.x][threadIdx.y + blockDim.y] = 0.0f;
   else
         data[threadIdx.x + blockDim.x][threadIdx.y + blockDim.y] = A[ gLoc + kernelRadius+1 + IMUL(DIM,(kernelRadius+1))];
 
   __syncthreads();
 
   float sum = 0;
   x = kernelRadius + threadIdx.x;
   y = kernelRadius + threadIdx.y;
 
   // Execute the convolution in the shared memory (kernel is in constant memory)
#pragma unroll
   for(int i = -kernelRadius; i<=kernelRadius; i++)
          for(int j=-kernelRadius; j<=kernelRadius; j++)
                  sum += data[x+i][y+j]  * gpu_D[i+kernelRadius][j+kernelRadius];
 
   // Transfer the risult to global memory
   result[gLoc] = sum;
 
}

The kernel only receives the surface matrix and the result where to store the convolved image. The kernel isn’t provided because it has been put into a special memory called “constant memory” which is read-only by kernels, pre-fetched and highly optimized to let all threads read from a specific location with minimum latency. The downside is that this kind of memory is extremely limited (in the order of 64Kb) so should be used wisely. Declaring our 3x3 kernel as constant memory grants us a significant speed advantage

__device__ __constant__ float gpu_D[laplacianD][laplacianD]; // Laplacian 2D kernel

The image below helps to determine how threads load from the surface matrix in the global memory the data and store them into faster on-chip shared memory before actually using them in the convolution operation. The purple 3x3 square is the kernel window and the central element is the value we are actually pivoting on. The grid is a 172x172 blocks 3x3 threads each one; each block of 3x3 threads have four stages to complete before entering the convolution loop: load the upper left apron and image data into shared memory (the upper left red square from the kernel purple window), load the upper right area (red square), load the lower left area (red square) and load the lower right area (idem). Since shared memory is only available to the threads in a block, each block loads its own shared area. Notice that we chose to let every thread read something from global memory to maximize coalescence, but we are not actually going to use every single element. The image shows a yellow area and a gray area: the yellow data is actually going to be used in the convolution operation for each element in the purple kernel square (it comprises aprons and data) while the gray area isn’t going to be used by any convolution performed by the block we are considering.

After filling each block’s shared memory array, the CUDA threads get synchronized to minimize their divergence. Then the execution of the convolution algorithm is performed: shared data is multiplied against constant kernel data resulting in a highly optimized operation.

The #pragma unroll directive instructs the compiler to unroll (where possible) the loop to reduce cycle control overhead and improve performances. A small example of loop unrolling: the following loop

for(int i=0;i<1000;i++)

a[i] = b[i] + c[i];

might be optimized by unrolling it

for(int i=0;i<1000;i+=2)

{

a[i] = b[i] + c[i];

a[i+1] = b[i+1] + c[i+1];

}

so that the control instructions are executed less and the overall loop improves its performances. It is to be noticed that almost every optimization in CUDA code needs to be carefully and thoroughly tested because a different architecture and different program control flows might produce different results (as well as different compiler optimizations that, unfortunately, cannot be always trusted).

Also notice that the IMUL macro is used in the code which is defined as

#define IMUL(a,b) __mul24(a,b)

On devices of CUDA compute capability 1.x, 32-bit integer multiplication is implemented using multiple instructions as it is not natively supported. 24-bit integer multiplication is natively supported via the __[u]mul24 intrinsic. However on devices of compute capability 2.0, however, 32-bit integer multiplication is natively supported, but 24-bit integer multiplication is not. __[u]mul24 is therefore implemented using multiple instructions and should not be used. So if you are planning to use the code on 2.x devices, make sure to redefine the macro directive.

A typical code which could call the kernel we just wrote could be

sMatrix convolutionGPU_i(sMatrix& A, sMatrix& B)
{
    unsigned int A_bytes = A.rows*A.columns*sizeof(float);
    sMatrix result(A.rows, A.columns);
    float *cpu_result = (float*)malloc(A_bytes);
       
    // Copy A data to the GPU global memory (B aka the kernel is already there)
    cudaError_t chk;
    chk = cudaMemcpy(m_gpuData->gpu_matrixA, A.values, A_bytes, cudaMemcpyHostToDevice);
    if(chk != cudaSuccess)
    {
        printf("\nCRITICAL: CANNOT TRANSFER MEMORY TO GPU");
        return result;
    }
       
    // Call the convolution kernel
    dim3 blocks(172,172);
    dim3 threads(3,3);
 
    convolutionGPU<<<blocks,threads>>>(m_gpuData->gpu_matrixA, m_gpuData->gpu_matrixResult);
 
    // Copy back the result
    chk = cudaMemcpy(cpu_result, m_gpuData->gpu_matrixResult, A_bytes, cudaMemcpyDeviceToHost);
    if(chk != cudaSuccess)
    {
         printf("\nCRITICAL: CANNOT TRANSFER MEMORY FROM GPU");
         return result;
    }
 
    // Allocate a sMatrix and return it with the GPU data
    free(result.values);
    result.values = cpu_result;
 
    return result;
}

obviously CUDA memory should be cudaMalloc-ated at the beginning of our program and freed only when the GPU work is complete.

However, as we stated before, converting a sequentially-designed program into a parallel one isn’t an easy task and often requires more than just a plain function-to-function conversion (it depends on the application). In our case substituting just a CPU-convolution function with a GPU-convolution function won’t work. In fact even though we distributed our workload in a better way from the CPU version (see the images below for a CPU-GPU exclusive time percentage), we actually slowed down the whole simulation.

The reason is simple: our kernel() function is called whenever a draw event is dispatched, so it needs to be called very often. Although the CUDA kernel is faster than the CPU convolution function and although GPU memory bandwidths are higher than CPU’s, transferring from (possibly paged-out) host memory to global device memory back and forth just kills our simulation performances. Applications which would benefit more from a CUDA approach usually perform a single-shot heavily-computational kernel workload and then transfer back the results. Real time applications might benefit from a concurrent kernels approach, but a 2.x capability device would be required.

In order to actually accelerate our simulation, a greater code revision is required.

Another more subtle thing to take into account when working with GPU code is CPU optimizations: take a look at the following asm codes for the CPU version of the line

un = (*data.u)*data.a1 + (*data.u0)*data.a2 + convolutionCPU((*data.u),(*data.D))*data.c1;

000000013F2321EF  mov        r8,qword ptr [m_simulationData+20h (13F234920h)]  
000000013F2321F6  mov        rdx,qword ptr [m_simulationData+10h (13F234910h)]  
000000013F2321FD  lea        rcx,[rbp+2Fh]  
000000013F232201  call       convolutionCPU (13F231EC0h)  
000000013F232206  nop  
000000013F232207  movss      xmm2,dword ptr [m_simulationData+8 (13F234908h)]  
000000013F23220F  lea        rdx,[rbp+1Fh]  
000000013F232213  mov        rcx,rax  
000000013F232216  call       sMatrix::operator* (13F2314E0h)  
000000013F23221B  mov        rdi,rax  
000000013F23221E  movss      xmm2,dword ptr [m_simulationData+4 (13F234904h)]  
000000013F232226  lea        rdx,[rbp+0Fh]  
000000013F23222A  mov        rcx,qword ptr [m_simulationData+18h (13F234918h)]  
000000013F232231  call       sMatrix::operator* (13F2314E0h)  
000000013F232236  mov        rbx,rax  
000000013F232239  movss      xmm2,dword ptr [m_simulationData (13F234900h)]  
000000013F232241  lea        rdx,[rbp-1]  
000000013F232245  mov        rcx,qword ptr [m_simulationData+10h (13F234910h)]  
000000013F23224C  call       sMatrix::operator* (13F2314E0h)  
000000013F232251  nop  
000000013F232252  mov        r8,rbx  
000000013F232255  lea        rdx,[rbp-11h]  
000000013F232259  mov        rcx,rax  
000000013F23225C  call       sMatrix::operator+ (13F2315B0h)  
000000013F232261  nop  
000000013F232262  mov        r8,rdi  
000000013F232265  lea        rdx,[rbp-21h]  
000000013F232269  mov        rcx,rax  
000000013F23226C  call       sMatrix::operator+ (13F2315B0h)  
000000013F232271  nop  
000000013F232272  cmp        dword ptr [rax],1F4h  
000000013F232278  jne        kernel+33Fh (13F2324CFh)  
000000013F23227E  cmp        dword ptr [rax+4],1F4h  
000000013F232285  jne        kernel+33Fh (13F2324CFh)  
000000013F23228B  mov        r8d,0F4240h  
000000013F232291  mov        rdx,qword ptr [rax+8]  
000000013F232295  mov        rcx,r12  
000000013F232298  call       memcpy (13F232DDEh)  
000000013F23229D  nop  
000000013F23229E  mov        rcx,qword ptr [rbp-19h]  
000000013F2322A2  call       qword ptr [__imp_operator delete (13F233090h)]  
000000013F2322A8  nop  
000000013F2322A9  mov        rcx,qword ptr [rbp-9]  
000000013F2322AD  call       qword ptr [__imp_operator delete (13F233090h)]  
000000013F2322B3  nop  
000000013F2322B4  mov        rcx,qword ptr [rbp+7]  
000000013F2322B8  call       qword ptr [__imp_operator delete (13F233090h)]  
000000013F2322BE  nop  
000000013F2322BF  mov        rcx,qword ptr [rbp+17h]  
000000013F2322C3  call       qword ptr [__imp_operator delete (13F233090h)]  
000000013F2322C9  nop  
000000013F2322CA  mov        rcx,qword ptr [rbp+27h]  
000000013F2322CE  call       qword ptr [__imp_operator delete (13F233090h)]  
000000013F2322D4  nop  
000000013F2322D5  mov        rcx,qword ptr [rbp+37h]  
000000013F2322D9  call       qword ptr [__imp_operator delete (13F233090h)]

and now take a look at the GPU version of the line

un = (*data.u)*data.a1 + (*data.u0)*data.a2 + convolutionGPU_i ((*data.u),(*data.D))*data.c1;

000000013F7E23A3  mov        rax,qword ptr [data]  
000000013F7E23AB  movss      xmm0,dword ptr [rax+8]  
000000013F7E23B0  movss      dword ptr [rsp+0A8h],xmm0  
000000013F7E23B9  mov        rax,qword ptr [data]  
000000013F7E23C1  mov        r8,qword ptr [rax+20h]  
000000013F7E23C5  mov        rax,qword ptr [data]  
000000013F7E23CD  mov        rdx,qword ptr [rax+10h]  
000000013F7E23D1  lea        rcx,[rsp+70h]  
000000013F7E23D6  call       convolutionGPU_i (13F7E1F20h)  
000000013F7E23DB  mov        qword ptr [rsp+0B0h],rax  
000000013F7E23E3  mov        rax,qword ptr [rsp+0B0h]  
000000013F7E23EB  mov        qword ptr [rsp+0B8h],rax  
000000013F7E23F3  movss      xmm0,dword ptr [rsp+0A8h]  
000000013F7E23FC  movaps     xmm2,xmm0  
000000013F7E23FF  lea        rdx,[rsp+80h]  
000000013F7E2407  mov        rcx,qword ptr [rsp+0B8h]  
000000013F7E240F  call       sMatrix::operator* (13F7E2B20h)  
000000013F7E2414  mov        qword ptr [rsp+0C0h],rax  
000000013F7E241C  mov        rax,qword ptr [rsp+0C0h]  
000000013F7E2424  mov        qword ptr [rsp+0C8h],rax  
000000013F7E242C  mov        rax,qword ptr [data]  
000000013F7E2434  movss      xmm0,dword ptr [rax+4]  
000000013F7E2439  movaps     xmm2,xmm0  
000000013F7E243C  lea        rdx,[rsp+50h]  
000000013F7E2441  mov        rax,qword ptr [data]  
000000013F7E2449  mov        rcx,qword ptr [rax+18h]  
000000013F7E244D  call       sMatrix::operator* (13F7E2B20h)  
000000013F7E2452  mov        qword ptr [rsp+0D0h],rax  
000000013F7E245A  mov        rax,qword ptr [rsp+0D0h]  
000000013F7E2462  mov        qword ptr [rsp+0D8h],rax  
000000013F7E246A  mov        rax,qword ptr [data]  
000000013F7E2472  movss      xmm2,dword ptr [rax]  
000000013F7E2476  lea        rdx,[rsp+40h]  
000000013F7E247B  mov        rax,qword ptr [data]  
000000013F7E2483  mov        rcx,qword ptr [rax+10h]  
000000013F7E2487  call       sMatrix::operator* (13F7E2B20h)  
000000013F7E248C  mov        qword ptr [rsp+0E0h],rax  
000000013F7E2494  mov        rax,qword ptr [rsp+0E0h]  
000000013F7E249C  mov        qword ptr [rsp+0E8h],rax  
000000013F7E24A4  mov        r8,qword ptr [rsp+0D8h]  
000000013F7E24AC  lea        rdx,[rsp+60h]  
000000013F7E24B1  mov        rcx,qword ptr [rsp+0E8h]  
000000013F7E24B9  call       sMatrix::operator+ (13F7E2BF0h)  
000000013F7E24BE  mov        qword ptr [rsp+0F0h],rax  
000000013F7E24C6  mov        rax,qword ptr [rsp+0F0h]  
000000013F7E24CE  mov        qword ptr [rsp+0F8h],rax  
000000013F7E24D6  mov        r8,qword ptr [rsp+0C8h]  
000000013F7E24DE  lea        rdx,[rsp+90h]  
000000013F7E24E6  mov        rcx,qword ptr [rsp+0F8h]  
000000013F7E24EE  call       sMatrix::operator+ (13F7E2BF0h)  
000000013F7E24F3  mov        qword ptr [rsp+100h],rax  
000000013F7E24FB  mov        rax,qword ptr [rsp+100h]  
000000013F7E2503  mov        qword ptr [rsp+108h],rax  
000000013F7E250B  mov        rdx,qword ptr [rsp+108h]  
000000013F7E2513  lea        rcx,[un]  
000000013F7E2518  call       sMatrix::operator= (13F7E2A90h)  
000000013F7E251D  nop  
000000013F7E251E  lea        rcx,[rsp+90h]  
000000013F7E2526  call        sMatrix::~sMatrix (13F7E2970h)  
000000013F7E252B  nop  
000000013F7E252C  lea        rcx,[rsp+60h]  
000000013F7E2531  call       sMatrix::~sMatrix (13F7E2970h)  
000000013F7E2536  nop  
000000013F7E2537  lea        rcx,[rsp+40h]  
000000013F7E253C  call       sMatrix::~sMatrix (13F7E2970h)  
000000013F7E2541  nop  
000000013F7E2542  lea        rcx,[rsp+50h]  
000000013F7E2547  call       sMatrix::~sMatrix (13F7E2970h)  
000000013F7E254C  nop  
000000013F7E254D  lea        rcx,[rsp+80h]  
000000013F7E2555  call       sMatrix::~sMatrix (13F7E2970h)  
000000013F7E255A  nop  
000000013F7E255B  lea        rcx,[rsp+70h]  
000000013F7E2560 call        sMatrix::~sMatrix(13F7E2970h)

the code, although the data involved are practically the same, looks much more bloated up (there are even nonsense operations, look at address 000000013F7E23DB). Probably letting the CPU finish the calculation after the GPU has done its work is not a good idea.

Since there are other functions which can be parallelized (like the matrix2bitmap() function), we need to move as much workload as possible on the device.

First we need to allocate memory on the device at the beginning of the program and deallocate it when finished. Small data like our algorithm parameters can be stored into constant memory to boost performances, while matrices large data is more suited into global memory (constant memory size is very limited).

// Initialize all data used by the device
// and the rendering simulation data as well
void initializeGPUData()
{
    /* Algorithm parameters */
 
    // Time step
    float dt = (float)0.05;
    // Speed of the wave
    float c = 1;
    // Space step
    float dx = 1;
    // Decay factor
    float k = (float)0.002;
    // Droplet amplitude (Gaussian amplitude)
    float da = (float)0.07;
        
    // Initialize u0
    sMatrix u0(DIM,DIM);
 
    for(int i=0; i<DIM; i++)
    {
          for(int j=0; j<DIM; j++)
          {
                u0(i,j) = 0.0f; // The corresponding color in the colormap for 0 is green
          }
    }
 
    // Initialize the rendering img to the u0 matrix
    CPUsMatrix2Bitmap(u0, renderImg);
        
    // Decayment per timestep
    float kdt=k*dt;
    // c1 constant
    float c1=pow(dt,2)*pow(c,2)/pow(dx,2);
 
    // Droplet as gaussian
    // This code creates a gaussian discrete droplet, see the documentation for more information
    const int dim = 4*dropletRadius+1;
    sMatrix xd(dim, dim);
    sMatrix yd(dim, dim);
    for(int i=0; i<dim; i++)
    {
           for(int j=-2*dropletRadius; j<=2*dropletRadius; j++)
           {
                  xd(i,j+2*dropletRadius) = j;
                  yd(j+2*dropletRadius,i) = j;
           }
    }
    float m_Zd[dim][dim];
    for(int i=0; i<dim; i++)
    {
           for(int j=0; j<dim; j++)
           {
                  // Calculate Gaussian centered on zero
                  m_Zd[i][j] = -da*exp(-pow(xd(i,j)/dropletRadius,2)-pow(yd(i,j)/dropletRadius,2));
           }
    }
 
    /* GPU data initialization */
 
    // Allocate memory on the GPU for u and u0 matrices
    unsigned int UU0_bytes = DIM*DIM*sizeof(float);
    cudaError_t chk;
    chk = cudaMalloc((void**)&m_gpuData.gpu_u, UU0_bytes);
    if(chk != cudaSuccess)
    {
         printf("\nCRITICAL: CANNOT ALLOCATE GPU MEMORY");
         return;
    }
    chk = cudaMalloc((void**)&m_gpuData.gpu_u0, UU0_bytes);
    if(chk != cudaSuccess)
    {
         printf("\nCRITICAL: CANNOT ALLOCATE GPU MEMORY");
         return;
    }
    // Allocate memory for ris0, ris1, ris2 and ptr matrices
    chk = cudaMalloc((void**)&m_gpuData.ris0, UU0_bytes);
    if(chk != cudaSuccess)
    {
         printf("\nCRITICAL: CANNOT ALLOCATE GPU MEMORY");
         return;
    }
    chk = cudaMalloc((void**)&m_gpuData.ris1, UU0_bytes);
    if(chk != cudaSuccess)
    {
         printf("\nCRITICAL: CANNOT ALLOCATE GPU MEMORY");
         return;
    }
    chk = cudaMalloc((void**)&m_gpuData.ris2, UU0_bytes);
    if(chk != cudaSuccess)
    {
         printf("\nCRITICAL: CANNOT ALLOCATE GPU MEMORY");
         return;
    }
    chk = cudaMalloc((void**)&m_gpuData.gpu_ptr, DIM*DIM*4);
    if(chk != cudaSuccess)
    {
         printf("\nCRITICAL: CANNOT ALLOCATE GPU MEMORY");
         return;
    }
 
    // Initialize to zero both u and u0
    chk = cudaMemcpy(m_gpuData.gpu_u0, u0.values, UU0_bytes, cudaMemcpyHostToDevice);
    if(chk != cudaSuccess)
    {
         printf("\nCRITICAL: CANNOT TRANSFER MEMORY TO GPU");
         return;
    }
    chk = cudaMemcpy(m_gpuData.gpu_u, u0.values, UU0_bytes, cudaMemcpyHostToDevice);
    if(chk != cudaSuccess)
    {
         printf("\nCRITICAL: CANNOT TRANSFER MEMORY TO GPU");
         return;
    }
        
    // Preload Laplacian kernel
    float m_D[3][3];
 
    m_D[0][0] = 0.0f; m_D[1][0] = 1.0f;  m_D[2][0]=0.0f;
    m_D[0][1] = 1.0f; m_D[1][1] = -4.0f; m_D[2][1]=1.0f;
    m_D[0][2] = 0.0f; m_D[1][2] = 1.0f;  m_D[2][2]=0.0f;
        
    // Copy Laplacian to constant memory
    chk = cudaMemcpyToSymbol((const char*)gpu_D, m_D, 9*sizeof(float), 0, cudaMemcpyHostToDevice);
    if(chk != cudaSuccess)
    {
          printf("\nCONSTANT MEMORY TRANSFER FAILED");
          return;
    }
 
    // Store all static algorithm parameters in constant memory
    const float a1 = (2-kdt);
    chk = cudaMemcpyToSymbol((const char*)&gpu_a1, &a1, sizeof(float), 0, cudaMemcpyHostToDevice);
    if(chk != cudaSuccess)
    {
          printf("\nCONSTANT MEMORY TRANSFER FAILED");
          return;
    }
    const float a2 = (kdt-1);
    chk = cudaMemcpyToSymbol((const char*)&gpu_a2, &a2, sizeof(float), 0, cudaMemcpyHostToDevice);
    if(chk != cudaSuccess)
    {
         printf("\nCONSTANT MEMORY TRANSFER FAILED");
         return;
    }
    chk = cudaMemcpyToSymbol((const char*)&gpu_c1, &c1, sizeof(float), 0, cudaMemcpyHostToDevice);
    if(chk != cudaSuccess)
    {
         printf("\nCONSTANT MEMORY TRANSFER FAILED");
         return;
    }
    const int ddim = dim;
    chk = cudaMemcpyToSymbol((const char*)&gpu_Ddim, &ddim, sizeof(int), 0, cudaMemcpyHostToDevice);
    if(chk != cudaSuccess)
    {
         printf("\nCONSTANT MEMORY TRANSFER FAILED");
         return;
    }
    const int droplet_dsz = dropletRadius;
    chk = cudaMemcpyToSymbol((const char*)&gpu_dsz, &droplet_dsz, sizeof(int), 0, cudaMemcpyHostToDevice);
    if(chk != cudaSuccess)
    {
         printf("\nCONSTANT MEMORY TRANSFER FAILED");
         return;
    }
    chk = cudaMemcpyToSymbol((constchar*)&gpu_Zd, &m_Zd, sizeof(float)*dim*dim, 0, cudaMemcpyHostToDevice);
    if(chk != cudaSuccess)
    {
         printf("\nCONSTANT MEMORY TRANSFER FAILED");
         return;
    }
 
    //
    // Initialize colormap and ppstep in constant memory
    chk = cudaMemcpyToSymbol((const char*)&gpu_pp_step, &pp_step, sizeof(float), 0, cudaMemcpyHostToDevice);
    if(chk != cudaSuccess)
    {
         printf("\nCONSTANT MEMORY TRANSFER FAILED");
         return;
    }
    chk = cudaMemcpyToSymbol((const char*)&gpu_m_colorMap, &m_colorMap, sizeof(unsigned char)*COLOR_NUM*3, 0, cudaMemcpyHostToDevice);
    if(chk != cudaSuccess)
    {
         printf("\nCONSTANT MEMORY TRANSFER FAILED");
         return;
    }
}
 
void deinitializeGPUData()
{
    // Free everything from device memory
    cudaFree(m_gpuData.gpu_u);
    cudaFree(m_gpuData.gpu_u0);
    cudaFree(m_gpuData.gpu_ptr);
    cudaFree(m_gpuData.ris0);
    cudaFree(m_gpuData.ris1);
    cudaFree(m_gpuData.ris2);
}

After initializing the GPU memory the openGLRenderer can be started as usual to call the kernel() function in order to obtain a valid render-able surface image matrix. But there’s a difference now, right in the openGLRenderer constructor

openGLRenderer::openGLRenderer(void)
{
     //. . .
        
     // Sets up the CUBLAS
     cublasStatus_t status = cublasInit();
     if (status != CUBLAS_STATUS_SUCCESS)
     {
           // CUBLAS initialization error
           printf("\nCRITICAL: CUBLAS LIBRARY FAILED TO LOAD");
           return;
     }
        
     // Set up the bitmap data with page-locked memory for fast access
     //no more : renderImg = new unsigned char[DIM*DIM*4];
     cudaError_t chk = cudaHostAlloc((void**)&renderImg, DIM*DIM*4*sizeof(char), cudaHostAllocDefault);
     if(chk != cudaSuccess)
     {
           printf("\nCRITICAL: CANNOT ALLOCATE PAGELOCKED MEMORY");
           return ;
     }
}

First we decided to use CUBLAS library to perform matrix addition for two reasons:

our row-major data on the device is ready to be used by the CUBLAS functions yet (cublasMalloc is just a wrapper around the cudaMalloc)
CUBLAS library is extremely optimized for large matrices operations; our matrices aren’t that big but this could help extending the architecture for a future version

Using our sMatrix wrapper is no more an efficient choice and we need to get rid of it while working on the device, although we can still use it for the initialization stage.

The second fundamental thing that we need to notice in the openGLRenderer constructor is that we allocated host-side memory (the memory that will contain the data to be rendered) with cudaHostAlloc instead of the classic malloc. As the documentation states, allocating memory with such a function grants that the CUDA driver will track the virtual memory ranges allocated with this function and accelerate calls to function like cudaMemCpy. Host memory allocated with cudaHostAlloc is often referred as “pinned memory”, and cannot be paged-out (and because of that allocating excessive amounts of it may degrade system performance since it reduces the amount of memory available to the system for paging). This expedient will grant additional speed in memory transfers between device and host.

We are now ready to take a peek at the revised kernel() function

// This kernel is called at each iteration
// It implements the main loop algorithm and someway "rasterize" the matrix data
// to be passed to the openGL renderer. It also adds droplets in the waiting queue
//
void kernel(unsigned char *ptr)
{
    // Set up the grid
    dim3 blocks(172,172);
    dim3 threads(3,3); // 516x516 img is 172x172 (3x3 thread) blocks
        
    // Implements the un = (*data.u)*data.a1 + (*data.u0)*data.a2 + convolution((*data.u),(*data.D))*data.c1;
    // line by means of several kernel calls
    convolutionGPU<<<blocks,threads>>>(m_gpuData.gpu_u, m_gpuData.ris0);
    // Now multiply everything by c1 constant
    multiplyEachElementby_c1<<<blocks,threads>>>(m_gpuData.ris0, m_gpuData.ris1);
    // First term is ready, now u*a1
    multiplyEachElementby_a1<<<blocks,threads>>>(m_gpuData.gpu_u, m_gpuData.ris0);
    // u0*a2
    multiplyEachElementby_a2<<<blocks,threads>>>(m_gpuData.gpu_u0, m_gpuData.ris2);
    // Perform the matrix addition with the CUBLAS library
    // un = ris0 + ris2 + ris1
    // Since everything is already stored as row-major device vectors, we don't need to do anything to pass it to the CUBLAS
    cublasSaxpy(DIM*DIM, 1.0f, m_gpuData.ris0, 1, m_gpuData.ris2, 1);
    cublasSaxpy(DIM*DIM, 1.0f, m_gpuData.ris2, 1, m_gpuData.ris1, 1);
    // Result is not in m_gpuData.ris1
 
    // Step forward in time
    cudaMemcpy(m_gpuData.gpu_u0, m_gpuData.gpu_u, DIM*DIM*sizeof(float), cudaMemcpyDeviceToDevice);
    cudaMemcpy(m_gpuData.gpu_u, m_gpuData.ris1, DIM*DIM*sizeof(float), cudaMemcpyDeviceToDevice);
 
    // Draw the u surface matrix and "rasterize" it into gpu_ptr
    gpuMatrix2Bitmap<<<blocks,threads>>>(m_gpuData.gpu_u, m_gpuData.gpu_ptr);
 
    // Back on the pagelocked host memory
    cudaMemcpy(ptr, m_gpuData.gpu_ptr, DIM*DIM*4, cudaMemcpyDeviceToHost);
        
    if(first_droplet == 1) // By default there's just one initial droplet
    {
         first_droplet = 0;
         int x0d= DIM / 2; // Default droplet center
         int y0d= DIM / 2;
 
         cudaMemcpy(m_gpuData.ris0, m_gpuData.gpu_u, DIM*DIM*sizeof(float), cudaMemcpyDeviceToDevice);
         addDropletToU<<<blocks,threads>>>(m_gpuData.ris0, x0d,y0d, m_gpuData.gpu_u);
    }
 
    // Add all the remaining droplets in the queue
    while(dropletQueueCount >0)
    {
         dropletQueueCount--;
         int y0d = DIM - dropletQueue[dropletQueueCount].my;
         // Copy from u to one of our buffers
         cudaMemcpy(m_gpuData.ris0, m_gpuData.gpu_u, DIM*DIM*sizeof(float), cudaMemcpyDeviceToDevice);
         addDropletToU<<<blocks,threads>>>(m_gpuData.ris0, dropletQueue[dropletQueueCount].mx,y0d, m_gpuData.gpu_u);
    }
 
    // Synchronize to make sure all kernels executions are done
    cudaThreadSynchronize();
 }

The line

un = (*data.u)*data.a1 + (*data.u0)*data.a2 + convolution((*data.u),(*data.D))*data.c1;

has completely been superseded by multiple kernel calls which respectively operate a convolution operation, multiply matrix data by algorithm constants and perform a matrix-matrix addition via CUBLAS. Everything is performed in the device including the point-to-RGBvalue mapping (which is a highly parallelizable operation since must be performed for every value in the surface image matrix). Stepping forward in time is also accomplished with device methods. Eventually the data is copied back to the page-locked pinned host memory and droplets waiting in the queue are added for the next iteration to the u surface simulation data matrix.

The CUDA kernels called by the kernel() function are the following

/******************************************************************************
                                CUDA KERNELS
/******************************************************************************/
 
// For a 512x512 image the grid is 170x170 blocks 3x3 threads each one
__global__ void convolutionGPU(float *A, float *result)
{
   __shared__ float data[laplacianD*2][laplacianD*2];
 
   // Absolute position into the image
   const int gLoc = threadIdx.x + IMUL(blockIdx.x,blockDim.x) + IMUL(threadIdx.y,DIM) + IMUL(blockIdx.y,blockDim.y)*DIM;
 
   // Image-relative position
   const int x0 = threadIdx.x + IMUL(blockIdx.x,blockDim.x);
   const int y0 = threadIdx.y + IMUL(blockIdx.y,blockDim.y);
 
   // Load the apron and data regions
 
   int x,y;
   // Upper left square
   x = x0 - kernelRadius;
   y = y0 - kernelRadius;
   if(x < 0 || y < 0)
        data[threadIdx.x][threadIdx.y] = 0.0f;
   else
        data[threadIdx.x][threadIdx.y] = A[ gLoc - kernelRadius - IMUL(DIM,kernelRadius)];
 
   // Upper right square
   x = x0 + kernelRadius + 1;
   y = y0 - kernelRadius;
   if(x >= DIM || y < 0)
        data[threadIdx.x + blockDim.x][threadIdx.y] = 0.0f;
   else
        data[threadIdx.x + blockDim.x][threadIdx.y] = A[ gLoc + kernelRadius+1 - IMUL(DIM,kernelRadius)];
 
   // Lower left square
   x = x0 - kernelRadius;
   y = y0 + kernelRadius+1;
   if(x < 0 || y >= DIM)
        data[threadIdx.x][threadIdx.y + blockDim.y] = 0.0f;
   else
        data[threadIdx.x][threadIdx.y + blockDim.y] = A[ gLoc - kernelRadius + IMUL(DIM,(kernelRadius+1))];
 
   // Lower right square
   x = x0 + kernelRadius+1;
   y = y0 + kernelRadius+1;
   if(x >= DIM || y >= DIM)
         data[threadIdx.x + blockDim.x][threadIdx.y + blockDim.y] = 0.0f;
   else
         data[threadIdx.x + blockDim.x][threadIdx.y + blockDim.y] = A[ gLoc + kernelRadius+1 + IMUL(DIM,(kernelRadius+1))];
 
   __syncthreads();
 
   float sum = 0;
   x = kernelRadius + threadIdx.x;
   y = kernelRadius + threadIdx.y;
 
   // Execute the convolution in the shared memory (kernel is in constant memory)
#pragma unroll
   for(int i = -kernelRadius; i<=kernelRadius; i++)
         for(int j=-kernelRadius; j<=kernelRadius; j++)
                  sum += data[x+i][y+j]  * gpu_D[i+kernelRadius][j+kernelRadius];
 
   // Transfer the risult to global memory
   result[gLoc] = sum;
 
}
 
__global__ void multiplyEachElementby_c1(float *matrix, float *result)
{
    // Absolute position into the image
    const int gLoc = threadIdx.x + IMUL(blockIdx.x,blockDim.x) + IMUL(threadIdx.y,DIM) + IMUL(blockIdx.y,blockDim.y)*DIM;
        
    // Multiply by c1 each matrix's element
    result[gLoc] = matrix[gLoc]*gpu_c1;
}
__global__ void multiplyEachElementby_a1(float *matrix, float *result)
{
    // Absolute position into the image
    const int gLoc = threadIdx.x + IMUL(blockIdx.x,blockDim.x) + IMUL(threadIdx.y,DIM) + IMUL(blockIdx.y,blockDim.y)*DIM;
        
    // Multiply by c1 each matrix's element
    result[gLoc] = matrix[gLoc]*gpu_a1;
}
__global__ void multiplyEachElementby_a2(float *matrix, float *result)
{
    // Absolute position into the image
    const int gLoc = threadIdx.x + IMUL(blockIdx.x,blockDim.x) + IMUL(threadIdx.y,DIM) + IMUL(blockIdx.y,blockDim.y)*DIM;
        
    // Multiply by c1 each matrix'selement
    result[gLoc] = matrix[gLoc]*gpu_a2;
}
 
// Associate a colormap RGB value to each point
__global__ void gpuMatrix2Bitmap(float *matrix, BYTE *bitmap)
{
    // Absolute position into the image
    const int gLoc = threadIdx.x + IMUL(blockIdx.x,blockDim.x) + IMUL(threadIdx.y,DIM) + IMUL(blockIdx.y,blockDim.y)*DIM;
        
    int cvalue = (int)((matrix[gLoc] + 1.0f)/gpu_pp_step);
 
    if(cvalue < 0)
          cvalue = 0;
    else if(cvalue >= COLOR_NUM)
          cvalue = COLOR_NUM-1;
 
    bitmap[gLoc*4] = gpu_m_colorMap[cvalue][0];
    bitmap[gLoc*4 + 1] = gpu_m_colorMap[cvalue][1];
    bitmap[gLoc*4 + 2] = gpu_m_colorMap[cvalue][2];
    bitmap[gLoc*4 + 3] = 0xFF; // Alpha
}
 
// Add a gaussian 2D droplet matrix to the surface data
// Warning: this kernel has a high divergence factor, it is meant to be seldom called
__global__ void addDropletToU(float *matrix, int x0d, int y0d, float *result)
{
    // Absolute position into the image
    const int gLoc = threadIdx.x + IMUL(blockIdx.x,blockDim.x) + IMUL(threadIdx.y,DIM) + IMUL(blockIdx.y,blockDim.y)*DIM;
    // Image relative position
    const int x0 = threadIdx.x + IMUL(blockIdx.x,blockDim.x);
    const int y0 = threadIdx.y + IMUL(blockIdx.y,blockDim.y);
 
    // Place the (x0d;y0d) centered Zd droplet on the wave data (it will be added at the next iteration)
    if (x0 >= x0d-gpu_dsz*2 && y0 >= y0d-gpu_dsz*2 && x0 <= x0d+gpu_dsz*2 && y0 <= y0d+gpu_dsz*2)
    {
         // Add to result the matrix value plus the Zd corresponding value
         result[gLoc] = matrix[gLoc] + gpu_Zd[x0 -(x0d-gpu_dsz*2)][y0 - (y0d-gpu_dsz*2)];
    }
    else
        result[gLoc] = matrix[gLoc]; // This value shouln't be changed
}

Notice that we preferred to “hardcode” the constant values usages with different kernels rather than introducing divergence with a conditional branch. The only kernel that increases thread divergence is the addDropletToU since only a few threads are actually performing the Gaussian packet-starting routine (see the theoric algorithm described a few paragraphs ago), but this isn’t a problem due to its low calling frequency.

Performance comparison

The timing measurements and performance comparisons have been performed on the following system

Intel 2 quad cpu Q9650 @ 3.00 Ghz

6 GB ram

64 bit OS

NVIDIA GeForce GTX 285 (1GB DDR3 @ 1476 Mhz, 240 CUDA cores)

The CUDA version we used to compile the projects is 4.2, if you have problems make sure to install the right version or change it as described in the readme file.

To benchmark the CUDA kernel execution we used the cudaEventCreate / cudaEventRecord / cudaEventSynchronize / cudaEventElapsedTime functions shipped with every CUDA version, while for the CPU version we used two Windows platform-dependent APIs: QueryPerformanceFrequency and QueryPerformanceCounter.

We split the benchmark into four stages: a startup stage where the only droplet in the image is the default one, a second stage when both the CPU and the GPU version stabilized themselves, a third one where we add 60-70 droplets to the rendering queue and a final one when the application is left running for 15-20 minutes. We saw that in every test the CPU performed worse than the GPU version which could rely on a large grid of threads ready to split up an upcoming significant workload and provide a fixed rendering time. On the other hand in the long term period, although the GPU still did better, the CPU version showed a small performance increment, perhaps thanks to the caching mechanisms.

Notice that an application operating on larger data would surely have taken a greater advantage from a massive parallelization approach. Our wave PDE simulation is quite simple indeed and did not require a significant workload thus reducing the performance gain that could have been achieved.

Once and for all: there’s not a general rule to convert a sequentially-designed algorithm to a parallel one and each case must be evaluated in its own context and architecture. Also using CUDA could provide a great advantage in scientific simulations, but one should not consider the GPU as a substitute of the CPU but rather as a algebraic coprocessor that can rely on massive data parallelization. Combining CPU sequential code parts with GPU parallel code parts is the key to succeed.

CUDA Kernels best practices

The last, but not the least, section of this paper provide a checklist of best-practices and errors to avoid when writing CUDA kernels in order to get the maximum from your GPU-accelerated application

Minimize host <-> device transfers, especially device -> host transfers which are slower, also if that could mean running on the device kernels that would not have been slower on the CPU.
Use pinned memory (pagelocked) on the host side to exploit bandwidth advantages. Be careful not to abuse it or you’ll slow your entire system down.
cudaMemcpy is a blocking function, if possible use it asynchronously with pinned memory + a CUDA stream in order to overlap transfers with kernel executions.
If your graphic card is integrated, zero-copy memory (that is: pinned memory allocated with cudaHostAllocMapped flag) always grants an advantage. If not, there is no certainty since the memory is not cached by the GPU.
Always check your device compute capabilities (use cudaGetDeviceProperties), if < 2.0 you cannot use more than 512 threads per block (65535 blocks maximum).
Check your graphic card specifications: when launching a AxB block grid kernel, each SM (streaming multiprocessor) can serve a part of them. If your card has a maximum of 1024 threads per SM you should size your blocks in order to fill as many of them as possible but not too many (otherwise you would get scheduling latencies). Every warp is usually 32 thread (although this is not a fixed value and is architecture dependent) and is the minimum scheduling unit on each SM (only one warp is executed at any time and all threads in a warp execute the same instruction - SIMD), so you should consider the following example: on a GT200 card you need to perform a matrix multiplication. Should you use 8x8, 16x16 or 32x32 threads per block?

For 8X8 blocks, we have 64 threads per block. Since each SM can take up to 1024 threads, there are 16 Blocks (1024/64). However, each SM can only take up to 8 blocks. Hence only 512 (64*8) threads will go into each SM -> SM execution resources are under-utilized; fewer wraps to schedule around long latency operations

For 16X16 blocks , we have 256 threads per Block. Since each SM can take up to 1024 threads, it can take up to 4 Blocks (1024/256) and the 8 blocks limit isn’t hit -> Full thread capacity in each SM and maximal number of warps for scheduling around long-latency operations (1024/32= 32 wraps).

For 32X32 blocks, we have 1024 threads per block -> Not even one can fit into an SM! (there’s a 512 threads per block limitation).

Check the SM registers limit per block and divide them by the thread number: you’ll get the maximum register number you can use in a thread. Exceeding it by just one per thread will cause less warp to be scheduled and decreased performance.
Check the shared memory per block and make sure you don’t exceed that value. Exceeding it will cause less warp to be scheduled and decreased performance.
Always check the thread number to be inferior than the maximum threads value supported by your device.
Use reduction and locality techniques wherever applicable.
If possible, never split your kernel code into conditional branches (if-then-else), different paths would cause more executions for the same warp and the following overhead.
Use reduction techniques with divergence minimization when possible (that is, try to perform a reduction with warps performing a coalesced reading per cycle as described in chapter 6 of Kirk and Hwu book)
Coalescence is achieved by forcing hardware reading consecutive data. If each thread in a warp access consecutive memory coalescence is significantly increased (half of the threads in a warp should access global memory at the same time), that’s why with large matrices (row-major) reading by columns is better than rows readings. Coalescence can be increased with locality (i.e. threads cooperating in loading data needed by other threads); kernels should perform coalesced readings with locality purposes in order to maximize memory throughputs. Storing values into shared memory (when possible) is a good practice too.
Make sure you don’t have unused threads/blocks, by design or because of your code. As said before graphic cards have limits like maximum threads on SM and maximum blocks per SM. Designing your grid without consulting your card specifications is highly discouraged.
As stated before adding more registers than the card maximum registers limit is a recipe for performance loss. Anyway adding a register may also cause instructions to be added, that is: more time to parallelize transfers or to schedule warps and better performances. Again: there’s no general rule, you should abide by the best practises when designing your code and then experiment by yourself.
Data prefetching means preloading data you don’t actually need at the moment to gain performance in a future operation (closely related to locality). Combining data prefetching in matrix tiles can solve many long-latency memory access problems.
Unrolling loops is preferable when applicable (as if the loop is small). Ideally loop unrolling should be automatically done by the compiler, but checking to be sure of that is always a better choice.
Reduce thread granularity with rectangular tiles (Chapter 6 Kirk and Hwu book) when working with matrices to avoid multiple row/columns readings from global memory by different blocks.
Textures are cached memory, if used properly they are significantly faster than global memory, that is: texture are better suited for 2D spatial locality accesses (e.g. multidimensional arrays) and might perform better in some specific cases (again, this is a case-by-case rule).
Try to mantain at least 25% of the overall registers occupied, otherwise access latency could not be hidden by other warps’ computations.
The number of threads per block should always be a multiple of warp size to favor coalesced readings.
Generally if a SM supports more than just one block, more than 64 threads per block should be used.
The starting values to begin experimenting with kernel grids are between 128 and 256 threads per block.
High latency problems might be solved by using more blocks with less threads instead of just one block with a lot of threads per SM. This is important expecially for kernels which often call __syncthreads().
If the kernel fails, use cudaGetLastError() to check the error value, you might have used too many registers / too much shared or constant memory / too many threads.
CUDA functions and hardware perform best with float data type. Its use is highly encouraged.
Integer division and modulus (%) are expensive operators, replace them with bitwise operations whenever possible. If n is a power of 2,

Avoid automatic data conversion from double to float if possible.
Mathematical functions with a __ preceeding them are hardware implemented, they’re faster but less accurate. Use them if precision is not a critical goal (e.g. __sinf(x) rather than sinf(x)).
Use signed integers rather than unsigned integers in loops because some compilers optimize signed integers better (overflows with signed integers cause undefined behavior, then compilers might aggressively optimize them).
Remember that floating point math is not associative because of round-off errors. Massively parallel results might differ because of this.
1) If your device supports concurrent kernel executions (see concurrentKernels device property) you might be able to gain additional performance by running kernels in parallel. Check your device specifications and the CUDA programming guides.

This concludes our article. The goal was an exploration of GPGPU applications capabilities to improve scientific simulations and programs which aim to large data manipulations.

References

Mathematics for 3D Game Programming and Computer Graphics Third Edition - Eric Lengyel
Programming Massively Parallel Processors – David B. Kirk, Wen-mei W. Hwu
CUDA by Example - Jason Sanders, Edward Kandrot
Analyzing Waves on a String - Michael Fowler (http://galileo.phys.virginia.edu/classes/152.mf1i.spring02/)
Frank Wattenberg - Estimating Second Derivatives Numerically
Maxim Vedenev (http://simulations.narod.ru/) - solve 2d wave equation with FDM
2008 Fall CS 525 CUDA Convolution - Sangyoon (James) Lee, University of Illinois (http://www.evl.uic.edu/sjames/cs525/final.html)