Part 3: Work-Groups and Synchronization

manythreads

0/5 (0 vote)

Jan 6, 2011

CPOL

9 min read

74894

In his third tutorial, GPGPU expert Rob Farber will introduce the OpenCL™ execution model and discuss how to coordinate computations among the work items in a work group

This third article in a series on portable multithreaded programming using OpenCL™ will introduce the OpenCL™ execution model and discuss how to coordinate computations among the work items in a work group. The previous article (Part 2) introduced OpenCL memory spaces and provided a general C++ API OpenCL example to facilitate building and experimenting with your own OpenCL kernels on both CPU and GPU devices.

Specifically, this tutorial will explore the OpenCL execution model and will expand the example code from part 2 to:

Provide command-line arguments, such as preprocessor defines, to the OpenCL kernel build.
Refactor the part 2 example to better separate the OclTest class and make it easier to create separate tests. Only minor changes were made but it is worth noting that methods have been added to support tests that use 1D, 2D, or 3D work groups, localize variables by performing test specific command-line preprocessing in the OclTest class, plus logic has been added so various OpenCL test kernels can be used without recompilation of the host code.
An example OpenCL kernel has been included that demonstrates the use of 2D work groups to tile global memory and utilize synchronized shared memory operations so each work item in a work group can perform a simple operation using global memory and multiple shared memory regions. Comprising only 25 lines of code counting comments, this example emphasizes how simple and intuitive indexing and synchronizing within a parallel 2D code can be in OpenCL. It should be easily extendible to implement more complex kernels.

The beauty inherent in the OpenCL design is that the data is localized and associated with a computation expressed by a work item or within a work-group. This effectively breaks the work and data into many small independent tasks that can be queued on one to many devices to achieve very high performance and scalability with parallel hardware.

For performance and convenience reasons, developers can map this localized data into work groups that support 1D, 2D and 3D indexing capabilities. Through the use of lightweight and efficient barriers, developers can concurrently utilize data in high speed local memory to reuse data and support complicated memory access patterns to greatly accelerate application performance.

It is important to note that section 3.4.3 of the OpenCL specification is very explicit that synchronization can only occur:

Between work-items in a single work-group.
Among commands enqueued to command-queue(s) in a single context.

This tutorial will focus on synchronization between work-items within a single work-group. While it is also possible for work-items in different work-groups to coordinate execution through the use of atomic global memory transactions, it is generally a good design practice to avoid using atomics for synchronization except as a last resort because they can adversely affect scalability, require the use of slower global memory, introduce deadlock into a code and limit portability as atomic operations are an OpenCL extension that is only supported by some OpenCL runtimes. Still, synchronization via atomic operations can be a valuable and necessary capability for some computational problems and will be discussed in a future tutorial as will synchronization via command queues.

Succinctly: Work-items in different work-groups should never try to synchronize or share data, since the runtime provides no guarantee that all work-items are concurrently executing, and such synchronization easily introduces deadlocks.

The OpenCL Execution Model

The OpenCL execution model is based on the parallel execution of a computational kernel over a 1D, 2D, or 3D grid, or NDRange (“N-Dimensional Range”). A single kernel instance, or work-item, operates at each point in a local grid while a work-group operates in a global grid.

Aside from terminology, an essential aspect of the OpenCL execution model is the definition of a unique global and set of local IDs for each work-item in the multidimensional space defined via the NDRanges. Through these unique identifiers, the OpenCL execution model allows the developer to exactly identify where each parallel instance of a kernel resides in the index space so it can perform the necessary computations required to correctly implement an application. (Note: The programmer also has the ability to specify the work-group size or let the runtime make that decision.)

The following illustration shows a 3x3 grid of individually colored 2-dimensionan (2x2) work-groups. Local coordinates for each work-item are shown along the diagonal.

Inside the kernel, global coordinates are found by calling get_global_id(index), where index is 0,1, or 2 depending on the dimensionality of the grid. Coordinates local to the work-group are found via get_local_id(index). The number of dimensions in use is found with get_work_dim(). Other built-in work-group functions can be found in the Work-item Built-in Functions in the Khronos documentation.

The following illustration from Tim Matteson’s Supercomputing 2009 tutorial, slide 15, shows a nice example of the variety of these IDs:

Synchronization within a Work-Group

Synchronization within a work-group occurs via either a memory fence or barrier function.

The difference is that a barrier requires that all threads stop at the barrier() call while a memory fence only requires that loads and/or stores preceding the mem_fence() call be committed to memory. It is important to understand that the OpenCL compiler is allowed to reorder memory operations to best exploit the device architecture in both local and global memory. Thus, a programmer cannot rely on the sequential order of memory accesses within the source code as the actual operations may occur in a different order. The memory fence operations give developers the ability to enforce data dependencies.

Careful use of mem_fence() can greatly increase performance since these operations give the developer the opportunity to keep work-items active for as long as possible. The specification even allows the developer to separately control the ordering of both load operations with read_mem_fence() and store operations with write_mem_fence().

In contrast, barriers literally impose an execution roadblock as all threads are required to reach the barrier before any thread can continue. The Khronos specification also notes in the quote below that barriers can result in deadlock under certain situations. (In addition, the specification also requires that a barrier perform a memory fence on both reads and writes to prevent the compiler reordering memory operations.)

"All work-items in a work-group executing the kernel on a processor must execute this function before any are allowed to continue execution beyond the barrier. This function must be encountered by all work-items in a work-group executing the kernel. If barrier is inside a conditional statement, then all work-items must enter the conditional if any work-item enters the conditional statement and executes the barrier. If barrier is inside a loop, all work-items must execute the barrier for each iteration of the loop before any are allowed to continue execution beyond the barrier.”

The barrier function also queues a memory fence(reads and writes) to ensure correct ordering of memory operations to local or global memory."

Following is a simple OpenCL kernel fill_tile.cl that demonstrates the use of a 2D grid of work-groups, indexing within the work-group, plus allocation and synchronization within local memory. Preprocessor defines were explicitly used to show static allocation of local memory in a multidimensional array. Local memory can also be dynamically allocated via setKernelArg() with C++ bindings or clSetKernelArg() in C.

__kernel void fill_tiles(__global float* a,
			__global float* b,
			__global float* c)
{

  // find our coordinates in the grid
  int row = get_global_id(1);
  int col = get_global_id(0);

  // allocate local memory for the workgroup
  __local float aTile[TILE_DIM_Y][TILE_DIM_X];
  __local float bTile[TILE_DIM_Y][TILE_DIM_X];
  
  // define the coordinates of this workitem thread 
  // in the 2D tile 
  int y = get_local_id(1);
  int x = get_local_id(0);

  aTile[y][x] = a[row*N + col];
  bTile[y][x] = b[row*N + col];
  barrier(CLK_LOCAL_MEM_FENCE);

  //Note the change in tile location in bTile!
  c[row*N + col] = aTile[x][y] * bTile[y][x];
}

Notice that the allocation of the local memory aTile and bTile arrays occurs for the entire work-group with the lines:

  // allocate local memory for the workgroup
  __local float aTile[TILE_DIM_Y][TILE_DIM_X];
  __local float bTile[TILE_DIM_Y][TILE_DIM_X];

The local arrays are then filled by each work-item in the work-group. There is one work-item per local tile 2D index as the work-group was created to be of size (TILE_DIM_X, TILE_DIM_Y). Thus, there are TILE_DIM_X * TILE_DIM_Y concurrent work-items per tile.

Without the barrier synchronization, the contents of aTile and bTile can possibly be undefined as the work-item that uses bTile[y][x] in a calculation might run before the work-item that fills that same location from global memory! Remember that each coordinate in the local grid has a separate work-item that effectively runs in parallel to all the other work-items. Further, the implicit memory fence operation in the barrier keeps that compiler from reordering memory loads and stores so the code can execute correctly. Notice that no logic is required within the work-group to account for anything happening in any other work-group.

To demonstrate the speed of local memory, the value of c is calculated using transposed indices between the aTile and bTile arrays. The test class ensures that TILE_DIM_Y and TILE_DIM_X are equal so that aTile and bTile are symmetric. The host code also ensures that the value of the preprocessor define value of N correctly describes the row size of the M x N matrices contained in a, b, and c.

Host Code to Test a 2D Work-Group

Following is a listing of workGroup2D.cpp. Search for NEW in the code or look for the highlighted comments in yellow in this article to see the differences from the generic code used in part 2. Almost all the changes occur in the OclTest class and not in main(). The most important change is the addition of two new methods, getGlobalWorkItems() and getWorkItemsInWorkGroup() to specify the NDRanges needed to create multidimensional tests using work-groups. Please consult part 2 for additional information about this source code.

#define PROFILING // Define to see the time the kernel takes
#define __NO_STD_VECTOR // Use cl::vector instead of STL version
#define __CL_ENABLE_EXCEPTIONS // needed for exceptions
#include <CL/cl.hpp>
#include <fstream>
#include <iostream>
#include <string>
#include <cmath>
using namespace std;

// Note: Search for NEW to find the changes from part2.

class OclTest {
private:
  // NEW Experiment with the size of the tile to explore
  // performance on various devices (CPU, GPU, etc)
  static const int TILE_DIM_X=16;
  static const int TILE_DIM_Y=TILE_DIM_X;
  cl::Kernel kernel;
  cl_int mItems, nItems, nTiles;
  cl_int vLen, vSize;
  float  *h_vecA, *h_vecB, *h_vecC;
  cl::Buffer d_vecA, d_vecB, d_vecC;
  
public:
  OclTest( cl::Context& context, cl::vector<cl::Device>& devices,
	   const char* kernelFile, int argc, char *argv[])
  {
    // NEW: parse commandline arguments so all test variables
    // are now contained within the class.
    if(argc < 2) {
      cerr << "Use: cpu|gpu kernel sizeM sizeN" << endl;
      exit(EXIT_FAILURE);
    }
    mItems = atoi(argv[0]);
    nItems = atoi(argv[1]);

    nTiles = mItems * nItems;
    vLen = (mItems*TILE_DIM_Y)*(nItems*TILE_DIM_X);
    vSize = vLen * sizeof(float);

    // NEW: Demonstrate using defines in the ocl build
    string buildOptions;
    { // create preprocessor defines for the kernel
      char buf[256]; 
      sprintf(buf,"-D TILE_DIM_X=%d -D TILE_DIM_Y=%d -D N=%d",
	      TILE_DIM_X, TILE_DIM_Y,nItems*TILE_DIM_Y);
      buildOptions += string(buf);
    }
    
    // build the program from the source in the file
    ifstream file(kernelFile);
    string prog(istreambuf_iterator<char>(file),
		(istreambuf_iterator<char>()));
    cl::Program::Sources source( 1, make_pair(prog.c_str(),
					      prog.length()+1));
    cl::Program program(context, source);
    file.close();

    try {
      cerr << "buildOptions " << buildOptions << endl;
      program.build(devices, buildOptions.c_str() );
    } catch(cl::Error& err) {
      // Get the build log
      cerr << "Build failed! " << err.what() 
	   << '(' << err.err() << ')' << endl;
      cerr << "retrieving  log ... " << endl;
      cerr 
	<< program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(devices[0])
	<< endl;
      exit(-1);
    }
    //NEW get the name of the kernel from the filename
    string kernelName = string(kernelFile)
      .substr(0,string(kernelFile).find(".cl"));
    cerr << "specified kernel: " <<  kernelName << endl;
    kernel = cl::Kernel(program, kernelName.c_str());
    
    // set up the kernel inputs
    h_vecA = new float[vLen];
    h_vecB = new float[vLen];
    h_vecC = new float[vLen];
    
    d_vecA = cl::Buffer(context, CL_MEM_READ_ONLY, vSize);
    d_vecB = cl::Buffer(context, CL_MEM_READ_WRITE, vSize);
    d_vecC = cl::Buffer(context, CL_MEM_READ_WRITE, vSize);
    
    kernel.setArg(0, d_vecA);
    kernel.setArg(1, d_vecB);
    kernel.setArg(2, d_vecC);
  }
  
  inline void initData(cl::CommandQueue& queue, cl::Event& event, 
		       int seed) 
  {
    srand(seed);
    //NEW initialize data between 0 - 1
    for(int i=0; i < vLen; i++) h_vecA[i] = rand()/(float)RAND_MAX;
    for(int i=0; i < vLen; i++) h_vecB[i] = rand()/(float)RAND_MAX;
    queue.enqueueWriteBuffer(d_vecA, CL_TRUE, 0, vSize, h_vecA);
    queue.enqueueWriteBuffer(d_vecB, CL_TRUE, 0, vSize, h_vecB);
  }

  inline cl::Kernel& getKernel() { return(kernel); }
  //NEW methods to return information for queuing work-groups
  cl::NDRange getGlobalWorkItems() {
    return( cl::NDRange(nItems*TILE_DIM_X, mItems*TILE_DIM_Y) ); 
  }
  cl::NDRange getWorkItemsInWorkGroup() {
    return( cl::NDRange(TILE_DIM_X, TILE_DIM_Y) ); 
  }

  //NEW test for results from the fill_tile.cl kernel
  inline int goldenTest(cl::CommandQueue& queue, cl::Event& event) 
  {
    event.wait();
#ifdef PROFILING
    cl_ulong start=
      event.getProfilingInfo<CL_PROFILING_COMMAND_START>();
    cl_ulong end=
      event.getProfilingInfo<CL_PROFILING_COMMAND_END>();
    double time = 1.e-9 * (end-start);
    cout << "Time for kernel to execute " << time << endl;
#endif

    // bring data back to the host via a blocking read
    queue.enqueueReadBuffer(d_vecC, CL_TRUE, 0, vSize, h_vecC);

    for(int row=0; row < mItems; row++) 
      for(int col=0; col < nItems; col++) {
	float a[TILE_DIM_Y][TILE_DIM_X];
	float b[TILE_DIM_Y][TILE_DIM_X];
	float c[TILE_DIM_Y][TILE_DIM_X];
	
	// fill a and b arrays
	for(int y=0; y< TILE_DIM_Y; y++) {
	  int rindex = (row*TILE_DIM_Y+y)*nItems*TILE_DIM_Y;
	  for(int x=0; x < TILE_DIM_X; x++) {
	    a[y][x] = h_vecA[rindex + (col*TILE_DIM_X + x)];
	    b[y][x] = h_vecB[rindex + (col*TILE_DIM_X + x)];
	    c[y][x] = h_vecC[rindex + (col*TILE_DIM_X + x)];
	  }
	}
	// double check
	for(int y=0; y< TILE_DIM_Y; y++)
	  for(int x=0; x< TILE_DIM_X; x++) {
	    if( c[y][x] != (a[x][y]*b[y][x]) ) {
	      cerr << "Error on c[" << y << "][" << x << "]";
	      cerr << " " << c[y][x] << " " << (a[x][y]*b[y][x]) << endl;
	      return(1);
	    }
	  }
      }
    return(0);
  }
};

void displayPlatformInfo(cl::vector< cl::Platform > platformList,
			 int deviceType)
{
  // print out some device specific information
  cout << "Platform number is: " << platformList.size() << endl;
    
  string platformVendor;
  platformList[0].getInfo((cl_platform_info)CL_PLATFORM_VENDOR, 
			  &platformVendor);

  cout << "device Type " 
       << ((deviceType==CL_DEVICE_TYPE_GPU)?"GPU":"CPU") << endl;
  cout << "Platform is by: " << platformVendor << "\n";
}

int main(int argc, char* argv[])
{
  int seed=4;

  if( argc < 2) {
    cerr 
      << "Use: {cpu|gpu} kernelFile" 
      << endl;
    exit(EXIT_FAILURE);
  }

  // handle command-line arguments
  const string platformName(argv[1]);
  int deviceType = platformName.compare("cpu")?
    CL_DEVICE_TYPE_GPU:CL_DEVICE_TYPE_CPU;

  const char* kernelFile = argv[2];

  try {
    cl::vector< cl::Platform > platformList;
    cl::Platform::get(&platformList);
    
    displayPlatformInfo(platformList, deviceType);
    
    cl_context_properties cprops[3] = 
    {CL_CONTEXT_PLATFORM, 
     (cl_context_properties)(platformList[0])(), 0};

    cl::Context context(deviceType, cprops);
    
    cl::vector<cl::Device> devices = 
      context.getInfo<CL_CONTEXT_DEVICES>();
    
#ifdef PROFILING
    cl::CommandQueue queue(context, devices[0], 
			   CL_QUEUE_PROFILING_ENABLE);
#else
    cl::CommandQueue queue(context, devices[0], 0);
#endif

    OclTest test(context, devices, kernelFile, argc-3, argv+3);

    cl::Event event;
    test.initData(queue, event, seed);

    queue.enqueueNDRangeKernel(test.getKernel(), 
		 cl::NullRange, // offset starts at 0,0
		 test.getGlobalWorkItems(), // number of work groups
	       test.getWorkItemsInWorkGroup(), // workgroup size
             NULL, &event);

    if(test.goldenTest(queue, event) == 0) {
      cout << "test passed" << endl;
    } else {
      cout << "TEST FAILED!" << endl;
    }
    
  } catch (cl::Error error) {
    cerr << "caught exception: " << error.what() 
	 << '(' << error.err() << ')' << endl;
  }
  return EXIT_SUCCESS;
}

Building and Performance

To build the host program under Linux, copy and paste workGroup2D.cpp to a file and use the following commands:

export ATISTREAMSDKROOT=$HOME/AMD/ati-stream-sdk-v2.2-lnx64

export LD_LIBRARY_PATH=$ATISTREAMSDKROOT/lib/x86:\
$ATISTREAMSDKROOT/lib/x86_64:$LD_LIBRARY_PATH

g++ -I $ATISTREAMSDKROOT/include workGroup2D.cpp -L \
$ATISTREAMSDKROOT/lib/x86_64 -lOpenCL -o workGroup2D

Copy and paste the source for fill_tiles.cl into the same directory.

Following are the results produced using an ATI Radeon HD 5870 GPU and an AMD Phenom™ II X6 1055T processor:

export ATISTREAMSDKROOT=$HOME/AMD/ati-stream-sdk-v2.2-lnx64

export LD_LIBRARY_PATH=$ATISTREAMSDKROOT/lib/x86:\
$ATISTREAMSDKROOT/lib/x86_64:$LD_LIBRARY_PATH

./workGroup2D gpu fill_tiles.cl 300 400
Platform number is: 1
device Type GPU
Platform is by: Advanced Micro Devices, Inc.
buildOptions -D TILE_DIM_X=16 -D TILE_DIM_Y=16 -D N=6400
specified kernel: fill_tiles
Time for kernel to execute 0.00356587
test passed

./workGroup2D cpu fill_tiles.cl 300 400 

Platform number is: 1
device Type CPU
Platform is by: Advanced Micro Devices, Inc.
buildOptions -D TILE_DIM_X=16 -D TILE_DIM_Y=16 -D N=6400
specified kernel: fill_tiles
Time for kernel to execute 1.0187
test passed

While this kernel is not a good test of performance, it nonetheless provides a sense of the performance difference between the GPU and CPU capabilities. This example code and kernel are intended to make it easy to experiment with different tile sizes and computations. Give it a try and see how work-groups and shared memory can benefit your applications. Are there any sizes that cause a bank conflict?

Summary

OpenCL advances the concept of “portable parallelism” as it is not just a language to create kernels that can run on CPUs, GPUs, DSPs, and other devices. It also defines capabilities to coordinate concurrent parallel computations using work-groups and shared data.

Key OpenCL coordination concepts include:

NDRange (“N-Dimensional Range”) can define the size of 1D, 2D or 3D work-groups, which is convenient, efficient, and can make code much more readable.
Kernels are instantiated as work-items that are grouped in work-groups. The developer can specify the work-group size or leave it to the runtime to decide.
- The compiler can reorder loads and stores. Use mem_fence() to enforce data dependencies.
- Barriers also provide an excellent light-weight synchronization mechanism, but they do require all threads reach the barrier before any thread can continue execution.
- In some circumstances fence operations might be more efficient.
- Work-groups are independent. Barriers and memory fences do not synchronize across work-groups. Atomic operations can be used but only with caution due to deadlock and scalability issues.

Additional Resources

“OpenCL: A Standard Platform for programming heterogeneous parallel computers” by Tim Mattson (Intel), Ian Buck (Nvidia), Michael Houston (AMD), and Ben Gaster (AMD).
CMSoft’s “OpenCL 99 Advanced” tutorial.
The AMD OpenCL forums.