Part 1: OpenCL™ – Portable Parallelism

manythreads

0/5 (0 vote)

Sep 17, 2010

CPOL

10 min read

144239

Curious about GPGPU programming? Read Rob Farber’s Massively Parallel Programming series. Learn how to get more from your CPU, GPU, APU, DSP, and more.

This first article in a series on portable multithreaded programming using OpenCL™ briefly discusses the thought behind the standard and demonstrates how to download and use the ATI Stream software development kit (SDK) to build and run an OpenCL program.

OpenCL is a cross-vendor standard that holds tremendous potential to exploit the massive-parallelism of modern processors, embedded devices and graphics processors (GPUs). Due to its broad industry support, OpenCL has the potential to become the de facto software for portable multi-core and many-threaded applications. The allure of writing a single-application that can run on platforms from embedded systems and handhelds to workstations and supercomputers is undeniable.

A key point to note is that in OpenCL the compiler is built into the runtime, which provides exceptional flexibility and portability as OpenCL applications can select and use different OpenCL devices in the system at runtime. It is even possible to create OpenCL application executables today that can use - without modification - devices that have not even been invented yet!

The challenge with OpenCL principally resides in learning how to both design parallel programs and write them to be robust and perform well on a variety of heterogeneous hardware platforms. Knowledgeable developers are the key to capitalizing on multi-core processor and many-threaded OpenCL application and hardware investments – especially as the standard is evolving rapidly. These are the individuals who stay current with the standard and who can account for variability in vendor SDKs and implementations.

This tutorial series focuses on bringing knowledgeable C and C++ programmers quickly up to speed so they can work with OpenCL to write efficient portable parallel programs. For practical reasons, this series will utilize examples that run on multi-core processors and GPUs as well as a heterogeneous mix of the two. As the series continues (and depending on reader interest), the portable nature of OpenCL will be highlighted through working examples that run on a variety of platforms (embedded, laptop, desktop, and clusters) utilizing operating systems from the major vendors.

These tutorials are intended to teach people how to think and program in OpenCL. Individual tutorials will focus on conveying essential concepts, syntax and development tool knowledge so the reader can quickly see how to utilize a particular technical capability or release feature. Complete examples will be provided to minimize frustration and make it easy to copy, paste, build and immediately start working with the code. Explicit build instructions will be provided for at least one platform. When possible, performance numbers from different platforms will be included so developers can get a sense of portability and different platform performance capabilities.

Be aware that OpenCL is still very new. In their OpenCL overview, the Khronos Group, which manages the OpenCL standard, notes that Apple Computer first proposed that there should be an OpenCL standard in June 2008. The first v1.0 vendor implementations started shipping in the second half of 2009. The v1.1 standard was recently released in June 2010, along with conformant vendor implementations.

The thought behind OpenCL

The big idea behind OpenCL is a portable execution model that allows a kernel to execute at each point in a problem domain. A kernel is a function declared in a program and executed on an OpenCL device. It is identified by the __kernel qualifier applied to any function defined in a program. Kernels can operate in either a data-parallel or task-parallel fashion.

Essentially, the OpenCL developer can imagine working on a vector or other data structure using GPU SIMD (Single Instruction Multiple Data) or processor SSE (Streaming SIMD Extensions) instructions in a data-parallel manner, or in a task-parallel fashion that allows a number of different tasks to reside on a processor core or MIMD (Multiple Instruction Multiple Data) architecture. OpenCL compiler hints can be provided with function qualifiers such as vec_type_hint() or work_group_size_hint().

Choose the task-parallel model when independent threads can process separate functions. Task-level concurrency calls for independent work encapsulated in functions to be mapped to individual threads, which execute asynchronously.
Choose the data-parallel threading model for compute-intensive loops where the same, independent operations are performed repeatedly. Data parallelism implies that the same independent operation is applied repeatedly to different data.

An OpenCL application runs on a host which submits work to the compute devices via queues. Implicit in this model is the idea that some form of data transfer occurs between the separate memory spaces of the host and one or more OpenCL devices. Applications queue kernel execution instances in-order, one queue per device. However, both in-order and out-of-order execution are possible on the device.

Following are some core OpenCL terms:

Work item: the basic unit of work on an OpenCL device.
Kernel: the code for a work item, which is basically a C function.
Program: A collection of kernels and other functions.
Context: The environment within which work items executes, which includes devices and their memories and command queues.

While the OpenCL application itself can be written in either C or C++, the source for the application kernels is written in a variant of the ISO C99 C-language specification. These kernels are compiled via the built-in runtime compiler, or if desired, are saved to be loaded later. The OpenCL C-language for kernels is:

A subset of ISO C99 standard that eliminates some features such as headers, function pointers, recursion, variable length arrays, and bit fields.
A superset of ISO C99 with additions for:
- Work-items and workgroups
- Vector types
- Synchronization.
- Address space qualifiers.
It also includes a large set of built-in functions to facilitate OpenCL capabilities such as:
- Image manipulation.
- Work-item manipulation.
- Specialized math routines, and other operations.

Other sources of information

ATI Stream SDK v2: downloads, documentation, OpenCL Zone and forum.
The Khronos Group: developer resources, manpages, specification, and forum.
Apple Computer
IBM
NVIDIA
Intel will make its OpenCL SDK available by the end of the year.

Installing the ATI Stream SDK v2

Complete download and installation instructions for the ATI Stream SDK v2 can be found here. Following is a brief synopsis:

Download the ATI Stream SDK appropriate for your operating system (Linux or Windows). In this article, we use the current version of the ATI Stream SDK v2.2 for 64-bit Linux. Linux users must also install the ICD information, which allows cross-platform support between multiple vendors to work properly.
Extract the directory and associated files. Under Linux use tar. Microsoft users will run the installation executable. Under Linux:
- mkdir AMD
- cd AMD
- tar -xzf ati-stream-sdk-v2.2-lnx64.tgz
Download and install the ICD information. For Linux this is the file, icd-registration.tgz.
- As root, change to the root directory and extract:
- (cd /; tar –xzf icd-registration.tgz)
Set the appropriate environment variables and build the samples. Under Linux,
- export ATISTREAMSDKROOT=$HOME/AMD/ati-stream-sdk-v2.2-lnx64
- export ATISTREAMSDKSAMPLESROOT=/$HOME /AMD/ati-stream-sdk-v2.2-lnx64
- export LD_LIBRARY_PATH=$ATISTREAMSDKROOT/lib/x86:$ATISTREAMSDKROOT/lib/x86_64:$LD_LIBRARY_PATH
- cd ati-stream-sdk-v2.2-lnx64/
- make

Once the samples have built, it is possible to see what devices are available by running the CLIinfo application in the samples directory: ./samples/opencl/bin/x86_64/CLInfo.

An output similar to the following will appear which indicates that this system has both CPU and GPU devices available:

Number
of platforms:                       1
Platform Profile:                        FULL_PROFILE
Platform Version:                        OpenCL 1.1 ATI-Stream-v2.2 (302)
Platform Name:                           ATI Stream
Platform Vendor:                         Advanced Micro Devices, Inc.
Platform Extensions:               cl_khr_icd cl_amd_event_callback
 
 
Platform Name:                           ATI Stream
Number
of devices:                         2
Device Type:                             CL_DEVICE_TYPE_CPU
Device ID:                               4098
Max compute units:                       6
Max work items dimensions:               3
Max work items[0]:                     1024
Max work items[1]:                     1024
Max work items[2]:                     1024
Max work group size:                     1024
Preferred vector width char:                   16
Preferred vector width short:                  8
Preferred vector width int:              4
Preferred vector width long:                   2
Preferred vector width float:                  4
Preferred vector width double:           0
Max clock frequency:                     800Mhz
Address bits:                            64
Max memory allocation:                   1073741824
Image support:                     No
Max size of kernel argument:                   4096
Alignment (bits) of base address:        1024
Minimum alignment (bytes) for any datatype:    128
Single precision floating point capability
Denorms:                               Yes
Quiet NaNs:                            Yes
Round to nearest even:                 Yes
Round to zero:                         Yes
Round to +ve and infinity:                   Yes
IEEE754-2008 fused multiply-add:             No
Cache type:                              Read/Write
Cache line size:                         64
Cache size:                              65536
Global memory size:                      3221225472
Constant buffer size:                    65536
Max number of constant args:                   8
Local memory type:                       Global
Local memory size:                       32768
Profiling timer resolution:              1
Device endianess:                        Little
Available:                               Yes
Compiler available:                      Yes
Execution capabilities:                        
Execute OpenCL kernels:                Yes
Execute native function:               Yes
Queue properties:                        
Out-of-Order:                    No
Profiling :                            Yes
Platform ID:                             0x7f47e30e2b20
Name:                                    AMD Phenom(tm) II X6 1055T Processor
Vendor:                            AuthenticAMD
Driver version:                    2.0
Profile:                           FULL_PROFILE
Version:                           OpenCL 1.1 ATI-Stream-v2.2 (302)
Extensions:                              cl_amd_fp64
cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics
cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics
cl_khr_int64_base_atomics cl_khr_int64_extended_atomics
cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_device_fission
cl_amd_device_attribute_query cl_amd_printf 
Device Type:                             CL_DEVICE_TYPE_GPU
Device ID:                               4098
Max compute units:                       20
Max work items dimensions:               3
Max work items[0]:                     256
Max work items[1]:                     256
Max work items[2]:                     256
Max work group size:                     256
Preferred vector width char:                   16
Preferred vector width short:                  8
Preferred vector width int:              4
Preferred vector width long:                   2
Preferred vector width float:                  4
Preferred vector width double:           0
Max clock frequency:                     850Mhz
Address bits:                            32
Max memory allocation:                   134217728
Image support:                     Yes
Max number of images read arguments:     128
Max number of images write arguments:    8
Max image 2D width:                8192
Max image 2D height:               8192
Max image 3D width:                2048
Max image 3D height:   2048
Max image 3D depth:                2048
Max samplers within kernel:        16
Max size of kernel argument:                   1024
Alignment (bits) of base address:        32768
Minimum alignment (bytes) for any datatype:    128
Single precision floating point capability
Denorms:                               No
Quiet NaNs:                            Yes
Round to nearest even:                 Yes
Round to zero:                         Yes
Round to +ve and infinity:                   Yes
IEEE754-2008 fused multiply-add:             Yes
Cache type:                              None
Cache line size:                         0
Cache size:                              0
Global memory size:                      536870912
Constant buffer size:                    65536
Max number of constant args:                   8
Local memory type:                       Scratchpad
Local memory size:                       32768
Profiling timer resolution:              1
Device endianess:                        Little
Available:                               Yes
Compiler available:                      Yes
Execution capabilities:                        
Execute OpenCL kernels:                Yes
Execute native function:               No
Queue properties:                        
Out-of-Order:                    No
Profiling :                            Yes
Platform ID:                             0x7f47e30e2b20
Name:                                    Cypress
Vendor:                            Advanced Micro Devices, Inc.
Driver version:                    CAL 1.4.736
Profile:                           FULL_PROFILE
Version:                           OpenCL 1.1 ATI-Stream-v2.2 (302)
Extensions:                              cl_amd_fp64
cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics
cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics
cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing
cl_amd_device_attribute_query cl_amd_printf cl_amd_media_ops 
 
 
Passed!

A First Application and OpenCL Kernel

The following application, first.cpp, is written in C++. However, it is written in a very straight-forward procedural fashion to illustrate the steps required to create an OpenCL application that can work on either a GPU or a CPU. The application just creates a vector of random values that is transferred to an OpenCL kernel which squares the values. The host then retrieves the values and double checks the results.

To use OpenCL, the developer must:

Define the platform.
Execute code on the platform.
Move data around in memory.
Write (and build) programs. In this application, clCreateProgramWithSource() was used to build compile the OpenCL kernel.

The following is the actual OpenCL kernel that is contained in the application. Note that:

The kernel source is contained in a constant character string. Commercial developers should consider that constant strings will very likely be visible to others in the executable image using command such as strings. AMD provides a knowledge base article explaining how to solve this issue using binary kernel generation,
This is a parallel kernel. Each thread gets it’s ID with the call to get_global_id(), which is used as the index into the vector.

const char *KernelSource = "\n" \
"__kernel void square(                          \n" \
"   __global float* input,                      \n" \
"   __global float* output,                     \n" \
"   const unsigned int count)                   \n" \
"{                                              \n" \
"   int i = get_global_id(0);                   \n" \
"   if(i < count)                               \n" \
"       output[i] = input[i] * input[i];        \n" \
"}                                              \n" \
"\n";

In contrast, a serial function would look similar to the following:

void SerialSource(int n, float* input, float* output) 
{
   for (int i=0; i<n; i++)
      output[i] = input[i] * input[i]; 
}

Following is the complete source for first.cpp:

#include <iostream>

using namespace std;

#define __NO_STD_VECTOR // Use cl::vector and cl::string and
#define __NO_STD_STRING // not STL versions, more on this later
#include <CL/cl.h>


#define DATA_SIZE (1024*1240)

const char *KernelSource = "\n"		      \
  "__kernel void square(                    \n" \
  "   __global float* input,                \n" \
  "   __global float* output,               \n" \
  "   const unsigned int count)             \n" \
  "{                                        \n" \
  "   int i = get_global_id(0);             \n" \
  "   if(i < count)                         \n" \
  "       output[i] = input[i] * input[i];  \n" \
  "}                                        \n" \
  "\n";

int main(int argc, char* argv[])
{
  int devType=CL_DEVICE_TYPE_GPU;
  
  if(argc > 1) {
    devType = CL_DEVICE_TYPE_CPU;
    cout << "Using: CL_DEVICE_TYPE_CPU" << endl;
  } else {
    cout << "Using: CL_DEVICE_TYPE_GPU" << endl;
  }
  
  cl_int err;     // error code returned from api calls
  
  size_t global;  // global domain size for our calculation
  size_t local;   // local domain size for our calculation
  
  cl_platform_id cpPlatform; // OpenCL platform
  cl_device_id device_id;    // compute device id
  cl_context context;        // compute context
  cl_command_queue commands; // compute command queue
  cl_program program;        // compute program
  cl_kernel kernel;          // compute kernel
  
  // Connect to a compute device
  err = clGetPlatformIDs(1, &cpPlatform, NULL);
  if (err != CL_SUCCESS) {
    cerr << "Error: Failed to find a platform!" << endl;
    return EXIT_FAILURE;
  }
  
  // Get a device of the appropriate type
  err = clGetDeviceIDs(cpPlatform, devType, 1, &device_id, NULL);
  if (err != CL_SUCCESS) {
    cerr << "Error: Failed to create a device group!" << endl;
    return EXIT_FAILURE;
  }
  
  // Create a compute context
  context = clCreateContext(0, 1, &device_id, NULL, NULL, &err);
  if (!context) {
    cerr << "Error: Failed to create a compute context!" << endl;
    return EXIT_FAILURE;
  }
  
  // Create a command commands
  commands = clCreateCommandQueue(context, device_id, 0, &err);
  if (!commands) {
    cerr << "Error: Failed to create a command commands!" << endl;
    return EXIT_FAILURE;
  }
  
  // Create the compute program from the source buffer
  program = clCreateProgramWithSource(context, 1, 
				      (const char **) &KernelSource,
				      NULL, &err);
  if (!program) {
    cerr << "Error: Failed to create compute program!" << endl;
    return EXIT_FAILURE;
  }
  
  // Build the program executable
  err = clBuildProgram(program, 0, NULL, NULL, NULL, NULL);
  if (err != CL_SUCCESS) {
    size_t len;
    char buffer[2048];
    
    cerr << "Error: Failed to build program executable!" << endl;
    clGetProgramBuildInfo(program, device_id, CL_PROGRAM_BUILD_LOG,
			  sizeof(buffer), buffer, &len);
    cerr << buffer << endl;
    exit(1);
  }
  
  // Create the compute kernel in the program
  kernel = clCreateKernel(program, "square", &err);
  if (!kernel || err != CL_SUCCESS) {
    cerr << "Error: Failed to create compute kernel!" << endl;
    exit(1);
  }
  
  // create data for the run
  float* data = new float[DATA_SIZE];    // original data set given to device
  float* results = new float[DATA_SIZE]; // results returned from device
  unsigned int correct;               // number of correct results returned
  cl_mem input;                       // device memory used for the input array
  cl_mem output;                      // device memory used for the output array
  
  // Fill the vector with random float values
  unsigned int count = DATA_SIZE;
  for(int i = 0; i < count; i++)
    data[i] = rand() / (float)RAND_MAX;
  
  // Create the device memory vectors
  //
  input = clCreateBuffer(context,  CL_MEM_READ_ONLY,  
			 sizeof(float) * count, NULL, NULL);
  output = clCreateBuffer(context, CL_MEM_WRITE_ONLY, 
			  sizeof(float) * count, NULL, NULL);
  if (!input || !output) {
    cerr << "Error: Failed to allocate device memory!" << endl;
    exit(1);
  }   
  
  // Transfer the input vector into device memory
  err = clEnqueueWriteBuffer(commands, input, 
			     CL_TRUE, 0, sizeof(float) * count, 
			     data, 0, NULL, NULL);
  if (err != CL_SUCCESS) {
    cerr << "Error: Failed to write to source array!" << endl;
    exit(1);
  }
  
  // Set the arguments to the compute kernel
  err = 0;
  err  = clSetKernelArg(kernel, 0, sizeof(cl_mem), &input);
  err |= clSetKernelArg(kernel, 1, sizeof(cl_mem), &output);
  err |= clSetKernelArg(kernel, 2, sizeof(unsigned int), &count);
  if (err != CL_SUCCESS) {
    cerr << "Error: Failed to set kernel arguments! " << err << endl;
    exit(1);
  }
  
  // Get the maximum work group size for executing the kernel on the device
  err = clGetKernelWorkGroupInfo(kernel, device_id, 
				 CL_KERNEL_WORK_GROUP_SIZE, 
				 sizeof(local), &local, NULL);
  if (err != CL_SUCCESS) {
    cerr << "Error: Failed to retrieve kernel work group info! "
	 <<  err << endl;
    exit(1);
  }
  
// Execute the kernel over the vector using the 
// maximum number of work group items for this device
  global = count;
  err = clEnqueueNDRangeKernel(commands, kernel, 
			       1, NULL, &global, &local, 
			       0, NULL, NULL);
  if (err) {
    cerr << "Error: Failed to execute kernel!" << endl;
    return EXIT_FAILURE;
  }
  
  // Wait for all commands to complete
  clFinish(commands);
  
  // Read back the results from the device to verify the output
  //
  err = clEnqueueReadBuffer( commands, output,
			     CL_TRUE, 0, sizeof(float) * count,
			     results, 0, NULL, NULL ); 
  if (err != CL_SUCCESS) {
    cerr << "Error: Failed to read output array! " <<  err << endl;
    exit(1);
  }
  
  // Validate our results
  //
  correct = 0;
  for(int i = 0; i < count; i++) {
    if(results[i] == data[i] * data[i])
      correct++;
  }
  
  // Print a brief summary detailing the results
  cout << "Computed " << correct << "/" << count << " correct values" << endl;
  cout << "Computed " << 100.f * (float)correct/(float)count
       << "% correct values" << endl;
  
  // Shutdown and cleanup
  delete [] data; delete [] results;
  
  clReleaseMemObject(input);
  clReleaseMemObject(output);
  clReleaseProgram(program);
  clReleaseKernel(kernel);
  clReleaseCommandQueue(commands);
  clReleaseContext(context);
  
  return 0;
}

The details of how first.cpp works is beyond the scope of this initial introductory article. At first glance the OpenCL code appears quite verbose for a code that does nothing more than perform a simple task like square the values in a vector. However, this is a CPU and GPU application that can run on many different device types as will be shown below. It is vendor agnostic and has the potential to run unchanged on future devices as well. This portability, along with the ability to exploit massively parallel hardware architectures are the strengths of OpenCL.

Building the executable is straight-forward under Linux:

Copy the source and place it in a file first.cpp.
Set the environment variable for the OpenCL home:
```
OCL_HOME=../ati-stream-sdk-v2.2-lnx64
```
g++ -I $OCL_HOME/include -L $OCL_HOME/lib/x86_64 first.cpp -l OpenCL

The following shows this code running on both CPU and GPU devices:

bda:~/AMD/test$ time ./a.out 
Using: CL_DEVICE_TYPE_GPU
Computed 1269760/1269760 correct values
Computed 100% correct values

real	0m0.354s
user	0m0.260s
sys	0m0.150s

bda:~/AMD/test$ time ./a.out CPU
Using: CL_DEVICE_TYPE_CPU
Computed 1269760/1269760 correct values
Computed 100% correct values

real	0m0.261s
user	0m0.200s
sys	0m0.040s

Notice that the GPU took more time than the CPU. The reason is that it took time to transfer the data from the host to GPU device while the CPU was able to immediately start performing the calculation. Experience has shown that performance GPU programming requires three steps:

Get and keep the data on the GPU to eliminate PCI bus data transfer bottlenecks.
Give the GPU enough work to do. Starting a kernel does require a small amount of overhead. However, modern GPUs are so fast that they can perform a significant amount of work while the kernel is being started. For this reason, kernel launches are queued on the device.
Optimize the calculation to minimize the bottleneck in accessing the GPU memory. Again, GPU hardware is so fast that it is important to reuse data within memory local to the computational hardware (e.g. registers, etcetera) to prevent the computation from being bottlenecked by the GPU memory system.

Output from the OpenCL sample programs, PCIeBandwidth and GlobalMemoryBandwidth shows the relative speed difference between the PCIe bus and the global memory bandwidth of an ATI Radeon™ HD 5800 Series graphics processor under Linux Ubuntu 10.04:

./samples/opencl/bin/x86/PCIeBandwidth

Host to device : 2.44032 GB/s 
Device to host : 1.26776 GB/s

./samples/opencl/bin/x86/GlobalMemoryBandwidth

Global Memory Read
AccessType	: single
VectorElements	: 4
Bandwidth	: 169.918 GB/s

Global Memory Read
AccessType	: linear
VectorElements	: 4
Bandwidth	: 154.875 GB/s

Global Memory Read
AccessType	: linear(uncached)
VectorElements	: 4
Bandwidth	: 118.425 GB/s

Global Memory Write
AccessType	: linear
VectorElements	: 4
Bandwidth	: 177.312 GB/s

C++ programmers will note that the example code takes a very procedural approach. This was done on purpose to enhance the steps required in creating an OpenCL application for both C and C++ developers. The ATI Stream SDK samples includes a simple Template.cpp example that uses a much more C++ like approach. This example is located in:

ati-stream-sdk-v2.2-lnx64/samples/opencl/cl/app/Template

Typing make in this directory will build the sample, which can be started by typing

build/debug/x86_64/Template. The Template_Kernels.cl file can be adapted to perform your own calculation. The documentation for this code is in the docs directory. While a good start, these OpenCL examples still have a way to go before they have the simplicity and power of a generic C++ data-parallel project. Existing projects show that it is possible to create succinct, easy-to-read data-parallel applications using STL-like vector classes. Try experimenting with this example code to create templates that are easy to use yet general

Summary

Architecture and balance ratios are key concepts that are essential to understanding OpenCL device performance. In particular, the bandwidth of the link between the host and OpenCL device, along with the memory bandwidth of the device can be key performance metrics. In some cases, either of these hardware characteristics can make it too expensive to move a calculation from the host onto an OpenCL device. However, many problems require enough computation per data item transferred to greatly speed OpenCL applications. For more information on how machine characteristics, or balance ratios, define application performance, I suggest my introductory article in Scientific Computing, HPC Balance and Common Sense as well as the extensive Atkins Report.

Finally, this tutorial has provided a quick and basic introduction to OpenCL along with an example code that can run on both CPU and GPU device types. Try it and see how it works.

OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.