Part 7: OpenCL plugins

manythreads

4.67/5 (5 votes)

Feb 13, 2012

CPOL

13 min read

37365

This article will demonstrate how to create C/C++ plugins that can be dynamically loaded at runtime to add massively parallel OpenCL capabilities to an already running application

Part 6 in this series on portable parallelism with OpenCL™ taught how to mix OpenCL™ computation and OpenGL rendering within a single application. Primitive restart, an addition to the OpenGL 3.1 standard, was used in the example source code to greatly accelerate performance by computing and rendering data on the GPU, which avoided transfers across the PCIe bus and highlighted GPU performance.

This article will demonstrate how to create C/C++ plugins that can be dynamically loaded at runtime to add massively parallel OpenCL capabilities to an already running application. Dynamically loaded modules via shared objects or DLLs (Dynamically Link-Libraries) are a popular design pattern for many applications – especially in the commercial marketplace. Developers who understand how to use OpenCL in a dynamically loaded runtime environment have the ability to create plugins that accelerate the performance of existing applications by an order of magnitude or more – simply by writing a new plugin that uses OpenCL.

As discussed in part 1 of this series, OpenCL application kernels are written in a variant of the ISO C99 C-language specification. These kernels are compiled at runtime for the destination device via the runtime OpenCL compiler. We know from previous articles in this series that OpenCL already creates and dynamically loads device-dependent code for use in an already running application. This dynamic compilation capability is perfect for use in a plugin environment except when some sequential operation is required to support a massively parallel OpenCL kernel, to work within a legacy plugin framework, or when the developer wishes to use multiple OpenCL devices. For these reasons, this tutorial will demonstrate how to create C/C++ plugins that are dynamically loaded into an application. These plugins can then create and load the massively parallel OpenCL kernels.

Looking ahead, the next article in this series will extend this plugin capability to incorporate OpenCL into heterogeneous workflows via a general-purpose "click together tools" framework that can stream arbitrary messages (vectors, arrays, and arbitrary, complex nested structures) within a single workstation, across a network of machines, or within a cloud computing framework. The ability to create scalable workflows is important because for many problems data handling and transformation can be as complex a problem as the computational problem used to produce the desired result.

The reader should note that dynamically compiled OpenCL plugins and kernels also opens up the possibility for highly optimized kernel generation based on problem parameters. A number of papers and examples on can be found on the Internet. Two examples are the presentation "Automatic OpenCL Optimization for Locality and Parallelism Management" and paper, "Automatically Generating and Tuning GPU Code for Sparse Matrix-Vector Multiplication from a High-Level Representation".

OpenCL for Libraries and Plug-ins

Most programmers are familiar with using libraries. A library is a collection of methods and functions that are contained in a single file. By convention, libraries are usually denoted by the library name followed by .a under Linux or.lib under Windows. Libraries are regularly used by most programmers because they allow code to be shared and changed in a modular fashion. During the compilation phase when building an executable, the compiler must note calls to any external library methods or functions. It is up to the linker, in a follow-on step, to complete the resolution of any unresolved references to create an executable that can run on the computer.

Linking to external methods can occur:

Statically during the creation of the executable: Static linking means that all references are resolved when the executable is built. Further, the executable contains the explicit machine code to run all library functions used by the program.

Dynamically during load-time: Load time dynamic linkage happens when the executable is loaded into memory. Just like static linking, all symbols in the executable are resolved by linking with one or more .dll (Windows) or .so files (Linux) at program startup. This form of linkage provides fixed functionality (think of the C runtime library and other commonly used libraries). A big advantage of load-time linking is that all applications that link a library at load time will benefit from bug fixes and performance improvements just by installing a revised library file at the shared location. No applications need to be recompiled or relinked to use the improved library code. Further, shared libraries keep individual executable sizes small – a cost savings that is multiplied many times for libraries that are commonly used.

Dynamically during run-time: Run-time linking is used to provide functionality to load plugins, which allows generic functionality to be added without recompiling the application. Thus an application can call a generic external function, func(), whose functionality depends entirely on the plug-in that is loaded by the application. As mentioned previously, an application can literally write and compile the plugin (when a compiler is available) by generating a problem-specific source code, which is then compiled and linked into the already running application. This trick allows application developers to create very highly optimized functions across a general problem domain when provided specific problem parameters. Many scientific applications utilize this capability to significantly improve performance.

For more information about the benefits of using libraries, DLLs (Dynamic-Link Libraries) and shared object files, look to "Static, Shared Dynamic, and Loadable Linux Libraries or the general Wikipedia discussion of DLLs.

Following is a simple C-language program that calls a generic external function, func(), and prints the value of x created by the generic function. An init(), func(), fini() framework (similar to a C++ object constructor, computational method, and destructor) is demonstrated in this simple source code to provide additional generality. This is a common design pattern in plugin programming as the init() method lets the programmer perform any initialization while the a fini() method gives the programmer the ability to perform an final processing and cleanup. In addition, the C library printf() function is called.

#include <stdio.h>
extern int func(int *);

int main()
{
   int x;
   init();
 
   func(&x);
   printf("Example of static linking\n");
   printf("Valx=%d\n",x);
 
   fini();
   return 0;
}

Example 1: prog.c

The source code for dynCompile.cc, extends this generic behavior and adds the capability to dynamically compile the .so (Shared Object) at runtime. The compiled method is then loaded and linked to the running executable. The name of the source file is specified by the user on the command-line. It is not hard to see how this application can be extended to generate the source code that is then compiled to create the shared object plugin.

The code walk-through of dynCompile.cc, starts with the specification of the include files needed to build dynCompile.cc.

#include <cstdlib>
#include <sys/types.h>
#include <dlfcn.h>
#include <string>
#include <iostream>

using namespace std;

Example 2: Part 1 of dynCompile.cc

Some global handles and pointer to function types are defined.

void *lib_handle;

typedef int (*initFini_t)();
typedef int (*func_t)(int*);

Example 3: Part 2 of dynCompile.cc

The main() method begins by parsing the command-line argument, which contains the filename of the source to be built. The command to build the .so is created and performed with a system() call. For a Linux environment, g++ is used. Windows users can call the Visual Studio cl.exe compiler.

int main(int argc, char **argv) 
{
  if(argc < 2) {
    cerr << "Use: sourcefilename" << endl;
    return -1;
  }
  string base_filename(argv[1]);
  base_filename = base_filename.substr(0,base_filename.find_last_of("."));
  
  // build the shared object or dll
  string buildCommand("g++ -fPIC –shared ");
  buildCommand += string(argv[1])
    + string(" -o ") + base_filename + string(".so ");
  
  cerr << buildCommand << endl; 
  if(system(buildCommand.c_str())) {
    cerr << "compile command failed!" << endl;
    cerr << "Build command " << buildCommand << endl;
    return -1;
  }

Example 4: Part 3 of dynCompile.cc

Assuming no errors occurred during the compilation phase, the next step is to load the library created in the previous step. If there is an error, the program exits.

  // load the library -------------------------------------------------
  string nameOfLibToLoad("./");
  nameOfLibToLoad += base_filename;
  
  nameOfLibToLoad += ".so";
  lib_handle = dlopen(nameOfLibToLoad.c_str(), RTLD_LAZY);
  if (!lib_handle) {
    cerr << "Cannot load library: " << dlerror() << endl;
    return -1;
  }

Example 5: Part 4 of dynCompile.cc

Finally, the symbols are loaded and the pointers to the init(), func(), and fini() methods are resolved.

  // load the symbols -------------------------------------------------
  initFini_t dynamicInit= NULL;
  func_t dynamicFunc= NULL;
  initFini_t dynamicFini= NULL;

  // reset errors
  dlerror();
  
  // load the function pointers
  dynamicFunc= (func_t) dlsym(lib_handle, "func");
  const char* dlsym_error = dlerror();
  if (dlsym_error) { cerr << "sym load: " << dlsym_error << endl; return -1;}
  dynamicInit= (initFini_t) dlsym(lib_handle, "init");
  dlsym_error = dlerror();
  if (dlsym_error) { cerr << "sym load: " << dlsym_error << endl; return -1;}
  dynamicFini= (initFini_t) dlsym(lib_handle, "fini");
  dlsym_error = dlerror();
  if (dlsym_error) { cerr << "sym load: " << dlsym_error << endl; return -1;}

Example 6: Part 5 of dynCompile.cc

Each function pointer is checked to see that symbol has been resolved. If so, the function is called. As a convenience to the plugin author, any of the calls can be made optional – meaning the method does not need to be included in the compiled source file. All that is required is to modify the logic in the previous step so a failure to resolve a reference does not cause the application to exit.

  if( (*dynamicInit)() < 0) return -1;

  int x;
  (*dynamicFunc)(&x);
  cout << "Valx " << x << endl;
  
  if( (*dynamicFini)() < 0) return -1;

Example 7: Part 6 of dynCompile.cc

Finally, the libraries are unloaded and the application exits.

 // unload the library -----------------------------------------------
 dlclose(lib_handle);
}

Example 8: Part 7 of dynCompile.cc

Following is the source for a simple C++ plugin, cctest1.cc. This source code is very straight forward.

#include <iostream>
using namespace std;
extern "C" int init() {
  cerr << "Hello from Init" << endl;
  return(0); 
}
 
extern "C" int func(int *i)
{
  cerr << "Hello from Func" << endl;
   *i=100;
   return(1);
}
 
extern "C" int fini()
{
  cerr << "Hello from Fini" << endl;
  return(0); 
}

Example 9: cctest1.cc

The following script demonstrates how to build dynCompile.cc and run the cctest1.cc source code:

echo "------ building dynCompile -----"
g++ -o dynCompile dynCompile.cc -ldl
echo "------ dynamic version of cctest1.cc -----"
./dynCompile cctest1.cc

Example 10: Linux commands to build and run dynCompile

It produces the following output:

$ ./dynCompile cctest1.cc
g++ -fPIC -shared cctest1.cc -o cctest1.so 
Hello from Init
Hello from Func
Valx 100
Hello from Fini

Example 11: Output from dynCompile

The application dynCompile.cc demonstrates how a sequential C/C++ plugin can be built and loaded into a running application. Further, it opens up the possibility for highly optimized automatic plugin generation based on problem parameters.

Using OpenCL in Plugins

The following source code, testStatic.cc, calls a shared object function myOCLfunction() that creates and runs an OpenCL kernel. Walking through the code we see that the device context and queue is setup as described in part 5 of this article series. The user can specify that the plugin runs on either the CPU or a GPU.

#define PROFILING // Define to see the time the kernel takes
#define __NO_STD_VECTOR // Use cl::vector instead of STL version
#define __CL_ENABLE_EXCEPTIONS // needed for exceptions
#include <CL/cl.hpp>
#include <fstream>
#include <iostream>
using namespace std;
 
extern "C" int myOCLfunction(cl::CommandQueue&, const char*, int, char **);
 
int main(int argc, char* argv[])
{
  if( argc < 2) {
    cerr << "Use: {cpu|gpu} kernelFile" << endl;
    exit(EXIT_FAILURE);
  }
 
  // handle command-line arguments
  const string platformName(argv[1]);
  const char* kernelFile = argv[2];
  int ret= -1;
 
  cl::vector<int> deviceType;
  cl::vector< cl::CommandQueue > contextQueues;
 
  // crudely parse the command line arguments. 
  if(platformName.compare("cpu")==0)
    deviceType.push_back(CL_DEVICE_TYPE_CPU);
  else if(platformName.compare("gpu")==0) 
    deviceType.push_back(CL_DEVICE_TYPE_GPU);
  else { cerr << "Invalid device type!" << endl; return(1); }
 
  // create the context and queues
  try {
    cl::vector< cl::Platform > platformList;
    cl::Platform::get(&platformList);
 
    // Get all the appropriate devices for the platform the
    // implementation thinks we should be using.
    // find the user-specified devices
    cl::vector<cl::Device> devices;
    for(int i=0; i < deviceType.size(); i++) {
      cl::vector<cl::Device> dev;
      platformList[0].getDevices(deviceType[i], &dev);
      for(int j=0; j < dev.size(); j++) devices.push_back(dev[j]);
    }
 
    // set a single context
    cl_context_properties cprops[] = {CL_CONTEXT_PLATFORM, NULL, 0};
    cl::Context context(devices, cprops);
    cout << "Using the following device(s) in one context" << endl;
    for(int i=0; i < devices.size(); i++)  {
      cout << "  " << devices[i].getInfo<CL_DEVICE_NAME>() << endl;
    }
 
    // Create the separate command queues to perform work
    for(int i=0; i < devices.size(); i++)  {
#ifdef PROFILING
      cl::CommandQueue queue(context, devices[i],CL_QUEUE_PROFILING_ENABLE);
#else
      cl::CommandQueue queue(context, devices[i],0);
#endif
      contextQueues.push_back( queue );
    }
 
    // Call our OpenCL function using the first device. If desired,
    // the reader can add a command-line argument to specify the
    // device number.
    ret = myOCLfunction(contextQueues[0], kernelFile, argc-3, argv+3);
  } catch (cl::Error error) {
    cerr << "caught exception: " << error.what() 
        << '(' << error.err() << ')' << endl;
  }
  return ret;
}

Example 12: Source code for testStatic.cc

As noted in the comment (highlighted in green), the function, myOCLfunction(), is called using the first device. If desired, the reader can add an additional command-line argument to specify a device number or change the code to run on multiple devices as shown in part 5 of this tutorial series. The generic function is passed the queue of the device as well as the name of the OpenCL kernel source file.

Following is a simple C++ plugin file to build and run the OpenCL kernel. Walking through the code we see that the appropriate preprocessor defines and includes are provided. Also, the oclBuildProgram() method has been adapted from part 5 to build the OpenCL kernel for the device. Note that the device context and device information (highlighted in green) can be retrieved from OpenCL queue.

#define PROFILING // Define to see the time the kernel takes
#define __NO_STD_VECTOR // Use cl::vector instead of STL version
#define __CL_ENABLE_EXCEPTIONS // needed for exceptions
#include <CL/cl.hpp>
#include <fstream>
#include <iostream>
using namespace std;
 
#ifndef _OCL_BUILD
#define _OCL_BUILD
cl::Program oclBuildProgram( cl::CommandQueue& queue,
                          const char *kernelFile, 
                          const char* myType)
{
  cl::Context context = queue.getInfo<CL_QUEUE_CONTEXT>();
  cl::Device device = queue.getInfo<CL_QUEUE_DEVICE>();
 
  // Demonstrate using defines in the ocl build
  string buildOptions;
  { // create preprocessor defines for the kernel
    char buf[256]; 
    sprintf(buf,"-D TYPE1=%s ", myType);
    buildOptions += string(buf);
    }
  
  // build the program from the source in the file
  ifstream file(kernelFile);
  string prog(istreambuf_iterator<char>(file),
             (istreambuf_iterator<char>()));
  cl::Program::Sources source( 1, make_pair(prog.c_str(),
                                      prog.length()+1));
  cl::Program oclProg(context, source);
  file.close();
  
  try {
    cerr << "   buildOptions " << buildOptions << endl;
    cl::vector<cl::Device> foo;
    foo.push_back(device);
    oclProg.build(foo, buildOptions.c_str() );
  } catch(cl::Error& err) {
    // Get the build log
    cerr << "Build failed! " << err.what() 
        << '(' << err.err() << ')' << endl;
    cerr << "retrieving  log ... " << endl;
    cerr << oclProg.getBuildInfo<CL_PROGRAM_BUILD_LOG>(device)
        << endl;
    exit(-1);
  }
  return(oclProg);
}
#endif

Example 13: Part 1 of myOCLfunction.cc

Also note that the function, myOCLfunction() is declared as an external "C" method to prevent C++ name mangling, which allows this method to be called from a C language program. This plugin parses additional command line arguments passed from the running application. It then calls the oclBuildProgram() method to compile the OpenCL kernel code for the device. In this example, the type of vector the kernel operates on is defined to be an unsigned int. Through the use of preprocessor defines, this OpenCL kernel can be compiled to support any type (int, float, double) as discussed in part 4 of this series.

extern "C" int myOCLfunction( cl::CommandQueue& queue, const char* kernelFile, 
                           int argc, char *argv[])
{
  if(argc < 1) {
    cerr << "myOCLfunction requires a vector size on the command-line" << endl;
    return -1;
  }
  int vecsize = atoi(argv[0]);
  unsigned int* vec = new uint[vecsize];
  int vecBytes = vecsize*sizeof(uint);
 
  cl::Context context = queue.getInfo<CL_QUEUE_CONTEXT>();
 
  // Build the OCL program to use uints
  cl::Program oclProg = oclBuildProgram(queue, kernelFile, "uint");
  cl::Kernel funcKernel = cl::Kernel(oclProg, "func");

Example 14: Part 2 of myOCLfunction.cc

Astute readers will recognize that this plugin is adapted from the testSum.hpp code from part 5. A vector is filled with random numbers and passed to an OpenCL kernel. The simple double check on the host verifies that the OpenCL code correctly added each random number in the vector to itself. If the OpenCL and host results agree, a message "test passed" is printed. If incorrect results are found, the message "TEST FAILED!" will be printed.

  // Fill the host memory with random data for the sums
  srand(0);
  for(int i=0; i < vecsize; i++) vec[i] = (rand()&0xffffff);
  
  // Create a separate buffer for each device in the context
  cl::Buffer d_vec;
  d_vec = cl::Buffer(context, CL_MEM_READ_WRITE, vecBytes);
  
  funcKernel.setArg(0,vecsize); // define the size of the vector
  funcKernel.setArg(1,d_vec); // set the pointer to the device vector
  funcKernel.setArg(2,0); // set the offset for the kernel
 
  queue.enqueueWriteBuffer(d_vec, CL_TRUE,0, vecBytes, &vec[0]);
  
  cl::Event event;
  queue.enqueueNDRangeKernel(funcKernel, 
                          cl::NullRange, // offset starts at 0,0
                          cl::NDRange( vecsize ), // parallelize by vecsize
                          cl::NDRange(1, 1),//one work-item per work-group
                          NULL, &event);
  // manually transfer the data from the device
  queue.enqueueReadBuffer(d_vec, CL_TRUE, 0, vecBytes, &vec[0]);
 
  queue.finish(); // wait for everything to finish
 
  // perform the golden test
  {
    int i;
    srand(0);
    for(i=0; i < vecsize; i++) {
      unsigned int r = (rand()&0xffffff);
      r += r;
      if(r != vec[i]) break;
    }
    if(i == vecsize) {
      cout << "test passed" << endl;
    } else {
      cout << "TEST FAILED!" << endl;
    }
  }
  delete [] vec;
  
  return EXIT_SUCCESS;
}

Example 15: Part 3 of myOCLfunction.cc

These two examples codes can be compiled with the following commands under Linux:

echo "---------------"
g++ -c -I $AMDAPPSDKROOT/include  myOCLfunction.cc
g++ -I $AMDAPPSDKROOT/include -fopenmp testStatic.cc myOCLfunction.o -L $AMDAPPSDKROOT/lib/x86_64 -lOpenCL -o testStatic

This program can be adapted to dynamically load the plugin and call the myOCLfunction() as seen in the source code for testDynamic.ccbelow. The modified code to perform the dynamic load and link is highlighted in green. Note that the name of the shared object file is passed via a command line argument.

#define PROFILING // Define to see the time the kernel takes
#define __NO_STD_VECTOR // Use cl::vector instead of STL version
#define __CL_ENABLE_EXCEPTIONS // needed for exceptions
#include <CL/cl.hpp>
#include <fstream>
#include <iostream>
#include <dlfcn.h>
using namespace std;
 
typedef int (*func_t)(cl::CommandQueue&, const char*, int, char **);
 
int main(int argc, char* argv[])
{
  if( argc < 3) {
    cerr << "Use: {cpu|gpu} kernelFile sharedObjectFile" << endl;
    exit(EXIT_FAILURE);
  }
 
  // handle command-line arguments
  const string platformName(argv[1]);
  const char* kernelFile = argv[2];
  const char* soFile = argv[3];
  int ret= -1;
 
  cl::vector<int> deviceType;
  cl::vector< cl::CommandQueue > contextQueues;
 
  // crudely parse the command line arguments. 
  if(platformName.compare("cpu")==0)
    deviceType.push_back(CL_DEVICE_TYPE_CPU);
  else if(platformName.compare("gpu")==0) 
    deviceType.push_back(CL_DEVICE_TYPE_GPU);
  else { cerr << "Invalid device type!" << endl; return(1); }
 
  // create the context and queues
  try {
    cl::vector< cl::Platform > platformList;
    cl::Platform::get(&platformList);
 
    // Get all the appropriate devices for the platform the
    // implementation thinks we should be using.
    // find the user-specified devices
    cl::vector<cl::Device> devices;
    for(int i=0; i < deviceType.size(); i++) {
      cl::vector<cl::Device> dev;
      platformList[0].getDevices(deviceType[i], &dev);
      for(int j=0; j < dev.size(); j++) devices.push_back(dev[j]);
    }
 
    // set a single context
    cl_context_properties cprops[] = {CL_CONTEXT_PLATFORM, NULL, 0};
    cl::Context context(devices, cprops);
    cout << "Using the following device(s) in one context" << endl;
    for(int i=0; i < devices.size(); i++)  {
      cout << "  " << devices[i].getInfo<CL_DEVICE_NAME>() << endl;
    }
 
    // Create the separate command queues to perform work
    for(int i=0; i < devices.size(); i++)  {
#ifdef PROFILING
      cl::CommandQueue queue(context, devices[i],CL_QUEUE_PROFILING_ENABLE);
#else
      cl::CommandQueue queue(context, devices[i],0);
#endif
      contextQueues.push_back( queue );
    }
 
    // ----- Perform the dynamic load ---------------
    string nameOfLibToLoad = string("./") + soFile;
    void* lib_handle = dlopen(nameOfLibToLoad.c_str(), RTLD_LAZY);
    if (!lib_handle) {
      cerr << "Cannot load library: " << dlerror() << endl;
      return -1;
    }
    // load the function pointers
    func_t dynamicFunc = (func_t) dlsym(lib_handle, "myOCLfunction" );
    const char* dlsym_error = dlerror();
    if (dlsym_error) { cerr << "sym load: " << dlsym_error << endl; return -1;}
    
    // Call our OpenCL function using the first device. If desired,
    // the reader can add a command-line argument to specify the
    // device number.
    ret = (*dynamicFunc)(contextQueues[0], kernelFile, argc-4, argv+4);
 
  } catch (cl::Error error) {
    cerr << "caught exception: " << error.what() 
        << '(' << error.err() << ')' << endl;
  }
  return ret;
}

Example 16: Source code for testDynamic.cc

The testDynamic executable and myOCLfunction.so file are created under Linux with the following commands:

echo "---------------"
g++ -fPIC -shared -I $AMDAPPSDKROOT/include  myOCLfunction.cc -o myOCLfunction.so
g++ -I $AMDAPPSDKROOT/include testDynamic.cc -L $AMDAPPSDKROOT/lib/x86_64 -lOpenCL -o testDynamic

Example 17: Commands to build testDynamic.cc under Linux

Running testStatic and testDynamic shows that the plugin works correctly on both the CPU and a GPU in both applications. The message demonstrating that the device and host results agree is highlighted in green.

$ sh RUN
 ------- static test CPU ------ 
Using the following device(s) in one context
  AMD Phenom(tm) II X6 1055T Processor
   buildOptions -D TYPE1=uint 
test passed
 ------- static test GPU ------ 
Using the following device(s) in one context
  Cypress
   buildOptions -D TYPE1=uint 
test passed
 ------- dynamic test CPU ------ 
Using the following device(s) in one context
  AMD Phenom(tm) II X6 1055T Processor
   buildOptions -D TYPE1=uint 
test passed
 ------- dynamic test GPU ------ 
Using the following device(s) in one context
  AMD Phenom(tm) II X6 1055T Processor
   buildOptions -D TYPE1=uint 
test passed

Example 18: Output of testStatic and testDynamic

Including and Protecting the OpenCL Source

In many cases, the developer might not want to include the OpenCL source code for the plugin in a separate file. Using multiple files complicates the installation and can introduce hard-to-diagnose installation and upgrade errors. Using a single file simplifies installation and can make OpenCL plugin deployments more robust. In particular, commercial developers will not want to release their OpenCL source code in a form that anyone can easily read.

The simplest way to include the OpenCL source in the plugin file is to use the Linux xxd hexdump tool to create a string that contains the OpenCL source code. The OpenCL method clCreateProgramWithSource() used in part 1 of this article series can then be used to compile the source code from a string.

The following command demonstrates how to create a C-language string from a file, foo.txt, that contains the following: This is a test file. It spans multiple lines.

Example 19: An example file, foo.txt, to be converted with xxd

Running "xxd –I foo.txt" results in the following output:

unsigned char foo_txt[] = {
  0x54, 0x68, 0x69, 0x73, 0x20, 0x69, 0x73, 0x20, 0x61, 0x20, 0x74, 0x65,
  0x73, 0x74, 0x20, 0x66, 0x69, 0x6c, 0x65, 0x2e, 0x0a, 0x49, 0x74, 0x20,
  0x73, 0x70, 0x61, 0x6e, 0x73, 0x20, 0x6d, 0x75, 0x6c, 0x74, 0x69, 0x70,
  0x6c, 0x65, 0x20, 0x6c, 0x69, 0x6e, 0x65, 0x73, 0x2e, 0x0a
};
unsigned int foo_txt_len = 46;

Example 20: Output of "xxd –I foo.txt"

The string foo_txt can be compiled in the plugin and passed to the clCreateProgramWithSource(). A disadvantage of using xxd is that the OpenCL source string is no longer readable. This does not protect the OpenCL kernel source as it can be easily found in the .so or executable file by simply running the UNIX strings command.

Commercial developers can make it more difficult to get the source code for their OpenCL kernels by encrypting and decrypting the source text with a package like Keyczar on Google code. The encrypted source string can still be created using xxd.Still, the OpenCL source code will reside in a buffer after decryption and during the call to clCreateProgramWithSource(). While encryption does make it harder, a motivated hacker can still find and print the source code from that buffer.

As mentioned in part 1, offline compilation can create the OpenCL device binary for specific devices. Just as the application binary provides protection against reverse engineering, so do OpenCL binaries obfuscate the OpenCL kernel code. Again this binary can be included in the source code using the output from xxd. AMD provides a knowledge base article explaining how to perform offline kernel compilation. A technical limitation of this method is that only pre-compiled devices can be supported by the plugin. Depending on the business model, this could be an advantage or drawback as plugins delivered to the customer will only be able to support specific devices for which they have pre-compiled OpenCL binaries.

Summary

With the ability to create OpenCL plugins, application programmers have the ability to write and support generic applications that can deliver accelerated performance when a GPU is present and CPU-based performance when a GPU is not available. These plugin architectures are well-understood and a convenient way to leverage existing applications and code bases. They also help preserve existing software investments.

The ability to dynamically compile OpenCL source code and link it into a running application opens a host of opportunities for optimizing code generators. This capability is part of the foundation upon which the portable parallelism of OpenCL resides. As mentioned in this article and utilized in scientific computation, dynamically generating optimized code for specific parameter sets in a general problem domain can achieve very high performance – far beyond what a single "one size fits all" generic code can deliver.

The next article in this series will extend this capability to exploit hybrid CPU/GPU computation in heterogeneous workflows so developers can capitalize on GPU acceleration and CPU capabilities in their production workflows.