Part 7: OpenCL plugins






4.67/5 (5 votes)
This article will demonstrate how to create C/C++ plugins that can be dynamically loaded at runtime to add massively parallel OpenCL capabilities to an already running application
Part 6 in this series on portable parallelism with OpenCL™ taught how to mix OpenCL™ computation and OpenGL rendering within a single application. Primitive restart, an addition to the OpenGL 3.1 standard, was used in the example source code to greatly accelerate performance by computing and rendering data on the GPU, which avoided transfers across the PCIe bus and highlighted GPU performance.
This article will demonstrate how to create C/C++ plugins that can be dynamically loaded at runtime to add massively parallel OpenCL capabilities to an already running application. Dynamically loaded modules via shared objects or DLLs (Dynamically Link-Libraries) are a popular design pattern for many applications – especially in the commercial marketplace. Developers who understand how to use OpenCL in a dynamically loaded runtime environment have the ability to create plugins that accelerate the performance of existing applications by an order of magnitude or more – simply by writing a new plugin that uses OpenCL.
As discussed in part 1 of this series, OpenCL application kernels are written in a variant of the ISO C99 C-language specification. These kernels are compiled at runtime for the destination device via the runtime OpenCL compiler. We know from previous articles in this series that OpenCL already creates and dynamically loads device-dependent code for use in an already running application. This dynamic compilation capability is perfect for use in a plugin environment except when some sequential operation is required to support a massively parallel OpenCL kernel, to work within a legacy plugin framework, or when the developer wishes to use multiple OpenCL devices. For these reasons, this tutorial will demonstrate how to create C/C++ plugins that are dynamically loaded into an application. These plugins can then create and load the massively parallel OpenCL kernels.
Looking ahead, the next article in this series will extend this plugin capability to incorporate OpenCL into heterogeneous workflows via a general-purpose "click together tools" framework that can stream arbitrary messages (vectors, arrays, and arbitrary, complex nested structures) within a single workstation, across a network of machines, or within a cloud computing framework. The ability to create scalable workflows is important because for many problems data handling and transformation can be as complex a problem as the computational problem used to produce the desired result.
The reader should note that dynamically compiled OpenCL plugins and kernels also opens up the possibility for highly optimized kernel generation based on problem parameters. A number of papers and examples on can be found on the Internet. Two examples are the presentation "Automatic OpenCL Optimization for Locality and Parallelism Management" and paper, "Automatically Generating and Tuning GPU Code for Sparse Matrix-Vector Multiplication from a High-Level Representation".
OpenCL for Libraries and Plug-ins
Most programmers are familiar with using libraries. A library is a collection of methods and functions that are contained in a single file. By convention, libraries are usually denoted by the library name followed by .a under Linux or.lib under Windows. Libraries are regularly used by most programmers because they allow code to be shared and changed in a modular fashion. During the compilation phase when building an executable, the compiler must note calls to any external library methods or functions. It is up to the linker, in a follow-on step, to complete the resolution of any unresolved references to create an executable that can run on the computer.
Linking to external methods can occur:
Statically during the creation of the executable: Static linking means that all references are resolved when the executable is built. Further, the executable contains the explicit machine code to run all library functions used by the program.
Dynamically during load-time: Load time dynamic linkage happens when the executable is loaded into memory. Just like static linking, all symbols in the executable are resolved by linking with one or more .dll (Windows) or .so files (Linux) at program startup. This form of linkage provides fixed functionality (think of the C runtime library and other commonly used libraries). A big advantage of load-time linking is that all applications that link a library at load time will benefit from bug fixes and performance improvements just by installing a revised library file at the shared location. No applications need to be recompiled or relinked to use the improved library code. Further, shared libraries keep individual executable sizes small – a cost savings that is multiplied many times for libraries that are commonly used.
Dynamically
during run-time: Run-time
linking is used to provide functionality to load plugins, which allows
generic functionality to be added without recompiling the application. Thus an
application can call a generic external function, func(),
whose
functionality depends entirely on the plug-in that is loaded by the application.
As mentioned previously, an application can literally write and compile the
plugin (when a compiler is available) by generating a problem-specific source
code, which is then compiled and linked into the already running application.
This trick allows application developers to create very highly optimized
functions across a general problem domain when provided specific problem
parameters. Many scientific applications utilize this capability to
significantly improve performance.
For more information about the benefits of using libraries, DLLs (Dynamic-Link Libraries) and shared object files, look to "Static, Shared Dynamic, and Loadable Linux Libraries or the general Wikipedia discussion of DLLs.
Following is a simple C-language program that calls a
generic external function, func()
, and prints the value of x
created
by the generic function. An init(), func(), fini()
framework (similar to
a C++ object constructor, computational method, and destructor) is demonstrated
in this simple source code to provide additional generality. This is a common
design pattern in plugin programming as the init()
method lets the
programmer perform any initialization while the a fini()
method gives
the programmer the ability to perform an final processing and cleanup. In
addition, the C library printf()
function is called.
#include <stdio.h>
extern int func(int *);
int main()
{
int x;
init();
func(&x);
printf("Example of static linking\n");
printf("Valx=%d\n",x);
fini();
return 0;
}
The source code for dynCompile.cc, extends this generic behavior and adds the capability to dynamically compile the .so (Shared Object) at runtime. The compiled method is then loaded and linked to the running executable. The name of the source file is specified by the user on the command-line. It is not hard to see how this application can be extended to generate the source code that is then compiled to create the shared object plugin.
The code walk-through of dynCompile.cc, starts with the specification of the include files needed to build dynCompile.cc.
#include <cstdlib>
#include <sys/types.h>
#include <dlfcn.h>
#include <string>
#include <iostream>
using namespace std;
Some global handles and pointer to function types are defined.
void *lib_handle;
typedef int (*initFini_t)();
typedef int (*func_t)(int*);
The main()
method begins by parsing the command-line
argument, which contains the filename of the source to be built. The command to
build the .so is created and performed with a system()
call. For
a Linux environment, g++ is used. Windows users can call the Visual
Studio cl.exe compiler.
int main(int argc, char **argv)
{
if(argc < 2) {
cerr << "Use: sourcefilename" << endl;
return -1;
}
string base_filename(argv[1]);
base_filename = base_filename.substr(0,base_filename.find_last_of("."));
// build the shared object or dll
string buildCommand("g++ -fPIC –shared ");
buildCommand += string(argv[1])
+ string(" -o ") + base_filename + string(".so ");
cerr << buildCommand << endl;
if(system(buildCommand.c_str())) {
cerr << "compile command failed!" << endl;
cerr << "Build command " << buildCommand << endl;
return -1;
}
Assuming no errors occurred during the compilation phase, the next step is to load the library created in the previous step. If there is an error, the program exits.
// load the library -------------------------------------------------
string nameOfLibToLoad("./");
nameOfLibToLoad += base_filename;
nameOfLibToLoad += ".so";
lib_handle = dlopen(nameOfLibToLoad.c_str(), RTLD_LAZY);
if (!lib_handle) {
cerr << "Cannot load library: " << dlerror() << endl;
return -1;
}
Finally, the symbols are loaded and the pointers to the init()
,
func()
, and fini()
methods are resolved.
// load the symbols -------------------------------------------------
initFini_t dynamicInit= NULL;
func_t dynamicFunc= NULL;
initFini_t dynamicFini= NULL;
// reset errors
dlerror();
// load the function pointers
dynamicFunc= (func_t) dlsym(lib_handle, "func");
const char* dlsym_error = dlerror();
if (dlsym_error) { cerr << "sym load: " << dlsym_error << endl; return -1;}
dynamicInit= (initFini_t) dlsym(lib_handle, "init");
dlsym_error = dlerror();
if (dlsym_error) { cerr << "sym load: " << dlsym_error << endl; return -1;}
dynamicFini= (initFini_t) dlsym(lib_handle, "fini");
dlsym_error = dlerror();
if (dlsym_error) { cerr << "sym load: " << dlsym_error << endl; return -1;}
Each function pointer is checked to see that symbol has been resolved. If so, the function is called. As a convenience to the plugin author, any of the calls can be made optional – meaning the method does not need to be included in the compiled source file. All that is required is to modify the logic in the previous step so a failure to resolve a reference does not cause the application to exit.
if( (*dynamicInit)() < 0) return -1;
int x;
(*dynamicFunc)(&x);
cout << "Valx " << x << endl;
if( (*dynamicFini)() < 0) return -1;
Finally, the libraries are unloaded and the application exits.
// unload the library -----------------------------------------------
dlclose(lib_handle);
}
Following is the source for a simple C++ plugin, cctest1.cc. This source code is very straight forward.
#include <iostream>
using namespace std;
extern "C" int init() {
cerr << "Hello from Init" << endl;
return(0);
}
extern "C" int func(int *i)
{
cerr << "Hello from Func" << endl;
*i=100;
return(1);
}
extern "C" int fini()
{
cerr << "Hello from Fini" << endl;
return(0);
}
The following script demonstrates how to build dynCompile.cc and run the cctest1.cc source code:
echo "------ building dynCompile -----"
g++ -o dynCompile dynCompile.cc -ldl
echo "------ dynamic version of cctest1.cc -----"
./dynCompile cctest1.cc
It produces the following output:
$ ./dynCompile cctest1.cc
g++ -fPIC -shared cctest1.cc -o cctest1.so
Hello from Init
Hello from Func
Valx 100
Hello from Fini
The application dynCompile.cc demonstrates how a sequential C/C++ plugin can be built and loaded into a running application. Further, it opens up the possibility for highly optimized automatic plugin generation based on problem parameters.
Using OpenCL in Plugins
The following source code, testStatic.cc, calls a
shared object function myOCLfunction()
that creates and runs an OpenCL
kernel. Walking through the code we see that the device context and queue is
setup as described in part 5 of
this article series. The user can specify that the plugin runs on either the CPU
or a GPU.
#define PROFILING // Define to see the time the kernel takes
#define __NO_STD_VECTOR // Use cl::vector instead of STL version
#define __CL_ENABLE_EXCEPTIONS // needed for exceptions
#include <CL/cl.hpp>
#include <fstream>
#include <iostream>
using namespace std;
extern "C" int myOCLfunction(cl::CommandQueue&, const char*, int, char **);
int main(int argc, char* argv[])
{
if( argc < 2) {
cerr << "Use: {cpu|gpu} kernelFile" << endl;
exit(EXIT_FAILURE);
}
// handle command-line arguments
const string platformName(argv[1]);
const char* kernelFile = argv[2];
int ret= -1;
cl::vector<int> deviceType;
cl::vector< cl::CommandQueue > contextQueues;
// crudely parse the command line arguments.
if(platformName.compare("cpu")==0)
deviceType.push_back(CL_DEVICE_TYPE_CPU);
else if(platformName.compare("gpu")==0)
deviceType.push_back(CL_DEVICE_TYPE_GPU);
else { cerr << "Invalid device type!" << endl; return(1); }
// create the context and queues
try {
cl::vector< cl::Platform > platformList;
cl::Platform::get(&platformList);
// Get all the appropriate devices for the platform the
// implementation thinks we should be using.
// find the user-specified devices
cl::vector<cl::Device> devices;
for(int i=0; i < deviceType.size(); i++) {
cl::vector<cl::Device> dev;
platformList[0].getDevices(deviceType[i], &dev);
for(int j=0; j < dev.size(); j++) devices.push_back(dev[j]);
}
// set a single context
cl_context_properties cprops[] = {CL_CONTEXT_PLATFORM, NULL, 0};
cl::Context context(devices, cprops);
cout << "Using the following device(s) in one context" << endl;
for(int i=0; i < devices.size(); i++) {
cout << " " << devices[i].getInfo<CL_DEVICE_NAME>() << endl;
}
// Create the separate command queues to perform work
for(int i=0; i < devices.size(); i++) {
#ifdef PROFILING
cl::CommandQueue queue(context, devices[i],CL_QUEUE_PROFILING_ENABLE);
#else
cl::CommandQueue queue(context, devices[i],0);
#endif
contextQueues.push_back( queue );
}
// Call our OpenCL function using the first device. If desired,
// the reader can add a command-line argument to specify the
// device number.
ret = myOCLfunction(contextQueues[0], kernelFile, argc-3, argv+3);
} catch (cl::Error error) {
cerr << "caught exception: " << error.what()
<< '(' << error.err() << ')' << endl;
}
return ret;
}
As noted in the comment (highlighted in green), the
function, myOCLfunction()
, is called using the first device. If desired,
the reader can add an additional command-line argument to specify a device
number or change the code to run on multiple devices as shown in part 5 of
this tutorial series. The generic function is passed the queue of the device as
well as the name of the OpenCL kernel source file.
Following is a simple C++ plugin file to build and run the
OpenCL kernel. Walking through the code we see that the appropriate
preprocessor defines and includes are provided. Also, the oclBuildProgram()
method
has been adapted from part 5 to build the OpenCL kernel for the device. Note
that the device context and device information (highlighted in green) can be retrieved
from OpenCL queue.
#define PROFILING // Define to see the time the kernel takes
#define __NO_STD_VECTOR // Use cl::vector instead of STL version
#define __CL_ENABLE_EXCEPTIONS // needed for exceptions
#include <CL/cl.hpp>
#include <fstream>
#include <iostream>
using namespace std;
#ifndef _OCL_BUILD
#define _OCL_BUILD
cl::Program oclBuildProgram( cl::CommandQueue& queue,
const char *kernelFile,
const char* myType)
{
cl::Context context = queue.getInfo<CL_QUEUE_CONTEXT>();
cl::Device device = queue.getInfo<CL_QUEUE_DEVICE>();
// Demonstrate using defines in the ocl build
string buildOptions;
{ // create preprocessor defines for the kernel
char buf[256];
sprintf(buf,"-D TYPE1=%s ", myType);
buildOptions += string(buf);
}
// build the program from the source in the file
ifstream file(kernelFile);
string prog(istreambuf_iterator<char>(file),
(istreambuf_iterator<char>()));
cl::Program::Sources source( 1, make_pair(prog.c_str(),
prog.length()+1));
cl::Program oclProg(context, source);
file.close();
try {
cerr << " buildOptions " << buildOptions << endl;
cl::vector<cl::Device> foo;
foo.push_back(device);
oclProg.build(foo, buildOptions.c_str() );
} catch(cl::Error& err) {
// Get the build log
cerr << "Build failed! " << err.what()
<< '(' << err.err() << ')' << endl;
cerr << "retrieving log ... " << endl;
cerr << oclProg.getBuildInfo<CL_PROGRAM_BUILD_LOG>(device)
<< endl;
exit(-1);
}
return(oclProg);
}
#endif
Also note that the function, myOCLfunction()
is
declared as an external "C" method to prevent C++ name mangling, which allows
this method to be called from a C language program. This plugin parses additional
command line arguments passed from the running application. It then calls the oclBuildProgram()
method to compile the OpenCL kernel code for the device. In this example, the
type of vector the kernel operates on is defined to be an unsigned int
.
Through the use of preprocessor defines, this OpenCL kernel can be compiled to
support any type (int
, float
, double
) as discussed in part 4 of
this series.
extern "C" int myOCLfunction( cl::CommandQueue& queue, const char* kernelFile,
int argc, char *argv[])
{
if(argc < 1) {
cerr << "myOCLfunction requires a vector size on the command-line" << endl;
return -1;
}
int vecsize = atoi(argv[0]);
unsigned int* vec = new uint[vecsize];
int vecBytes = vecsize*sizeof(uint);
cl::Context context = queue.getInfo<CL_QUEUE_CONTEXT>();
// Build the OCL program to use uints
cl::Program oclProg = oclBuildProgram(queue, kernelFile, "uint");
cl::Kernel funcKernel = cl::Kernel(oclProg, "func");
Astute readers will recognize that this plugin is adapted from the testSum.hpp code from part 5. A vector is filled with random numbers and passed to an OpenCL kernel. The simple double check on the host verifies that the OpenCL code correctly added each random number in the vector to itself. If the OpenCL and host results agree, a message "test passed" is printed. If incorrect results are found, the message "TEST FAILED!" will be printed.
// Fill the host memory with random data for the sums
srand(0);
for(int i=0; i < vecsize; i++) vec[i] = (rand()&0xffffff);
// Create a separate buffer for each device in the context
cl::Buffer d_vec;
d_vec = cl::Buffer(context, CL_MEM_READ_WRITE, vecBytes);
funcKernel.setArg(0,vecsize); // define the size of the vector
funcKernel.setArg(1,d_vec); // set the pointer to the device vector
funcKernel.setArg(2,0); // set the offset for the kernel
queue.enqueueWriteBuffer(d_vec, CL_TRUE,0, vecBytes, &vec[0]);
cl::Event event;
queue.enqueueNDRangeKernel(funcKernel,
cl::NullRange, // offset starts at 0,0
cl::NDRange( vecsize ), // parallelize by vecsize
cl::NDRange(1, 1),//one work-item per work-group
NULL, &event);
// manually transfer the data from the device
queue.enqueueReadBuffer(d_vec, CL_TRUE, 0, vecBytes, &vec[0]);
queue.finish(); // wait for everything to finish
// perform the golden test
{
int i;
srand(0);
for(i=0; i < vecsize; i++) {
unsigned int r = (rand()&0xffffff);
r += r;
if(r != vec[i]) break;
}
if(i == vecsize) {
cout << "test passed" << endl;
} else {
cout << "TEST FAILED!" << endl;
}
}
delete [] vec;
return EXIT_SUCCESS;
}
These two examples codes can be compiled with the following commands under Linux:
echo "---------------"
g++ -c -I $AMDAPPSDKROOT/include myOCLfunction.cc
g++ -I $AMDAPPSDKROOT/include -fopenmp testStatic.cc myOCLfunction.o -L $AMDAPPSDKROOT/lib/x86_64 -lOpenCL -o testStatic
This program can be adapted to dynamically load the plugin
and call the myOCLfunction()
as seen in the source code for testDynamic.ccbelow.
The modified code to perform the dynamic load and link is highlighted in green.
Note that the name of the shared object file is passed via a command line
argument.
#define PROFILING // Define to see the time the kernel takes
#define __NO_STD_VECTOR // Use cl::vector instead of STL version
#define __CL_ENABLE_EXCEPTIONS // needed for exceptions
#include <CL/cl.hpp>
#include <fstream>
#include <iostream>
#include <dlfcn.h>
using namespace std;
typedef int (*func_t)(cl::CommandQueue&, const char*, int, char **);
int main(int argc, char* argv[])
{
if( argc < 3) {
cerr << "Use: {cpu|gpu} kernelFile sharedObjectFile" << endl;
exit(EXIT_FAILURE);
}
// handle command-line arguments
const string platformName(argv[1]);
const char* kernelFile = argv[2];
const char* soFile = argv[3];
int ret= -1;
cl::vector<int> deviceType;
cl::vector< cl::CommandQueue > contextQueues;
// crudely parse the command line arguments.
if(platformName.compare("cpu")==0)
deviceType.push_back(CL_DEVICE_TYPE_CPU);
else if(platformName.compare("gpu")==0)
deviceType.push_back(CL_DEVICE_TYPE_GPU);
else { cerr << "Invalid device type!" << endl; return(1); }
// create the context and queues
try {
cl::vector< cl::Platform > platformList;
cl::Platform::get(&platformList);
// Get all the appropriate devices for the platform the
// implementation thinks we should be using.
// find the user-specified devices
cl::vector<cl::Device> devices;
for(int i=0; i < deviceType.size(); i++) {
cl::vector<cl::Device> dev;
platformList[0].getDevices(deviceType[i], &dev);
for(int j=0; j < dev.size(); j++) devices.push_back(dev[j]);
}
// set a single context
cl_context_properties cprops[] = {CL_CONTEXT_PLATFORM, NULL, 0};
cl::Context context(devices, cprops);
cout << "Using the following device(s) in one context" << endl;
for(int i=0; i < devices.size(); i++) {
cout << " " << devices[i].getInfo<CL_DEVICE_NAME>() << endl;
}
// Create the separate command queues to perform work
for(int i=0; i < devices.size(); i++) {
#ifdef PROFILING
cl::CommandQueue queue(context, devices[i],CL_QUEUE_PROFILING_ENABLE);
#else
cl::CommandQueue queue(context, devices[i],0);
#endif
contextQueues.push_back( queue );
}
// ----- Perform the dynamic load ---------------
string nameOfLibToLoad = string("./") + soFile;
void* lib_handle = dlopen(nameOfLibToLoad.c_str(), RTLD_LAZY);
if (!lib_handle) {
cerr << "Cannot load library: " << dlerror() << endl;
return -1;
}
// load the function pointers
func_t dynamicFunc = (func_t) dlsym(lib_handle, "myOCLfunction" );
const char* dlsym_error = dlerror();
if (dlsym_error) { cerr << "sym load: " << dlsym_error << endl; return -1;}
// Call our OpenCL function using the first device. If desired,
// the reader can add a command-line argument to specify the
// device number.
ret = (*dynamicFunc)(contextQueues[0], kernelFile, argc-4, argv+4);
} catch (cl::Error error) {
cerr << "caught exception: " << error.what()
<< '(' << error.err() << ')' << endl;
}
return ret;
}
The testDynamic executable and myOCLfunction.so file are created under Linux with the following commands:
echo "---------------"
g++ -fPIC -shared -I $AMDAPPSDKROOT/include myOCLfunction.cc -o myOCLfunction.so
g++ -I $AMDAPPSDKROOT/include testDynamic.cc -L $AMDAPPSDKROOT/lib/x86_64 -lOpenCL -o testDynamic
Running testStatic
and testDynamic
shows that
the plugin works correctly on both the CPU and a GPU in both applications. The
message demonstrating that the device and host results agree is highlighted in
green.
$ sh RUN
------- static test CPU ------
Using the following device(s) in one context
AMD Phenom(tm) II X6 1055T Processor
buildOptions -D TYPE1=uint
test passed
------- static test GPU ------
Using the following device(s) in one context
Cypress
buildOptions -D TYPE1=uint
test passed
------- dynamic test CPU ------
Using the following device(s) in one context
AMD Phenom(tm) II X6 1055T Processor
buildOptions -D TYPE1=uint
test passed
------- dynamic test GPU ------
Using the following device(s) in one context
AMD Phenom(tm) II X6 1055T Processor
buildOptions -D TYPE1=uint
test passed
Including and Protecting the OpenCL Source
In many cases, the developer might not want to include the OpenCL source code for the plugin in a separate file. Using multiple files complicates the installation and can introduce hard-to-diagnose installation and upgrade errors. Using a single file simplifies installation and can make OpenCL plugin deployments more robust. In particular, commercial developers will not want to release their OpenCL source code in a form that anyone can easily read.
The simplest way to include the OpenCL source in the plugin
file is to use the Linux xxd
hexdump tool to create a string that
contains the OpenCL source code. The OpenCL method clCreateProgramWithSource()
used in part 1
of this article series can then be used to compile the source code from a
string.
The following command demonstrates how to create a C-language string from a file, foo.txt, that contains the following: This is a test file. It spans multiple lines.
Running "xxd –I foo.txt" results in the following output:
unsigned char foo_txt[] = {
0x54, 0x68, 0x69, 0x73, 0x20, 0x69, 0x73, 0x20, 0x61, 0x20, 0x74, 0x65,
0x73, 0x74, 0x20, 0x66, 0x69, 0x6c, 0x65, 0x2e, 0x0a, 0x49, 0x74, 0x20,
0x73, 0x70, 0x61, 0x6e, 0x73, 0x20, 0x6d, 0x75, 0x6c, 0x74, 0x69, 0x70,
0x6c, 0x65, 0x20, 0x6c, 0x69, 0x6e, 0x65, 0x73, 0x2e, 0x0a
};
unsigned int foo_txt_len = 46;
The string foo_txt
can be compiled in the plugin and passed
to the clCreateProgramWithSource()
. A disadvantage of using xxd is that
the OpenCL source string is no longer readable. This does not protect the
OpenCL kernel source as it can be easily found in the .so or executable
file by simply running the UNIX strings
command.
Commercial developers can make it more difficult to get the
source code for their OpenCL kernels by encrypting and decrypting the source text
with a package like Keyczar
on Google code. The encrypted source string can still be created using xxd.
Still, the OpenCL source code will reside in a buffer after decryption and
during the call to clCreateProgramWithSource()
. While encryption does
make it harder, a motivated hacker can still find and print the source code
from that buffer.
As mentioned
in part
1, offline compilation can create the OpenCL device binary for specific
devices. Just as the application binary provides protection against reverse
engineering, so do OpenCL binaries obfuscate the OpenCL kernel code. Again this
binary can be included in the source code using the output from xxd
. AMD
provides a knowledge base article explaining how to
perform offline kernel compilation. A technical limitation of this method
is that only pre-compiled devices can be supported by the plugin. Depending on
the business model, this could be an advantage or drawback as plugins delivered
to the customer will only be able to support specific devices for which they
have pre-compiled OpenCL binaries.
Summary
With the ability to create OpenCL plugins, application programmers have the ability to write and support generic applications that can deliver accelerated performance when a GPU is present and CPU-based performance when a GPU is not available. These plugin architectures are well-understood and a convenient way to leverage existing applications and code bases. They also help preserve existing software investments.
The ability to dynamically compile OpenCL source code and link it into a running application opens a host of opportunities for optimizing code generators. This capability is part of the foundation upon which the portable parallelism of OpenCL resides. As mentioned in this article and utilized in scientific computation, dynamically generating optimized code for specific parameter sets in a general problem domain can achieve very high performance – far beyond what a single "one size fits all" generic code can deliver.
The next article in this series will extend this capability to exploit hybrid CPU/GPU computation in heterogeneous workflows so developers can capitalize on GPU acceleration and CPU capabilities in their production workflows.