The previous
article, part
4 of this series, covered the OpenCL™ runtime and demonstrated how to perform
concurrent computations among the work queues of heterogeneous devices.
The example source code from part 4 was
adapted to queue a user specified number of tasks split amongst multiple CPU
and GPU command queues. Both blocking I/O and mapped I/O will be used to tie
data to kernels running on each device. The supplied OpenCL kernel, which only
adds a number to itself, was purposely kept simple to focus attention on the
queue and data management issues. The reader can easily supply other more
complicated kernels via a command-line argument to evaluate them in multiple
environments when utilizing (1) just the host processor, (2) one or more GPUs devices,
or (3) a mix of all devices at once. As per the example in part 4, concurrent
execution is initiated succinctly via an OpenMP pragma.
The source code in this article continues to use a simple
yet useful preprocessor capability to pass C++ template types to an OpenCL
kernel. While the example in this article does not modify data type to evaluate
the performance, this capability was kept as a convenience should the reader
choose to evaluate the performance of alternative kernels in a mixed and multi-device
environment.
Memory Affinity
OpenCL uses a relaxed memory consistency model as
discussed in part
2 of this series. The beauty inherent in the OpenCL memory design is that data
is localized and associated with a work item,
or within a work-group, by the programmer. These work items can then be queued
on many devices to achieve very high performance and scalability. This model
does require that the programmer assume responsibility to ensure all tasks see
a consistent view of memory.
OpenCL utilizes buffer objects within a context to share
data between devices, which differs from the programming model expected by
those who program conventional shared-memory machines. Buffer objects provide
the foundation through which the programmer and OpenCL runtime work together in
concert to create a single program that can run without compilation on a
multitude of machine and device configurations. For example, the executable
from this article ran without recompilation on a single processor, a system
with a single GPU, and a larger system with a CPU and multiple GPUs.
It is worth noting that many OpenCL programmers become
confused when they encounter the maximum size limitation of a buffer, which is generally
much smaller than the size of the memory on their GPU. To avoid this
limitation, it is necessary to think in terms of partitioning a computation to
run across one or more devices. A buffer then becomes a natural way to express
the distribution of work - rather than a mechanism through which a memory image
is moved to a single massively threaded device. To preserve efficiency, OpenCL
provides both mapped and asynchronous buffers so computation can proceed while
additional data is being transferred. In this way, OpenCL programmers can create
applications that deliver both high performance and portable parallelism
without recompilation on a wide-variety of user hardware configurations.
Explicit, programmer initiated transfers occur by queuing
one or more transfers on a command queue. Examples include:
C API:
clEnqueueReadBuffer()
, clEnqueueReadImage()
clEnqueueWriteBuffer()
, clEnqueueWriteImage()
clEnqueueCopyBuffer()
, clEnqueueCopyImage()
C++ API
cl::enqueueReadBuffer()
, cl::enqueueWriteBuffer()
Data transfers can be either blocking, in which case the
queue waits for the transfer to complete, or asynchronous requiring the use of
events for notification when a transfer has completed. Using asynchronous data
transfers benefits application performance by allowing computation to overlap with
data movement - thus decreasing the time to solution. Since the PCIe bus is
full duplex, meaning that it can transfer data in two directions at the same
time, there is a potential 2-times increase in data transfer bandwidth that can
be achieved.
Alternatively, regions of the object data can be implicitly
transferred by mapping buffers into the host address space. These transfers can
occur both asynchronously and on a demand basis, meaning only portions of the
data required by a calculation are moved and cached on a device. API examples
include:
C API:
clEnqueueMapBuffer()
, clEnqueueMapImage()
clEnqueueUnmapMemObject()
C++ API:
cl::Buffer()
(via various flags discussed below)
cl::enqueueMapBuffer()
, cl::enqueueMapImage()
cl::enqueueUnmapMemObject();
It can be confusing determining which flags, or combination
of flags, are needed to correctly create and potentially map a buffer to a
device. (The following is excerpted
from an excellent explanation by user bwatt on the Khronos.org message board.)
There are three flags that can be used in mapping memory or
when creating a buffer with the C++ OpenCL wrapper:
CL_MEM_ALLOC_HOST_PTR
CL_MEM_COPY_HOST_PTR
CL_MEM_USE_HOST_PTR
Five valid combinations can be created with these three
flags:
- Non
mapped requiring manual data transfers:
- No flags specified
CL_MEM_COPY_HOST_PTR
- Mapped buffers:
CL_MEM_ALLOC_HOST_PTR
- (
CL_MEM_ALLOC_HOST_PTR
| CL_MEM_COPY_HOST_PTR
) CL_MEM_USE_HOST_PTR
Letting OpenCL allocate the
memory (options 1a and 2a) provides the greatest likelihood of delivering good
performance in a portable fashion as the buffer can be internally allocated to
best conform to alignment, pinned memory, and other device specific performance
criteria. If porting an existing application, it might be necessary to use
already allocated regions of memory, which means that CL_MEM_USE_HOST_PTR
may be the only option. This may not be the best option from a performance
point of view.
Choosing a Queuing Model
Derek Gerstmann’s provided an excellent presentation for
SIGGRAPH ASIA 2009, “Advanced
OpenCl Event Model Usage”, which discusses several use cases including:
- 1x In-Order Queue, 1x Context, 1x Device.
- 1x Out-of-Order Queue, 1x Context, 1x Device.
- 2x In-Order Queues, 2x Separate Contexts, 2x Devices.
- 2x In-Order Queues, 1x Combined Context, 2x Devices.
The example in this article utilizes a single context with a
single command queue per device, or the “Cooperative Multi-Device Usage Model”.
This implies that all objects created in the single context are shared by all
the command-queues. Appendix A section A.1 of the OpenCL
specification, “Shared OpenCL Objects” indicates:
- OpenCL memory objects, program objects and kernel objects
are created using a context and can be shared across multiple
command-queues created using the same context.
- A command-queue can cache
changes to the state of a memory object on the device associated with the
command-queue.
- The application needs to implement appropriate
synchronization across threads on the host processor to ensure that the
changes to the state of a shared object (such as a command-queue object,
memory object, program or kernel object) happen in the correct order
(deemed correct by the application) when multiple command-queues in
multiple threads are making changes to the state of a shared object.
Succinctly, be careful how buffers are utilized by multiple
devices as the runtime may introduce copies and other behavior that can affect
performance and even program correctness. A good, recent discussion on this
topic occurred in this
thread on the Khronos.org message board. On the basis of this discussion, the
example code in this article dedicated a buffer per device for data transfers
and when mapping memory to the devices.
Example
The source code for testSum.cpp creates a two
dimensional integer array in host memory according to the sizes specified by
the user on the command-line. This array is filled with random numbers and mapped
to one or more devices. The OpenCL kernel, simpleAdd.cl, simply adds
each array element to itself.
As discussed in part 4, the C++ template header file, testSum.hpp,
uses a simple use of template specialization get the typename of a template
parameter for several atomic datatypes (float, double, int, etcetera). Use of
this method keeps the code simple and does not complicate it with too much C++
wizardry. Since the kernel is built inside the instantiated class, this
typename can be passed to the OpenCL kernel build via preprocessor defines.
The complete code for testSum.hpp follows. Note that
it has been adapted to support multiple kernel invocations by essentially
removing all the code inside of the initData
method. Aside from that,
this code is very similar to the C++ template files used in the previous two
articles.
#include <cmath>
#include <algorithm>
template<class T>
struct TypeName {
string getName();
private:
T *t;
};
template<> string TypeName<double>::getName() {return(string("double")); }
template<> string TypeName<float>::getName() {return(string("float")); }
template<> string TypeName<unsigned long>::getName() {return(string("ulong"));}
template<> string TypeName<long>::getName() { return(string("long")); }
template<> string TypeName<unsigned int>::getName() {return(string("uint"));}
template<> string TypeName<int>::getName() {return(string("int")); }
template<> string TypeName<unsigned char>::getName() {return(string("uchar"));}
template<> string TypeName<char>::getName() {return(string("char")); }
template <typename TYPE1>
class OclTest {
private:
cl::Kernel kernel;
string myType;
cl::Event event;
int vecsize;
public:
cl::Event *getEventPtr() { return &event;}
OclTest() {}
OclTest( cl::CommandQueue& queue, const char* kernelFile,
int argc, char *argv[])
{
cl::Device device = queue.getInfo<CL_QUEUE_DEVICE>();
cl::Context context = queue.getInfo<CL_QUEUE_CONTEXT>();
cout << "---------- building OpenCL kernel ("
<< kernelFile << ") -----" << endl;
myType= TypeName<TYPE1>().getName();
cout << " My type is " << myType.c_str() << endl;
string buildOptions;
{ char buf[256];
sprintf(buf,"-D TYPE1=%s ", myType.c_str());
buildOptions += string(buf);
}
ifstream file(kernelFile);
string prog(istreambuf_iterator<char>(file),
(istreambuf_iterator<char>()));
cl::Program::Sources source( 1, make_pair(prog.c_str(),
prog.length()+1));
cl::Program program(context, source);
file.close();
try {
cerr << " buildOptions " << buildOptions << endl;
cl::vector<cl::Device> foo;
foo.push_back(device);
program.build(foo, buildOptions.c_str() );
} catch(cl::Error& err) {
cerr << "Build failed! " << err.what()
<< '(' << err.err() << ')' << endl;
cerr << "retrieving log ... " << endl;
cerr
<< program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(device)
<< endl;
exit(-1);
}
string kernelName = string(kernelFile)
.substr(0,string(kernelFile).find(".cl"));
kernel = cl::Kernel(program, kernelName.c_str());
}
inline void initData(int _vecsize)
{
vecsize = _vecsize;
}
inline cl::Kernel& getKernel() { return(kernel); }
cl::NDRange getGlobalWorkItems() {
return( cl::NDRange( vecsize ) );
}
cl::NDRange getWorkItemsInWorkGroup() {
return( cl::NDRange(1, 1) );
}
};
TestSum.cpp
The focus of the testSum.cpp is to demonstrate the creation of multiple command-queues within a single context while maintaining data affinity across for multiple kernel invocations across a number the devices. Using the C++ OpenCL wrapper make the code to find all the devices and put them into a context fairly compact:
cl::vector< cl::Platform > platformList;
cl::Platform::get(&platformList);
cl::vector<cl::Device> devices;
for(int i=0; i < deviceType.size(); i++) {
cl::vector<cl::Device> dev;
platformList[0].getDevices(deviceType[i], &dev);
for(int j=0; j < dev.size(); j++) devices.push_back(dev[j]);
}
cl_context_properties cprops[] = {CL_CONTEXT_PLATFORM, NULL, 0};
cl::Context context(devices, cprops);
cout << "Using the following device(s) in one context" << endl;
for(int i=0; i < devices.size(); i++) {
cout << " " << devices[i].getInfo<CL_DEVICE_NAME>() << endl;
}
This code has been tested on a system containing two ATI
Radeon HD 5870 GPUs and an AMD Phenom™ II X6 1055T processor running the latest
AMD Accelerated Parallel Processing (APP) SDK (formerly known as ATI Stream).
(Note: the code in part 3 also works correctly on multi-GPU systems.)
Creating separate command queues are specified, one per
device, is also compact:
cl::vector< cl::CommandQueue > contextQueues;
for(int i=0; i < devices.size(); i++) {
#ifdef PROFILING
cl::CommandQueue queue(context, devices[i],CL_QUEUE_PROFILING_ENABLE);
#else
cl::CommandQueue queue(context, devices[i],0);
#endif
contextQueues.push_back( queue );
}
C++ preprocessor conditionals are utilized to choose between
mapped and unmapped buffers. Setting the preprocessor variable USE_MAP
will
compile the code using mapped buffers that implicitly transfer the data to the
OpenCL devices. The default is to use explicit, blocking transfers.
int nDevices = contextQueues.size();
unsigned int* vec = new uint[nvec*vecsize];
int vecBytes=vecsize*sizeof(uint);
srand(0);
for(int i=0; i < (nvec*vecsize); i++) vec[i] = (rand()&0xffffff);
#ifdef USE_MAP
cl::vector< cl::Buffer > d_vec;
for(int i=0; i < contextQueues.size(); i++) {
d_vec.push_back(cl::Buffer(context, CL_MEM_COPY_HOST_PTR,
nvec* vecBytes, vec) );
}
int vecOffset=vecBytes; #else
cl::vector< cl::Buffer > d_vec;
for(int i=0; i < contextQueues.size(); i++) {
d_vec.push_back(cl::Buffer(context, CL_MEM_READ_WRITE, vecBytes) );
}
int vecOffset=0; #endif
As with the example in part 4, OpenMP has been utilized to
concurrently queue commands in parallel on all the devices. This provides a
simple and concise way to introduce parallelism via an OpenMP pragma as seen in
the code snippet below. More information on OpenMP can be found on the
Internet. An excellent starting place is the Wikipedia article, OpenMP. Note that the kernel
runs multiple times one each queue.
#pragma omp parallel for
for(int i=0; i < contextQueues.size(); i++) {
test[i].initData(vecsize);
test[i].getKernel().setArg(0,vecsize);
test[i].getKernel().setArg(1,d_vec[i]);
for(int j=i; j < nvec; j += nDevices) {
#ifdef USE_MAP
test[i].getKernel().setArg(2,j); #else
test[i].getKernel().setArg(2,0);
contextQueues[i].enqueueWriteBuffer(d_vec[i], CL_TRUE,0, vecBytes,
&vec[j*vecsize]);
#endif
contextQueues[i].enqueueNDRangeKernel(
test[i].getKernel(),
cl::NullRange, test[i].getGlobalWorkItems(), test[i].getWorkItemsInWorkGroup(), NULL, test[i].getEventPtr());
contextQueues[i].enqueueReadBuffer(d_vec[i], CL_TRUE,
j * vecOffset,
vecBytes,
&vec[j*vecsize]);
}
contextQueues[i].finish(); }
Once all the devices have completed, the host double-checks the results and prints out either success or failure:
{
int i;
srand(0);
for(i=0; i < (nvec*vecsize); i++) {
unsigned int r = (rand()&0xffffff);
r += r;
if(r != vec[i]) break;
}
if(i == (nvec*vecsize)) {
cout << "test passed" << endl;
} else {
cout << "TEST FAILED!" << endl;
}
}
Following is the complete code for testSum.cpp.
#define PROFILING // Define to see the time the kernel takes
#define __NO_STD_VECTOR // Use cl::vector instead of STL version
#define __CL_ENABLE_EXCEPTIONS // needed for exceptions
#include <CL/cl.hpp>
#include <fstream>
#include <iostream>
using namespace std;
#include "testSum.hpp"
int main(int argc, char* argv[])
{
if( argc < 5) {
cerr << "Use: {cpu|gpu|both} kernelFile nvec vecsize" << endl;
exit(EXIT_FAILURE);
}
const string platformName(argv[1]);
const char* kernelFile = argv[2];
int nvec = atoi(argv[3]);
int vecsize = atoi(argv[4]);
cl::vector<int> deviceType;
cl::vector< cl::CommandQueue > contextQueues;
if(platformName.compare("cpu")==0)
deviceType.push_back(CL_DEVICE_TYPE_CPU);
else if(platformName.compare("gpu")==0)
deviceType.push_back(CL_DEVICE_TYPE_GPU);
else if(platformName.compare("both")==0) {
deviceType.push_back(CL_DEVICE_TYPE_GPU);
deviceType.push_back(CL_DEVICE_TYPE_CPU);
} else { cerr << "Invalid device type!" << endl; return(1); }
try {
cl::vector< cl::Platform > platformList;
cl::Platform::get(&platformList);
cl::vector<cl::Device> devices;
for(int i=0; i < deviceType.size(); i++) {
cl::vector<cl::Device> dev;
platformList[0].getDevices(deviceType[i], &dev);
for(int j=0; j < dev.size(); j++) devices.push_back(dev[j]);
}
cl_context_properties cprops[] = {CL_CONTEXT_PLATFORM, NULL, 0};
cl::Context context(devices, cprops);
cout << "Using the following device(s) in one context" << endl;
for(int i=0; i < devices.size(); i++) {
cout << " " << devices[i].getInfo<CL_DEVICE_NAME>() << endl;
}
cl::vector< cl::CommandQueue > contextQueues;
for(int i=0; i < devices.size(); i++) {
#ifdef PROFILING
cl::CommandQueue queue(context, devices[i],CL_QUEUE_PROFILING_ENABLE);
#else
cl::CommandQueue queue(context, devices[i],0);
#endif
contextQueues.push_back( queue );
}
cl::vector< OclTest<uint> > test;
for(int i=0; i < contextQueues.size(); i++) {
test.push_back(OclTest<uint>(contextQueues[i],
kernelFile, argc-3, argv+3));
}
int nDevices = contextQueues.size();
unsigned int* vec = new uint[nvec*vecsize];
int vecBytes=vecsize*sizeof(uint);
srand(0);
for(int i=0; i < (nvec*vecsize); i++) vec[i] = (rand()&0xffffff);
#ifdef USE_MAP
cl::vector< cl::Buffer > d_vec;
for(int i=0; i < contextQueues.size(); i++) {
d_vec.push_back(cl::Buffer(context, CL_MEM_COPY_HOST_PTR,
nvec* vecBytes, vec) );
}
int vecOffset=vecBytes; #else
cl::vector< cl::Buffer > d_vec;
for(int i=0; i < contextQueues.size(); i++) {
d_vec.push_back(cl::Buffer(context, CL_MEM_READ_WRITE, vecBytes) );
}
int vecOffset=0; #endif
#pragma omp parallel for
for(int i=0; i < contextQueues.size(); i++) {
test[i].initData(vecsize);
test[i].getKernel().setArg(0,vecsize);
test[i].getKernel().setArg(1,d_vec[i]);
for(int j=i; j < nvec; j += nDevices) {
#ifdef USE_MAP
test[i].getKernel().setArg(2,j); #else
test[i].getKernel().setArg(2,0);
contextQueues[i].enqueueWriteBuffer(d_vec[i], CL_TRUE,0, vecBytes,
&vec[j*vecsize]);
#endif
contextQueues[i].enqueueNDRangeKernel(
test[i].getKernel(),
cl::NullRange, test[i].getGlobalWorkItems(), test[i].getWorkItemsInWorkGroup(), NULL, test[i].getEventPtr());
contextQueues[i].enqueueReadBuffer(d_vec[i], CL_TRUE,
j * vecOffset,
vecBytes,
&vec[j*vecsize]);
}
contextQueues[i].finish(); }
{
int i;
srand(0);
for(i=0; i < (nvec*vecsize); i++) {
unsigned int r = (rand()&0xffffff);
r += r;
if(r != vec[i]) break;
}
if(i == (nvec*vecsize)) {
cout << "test passed" << endl;
} else {
cout << "TEST FAILED!" << endl;
}
}
delete [] vec;
} catch (cl::Error error) {
cerr << "caught exception: " << error.what()
<< '(' << error.err() << ')' << endl;
}
return EXIT_SUCCESS;
}
The following commands build and run the code using the
recently released AMD version 2.5 SDK. Note the specification of the C++
preprocessor define USE_MAP
.
echo "---------------"
g++ -I $ATISTREAMSDKROOT/include -fopenmp testSum.cpp -L $ATISTREAMSDKROOT/lib/x86_64 -lOpenCL -o testSum
g++ -D USE_MAP -I $ATISTREAMSDKROOT/include -fopenmp testSum.cpp -L $ATISTREAMSDKROOT/lib/x86_64 -lOpenCL -o testSum
simpleAdd.cl
The complete source listing for simpleAdd.cl follows:
inline __kernel void simpleAdd(int veclen, __global TYPE1* c, int offset)
{
int index = get_global_id(0);
c[index + offset*veclen] += c[index + offset*veclen];
}
Command-line options allow
setting the following:
- cpu, gpu, both:
- cpu: run only on the processor cores.
- gpu: run on all the GPU devices.
- both: run on both cpu and GPU devices.
- The filename of the OpenCL kernel.
- The number of rows in the array. This also defines how many kernel
invocations will happen across the devices (one kernel invocation per row).
- The number of columns in each row. This defines how much data will be
transferred across the PCIe bus.
The following shows an example
output when running on all the devices:
$ ./testSum both simpleAdd.cl 300 10240
Using the following device(s) in one context
Cypress
Cypress
AMD Phenom(tm) II X6 1055T Processor
---------- building OpenCL kernel (simpleAdd.cl) -----
My type is uint
buildOptions -D TYPE1=uint
---------- building OpenCL kernel (simpleAdd.cl) -----
My type is uint
buildOptions -D TYPE1=uint
---------- building OpenCL kernel (simpleAdd.cl) -----
My type is uint
buildOptions -D TYPE1=uint
test passed
Summary
The example in this article demonstrated that it is possible
to concisely specify an application that can run on numerous device
configurations without recompilation. This is the essence of the portable
parallelism of OpenCL.
Tying data to computation is a key capability that permits
work to be queued on many devices to achieve very high performance and scalability.
Rather than thinking in terms of a memory image, OpenCL programmers need to
redefine their thinking in terms of buffers that can be moved as needed amongst
devices. In this way, the OpenCL mapping and data transfer capabilities can be
utilized. This differs from the monolithic view of a single large memory
presented by most SMP (Shared Multi-Processor) conventional architectures. With
buffer objects, the OpenCL programmer and runtime can work together to have a
single program run, without compilation, on a single processor, a hybrid
CPU/GPU system, or a system with a CPU and multiple GPUs.
Rob Farber is a senior scientist and research consultant at the Irish Center for High-End Computing in Dublin, Ireland and Chief Scientist for BlackDog Endeavors, LLC in the US.
Rob has been on staff at several US national laboratories including Los Alamos National Laboratory, Lawrence Berkeley National Laboratory, and at Pacific Northwest National Laboratory. He also served as an external faculty member at the Santa Fe Institute, co-founded two successful start-ups, and has been a consultant to Fortune 100 companies. His articles have appeared in Dr. Dobb's Journal and Scientific Computing, among others. He recently completed a book teaching massive parallel computing.