## Introduction

VexCL is vector expression template library for OpenCL. It has been created for ease of C++ based OpenCL development. Multi-device (and multi-platform)
computations are supported. Source code of the library is publicly available under MIT
license. See the Doxygen-generated documentation at http://ddemidov.github.com/vexcl.

This is the first of two articles describing the VexCL library. The first part is an introduction to the VexCL interface.
The second part compares VexCL performance
with existing GPGPU implementations. The comparison is based on odeint -
a modern C++ library for numerical solutions of ordinary differential equations.

To quote Khronos group, the organization behind
the OpenCL standard, *OpenCL is the first open, royalty-free standard
for cross-platform, parallel programming of modern processors found in personal computers, servers, and handheld/embedded devices. OpenCL (Open Computing
Language) greatly improves speed and responsiveness for a wide spectrum of applications in numerous market categories from gaming and entertainment to scientific and medical software.*

The weakest sides of OpenCL are, probably, lack of tools and libraries around it and
the amount of boilerplate code needed to develop OpenCL applications. The VexCL
library tries to solve the latter issue. To start with an example, consider the "hello world" problem of OpenCL: addition of two vectors.
The following code block contains the program implementing the task with pure OpenCL. Note that
the official C++ bindings are used here; the C variant would be much more verbose.

#include <iostream>
#include <vector>
#include <string>
#define __CL_ENABLE_EXCEPTIONS
#include <CL/cl.hpp>
static const char source[] =
"#if defined(cl_khr_fp64)\n"
"# pragma OPENCL EXTENSION cl_khr_fp64: enable\n"
"#elif defined(cl_amd_fp64)\n"
"# pragma OPENCL EXTENSION cl_amd_fp64: enable\n"
"#else\n"
"# error double precision is not supported\n"
"#endif\n"
"kernel void add(\n"
" uint n,\n"
" global const double *a,\n"
" global const double *b,\n"
" global double *c\n"
" )\n"
"{\n"
" uint i = get_global_id(0);\n"
" if (i < n) {\n"
" c[i] = a[i] + b[i];\n"
" }\n"
"}\n";
int main() {
const unsigned int N = 1 << 20;
try {
std::vector<cl::Platform> platform;
cl::Platform::get(&platform);
if (platform.empty()) {
std::cerr << "OpenCL platforms not found." << std::endl;
return 1;
}
cl::Context context;
std::vector<cl::Device> device;
for(auto p = platform.begin(); device.empty() && p != platform.end(); p++) {
std::vector<cl::Device> pldev;
try {
p->getDevices(CL_DEVICE_TYPE_GPU, &pldev);
for(auto d = pldev.begin(); device.empty() && d != pldev.end(); d++) {
if (!d->getInfo<CL_DEVICE_AVAILABLE>()) continue;
std::string ext = d->getInfo<CL_DEVICE_EXTENSIONS>();
if (
ext.find("cl_khr_fp64") == std::string::npos &&
ext.find("cl_amd_fp64") == std::string::npos
) continue;
device.push_back(*d);
context = cl::Context(device);
}
} catch(...) {
device.clear();
}
}
if (device.empty()) {
std::cerr << "GPUs with double precision not found." << std::endl;
return 1;
}
std::cout << device[0].getInfo<CL_DEVICE_NAME>() << std::endl;
cl::CommandQueue queue(context, device[0]);
cl::Program program(context, cl::Program::Sources(
1, std::make_pair(source, strlen(source))
));
try {
program.build(device);
} catch (const cl::Error&) {
std::cerr
<< "OpenCL compilation error" << std::endl
<< program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(device[0])
<< std::endl;
return 1;
}
cl::Kernel add = cl::Kernel(program, "add");
std::vector<double> a(N, 1);
std::vector<double> b(N, 2);
std::vector<double> c(N);
cl::Buffer A(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
a.size() * sizeof(double), a.data());
cl::Buffer B(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
b.size() * sizeof(double), b.data());
cl::Buffer C(context, CL_MEM_READ_WRITE,
c.size() * sizeof(double));
add.setArg(0, N);
add.setArg(1, A);
add.setArg(2, B);
add.setArg(3, C);
queue.enqueueNDRangeKernel(add, cl::NullRange, N, cl::NullRange);
queue.enqueueReadBuffer(C, CL_TRUE, 0, c.size() * sizeof(double), c.data());
std::cout << c[42] << std::endl;
} catch (const cl::Error &err) {
std::cerr
<< "OpenCL error: "
<< err.what() << "(" << err.err() << ")"
<< std::endl;
return 1;
}
}

I am sorry for wasting page space, but that is the point of this example. Feel free to collapse this ugly monster away as soon as you note that it contains
133 lines of code. As you can see, the basic steps of an OpenCL program are selection of
the compute device, initialization of the OpenCL context, building the OpenCL
program, allocating and initializing memory on the selected device, and running the compute kernel. VexCL strives to simplify (and get rid of, where possible)
these steps. A program that solves the same problem with the help of the VexCL library is presented below.

#include <iostream>
#include <vector>
#include <vexcl/vexcl.hpp>
int main() {
const unsigned int N = 1 << 20;
try {
vex::Context ctx(
vex::Filter::Type(CL_DEVICE_TYPE_GPU) &&
vex::Filter::DoublePrecision &&
vex::Filter::Count(1)
);
if (ctx.queue().empty()) {
std::cerr << "GPUs with double precision not found." << std::endl;
return 1;
}
std::cout << ctx << std::endl;
std::vector<double> a(N, 1);
std::vector<double> b(N, 2);
std::vector<double> c(N);
vex::vector<double> A(ctx.queue(), a);
vex::vector<double> B(ctx.queue(), b);
vex::vector<double> C(ctx.queue(), N);
C = A + B;
vex::copy(C, c);
std::cout << c[42] << std::endl;
} catch (const cl::Error &err) {
std::cerr << "OpenCL error: " << err << std::endl;
return 1;
}
}

This program is much more concise (45 lines to be precise) and almost as effective. VexCL uses template metaprogramming techniques, so most of the work
is done at compilation time. The only considerable overhead the VexCL example has is dynamic construction of kernel source. But that is
a relatively
lightweight operation and it is performed only once, when an expression is first encountered in your program.

VexCL makes heavy use of C++11 features, so your compiler has to be modern enough. GCC versions 4.6 and above are fully supported. Microsoft Visual C++
2010 manages to compile the project with some features disabled: since it does not support variadic templates, only one-argument built-in functions are enabled;
user functions are not available at all.

## VexCL interface description

### Compute device selection

VexCL supports multi-device computations. Compute devices you use may even belong to different OpenCL platforms. For example,
a single VexCL context may
include AMD and NVIDIA graphic cards as well as Intel CPU. The easiest way to initialize VexCL is by using
the `vex::Context`

class. Its constructor
accepts a *device filter* or a functor acting on `cl::Device`

. Several standard
filters are defined. Compute devices may be filtered by name, vendor, platform, type (CPU or GPU), etc. Device filters may be combined with logical
operators. For example, the following piece of code lists all available GPUs supporting double precision computations:

vex::Context ctx(
vex::Filter::Type(CL_DEVICE_TYPE_GPU) && vex::Filter::DoublePrecision
);
std::cout << ctx << std::endl;

In case built-in functionality is not enough, users may provide their own filters.

Almost any class in VexCL accepts `std::vector`

of `cl::CommandQueue`

s as a constructor argument. Each command queue
is presumably located on a separate compute device. This list of queues may be obtained by a call to
the `queue()`

method of
the initialized
`vex::Context`

class, or be supplied by a user. This allows to easily incorporate VexCL into existing projects that have their own means of OpenCL initialization.

### Vector arithmetic

Once you have initialized the VexCL context, you can allocate device vectors on the associated devices. Each device in
the VexCL context gets its own share of vector
memory and computational work. Vectors in VexCL are partitioned across all devices given at construction time. All vectors use
the same partitioning scheme.
This guarantees that corresponding elements of two equally sized vectors are located on
the same compute devices and no inter-device data transfer is required for computations.

In the example below, device vector `x`

is allocated across all devices supporting double precision.

vex::Context ctx(vex::Filter::DoublePrecision);
const size_t N = 1024 * 1024;
vex::vector<double> x(ctx.queue(), N);

The default partitioning scheme is based on the measuring performance of the code `a = b + c;`

on every device in
the context, where `a`

,
`b`

, and `c`

are vectors residing on the benchmarked device. This test is performed first time any multi-device vector is allocated.
Each device gets part of a vector proportional to its performance. This is another overhead introduced when using
the VexCL library in a multi-device
context. But this overhead is easily removed by providing a device weighting function. If, for example, you have
a homogeneous set of compute devices, then
equal partitioning is the best choice. In order to skip the bandwidth test, you assign
the same weight to each device:

vex::partitioning_scheme<>::set(
[](const cl::Context&, const cl::Device&) { return 1.0; }
);

Note that you have to set the partitioning scheme *before* you allocate any vector. Otherwise,
the default partitioning scheme will be used. The partitioning behaviour may be altered only once. This guarantees that all vectors are consistently partitioned.

Once the device vectors are allocated, simple and intuitive arithmetic expressions may be used to alter their state. For every expression you use,
the appropriate
kernel is generated, compiled (first time it is encountered in your program), and called automatically. If you want to get sources of the generated kernels
output to the standard stream, define a `VEXCL_SHOW_KERNELS`

macro before including
the VexCL headers.

const size_t n = 1 << 20;
std::vector<float> x(n);
std::generate(x.begin(), x.end(), [](){ return (float)rand() / RAND_MAX; });
vex::vector<float> X(ctx.queue(), x);
vex::vector<float> Y(ctx.queue(), n);
vex::vector<float> Z(ctx.queue(), n);
Y = 42;
Z = sqrt(2 * X) + pow(cos(Y), 2.0);

Computational work is split between devices. Vector expressions supported by VexCL are
embarrassingly parallel, so if you have two or more compute devices,
then you should get linear speedup for your code. You can copy the result back to
the host or you can use `vex::vector::operator[]`

to read (or write)
vector elements directly. Though the latter technique is very ineffective and should be used for debugging purposes only:

copy(Z, x);
assert(x[42] == Z[42]);

### Reductions

Reduction of a vector to a single value is a commonly used operation. You use reduction to obtain
the sum of vector elements, or to find the maximum/minimum element
of a vector. The class template `vex::Reductor`

allows to perform reduction of an arbitrary expression. Supported types of reduction are
`SUM`

, `MIN`

, and `MAX`

. This is how you obtain
the sum of all elements in an expression:

vex::vector<double> x(ctx.queue(), 4096);
vex::vector<double> y(ctx.queue(), 4096);
vex::Reductor<double,vex::SUM> sum(ctx.queue());
std::cout << sum(x) << std::endl;
std::cout << sum(x * y) << std::endl;
std::cout << sum(x > 1) << std::endl;

Another example is finding the maximum distance from the origin for a set of 2D points. Point coordinates are stored in `x`

and `y`

vectors:

vex::Reductor<double, vex::MAX> max(ctx.queue());
std::cout << max(sqrt(pow(x, 2.0) + pow(y, 2.0))) << std::endl;

Note that the `vex::Reductor`

instance allocates a small temporary buffer on each device in its queue list at construction time, so for performance
reasons it should be allocated once and used throughout the application lifetime.

### User-defined functions

Simple functions to be used in kernels are easily defined in VexCL. All you need is to define
the function body, its return type, and the types of its arguments.
After that, the function may be used in vector expressions in the same way built-in functions are used. Note that
the function body has to be defined at global
scope and belongs to the `extern const char[]`

type. These requirements are necessary in order to use the body string as a template parameter.
The following example counts 2D points located in the first quadrant:

extern const char between_body[] = "return prm1 <= prm2 && prm2 <= prm3;";
UserFunction<between_body, bool(double, double, double)> between;
size_t points_in_1q(
const vex::Reductor<size_t, vex::SUM> &sum,
const vex::vector<double> &x,
const vex::vector<double> &y
)
{
return sum(between(0.0, atan2(y, x), M_PI/2));
}

Any valid vector expression may be used as a function argument. Note that function parameters in the body definition are always named as
`prm1`

, `prm2`

, etc. The function body does not have to be a one-liner: any valid sequence of OpenCL operators is allowed. Due to the fact
that OpenCL kernels are compiled at runtime, you will not get any compilation errors regarding your function until
the first expression using it is encountered in your program.

Custom functions may be used not only to introduce new functionality, but also for performance sake. See
the Kernel generation section for further explanations.

### Sparse matrix-vector products and stencil convolutions

One of the most common operations in linear algebra is matrix-vector multiplication.
The class `vex::SpMat`

holds the representation of a sparse
matrix, spanning several devices. Its constructor accepts a sparse matrix stored in CRS format.
On the compute devices, the matrix is stored in hybrid ELL-CRS format. In the example below,
the device matrix is constructed from host vectors and is used to compute the maximum residual value:

vex::SpMat<double> A(ctx.queue(), n, row.data(), col.data(), val.data());
r = f - A * u;
double res = max(fabs(r));

See file examples/cg.cpp for an example of implementing
the
conjugate gradient method for
the numerical solution of a system of linear equations.

Another commonly used operation is the convolution of a vector with a stencil. VexCL supports three kinds of stencils: simple stencils, generalized stencils,
and user-defined stencil operators. Simple stencil is based on a one-dimensional array and is used as is. Generalized stencil is based on a
dense two-dimensional matrix and is used together with a built-in or user-defined function of one argument. To define
a stencil operator, the user has to provide
its dimensions (width and center), and body string.

Simple stencil convolution comes in handy in many situations. For example, it allows us to apply a moving average filter to a device vector. All you need is
to construct a `vex::stencil`

object from `std::vector`

, a pair of iterators, or from an initializer list. You also need to specify
the position of the stencil center:

std::vector<double> sdata(5, 0.2);
vex::stencil<double> s(ctx.queue(), sdata, sdata.size() / 2);
y = x * s;

Generalized stencil is basically a small dense coefficient matrix. It is used in combination with a built-in or user-defined function of one argument.
For example, the following nonlinear vector operator:

y[i] = sin(x[i-1] - x[i]) + sin(x[i] - x[i+1]);

may be implemented as:

const uint rows = 2;
const uint cols = 3;
const uint center = 1;
vex::gstencil<double> S(ctx.queue(), rows, cols, center, {
1, -1, 0, 0, 1, -1 });
y = sin(x * S);

A user-defined function may be used together with generalized stencils. To implement the following operator:

y[i] = x[i] + pow(x[i-1] + x[i+1], 3);

you would write:

extern const char pow3_body[] = "return pow(prm1, 3);";
UserFunction<pow3_body, double(double)> pow3;
void pow3_oper(const vex::vector<double> &x, vex::vector<double> &y) {
gstencil<double> S(ctx.queue(), 1, 3, 1, {1, 0, 1});
y = x + pow3(x * S);
}

User-defined stencil operators allow to define efficient arbitrary stencils. You need to specify stencil dimensions and provide the body for the function
returning the local value of the stencil. Let us implement the previous example with
the help of a stencil operator. Note that in the body, string elements
of a vector that stencil is applied to are referred as `X[i]`

, where `i`

is relative to the stencil center:

extern const char pow3_op_body[] =
"double t = X[-1] + X[1];\n"
"return X[0] + t * t * t;"
const uint width = 3;
const uint center = 1;
vex::StencilOperator<double, width, center, pow3_op_body> pow3_op(ctx.queue());
y = pow3_op(x);

The latter implementation is more effective because it uses single kernel launch. You see, stencil convolution, as well as sparse matrix-vector
multiplication, is a two-step operation. First, halo points have to be exchanged between neighboring compute devices. Second,
the OpenCL kernel has to be
launched. Because of this, stencil convolution is allowed only in additive expressions. Internally, such expressions are computed separately. First, all
terms except stencil convolution are computed and stored to the result vector. Second, stencil convolution is computed and added to the result. So
the expression
`y = x + pow3(x * S);`

is functionally equivalent to:

y = x; y += pow3(x * S);

### Kernel generation

Each expression you use in your code results in the OpenCL kernel. For example, the following expression:

x = 2 * y - sin(z);

triggers generation, compilation, and launch of this kernel:

#if defined(cl_khr_fp64)
# pragma OPENCL EXTENSION cl_khr_fp64: enable
#elif defined(cl_amd_fp64)
# pragma OPENCL EXTENSION cl_amd_fp64: enable
#endif
kernel void Sub_Mul_cvsinv(
ulong n,
global double *res,
int prmll,
global double *prmlr,
global double *prmr1
)
{
size_t i = get_global_id(0);
size_t grid_size = get_num_groups(0) * get_local_size(0);
while (i < n) {
res[i] = ((prmll * prmlr[i]) - sin(prmr1[i]));
i += grid_size;
}
}

Kernel for each expression is generated and compiled only once (per OpenCL context), when it is first encountered in your program. Expression in the above
example corresponds to the expression tree in the figure below:

As you can see, the type of an expression tree depends on its operations and types of its operands. So an expression `2.0 * y - sin(z)`

would result in
a similar, but different tree, since the type of the constant here is `double`

instead of `int`

. This should be kept in mind if you want to limit
the number of generated kernels. Another thing to remember is that there is no way to tell if several terminals of a tree refer to the same data. Consider the
example from the User-defined functions section, where we had to count points in
the first quadrant. We could implement the example without using a custom function:

return sum(atan2(y, x) >= 0.0 && atan2(y, x) <= M_PI/2);

The resulting kernel would read data from vectors `x`

and `y`

twice, which is not very nice from
a performance standpoint.
The introduction of the `between`

function allows us to halve the number of global memory reads here!

### Custom kernels

As Kozma Prutkov repeatedly said, "One cannot embrace the unembraceable". So in order
to be usable, VexCL has to support custom kernels. When writing OpenCL kernel, you have to remember that VexCL vectors are distributed across all compute
devices present in the context. `vex::vector::operator()`

returns a `cl::Buffer`

object that stores part of a vector on a given device.
Other methods useful for kernel authors are `vex::vector::part_size()`

and `vex::vector::part_start()`

. The former returns
the size of a vector
partition located on a given device; the latter returns the number of first vector elements located on a given device. So `x.part_start(d+1) - x.part_start(d)`

is always equal to
`x.part_size(d)`

. If the result depends on the neighbor points, you have to remember that these points are possibly located on a different compute device.
In this case the exchange of these halo points has to be arranged manually.

The following example builds and launches a custom kernel for each device in the context:

vex::Context ctx( vex::Filter::Env );
const size_t n = 1 << 20;
vex::vector<float> x(ctx.queue(), n);
std::vector<cl::Kernel> kernel(ctx.size());
for(uint d = 0; d < ctx.size(); d++) {
cl::Program program = vex::build_sources(ctx.context(d), std::string(
"kernel void dummy(ulong size, global float *x)\n"
"{\n"
" size_t i = get_global_id(0);\n"
" if (i < size) x[i] = 4.2;\n"
"}\n"
));
kernel[d] = cl::Kernel(program, "dummy");
}
for(uint d = 0; d < ctx.size(); d++) {
kernel[d].setArg(0, static_cast<cl_ulong>(x.part_size());
kernel[d].setArg(1, x(d));
ctx.queue(d).enqueueNDRangeKernel(kernel[d], cl::NullRange, n, cl::NullRange);
}

In rare cases your kernel contains a syntax error, the error description will be output to
the standard error stream and a `cl::Error`

exception will be
thrown. By the way, VexCL overloads the output to a stream operator for `cl::Error`

objects to get more readable error messages.

## Scalability

Scalability of the library with respect to the number of compute devices is presented in this section. Effective performance (GFLOPS) and bandwidth
(GB/sec) were measured by launching a big number of test kernels on one, two, or three Nvidia Tesla C2070 cards.
The effect of adding a fourth, somewhat slower,
device (Intel Core i7) was tested as well. The results shown are averaged over 20 runs. The details of the experiments may be found in file
examples/benchmark.cpp. Basically,
the performance of the following code was measured:

a += b + c * d;
double s = sum(a * b);
y = x * s;
y += A * x;

As you can see, VexCL scales almost linearly with the addition of compute devices. Note that the performance and bandwidth for a stencil convolution operation are
much higher than for other primitives. This is due to the fact that much faster local (shared) memory is used in this algorithm, and formulas for effective
performance and bandwidth do not take this into account. Another thing worth noting is
the overall degradation of performance after an Intel CPU is added to the VexCL
context. The only primitive gaining speed from this addition is vector arithmetic. This is probably because
the performance of vector arithmetic was used as a basis for problem partitioning.

## Conclusion

VexCL allows to write compact and readable code without sacrificing too much performance. In the next article,
headmyshoulder will talk about integrating VexCL into odeint -
a modern C++ library
for numerical solution of ordinary differential equations. odeint now supports GPU computations both with CUDA by using
the
Thrust library and OpenCL by using VexCL. This allows us to compare
the performance of the two solutions.

## Acknowledgments

This work has been partially supported by the Russian Foundation for Basic Research (RFBR) grants No 12-01-00033 and 12-07-0007.