but a lot of work must be done added to the parallelizing the algorithm that is the system preparision, buffer ctreation, data transfers
If you are re-writing whole OpenCL stages for every new programming work, then you need to shorten the writing part by using OOP such as creating an object for doing not all but specialized parts of OpenCL for you within RAII style rules so that what you write 1 liner will be equal to 10s of lines of OpenCL API and you won't have to remember in which order of "resource release" operations will be done.
If you don't want to re-invent your own wheel and if you are ok with the latency of bureaucracy for bugfixing and feature integrations, then there are others wheels.
Sycl:
std::vector h_a(LENGTH); std::vector h_b(LENGTH); std::vector h_c(LENGTH); std::vector h_r(LENGTH, 0xdeadbeef); int count = LENGTH;
for (int i = 0; i < count; i++) {
h_a[i] = rand() / (float)RAND_MAX;
h_b[i] = rand() / (float)RAND_MAX;
h_c[i] = rand() / (float)RAND_MAX;
}
{
buffer d_a(h_a);
buffer d_b(h_b);
buffer d_c(h_c);
buffer d_r(h_d);
queue myQueue;
command_group(myQueue, [&]()
{
auto a = d_a.get_access<access::read>();
auto b = d_b.get_access<access::read>();
auto c = d_c.get_access<access::read>();
auto r = d_r.get_access<access::write>();
parallel_for(count, kernel_functor([ = ](id<> item) {
int i = item.get_global(0);
r[i] = a[i] + b[i] + c[i];
}));
});
}
Arrayfire:
int device = argc > 1 ? atoi(argv[1]) : 0;
af::setDevice(device);
af::info();
printf("Create a 5-by-3 matrix of random floats on the GPU\n");
array A = randu(5,3, f32);
af_print(A);
printf("Element-wise arithmetic\n");
array B = sin(A) + 1.5;
af_print(B);
printf("Negate the first three elements of second column\n");
B(seq(0, 2), 1) = B(seq(0, 2), 1) * -1;
af_print(B);
printf("Fourier transform the result\n");
array C = fft(B);
af_print(C);
printf("Grab last row\n");
array c = C.row(end);
af_print(c);
printf("Scan Test\n");
dim4 dims(16, 4, 1, 1);
array r = constant(2, dims);
af_print(r);
printf("Scan\n");
array S = af::scan(r, 0, AF_BINARY_MUL);
af_print(S);
printf("Create 2-by-3 matrix from host data\n");
float d[] = { 1, 2, 3, 4, 5, 6 };
array D(2, 3, d, afHost);
af_print(D);
printf("Copy last column onto first\n");
D.col(0) = D.col(end);
af_print(D);
printf("Sort A and print sorted array and corresponding indices\n");
array vals, inds;
sort(vals, inds, A);
af_print(vals);
af_print(inds);
CUDA has an easier entrance than OpenCL with its runtime API. If you are ok with being resticted to Nvidia hardware, then have a look at this:
#include <stdio.h>
__global__
void saxpy(int n, float a, float *x, float *y)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
if (i < n) y[i] = a*x[i] + y[i];
}
int main(void)
{
int N = 1<<20;
float *x, *y, *d_x, *d_y;
x = (float*)malloc(N*sizeof(float));
y = (float*)malloc(N*sizeof(float));
cudaMalloc(&d_x, N*sizeof(float));
cudaMalloc(&d_y, N*sizeof(float));
for (int i = 0; i < N; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
}
cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);
saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y);
cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);
float maxError = 0.0f;
for (int i = 0; i < N; i++)
maxError = max(maxError, abs(y[i]-4.0f));
printf("Max error: %f\n", maxError);
cudaFree(d_x);
cudaFree(d_y);
free(x);
free(y);
}
but if you prefer CUDA's driver-API then it will be as hard as OpenCL as a learning curve. But ofcourse, with enough encapsulation, you can make your own 1-liner algorithms to sort, map and compute things. Also CUDA has many tools to aid developers in bugfixing, optimizing and computing (so that you may not need to invent a FFT for yourself but directly use the relevant tool in CUDA platform).