C++AMP vs thrust vs opencl

Question

0.00/5 (No votes)

See more:

I program in GPUs using OpenCL but I would be happy with a easier system to parallelize the program

What of them implies less code to be changed to introduce in GPU?
C++ amp and Trust allows run several functions sequentially inside the GPU before returning results?

What I have tried:

I made code using OpenCL but a lot of work must be done added to the parallelizing the algorithm that is the system preparision, buffer ctreation, data transfers, ...

Posted 14-Sep-18 1:39am

Javier Luis Lopez

Updated 11-Nov-18 2:15am

v2

Add a Solution

1 solution

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

tugrulGtx · Answer 1 · 2018-11-11T02:15:00

but a lot of work must be done added to the parallelizing the algorithm that is the system preparision, buffer ctreation, data transfers

If you are re-writing whole OpenCL stages for every new programming work, then you need to shorten the writing part by using OOP such as creating an object for doing not all but specialized parts of OpenCL for you within RAII style rules so that what you write 1 liner will be equal to 10s of lines of OpenCL API and you won't have to remember in which order of "resource release" operations will be done.

If you don't want to re-invent your own wheel and if you are ok with the latency of bureaucracy for bugfixing and feature integrations, then there are others wheels.

Sycl:

C++

std::vector h_a(LENGTH);             // a vector
  std::vector h_b(LENGTH);             // b vector
  std::vector h_c(LENGTH);             // c vector
  std::vector h_r(LENGTH, 0xdeadbeef); // d vector (result)
  // Fill vectors a and b with random float values
  int count = LENGTH;
  for (int i = 0; i < count; i++) {
    h_a[i] = rand() / (float)RAND_MAX;
    h_b[i] = rand() / (float)RAND_MAX;
    h_c[i] = rand() / (float)RAND_MAX;
  }
  {
    // Device buffers
    buffer d_a(h_a);
    buffer d_b(h_b);
    buffer d_c(h_c);
    buffer d_r(h_d);
    queue myQueue;
    command_group(myQueue, [&]()
    {
      // Data accessors
     auto a = d_a.get_access<access::read>();
     auto b = d_b.get_access<access::read>();
     auto c = d_c.get_access<access::read>();
     auto r = d_r.get_access<access::write>(); 
      // Kernel
      parallel_for(count, kernel_functor([ = ](id<> item) {
        int i = item.get_global(0);
        r[i] = a[i] + b[i] + c[i];
      }));
    });
  }

Arrayfire:

C++

// Select a device and display arrayfire info
int device = argc > 1 ? atoi(argv[1]) : 0;
af::setDevice(device);
af::info();
printf("Create a 5-by-3 matrix of random floats on the GPU\n");
array A = randu(5,3, f32);
af_print(A);
printf("Element-wise arithmetic\n");
array B = sin(A) + 1.5;
af_print(B);
printf("Negate the first three elements of second column\n");
B(seq(0, 2), 1) = B(seq(0, 2), 1) * -1;
af_print(B);
printf("Fourier transform the result\n");
array C = fft(B);
af_print(C);
printf("Grab last row\n");
array c = C.row(end);
af_print(c);
printf("Scan Test\n");
dim4 dims(16, 4, 1, 1);
array r = constant(2, dims);
af_print(r);
printf("Scan\n");
array S = af::scan(r, 0, AF_BINARY_MUL);
af_print(S);
printf("Create 2-by-3 matrix from host data\n");
float d[] = { 1, 2, 3, 4, 5, 6 };
array D(2, 3, d, afHost);
af_print(D);
printf("Copy last column onto first\n");
D.col(0) = D.col(end);
af_print(D);
// Sort A
printf("Sort A and print sorted array and corresponding indices\n");
array vals, inds;
sort(vals, inds, A);
af_print(vals);
af_print(inds);

CUDA has an easier entrance than OpenCL with its runtime API. If you are ok with being resticted to Nvidia hardware, then have a look at this:

C++

#include <stdio.h>

__global__
void saxpy(int n, float a, float *x, float *y)
{
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n) y[i] = a*x[i] + y[i];
}

int main(void)
{
  int N = 1<<20;
  float *x, *y, *d_x, *d_y;
  x = (float*)malloc(N*sizeof(float));
  y = (float*)malloc(N*sizeof(float));

  cudaMalloc(&d_x, N*sizeof(float)); 
  cudaMalloc(&d_y, N*sizeof(float));

  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
  cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);

  // Perform SAXPY on 1M elements
  saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y);

  cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);

  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = max(maxError, abs(y[i]-4.0f));
  printf("Max error: %f\n", maxError);

  cudaFree(d_x);
  cudaFree(d_y);
  free(x);
  free(y);
}

but if you prefer CUDA's driver-API then it will be as hard as OpenCL as a learning curve. But ofcourse, with enough encapsulation, you can make your own 1-liner algorithms to sort, map and compute things. Also CUDA has many tools to aid developers in bugfixing, optimizing and computing (so that you may not need to invent a FFT for yourself but directly use the relevant tool in CUDA platform).