Click here to Skip to main content
15,892,059 members
Articles / Web Development / HTML

Permutations with CUDA and OpenCL

Rate me:
Please Sign up or sign in to vote.
4.57/5 (6 votes)
12 Apr 2016Ms-PL11 min read 64.4K   1.1K   34  
Finding lexicographical permutations on GPU
<ul class="download"><li><a href="PermCudaSrcVer100.zip">Download CUDA source - 5.4 KB</a></li>
<li><a href="PermOpenCLSrcVer100.zip">Download OpenCL source - 10 KB</a></li>
<li><a href="KB/recipes/PartTwoOfPermutationsInCS/TimePermutationsExe.zip">Download CPU benchmark - 30.6 KB</a></li></ul>

<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#algo">Algorithm</a></li>
<li><a href="#cuda">CUDA Source Code</a></li>
<li><a href="#opencl">OpenCL Source Code</a></li>
<li><a href="#benchmark">Benchmark</a></li>
<li><a href="#interest">Points of Interest</a></li>
<li><a href="#conclusion">Conclusion</a></li>
<li><a href="#ref">References</a></li>
<li><a href="#history">History</a></li>
</ul>

<a name="intro"></a>
<h2>Introduction</h2>

<p>CUDA and OpenCL gives any application access to the graphics processing unit for non-graphical computing and enables massive general purpose computations using GPUs. Before the advent of CUDA and OpenCL, GPGPU is done using graphics API shaders (HLSL and GLSL) which are awkward and unnatural to the task at hand because programmer has to 'bend' the shaders to do general purpose programming. CUDA is available for use on NVidia graphics card and x86 CPU. OpenCL is available on NVidia and AMD graphics card, x86 CPU and ARM CPU. Whereas DirectCompute is only available on MS Vista and Windows 7 platforms and DirectX 10/11 level graphics card. To read this article, reader is expected to be fairly acquainted with CUDA and OpenCL workings, its memory model and technical terms like kernel, host and device.</p>

<a name="algo"></a>
<h2>Algorithm</h2>

<p>The algorithm used in this article is basically the same as my <a href="http://www.codeproject.com/Articles/21362/Permutations-in-C-Part-2">Permutation in C++, Part 2</a> article. The algorithm make use of <a href="http://en.wikipedia.org/wiki/Factorial">factorial</a> decomposition to find lexicographical permutations and is zero based. As the reader can see from the rating of my previous article, I am not very good at explaining this algorithm and reader may want to read the explanation elsewhere. To explain this algorithm, we will use 2 sets of elements, the first set is known as the unpicked set which will initially store all the elements and the second one is called the picked set which is initially empty but will eventually contain the final permutation. Element move one by one from unpicked set to the picked set. Let us walk through an example. Say we have a unpicked set of {0,1,2}. By computing the factorial of 3, we know that 3 elements have a total of 6 different permutations. Note: factorial operation is usually notated by <code>!</code> sign, for example factorial of 2 is simply written as <code>2!</code>. Factorial of 1 is defined to be 1.</p>

<pre>0,1,2
0,2,1
1,0,2
1,2,0
2,0,1
2,1,0
</pre>

<p>From the list of (zero-based) lexicographical permutations, we know that the first permutation(0th) is 0,1,2 and the last permutation(5th) is 2,1,0. Let us find out about 4th permutation (which we know from the above list, is 2,0,1) using factorial decomposition.</p>

<pre>2! * 0 = 0
2! * 1 = 2
2! * 2 = 4
</pre>

<p>The algorithm requires us to find highest product which is greater or equal to our nth index which is 4 so we know the first element is 2 because 4 is equal to the 3rd product and 3rd element in the unpicked set of {0,1,2} is 2. Let us proceed to find the 2nd element. Before that, we must subtract the found product from our nth index which is 4 - 4 = 0. <code>{2,?,?}</code> is our current pciked set with <code>{0,1}</code> remaining in the unpicked set.</p>

<pre>1! * 0 = 0
1! * 1 = 1
</pre>

<p>And again, we find the highest number which is greater or equal to our subtracted nth index which is 0 so we know the second element is 0 because 0 is equal to the 1st product and 1st element of the unpicked set <code>{0,1}</code> is 0. <code>{2,0,?}</code> is our current partially found permutation with <code>{1}</code> remaining in the unpicked set. We need not compute any further because only 1 is remaining in the unpicked set. So the 4th permutation is <code>{2,0,1}</code>. Let's try another example! This time, we will find the 3th permutation of 3 elements <code>{0,1,2}</code>.</p>

<pre>2! * 0 = 0
2! * 1 = 2
2! * 2 = 4
</pre>

<p>As the reader should have know by now, we need to find highest product which is greater or equal to our nth index which is 3. We know the first element is 1 because 3 is less than the 3rd product but more than the 2nd product and 2nd element of the unpicked set <code>{0,1,2}</code> is 1. Let us proceed to find the 2nd element of the picked set. First, we must subtract the found product from our nth index which is 3 - 2 = 1. Note: <code>{1,?,?}</code> is our current picked set with <code>{0,2}</code> remaining in the unpicked set.</p>

<pre>1! * 0 = 0
1! * 1 = 1
</pre>

<p>And again, we find the highest number which is greater or equal to our subtracted nth index which is 1 so we know the second element is 2 because 2 is 2nd element of the unpicked set <code>{0,2}</code>. <code>{1,2,?}</code> is our current partially found permutation with <code>{0}</code> remaining in the unpicked set. Because only 0 is remaining in the unpicked set, so the 3th permutation is <code>{1,2,0}</code>.</p>

<a name="cuda"></a>
<h2>CUDA Source Code</h2>

<p>I present the source code for the CUDA kernel. The kernel function is called <code>Permute</code>. The <code>arrDest</code> parameter is an uninitialized array which will be filled with the results after the CUDA operation is completed. <code>offset</code> is the nth permutation to be start computing and <code>Max</code> is the maximum permutation to compute. As a simple example to illustrate the use of <code>offset</code> and <code>Max</code> parameters, let assume you had 1000 permutations to find but your GPU only have enough memory to find 500 permutations at any one time, so you have to split the CUDA operation into 2. The first operation compute the 0th to 499th permuations (<code>offset</code>=0 and <code>Max</code>=500) while the second operation compute the 500th to 999th permutations (<code>offset</code>=500 and <code>Max</code>=500). The nth permutation used in the algorithm is zero-based. The constant array, <code>arr</code> stores the precomputed product for the factorial and the unpicked set. Constant memory access in CUDA is faster than global memory but they are usually limited to 64KB and cannot be written to. When we are finding kth element out of a set of <code>n</code>, we are using the factorial of (<code>n-k</code>). The row of <code>arr</code> refers the factorial of <code>n-k</code> for each <code>k</code> row while the column of <code>arr</code> stores product of factorial of <code>n-k</code> and <code>k</code>. Each element of <code>arr</code> is a 64 bit integer which can store up to maximum factorial of 20. The <code>next_permutation</code> is copied and modified from Visual C++ 10 STL. <code>NEXT_PERM_LOOP</code> specifies the number of times you want to use <code>next_permutation</code>.</p>

<pre lang="C++">
#define NUMBER_OF_ELEMENTS 5
#define BLOCK_DIM 1024
#define OFFSET 0
// When MAX_PERM = 0, means find all permutations
#define MAX_PERM 0
#define NEXT_PERM_LOOP 1

__constant__ long long arr[20][20] = { /*Not shown here to save space*/ };

// function to swap character 
// a - the character to swap with b
// b - the character to swap with a
__device__ void swap(
    char* a, 
    char* b)
{
    char tmp = *a;
    *a = *b;
    *b = tmp;
}


// function to reverse the array (sub array in array)
// first - 1st character in the array (sub-array in array)
// last - 1 character past the last character
__device__ void reverse(
    char* first, 
    char* last)
{    
    for (; first != last &amp;&amp; first != --last; ++first)
        swap(first, last);
}


// function to find the next permutation (sub array in array)
// first - 1st character in the array (sub-array in array)
// last - 1 character past the last character
__device__ void next_permutation(
    char* first, 
    char* last)
{
    char* next = last;
    --next;
    if(first == last || first == next)
        return;

    while(true)
    {
        char* next1 = next;
        --next;
        if(*next &lt; *next1)
        {
            char* mid = last;
            --mid;
            for(; !(*next &lt; *mid); --mid)
                ;
            swap(next, mid);
            reverse(next1, last);
            return;
        }

        if(next == first)
        {
            reverse(first, last);
            return;
        }
    }
}    

__global__ void PermuteHybrid(char* arrDest, long long* offset, long long* Max)
{
    long long index = threadIdx.x + blockIdx.x * blockDim.x;
    if(index &gt;= (*Max/(NEXT_PERM_LOOP+1)))
        return;

    index *= NEXT_PERM_LOOP+1;
    long long tmpindex = index;
    
    index += *offset;
    
    char arrSrc[NUMBER_OF_ELEMENTS];
    char arrTaken[NUMBER_OF_ELEMENTS];
    for(char i=0; i&lt;NUMBER_OF_ELEMENTS; ++i)
    {
        arrSrc[i] = i;
        arrTaken[i] = 0;
    }

    char size = NUMBER_OF_ELEMENTS;
    for(char i=NUMBER_OF_ELEMENTS-1; i&gt;=0; --i)
    {
        for(char j=i; j&gt;=0; --j)
        {
            if(index &gt;= arr[i][j])
            {
                char foundcnt = 0;
                index = index - arr[i][j];
                for(char k=0;k&lt;NUMBER_OF_ELEMENTS; ++k)
                {
                    if(arrTaken[k]==0) // not taken
                    {
                        if(foundcnt==j)
                        {
                            arrTaken[k] = 1; // set to taken
                            arrDest[ (tmpindex*NUMBER_OF_ELEMENTS) + (NUMBER_OF_ELEMENTS-size) ] = arrSrc[k];
                            break;
                        }
                        foundcnt++;
                    }
                }
                break;
            }
        }
        --size;
    }

    long long idx = tmpindex*NUMBER_OF_ELEMENTS;
    for(char a=1; a&lt;NEXT_PERM_LOOP+1; ++a)
    {
        long long idx2 = a*NUMBER_OF_ELEMENTS;
        for(char i=0; i&lt;NUMBER_OF_ELEMENTS; ++i)
        {
            arrDest[ idx + idx2 + i ] =
                arrDest[ idx + ((a-1)*NUMBER_OF_ELEMENTS) + i ];
        }
        next_permutation(arrDest + idx + idx2, 
            arrDest+idx + idx2 + NUMBER_OF_ELEMENTS);
    }
}
</pre>

<p>To control operation parameters, we need to set macros accordingly. As an example, the macros below find permutations for 5 elements from 0th to 29th permutations. If the kernel cannot be run, the reader might want to play around with the values of <code>BLOCK_DIM</code> in multiple of 256.</p>

<pre lang="C++">#define NUMBER_OF_ELEMENTS 5
#define BLOCK_DIM 1024
#define OFFSET 0
// When MAX_PERM = 0, means find all permutations
#define MAX_PERM 30
#define NEXT_PERM_LOOP 1
</pre>

<p>The host code is presented below. Host code here refers to code which is executed on the host CPU, not device(GPU).</p>

<pre lang="C++">int main()
{
    long long Max = 0;

    if(MAX_PERM==0)
        Max = Fact(NUMBER_OF_ELEMENTS);
    else
        Max = MAX_PERM;

    long long offset = OFFSET;
    char* arrDest = new char[Max*NUMBER_OF_ELEMENTS];

    // Add vectors in parallel.
    cudaError_t cudaStatus = PermuteWithCuda(arrDest, &amp;offset, &amp;Max);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "PermuteWithCuda failed!");
        return 1;
    }

    check(arrDest, Max);
    //display(arrDest, Max);

    printf("\nExecuted program successfully.\n");

    // cudaDeviceReset must be called before exiting in order for profiling and
    // tracing tools such as Parallel Nsight and Visual Profiler to show complete traces.
    cudaStatus = cudaDeviceReset();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaDeviceReset failed!");
        return 1;
    }
    delete [] arrDest;
    return 0;
}

cudaError_t PermuteWithCuda(char* arrDest, long long* offset, long long* Max)
{
    char* dev_arrDest = 0;
    long long *dev_offset = 0;
    long long *dev_Max = 0;

    cudaError_t cudaStatus;

    // Choose which GPU to run on, change this on a multi-GPU system.
    cudaStatus = cudaSetDevice(0);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaSetDevice failed!  Do you have a CUDA-capable GPU installed?");
        goto Error;
    }

    // Allocate GPU buffers.
    cudaStatus = cudaMalloc((void**)&amp;dev_arrDest, (*Max)*NUMBER_OF_ELEMENTS);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMalloc failed!");
        goto Error;
    }

    cudaStatus = cudaMalloc((void**)&amp;dev_offset, sizeof(long long));
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMalloc failed!");
        goto Error;
    }

    cudaStatus = cudaMalloc((void**)&amp;dev_Max, sizeof(long long));
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMalloc failed!");
        goto Error;
    }

    // Copy variable from host memory to GPU buffer.
    cudaStatus = cudaMemcpy(dev_offset, offset, sizeof(long long), cudaMemcpyHostToDevice);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMemcpy failed!");
        goto Error;
    }

    // Copy variable from host memory to GPU buffer.
    cudaStatus = cudaMemcpy(dev_Max, Max, sizeof(long long), cudaMemcpyHostToDevice);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMemcpy failed!");
        goto Error;
    }

    // Launch a kernel on the GPU with one thread for each element.
    int blocks = (*Max)/BLOCK_DIM;
    if((*Max)%BLOCK_DIM != 0)
        ++blocks;

    if(NEXT_PERM_LOOP&gt;0)
    {
        if(blocks%(NEXT_PERM_LOOP+1)==0)
            blocks = blocks/(NEXT_PERM_LOOP+1);
        else
            blocks = blocks/(NEXT_PERM_LOOP+1) + 1;
    }

    ++blocks;

    UINT wTimerRes = 0;
    bool init = InitMMTimer(wTimerRes);
    DWORD startTime = timeGetTime();

    PermuteHybrid&lt;&lt;&lt;blocks, BLOCK_DIM&gt;&gt;&gt;(dev_arrDest, dev_offset, dev_Max);

    // cudaDeviceSynchronize waits for the kernel to finish, and returns
    // any errors encountered during the launch.
    cudaStatus = cudaDeviceSynchronize();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaDeviceSynchronize returned error code %d after launching addKernel!\n", cudaStatus);
        goto Error;
    }

    DWORD endTime = timeGetTime();
    printf("Timing: %dms\n", endTime-startTime);

    DestroyMMTimer(wTimerRes, init);

    // Copy output vector from GPU buffer to host memory.
    cudaStatus = cudaMemcpy(arrDest, dev_arrDest, (*Max) * NUMBER_OF_ELEMENTS, cudaMemcpyDeviceToHost);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMemcpy failed!");
        goto Error;
    }

Error:
    cudaFree(dev_arrDest);
    
    return cudaStatus;
}

long long Fact(long long n)
{
    long long fact = 1;
    for (long long i = 2; i &lt;= n; ++i)
    {
        fact *= i;
    }

    return fact;
}
</pre>


<p>I have included a <code>check</code> function to check the generated permutations against <code>next_permutation</code> and a <code>display</code> function to display the the results. Reader might want to comment out <code>display</code> function for massive number of permutations.</p>

<pre lang="C++">void check(char* arrDest, long long Max)
{
    printf("\nChecking...\n");

    char check[NUMBER_OF_ELEMENTS];
    for(int i=0; i&lt;NUMBER_OF_ELEMENTS ;++i)
    {
        check[i] = i;
    }
    
    if(OFFSET!=0)
    {
        for(int i=0; i&lt;OFFSET; ++i)
        {
            std::next_permutation(check, check+NUMBER_OF_ELEMENTS);
        }
    }

    for(int i=0; i&lt;Max ;++i)
    {
        for(int j=0;j&lt;NUMBER_OF_ELEMENTS;++j)
        {
            if(arrDest[i*NUMBER_OF_ELEMENTS+j] != check[j])
            {
                fprintf(stderr, "Diff check failed at %d!", i);
                return;
            }
        }

        std::next_permutation(check, check+NUMBER_OF_ELEMENTS);
    }
}

void display(char* arrDest, long long Max)
{
    for(int i=0; i&lt;Max ;++i)
    {
        for(int j=0;j&lt;NUMBER_OF_ELEMENTS;++j)
            printf("%d ", arrDest[i*NUMBER_OF_ELEMENTS+j]);
        printf("\n");
    }
}
</pre>

<a name="opencl"></a>
<h2>OpenCL Source Code</h2>

<p>The OpenCL kernel function is presented below. The OpenCL code is similar to the CUDA one except the host code and kernel code are written in 2 separate source files. This leads to the need to set the <code>NUMBER_OF_ELEMENTS</code> macro twice (once in the host file and once in kernel source file). <code>arr</code> is a single dimensional array as opposed to two dimensional array in CUDA because it is not possible to define 2 dimensional array in OpenCL.</p>

<pre lang="C++">#pragma OPENCL EXTENSION cl_khr_byte_addressable_store: enable

#define NUMBER_OF_ELEMENTS 5
#define NEXT_PERM_LOOP 1

__constant long arr[400] = { /*Not shown here to save space*/ };

// function to swap character 
// a - the character to swap with b
// b - the character to swap with a
void swap(
    __global char* a, 
    __global char* b)
{
    char tmp = *a;
    *a = *b;
    *b = tmp;
}


// function to reverse the array (sub array in array)
// first - 1st character in the array (sub-array in array)
// last - 1 character past the last character
void reverse(
    __global char* first, 
    __global char* last)
{    
    for (; first != last &amp;&amp; first != --last; ++first)
        swap(first, last);
}


// function to find the next permutation (sub array in array)
// first - 1st character in the array (sub-array in array)
// last - 1 character past the last character
void next_permutation(
    __global char* first, 
    __global char* last)
{
    __global char* next = last;
    --next;
    if(first == last || first == next)
        return;

    while(true)
    {
        __global char* next1 = next;
        --next;
        if(*next &lt; *next1)
        {
            __global char* mid = last;
            --mid;
            for(; !(*next &lt; *mid); --mid)
                ;
            swap(next, mid);
            reverse(next1, last);
            return;
        }

        if(next == first)
        {
            reverse(first, last);
            return;
        }
    }
}    

__kernel void PermuteHybrid(__global char* arrDest, __global long* offset, __global long* Max)
{
    long index = get_global_id(0);
    if(index&gt;=(*Max/(NEXT_PERM_LOOP+1)))
        return;

    index *= NEXT_PERM_LOOP+1;
    long tmpindex = index;

    index += *offset;
    
    char arrSrc[NUMBER_OF_ELEMENTS];
    char arrTaken[NUMBER_OF_ELEMENTS];
    for(int i=0; i&lt;NUMBER_OF_ELEMENTS; ++i)
    {
        arrSrc[i] = i;
        arrTaken[i] = 0;
    }

    int size = NUMBER_OF_ELEMENTS;
    for(char i=NUMBER_OF_ELEMENTS-1; i&gt;=0; --i)
    {
        for(char j=i; j&gt;=0; --j)
        {
            if(index &gt;= arr[i*20+j])
            {
                char foundcnt = 0;
                index = index - arr[i*20+j];
                for(char k=0;k&lt;NUMBER_OF_ELEMENTS; ++k)
                {
                    if(arrTaken[k]==0) // not taken
                    {
                        if(foundcnt==j)
                        {
                            arrTaken[k] = 1; // set to taken
                            arrDest[ (tmpindex*NUMBER_OF_ELEMENTS) + (NUMBER_OF_ELEMENTS-size) ] = arrSrc[k];
                            break;
                        }
                        foundcnt++;
                    }
                }
                break;
            }
        }
        --size;
    }
    
    long idx = tmpindex*NUMBER_OF_ELEMENTS;
    for(char a=1; a&lt;NEXT_PERM_LOOP+1; ++a)
    {
        long idx2 = a*NUMBER_OF_ELEMENTS;
        for(char i=0; i&lt;NUMBER_OF_ELEMENTS; ++i)
        {
            arrDest[ idx + idx2 + i ] =
                arrDest[ idx + ((a-1)*NUMBER_OF_ELEMENTS) + i ];
        }
        next_permutation(arrDest + idx + idx2, 
            arrDest+idx + idx2 + NUMBER_OF_ELEMENTS);
    }
}
</pre>

<p>The host code for OpenCL is presented below. As the reader can see, the OpenCL host code is seemingly more complex than CUDA's one but they actually do the mostly similar things. CUDA has better compiler support for writing host code.</p>

<pre lang="C++">#ifdef __APPLE__
#include &lt;OpenCL/cl.h&gt;
#else
#include &lt;CL/cl.h&gt;
#endif

#define NUMBER_OF_ELEMENTS 5
#define LOCALGROUPS 1024
#define OFFSET 0
// When MAX_PERM = 0, means find all permutations
#define MAX_PERM 0
#define NEXT_PERM_LOOP 1

// Function Prototypes
long long Fact(long long n);
void check(char* arrDest, long long Max);
void display(char* arrDest, long long Max);
bool InitMMTimer(UINT wTimerRes);
void DestroyMMTimer(UINT wTimerRes, bool init);

///
//  Create an OpenCL context on the first available platform using
//  either a GPU or CPU depending on what is available.
//
cl_context CreateContext()
{
    cl_int errNum;
    cl_uint numPlatforms;
    cl_platform_id firstPlatformId;
    cl_context context = NULL;

    // First, select an OpenCL platform to run on.  For this example, we
    // simply choose the first available platform.  Normally, you would
    // query for all available platforms and select the most appropriate one.
    errNum = clGetPlatformIDs(1, &amp;firstPlatformId, &amp;numPlatforms);
    if (errNum != CL_SUCCESS || numPlatforms &lt;= 0)
    {
        std::cerr &lt;&lt; "Failed to find any OpenCL platforms." &lt;&lt; std::endl;
        return NULL;
    }

    // Next, create an OpenCL context on the platform.  Attempt to
    // create a GPU-based context, and if that fails, try to create
    // a CPU-based context.
    cl_context_properties contextProperties[] =
    {
        CL_CONTEXT_PLATFORM,
        (cl_context_properties)firstPlatformId,
        0
    };
    context = clCreateContextFromType(contextProperties, CL_DEVICE_TYPE_GPU,
        NULL, NULL, &amp;errNum);
    if (errNum != CL_SUCCESS)
    {
        std::cout &lt;&lt; "Could not create GPU context, trying CPU..." &lt;&lt; std::endl;
        context = clCreateContextFromType(contextProperties, CL_DEVICE_TYPE_CPU,
            NULL, NULL, &amp;errNum);
        if (errNum != CL_SUCCESS)
        {
            std::cerr &lt;&lt; "Failed to create an OpenCL GPU or CPU context." &lt;&lt; std::endl;
            return NULL;
        }
    }

    return context;
}

///
//  Create a command queue on the first device available on the
//  context
//
cl_command_queue CreateCommandQueue(cl_context context, cl_device_id *device)
{
    cl_int errNum;
    cl_device_id *devices;
    cl_command_queue commandQueue = NULL;
    size_t deviceBufferSize = -1;

    // First get the size of the devices buffer
    errNum = clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, NULL, &amp;deviceBufferSize);
    if (errNum != CL_SUCCESS)
    {
        std::cerr &lt;&lt; "Failed call to clGetContextInfo(...,GL_CONTEXT_DEVICES,...)";
        return NULL;
    }

    if (deviceBufferSize &lt;= 0)
    {
        std::cerr &lt;&lt; "No devices available.";
        return NULL;
    }

    // Allocate memory for the devices buffer
    devices = new cl_device_id[deviceBufferSize / sizeof(cl_device_id)];
    errNum = clGetContextInfo(context, CL_CONTEXT_DEVICES, deviceBufferSize, devices, NULL);
    if (errNum != CL_SUCCESS)
    {
        delete [] devices;
        std::cerr &lt;&lt; "Failed to get device IDs";
        return NULL;
    }

    // In this example, we just choose the first available device.  In a
    // real program, you would likely use all available devices or choose
    // the highest performance device based on OpenCL device queries
    commandQueue = clCreateCommandQueue(context, devices[0], 0, NULL);
    if (commandQueue == NULL)
    {
        delete [] devices;
        std::cerr &lt;&lt; "Failed to create commandQueue for device 0";
        return NULL;
    }

    *device = devices[0];
    delete [] devices;
    return commandQueue;
}

///
//  Create an OpenCL program from the kernel source file
//
cl_program CreateProgram(cl_context context, cl_device_id device, const char* fileName)
{
    cl_int errNum;
    cl_program program;

    std::ifstream kernelFile(fileName, std::ios::in);
    if (!kernelFile.is_open())
    {
        std::cerr &lt;&lt; "Failed to open file for reading: " &lt;&lt; fileName &lt;&lt; std::endl;
        return NULL;
    }

    std::ostringstream oss;
    oss &lt;&lt; kernelFile.rdbuf();

    std::string srcStdStr = oss.str();
    const char *srcStr = srcStdStr.c_str();
    program = clCreateProgramWithSource(context, 1,
        (const char**)&amp;srcStr,
        NULL, NULL);
    if (program == NULL)
    {
        std::cerr &lt;&lt; "Failed to create CL program from source." &lt;&lt; std::endl;
        return NULL;
    }

    errNum = clBuildProgram(program, 0, NULL, NULL, NULL, NULL);
    if (errNum != CL_SUCCESS)
    {
        // Determine the reason for the error
        char buildLog[16384];
        clGetProgramBuildInfo(program, device, CL_PROGRAM_BUILD_LOG,
            sizeof(buildLog), buildLog, NULL);

        std::cerr &lt;&lt; "Error in kernel: " &lt;&lt; std::endl;
        std::cerr &lt;&lt; buildLog;
        clReleaseProgram(program);
        return NULL;
    }

    return program;
}

///
//  Create memory objects used as the arguments to the kernel
//  The kernel takes three arguments: result (output), a (input),
//  and b (input)
//
bool CreateMemObjects(cl_context context, cl_mem memObjects[3],
    char *a, long long* offset, long long* Max)
{
    memObjects[0] = clCreateBuffer(context, CL_MEM_WRITE_ONLY,
        (*Max) * NUMBER_OF_ELEMENTS, a, NULL);
    memObjects[1] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
        sizeof(long long), offset, NULL);
    memObjects[2] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
        sizeof(long long), Max, NULL);

    if (memObjects[0] == NULL || memObjects[1] == NULL || memObjects[2] == NULL)
    {
        std::cerr &lt;&lt; "Error creating memory objects." &lt;&lt; std::endl;
        return false;
    }

    return true;
}

///
//  Cleanup any created OpenCL resources
//
void Cleanup(cl_context context, cl_command_queue commandQueue,
    cl_program program, cl_kernel kernel, cl_mem memObjects[3])
{
    for (int i = 0; i &lt; 3; i++)
    {
        if (memObjects[i] != 0)
            clReleaseMemObject(memObjects[i]);
    }
    if (commandQueue != 0)
        clReleaseCommandQueue(commandQueue);

    if (kernel != 0)
        clReleaseKernel(kernel);

    if (program != 0)
        clReleaseProgram(program);

    if (context != 0)
        clReleaseContext(context);

}

///
//    main() for HelloWorld example
//
int main(int argc, char** argv)
{
    cl_context context = 0;
    cl_command_queue commandQueue = 0;
    cl_program program = 0;
    cl_device_id device = 0;
    cl_kernel kernel = 0;
    cl_mem memObjects[3] = { 0, 0, 0 };
    cl_int errNum;
    // Create an OpenCL context on first available platform
    context = CreateContext();
    if (context == NULL)
    {
        std::cerr &lt;&lt; "Failed to create OpenCL context." &lt;&lt; std::endl;
        return 1;
    }

    // Create a command-queue on the first device available
    // on the created context
    commandQueue = CreateCommandQueue(context, &amp;device);
    if (commandQueue == NULL)
    {
        Cleanup(context, commandQueue, program, kernel, memObjects);
        return 1;
    }

    // Create OpenCL program from HelloWorld.cl kernel source
    char file[] = "C:\\Users\\wong\\Documents\\Visual Studio 2010\\Projects\\PermOpenCL3\\PermOpenCL3\\PermOpenCL3.cl";
    program = CreateProgram(context, device, file);
    if (program == NULL)
    {
        Cleanup(context, commandQueue, program, kernel, memObjects);
        return 1;
    }

    // Create OpenCL kernel
    kernel = clCreateKernel(program, "PermuteHybrid", NULL);
    if (kernel == NULL)
    {
        std::cerr &lt;&lt; "Failed to create kernel" &lt;&lt; std::endl;
        Cleanup(context, commandQueue, program, kernel, memObjects);
        return 1;
    }

    size_t param_value=0;
    size_t param_value_size_ret=0;
    clGetKernelWorkGroupInfo (kernel, device,
        CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE,
        sizeof(size_t),
        (void*)&amp;param_value,
        &amp;param_value_size_ret);

    if(param_value_size_ret!=sizeof(size_t))
    {
        std::cerr &lt;&lt; "clGetKernelWorkGroupInfo return different size for CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE" &lt;&lt; std::endl;
        Cleanup(context, commandQueue, program, kernel, memObjects);
        return 1;
    }

    size_t param_value2=0;
    size_t param_value_size_ret2=0;
    clGetKernelWorkGroupInfo (kernel, device,
        CL_KERNEL_WORK_GROUP_SIZE,
        sizeof(size_t),
        (void*)&amp;param_value2,
        &amp;param_value_size_ret2);

    if(param_value_size_ret2!=sizeof(size_t))
    {
        std::cerr &lt;&lt; "clGetKernelWorkGroupInfo return different size for CL_KERNEL_WORK_GROUP_SIZE" &lt;&lt; std::endl;
        Cleanup(context, commandQueue, program, kernel, memObjects);
        return 1;
    }

    long long offset = OFFSET;
    long long Max = 0;

    if(MAX_PERM==0)
        Max = Fact(NUMBER_OF_ELEMENTS);
    else
        Max = MAX_PERM;

    char* arrDest = new char[Max*NUMBER_OF_ELEMENTS];

    if (!CreateMemObjects(context, memObjects, arrDest, &amp;offset, &amp;Max))
    {
        Cleanup(context, commandQueue, program, kernel, memObjects);
        return 1;
    }

    // Set the kernel arguments (result, a, b)
    errNum = clSetKernelArg(kernel, 0, sizeof(cl_mem), &amp;memObjects[0]);
    errNum |= clSetKernelArg(kernel, 1, sizeof(cl_mem), &amp;memObjects[1]);
    errNum |= clSetKernelArg(kernel, 2, sizeof(cl_mem), &amp;memObjects[2]);
    if (errNum != CL_SUCCESS)
    {
        std::cerr &lt;&lt; "Error setting kernel arguments." &lt;&lt; std::endl;
        Cleanup(context, commandQueue, program, kernel, memObjects);
        return 1;
    }

    int GlobalGroups = Max/LOCALGROUPS;
    if(Max%LOCALGROUPS != 0)
        ++GlobalGroups;

    if(NEXT_PERM_LOOP&gt;0)
    {
        if(GlobalGroups%(NEXT_PERM_LOOP+1)==0)
            GlobalGroups = GlobalGroups/(NEXT_PERM_LOOP+1);
        else
            GlobalGroups = GlobalGroups/(NEXT_PERM_LOOP+1) + 1;
    }

    ++GlobalGroups;

    size_t globalWorkSize[1] = { GlobalGroups * LOCALGROUPS };
    int LocalGroups = Max &lt;= LOCALGROUPS ? 1 : LOCALGROUPS;
    size_t localWorkSize[1] = { LocalGroups };

    UINT wTimerRes = 0;
    bool init = InitMMTimer(wTimerRes);
    DWORD startTime = timeGetTime();

    // Queue the kernel up for execution across the array
    errNum = clEnqueueNDRangeKernel(commandQueue, kernel, 1, NULL,
        globalWorkSize, localWorkSize,
        0, NULL, NULL);
    if (errNum != CL_SUCCESS)
    {
        std::cerr &lt;&lt; "Error queuing kernel for execution." &lt;&lt; std::endl;
        Cleanup(context, commandQueue, program, kernel, memObjects);
        return 1;
    }

    clFinish(commandQueue);

    DWORD endTime = timeGetTime();
    char buf[50];
    sprintf(buf, "Timing: %dms\n", endTime-startTime);
    std::cout&lt;&lt;buf&lt;&lt;std::endl;

    DestroyMMTimer(wTimerRes, init);

    // Read the output buffer back to the Host
    errNum = clEnqueueReadBuffer(commandQueue, memObjects[0], CL_TRUE,
        0, Max*NUMBER_OF_ELEMENTS, arrDest,
        0, NULL, NULL);
    if (errNum != CL_SUCCESS)
    {
        std::cerr &lt;&lt; "Error reading result buffer." &lt;&lt; std::endl;
        Cleanup(context, commandQueue, program, kernel, memObjects);
        return 1;
    }

    check(arrDest, Max);
    //display(arrDest, Max);
    std::cout &lt;&lt; std::endl;
    std::cout &lt;&lt; "Executed program successfully." &lt;&lt; std::endl;
    Cleanup(context, commandQueue, program, kernel, memObjects);

    delete [] arrDest;

    return 0;
}

long long Fact(long long n)
{
    long long fact = 1;
    for (long long i = 2; i &lt;= n; ++i)
    {
        fact *= i;
    }

    return fact;
}
</pre>

<p>The OpenCL version also has similar <code>check</code> and <code>display</code> functions. Their call should be commented out as needed.</p>

<pre lang="C++">void check(char* arrDest, long long Max)
{
    std::cout &lt;&lt; std::endl;
    std::cout &lt;&lt; "Checking..." &lt;&lt; std::endl;

    char check[NUMBER_OF_ELEMENTS];
    for(int i=0; i&lt;NUMBER_OF_ELEMENTS ;++i)
    {
        check[i] = i;
    }

    if(OFFSET!=0)
    {
        for(int i=0; i&lt;OFFSET; ++i)
        {
            std::next_permutation(check, check+NUMBER_OF_ELEMENTS);
        }
    }

    for(int i=0; i&lt;Max ;++i)
    {
        for(int j=0;j&lt;NUMBER_OF_ELEMENTS;++j)
        {
            if(arrDest[i*NUMBER_OF_ELEMENTS+j] != check[j])
            {
                printf("Diff check failed at %d!", i);
                return;
            }
        }

        std::next_permutation(check, check+NUMBER_OF_ELEMENTS);
    }

}

void display(char* arrDest, long long Max)
{
    for(int i=0; i&lt;Max ;++i)
    {
        for(int j=0;j&lt;NUMBER_OF_ELEMENTS;++j)
            std::cout &lt;&lt; (int)(arrDest[i*NUMBER_OF_ELEMENTS+j]);

        std::cout &lt;&lt; std::endl;
    }
}
</pre>


<a name="benchmark"></a>
<h2>Benchmark</h2>

<p>I have benchmarked the CUDA against the <a href="http://www.codeproject.com/KB/recipes/PartTwoOfPermutationsInCS/TimePermutationsExe.zip">CPU application</a>. The CPU and GPU used in the benchmark, are Intel i7 870 (8 cores), 2.93Ghz and NVidia Geforce 460 respectively. The CPU application make full use of the 8 cores to find permutations. The CPU application is using factorial decomposition to split the nth permutations among different CPU cores and in each worker threads, STL <code>next_permutation</code> is used to find every consecutive permutation from the first nth permutation. The results of computing permutations of 11 elements is listed below. The total number of permutations of 11 elements found is 39,916,800. The size of the array need to store the results is 39,916,800 x 11 = 439,084,800. This is the maximum number of permutations which my 1GB memory GPU can store.</p>

<pre>
CPU    : 550ms

<b>Version 1 with pure factorial decomposition (Average timing)</b>
CUDA   : 550ms
OpenCL : 581ms

<b>Version 2 with 1 next_permutation per factorial decomposition (Average timing)</b>
CUDA   : 317ms
OpenCL : 373ms

<b>Version 3 with 9 next_permutation per factorial decomposition (Average timing)</b>
CUDA   : 681ms
OpenCL : 456ms
</pre>

<p>The benchmark of 8 core CPU and the GPU is about the same. Does that mean the GPU permutation is useless? GPU permutation by no means is useless because it enables permutation and your custom kernel to be done wholly on GPU, instead of the CPU/GPU hybrid approach. Factorial decomposition has complexity of O(n^2) while STL next_permutation has complexity of O(n) but we cannot use purely <code>next_permutation</code> on GPU because next_permutation depends on the current permutation to generate the next one. However, factorial decomposition can be used in conjunction with <code>next_permutation</code> on GPU to improve performance (similar to what I have done with the CPU benchmark). Higher number of use of <code>next_permutation</code> seem to have diminishing returns on performance as the number of workitems have to be massive to keep the GPU work unit busy. Version 1 with pure factorial decomposition is simple to be modified to be called from your custom kernel. Imagine you are using version 3 which can gives you flexibility of different number of <code>next_permutation</code> in 1 kernel call, and if your custom kernel only needs 1 permutation input, what are you going to do with the rest of permutations generated by <code>next_permutation</code>?</p>

<a name="interest"></a>
<h2>Points of Interest</h2>

<p>How did my affection for permutation came about? The history started in 1998, there was this prize game held by the local radio station, FM 93.3. The radio will give out 6 digits and the radio listeners had to arrange these 6 digits into 2 numbers with 3 digits, and these 2 numbers have to add up to 933. The cash prize was very attractive and it accumulated each time a listener fails to get the correct answer. I took the brute force approach to solve the problem. It took me a few days to get the permutations working in Visual Basic 6. The funny thing was I never managed to call through the radio station hotline to give my answer. I listened with helplessness, as the $1000+ cash prize flew away from me.</p>

<a name="conclusion"></a>
<h2>Conclusion</h2>


In this article, we have look at how to implement generation of lexicographical permutations on GPU, using NVidia's CUDA. The benchmark seem to indicate both CPU and GPU have similar performance. We have improved the performance by 40% using hybrid approach of factorial decomposition and <code>next_permutation</code>. GPU has the advantage of running your custom operations mainly using GPU, instead of hybrid CPU/GPU approach if you have computed permutation on the CPU beforehand. To build and run the CUDA source code, reader must download the latest <a href="http://developer.nvidia.com/cuda-downloads">CUDA SDK</a>. Any comments on how to further optimize the CUDA and OpenCL code is most welcome.

<a name="ref"></a>
<h2>References</h2>

<ul>
<li><strong>Programming Massively Parallel Processors: A Hands-on Approach</strong> by David B. Kirk, Wen-mei W. Hwu</li>
<li><strong>CUDA by Example</strong> by Jason Sanders and Edward Kandrot</li>
<li><strong>OpenCL Programming Guide</strong> by Aaftab Munshi, Benedict Gaster, Timothy G. Mattson, James Fung, Dan Ginsburg</li>
<li><strong>CUDA Application Design and Development</strong> by Rob Farber</li>
</ul>

<a name="history"></a>
<h2>History</h2>

<ul>
<li><strong>2012-05-09</strong>: Optimize with STL <code>next_permutation</code></li>
<li><strong>2012-05-09</strong>: Added OpenCL section and source code</li>
<li><strong>2012-05-08</strong>: Initial Release</li>
</ul>

By viewing downloads associated with this article you agree to the Terms of Service and the article's licence.

If a file you wish to view isn't highlighted, and is a text file (not binary), please let us know and we'll add colourisation support for it.

License

This article, along with any associated source code and files, is licensed under The Microsoft Public License (Ms-PL)


Written By
Software Developer (Senior)
Singapore Singapore
Shao Voon is from Singapore. His interest lies primarily in computer graphics, software optimization, concurrency, security, and Agile methodologies.

In recent years, he shifted focus to software safety research. His hobby is writing a free C++ DirectX photo slideshow application which can be viewed here.

Comments and Discussions