## Introduction

CUDA and OpenCL gives any application access to the graphics processing unit for non-graphical computing and enables massive general purpose computations using GPUs. Before the advent of CUDA and OpenCL, GPGPU is done using graphics API shaders (HLSL and GLSL) which are awkward and unnatural to the task at hand because programmer has to 'bend' the shaders to do general purpose programming. CUDA is available for use on NVidia graphics card and x86 CPU. OpenCL is available on NVidia and AMD graphics card, x86 CPU and ARM CPU. Whereas DirectCompute is only available on MS Vista and Windows 7 platforms and DirectX 10/11 level graphics card. To read this article, reader is expected to be fairly acquainted with CUDA and OpenCL workings, its memory model and technical terms like kernel, host and device.

## Algorithm

The algorithm used in this article is basically the same as my Permutation in C++, Part 2 article. The algorithm make use of factorial decomposition to find lexicographical permutations and is zero based. As the reader can see from the rating of my previous article, I am not very good at explaining this algorithm and reader may want to read the explanation elsewhere. To explain this algorithm, we will use 2 sets of elements, the first set is known as the unpicked set which will initially store all the elements and the second one is called the picked set which is initially empty but will eventually contain the final permutation. Element move one by one from unpicked set to the picked set. Let us walk through an example. Say we have a unpicked set of {0,1,2}. By computing the factorial of 3, we know that 3 elements have a total of 6 different permutations. Note: factorial operation is usually notated by `!`

sign, for example factorial of 2 is simply written as `2!`

. Factorial of 1 is defined to be 1.

0,1,2
0,2,1
1,0,2
1,2,0
2,0,1
2,1,0

From the list of (zero-based) lexicographical permutations, we know that the first permutation(0th) is 0,1,2 and the last permutation(5th) is 2,1,0. Let us find out about 4th permutation (which we know from the above list, is 2,0,1) using factorial decomposition.

2! * 0 = 0
2! * 1 = 2
2! * 2 = 4

The algorithm requires us to find highest product which is greater or equal to our nth index which is 4 so we know the first element is 2 because 4 is equal to the 3rd product and 3rd element in the unpicked set of {0,1,2} is 2. Let us proceed to find the 2nd element. Before that, we must subtract the found product from our nth index which is 4 - 4 = 0. `{2,?,?}`

is our current pciked set with `{0,1}`

remaining in the unpicked set.

1! * 0 = 0
1! * 1 = 1

And again, we find the highest number which is greater or equal to our subtracted nth index which is 0 so we know the second element is 0 because 0 is equal to the 1st product and 1st element of the unpicked set `{0,1}`

is 0. `{2,0,?}`

is our current partially found permutation with `{1}`

remaining in the unpicked set. We need not compute any further because only 1 is remaining in the unpicked set. So the 4th permutation is `{2,0,1}`

. Let's try another example! This time, we will find the 3th permutation of 3 elements `{0,1,2}`

.

2! * 0 = 0
2! * 1 = 2
2! * 2 = 4

As the reader should have know by now, we need to find highest product which is greater or equal to our nth index which is 3. We know the first element is 1 because 3 is less than the 3rd product but more than the 2nd product and 2nd element of the unpicked set `{0,1,2}`

is 1. Let us proceed to find the 2nd element of the picked set. First, we must subtract the found product from our nth index which is 3 - 2 = 1. Note: `{1,?,?}`

is our current picked set with `{0,2}`

remaining in the unpicked set.

1! * 0 = 0
1! * 1 = 1

And again, we find the highest number which is greater or equal to our subtracted nth index which is 1 so we know the second element is 2 because 2 is 2nd element of the unpicked set `{0,2}`

. `{1,2,?}`

is our current partially found permutation with `{0}`

remaining in the unpicked set. Because only 0 is remaining in the unpicked set, so the 3th permutation is `{1,2,0}`

.

## CUDA Source Code

I present the source code for the CUDA kernel. The kernel function is called `Permute`

. The `arrDest`

parameter is an uninitialized array which will be filled with the results after the CUDA operation is completed. `offset`

is the nth permutation to be start computing and `Max`

is the maximum permutation to compute. As a simple example to illustrate the use of `offset`

and `Max`

parameters, let assume you had 1000 permutations to find but your GPU only have enough memory to find 500 permutations at any one time, so you have to split the CUDA operation into 2. The first operation compute the 0th to 499th permuations (`offset`

=0 and `Max`

=500) while the second operation compute the 500th to 999th permutations (`offset`

=500 and `Max`

=500). The nth permutation used in the algorithm is zero-based. The constant array, `arr[i][j]`

stores the precomputed value of factorial(i) * j where i and j are both zero-based and factorial(0) is assumed to be zero instead of the usual 1 in order to get the code working correctly. Constant memory access in CUDA is faster than global memory but they are usually limited to 64KB and cannot be written to. When we are finding kth element out of a set of `n`

, we are using the factorial of (`n-k`

). Each element of `arr`

is a 64 bit integer which can store up to maximum factorial of 20. The `next_permutation`

is copied and modified from Visual C++ 10 STL. `NEXT_PERM_LOOP`

specifies the number of times you want to use `next_permutation`

.

#define NUMBER_OF_ELEMENTS 5
#define BLOCK_DIM 1024
#define OFFSET 0
#define MAX_PERM 0
#define NEXT_PERM_LOOP 1
__constant__ long long arr[20][20] = { };
__device__ void swap(
char* a,
char* b)
{
char tmp = *a;
*a = *b;
*b = tmp;
}
__device__ void reverse(
char* first,
char* last)
{
for (; first != last && first != --last; ++first)
swap(first, last);
}
__device__ void next_permutation(
char* first,
char* last)
{
char* next = last;
--next;
if(first == last || first == next)
return;
while(true)
{
char* next1 = next;
--next;
if(*next < *next1)
{
char* mid = last;
--mid;
for(; !(*next < *mid); --mid)
;
swap(next, mid);
reverse(next1, last);
return;
}
if(next == first)
{
reverse(first, last);
return;
}
}
}
__global__ void PermuteHybrid(char* arrDest, long long* offset, long long* Max)
{
long long index = threadIdx.x + blockIdx.x * blockDim.x;
if(index >= (*Max/(NEXT_PERM_LOOP+1)))
return;
index *= NEXT_PERM_LOOP+1;
long long tmpindex = index;
index += *offset;
char arrSrc[NUMBER_OF_ELEMENTS];
char arrTaken[NUMBER_OF_ELEMENTS];
for(char i=0; i<NUMBER_OF_ELEMENTS; ++i)
{
arrSrc[i] = i;
arrTaken[i] = 0;
}
char size = NUMBER_OF_ELEMENTS;
for(char i=NUMBER_OF_ELEMENTS-1; i>=0; --i)
{
for(char j=i; j>=0; --j)
{
if(index >= arr[i][j])
{
char foundcnt = 0;
index = index - arr[i][j];
for(char k=0;k<NUMBER_OF_ELEMENTS; ++k)
{
if(arrTaken[k]==0)
{
if(foundcnt==j)
{
arrTaken[k] = 1;
arrDest[ (tmpindex*NUMBER_OF_ELEMENTS) + (NUMBER_OF_ELEMENTS-size) ] = arrSrc[k];
break;
}
foundcnt++;
}
}
break;
}
}
--size;
}
long long idx = tmpindex*NUMBER_OF_ELEMENTS;
for(char a=1; a<NEXT_PERM_LOOP+1; ++a)
{
long long idx2 = a*NUMBER_OF_ELEMENTS;
for(char i=0; i<NUMBER_OF_ELEMENTS; ++i)
{
arrDest[ idx + idx2 + i ] =
arrDest[ idx + ((a-1)*NUMBER_OF_ELEMENTS) + i ];
}
next_permutation(arrDest + idx + idx2,
arrDest+idx + idx2 + NUMBER_OF_ELEMENTS);
}
}

To control operation parameters, we need to set macros accordingly. As an example, the macros below find permutations for 5 elements from 0th to 29th permutations. If the kernel cannot be run, the reader might want to play around with the values of `BLOCK_DIM`

in multiple of 256.

#define NUMBER_OF_ELEMENTS 5
#define BLOCK_DIM 1024
#define OFFSET 0
#define MAX_PERM 30
#define NEXT_PERM_LOOP 1

I have included a `check`

function to check the generated permutations against `next_permutation`

and a `display`

function to display the the results. Reader might want to comment out `display`

function for massive number of permutations.

void check(char* arrDest, long long Max)
{
printf("\nChecking...\n");
char check[NUMBER_OF_ELEMENTS];
for(int i=0; i<NUMBER_OF_ELEMENTS ;++i)
{
check[i] = i;
}
if(OFFSET!=0)
{
for(int i=0; i<OFFSET; ++i)
{
std::next_permutation(check, check+NUMBER_OF_ELEMENTS);
}
}
for(int i=0; i<Max ;++i)
{
for(int j=0;j<NUMBER_OF_ELEMENTS;++j)
{
if(arrDest[i*NUMBER_OF_ELEMENTS+j] != check[j])
{
fprintf(stderr, "Diff check failed at %d!", i);
return;
}
}
std::next_permutation(check, check+NUMBER_OF_ELEMENTS);
}
}
void display(char* arrDest, long long Max)
{
for(int i=0; i<Max ;++i)
{
for(int j=0;j<NUMBER_OF_ELEMENTS;++j)
printf("%d ", arrDest[i*NUMBER_OF_ELEMENTS+j]);
printf("\n");
}
}

## OpenCL Source Code

The OpenCL kernel function is presented below. The OpenCL code is similar to the CUDA one except the host code and kernel code are written in 2 separate source files. This leads to the need to set the `NUMBER_OF_ELEMENTS`

and `NEXT_PERM_LOOP`

macro twice (once in the host file and once in kernel source file). `arr`

is a single dimensional array as opposed to two dimensional array in CUDA because it is not possible to define 2 dimensional array in OpenCL.

#pragma OPENCL EXTENSION cl_khr_byte_addressable_store: enable
#define NUMBER_OF_ELEMENTS 5
#define NEXT_PERM_LOOP 1
__constant long arr[400] = { };
void swap(
__global char* a,
__global char* b)
{
char tmp = *a;
*a = *b;
*b = tmp;
}
void reverse(
__global char* first,
__global char* last)
{
for (; first != last && first != --last; ++first)
swap(first, last);
}
void next_permutation(
__global char* first,
__global char* last)
{
__global char* next = last;
--next;
if(first == last || first == next)
return;
while(true)
{
__global char* next1 = next;
--next;
if(*next < *next1)
{
__global char* mid = last;
--mid;
for(; !(*next < *mid); --mid)
;
swap(next, mid);
reverse(next1, last);
return;
}
if(next == first)
{
reverse(first, last);
return;
}
}
}
__kernel void PermuteHybrid(__global char* arrDest, __global long* offset, __global long* Max)
{
long index = get_global_id(0);
if(index>=(*Max/(NEXT_PERM_LOOP+1)))
return;
index *= NEXT_PERM_LOOP+1;
long tmpindex = index;
index += *offset;
char arrSrc[NUMBER_OF_ELEMENTS];
char arrTaken[NUMBER_OF_ELEMENTS];
for(int i=0; i<NUMBER_OF_ELEMENTS; ++i)
{
arrSrc[i] = i;
arrTaken[i] = 0;
}
int size = NUMBER_OF_ELEMENTS;
for(char i=NUMBER_OF_ELEMENTS-1; i>=0; --i)
{
for(char j=i; j>=0; --j)
{
if(index >= arr[i*20+j])
{
char foundcnt = 0;
index = index - arr[i*20+j];
for(char k=0;k<NUMBER_OF_ELEMENTS; ++k)
{
if(arrTaken[k]==0)
{
if(foundcnt==j)
{
arrTaken[k] = 1;
arrDest[ (tmpindex*NUMBER_OF_ELEMENTS) + (NUMBER_OF_ELEMENTS-size) ] = arrSrc[k];
break;
}
foundcnt++;
}
}
break;
}
}
--size;
}
long idx = tmpindex*NUMBER_OF_ELEMENTS;
for(char a=1; a<NEXT_PERM_LOOP+1; ++a)
{
long idx2 = a*NUMBER_OF_ELEMENTS;
for(char i=0; i<NUMBER_OF_ELEMENTS; ++i)
{
arrDest[ idx + idx2 + i ] =
arrDest[ idx + ((a-1)*NUMBER_OF_ELEMENTS) + i ];
}
next_permutation(arrDest + idx + idx2,
arrDest+idx + idx2 + NUMBER_OF_ELEMENTS);
}
}

The OpenCL version also has similar `check`

and `display`

functions. Their call should be commented out as needed.

void check(char* arrDest, long long Max)
{
std::cout << std::endl;
std::cout << "Checking..." << std::endl;
char check[NUMBER_OF_ELEMENTS];
for(int i=0; i<NUMBER_OF_ELEMENTS ;++i)
{
check[i] = i;
}
if(OFFSET!=0)
{
for(int i=0; i<OFFSET; ++i)
{
std::next_permutation(check, check+NUMBER_OF_ELEMENTS);
}
}
for(int i=0; i<Max ;++i)
{
for(int j=0;j<NUMBER_OF_ELEMENTS;++j)
{
if(arrDest[i*NUMBER_OF_ELEMENTS+j] != check[j])
{
printf("Diff check failed at %d!", i);
return;
}
}
std::next_permutation(check, check+NUMBER_OF_ELEMENTS);
}
}
void display(char* arrDest, long long Max)
{
for(int i=0; i<Max ;++i)
{
for(int j=0;j<NUMBER_OF_ELEMENTS;++j)
std::cout << (int)(arrDest[i*NUMBER_OF_ELEMENTS+j]);
std::cout << std::endl;
}
}

## Benchmark

I have benchmarked the CUDA against the CPU application. The CPU and GPU used in the benchmark, are Intel i7 870 (8 cores), 2.93Ghz and NVidia Geforce 460 respectively. The CPU application make full use of the 8 cores to find permutations. The CPU application is using factorial decomposition to split the nth permutations among different CPU cores and in each worker threads, STL `next_permutation`

is used to find every consecutive permutation from the first nth permutation. The results of computing permutations of 11 elements is listed below. The total number of permutations of 11 elements found is 39,916,800. The size of the array need to store the results is 39,916,800 x 11 = 439,084,800. This is the maximum number of permutations which my 1GB memory GPU can store.

CPU : 550ms
**Version 1 with pure factorial decomposition (Average timing)**
CUDA : 550ms
OpenCL : 581ms
**Version 2 with 1 next_permutation per factorial decomposition (Average timing)**
CUDA : 317ms
OpenCL : 373ms
**Version 3 with 9 next_permutation per factorial decomposition (Average timing)**
CUDA : 681ms
OpenCL : 456ms

The benchmark of 8 core CPU and the GPU is about the same. Does that mean the GPU permutation is useless? GPU permutation by no means is useless because it enables permutation and your custom kernel to be done wholly on GPU, instead of the CPU/GPU hybrid approach. Factorial decomposition has complexity of O(n^2) while STL next_permutation has complexity of O(n) but we cannot use purely `next_permutation`

on GPU because next_permutation depends on the current permutation to generate the next one. However, factorial decomposition can be used in conjunction with `next_permutation`

on GPU to improve performance (similar to what I have done with the CPU benchmark). Higher number of use of `next_permutation`

seem to have diminishing returns on performance as the number of workitems have to be massive to keep the GPU work units busy; 300ms is the lowest for 2, 3 or 4 next_permutation after which the performance will worsen. Version 1 with pure factorial decomposition is simple to be modified to be called from your custom kernel. Imagine you are using version 3 which can gives you flexibility of different number of `next_permutation`

in 1 kernel call, and if your custom kernel only needs 1 permutation input, what are the programmer going to do with the rest of permutations generated by `next_permutation`

?

## Massive Permutations Benchmark

I decide to modify the code to benchmark finding permutations of 12 elements. Total permutation is 479,001,600. Total bytes needed to store all permutation is 479,001,600 * 12 = 5,748,019,200 (around 6GB) which is way too much to store in my 1GB Graphics card. I break up the computation in batches of 12 to fit into my graphics card memory. The number you choose to break them up must be a factor of total permutations, meaning the total permutations could be cleanly divided by this number without remainder. The 1st batch is computed using factorial decomposition while the rest is using pure `next_permutation`

. The results is below.

Timing: 649ms
Checking...
next_permutation Timing: 18ms
Checking...
next_permutation Timing: 29ms
Checking...
next_permutation Timing: 18ms
Checking...
next_permutation Timing: 29ms
Checking...
next_permutation Timing: 18ms
Checking...
next_permutation Timing: 32ms
Checking...
next_permutation Timing: 18ms
Checking...
next_permutation Timing: 28ms
Checking...
next_permutation Timing: 18ms
Checking...
next_permutation Timing: 29ms
Checking...
next_permutation Timing: 19ms
Checking...
Total timing: 905ms
Executed program successfully.

The 1st timing using factorial decomposition, is 649ms. The total timing is 905ms. The CPU version takes 6.5 seconds. The GPU performance is improved over CPU version by 7x! The reader can find this benchmark source code under PermCudaBenchmark folder. The readers may wish to comment out the checking function which takes a long time to check all the generated permutations.

## Limitations

The presented algorithm can only find permutations for a maximum of 20 elements. This is because 64 bit integer can only store at most up to factorial(20). Note: A graphics card with 1 GB can store permutations of 11 elements (which is about 0.5 GB). Permuting 12 or more elements, the programmer have to break up the operation into batches (Preferably the 1st operation is using factorial decomposition while the rest is using `next_permutation`

). As indicated in this wiki doc, oclHashcat-plus permutation limit is 11. For those who are not familiar, oclHashcat-plus is the advanced password recovery software running on GPU. Also this algorithm does not handle duplicate elements. **A note of caution**: the OpenCL code is only tested on NVidia graphics card, AMD users have to enable the check function (enabled by default) to ensure the permutations are correct. As implied, OpenCL code is not tested on any x86 OpenCL implementation.

## Points of Interest

How did my affection for permutation came about? The history started in 1998, there was this prize game held by the local radio station, FM 93.3. The radio will give out 6 digits and the radio listeners had to arrange these 6 digits into 2 numbers with 3 digits, and these 2 numbers have to add up to 933. The cash prize was very attractive and it accumulated each time a listener fails to get the correct answer. I took the brute force approach to solve the problem. It took me a few days to get the permutations working in Visual Basic 6. The funny thing was I never managed to call through the radio station hotline to give my answer. I listened with helplessness, as the $1000+ cash prize flew away from me.

## Conclusion

In this article, we have look at how to implement generation of lexicographical permutations on GPU, using NVidia's CUDA. The benchmark seem to indicate both CPU and GPU have similar performance. We have improved the performance by 40% using hybrid approach of factorial decomposition and `next_permutation`

. GPU has the advantage of running your custom operations mainly using GPU, instead of hybrid CPU/GPU approach if you have computed permutation on the CPU beforehand. To build and run the CUDA source code, reader must download the latest CUDA SDK. The source code is hosted at Codeplex. Any comments on how to further optimize the CUDA and OpenCL code is most welcome.

## References

**Programming Massively Parallel Processors: A Hands-on Approach** by David B. Kirk, Wen-mei W. Hwu
**CUDA by Example** by Jason Sanders and Edward Kandrot
**OpenCL Programming Guide** by Aaftab Munshi, Benedict Gaster, Timothy G. Mattson, James Fung, Dan Ginsburg
**CUDA Application Design and Development** by Rob Farber

## History

**2012-05-15**: Add limitations section.
**2012-05-13**: Includes massive permutation benchmark which improve the performance over the CPU by 7 times.
**2012-05-09**: Optimize with STL `next_permutation`

**2012-05-09**: Added OpenCL section and source code
**2012-05-08**: Initial Release