Click here to Skip to main content
12,956,410 members (79,465 online)
Click here to Skip to main content
Add your own
alternative version


49 bookmarked
Posted 25 May 2010

A Brief Test on the Code Efficiency of CUDA and Thrust

, 27 Jun 2010 GPL3
Rate this:
Please Sign up or sign in to vote.
Verify the execution efficiency of a short CUDA program when using the library thrust


I am working on numerical simulations which are always pretty time consuming jobs. Most of these jobs take lots of hours to complete, even though multi-core CPUs are commonly used. Before I can afford a cluster, how to dramatically improve the calculation efficiency on my desktop computers to save computational effort became a critical problem I am facing and dreaming to achieve.

NVIDIA CUDA seems more and more popular and potential to solve the present problem with the power released from GPU. CUDA framework provides a modified C language and with its help, my C programming experiences can be re-used to implement numerical algorithms by utilising a GPU. Whilst thrust is a C++ template library for CUDA, thrust is aimed at improving developers' development productivity; however, the code execution efficiency is also of high priority for a numerical job. Someone stated that code execution efficiency could be lost to some extent due to the extra cost from using the library thrust. To judge this precisely, I did a series of basic tests in order to explore the truth. Basically, that is the purpose of this article.

My test computer is an Intel Q6600 quad core CPU plus 3G DDR2 800M memory. Although I don't have good hard drives, marked only 5.1 in Windows 7 32 bit, I think in this test of the calculation of the summation of squares, the access to hard drives might not be significant. The graphic card used is a GeForce 9800 GTX+ with 512M GDDR3 memory. The card is shown as:

[This article can also be referred from my blog (Free your CFD), "A short test on the code efficiency of CUDA and thrust".]

Algorithm in Raw CUDA

The test case I used is solving the summation of squares of an array of integers (random numbers ranged from 0 to 9), and, as I mentioned, a GeForce 9800 GTX+ graphic card running within Windows 7 32-bit system was employed for the testing. If in plain C language, the summation could be implemented by the following loop code, which is then executed on a CPU core:

int final_sum = 0;
for (int i = 0; i < DATA_SIZE; i++) {
	final_sum += data[i] * data[i];

Obviously, it is a serial computation. The code is executed in a serial stream of instructions. In order to utilise the power of CUDA, the algorithm has to be parallelised, and the more parallelisation is realised, the more potential power will be explored. With the help of my basic understanding on CUDA, I split the data into different groups and then used the equivalent number of threads on the GPU to calculate the summation of the squares of each group. Ultimately results from all the groups are added together to obtain the final result.

The algorithm designed is briefly shown in the figure:

The consecutive steps are:

  1. Copy data from the CPU memory to the GPU memory.
    cudaMemcpy(gpudata, data, sizeof(int) * DATA_SIZE, cudaMemcpyHostToDevice);
  2. Totally BLOCK_NUM blocks are used, and in each block THREAD_NUM threads are produced to perform the calculation. In practice, I used THREAD_NUM = 512, which is the greatest allowed thread number in a block of CUDA. Thereby, the raw data are separated into DATA_SIZE / (BLOCK_NUM * THREAD_NUM) groups.
  3. The access to the data buffer is designed as consecutive, otherwise the efficiency will be reduced.
  4. Each thread does its corresponding calculation.
    shared[tid] = 0;
    for (int i = bid * THREAD_NUM + tid; i < DATA_SIZE; i += BLOCK_NUM * THREAD_NUM) {
    	shared[tid] += num[i] * num[i];
  5. By using shared memory in the blocks, sub summation can be done in each block. Also, the sub summation is parallelised to achieve as high execution speed as possible. Please refer to the source code regarding the details of this part.
  6. The BLOCK_NUM sub summation results for all the blocks are copied back to the CPU side, and they are then added together to obtain the final value.
    cudaMemcpy(&sum, result, sizeof(int) * BLOCK_NUM, cudaMemcpyDeviceToHost);
    int final_sum = 0;
    for (int i = 0; i < BLOCK_NUM; i++) {
    	final_sum += sum[i];

Regarding the procedure, function QueryPerformanceCounter records the code execution duration, which is then used for comparison between the different implementations. Before each call of QueryPerformanceCounter, CUDA function cudaThreadSynchronize() is called to make sure that all computations on the GPU are really finished. (Please refer to the CUDA Best Practices Guide §2.1.)

More details on the raw CUDA code can be referred directly from the source code attached. Comments are also welcome.

Algorithm in Thrust

The application of the library thrust could make the CUDA code as simple as a plain C++ one. The usage of the library is also compatible with the usage of STL (Standard Template Library) of C++. For instance, the code for the calculation on GPU utilising thrust support is scratched like this:

thrust::host_vector<int> data(DATA_SIZE);
thrust::generate(data.begin(), data.end(), random());


thrust::device_vector<int> gpudata = data;

int final_sum = thrust::transform_reduce(gpudata.begin(), gpudata.end(),
    square<int>(), 0, thrust::plus<int>());

elapsed_time = (double)(elapsed_time_end.QuadPart - elapsed_time_start.QuadPart)
    / frequency.QuadPart;

printf("sum (on GPU): %d; time: %lf\n", final_sum, elapsed_time);

thrust::generate is used to generate the random data, for which the functor random is defined in advance. random was customised to generate a random integer ranged from 0 to 9.

// define functor for
// random number ranged in [0, 9]
class random
    int operator() ()
        return rand() % 10;

In comparison with the random number generation without thrust, the code could however not be as elegant.

// generate random number ranged in [0, 9]
void GenerateNumbers(int * number, int size)
	for (int i = 0; i < size; i++) {
		number[i] = rand() % 10;

Similarly square is a transformation functor taking one argument. Please refer to the source code for its definition. square was defined for __host__ __device__ and thus it can be used for both the CPU and the GPU sides.

// define transformation f(x) -> x^2
template <typename T>
struct square
	__host__ __device__
		T operator() (T x)
		return x * x;

That is all for the thrust based code. Is it concise enough? :) Here function QueryPerformanceCounter also records the code duration. On the other hand, the host_vector data is operated on CPU to compare. Using the code below, the summation is performed by the CPU end:


final_sum = thrust::transform_reduce(data.begin(), data.end(),
	square<int>(), 0, thrust::plus<int>());

elapsed_time = (double)(elapsed_time_end.QuadPart - elapsed_time_start.QuadPart)
	/ frequency.QuadPart;

printf("sum (on CPU): %d; time: %lf\n", final_sum, elapsed_time);

I also tested the performance if use thrust::host_vector<int> data as a plain array. This is supposed to cost more overhead, I thought, but we might be curious to know how much. The corresponding code is listed as:

final_sum = 0;
for (int i = 0; i < DATA_SIZE; i++)
    final_sum += data[i] * data[i];

printf("sum (on CPU): %d; time: %lf\n", final_sum, elapsed_time);

The execution time was recorded to compare as well.

Test Results on GPU & CPU

The previous experiences show that GPU surpasses CPU when massive parallel computation is realised. When DATA_SIZE increases, the potential of GPU calculation will be gradually released. This is predictable. Moreover, do we lose efficiency when we apply thrust? I guess so, since there is extra cost brought, but do we lose much? We have to judge from the comparison results.

When DATA_SIZE increases from 1 M to 32 M (1 M equals to 1 * 1024 * 1024), the results obtained are illustrated as the table:

The descriptions of the items are:

  • GPU Time: Execution time of the raw CUDA code
  • CPU Time: Execution time of the plain loop code running on the CPU
  • GPU thrust: Execution time of the CUDA code with thrust
  • CPU thrust: Execution time of the CPU code with thrust
  • CPU '': Execution time of the plain loop code based on thrust::host_vector

The corresponding trends can be summarised as:

or compare them by the column figure:

The speedup of GPU to CPU is obvious when DATA_SIZE is more than 4 M. Actually with greater data size, much better performance speedup can be obtained. Interestingly, in this region, the cost of using thrust is quite small, which can even be neglected. However, on the other hand, don't use thrust on the CPU side, neither thrust::transform_reduce method nor a plain loop on a thrust::host_vector; according to the figures, the cost brought is huge. Use a plain array and a loop instead.

From the comparison figure, we found that the application of thrust not only simplifies the code of CUDA computation, but also compensates the loss of efficiency when DATA_SIZE is relatively small. Therefore, it is strongly recommended.


Based on the tests performed, apparently, by employing parallelism, GPU shows greater potential than CPU does, especially for those calculations which contain much more parallel elements. This article also found that the application of thrust does not reduce the code execution efficiency on the GPU side, but brings dramatical negative changes in the efficiency on the CPU side. Consequently, it is better using plain arrays for CPU calculations.

In conclusion, the usage of thrust feels pretty good, because it improves the code efficiency, and with employing thrust, the CUDA code can be so concise and rapidly developed.

Code Instruction

The code file, contained in the zip package, includes the algorithms for the raw CUDA as well as thrust on both GPU and CPU. Note that the calculation execution has to be repeated enough times in order to extract average values for a practical benchmark test; for clarity and simplification, I didn't include this feature in the code attached, but it is easy to add.

The code was built and tested in Windows 7 32 bit plus Visual Studio 2008, CUDA 3.0 and the latest thrust 1.2. One also needs a NVIDIA graphic card as well as CUDA toolkit to run the programs. For instructions on installing CUDA, please refer to its official site CUDA Zone.


  • 25/05/2010: The first version of the present article was released.
  • 26/05/2010: Source code packages are attached and the article is also updated accordingly.
  • 05/06/2010: The two packages are incorporated together and the code is also improved according to recent readers' comments. The article is also updated accordingly, especially the algorithms implemented are described more detailedly.
  • 27/06/2010: The code was modified with the help of the recent comments. In particular, the necessary call of cudaThreadSynchronize() was added. Meanwhile, the presentation of the test results is also polished further to be clear and elegant.


This article, along with any associated source code and files, is licensed under The GNU General Public License (GPLv3)


About the Author

Wayne Wood
United States United States

Free your CFD

Working on numerical modelling on electromagnetic, thermal and fluid dynamics etc in power and energy field.

Programming in C/C++ from Visual C++ 6.0 and in C#/VB.NET since Visual Studio 2003. Experienced in MATLAB, Python and Fortran etc. Meanwhile I am also a Linux fan.

Happy to exchange ideas!

You may also be interested in...


Comments and Discussions

Generalthanks Pin
Member 1091497122-Oct-14 7:10
memberMember 1091497122-Oct-14 7:10 
NewsMy Results Pin
Member 1006078419-May-13 6:17
memberMember 1006078419-May-13 6:17 
QuestionWhat you really measure Pin
Mario Mulansky10-Jun-10 0:13
memberMario Mulansky10-Jun-10 0:13 
AnswerRe: What you really measure Pin
Wayne Wood12-Jun-10 23:10
memberWayne Wood12-Jun-10 23:10 
GeneralRe: What you really measure Pin
Mario Mulansky13-Jun-10 2:13
memberMario Mulansky13-Jun-10 2:13 
GeneralRe: What you really measure Pin
El Corazon14-Jun-10 16:35
memberEl Corazon14-Jun-10 16:35 
GeneralRe: What you really measure Pin
Wayne Wood16-Jun-10 8:23
memberWayne Wood16-Jun-10 8:23 
Generalconclusion Pin
epitalon2-Jun-10 22:52
memberepitalon2-Jun-10 22:52 
GeneralRe: conclusion Pin
Wayne Wood7-Jun-10 11:46
memberWayne Wood7-Jun-10 11:46 
GeneralRe: conclusion Pin
epitalon8-Jun-10 0:17
memberepitalon8-Jun-10 0:17 
GeneralRe: conclusion Pin
Wayne Wood8-Jun-10 2:28
memberWayne Wood8-Jun-10 2:28 
GeneralRe: conclusion Pin
El Corazon8-Jun-10 13:42
memberEl Corazon8-Jun-10 13:42 
GeneralRe: conclusion Pin
Wayne Wood9-Jun-10 0:21
memberWayne Wood9-Jun-10 0:21 
GeneralRe: conclusion Pin
El Corazon10-Jun-10 10:27
memberEl Corazon10-Jun-10 10:27 
GeneralRe: conclusion Pin
Wayne Wood12-Jun-10 7:24
memberWayne Wood12-Jun-10 7:24 
Generalexcellent article Pin
El Corazon29-May-10 11:25
memberEl Corazon29-May-10 11:25 
GeneralRe: excellent article Pin
Wayne Wood29-May-10 12:10
memberWayne Wood29-May-10 12:10 
GeneralRe: excellent article Pin
El Corazon29-May-10 12:56
memberEl Corazon29-May-10 12:56 
one other issue with the code, and it is only in respect to benchmarking practices. I moved the QueryPerformanceCounter up before the calculation of core clocks. Since this is simply a matter of determining how much use you put the GPU in while processing the problem, it really should not be part of the time, especially since it is calculated on the CPU not the GPU thus slowing down the evaluation. The mallocs frees and upload to card are fine, the goal in CUDA is to find a process that takes long enough to make the time transmitting to the GPU worthwhile, and it is a rough choice at times. The main rule of thumb in benchmarking properly and it is tricky at times, if it contributes to solving the problem, include it in the benchmark, if it contributes to evaluating the benchmark itself, do not time it, calculate it outside of the time end and before the next begin.

I am starting my times evaluation now. I just wanted us on the same page.

basically my code looks like this now:

<pre>	QueryPerformanceCounter(&elapsed_time_end);<br />
<br />
	clock_t min_start, max_end;<br />
	min_start = time_used[0];<br />
	max_end = time_used[BLOCK_NUM];<br />
	for (int i = 1; i < BLOCK_NUM; i++) {<br />
		if (min_start > time_used[i])<br />
			min_start = time_used[i];<br />
		if (max_end < time_used[i + BLOCK_NUM])<br />
			max_end = time_used[i + BLOCK_NUM];<br />
	}<br />
<br />
	elapsed_time = (double)(elapsed_time_end.QuadPart - elapsed_time_start.QuadPart)<br />
		/ frequency.QuadPart;<br />

John Andrew Holmes "It is well to remember that the entire universe, with one trifling exception, is composed of others."

Shhhhh.... I am not really here. I am a figment of your imagination.... I am still in my cave so this must be an illusion....

GeneralRe: excellent article Pin
Wayne Wood29-May-10 22:35
memberWayne Wood29-May-10 22:35 
GeneralRe: excellent article Pin
El Corazon30-May-10 10:16
memberEl Corazon30-May-10 10:16 
GeneralPlease the same Computation with CUDA Pin
Kevin Drzycimski26-May-10 6:35
memberKevin Drzycimski26-May-10 6:35 
GeneralRe: Please the same Computation with CUDA Pin
Wayne Wood26-May-10 11:33
memberWayne Wood26-May-10 11:33 
JokeRe: Please the same Computation with CUDA Pin
Kevin Drzycimski31-May-10 3:07
memberKevin Drzycimski31-May-10 3:07 
QuestionQuestion Pin
BryanWilkins26-May-10 3:47
memberBryanWilkins26-May-10 3:47 
AnswerRe: Question Pin
Wayne Wood26-May-10 4:05
memberWayne Wood26-May-10 4:05 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170525.1 | Last Updated 27 Jun 2010
Article Copyright 2010 by Wayne Wood
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid