Click here to Skip to main content
Click here to Skip to main content

A Brief Test on the Code Efficiency of CUDA and Thrust

By , 27 Jun 2010
 

Introduction

I am working on numerical simulations which are always pretty time consuming jobs. Most of these jobs take lots of hours to complete, even though multi-core CPUs are commonly used. Before I can afford a cluster, how to dramatically improve the calculation efficiency on my desktop computers to save computational effort became a critical problem I am facing and dreaming to achieve.

NVIDIA CUDA seems more and more popular and potential to solve the present problem with the power released from GPU. CUDA framework provides a modified C language and with its help, my C programming experiences can be re-used to implement numerical algorithms by utilising a GPU. Whilst thrust is a C++ template library for CUDA, thrust is aimed at improving developers' development productivity; however, the code execution efficiency is also of high priority for a numerical job. Someone stated that code execution efficiency could be lost to some extent due to the extra cost from using the library thrust. To judge this precisely, I did a series of basic tests in order to explore the truth. Basically, that is the purpose of this article.

My test computer is an Intel Q6600 quad core CPU plus 3G DDR2 800M memory. Although I don't have good hard drives, marked only 5.1 in Windows 7 32 bit, I think in this test of the calculation of the summation of squares, the access to hard drives might not be significant. The graphic card used is a GeForce 9800 GTX+ with 512M GDDR3 memory. The card is shown as:

[This article can also be referred from my blog (Free your CFD), "A short test on the code efficiency of CUDA and thrust".]

Algorithm in Raw CUDA

The test case I used is solving the summation of squares of an array of integers (random numbers ranged from 0 to 9), and, as I mentioned, a GeForce 9800 GTX+ graphic card running within Windows 7 32-bit system was employed for the testing. If in plain C language, the summation could be implemented by the following loop code, which is then executed on a CPU core:

int final_sum = 0;
for (int i = 0; i < DATA_SIZE; i++) {
	final_sum += data[i] * data[i];
}

Obviously, it is a serial computation. The code is executed in a serial stream of instructions. In order to utilise the power of CUDA, the algorithm has to be parallelised, and the more parallelisation is realised, the more potential power will be explored. With the help of my basic understanding on CUDA, I split the data into different groups and then used the equivalent number of threads on the GPU to calculate the summation of the squares of each group. Ultimately results from all the groups are added together to obtain the final result.

The algorithm designed is briefly shown in the figure:

The consecutive steps are:

  1. Copy data from the CPU memory to the GPU memory.
    cudaMemcpy(gpudata, data, sizeof(int) * DATA_SIZE, cudaMemcpyHostToDevice);
  2. Totally BLOCK_NUM blocks are used, and in each block THREAD_NUM threads are produced to perform the calculation. In practice, I used THREAD_NUM = 512, which is the greatest allowed thread number in a block of CUDA. Thereby, the raw data are separated into DATA_SIZE / (BLOCK_NUM * THREAD_NUM) groups.
  3. The access to the data buffer is designed as consecutive, otherwise the efficiency will be reduced.
  4. Each thread does its corresponding calculation.
    shared[tid] = 0;
    for (int i = bid * THREAD_NUM + tid; i < DATA_SIZE; i += BLOCK_NUM * THREAD_NUM) {
    	shared[tid] += num[i] * num[i];
    }
  5. By using shared memory in the blocks, sub summation can be done in each block. Also, the sub summation is parallelised to achieve as high execution speed as possible. Please refer to the source code regarding the details of this part.
  6. The BLOCK_NUM sub summation results for all the blocks are copied back to the CPU side, and they are then added together to obtain the final value.
    cudaMemcpy(&sum, result, sizeof(int) * BLOCK_NUM, cudaMemcpyDeviceToHost);
    
    int final_sum = 0;
    for (int i = 0; i < BLOCK_NUM; i++) {
    	final_sum += sum[i];
    }

Regarding the procedure, function QueryPerformanceCounter records the code execution duration, which is then used for comparison between the different implementations. Before each call of QueryPerformanceCounter, CUDA function cudaThreadSynchronize() is called to make sure that all computations on the GPU are really finished. (Please refer to the CUDA Best Practices Guide §2.1.)

More details on the raw CUDA code can be referred directly from the source code attached. Comments are also welcome.

Algorithm in Thrust

The application of the library thrust could make the CUDA code as simple as a plain C++ one. The usage of the library is also compatible with the usage of STL (Standard Template Library) of C++. For instance, the code for the calculation on GPU utilising thrust support is scratched like this:

thrust::host_vector<int> data(DATA_SIZE);
srand(time(NULL));
thrust::generate(data.begin(), data.end(), random());

cudaThreadSynchronize();
QueryPerformanceCounter(&elapsed_time_start);

thrust::device_vector<int> gpudata = data;

int final_sum = thrust::transform_reduce(gpudata.begin(), gpudata.end(),
    square<int>(), 0, thrust::plus<int>());

cudaThreadSynchronize();
QueryPerformanceCounter(&elapsed_time_end);
elapsed_time = (double)(elapsed_time_end.QuadPart - elapsed_time_start.QuadPart)
    / frequency.QuadPart;

printf("sum (on GPU): %d; time: %lf\n", final_sum, elapsed_time);

thrust::generate is used to generate the random data, for which the functor random is defined in advance. random was customised to generate a random integer ranged from 0 to 9.

// define functor for
// random number ranged in [0, 9]
class random
{
public:
    int operator() ()
    {
        return rand() % 10;
    }
};

In comparison with the random number generation without thrust, the code could however not be as elegant.

// generate random number ranged in [0, 9]
void GenerateNumbers(int * number, int size)
{
	srand(time(NULL));
	for (int i = 0; i < size; i++) {
		number[i] = rand() % 10;
	}
}

Similarly square is a transformation functor taking one argument. Please refer to the source code for its definition. square was defined for __host__ __device__ and thus it can be used for both the CPU and the GPU sides.

// define transformation f(x) -> x^2
template <typename T>
struct square
{
	__host__ __device__
		T operator() (T x)
	{
		return x * x;
	}
};

That is all for the thrust based code. Is it concise enough? :) Here function QueryPerformanceCounter also records the code duration. On the other hand, the host_vector data is operated on CPU to compare. Using the code below, the summation is performed by the CPU end:

QueryPerformanceCounter(&elapsed_time_start);

final_sum = thrust::transform_reduce(data.begin(), data.end(),
	square<int>(), 0, thrust::plus<int>());

QueryPerformanceCounter(&elapsed_time_end);
elapsed_time = (double)(elapsed_time_end.QuadPart - elapsed_time_start.QuadPart)
	/ frequency.QuadPart;

printf("sum (on CPU): %d; time: %lf\n", final_sum, elapsed_time);

I also tested the performance if use thrust::host_vector<int> data as a plain array. This is supposed to cost more overhead, I thought, but we might be curious to know how much. The corresponding code is listed as:

final_sum = 0;
for (int i = 0; i < DATA_SIZE; i++)
{
    final_sum += data[i] * data[i];
}

printf("sum (on CPU): %d; time: %lf\n", final_sum, elapsed_time);

The execution time was recorded to compare as well.

Test Results on GPU & CPU

The previous experiences show that GPU surpasses CPU when massive parallel computation is realised. When DATA_SIZE increases, the potential of GPU calculation will be gradually released. This is predictable. Moreover, do we lose efficiency when we apply thrust? I guess so, since there is extra cost brought, but do we lose much? We have to judge from the comparison results.

When DATA_SIZE increases from 1 M to 32 M (1 M equals to 1 * 1024 * 1024), the results obtained are illustrated as the table:

The descriptions of the items are:

  • GPU Time: Execution time of the raw CUDA code
  • CPU Time: Execution time of the plain loop code running on the CPU
  • GPU thrust: Execution time of the CUDA code with thrust
  • CPU thrust: Execution time of the CPU code with thrust
  • CPU '': Execution time of the plain loop code based on thrust::host_vector

The corresponding trends can be summarised as:

or compare them by the column figure:

The speedup of GPU to CPU is obvious when DATA_SIZE is more than 4 M. Actually with greater data size, much better performance speedup can be obtained. Interestingly, in this region, the cost of using thrust is quite small, which can even be neglected. However, on the other hand, don't use thrust on the CPU side, neither thrust::transform_reduce method nor a plain loop on a thrust::host_vector; according to the figures, the cost brought is huge. Use a plain array and a loop instead.

From the comparison figure, we found that the application of thrust not only simplifies the code of CUDA computation, but also compensates the loss of efficiency when DATA_SIZE is relatively small. Therefore, it is strongly recommended.

Conclusion

Based on the tests performed, apparently, by employing parallelism, GPU shows greater potential than CPU does, especially for those calculations which contain much more parallel elements. This article also found that the application of thrust does not reduce the code execution efficiency on the GPU side, but brings dramatical negative changes in the efficiency on the CPU side. Consequently, it is better using plain arrays for CPU calculations.

In conclusion, the usage of thrust feels pretty good, because it improves the code efficiency, and with employing thrust, the CUDA code can be so concise and rapidly developed.

Code Instruction

The code file thrustExample.cu, contained in the zip package, includes the algorithms for the raw CUDA as well as thrust on both GPU and CPU. Note that the calculation execution has to be repeated enough times in order to extract average values for a practical benchmark test; for clarity and simplification, I didn't include this feature in the code attached, but it is easy to add.

The code was built and tested in Windows 7 32 bit plus Visual Studio 2008, CUDA 3.0 and the latest thrust 1.2. One also needs a NVIDIA graphic card as well as CUDA toolkit to run the programs. For instructions on installing CUDA, please refer to its official site CUDA Zone.

History

  • 25/05/2010: The first version of the present article was released.
  • 26/05/2010: Source code packages are attached and the article is also updated accordingly.
  • 05/06/2010: The two packages are incorporated together and the code is also improved according to recent readers' comments. The article is also updated accordingly, especially the algorithms implemented are described more detailedly.
  • 27/06/2010: The code was modified with the help of the recent comments. In particular, the necessary call of cudaThreadSynchronize() was added. Meanwhile, the presentation of the test results is also polished further to be clear and elegant.

License

This article, along with any associated source code and files, is licensed under The GNU General Public License (GPLv3)

About the Author

Wayne Wood
Engineer
United States United States
Member

Free your CFD

 
Working on numerical modelling on electromagnetic, thermal and fluid dynamics etc in power and energy field.
 
Programming in C/C++ from Visual C++ 6.0 and in C#/VB.NET since Visual Studio 2003. Experienced in MATLAB, Python and Fortran etc. Meanwhile I am also a Linux fan.
 
Happy to exchange ideas!

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board.
Search this forum  
    Spacing  Noise  Layout  Per page   
NewsMy ResultsmemberMember 1006078419 May '13 - 6:17 
Hi,
I ported it all to linux (it was not working as-is).
I don't see here how to attach a file though.
I have GeForce 210 and my CPU is Phenom-II x6 2.7Ghz.
This is what I observe (CPU is faster than CUDA or Thrust):
CUDA initialized.
Thrust v1.5
 
...
 
Data size = 32M
sum (on GPU): 956110635; time: 0.060000 (core clocks: 23561968)
sum (on CPU): 956110635; time: 0.020000
if with thrust support,
sum (on GPU): -608775556; time: 0.060000
sum (on CPU): -608775556; time: 0.030000
 
Data size = 64M
sum (on GPU): 1912583828; time: 0.130000 (core clocks: 50890972)
sum (on CPU): 1912583828; time: 0.060000
if with thrust support,
sum (on GPU): -1537996292; time: 0.120000
sum (on CPU): -1537996292; time: 0.060000
 
Data size = 128M
sum (on GPU): -469843183; time: 0.240000 (core clocks: 94416500)
sum (on CPU): -469843183; time: 0.100000
if with thrust support,
sum (on GPU): 841174922; time: 0.250000
sum (on CPU): 841174922; time: 0.110000
 
Of course my GPU is not the one that is supposed for such workloads, but still I expected it to be better than it is.
 
By the way, there is a possibility in Thrust to change the backend from CUDA to OpenMP.
Then all thrust algorithms would use multithreading instead of GPU.
But it was not used here - only one core was loaded.
QuestionWhat you really measurememberMario Mulansky10 Jun '10 - 0:13 
First of all, thanks for this small test program!
 
However, there are a few issues that make your test results not very representative. Let me first not that I used your code under Linux so I had to change the time measurements because QueryPerformanceCounter is Windows only. The issues I found were:
1. You should add cudaThreadSynchronize() calls before each time measurement, that is each call of QueryPerformanceCounter to make sure that all computations on the card are really finished. Have a look at the NVIDIA CUDA BestPractice Guide §2.1
2. When measuring GPU performance you actually measure the time required to transfer your data to the device. This time is much larger than the one needed to do the actual computation. Additionally, this time scales linear with data size - as seen in your results - while the computation time should remain approx. constant due to the massive parallel cores on the GPU. If you write your QueryPerformanceCounter AFTER the cudaMemcpy(gpudata, data, sizeof(int) * DATA_SIZE, cudaMemcpyHostToDevice); for the first case and AFTER the thrust::device_vector vec_gpudata = vec_data; for the thrust you will get astonishing GPU performance. These results are then also more representative to compare thrust and direct cuda, as before you were mainly measuring memory transfer times.
3. I couldn't reproduce your bad thrust CPU performance at all. I'm not even sure if In the program I downloaded thrust was even used for CPU calculation in the second case? To do so, you should simply apply the reduce_transform on the device vector, e.g. final_sum = thrust::transform_reduce(vec_data.begin(), vec_data.end(), square(), 0, thrust::plus());. The performance of thrust on CPU was rather equal to direct implementation on my runs - did you use full optimization of your compiler when creating binaries?
 
Regards, Mario
AnswerRe: What you really measurememberWayne Wood12 Jun '10 - 23:10 
Hi Mario,
 
Thank you very much for your good comments. I learnt much from them.
 
1. I agree with you. cudaThreadSynchronize() should be called before recording the timestamp. Thanks for reminding me this.
 
2. I agree with you. I understand what you are saying. However, I think my purpose is to measure the time needed to finish the calculation, including the data transfer time if a GPU is employed. When the data transfer time is unacceptably long, it doesn't make sense to use GPU calculation either, even though the calculation efficiency is astonishing. You are right to say I am actually measuring the data transfer time, because the compuation time stays relatively small and constant. I should use a better way to make it clearer. Maybe the two aspects of time should be seperated measured and summarised together to give a clearer picture.
 
3. Trying final_sum = thrust::transform_reduce(vec_data.begin(), vec_data.end(), square(), 0, thrust::plus()); is helpful, but I forgot. Thanks for reminding me this. I only used a thrust container instead of a plain array, but used a for loop to do the calculation, and the performance is really bad. I should try the thrust::transform_reduce method and see the difference.
 
Thanks again for your comments.
Best regards,
Wayne
 
http://code-saturne.blogspot.com

GeneralRe: What you really measurememberMario Mulansky13 Jun '10 - 2:13 
To the 2. point:
 
You are right that one always should include memory transfer times when comparing GPU and CPU performance. However, your primary aim was to compare plain CUDA and thrust, am I correct. For this case, it doesn't make too much sense in my opinion to include memory transfer. In both cases the memory transfer governs the execution time which makes it impossible to identify differences between plain CUDA and thrust.
Additionally, one typically tries to do as much computation on the card as possible between transferring data. In your program just one kernel is executed on the data which is not a typically use of the GPU. I recommend to add more arithmetic operations (that is, transform and reduce methods) to be performed on the data. However, separately measuring the memory transfer times sounds like a very good idea to me.
 
Regards, Mario
GeneralRe: What you really measurememberEl Corazon14 Jun '10 - 16:35 
Mario Mulansky wrote:
For this case, it doesn't make too much sense in my opinion to include memory transfer. In both cases the memory transfer governs the execution time which makes it impossible to identify differences between plain CUDA and thrust.

 
Actually, I would disagree. You would want to include memory transfer in comparison.
A) does defining Thrust Operators add overhead to regular CUDA defined kernels?
B) does defining Thrust storage affect memory transfer operations within CUDA?
 
Since that is the purpose of thrust, that is exactly what it should be measuring. As a caveat, I would say it is "somewhat" unfair to compare it to CPU in terms of one simple operation: the CPU will almost always win because it does not have the memory transfer times, though you can transfer it first then calculate time to see if you have any gains in raw performance. But in the grand scheme of things defining whether or not enough computation is present to justify CUDA operations you want to include the memory transfers again. Separate memory tests is a good idea. When I got his initial results my memory transfer times were longer because the frick'n Dell I have under-clocked the PCIex transfers it seems. But once on the board calculations go fast. so I was looping on the internal calculations.
 
It all comes down to intent. If the intent is see if Thrust adds significant overhead to the same CUDA operations, then you want to do it one way. You are right though, he makes no attempt at the higher-level algorithms which seems to be where Thrust would have an advantage, if only that you don't have to code it yourself.
 
I followed on Wayne's first steps with additional tests and algorithms and eventually coding a link between Thrust storage and CudaFFT and soon cudaBLAS. Though I am still debating on how to handle Complex numbers either through cuComplex provided by nVidia or one of the many CUDA alternatives....
_________________________
John Andrew Holmes "It is well to remember that the entire universe, with one trifling exception, is composed of others."
 
Shhhhh.... I am not really here. I am a figment of your imagination.... I am still in my cave so this must be an illusion....

GeneralRe: What you really measurememberWayne Wood16 Jun '10 - 8:23 
El Corazon wrote:
Mario Mulansky wrote:
For this case, it doesn't make too much sense in my opinion to include memory transfer. In both cases the memory transfer governs the execution time which makes it impossible to identify differences between plain CUDA and thrust.

 
El Corazon wrote:
Separate memory tests is a good idea. When I got his initial results my memory transfer times were longer because the frick'n Dell I have under-clocked the PCIex transfers it seems. But once on the board calculations go fast. so I was looping on the internal calculations.

 
El Corazon wrote:
It all comes down to intent.

 
I think it is a good discussion here. Originally I only wanted to test if performance of thrust is acceptable, and because I meant to use it for calculations, I included the memory transfer time at that moment. I think Mario is right to say that this would confuse people, because we mix the calculation part together with the memory copy time, which makes the conclusion slightly unclear. However, on the other hand, it all comes down to intent, as you said Smile | :)
 
Seperately measure the time for data transfer and calculation would solve the problem, I believe, because it can suit different intents.
 
Many thanks indeed for all your helpful comments!
Best regards,
Wayne
 
http://code-saturne.blogspot.com

Generalconclusionmemberepitalon2 Jun '10 - 22:52 
Wayne,
 
As long as you don't do the same test without Thrust, with CUDA only, and for the same set of data sizes, one cannot say that GPU is faster than CPU on this particular case.
If a computation program is coded with simple arrays, the compiler may optimize the binary code so as to use :
1) parallelism on multicore
2) vectorization based on SSE instructions, which are of SIMD type (Simple Instruction Multiple Data)
 
So, if one has a quadricore computer, the CPU can achieve a parallelization level of 4x4 = 16.
This may help compute very fast.
 
I don't say that GPU will be slower than CPU for any data size. But it would be interesting to know for which data size the GPU starts being faster than the CPU. If there is such a data size...
 
Anyway, thanks for sharing your test.
GeneralRe: conclusionmemberWayne Wood7 Jun '10 - 11:46 
Hi,
 
Thanks for your comments. I am going to test the code efficiency if multi-cores are used for parallelism. Also, I heard of SSE, but I don't know how to use it Frown | :( Any recommendations?
 
Because of the latency hiden in the memory access on the GPU side, data size needs to be relatively big to release the power of GPU computations. That is right. I recently updated this article, and I hope it helps. Any comments are surely welcome. Smile | :)
 
Many thanks.
Best regards,
Wayne
 
http://code-saturne.blogspot.com

GeneralRe: conclusionmemberepitalon8 Jun '10 - 0:17 
Hi,
in my current project, I process digital images. I use Visual Studio 2005 coupled with Intel 11.1 compiler.
The Intel compiler has a feature : automatic vectorization, which automaticaly compile simple loops using SSE instructions.
SSE instructions are specific instruction that execute add, multiply, etc... instructions on four items at a time (actually eight short int, four int or float and two double precision float)
 
The Intel compiler has also a feature : automatic parallelism, which automatically breaks down a loop on all availlable cores. In my project however, I use OPEN-MP #pragma instructions to control the parallelism.
 
I believe that Visual Studion 2010 has a C++ compiler that has the same features as Intel.
 
Best regards,
Jean-Marie Epitalon
GeneralRe: conclusionmemberWayne Wood8 Jun '10 - 2:28 
Thanks a lot for your information. It is pretty interesting to me. Big Grin | :-D
 
I know .NET 4.0 introduces a parallelism support, and therefore I can use it in C#. I want to try SSE and OPEN-MP in C++ as well to compare the results. I am going to check it.
 
Many thanks for your time.
Best regards,
Wayne
 
http://code-saturne.blogspot.com

GeneralRe: conclusionmemberEl Corazon8 Jun '10 - 13:42 
Wayne Wood wrote:
Because of the latency hiden in the memory access on the GPU side, data size needs to be relatively big to release the power of GPU computations.

 
Just as important as data size is what you are doing with it. Data parallelism is a very different concept than a pure instruction parallelism. Back when Massively parallel computers where the size of office rooms, there was the continual argument over data parallel version instruction parallel. Multi-core computing uses instruction parallel, CUDA uses data parallel. This particular project uses only 24% of my GPU when tested, less when I moved it to the bigger brother nv260. Yet my CUDAFFT tests of thrust vs. raw CUDA used 75/78% of my GPU respectively. It is important when you are dealing with a debate over CPU vs. GPU that you consider the algorithm itself. Some algorithms move to the GPU better than others. FFT is one of those that is not well suited for the GPU, but many people have found ways to adapt it to improve the speed until it is faster on the GPU.
 
The debate of CUDA vs Thrust adaption of CUDA is a much different and less convoluted debate than GPU vs. CPU.
_________________________
John Andrew Holmes "It is well to remember that the entire universe, with one trifling exception, is composed of others."
 
Shhhhh.... I am not really here. I am a figment of your imagination.... I am still in my cave so this must be an illusion....

GeneralRe: conclusionmemberWayne Wood9 Jun '10 - 0:21 
Hi,
 
Your explainations always let me learn more Smile | :)
 
It is a good topic of instruction parallel vs. data parallel. According to your experiences, what are their respective suitabilities for different algorithms? i.e. theoretically, for what kinds of algorithms it is better to use instruction parallel like multi-core CPUs, and for what algorithms, data parallel methodologies like CUDA can show their potential?
 
Many thanks.
Best regards,
Wayne
 
http://code-saturne.blogspot.com

GeneralRe: conclusionmemberEl Corazon10 Jun '10 - 10:27 
Wayne Wood wrote:
It is a good topic of instruction parallel vs. data parallel. According to your experiences, what are their respective suitabilities for different algorithms? i.e. theoretically, for what kinds of algorithms it is better to use instruction parallel like multi-core CPUs, and for what algorithms, data parallel methodologies like CUDA can show their potential?

 
whew... where to start?
 
I guess it really comes down to understanding your data. Cuda performance is about 100x and more than that of a CPU in a perfectly parallel algorithm. But this happens rarely. There are many ways to solve a problem, the obvious solution is not always the best one, and sometimes it is. I would start at the algorithm stage and learn the different ways of processing parallelism. http://gpgpu.org/static/s2007/slides/03-data-parallel-algorithms-and-data-structures.pdf[^] https://computing.llnl.gov/tutorials/parallel_comp/[^]
 
As mentioned there is automated loop parallelism, the famous "magic bullet" of parallelism. Not too surprisingly it doesn't always work the best. Often re-describing the problem leads to better or worse results. As you learn each problem you get a feel for the results. I was never taught parallelism, I learned it intuitively, so I have a very different view of it than most.
_________________________
John Andrew Holmes "It is well to remember that the entire universe, with one trifling exception, is composed of others."
 
Shhhhh.... I am not really here. I am a figment of your imagination.... I am still in my cave so this must be an illusion....

GeneralRe: conclusionmemberWayne Wood12 Jun '10 - 7:24 
El Corazon wrote:
I guess it really comes down to understanding your data. Cuda performance is about 100x and more than that of a CPU in a perfectly parallel algorithm. But this happens rarely. There are many ways to solve a problem, the obvious solution is not always the best one, and sometimes it is. I would start at the algorithm stage and learn the different ways of processing parallelism. http://gpgpu.org/static/s2007/slides/03-data-parallel-algorithms-and-data-structures.pdf[^] https://computing.llnl.gov/tutorials/parallel_comp/[^]

 
You gave really good links for me to learn. Thanks a lot!
Best regards,
Wayne
 
http://code-saturne.blogspot.com

Generalexcellent articlememberEl Corazon29 May '10 - 11:25 
One comment, and its nitpicking, since CUDA code is so dependent upon the number of cores of CPU and GPU as well as memory speed of both for issuing/receiving commands. I would give your system specs, and include a spreadsheet of your sample times, or at least a data list of more than one in the download. The article itself doesn't need the full data-set, the graphs are perfect.
 
People like me will want to compare their systems with yours, so would love to have your full system specs of the test machine as well as the actual timed data to compare. For instance benchmarking my systems I have quadro, 260, 280 and 9800GT machines to test with, the times are all different and even the ratios since GPU memory is significantly faster on the 260/280 series. I hope late this year to have a 480 to test with to compare results, but that may be too much to hope for.
 
Thank you for a wonderful article, I just downloaded the code so will be conducting my own experiments for evaluation in our new project. Thank you!
 
Jeff
_________________________
John Andrew Holmes "It is well to remember that the entire universe, with one trifling exception, is composed of others."
 
Shhhhh.... I am not really here. I am a figment of your imagination.... I am still in my cave so this must be an illusion....

GeneralRe: excellent articlememberWayne Wood29 May '10 - 12:10 
Thanks a lot for your comment. Big Grin | :-D
 
Basically my computer is Q6600 2.4G, 3G DDR2 800 memory and a card of GeForce 9800 GTX+ with 512M GDDR3 graphic memory. I use two old fashion hard drives, using IDE bus. Windows 7 marked its data transfer rate as only 5.1. This is my home computer and it is a pity that I haven't got much money on it Frown | :( I envy you have lots of fantastic cards to use! Cool | :cool: If you can show the test results on your series of cards, that would be greatly enjoyable.
 
Also I would add my computer specs to the article.
 
Ideally, I would also combine both source code packages together and make a full version to test more. If you are interested we can also compare test results then.
Best regards,
Wayne
 
http://code-saturne.blogspot.com/

GeneralRe: excellent articlememberEl Corazon29 May '10 - 12:56 
one other issue with the code, and it is only in respect to benchmarking practices. I moved the QueryPerformanceCounter up before the calculation of core clocks. Since this is simply a matter of determining how much use you put the GPU in while processing the problem, it really should not be part of the time, especially since it is calculated on the CPU not the GPU thus slowing down the evaluation. The mallocs frees and upload to card are fine, the goal in CUDA is to find a process that takes long enough to make the time transmitting to the GPU worthwhile, and it is a rough choice at times. The main rule of thumb in benchmarking properly and it is tricky at times, if it contributes to solving the problem, include it in the benchmark, if it contributes to evaluating the benchmark itself, do not time it, calculate it outside of the time end and before the next begin.
 
I am starting my times evaluation now. I just wanted us on the same page.
 
basically my code looks like this now:
 
	QueryPerformanceCounter(&elapsed_time_end);
 
	clock_t min_start, max_end;
	min_start = time_used[0];
	max_end = time_used[BLOCK_NUM];
	for (int i = 1; i < BLOCK_NUM; i++) {
		if (min_start > time_used[i])
			min_start = time_used[i];
		if (max_end < time_used[i + BLOCK_NUM])
			max_end = time_used[i + BLOCK_NUM];
	}
 
	elapsed_time = (double)(elapsed_time_end.QuadPart - elapsed_time_start.QuadPart)
		/ frequency.QuadPart;
_________________________
John Andrew Holmes "It is well to remember that the entire universe, with one trifling exception, is composed of others."
 
Shhhhh.... I am not really here. I am a figment of your imagination.... I am still in my cave so this must be an illusion....

GeneralRe: excellent articlememberWayne Wood29 May '10 - 22:35 
Right! I totally agree with you. It is supposed to be a mistake of mine. I will modify the code when I polish the current version.
 
Thanks a lot Smile | :)
Best regards,
Wayne
 
http://code-saturne.blogspot.com/

GeneralRe: excellent articlememberEl Corazon30 May '10 - 10:16 
Wayne Wood wrote:
Thanks a lot

 
Glad to help, plus I am helping myself to a nice start on benchmarking Thrust templates for an upcoming project. Now I just need to throw a nv285 at it and see what happens. Smile | :)
_________________________
John Andrew Holmes "It is well to remember that the entire universe, with one trifling exception, is composed of others."
 
Shhhhh.... I am not really here. I am a figment of your imagination.... I am still in my cave so this must be an illusion....

GeneralPlease the same Computation with CUDAmemberKevin Drzycimski26 May '10 - 6:35 
Hello,
 
When I saw the title I was looking forward for a real comparsion CUDA<->Thrust.
It is obvious, that Thrust is faster than a CPU.
GeneralRe: Please the same Computation with CUDAmemberWayne Wood26 May '10 - 11:33 
I did the comparison actually. It is my fault not to describe it clearly. Frown | :(
 
I have re-structured the article and hope it is clearer. Please see the section "Without thrust support, compare GPU and CPU"
 
Both code packages are also attached for providing detailed information.
 
Many thanks.
Best regards,
Wayne
 
http://code-saturne.blogspot.com/

JokeRe: Please the same Computation with CUDAmemberKevin Drzycimski31 May '10 - 3:07 
Thank you very much!
 
This was exactly the result I hoped for Big Grin | :-D
 
Now let's start introducing thrust into my meta-template-lib Cool | :cool:
 

Thanks again,
Kevin
QuestionQuestionmemberBryanWilkins26 May '10 - 3:47 
Why would the presence of the template library effect the speed of the cpu code? This makes no sense at all to me. If im missing something please explain? Smile | :)
-Bryan
 
My latest programming adventure was coding the multimedia features for the Rip Ride Rockit coaster at Universal Studios Florida. I love my job.

AnswerRe: QuestionmemberWayne Wood26 May '10 - 4:05 
When using thrust the cpu calculation used thrust::host_vector to store data, but in the pure cuda version I didn't include thrust at all, which means the data was stored in a plain array, i.e.
 
int data[DATA_SIZE];
 
Does that make sense?
 
I am sorry I didn't paste the code yet. I think it can be clearer when code is available here.
 
Thanks for your attention Big Grin | :-D
Best regards,
Wayne

GeneralRe: QuestionmemberBryanWilkins27 May '10 - 6:06 
Yes, now it makes sense to me... Duh if i had read the code a little more carefully, I would have figured that out for myself. Blush | :O
-Bryan
 
My latest programming adventure was coding the multimedia features for the Rip Ride Rockit coaster at Universal Studios Florida. I love my job.

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Permalink | Advertise | Privacy | Mobile
Web04 | 2.6.130523.1 | Last Updated 27 Jun 2010
Article Copyright 2010 by Wayne Wood
Everything else Copyright © CodeProject, 1999-2013
Terms of Use
Layout: fixed | fluid