Unfortunately, the code we've created is incorrect and the result of the function is in general undefined. For example, it can be 298441282.231. Let's consider possible causes.
The main cause of errors in parallel programs is incorrect work with shared resources, i.e. resources common for all launched processes, and in particular - with shared variables.
This program is successfully compiled in Microsoft Visual Studio 2005 environment and the compiler even doesn't show any warning messages. However it is incorrect. To understand this you should recall that variables in OpenMP-programs are divided into shared, existing as single copies and available for all the threads, and private, localized in a concrete process. Besides, there is a rule saying that by default all the variables in parallel regions of OpenMP are shared save for parallel loops' indexes and variables defined inside these parallel regions.
From the example above it is obvious that x, y and s variables are shared what it absolutely incorrect. s variable should be shared as it is the adder in the given algorithm. But when working with x or y each process calculates their next value and writes it into the corresponding variable (x or y). And in this case the result depends on the sequence of executing the parallel threads. In other words, if the first thread calculates x value, writes it into x variable and after the same operations are performed by the second thread, tries to read the value of x variable it will get the value written into it by the last thread, i.e. the second one. Such errors, when program operation depends on the sequence of executing different code sections, are called race condition or data race ("race" condition or "race" of computing threads; it means that unsynchronized memory accesses occur).
To search such errors we need special program means. One of them is Intel Thread Checker. The product's site: http://www.intel.com/software/products/tcwin. This program is provided as a module for Intel VTune Performance Analyzer profiler adding to already existing means for working with multi-thread code. Intel Thread Checker allows you to detect both the errors described above and many other defects, for example, deadlocks (points of mutual lock of computational threads) and memory leaks.
When Intel Thread Checker is installed, a new project category will appear in New Project dialog of Intel VTune Performance Analyzer application - Threading Wizards (wizards for working with threads), among which there will be Intel Thread Checker Wizard. To launch an example, you should select it and point the path to the executed program in the next wizard's window. The program will be executed and the profiler will collect all data about the application's operation. An example of such information provided by Intel Thread Checker is shown on Figure 1.
As we see, the number of errors is large even for such a small program. Thread Checker groups the detected errors simultaneously estimating how critical they are for program operation and provides their description increasing the programmer's labor effectiveness. Besides, on Source View inlay there is program code of the application where the sections with the errors are marked (Figure 2).
We should mention that Intel Thread Checker cannot detect errors in some cases. This occurs when the code rarely gets control or is executed on a system with a different architecture. Errors can be also missed when the set of input text data differs greatly from the data processed by the program when used by end users. All this doesn't allow us to be sure that there are no errors in a multi-thread program when it is tested by dynamic analysis means the results of which depend on the environment and time of its execution.
But there is another tool and it is good news for OpenMP developers. This tool is VivaMP offering an alternative approach to verifying parallel programs. VivaMP is built according the principle of static code analyzer and allows you to test the application code without launching it. To learn more about VivaMP tool visit the developers' site http://www.viva64.com/vivamp-tool/.
VivaMP's scopes of use:
VivaMP analyzer integrates into Visual Studio 2005/2008 environment and provides simple interface for testing applications (Figure 3).
If we launch VivaMP for our example we'll get a message about errors in 4 different strings where incorrect modification of variables occurs (Figure 4).
Of course, static analysis also has some disadvantages as well as dynamic analysis. But being used together these methodologies (Intel Thread Checker and VivaMP tools) supplement each other very well. And used together they serve rather a safe method of detecting errors in multi-thread applications.
The above described error of writing into x and y variables detected by Intel Thread Checker and VivaMP tools can be easily corrected: you only need to add one more directive into '#pragma omp parallel for' construction: private (x, y). Thus, these two variables will be defined as private and there will be separate copies of x and y in each computational thread. But you should also keep in mind that all the threads save the calculated result by adding it to s variable. Such errors occur when one computational thread tries to write some value into shared memory and simultaneously the second thread performs reading. In the given example it can lead to an incorrect result.
Let's consider s += j*y instruction. Originally it is suggested that each thread add the calculated result to the current value of s variable and then the same operations are executed by all the other threads. But in some cases the two threads begin to execute s += j*y instruction simultaneously, that is each of them first reads the current value of s variable, then adds the result of j*y multiplication to this value and writes the final result into s variable.
Unlike reading operation which can be implemented in parallel mode and is quite quick, the writing operation is always sequential. Consequently, if the first thread writes the new value first, the second thread, having performed the writing operation after that, will erase the result of the first thread's calculations because the both computational threads read one and the same value of s variable and begin to write their data into it. In other words, the value of s variable which will be written into shared memory by the second thread doesn't consider the results of calculations performed by the first thread at all. You can avoid such a situation by making sure that at any moment only one thread is allowed to execute s += j*y operation. Such operations are called indivisible or atomic. When we need to point the compiler that some instruction is atomic we use #pragma omp atomic construction. The program code in which the described operations are corrected is shown in Listing 3.
Listing 3
double FixedFunctionOpenMP(int N) { double x, y, s=0; #pragma omp parallel for \ private(x,y) num_threads(2) for (int i=1; i<=N; i++) { x = (double)i/N; y = x; for (int j=1; j<=N; j++) { #pragma omp atomic s += j * y; y = y * x; }; }; return s; } |
Having recompiled the program and analyzed it once again in Thread Checker, we'll see that it doesn't contain critical errors. Only two messages are shown telling that the parallel threads end when reaching return operator in MathFunction function. In the given example it is alright because only the code inside this function is paralleled. VivaMP static analyzer won't show any messages on this code at all as it is absolutely correct from its viewpoint.
But some work still remains. Let's see if our code has really become more effective after paralleling. Let's measure the execution time for three functions: 1 - sequential, 2 - parallel incorrect, 3 - parallel correct. The results of this measuring for N=1500 are given in Table 1.
Function | Result | Execution time |
Sequential variant of the function | 287305025.528 | 0.5781 seconds |
Incorrect variant of the parallel function | 298441282.231 | 2.9531 seconds |
Correct variant of the parallel function using atomic directive | 287305025.528 | 36.8281 seconds |
What do we see in the table? The parallel variant of the incorrect function works slower in several times. But we are not interested in this function. The problem is that the correct variant works even more than 60 times slower. Do we need such parallelism? Of course not.
The reason is that we have chosen a very ineffective method of solving the problem of summing the result in s variable by using atomic directive. This approach leads to that the threads wait for each other very often. To avoid constant deadlocks when executing atomic summing operation we can use the special directive reduction. reduction option defines that the variable will get the combined value at the exit from the parallel block. The following operations are permissible: +, *, -, &, |, ^, &&, ||. The modified variant of the function is shown in Listing 4.
Listing 4
double OptimizedFunction(int N) { double x, y, s=0; #pragma omp parallel for private(x,y) \ num_threads(2) reduction(+: s) for (int i=1; i<=N; i++) { x = (double)i/N; y = x; for (int j=1; j<=N; j++) { s += j * y; y = y * x; }; }; return s; } |
In this case we'll get the function's variant not only correct but of higher performance as well (Table 2). The speed of calculations has almost doubled (more exactly, it has increased in 1.85 times) what is a good result for such functions.
Function | Result | Execution time |
Sequential variant of the function | 287305025.528 | 0.5781 seconds |
Incorrect variant of the parallel function | 298441282.231 | 2.9531 seconds |
Correct variant of the parallel function using atomic directive | 287305025.528 | 36.8281 seconds |
Correct variant of the parallel function using reduction directive | 287305025.528 | 0.3125 seconds |
In conclusion I would like to point out once again that an operable parallel program not always can be effective. And although parallel programming provides many ways to increase code effectiveness it demands attention and good knowledge of the used technologies from the programmer. Fortunately, there exist such tools as Intel Thread Checker and VivaMP which greatly simplify creation and testing of multi-thread applications. Dear readers, I wish you good luck in mastering the new field of knowledge!
#include "stdafx.h" #include <omp.h> #include <stdlib.h> #include <windows.h> class VivaMeteringTimeStruct { public: VivaMeteringTimeStruct() { m_userTime = GetCurrentUserTime(); } ~VivaMeteringTimeStruct() { printf("Time = %.4f seconds\n", GetUserSeconds()); } double GetUserSeconds(); private: __int64 GetCurrentUserTime() const; __int64 m_userTime; }; __int64 VivaMeteringTimeStruct::GetCurrentUserTime() const { FILETIME creationTime, exitTime, kernelTime, userTime; GetThreadTimes(GetCurrentThread(), &creationTime, &exitTime, &kernelTime, &userTime); __int64 curTime; curTime = userTime.dwHighDateTime; curTime <<= 32; curTime += userTime.dwLowDateTime; return curTime; } double VivaMeteringTimeStruct::GetUserSeconds() { __int64 delta = GetCurrentUserTime() - m_userTime; return double(delta) / 10000000.0; } double Function(int N) { double x, y, s=0; for (int i=1; i<=N; i++) { x = (double)i/N; y = x; for (int j=1; j<=N; j++) { s += j * y; y = y * x; }; }; return s; } double FunctionOpenMP(int N) { double x, y, s=0; #pragma omp parallel for num_threads(2) for (int i=1; i<=N; i++) { x = (double)i/N; y = x; for (int j=1; j<=N; j++) { s += j * y; y = y * x; }; }; return s; } double FixedFunctionOpenMP(int N) { double x, y, s=0; #pragma omp parallel for private(x,y) num_threads(2) for (int i=1; i<=N; i++) { x = (double)i/N; y = x; for (int j=1; j<=N; j++) { #pragma omp atomic s += j * y; y = y * x; }; }; return s; } double OptimizedFunction(int N) { double x, y, s=0; #pragma omp parallel for private(x,y) \ num_threads(2) reduction(+: s) for (int i=1; i<=N; i++) { x = (double)i/N; y = x; for (int j=1; j<=N; j++) { s += j * y; y = y * x; }; }; return s; } int _tmain(int , _TCHAR* []) { int N = 15000; { VivaMeteringTimeStruct Timer; printf("Result = %.3f ", Function(N)); } { VivaMeteringTimeStruct Timer; printf("Result = %.3f ", FunctionOpenMP(N)); } { VivaMeteringTimeStruct Timer; printf("Result = %.3f ", FixedFunctionOpenMP(N)); } { VivaMeteringTimeStruct Timer; printf("Result = %.3f ", OptimizedFunction(N)); } return 0; } |
This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.
A list of licenses authors might use can be found here