Open MP process time

Question

0.00/5 (No votes)

See more:

hi everyone!
i'm working with a simple image process program using open mp,
but can't get make my program faster as it's supposed to be.

the process time of the code bellow was 0.5secs.
but with the open mp line commented out, it was only 0.1secs.
i can't see what is going on.
please tell me what i'm missing.

thanks in advance!
-carlos

[machine info]
cpu:intel core 2 quad 2.6ghz
memory:4gb
os:win xp professional

[code]

int width = 10000;
int length = 10000;
#pragma omp parallel for  // open mp line
for (int y = 0; y < length; y++) {
  int y_offset = y * width;
  const byte* source = SOURCE_IMAGE_ADDRESS_;
  source += y_offset;
  byte* destination = DESTINATION_IMAGE_ADDRESS_;
  destination += y_offset;
  for (int x = 0; x < width; x++) {
    *destination = (*source);
    source++;
    destination++;  
  }
}

Posted 15-Jan-12 17:19pm

alianzalima1978

Updated 16-Jan-12 1:36am

Stefan_Lang

v2

Add a Solution

3 solutions

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Sergey Alexandrovich Kryukov · Answer 1 · 2012-01-15T18:09:00

Solution 1

Did you gain 5 times in performance? This is better result as I would expect. How could you expect more if you only have 2 cores? Do you think parallel processing is the miracle, can draw power from nowhere? :-)

—SA

Posted 15-Jan-12 18:09pm

Sergey Alexandrovich Kryukov

Comments

alianzalima1978 16-Jan-12 0:28am

thanks for checking my question SAKryukov!

i meant i got worse performance using open mp.
-with "#pragma omp parallel for" : 0.5secs
-without "#pragma omp parallel for" : 0.1secs

Jochen Arndt · Answer 2 · 2012-01-15T22:26:00

Solution 2

The inner loop var x is shared by default. You should make it private:

C++

#pragma omp parallel for private(x)

This should boost performance. But I'm not sure if this is all.

Posted 15-Jan-12 22:26pm

Jochen Arndt

Stefan_Lang · Answer 3 · 2012-01-16T01:56:00

simple image process program using open mp

That is the first mistake: Multithreading is never simple!

The second problem is memory access: I don't know how clever OMP goes about it, but I doubt it will recognize that for each y you access values from a very specific range of memory, and that this memory does not overlap with the ranges used for other values, and, more importantly, that this range does not overlap with the memory range you write to! That means every single memory access will have to be synchronized, and therefore the majority of your code is in effect serial, no matter whether you use OMP or not. Worse: the synchronization probably takes longer than the actual access!

I don't know OMP, but you must somehow tell it to not synchronize the memory accesses. Either by specifying the memory ranges as local (or 'private' as used in solution 2?), or maybe through other options.

There's more to it, e. g. that each core will try to load the memory it accesses into it's cache. If two (or more) access memory addresses that overlap, that cache needs to be synchronized everytime this happens. You may not be aware of that, but if the memory address of your source and destination are close to each other, the cores may try to load the entire memory block that includes both addresses, and thus the cache would indeed overlap...

As I said: Multithreading is never simple!