Performance was not as expected

Question

4.50/5 (2 votes)

See more:

Hi everyone

I wrote a function to get the exponent of a number power of two. Here is the code:

C++

int exp = 0;
while((n = n / 2) > 0) {
//while((n = n >> 1) > 0) {
	exp++;
}
return exp;

Then came to mind, "Well, let's replace that division operator to the shift operator, to improve performance".

The result was not the expected.

Using the division operator, the code takes about 1 second; using the shift operator, the code takes about 1.3 seconds, on same conditions.

In debug mode, I checked the disassmebly code gave me by Visual Studio (Note: I don't understand nothing about assembly language). I noticed that the assembly code was the same, except when I am using the division operator, there are two more instruction.

So, can anyone explain to me the reason of this time execution difference? I expected, in worst case, the same execution time.

Thanks in advance
Filipe Marques

UPDATE 2: Add the full asm code for release mode.

For the divion operator:

ASM

PUBLIC  ?get_exponent@@YAHH@Z               ; get_exponent
; Function compile flags: /Ogtp
;   COMDAT ?get_exponent@@YAHH@Z
_TEXT   SEGMENT
?get_exponent@@YAHH@Z PROC              ; get_exponent, COMDAT
; _n$ = eax

; 124  :    int exp = 0;
; 125  :    while((n = n / 2) > 0) {

    cdq
    sub eax, edx
    sar eax, 1
    xor ecx, ecx
    test    eax, eax
    jle SHORT $LN7@get_expone
    npad    5
$LL2@get_expone:
    cdq
    sub eax, edx
    sar eax, 1

; 126  :    //while((n = n >> 1) > 0) {
; 127  :        exp++;

    inc ecx
    test    eax, eax
    jg  SHORT $LL2@get_expone
$LN7@get_expone:

; 128  :    }
; 129  :    return exp;

    mov eax, ecx

; 130  : }

    ret 0
?get_exponent@@YAHH@Z ENDP              ; get_exponent

For shift operator:

ASM

PUBLIC  ?get_exponent@@YAHH@Z               ; get_exponent
; Function compile flags: /Ogtp
;   COMDAT ?get_exponent@@YAHH@Z
_TEXT   SEGMENT
?get_exponent@@YAHH@Z PROC              ; get_exponent, COMDAT
; _n$ = ecx

; 124  :    int exp = 0;
; 125  :    //while((n = n / 2) > 0) {
; 126  :    while((n = n >> 1) > 0) {

    sar ecx, 1
    xor eax, eax
    test    ecx, ecx
    jle SHORT $LN1@get_expone
$LL2@get_expone:
    sar ecx, 1

; 127  :        exp++;

    inc eax
    test    ecx, ecx
    jg  SHORT $LL2@get_expone
$LN1@get_expone:

; 128  :    }
; 129  :    return exp;
; 130  : }

    ret 0
?get_exponent@@YAHH@Z ENDP              ; get_exponent

Posted 26-Jan-13 6:59am

Filipe Marques

Updated 26-Jan-13 9:59am

v3

Add a Solution

Comments

PIEBALDconsult 26-Jan-13 13:26pm

Could you show the shifting method? If done properly, it should be faster.

Filipe Marques 26-Jan-13 13:35pm

The shift method is commented.

PIEBALDconsult 26-Jan-13 14:00pm

Ah, so you did. I had expected n >>= i :D
As mentioned it's likely due to optimization, maybe compile without optimization or don't use a literal 2.

Mike Meinz 26-Jan-13 14:15pm

Funny that the divide version has two more instructions. I would think that one would take longer. Did you execute each version 10,000 times to get the average timing or only a few times? If only a few times, then it could be that some other task caused divide to run longer and that task wasn't doing much when the shift version ran.

CDQ - Converts Double Word EAX to Quad Word EDX:EAX.
SUB - Subtracts EDX (Source) from EAX (Destination)
SAR - Shift Arithmetic Right

See http://web.itu.edu.tr/~kesgin/mul06/intel/

Filipe Marques 26-Jan-13 15:42pm

I execute each one 10.000.000 times. Thanks for the link. I hope one day I gain courage to learn a little bit about assembly :D

Philippe Mori 26-Jan-13 22:25pm

By the way, it is generally pointless to do such test as in typical applications you will never execute that kind of code very often... And if you really want it, then there are better way as explained in my updated solution.

Also you can uses Google to find articles on finding most significant bit (there are a lot of them).

H.Brydon 26-Jan-13 14:22pm

Good question. I don't have a good answer but +5 for the question.

4 solutions

Solution 4

If the division really is faster then you may be the victim/beneficiary of some form of auto parallelization at runtime. Not necessariy because any auto-p has been enabled but for example the extra size and padding in the division case might push you onto 2 cache lines instead of one, changes in pipelining and hyperthreading can have strange corner cases, possibly slowing down the smaller shift case rather than speeding up the larger div case. Sometimes these changes can be dramatic especially when magnified by doing the same thing thousands of times in a tight loop. That in itself can cause side effects because the machine doesn't know its the same thing over again and it can forget the previous result or calculation step. Something this small will fit a single Corei7 pipeline several times over so it might come down to the end being different causing a pipeline stall. There are even differences between Intel and AMD at this low a level in branch prediction which can radically impact performance.
In conclusion your guess is as good as mine you'd need to ask Intel assuming it's their chip you're executing on.

Posted 26-Jan-13 14:08pm

Matthew Faithfull

Solution 2

Optimization: As you may see in the ASM code, the compiler is smart enough to replace the division with the shift operation.

Posted 26-Jan-13 8:20am

CPallini

Comments

Filipe Marques 26-Jan-13 16:13pm

I did that because in low performance CPU (like micro controllers) the division is a overkill instruction (correct me if I am wrong). I know that compilers now a days optimize the code. What I was expected is, in worst case, the division takes the same time as shift operator (the compiler transform the division into a shift). The result was a difference in order of 30%.

Solution 1

You did not supply the disassemblies for us to look at so it is impossible to tell what the additional instructions do. Without that, my guess is that the compiler has an optimization phase that optimizes division by a constant 2 by "rewriting" your code. When you use the shift operator, the optimizer probably doesn't do much, if anything, to your code.

Posted 26-Jan-13 7:21am

Mike Meinz

Comments

Filipe Marques 26-Jan-13 13:37pm

Hi Mike, I already updated the question with the assembly code.

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Philippe Mori · Accepted Answer · 2013-01-26T08:29:00

Since the compiler does far less optimization in debug mode, you should really compare release version.

By the way, is n a signed or an unsigned number? At first, I think that n is signed and this is the reason for the extra instructions for the division (sign extension and substraction). If n would have been unsigned, then my guess is that the code would have been identical.

By the way, in both cases SAR (Shift Arithmetic Right) instruction is used. This confirm the fact that number n is signed. I think that extra instructiona are required for rounding purpose of odd negative numbers (by adding 1).

[ --- Improved anwser --- ]

If n can be negative, division might be faster on average simply because it will loop less often in some cases.

For example using division, we have 5 steps starting with -31 before reaching 0.

1: -31 / 2 = -15
2: -15 /2 = -7 
3: -7 /2 = -3
4: -3 / 2 = -1
5: -1 / 2 = 0

If instead, we uses >>, we have 6 steps in that case:

1: -31 >> 1 = -16
2: -16 >> 1 = -8
3: -8 >> 1 = -4
4: -4 >> 1 = -2
5: -2 >> 1 = -1
6: -1 >> 1 = 0

For positive or even number, we would expect shift to be a bit faster when n is signed (and exactly the same if n would have been unsigned).

If this not the case, then it might be some obsure reason related to caching or jump prediction or pipelining.

By the way, if n was made unsigned, there might be a gain and if n is never negative, then the result would be the same. If n is always a power of 2, then n should really be unsigned.

Finally, if you really want to optimize that function and you expect n to be greater than 255 (for example), then you should split the number into parts.

Something like:

C++

unsigned int n = /* some power of 2 */
int exp = 0;
if (n > 0xFFFF) { exp = 16; n = n / 65536; }
if (n > 0xFF) { exp += 8; n = n / 256; }
if (n > 0xF) { exp += 4; n = n / 16; }
if (n > 0x3) { exp += 2; n = n / 4; }
if (n > 0x1) { exp += 1; n = n / 2; }
;

Except for very small n (say less than 32), that code would be faster.

If n is not a power of 2, then the code should give the maximum exponent such that 2^exp is less or equal to n.

And still faster:

If you compile in 32 bits under Visual C++, you can uses assembler BSR instruction.

Alternatively, you can uses intrinsic function (_BitScanReverse under Visual C++ 2005 and above).

For other compilers, you might have similar options.

Conclusion

In any cases, you should uses unsigned number if n is always positive as it will lead to more efficient code. Intrisic function is the best option if you really need performance.

If you want portable code that should be fast, then my unrolled code above would be the best option particulary if n is expected to be larger than 1 byte.

Performance was not as expected

4 solutions

Solution 3

Solution 4

Solution 2

Solution 1

Add your solution here

Preview 0

Existing Members

...or Join us