Click here to Skip to main content
15,886,026 members
Please Sign up or sign in to vote.
4.50/5 (2 votes)
See more:
Hi everyone

I wrote a function to get the exponent of a number power of two. Here is the code:

C++
int exp = 0;
while((n = n / 2) > 0) {
//while((n = n >> 1) > 0) {
	exp++;
}
return exp;


Then came to mind, "Well, let's replace that division operator to the shift operator, to improve performance".

The result was not the expected.

Using the division operator, the code takes about 1 second; using the shift operator, the code takes about 1.3 seconds, on same conditions.

In debug mode, I checked the disassmebly code gave me by Visual Studio (Note: I don't understand nothing about assembly language). I noticed that the assembly code was the same, except when I am using the division operator, there are two more instruction.

So, can anyone explain to me the reason of this time execution difference? I expected, in worst case, the same execution time.

Thanks in advance
Filipe Marques

UPDATE 2: Add the full asm code for release mode.

For the divion operator:
ASM
PUBLIC  ?get_exponent@@YAHH@Z               ; get_exponent
; Function compile flags: /Ogtp
;   COMDAT ?get_exponent@@YAHH@Z
_TEXT   SEGMENT
?get_exponent@@YAHH@Z PROC              ; get_exponent, COMDAT
; _n$ = eax

; 124  :    int exp = 0;
; 125  :    while((n = n / 2) > 0) {

    cdq
    sub eax, edx
    sar eax, 1
    xor ecx, ecx
    test    eax, eax
    jle SHORT $LN7@get_expone
    npad    5
$LL2@get_expone:
    cdq
    sub eax, edx
    sar eax, 1

; 126  :    //while((n = n >> 1) > 0) {
; 127  :        exp++;

    inc ecx
    test    eax, eax
    jg  SHORT $LL2@get_expone
$LN7@get_expone:

; 128  :    }
; 129  :    return exp;

    mov eax, ecx

; 130  : }

    ret 0
?get_exponent@@YAHH@Z ENDP              ; get_exponent


For shift operator:
ASM
PUBLIC  ?get_exponent@@YAHH@Z               ; get_exponent
; Function compile flags: /Ogtp
;   COMDAT ?get_exponent@@YAHH@Z
_TEXT   SEGMENT
?get_exponent@@YAHH@Z PROC              ; get_exponent, COMDAT
; _n$ = ecx

; 124  :    int exp = 0;
; 125  :    //while((n = n / 2) > 0) {
; 126  :    while((n = n >> 1) > 0) {

    sar ecx, 1
    xor eax, eax
    test    ecx, ecx
    jle SHORT $LN1@get_expone
$LL2@get_expone:
    sar ecx, 1

; 127  :        exp++;

    inc eax
    test    ecx, ecx
    jg  SHORT $LL2@get_expone
$LN1@get_expone:

; 128  :    }
; 129  :    return exp;
; 130  : }

    ret 0
?get_exponent@@YAHH@Z ENDP              ; get_exponent
Posted
Updated 26-Jan-13 9:59am
v3
Comments
PIEBALDconsult 26-Jan-13 13:26pm    
Could you show the shifting method? If done properly, it should be faster.
Filipe Marques 26-Jan-13 13:35pm    
The shift method is commented.
PIEBALDconsult 26-Jan-13 14:00pm    
Ah, so you did. I had expected n >>= i :D
As mentioned it's likely due to optimization, maybe compile without optimization or don't use a literal 2.
Mike Meinz 26-Jan-13 14:15pm    
Funny that the divide version has two more instructions. I would think that one would take longer. Did you execute each version 10,000 times to get the average timing or only a few times? If only a few times, then it could be that some other task caused divide to run longer and that task wasn't doing much when the shift version ran.

CDQ - Converts Double Word EAX to Quad Word EDX:EAX.
SUB - Subtracts EDX (Source) from EAX (Destination)
SAR - Shift Arithmetic Right

See http://web.itu.edu.tr/~kesgin/mul06/intel/
Filipe Marques 26-Jan-13 15:42pm    
I execute each one 10.000.000 times. Thanks for the link. I hope one day I gain courage to learn a little bit about assembly :D

Since the compiler does far less optimization in debug mode, you should really compare release version.

By the way, is n a signed or an unsigned number? At first, I think that n is signed and this is the reason for the extra instructions for the division (sign extension and substraction). If n would have been unsigned, then my guess is that the code would have been identical.

By the way, in both cases SAR (Shift Arithmetic Right) instruction is used. This confirm the fact that number n is signed. I think that extra instructiona are required for rounding purpose of odd negative numbers (by adding 1).

[ --- Improved anwser --- ]

If n can be negative, division might be faster on average simply because it will loop less often in some cases.


For example using division, we have 5 steps starting with -31 before reaching 0.
1: -31 / 2 = -15
2: -15 /2 = -7 
3: -7 /2 = -3
4: -3 / 2 = -1
5: -1 / 2 = 0 


If instead, we uses >>, we have 6 steps in that case:
1: -31 >> 1 = -16
2: -16 >> 1 = -8
3: -8 >> 1 = -4
4: -4 >> 1 = -2
5: -2 >> 1 = -1
6: -1 >> 1 = 0


For positive or even number, we would expect shift to be a bit faster when n is signed (and exactly the same if n would have been unsigned).

If this not the case, then it might be some obsure reason related to caching or jump prediction or pipelining.

By the way, if n was made unsigned, there might be a gain and if n is never negative, then the result would be the same. If n is always a power of 2, then n should really be unsigned.

Finally, if you really want to optimize that function and you expect n to be greater than 255 (for example), then you should split the number into parts.

Something like:
C++
unsigned int n = /* some power of 2 */
int exp = 0;
if (n > 0xFFFF) { exp = 16; n = n / 65536; }
if (n > 0xFF) { exp += 8; n = n / 256; }
if (n > 0xF) { exp += 4; n = n / 16; }
if (n > 0x3) { exp += 2; n = n / 4; }
if (n > 0x1) { exp += 1; n = n / 2; }
;


Except for very small n (say less than 32), that code would be faster.

If n is not a power of 2, then the code should give the maximum exponent such that 2^exp is less or equal to n.

And still faster:

If you compile in 32 bits under Visual C++, you can uses assembler BSR instruction.

Alternatively, you can uses intrinsic function (_BitScanReverse under Visual C++ 2005 and above).

For other compilers, you might have similar options.

Conclusion

In any cases, you should uses unsigned number if n is always positive as it will lead to more efficient code. Intrisic function is the best option if you really need performance.

If you want portable code that should be fast, then my unrolled code above would be the best option particulary if n is expected to be larger than 1 byte.
 
Share this answer
 
v4
Comments
Filipe Marques 26-Jan-13 16:05pm    
Yes, the 'n' is a signed variable (int). I updated the question. Now, I put the asm code for the two situations. The compiler still uses the SAR instruction. The only diference I see now is in the end of code: with the division operator, the compiler puts the instructor MOV and in shift operator, doesn't.
Philippe Mori 26-Jan-13 21:18pm    
It seems that you haven't fully understand my point. Since n is signed, (and the compiler cannot know for sure than n is always positive), then when when doing a division, it has to take into account the case where n is negative.

For example, if you have -3 (1111.....1101) and you shift right (using SAR thus keeping the sign), you get -2 (1111.....1110) which is different from -3 / 2 that give -1. In one case, it is rounded toward ±infinity and in the other case toward 0.

The compiler is able to fix that by adding 1 to negative numbers (it sign extend the number into EDX, thus EDX is -1 for negative numbers and 0 for positive numbers or 0). Thus it will do for the division n = (n - (n < 0 ? -1 : 0)) >> 1.

If the number n was unsigned instead, then a division by 2 whould be equivalent to a right shift (without keeping sign byte).

You have to understand that >> on a signed number is not the same as >> on an unsigned number. In the first case, it will use the ASM instruction keep the sign bit untouched while in the second case, it is filled with 0.
Filipe Marques 27-Jan-13 9:36am    
Thanks a lot for the explanation Philippe. I never realised that using signed or an unsigned variable causes such impact. Now I have same performance in both situations and the asm code is the same too.
If the division really is faster then you may be the victim/beneficiary of some form of auto parallelization at runtime. Not necessariy because any auto-p has been enabled but for example the extra size and padding in the division case might push you onto 2 cache lines instead of one, changes in pipelining and hyperthreading can have strange corner cases, possibly slowing down the smaller shift case rather than speeding up the larger div case. Sometimes these changes can be dramatic especially when magnified by doing the same thing thousands of times in a tight loop. That in itself can cause side effects because the machine doesn't know its the same thing over again and it can forget the previous result or calculation step. Something this small will fit a single Corei7 pipeline several times over so it might come down to the end being different causing a pipeline stall. There are even differences between Intel and AMD at this low a level in branch prediction which can radically impact performance.
In conclusion your guess is as good as mine you'd need to ask Intel assuming it's their chip you're executing on.
 
Share this answer
 
Optimization: As you may see in the ASM code, the compiler is smart enough to replace the division with the shift operation.
 
Share this answer
 
Comments
Filipe Marques 26-Jan-13 16:13pm    
I did that because in low performance CPU (like micro controllers) the division is a overkill instruction (correct me if I am wrong). I know that compilers now a days optimize the code. What I was expected is, in worst case, the division takes the same time as shift operator (the compiler transform the division into a shift). The result was a difference in order of 30%.
You did not supply the disassemblies for us to look at so it is impossible to tell what the additional instructions do. Without that, my guess is that the compiler has an optimization phase that optimizes division by a constant 2 by "rewriting" your code. When you use the shift operator, the optimizer probably doesn't do much, if anything, to your code.
 
Share this answer
 
Comments
Filipe Marques 26-Jan-13 13:37pm    
Hi Mike, I already updated the question with the assembly code.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900