Click here to Skip to main content
12,892,596 members (47,842 online)
Click here to Skip to main content
Add your own
alternative version

Stats

5.5K views
90 downloads
15 bookmarked
Posted 14 Apr 2017

Understanding C/C++ Code Behavior by Analyzing Assembler Output

, 23 Apr 2017 CPOL
Rate this:
Please Sign up or sign in to vote.
A discussion on how Visual C/C++ code works with bit-wise logical and shift operators, as well as a loop optimization.

Introduction

The C/C++ standard defines the specification of language elements, such as syntax and grammar, while it doesn't strictly indicate how to implement an operator or a statement. Every software manufacturer is free to use any implementation as long as it fits the C/C++ standard. However, an implementation is closely related to performance of executing a program at runtime. My previous article Something You May Not Know About the Switch Statement in C/C++ [1] gives an example to analyze C/C++ switch statement behavior and efficiency with the assembly output.

This article will discuss some C/C++ operators and code snippets in Visual C++, by analyzing their assembly output generated with Visual Studio in three examples. First I'll show you how to understand conceptual difference between the bit-wise & and the logical && operators. The second is a specific sample for the left bit shift << and the right shift >> operators that let you see how they are implemented differently in Debug and Release build. Thirdly, as a topic of optimization, I show a sequential search test in array to set a benchmark illustrating how efficient the Release version is achieved over its Debug counterpart.

We’ll see what happens behind the Visual C/C++ code. We use Microsoft Visual Studio IDE, since it can generate the corresponding assembly listing on compiling. In order to read this article, a general understanding of Intel x86 assembly instructions and Microsoft MASM specifications [2] is necessary. As you can see shortly, all discussions here are based on a kind of “reverse engineering” and then, the article is never an accurate description of compiler implementations; while it can be an additional study material in learning assembly language programming.

In order to generate an assembler output in Visual Studio, open a .CPP file Property dialog, choose All Configurations, and under left C/C++ dropdown, select the Output Files category. On the right pane, choose the Assembly With Source Code (/FAs) option as shown below for ShiftAndTest.cpp:

   Image 1: the Assembly With Source Code option

When you compile ShiftAndTest.cpp, an assembly file called ShiftAndTest.asm will be generated either in Debug or Release folder. Using this option, the assembly listing includes the original C++ code line numbered for your reference, which is commented by a semi-colon as you’ll see soon.

I use Visual Studio with MASM assembler on the Windows system just for academic convenience for my class [3]. As assembly language programming is not potable, you can find similar tools on other platforms. A powerful online compiler (and dissembler) can be found at gcc.godbolt.org for GCC/Clang with many useful options for either ARM or x86. You definitely can try that to generate assembly listing alike.

Understanding bit-wise and logical operators

We all know the specification of operator & and && generally. Both are binary operators with two operands to perform Boolean AND logic as explained online at Cplusplus.com or Wikipedia,org. Also we understand the difference that the result of operator & is calculated by ANDing each bit (0 or 1) from two integers, while && only takes two operands as false and true represented by zero or non-zero.

Moreover, an operand of either operator & or && can be an expression that must be evaluated beforehand. In such a case, let’s consider the following code snippet for & and && with the same expression operands:

int i =5, j =3, k;
bool b;

k =7;
b = (i-5) & (k=i-j);
printf("Operator &,  b=%d, k=%d\n", b, k);

k =7;
b = (i-5) && (k=i-j);
printf("Operator &&, b=%d, k=%d\n", b, k);

What’s the difference? You may be stuck a while. Here is the output:

   Image 2: the console output for operator AND

It reminds you of the concept of short-circuit evaluation. To see what happens in the operator implementation, I generate the assembler output as below:

; 17   :    int i =5, j =3, k;
    mov    DWORD PTR _i$[ebp], 5
    mov    DWORD PTR _j$[ebp], 3
; 18   :    bool b;
; 19   :    k =7;
    mov    DWORD PTR _k$[ebp], 7

The code creates and initializes i, j, and k. Three local variables are easy to recognize, as denoted by the same name on the stack frame.

From the output, you can see that the & operator makes k=2, i.e., k= i-j. This means that the operator & performs full elevation of both expressions without using short-circuit, even the left i-5 is zero, the right side k=i-j still executed. The following code exposes details:

; 20   :    b = (i-5) & (k=i-j);
    mov    eax, DWORD PTR _i$[ebp]
    sub    eax, DWORD PTR _j$[ebp]
    mov    DWORD PTR _k$[ebp], eax
    mov    ecx, DWORD PTR _i$[ebp]
    sub    ecx, 5
    and    ecx, DWORD PTR _k$[ebp]
    setne    dl
    mov    BYTE PTR _b$[ebp], dl

The code simply does ij in EAX to assign it to k. Then it goes to the second SUB for i5. The AND instruction is used to perform the Boolean bit-wise logic. Since we only need the logical result true or false for variable b, the code checks the zero flag by using SETNE via DL. Interestingly, even when changing the order of two expression, you still get the same assembly code:

b = (k=i-j) & (i-5);

Now let’s take a look at the output of operator && that results in k=7. It means k unchanged without doing i-j. The operator && performs short-circuit evaluation, because the left i-5 is zero, the right side k=i-j skipped. The following gives such an implementation:

; 22   :    k =7;
    mov    DWORD PTR _k$[ebp], 7

; 23   :    b = (i-5) && (k=i-j);
    mov    eax, DWORD PTR _i$[ebp]
    sub    eax, 5
    je     SHORT $LN3@main
    mov    ecx, DWORD PTR _i$[ebp]
    sub    ecx, DWORD PTR _j$[ebp]
    mov    DWORD PTR _k$[ebp], ecx
    je     SHORT $LN3@main
    mov    DWORD PTR tv77[ebp], 1
    jmp    SHORT $LN4@main
$LN3@main:
    mov    DWORD PTR tv77[ebp], 0
$LN4@main:
    mov    dl, BYTE PTR tv77[ebp]
    mov    BYTE PTR _b$[ebp], dl

As && only for logical true or false, no bit-wise AND instruction needed. The code first does i5 and checks the zero flag with JE. If zero found, it jumps to the label $LN3@main to return zero as false. Otherwise, it falls through to the right operand evaluation of k=i-j and checks that second condition for the final result of variable b.

Also an interesting exercise is to change the conditions' order to verify both expressions evaluated in assembly output:

b = (k=i-j) && (i-5);

Now the similar comparison can be made between the bit-wise operator | and logical ||. See this example:

k =7;
b = (i-j) | ++k;
printf("Operator |,  b=%d, k=%d\n", b, k);
k =7;
b = (i-j) || ++k;
printf("Operator ||, b=%d, k=%d\n", b, k);

The console output is

   Image 3: the console output for operator OR

As for the bit-wise |, the code first does ++k unconditionally in EAX and saves it back to the variable k. Then i-j is calculated in ECX for the OR instruction:

; 26   :    k =7;
    mov    DWORD PTR _k$[ebp], 7

; 27   :    b = (i-j) | ++k;
    mov    eax, DWORD PTR _k$[ebp]
    add    eax, 1
    mov    DWORD PTR _k$[ebp], eax
    mov    ecx, DWORD PTR _i$[ebp]
    sub    ecx, DWORD PTR _j$[ebp]
    or     ecx, DWORD PTR _k$[ebp]
    setne    dl
    mov    BYTE PTR _b$[ebp], dl

For the logical ||, the left i-j is first calculated and checked non-zero. Only when JNE is not satisfied, will the right side ++k be calculated. In our test, i-j is non-zero and so ++k is ignored, as short-circuit implemented

; 29   :    k =7;
    mov    DWORD PTR _k$[ebp], 7

; 30   :    b = (i-j) || ++k;
    mov    eax, DWORD PTR _i$[ebp]
    sub    eax, DWORD PTR _j$[ebp]
    jne    SHORT $LN5@main
    mov    ecx, DWORD PTR _k$[ebp]
    add    ecx, 1
    mov    DWORD PTR _k$[ebp], ecx
    jne    SHORT $LN5@main
    mov    DWORD PTR tv128[ebp], 0
    jmp    SHORT $LN6@main
$LN5@main:
    mov    DWORD PTR tv128[ebp], 1
$LN6@main:
    mov    dl, BYTE PTR tv128[ebp]
    mov    BYTE PTR _b$[ebp], dl

In this example, I generated the above assembler output in Debug build. Usually, the code generated in Debug is more readable and understandable, but probably less efficient. The code generated from Release build is optimized to minimize speed or size. They are probably hard to read but more efficient.

Sometimes, the assembly listing generated is not completely to reflect the whole logics to implement high level language code, because some part of the logic could be implemented in preprocessing or elsewhere. A smart compiler may divide the responsibility that can happen in assembly time instead of at precious runtime. As an example, you can read the section of 'Using Binary Search' in the article [1] where I discussed such a scenario by one situation of the switch/case block.

Understanding optimization of bit-wise shift

In this section, I use an example to reverse bytes in a DWORD, 32-bit unsigned integer, by calling a C function named ReverseBytes() as this:

printf("Reverse Bytes:\n");
int n = 0x12345678;
printf("1. 32-bit Hexadecimal: %Xh\n", n);
printf("2. Now Reverse Bytes:  %Xh\n", n =ReverseBytes(n) );
printf("3. Again, Resume now:  %Xh\n", ReverseBytes(n) );

As a byte taking two hexadecimal digits, the result formatted in hexadecimal can be

   Image 4: the console output for Reverse Bytes

One solution is to simply use bit-wise right and left shift operators twice:

unsigned int ReverseBytes(unsigned x) 
{
   x = (x & 0xffff0000) >> 16 | (x & 0x0000ffff) << 16;
   x = (x & 0xff00ff00) >> 8  | (x & 0x00ff00ff) << 8;
   return x;
}

As for our test data 12345678h, the first statement shifting 16 bits makes this:

      x = 00001234h | 56780000h = 56781234h

And the second statement shifting 8 bits makes the result:

      x = 00560012h | 78003400h = 78563412h

The x86 instruction set provides powerful shifting and rotation instructions to implement logics of C bit-wise shift. Let’s take a first look at an assembly code generated from Debug build. To focus on our topic, I simply cut three lines of the C statement implementations without stack frame's prologue and epilogue here:

; 9    :    x = ((x & 0xffff0000) >> 16) | ((x & 0x0000ffff) << 16);
    mov    eax, DWORD PTR _x$[ebp]
    and    eax, -65536               ; ffff0000H
    shr    eax, 16                   ; 00000010H
    mov    ecx, DWORD PTR _x$[ebp]
    and    ecx, 65535                ; 0000ffffH
    shl    ecx, 16                   ; 00000010H
    or     eax, ecx
    mov    DWORD PTR _x$[ebp], eax

; 10   :    x = ((x & 0xff00ff00) >> 8) | ((x & 0x00ff00ff) << 8);
    mov    eax, DWORD PTR _x$[ebp]
    and    eax, -16711936            ; ff00ff00H
    shr    eax, 8
    mov    ecx, DWORD PTR _x$[ebp]
    and    ecx, 16711935             ; 00ff00ffH
    shl    ecx, 8
    or     eax, ecx
    mov    DWORD PTR _x$[ebp], eax

; 11   :    return x;
    mov    eax, DWORD PTR _x$[ebp]

Above two assembly blocks make a perfect mapping to the two C statements in logics. The SHR instruction is right shift for the >> operator and SHL is left shift for <<. Only variable x denoted by memory _x$[ebp] is moved to EAX and ECX to perform different shifting and then ORing them together in EAX that is saved back to _x$[ebp]. The difference of two blocks is just two specified masks for shifting either 16 or 8 bits.

The logic above is so clear and readable, which is helpful to understand how assembly instructions working step by step, since one C language statement corresponds multiple assembly instructions in details. However, the code might not be very efficient. Let’s turn to the assembly code generated from Release build as follows:

; _x$ = ecx
; 9    :    x = (x & 0xffff0000) >> 16 | (x & 0x0000ffff) << 16;
    rol    ecx, 16                      ; 00000010H
    mov    eax, ecx

; 10   :    x = (x & 0xff00ff00) >> 8  | (x & 0x00ff00ff) << 8;
; 11   :    return x;
    mov    edx, ecx
    shr    eax, 8
    shl    edx, 8
    xor    eax, edx
    and    eax, 16711935                ; 00ff00ffH
    shl    ecx, 8
    xor    eax, ecx

Surprisingly, at first, it doesn’t use x as a previous memory argument denoted by EBP. Directly using register ECX to represent x, is much faster. Next, it makes use of the rotation ROL that greatly simplifies the swap of 16 bits to obtain the middle result of 56781234h.

However, in the second block, the code right shifts 8 bits to have 00567812h in EAX and left shifts 8 bits to have 78123400h in EDX. The hard part for understanding is two XOR instructions. As a learning practice, you can trace the code to verify the result. Think of what they actually doing and how finally working out. When getting the result, the value 78563412h in EAX is exactly you expected to return.

Although the second block logic is not easy to understand, the first block 16-bit shift strategy is really great and heuristic. Could we get hint from this concise ROL instruction for an easy swap? Sure, see this:

      12345678h => 12347856h => 78561234h =>78563412h

Wow, this idea is much simpler and more understandable, even better than that of the Release build. I created such a solution, just by three ROL instructions. Try this assuming EAX initialized as 12345678h:

rol ax, 8h
rol eax, 10h
rol ax, 8h

An example of loop optimization in Release build

We have already realized that a smart compiler could do a good job in Release build for optimization. Usually, the Visual C/C++ project setting for Release build can be seen like this:

   Image 5: Release build settings for optimization

How much can be achieved in efficiency from a Debug build over a Release one? To answer this, I created a benchmark test with a simple sequential search to compare the time consumed as this:

const unsigned ARRAY_SIZE = 100000;     /* 100,000 */
const unsigned LOOP_SIZE = 1000000;     /* 1,000,000 */
long array[ARRAY_SIZE];

/* Fill an array with pseudorandom integers */
for(unsigned i = 0; i < ARRAY_SIZE; i++)
  array[i] = rand() % 1000;

long searchVal;
cout << "CPP - Enter an integer value to find: ";
cin >> searchVal;
cout << "Please wait...\n";

/* Test the function, -1 not found */
long index = -1;
time_t startTime, endTime;
time( &startTime );

/* Exhaustive loop test unconditionally */
for( unsigned n = 0; n < LOOP_SIZE; n++)
    index = ArraySearch( searchVal, array, ARRAY_SIZE );

time( &endTime );
cout << "Elapsed time: " << long(endTime - startTime)
     << " seconds. Index = " << index << endl;

In above main, the test will:

  • Fill an array with 100,000 pseudorandom integers ranged from zero to 999 inclusive
  • Let the user enter a search value, for either successful or failed test
  • Set a loop calling ArraySearch() a million (1,000,000) times and show the time span in seconds

I do not call srand() to initialize the generator with different seeds, and so the values of 100,000 pseudorandom integers are the same for each test. This can be verified with a successful search of 100. The Debug build found the index 1423 in 4 seconds as running on my computer:

   Image 6: Debug build found the index 1423 in 4 seconds

The Release build found the same index 1423 in zero second:

   Image 7: Release build found the same index 1423 in zero second

Now let’s search value 1000 purposely making a failure. The Debug build exhausted 251 seconds:

   Image 8: Debug build exhausted 255 seconds

While the Release build still resulted in zero second:

   Image 9: Release build still resulted in zero second

The C function of ArraySearch is quite straight:

extern "C" 
long ArraySearch(long searchVal, long array[], unsigned count) {
   for(unsigned i = 0; i < count; i++) {
      if( array[i] == searchVal )
         return i;
   }
  return -1;
}

Essencially, the compiler translates the above C function in different ways in assembly code. At first, let’s read it from Debug build:

PUBLIC    _ArraySearch
; Function compile flags: /Odtp
 _TEXT    SEGMENT

_i$22332 = -4                      ; size = 4
_searchVal$ = 8                     ; size = 4
_array$ = 12                        ; size = 4
_count$ = 16                        ; size = 4
_ArraySearch PROC
; 9    : {
    push   ebp
    mov    ebp, esp
    push   ecx

; 10   :    for(unsigned i = 0; i < count; i++) {

    mov    DWORD PTR _i$22332[ebp], 0
    jmp    SHORT $LN4@ArraySearc
$LN3@ArraySearc:
    mov    eax, DWORD PTR _i$22332[ebp]
    add    eax, 1
    mov    DWORD PTR _i$22332[ebp], eax
$LN4@ArraySearc:
    mov    ecx, DWORD PTR _i$22332[ebp]
    cmp    ecx, DWORD PTR _count$[ebp]
    jae    SHORT $LN2@ArraySearc

; 11   :       if( array[i] == searchVal )

    mov    edx, DWORD PTR _i$22332[ebp]
    mov    eax, DWORD PTR _array$[ebp]
    mov    ecx, DWORD PTR [eax+edx*4]
    cmp    ecx, DWORD PTR _searchVal$[ebp]
    jne    SHORT $LN1@ArraySearc

; 12   :          return i;

    mov    eax, DWORD PTR _i$22332[ebp]
    jmp    SHORT $LN5@ArraySearc
$LN1@ArraySearc:
; 13   :    }

    jmp    SHORT $LN3@ArraySearc
$LN2@ArraySearc:
; 14   :   return -1;

    or    eax, -1
$LN5@ArraySearc:
; 15   : }
    mov    esp, ebp
    pop    ebp
    ret    0
_ArraySearch ENDP
_TEXT    ENDS

As we can see, this behavior is clear and easy to understand, The Debug assembler output is more loyal to the original logic in C code.

First on the stack frame, we have three parameters, _searchVal$, _array$, and _count$, and one locale variable _i$22338. Under line 10 of the for-loop, _i$22338 is initialized to zero and the code jumps to $LN4@ArraySearc to check the loop condition against _count$. The block below the label $LN3@ArraySearc is to increment _i$22338.

To implement the line 11 of the if clause, the code compares _searchVal$ with the current _array$ element at _i$22338. If not matching, JNE will continue the loop back to $LN3@ArraySearc. If a match found, the code remembers _i$22338 in EAX to return. Under $LN2@ArraySearc is the Line 14, where the loop finishes without a match in failure.

Above descriptions are all the logics simply matching the C code. Obviously, it is not efficient by comparing to the Release build output. Let's see the following Release assembly code:

PUBLIC    _ArraySearch
; Function compile flags: /Ogtp
_TEXT    SEGMENT

searchVal$ = 8                     ; size = 4
_array$ = 12                        ; size = 4
_count$ = 16                        ; size = 4
_ArraySearch PROC 
; 9    : { 
    push    ebp
    mov    ebp, esp
 
; 10   :    for(unsigned i = 0; i < count; i++) {
 
    mov    ecx, DWORD PTR _count$[ebp]
    xor    eax, eax
    push   esi
    test   ecx, ecx
    je     SHORT $LN2@ArraySearc
    mov    edx, DWORD PTR _array$[ebp]
    mov    esi, DWORD PTR _searchVal$[ebp]
$LL4@ArraySearc:
 
; 11   :       if( array[i] == searchVal )
 
    cmp    DWORD PTR [edx+eax*4], esi
    je     SHORT $LN5@ArraySearc
 
; 10   :    for(unsigned i = 0; i < count; i++) {
 
    inc    eax
    cmp    eax, ecx
    jb     SHORT $LL4@ArraySearc
$LN2@ArraySearc:
 
; 12   :          return i;
; 13   :    }
; 14   :   return -1;
 
    or    eax, -1
$LN5@ArraySearc:
    pop    esi 
; 15   : }
 
    pop    ebp
    ret    0
_ArraySearch ENDP
_TEXT    ENDS

Fortunately, it’s not so hard to read, but it's really simplified and improved. At first, we still have three parameters, _searchVal$, _array$, and _count$, yet without a locale variable for i. The line 10 is implemented in two blocks separated by line 11’s logic. The first line 10 block is an initialization before the loop of $LL4@ArraySearc. The loop body consists of line 11 of checking the search value and line 10’s second part of incrementing index and checking the loop condition. Using two registers EAX and ECX, to control the iteration is more efficient, where ECX takes _count$ as a loop upper bound and EAX acts as an array index so without memory use in the loop body.

No doubt, this can be a good example to understand how to simplify a logic in assembly language programming. Frankly speaking, this assembly code improvement could be just a small part of optimization directly from the function ArraySearch(). Actually, the other code like the caller in main() should also make contributions for efficiency.

However, here even when you set LOOP_SIZE to 1,000,000,000, the failed search running by the Release build is still zero second while the Debug build consumes a large amount of time. You may wonder what technique in compiler was actually used behind the scene?

The modern compiler does optimizations much more sophisticated than that a programmer could predict or imagine, unless s/he works with performance profile tools in details. A powerful compiler can make use of advanced technologies in both hardware and software in multiple ways. An optimization also depends on the applications, systems, and environments. In a large C++ development, certainly you can delegate most optimizations to the compiler without your consern.

This article is not talking about how to optimize code in C/C++ manually. However, on the lower-level like embedded systems, basic code optimization could be a necessary consideration. Here, from the compiled assembly output, we can learn some ideas and skills for our education purpose. As you already see, depending on the specific code, we get quite different results to understand and compare the C/C++ code behaviors.

Summary

I talked about miscellaneous features in assembly language implementations interfaced with C operators, loops, and functions. To understand these examples actually require a lot of your background knowledge with x86 instructions and MASM specifications. By scrutinizing assembler code listing, I exposed some interesting C/C++ code behavior at runtime. To analyze a C/C++ program in Visual Studio, we can have either static assembly listings generated by the compiler or dynamic executions with the VS debug Disassembler. Here in this article, I use the assembly listing, since I have to compare C/C++ code behavior in both Debug and Release assembler outputs. The downloadable zip file contains two projects, ShiftAndTest and ArraySearch for all examples, including the .CPP source code and .ASM listings in VS 2010. The same listings can be generated by yourself with any recent VS IDE.

As we know, understanding correspondence between the high level language and the assembly language is an important topic in our classes to learn the lower level implementations based on one-to-many relationship. This is why we have to peek and detect C/C++ code behavior by digging deeper into assembler outputs for our education purpose. Assembly language is notable for its one-to-one correspondence between an instruction and its machine code. Via assembly code, you can get closer to the heart of the machine, investigating the instruction execution and performance. Assembly language programming plays an important role in both academic studies and industry developments. I hope this article could serve as some examples for student learning assemble programming and perhaps for developers as well. Any comments and suggestions are welcome.

References

History

  • April 23, 2017 -- Code optimization explained
  • April 14, 2017 -- Original version posted

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Zuoliu Ding
Software Developer
United States United States
An Adjunct Faculty and Software Developer in Los Angeles and Orange County, CA

* Typical articles published in Dr. Dobb’s Journal and Windows Developer Magazine:

- A Silent Component Update for Internet Explorer
- Silent Application Update
- An MDI-Style Web Browser and Load Spy Monitor
- Implementing Wireless Print for WinNT/Win2K
- Multi-State Checkbox Tree Views
- A Generic Tool Tip Class
- An Easy Way to Add Tool Tips to Any MFC Control

- More from Google...

You may also be interested in...

Comments and Discussions

 
QuestionPremature optimization Pin
Stefan_Lang19-Apr-17 20:56
memberStefan_Lang19-Apr-17 20:56 
AnswerRe: Premature optimization Pin
Zuoliu Ding20-Apr-17 13:36
memberZuoliu Ding20-Apr-17 13:36 
QuestionEffect of Optimization setting Pin
Stefan_Lang19-Apr-17 19:55
memberStefan_Lang19-Apr-17 19:55 
QuestionGreat Article Pin
afigegoznaet17-Apr-17 2:36
professionalafigegoznaet17-Apr-17 2:36 
AnswerRe: Great Article Pin
Zuoliu Ding17-Apr-17 7:10
memberZuoliu Ding17-Apr-17 7:10 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170424.1 | Last Updated 23 Apr 2017
Article Copyright 2017 by Zuoliu Ding
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid