Click here to Skip to main content
Click here to Skip to main content
Go to top

C++ Reverse Disassembly

, 25 Aug 2004
Rate this:
Please Sign up or sign in to vote.
This article's aim is to provide material for modern day decompiling of an application written in C++

Technical Detail

This article's aim is to provide material for modern day decompiling of applications written in C++. We assume you have a solid understand of C++, X86 Assembly, and windows.

Overview and contents

  1. Why is C++ Decompiling possible?
    1. Intro
    2. Modern Day Example
    3. Compiler Specific
  2. C++ Protocols
    1. Intro
    2. Global Variables
    3. Expressions
    4. Return Values
    5. Function calls and the stack
    6. Local Variables
  3. C++ Keywords
    1. Intro
    2. If statement
    3. For Loop
    4. Structures
    5. Technical Algorithms
  4. Practical Decompiling
    1. Intro to Decompiling Windows application
    2. Decompiling a sample application

Special Case: Compiler Specific

Compiler Specific:

Each compiler is different, such as their CrtlStartUp routines, their statement assemblies (switch , if, while), and numerous other things make each compiler generate different code, even if you compile the same C++ code on two compilers, the end result will be different, because of this I will stick with one and only one compiler, which is the Visual C++ Compiler.

Visual C++ is produce by Microsoft and currently delivers the fastest and most optimized code available. Not to say all the information provided in this book only applies to Visual C++, I just saying some of the information presented in this book may only work on Visual C++.

If you don’t have Visual C++ that is fine, there are many other compilers available, and most of this information is also accurate for them

Chapter 1: Why is C++ Decompiling possible?

1.1 Intro:

I been ask many times is C++ decompiling even possible not only due to the complexity of a compiler but for the mass about of information loss in compiling, such as comments , include files, macros just to name a few. So one often wonders is this even worth pursing. Well I wanted to start out with the topic of what is totally loss when you compile a program and what stays there, refer to table 1.1.1 to see what we loses and remains.

What is lost

What remains

templates

Function calls

classes

Dynamic linking calls

Marcos

Switch statements

Include files

Local Variables

comments

Parameters

Not to say everything that is in the “What remains” sections is 100% there, it just means it is very simple and practical to reverse engineer. Because of this fact I choose to deal with the “What remains” section first because it’s much easier.

As we progress though this book keep in mind reverse engineering is almost never practical and takes lots of practice. It’s harder to reverse engineer something created than to create it in the first place.

A good way to start out with reverse engineering is to decompile your own programs and see how each C++ function specifically works, then apply that knowledge in other areas because looking at thousands of lines of assembly code is not really fun.

1.2 Modern Day Examples:

Now when your reading this book you might start to think that , “anything translated info a different language can be retranslated back into the same language” right, well this is not the case in reverse engineering a lot of things will be lost, and a lot of things you must make up(assume) along the way.

So I wanted to make sure a provided some practical examples for reverse engineering at the beginning of the book, to give you a sense of hope.

To begin reverse engineering, I decided to start with the main C++ statement

Int main(int argc, char * argv[])

Now we can easily find this statement in any executable file due to the PE format which tells us the start of the executable, because of this we can simply read the PE format in a specific executable and get its start address. Or can we?

This is where the Common Runtime Library comes in at (CRTL), you see when you compile a C++ program most compilers (because this is compiler specific stuff) will execute in the following order

  1. CrtlStartUp();
  2. Int main(int argc, char * argv[])
  3. CrtlCleanUp();

this means we can’t look into the PE file and get the start of our code, we can only get the start of the CrtlStartUp()’s code. We have to choices, reverse engineer the CrtStartup Code or skip over it, I like the latter, and we will deal with the Common Runtime Library later.

Chapter 2: C++ Protocols:

2.1 Intro

One of the main reason C++ is so well design is because it has a strict protocols use in its assemblies. C++ has some very static assemblies such as when you return values, it is always put in the EAX register, and function calling usually always use the stack because of this reverse engineers can attack this static assemblies and get a head start

The first thing we should deal with is Global Variables because if you’re coming from a lot of high level languages you might have some miss conceptions.

2.2 Global Variables

You know how many books say Memory is stored random on the computer, well this is true for the most part, but your application memory allocation for global variables is quite static. That’s right each time you run your program, your static allocated variables will always end up in the same place.

Another interesting fact is variables don’t hold data, they pointer to where the data is stored.

Here is a C++ Example:

#include "stdafx.h"
#include "windows.h"
 
char * globalvar = "Whats Up";
int APIENTRY WinMain(HINSTANCE hInstance,
HINSTANCE hPrevInstance,
LPSTR lpCmdLine,
int nCmdShow)
{
// TODO: Place code here.
globalvar = (char *)0x400000;
return 0;
}

Here is a in depth look at the disassemblies

00405030: global_var dd 405034h
00405034: global_var_value db 'Whats Up',0
mov global_var,400000h

OK, this proves that variables do not hold data, as you can see, the compiler automatically initialize our global_var pointer to the address of global_var_value.

OK, so far we know that variables are just pointers to values, so we can change were the variable is pointer right? Yes we can, with mov global_var, 400000h so whenever the compiler accesses global_var, it will look into the value stored at 405030h and come up with 400000h

If you’re confused remember global_var is stored at 405030h, and refer to the picture 2.

Sample screenshot

This picture is pretty self explanatory and if you’re still confused how everything works then I suggest you get a good assembly book and learn what indirect addressing is.

We have just dealt with a pointer variable lets deal with just a variable, because this is much more simple.

#include "stdafx.h"
#include "windows.h"
 
char globalvar[] = "Whats Up";
 
int APIENTRY WinMain(HINSTANCE hInstance,
 
INSTANCE hPrevInstance,
LPSTR lpCmdLine,
 
int nCmdShow)
{
    globalvar[0] = 'A';
    globalvar[5] = ‘U’;
    return 0;
}

Which when compiled becomes

00405030 global_var db “Whats up”,0
 
mov global_var, ‘A’
mov global_var + 5 , ‘U’ 

When instantly see that regular variables or a lot simpler than global variables, all we have to do is refer to a address in memory which holds or data , of course in machine code we can’t see pretty names like global_var, so here is a pure disassembly

00405030 “Whats Up”,0
mov 00405030,’A’
mov 00405035,’U’

As you can see, we aren’t doing anything special just modifying the values store at 00405030 and 00405035.

You should have variables and pointer variables down pack, since this information will not be explain again, if there is something you don’t understand, read it over.

2.3 Expressions

OK, as we all know C++ has near English like syntax and which we can program in. Well X86 assembly code doesn’t, for example take a look at the following statement

Int s = 3 + 4 + 1 + 5 + 9;

How can we calculate this in assembly? simple, look at the following C++ example

#include "stdafx.h"
#include "windows.h"
 
int s1 = 3;
int s2 = 4;
int s3 = 1;
int s4 = 13;
 
int APIENTRY WinMain(HINSTANCE hInstance,
 
HINSTANCE hPrevInstance,
 
LPSTR lpCmdLine,
 
int nCmdShow)
{
    // TODO: Place code here.
    s1 = s2 + s3 –s4 + 34;
    return 0;
}

Which when compiled becomes

00405030 s1 dd 3 
00405034 s2 dd 4 
00405038 s3 dd 1 
0040503C s4 dd 1 
 
         mov eax, s2
00401008 add eax, s3
0040100E sub eax, s4
00401014 add eax, 34
00401017 mov s1, eax

OK the compiler optimizes the code a little bit, but it’s still very easy to understand.

  • The first thing the compiler does is load up eax, with the value of s2 with mov eax, s2
  • Now eax holds 4, the next thing we do is add eax to s3,
  • Now eax holds 5, after that we subtract eax from s4,
  • Now eax holds 4, after that we add eax to 34,
  • Now eax holds 38
  • Then we finish it up by moving s1 to eax which holds 38, now s1 holds 38.

You will often see the compiler use registers instead of variables in expression because registers are faster.

From this we can conclude that for each mathematical operator the compiler maps it with a specific X 86 Instructions, here is a table

C++ Operator

X86 Instruction

* (Multiply)

Mul , (use fmul for floating point)

/ (Division)

Div (use fdiv for floating point)

- (Subtraction)

Sub

+(Addition)

Add

As you can see, we can easily decipher most statements in C++ using the table above.

For a test we will look at a sample disassembly dump and decompile it by hand to C++.

0000000 2
0000001 3
0000002 4
0000003 0
0000004 1
0000005 mov al, [00000000]
        add al, [00000001]
        mov ch, [00000002]
        mul ch
        mov [000000003],ax 

OK the first thing we do is try to figure out what type of variables they are using

And from what we can see they our using al and ch, which are 8 bit registers, so that means whenever they reference anything with 8 bit registers, it means the variable is a Char type.

On down you see that they do a “mov [000000003], ax”, and since ax is a 16 bit register the variable type is short int.

Here is a small table, so you can map registers to variable types

X86 Registers

C++ Type Variables

8 bit registers ( AL , AH)

Char

16 bit registers (AX)

Short int

32 bit registers (EAX)

Int

So far we see 4 references to memory addresses, because of this we know we have 4 variables, the first one [000000000] is obviously an char type variable since we see,

mov al, [00000000]” 

and since al is an 8 bit register.

So lets give [0000000] the name of s1, we also see that [0000000] though[00000002] is all reference by 8 bit variables meaning they are also char type, and the last one [00000003] which’s use like “mov [000000003] , ax” is a short int type since ax is 16 bits

OK let’s create another table one which will hold variable names or alias for the addresses

Although we can never get the original variable name we can also create our own.

Addresses

Variable names/alias’s

Variable size

0000:0000

S1

Char

0000:0001

S2

Char

0000:0002

S3

Char

0000:0003

S4

Short int

You might be confused why 00000004 holds 1 and 00000003 doesn’t, well this is because Intel is a little edian machine, that stores values in reverse word order.

Now the next thing we should do is rewrite the code above with our alias’s we created

s1 db 2
s2 db 3
s3 db 4
s4 dw 1
mov al, s1
add al, s2
mov ch, s3
mul ch
mov 
s4,ax 

Now the first thing we do is mov al, s1

OK al now holds a value of 2, the next thing we do is “add al, s2

Now al has a value of 5, since s2 had a value of 3 in it the next thing we want to do is mov ch, s3?

Now ch has a value of 4, after that we mul ch, now ax has the value of al * ch,

And since al had a value of 5 in it and ch had a value of 4 in at, ax has the value of 20.

OK we can start to decipher the C++ statement which is

s1 + s2 * s3

After that we see that we see, “mov s4, ax” so the complete C++ statement is

S4 = s1 + s2 * s3;

As you can see we just went though a whole bunch of mess to come up with a simple C++ statement, and this only works for global variables. Not local variables or structure members. So things will only get harder, due to this I suggest you read carefully and if you don’t understand something read it over and over until you do.

2.4. Return Values

One of the major fundamentals of C++ is returning values from function call. This is actually a very simple procedure, because it simple involves placing a value into the eax register.

So when you have a statement like this

c = (char *) malloc (0xFF);

The first thing the compiler does is call malloc and then it assigns c to eax like “mov c, eax

For example if you have a statement that returns 5; what you our really saying is

__asm
{
    Mov eax, 5
    Ret
}

Let’s have a little practice with a full disassembly dump

Mov eax,5
Add eax,2
Sub eax,1
Ret

And the C++ equivalent is

return 5 + 21;

This although simple is one of the most important concepts a C++ reverse engineer can learn.

2.5 Function Calls and the Stack

Now its time to get to the blood and guts of C++ with function calls.

Function calls are fairly simple for the most part because they our just labels for assembly programmers example.

Int func () {return 1 ;}
Func();

Would compile into

Func:
Mov eax, 1
Ret
Call Func

From this we can conclude two things, the first is:

Function’s name or like variables, they are just references to some address which is the same as a label

Here is a full disassembly dump for practice

0000:0000 0
0000:0001 0
0000:0002 0
0000:0003 0
 
0000:0005 mov eax,1
0000:0009 ret
 
0000:0010 call 0000:0005 ‘code starts here
0000:0015 mov [0000:0000],eax

OK the first thing we see is that at address 0000:0015 , we our assign a 32 bit memory address to the value of a 32 bit register which’s mean that we have a 32 bit variable at hand or a int type variable to be more exact.

So let’s create an alias for the address’s 0000:0000 0000:0003, which will be s1.

Now let’s create a new disassembly with this added information

S1 dw0
 
0000:0005 mov eax,1
0000:0009 ret
 
0000:0010 call 0000:0005 ‘code starts here
0000:0015 mov S1,eax

OK the second thing we see is that code start at 0000:0010 and the first instruction is call 0000:0005.

Now we’re at 0000:0015 we can see that the code is moving a value into eax then returning. Now we our at address 0000:0015 and we just moved s1 into eax

So we can now reverse engineer this whole program back into C++

Int s1 = 0; //dw 0
Int some_function()
{ 
    return 1; //mov eax ,1 : ret
}
s1 = some_function(); //mov s1 , eax

Now what do we do when functions have parameters, well things get pretty complicated because the compiler uses the stack to handle parameters.

It pushes in parameters right to left, meaning the last parameter goes in first, and the first parameter goes in lest.

For example, C++ Function:

Func (1, 2);

Would compile into

Push 2
Push 1
Call func

Now let’s have an imaginary stack frame, which has a size of 32

Now the first thing we realize is that ESP = 32, with that in mind look at the table below

X86 Instruction

Memory address stored at

Stack Frame Pointer value

Push 2

[32]

ESP = 28

Push 1

[28]

ESP = 24

Call func

[24]

ESP = 20

Push ebp

[20]

ESP = 16

Remember when you issue a call instruction on the X86 machines, the Processor stores the current address on the stack so it can know the location it should return to.

Now that the parameters are on the stack lets look at the function itself

Int func (int a, int b)
{
    return a + b;
}
  • The first thing the compiler does is “Mov eax, [ESP + 8]”, since ESP equals 20, and the first parameter is stored at [28].
  • The second thing the compiler does is “add eax, [esp + 12]” and since ESP equals 20 and the second parameter is stored at 32.
  • The last thing the compiler does is ret

So the full compilation would be

Func:
  Mov eax, [ESP + 8]
  Add eax, [ESP + 12]
  Ret

A neat little reverse engineering tip is to remember that sense the stack has a fix width of 4 bytes, you can easily tell what parameter they our accessing.

[EBP] = Stack
 
[EBP +4] = Return address
 
[EBP + 8] = First
 
[EBP + 12] = Second
 
[EBP + 16] = Third
 
[EBP + 20] = Fourth

And so on….

2.6 Local Variables

We just learn that parameters are stored on the stack, now it time to learn about local variables which are also stored on the stack, but local variables are stored quite different.

Here is an example

Int func ()
{
    int a = 5;
    return a;
}

OK to compile this code, the compiler must first reserve space on the stack by going

Sub ESP, 4. Since 4 bytes is the size of an int variable. Of course the compiler must first back up the esp register , and it does this by “mov ebp,esp” , but wait, the compiler must first back up ebp, and it does this by “push ebp” so the very first thing the compiler does is

: Setting up the stack frame
Push ebp; back up ebp
Mov ebp, ESP; back up ESP in ebp
Sub ESP, 4; reverse some space on the stack

Note: C++ always compiles code like “Setting up the stack frame” in any function, even if you use or don’t use local variables, and the compiler always uses ebp to reference parameters and local variables.

In the “Function Calls and the Stack” section I use esp to reference parameters and skip Setting up the stack frame code this out for clarity sake.

Now the second thing the compiler does is

Mov [ebp4], 5
Mov eax, [ebp -4]

If we had a second local variable we could simple go

Mov [ebp – 8], 5, or course the compiler would use sub ESP, 8 Instead of sub ESP, 4.

The last thing the compiler does is restore the stack frame and return

; Cleaning up the stack frame
Mov ESP, ebp; restore stack pointer
Pop ebp; restore ebp
Ret

Note: The compiler always execute the “Cleaning up the stack frame” code, in every function, due to this we can detect a function by looking for similar code. I also skip this in “functions call and the stack” section for clarity sake.

Here is a full disassembly dump, for practice

0000:0000 0
0000:0004 push ebp
0000:0003 mov ebp,esp
0000:0005 sub esp, 8
0000:0010 mov [ebp -4], 5
0000:0015 add [ebp4] , [ebp + 8]
0000:0016 mov eax,[ebp4]
0000:0018 mov esp, ebp
0000:0020 pop ebp
0000:0021 ret
0000:0022 push ebp
0000:0023 mov ebp,esp
0000:0025 add [ebp + 8] , [0000:0000]
0000:0030 add [ebp + 8] , [ebp + 12]
0000:0031 mov eax,[ebp +8]
0000:0032 mov esp, ebp
0000:0035 pop ebp
0000:0036 ret
0000:0037 push ebp ;code start
0000:0038     mov esp, ebp
0000:0040     push 1
0000:0044     call 0000:0002
0000:0049     mov [0000:0000],eax
0000:0050     push 4
0000:0051     push 3
0000:0052     call 0000:0022
0000:0056     add [0000:0000],eax
0000:0058     mov esp, ebp
0000:0059     pop ebp

OK the first thing we is that memory address [0000:000] is being reference by eax a lot, meaning we have a 32 bit variable which is an int type. The next thing we notice is we set up the stack frame 3 times and clean it up 3 times, which means we have 3 functions(and yes int main(…) also sets up the stack frame and cleans it up).

So we have

Func1 ()
Func2 ()
Main ()

Next we see Func1 address is at 0000:0004 and accept one 32 bit parameter

Because we see at address 0000:0040 we push 1 into the stack and then at address 0000:0044 we are calling 0000:0004 so we can setup func1 declaration

00000:00002 Func1 (int a) 

Now whenever func1, does anything to [ebp + 8] we know that it is doing something to its first parameter. So look into func1 code, and we see that it has 1 local variable because it references [ebp – 4].

Now lets take a lot at address 0000:0049, which is mov [0000:000], eax so we know that the original C++ code is something like

[0000:0000] = func1 (1);

Next when see at address 0000:0051 that we are pushing 4 onto the stack then after that we are pushing 3 onto the stack then we all 0000:0022.

Now we can setup Func2 declarations

0000:0022 Func2(int a, int b)

At address 0000:0056 we see add [0000:0000],eax , means the original C++ code is something like

[0000:0000] += Func2(3,4)

Remember we pushed 4 onto the stack first, and 3 onto the stack second, because parameters or passed right to left.

Now that we have a lot of information lets make a new disassembly one with alias for all local variables and parameters in Func1 and Func2. Since we know that whenever they use code like [ebp +…] it’s a parameter, and when they use code like [ebp -...] it’s a local variable.

0000:0000 s1 dw 0
0000:0004 func1(int param_1): push ebp
{ local : local_var_1}

OK I know, I made up a little assembly syntax such as func1(int param_1) and

{Local : local_var_1 }

This is for clarity sake that’s all.

Now let’s start with func1 at address 0000:0010 we see that it is moving local_var_1 to 5, which in C++ it's saying

int local_var_1 = 5;

next we see add local_var_1, param_1 which in C++ its saying

local_var_1 +=param_1

The last thing we see before we clean up the stack is mov eax,local_var_1 which in C++ its saying

return local_var_1;

So the full reversed engineered function is

Int func1(int param1)
{
    int local_var_1 = 5;
    local_var_1 += param1;
    return local_var_1;

Now lets go to func2 at address 0000:0025 we see add param_1, s1, which in C++ its saying

param_1 +=s1;

after that we see add param_1, param_2, which in C++ its saying

param_1 += param_2;

the last thing we see before we clean up the stack is mov eax, param_1, which in C++ its saying

return param_1;

So the full reversed engineered function is

Int func2(int param_1 int param_2)
{
    param_1 += s1;
    param_1 += param_2;
    return param_1;
}

Now we our able to reverse engineer the whole program

Int s1 = 0;
Int func1(int param1)
{
    int local_var_1 = 5;
    local_var_1 += param1;
    return local_var_1;
}
 
Int func2(int param_1 int param_2)
{
    param_1 += s1;
    param_1 += param_2;
    return param_1;
}
 
int main()
{
    s1 = func1(1);
    s1 += func2(3,4);
}

This Chapter might be a little hard to comprehend at first since I presented a lot of “straight to the point” information, again if you don’t understand anything read it over, and if you still don’t understand email vbmew@hotmail.com with your question

Chapter 3: C++ Keywords

3.1 Intro:

What we been doing so far is the easy stuff, its time to deal with C++ keywords complex expression, and some practical real world examples.

3.2 If Statement

One of the main statements people use is this if statement which logically compares values. Using this function we can choose which path of execution our program should take.

If statement can also be very , very complex and very simple

Take a look at the following examples.

If(I ==0)
//do function
//continue

Now what if we had something like this

If(I==0)
{
    int i2 = 0;
}
i2 = 3; //error can’t access i2 because it’s not in your scope
        // it’s in the if statements scope

Because of this we know that compiler generates a stack frame for each If statement with brackets right? Wrong!.

I2 is accessible to main in reality but the compiler keeps it hidden, the reason I ‘m telling you this is because to reverse engineer if statements you must completely understand them.

The second example is

If( (I ==0) || ( ( I2 == 1) && (i3 ==2) ) )

The logic for this is if I = 0 or if i2 = 1 andi3 = 2

Another Example would be

If( (c = (char *) malloc(0xFF) ) == NULL)

This is saying c = malloc(0xFF) and if malloc return NULL this condition is true.

Yet another example is

If(malloc(0xFF)) //this is saying call malloc(0xFF) and if it returns 
                 //anything not equal to 0 then This condition is true

The last but not least example is

If(!malloc(0xFF)) //this is saying call malloc(0xFF) and if it returns 
                  // value is equal to zero then this condition is true

Thankfully all these if statement can be reverse engineer in turn back into just the way they are(almost).

Now the if statement maps directly to the X86 instruction cmp with this in mind take a lot at the following C++ program

int main()
{
    int I = 0;
    if(I == 34)
    i+= 23;
    return 1;
}

This compiles into the following

push ebp
mov ebp,esp ;setup the stack frame
sub esp, 4
 
mov [ebp4],0
cmp [ebp4], 34 ;
jnz continue_program
add [ebp4],23
 
continue_program:
 
mov eax,1
 
mov esp, ebp ;restore the stack frame
pop ebp
ret

Yes I know I decided to give you a complete binary disassembly to see if you remember about the stack frame and the [ebp -4] which means the first local variable created and yes int main has to setup the stack frame like every other function.

Now let’s learn how to turn this program back into C++

The first thing we do is look at the compare mov [ebp – 4],0 which is telling us that the program is initilize a variable to 0.

Next we see a cmp instruction that is comparing [ebp -4],34 , because of this we know the program is using a if statement, you know “if [ebp -4] = 34” what we should do now is create some alias for [ebp -4] we will use local_var_1. next we see the instruction jnz, which is the same as jne which is saying if[ebp -4] or local_var_1 is not 34 then skip over this if statement and jump to continue_program.

Add [ebp -4], 34 or add local_var_1, 34 is saying local_var_1 += 34; After that we “mov eax,1”, clean up the stack frame and then return.

Now lets look for a multiple logical if statements

If( (i==0) || (i2 == 23) && (i3 ==21) )
If_block_check1:
   Cmp I,0
Jne if_block_check2:
   Jmp do_if
 
If_block_check2:
   Cmp i2,23
   Jne skip_if
   Cmp i3,21
   Jne skip_if
Do_if:
   ; actions here
skip_if:

OK the first thing we see is that on multi logical if statements when one condition fails it jumps to the next logical expression to see if that will evaluate to true, as shown in figure 3.2.1

Sample screenshot

So if we have a multi logical if statement, and part of the expression succeeds we continue to evaluate the expression until something is false.

Of course this is only true for a &;& operator. For a || operator if one part of the expression is true we quit that entire expression and the if statement evaluates as true.

3.3 For Loop

The for Loop is not only one of the most interesting things about C++ it is one of the most use statements.

The interesting factor for the for loop comes in its ability to evaluate 3 expressions

For( <expression 1>; <expression 2>; <expression 3>)

The Expression our usually

For( <assignment>; <conditional>; <increment| decrement>)

Reverse engineering the for statement is not hard, because it’s really a if statement in most cases

If(I < 4)
{
    i++;
    //do actions
}

Now for the for loop equivalent

for(int I =0;i<4;i++)
{
    //do actions
}

OK lets look at a simple reverse disassembly for the for loop

Mov [ebp4],0 ;initilize the local variable
Jmp condition
 
Increment:
   Add [ebp -4],1
 
Condition:
   Cmp [ebp -4],4 
   Jge done
 
Loop:
   ;do actions
Jmp increment
 
Done: 

As you can see the for loop is nothing more than a high level if statement, the first thing we do is initilize the local variable on the stack , after that check the condition statement. Then we go to the loop, then at last we jump back to increment then we jump yet again to the condition label and again until the condition is true.

3.4 Structures

Structures are very useful in C++ because of there ability to contain members. A structure lets you define a variable of any size , example

Struct test1
{
    int member1;
    int member2;
};

This creates a 64 bit , 8 byte variable in memory. So in a sense structures or regular variables but allow us to access certain parts of that variable independently from others

This makes it very useful

Because if you were to use char test1[8]; you would be create the exact same in memory as Struct test1, only it would be much harder to access 4 byte members individually in char test[8];

Here is a example of using test1 as a local variable

Sub ESP, 8 ;reverse 8 bytes on the local stack
Mov [ESP -4], 45 ;move member2 to 45
Mov [ebp -8], 12 ;move member1 to 1

As you can see structures are stored reverse in memory, because you would think

That member one would be the last on the stack, but it turns out it is the first on the stack

For a global variable the compiler would simply reverse 8 bytes in the executable in reference those each individually base on the member you have chosen.

3.6 Technical Algorithms

I am providing some algorithms to prove and help you understand some of the theory I presented in this book.

This following example proves that variables inside a if block our truly accessible to the whole function.

#include "stdafx.h"
#include "iostream.h"
 
int main(int argc, char* argv[])
{
    __asm mov dword ptr [ebp -4], 23
    if(true)
    {
        int i;
        cout << i << endl;
    }
    return 1;
}

The output should be 23 even though we never initialize I , if your confused remanber that since I is the first variable and the only variable its location is [ebp -4].

This next example proves that structures are just regular variables with the given ability to be access in parts instead of wholes.

#include "stdafx.h"
#include "iostream.h"
struct test1
{
    int member1;
    int member2;
    int member3;
};
 
int main(int argc, char* argv[])
{
    test1 local_struct;
    local_struct.member1 = 1;
    local_struct.member2 = 1;
    local_struct.member3 = 1;
    __asm
    {
        add dword ptr [ ebp - 12],55 ; structure 1
        add dword ptr [ ebp - 8] , 100 ; structure 2
        add dword ptr [ ebp - 4] , 23 ; structure 3
    }
    cout << "member 1: " << local_struct.member1 << endl;
    cout << "member 2: " << local_struct.member2 << endl;
    cout << "member 3: " << local_struct.member3 << endl;
    return 1;
}

Output should be

member 1: 56
member 2: 101
member 3: 24 

Chapter 4: Practical Decompiling

This Chapter aims to provide knowledge of practical decompiling, in this chapter we will learn to use a disassembler, and learn to decompile real world applications.

4.1 Intro to Windows decompiling

Windows decompiling is not that difficult since all windows programmers follow a strict programming method such as CreateWindowEx, or CreateDialog, and All windows have message loops which you can easily find. Before we really start getting into decompiling lets go over the basic. In the vast world of windows there are many types of application, and many more types of technology.

Therefore all of it is too much to cover in one tutorial. On top of that, this information only applies to application that uses the basic window functions, such as CreateWindowEx, and CreateDialog. Applications made in visual basic, or Delphi use there own engine, and there engines will not be cover. Also there is MFC, which is simply a class wrapper to API calls, but can greatly complex things. We will be working on an application I made in pure win32 API, All it does is show a window, but we all know showing a window requires a significant amount of work.

1. Create the window class

From this we can get the Window Procedure Method, in which all message are handle.

lpfnWndProc of the WNDCLASSEX structure contains the address to the Window procedure method.

2. Create the Window itself.

We can retrieve every single const by name, and most of the time the exact C/C++ equivalent.

3. The message Loop

All we have to do is look for a reference to GetMessage(…).

We start with the basic skeletons first, then move on to more complex stuff, its import to learn the basic first because

They give you an ideal of how the application is design. We will be using the PVdasm, which you can get from my site -

This is a very nice free disassembler which we will be using.

4.2 Decompiling a sample application

First load up PvDasm, and your screen should look similar to Figure 4.2.1

(Figure 4.2.1)

Grab CreateWindow2 (the program we are going to decompile by hand) and Open it in the disassembler, your screen should look similar to figure 4.2.2

(Figure 4.2.2)

We see are entry point, but this is CRTL code (Common Runtime library), how can we find WinMain Function? By references. We know that in WinMain functions we have a CreateWindowEx, or a RegisterClassEx, if we can find where the program is calling these functions, we can than begin to map out the program. You see when you compile a program a linker links it with libraries or DLL (Dynamic linking libraries). The functions you get from these

DLL’s are called imports. The PVdasm can list all the imports a program has, and show you the address from where they are called. To use this feature press Crtl+N or press the import button. Your screen should look similar to figure 4.2.3

  • Step 1. Click the input button or press Crtl+N
  • Step 2. You should see a window with a list of imports; scroll down until you see CreateWindowEx.

Now we must find the start of the function, this is pretty easy, if we follow the following rules.

1. Consist of a
push ebp 
mov esp,ebp
sub esp, <X>

2. Right after a

mov esp,ebp
pop ebp
ret <X>

Well if we scroll up to address 0040104C and you should see

0040104C push ebp
0040104D mov ebp, esp
0040104F sub esp, 50h 

After that we see

mov dword ptr ss:[ebp- 30],0000030
mov dword ptr ss:[ebp-2c],0000000003

Ok, so we know we have local variables, and it mostly looks like a structure, to find the WNDCLASSEX structure we need a reference point. A good reference to look for is LoadCursor. About every single application uses the call, so simply press the import button or Crtl+N, and select LoadCursor.

Once you have selected LoadCursor you should then see something similar to

00401092 call ds:LoadCursorA

00401098 mov [ebp-14], eax

Ok, now we all know the return value for functions are stored in the eax register, and we know that the hCursor member of WNDCLASSEX is being used (because we are loading a cursor). Now what position is hCursor in memory, well its ebp-14h(yes that’s 14 HEX no decimal), with this information we can figure out where all the other member are to. If we take a quick look at the WNDCLASSEX structure

typedef struct WNDCLASSEX {
UINT cbSize; //30h
UINT style; // 2ch
WNDPROC lpfnWndProc; //28h
int cbClsExtra; //24h
int cbWndExtra; //20h
HINSTANCE hInstance; //1ch/
HICON hIcon; //18h
HCURSOR hCursor; // ebp -14h ß--Start calculation here ->
HBRUSH hbrBackground; //ebp -10h
LPCSTR lpszMenuName; //ebp – 0ch
LPCSTR lpszClassName; //ebp - 8
HICON hIconSm; //ebp -4
};

As you can see its easy to calculate structure member addresses, simply add the size of the variable for each member above you and subtract the size of the variable for each member below you. Now that we know the memory location of every structure we can begin to really understand how the program is created. The first thing we do is get the value of all the members in the structure, starting with the cbSize member.

1. cbSize

The first thing we see is mov dword ptr ss:[ebp- 30],0000030 and we all know that ebp – 30h is the location of cbSize. So what we are really saying is mov dword ptr ss:[cbSize],30h. Of course we can go a step further since we know that 30h is the size of WNDCLASSEX, and cbSize is suppose to hold the size of WNDCLASSEX, so we can fully decompile this line to

wc.cbSize = sizeof(WNDCLASSEX);

2. style

mov dword ptr ss:[ebp-2c],0000000003

Ok, what style is the program using, well, to figure this out we need to look into windows.h and get all style values. Now we could do a bit by bit compare by hand, but we don’t have time for that, so I made a small program call WinDasmRef. All we need to do is choose the type of section we want to look up, in our case its style from WNDCLASSEX, then enter a value, and bam it returns exactly what the user entered.

Refer to screen shot 4.2.5 for more information

You can get this program from http://www.crackingislife.com/modules.php?name=Downloads&d_op=getit&lid=1

  • Step 1. Select a section
  • Step 2. Enter a value
  • Step 3. It will do a bit by bit compare for you and find all the values.

This program is no where near finish, but it is more than enough for this book.

3. lpfnWndProc

mov dword ptr ss:[ebp-28],00401000

This is the most important and interesting structure, because this holds the address to the message loop from this we can tell that the message loop is located at address 00401000(in hex of course)

4. cbClsExtra

mov dword ptr ss:[ebp -24],0

We are simply setting wc.cbClsExtra to 0000000

5. cbWndExtra

mov dword ptr ss:[ebp-20],0000000

we are simply setting wc.cbWndExtra to 0

6. hInstance

mov eax,dword ptr ss:[ebp+8] //local variable hInstance

mov dword ptr ss:[ebp-1C],eax //Hinstance

Remember the declaration for the main function is

WinMain(HINSTANCE hInstance, HINSTANCE hPrevInstance, LPSTR lpCmdLine , int nCmdShow)

and the first parameter (hInstance) is stored at ebp + 8, and the second parameter (hPrevInstance) is stored at ebp + 12

Now that eax holds the value of holds hinstance, we simply transfer that value to [ebp-1C] or hinstance. So in other words we are saying wc.hInstance = hInstance

7. hIcon

mov dword ptr ss:[ebp-18],00000000

we are simply setting wc.hIcon to 0

8. hCursor

push 00007F00
mov ecx,DWORD ptr SS:[ebp+08]
push ecx
call USER32!LoadCursorA
mov dword ptr ss:[ebp-14],eax

Ok, the first thing we do is look at the declaration of LoadCursorA and find that it is

LoadCursor (HINSTANCE hInstance, LPSTR cursorname);

and the last parameter is push first, so cursorname is the first parameter being bush which is the value 7F00.

If the user is not using a custom cursor (most don’t) we can retrieve its value in WinDasmRef and yes, you can enter hex values in WinDasmRef, just make sure you put a 0x7F00 not 7F00

refer to figure 4.2.6

(Figure 4.2.6)

Note: If your wondering why LoadCursor.cursorname wasn’t in the first picture, it is because I’m writing this program as I’m typing this book.

mov ecx,DWORD ptr SS:[ebp+08]

push ecx

Next we move ecx, to SS:[ebp+8] which is hInstance, and then we push ecx to the stack,

the stack currently contains

  • IDC_ARROW
  • hInstance

then we see call USER32!LoadCursorA , we can turn this back into the complete original line of source which is

LoadCursor(hInstance,IDC_ARROW);

now we all know that LoadCursor returns the handle to the cursor in the eax register so

mov dword ptr ss:[ebp-14],eax , ebp-14 is the position of hCursor. Now lets decompile the entire statement

wc.hCursor = LoadCursor(hInstance,IDC_ARROW);

9. hbrBackground

push 01

CALL GDI32!GetStockObject

mov dword ptr ss:[ebp-10],eax

Ok , first we push 01 into the stack and call GetStockObject, now if we look at the declaration of GetStockObject which is GetStockObject(int brush) , we know that the 01 is specifying a brush so load up WinDasmRef, and type 1 in , refer to figure 4.2.7 for more information

So we know the call is like GetStockObject(LTGRAY_BRUSH), after that we see mov dword ptr ss:[ebp-10],eax and eax holds the handle to the brush return by GetStockObject, and ebp-10, is the memory location of hbrBackground, so the full decompile statement is

wc.hbrBackground = GetStockObject(LTGRAY_BRUSH);

10. lpszMenuName

mov dword ptr ss:[ebp-0C],0000000

we simply set lpszMenuName to 0

11. lpszClassName

mov edx,dword ptr ds:[0040603C]

mov dword ptr ss:[ebp-08],edx

at the address of 0040603C, is a pointer to are class name, how can i tell ? , easy because it is surrounding the address in brackets, so it is getting a value from 0040603C, we can easily use any hex editor to look at the address 0040603C, as long as we know the image base.

The image base is the location the program is loaded into memory, to see the image base press CRTL+P in PvDasm A window similar to Figure 4.2.8 should come up

(Figure 4.2.8)

We subtract the image base with is 400000 in hex from 0040603C, and we are left with 603C, now if we go to offset 603C in a file we will see 30, we must read 3 more bytes because Intel uses 32 bit address, so the full address is 30604000

Now 30604000 is in little endian order, which the X86 uses, we must convert it to big endian by reverse every hex byte, like this 00406030, now if we subtract the image base from that we get 6030, and we look at address 6030, we will see a ‘D’, if we keep reading to a null terminator like everyone else does we will see DECOMPILE.

Now that we have the name of are class, we can fully decompile the statement like this

static char * szClass = “DECOMPILE”;

wc.lpszClassName = szClass; since we are going mov dword ptr ss:[ebp-08],edx and edx

holds the address of szClass, and ebp-8 is the memory location of lpszClassName

12. hIcon

mov dword ptr ss:[ebp-4],0000000

this is simply setting hIcon to 0

Now that we are done with are whole window class, lets have a overview of all the values

WNDCLASSEX wc; //we don’t know the exact name but it has to be something
wc.cbSize = sizeof(WNDCLASSEX);
wc.style = CS_HREDRAW | CS_VREDRAW;
wc.lpfnWndProc = WndProc;
wc.cbClsExtra = 0;
wc.cbWndExtra =0;
wc.hInstance = hInstance;
wc.hIcon =0;
wc.hCursor = LoadCursor(hInstance,IDC_ARROW);
wc.hbrBackground = (HBRUSH) GetStockObject(LTGRAY_BRUSH);
wc.lpszMenuName = NULL;
wc.lpszClassName = szClass;
wc.hIconSm = NULL;

As you can see we practically decompile this back to exact source code.

Now we see the following code

lea eax,dword ptr ss:[ebp-30]
push eax
call USER32!RegisterClassExA
and eax,0000FFFF
test eax,eax
jnz 004010E4
push 0
push 00406054 ; ASCIIZ Crap
push 0040605C ; ASCIIZ Can’t register class
push 0
Call USER32!MessageBoxA
xor eax,eax
jmp 00401172

lets first begin with

lea eax,dword ptr ss:[ebp-30]

push eax

call USER32!RegisterClassExA

now ss:[ebp-30] holds the address of the WNDCLASSEX structure, because [ebp-30] is the first member of the structure which is cbSize, now that eax holds the address of the structure we push it into the stack and call USER32!RegisterClassExA; if we look at the Declaration of RegisterClassEx,

ATOM WINAPI RegisterClassExA(CONST WNDCLASSEX *);

We see that it returns the type ATOM, which is 16 bits, and because of that we see and eax,0000FFFF, which is masking off the upper 16 bits, so we don’t read a 32 bit value, after that we see

test eax,eax

jnz 004010E4

this is simply saying if eax is not zero then jump to 004010E4, the exact c++ code for this is

if(!RegisterClassEx(&;wc))

{

//bad code here

}

//else continue (004010E4

Remember the ‘!’ is saying if RegisterClassEx returns the value of 0 execute the bad code. Now as we continue on we see that it is going to display a message box if it fails

push 0

push 00406054 ; ASCIIZ Crap

push 0040605C ; ASCIIZ Can’t register class

push 0

Call USER32!MessageBoxA

and if we look at the declaring of MessageBox

MessageBoxA(HWND hWnd , LPCSTR lpText, LPCSTR lpCaption, UINT uType);

  • push 0 is for the hWnd parameter and its specifying we have none
  • push 00406054; is the address of the ASCII string “crap”
  • push 0040605C;is the address of the ASCII string “Can’t register class”
  • push 0; is the message box type, to see what type 0 is

Lets crack open WinDasmRef

Refer to figure 4.2.9 for more information

So we can decompile the whole line into

MessageBox(NULL,”Can’t register class”,”crap”,MB_OK);

after that we see

xor eax,eax

jmp 00401172

xor eax,eax clears 0 and if we go see what’s at address 00401172, we will find

mov esp,ebp

pop ebp

ret 10

which is exit code, so we can decompile this line to return 0. The full original code is

if(!RegisterClassEx(&;wc))

{

MessageBox(NULL,"Can't register class","Crap",MB_OK);

return 0;

}

As you can see decompiling is quite simple for this basic windows stuff, so I not going to bore you with the rest. If you have any questions , please check out are forums at http://www.eliteproxy.com/modules.php?name=Forums

More to come

Visual basic 6.0 is next

Credit

This paper is made possible by a grant from your donation, if you would like to continue to support Opcodevoid, then please donate.

Disclaimer

This book is provided as is; no warranty is applied nor granted information. What is presented in this book is copyrighted by Opcodevoid with all rights respected. All information, algorithms can not be copied, reproduce nor distributed in anyway, without written permission from Opcodevoid or Opcodevoid Inc.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

Share

About the Author

Opcodevoid

United States United States
No Biography provided

Comments and Discussions

 
QuestionReverse Compile PinmemberBram van Kampen26-Aug-14 14:54 
GeneralMy vote of 1 PinmemberMember 752503119-May-11 4:49 
GeneralNeed some more example for C++ class reversing Pinmemberantoniozhou28-Sep-08 21:55 
Generalsample code Pinmemberaldo hexosa15-Feb-08 22:25 
GeneralGreat article Pinmembermnoury2-Sep-04 6:59 
Generalperfect for me a beginner Pinmemberkinggozhang10-Jun-04 0:18 
GeneralI wish you would explain this... PinmemberLiquidKnight12-Dec-03 20:59 
GeneralStrange title Pinmemberdog_spawn28-May-03 2:16 
GeneralRe: Strange title PinmemberChris Meech28-May-03 3:23 
GeneralRe: Strange title PinsussC-J Berg28-May-03 3:40 
GeneralRe: Strange title Pinmemberdog_spawn28-May-03 4:28 
GeneralRe: Strange title PinsussAnonymous10-Jun-03 4:49 
GeneralRe: Strange title Pinmemberdog_spawn28-May-03 4:29 
GeneralRe: Strange title PinmemberChris Meech28-May-03 4:32 
GeneralRe: Strange title PinsussAnonymous4-Jun-03 5:41 
GeneralA too simplified picture PinsussC-J Berg28-May-03 0:32 
I would like to point out that this article gives a very simplified picture of a very complicated subject. I don't think the author has fully understood the complexity of modern C++ compilers, where optimization techniques such as interprocedural analysis, partial inlining, loop unrolling and complex branch optimization transforms the source code into machine code in such a way that an extensive amount of important information is completely lost.
 
Some statements, for instance "for each mathematical operator the compiler maps it with a specific X 86 Instructions", are unfortunately absolutely wrong (compilers are a lot more intelligent these days). Other statements like "Structures are very useful in C++ because of their ability to contain members" make me puzzled, to say the least.
 
Note, however, that I'm not saying that the subject is void of interest. Flow analysis and pseudo-code generation is very important for speeding up reverse engineering tasks such as analysis of computer viruses and performing product security assessments, but it's far from being as simple as get the impression of by reading the article, and it's never about recovering compilable source code.

GeneralRe: A too simplified picture PinmemberGoran Mitrovic28-May-03 10:32 
GeneralRe: A too simplified picture PinmemberGabriel 226-Aug-04 9:13 
GeneralRe: A too simplified picture PinmemberGabriel 226-Aug-04 9:23 
GeneralRe: A too simplified picture PinmemberOpcodevoid1-Jun-03 9:54 
GeneralRe: A too simplified picture Pinmembercyrix2455518-Mar-07 18:42 
GeneralRe: A too simplified picture PinmemberLars P.Wadefalk12-Mar-12 5:08 
GeneralShould be compulsory reading PinmemberNeville Franks27-May-03 23:21 
GeneralRe: Should be compulsory reading PinmemberOpcodevoid1-Jan-04 10:25 
GeneralIn school and running,.. PinmemberSledgeHammer27-May-03 20:25 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web01 | 2.8.140916.1 | Last Updated 26 Aug 2004
Article Copyright 2003 by Opcodevoid
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid