You might be wondering about how your computer works: what happens when you write a program and then compile it? What is assembler and what is the basic principle
of programming in it? This tutorial should clarify this for you, it’s not indented to teach you assembly programming itself, but rather give you the needed basics
to understand what’s actually going on under the hood. It also deliberately simplifies some things, so you’re not overwhelmed by additional information.
However, I assume that you have some knowledge in high level programming (C/C++, Visual Basic, Python, Pascal, Java, and tons more…).
Also I hope that the more skilled guys will forgive me for simplifying a lot of things here, my intention was to make the explanation clear and simple
for someone who doesn't have a clue about this topic.
Note: I will be very grateful for any feedback on this. It’s difficult to write explanations for people who don’t know much about the topic,
so I might’ve omitted some important things or didn’t clarify something enough, so if something is unclear, don’t worry to ask.
How does the processor (CPU) work?
You might know that the CPU (Central Processing Unit, or simply processor) is the “brain” of the computer, controlling all other parts of the computer and performing
various calculations and operations with data. But how does it achieve that?
Processor is a circuit that is designed to perform single instructions: actually a whole series of them, one by one. The instructions to be executed
are stored in some memory, in a PC, it’s the operating memory. Imagine the memory like a large grid of cells. Each cell can store a small number and each cell
has its own unique number – address. The processor tells the memory address of a cell and the memory responds with the value (number, but it can represent anything – letters, graphics,
sound… everything can be converted to numerical values) stored in the cell. Of course, the processor can tell the memory to store a new number in a given cell as well.
Instructions themselves are basically numbers too: each simple operation is assigned its own unique numeric code. The processor retrieves this number and decides
what to do: for example, number 35 will cause the processor to copy data from one memory cell to another, number 48 can tell it to add two numbers together, and number
12 can tell it to perform a simple logical operation called OR.
Which operations are assigned to which numbers is decided by the engineers who design a given processor, or it’s better to say processor architecture: they decide what
number codes will be assigned to various operations (and of course, they decide other aspects of the processor, but that’s not relevant now).
This set of rules is then called the architecture. This way, manufactures can create various processors that support a given architecture: they can differ in speed,
power consumption, and price, but they all understand the same codes as same instructions.
Once the processor completes the action determined by the code (the instruction), it simply requests the following one and repeats the whole process.
Sometimes it can also decide to jump to different places in the memory, for example to some subroutine (function) or jump a few cells back to a previous instruction
and execute the same sequence again – basically creating a loop. The sequence of numerical codes that form the program is called machine code.
What are instructions and how are they used?
As I already mentioned, instructions are very simple tasks that the processor can perform, each one having its unique code. The circuit that makes
up the processor is designed in a way to perform the given operations according to the codes it loads from the memory. The numeric code is often called opcode.
The operations that the instructions perform are usually very simple. Only by writing a sequence of these simple operations, can you make the processor perform
a specific task. However, writing a sequence of numeric codes is quite tedious (though that’s how programming was done long ago), so the assembly programming
language was created. It assigns opcodes (the numeric code) a symbol – a name that sort of describes what it does.
Given the previous examples, where number 35 makes the processor move data from one memory cell to another, we can assign this instruction
MOV, which is a short for MOVe. Number 48, which is the instruction that adds two numbers together gets the name
ADD, and 12, which performs
the OR logical operation, gets the name
The programmer writes a sequence of instructions – simple operations that the processor can perform, using these names, which are much easier
to read than just numeric codes. Then he executes a tool named assembler (but often the term “assembler” is used also for the programming language,
though technically it means the tool), which will convert these symbols to the appropriate numeric codes that can be executed by the processor.
However, in most cases, the instruction itself isn’t sufficient. For example, if you want to add two numbers together, you obviously need to specify them,
the same goes for logical operations, or moving data from a memory cell to another: you need to specify the address of the source and the target cell.
This is done by adding the so-called operands to the instruction – simply one or more values (numbers) that will provide additional information for the instruction
needed to perform a given operation. The operands are stored in the memory too, along with the instruction opcodes.
For example, if you want to move data from a location with address 1000 to a location 1258, you can write:
MOV 1258, 1000
The first number being the target address and the second the source (in assembly, you usually write the target first, and the source as the second one, it’s quite common).
The assembler (the tool that converts the source to the machine code) stores these operands too, so when the processor first loads the instruction opcodes,
which will tell it that it must move data from one location to another. Of course, it needs to know from which location to move to what destination, so it will load
the operand values from the memory too (they can be at addresses right after the instruction opcodes), and once it has all the necessary data, it will perform the operation.
Let’s look at some short code and describe what it does. Please note that it's pseudocode, it's not made for a specific architecture or language
and various symbols can differ, the principle however, remains the same.
MOV A, 2000
ADD A, #5
JNL A, #200, LOOP
MOV 2001, A
The first instruction will move the number from the memory cell with address 2000 to a register A – it’s a temporary location, where the processor stores numbers.
It can have many registers like this. The second line contains something called a label: it’s not an instruction, it’s simply a mark in the source code that we may use later (you’ll see how).
On the third line, there’s an
ADD instruction, which adds two numbers together. The operands are register A and number 5 (the # mark before tells the assembler
that it’s number five, not a number in the memory cell with address 5). And remember? We stored the value from memory location 2000 in the A register,
so whatever the value is, this instruction will add the number 5 to it.
The following instruction is called a conditional jump: the processor will test some condition and based on the result, it will jump or not.
In this case, the condition is whether a given number is not larger than another one (
JNL = Jump (if) Not Larger). The number being compared
is the number in register A, with the number 200 (again, mark # means that it’s a direct number, not a number from the memory location with address 200).
In this case, the number in A is smaller than 200 (thus not larger than 200 – condition is true), the processor will make a jump at the instruction
specified by the third operand, and this is where our label comes in: the assembler tool (the translator) will replace “
the memory address of the instruction right after this mark.
So if the number is smaller, the processor will jump back to the instruction
ADD and again add value 5 to the number A (which is already larger
from the previous calculation) and then get back to the
JNL instruction. If the number is still smaller than 200, it will jump back again; however,
if it is larger, then the condition won’t be true anymore, so no jump occurs and the following instruction gets executed. This one moves value from
register A to the memory cell with address 2001, basically storing the resulting number there. It’s important to add, that the memory cell with address
2000 still contains the original value, because we created a copy of it in register A, we didn’t modify the original.
This piece of code doesn’t really have much of a purpose; it’s simply for demonstration and also for some hypothetical architecture.
Real-world programs are composed of hundreds, thousands, even hundreds of thousands of instructions.
Architectures and assembly languages
I already mentioned the term architecture before: it describes features of certain processors. It describes what simple operations the processor
can perform (some may be able to perform only a dozen of them, some hundreds of various operations), and what opcodes each instruction has.
It also specifies a lot of other things: what and how many registers (small storage places directly in the processor itself, where the programmer
can temporarily store data) it has, how it can communicate with other chips and devices, like memory, chipset, graphics card, and other features of its function.
This means that every processor has its own assembly language, because the instructions it has are different. Thus, assembly language (or simply assembler,
though it’s technically not correct) is not just one language, it’s a whole set of languages. They’re all quite similar, but differ in what instructions are there,
what are the operands and some other features specific to the processor. However the basic principle is the same (unless it’s one of my WPU experimental
processors :-) ) among them, so if you understand principle of one assembler for a given architecture, learning others will be a cinch.
So it’s important to understand: assembly language is always meant to be used with a specific architecture. For example, most personal computers use
an architecture called x86, or in the case of 64-bit systems and applications, its extension x64, so if you wanted to program for this architecture, you would use
the x86 assembly language. Many mobile devices use an architecture called ARM, so if you have programmed these processors in assembler, you would use
the ARM assembly language. If you wanted to program some old console, like Sega Genesis, you would use the 68000 assembly language, because it uses the Motorola 68000 processor, and so on.
There are hundreds of various architectures for various purposes.
Also, like I said, there are a lot of processors on the market with varying speeds, price, and power consumption, but many of them support the same
architecture – thus programs written in the given assembly language will work on them, they'll just run faster or slower.
However, programs created for one architecture generally won’t work on another one, because the processor is simply different: the opcodes are different,
the supported instructions and other features are different, so the machine code (a program – set of numeric codes) for x86 architecture would be gibberish
for the ARM architecture. This also means when you write a program in assembler for one processor architecture, it won’t work on another one: you would
need to rewrite it completely into assembly language for a different architecture.
Need for high level programming languages and compilers
There were two big pressing issues that led to the high level programming languages that you most probably already know. First, creating complex programs requires
dividing them into a lot of simple instructions, so to achieve more complex actions, you need to write a lot of instructions: this is both tedious and time consuming,
not to mention that it’s more difficult to understand. The second issue was already mentioned: a program written for one architecture won’t work on another
one without a complete rewrite. High level languages solve both these problems.
Complexity of programs
Let’s imagine that you want to perform a more complex calculation, for example, you want to calculate the result of A = 2 + ( 7 – 3 ) * 2.
However, the processor doesn’t support anything like that, it can only perform very simple operations. So if you want to write code in assembly,
you need to split this calculation into simple operations that the processor supports. For a mathematic expression, this is done the same way
you would do it if you were calculating the expression manually in math class, for example: first you need to calculate the value in the parentheses (subtract 3 from 7),
then multiply the result by 2, and finally add it to number 2. The result will be stored in register A. So the assembly code would look like
this (the “;” starts a comment - not part of the code, just a remark of what it does):
SUB #3, #7 MUL A, #2 ADD #2, A
Of course, you’ll rarely use fixed numbers in calculations, you’ll rather calculate with values from the memory, so let's complicate the whole
process with the following change to the equation: @250 = @200 + ( @201 - @202 ) * @203. Here @(number) means “at address” – the number stored in the memory cell at the given address.
To complete the calculation, we need to load values from memory, because the processor doesn’t allow calculations with numbers at the given memory addresses directly,
however it provides another register B.
MOV B, 202 MOV A, 201 SUB A, B MOV B, 203 MUL A, B MOV B, 200 ADD A, B MOV 250, A
As you can see, even simple expressions can get complicated and will require several lines of code, which is quite tedious. Not to mention
that it’s quite difficult from the code to understand what it is actually doing, unless you explain it in the comments. But unfortunately,
there’s no way around, it’s simply how the processor works. You can imagine that with more complex code, the number of instructions and tediousness would increase rapidly.
Another problem is, the values you’re working with (you can basically consider them variables) are just numbers (addresses), which are not exactly easy to deal with.
You can assign the addresses some names, but that just delays the problem a bit: you still need to say that address 204 will be known under name
MYVARIABLE and if the number
of variables increases, it will quickly get problematic, although assigning exact addresses is usually automated. Not to mention you might use an address
as a specific variable only for a short while and then reuse it as another, but you also have to make sure that both (or even more) usages won’t collide.
Where the compiler comes in
Okay, so here’s a question: if you can split expressions and tasks into a series of simple instructions and if you can assign memory locations for variables,
why can’t it be done by a program? And that’s exactly what the compiler does. The programming language specifies what kind of statements you can write and how, and the compiler must support them.
So you can simply write the following code (it’s C-like code):
int a, b, c, d, e;
a = 2;
b = 7;
c = 3;
d = 2;
e = a + (b - c) * d;
When you compile this code, the compiler will analyze (parse) this code and it will find that you want five variables. You don’t have to decide which memory
cells will be assigned to these variables: it’s all handled for you. For example, the compiler might decide that contents of the variable named “
a” will be stored in a memory
cell with address 200, “
b” in 201, and so on. The compiler keeps track of these assignments, so wherever you use the given variable, it will make sure that
the proper memory address is used. In reality, the process is often a bit more complex than this, but the principle remains the same.
In the example code, there are a few value assignments, starting with “
a = 2;”. The compiler will read this and according to the rules
of the programming language, this means, that the variable is assigned the value 2. The compiler knows which memory address corresponds to the variable
a”, so it will generate the proper instructions for you: remember, the processor doesn’t understand expressions like “a = 2;”, it can only
work with simple instructions. But it’s the compiler’s job to convert these high level statements into instructions that the processor understands:
MOV 200, #2 MOV 201, #7 MOV 202, #3 MOV 203, #2
This is hopefully simple enough; each assignment corresponds to a processor instruction (I’ve written them in assembly language, the compiler will
of course generate the appropriate numeric codes for each instruction – the machine code). However, when it comes to the last statement, which assigns the result
of the expression “a + (b - c) * d” to the variable “
e”, it can’t be done using a single instruction as you’ve seen before. However, all you need to do
is to write this expression and the compiler will read it and do the splitting into a series of simple instructions itself, without you even knowing
about it (until now at least :-) ). For example, it might generate the following instructions:
MOV B, 202 MOV A, 201 SUB A, B MOV B, 203 MUL A, B MOV B, 200 ADD A, B MOV 250, A
I think it needs no explaining that it’s much easier to simply write “e = a + (b - c) * d” instead of a series of instructions and the same principle applies
to everything. A high level programming language allows you to express the actions to be performed in a more clear, easier, and understandable manner, and the compiler
will take care of converting this into a series of simple instructions the processor understands, and will handle all other details for you.
This is called abstraction and solves the program complexity problem: you can write and manage much more complex programs, because you don’t have
to bother with all the details: they’re taken care of for you automatically.
It might be important to mention how some basic programming constructs are handled. For example, the “
if” statement. Let’s consider the following C code:
if(a > 2)
b = 3;
b = 5;
c = 8;
a = 8;
The processor doesn’t understand what the “
if” statement is, however it has a conditional jump instruction: it will jump to another instruction
if a condition is true. So this fragment will be translated to the following assembly code:
MOV A, 200 LGR #2 JZ ELSE MOV 201, #3 JMP END ELSE:
MOV 201, #5 MOV 202, #8 END:
MOV 201, #8
The same goes for loops, except the jump instruction causes to jump back in the sequence, so a portion of the code gets repeated.
When programming in a high level language, you don’t have to bother how exactly the loop is constructed using the jump instruction,
the processor can actually have several of them, each one having a different condition: the compiler will handle this arrangement for you, as well as choose
the appropriate jump instruction and do additional work so the conditional jump can be performed: in this example, we assumed that the processor
doesn’t have a conditional instruction that would compare which number is larger: instead, it has a separate instruction to test this, so the compiler uses the instruction
to compare two numbers first and then uses the result with the appropriate conditional jump instruction.
Need for the assembler nowadays
You might be wondering: if the compiler can handle all these things for you, what’s the point of knowing assembler today? There are several
answers: first, the compiler might not always generate optimal instructions: some actions can be done using fewer instructions in a less standard way
that the compiler doesn’t know of. This is a problem when you need to squeeze every bit of performance; in such cases, you can write instructions
for the performance critical parts yourself, ensuring that it works exactly the way you want it to.
This might be even a bigger problem when programming for small embedded devices with limited resources, where you can’t simply afford
any overhead (more instructions that are actually needed and suboptimal ways to solve things). Another reason is limitations of the compiler: you’re limited only by what
it supports, if you want to use some features of the processor that the compiler can’t generate instructions for, for example, new special instructions, you’ll need to write them yourself.
Knowledge of assembly is an absolute must if you want to analyze existing software, hack (alter its behavior) or crack it. As you already know,
program is composed of series of simple instructions – numeric codes that represent various actions. It’s easy to disassemble an existing program when
you don’t have its source code: the numeric codes are simply replaced with their appropriate names, resulting in an assembly language code,
so if you want to analyze and modify them, you must know assembly language.
It’s much more difficult to decompile the program – convert it back to its high level source code: this needs extensive analysis of the instructions
and their structure, and the resulting source code will be still very far from the original: important things like names of variables, functions,
and comments are lost during compilation (not necessarily for all languages), because they’re simply not needed by the processor: all it needs is memory address, which is just a number.
The second major problem of assembly programming is portability: to transfer your program to another platform with a different processor architecture,
you need to rewrite it completely in the assembly language for the target platform. Using a high level language solves this problem quite easily.
The code in the high level language is usually platform independent: it’s the compiler that generates the appropriate instructions, not you.
So if you want your code to run on a PC with x86 architecture, you give the sources to a compiler that generates instructions for x86.
If you want to make binary (machine code) for a mobile device with ARM architecture, you give the same source code to an ARM compiler and it will
generate instructions for that architecture, without you needing to do anything.
This of course requires that a compiler exists for the given architecture; if there is no compiler for a particular architecture,
then you’ll be left with assembly programming, unless you write the compiler yourself.
So far, we only dealt with something that’s called native code: languages that create native code result in instructions that are directly executed
by a given processor. If you create a binary file (containing these raw instructions for a given processor), it will only work on a given architecture;
if you want to use it on another, you’ll need to compile it for that architecture: generate the appropriate machine code that the processor of a given architecture understands.
However, there exists something that’s called an interpreted language, which makes portability much easier. I will mention this only briefly,
since the topic can be elaborated into a long article on its own. With an interpreted programming language, the source is left as it is, or it is compiled
into a “universal” assembly code (that’s what happens with Java – the resulting universal assembly code is called bytecode). If you want to run such a program,
you need an interpreter: it’s a program in native code – something that the processor understands directly, that can read this universal code and translate
it into instructions for the target architecture on the fly – as the program is run.
The advantage of this approach is easy portability, safety, and flexibility: you can write your program once and then it will run on every
architecture where the interpreter for the given language is available, without having to change a single thing about your program. Because the interpreter
is in control of what the program can do, safety is also increased, because it can choose to block certain actions, which is much more difficult with native code,
if not impossible. You can also test and modify your program quickly without having to compile it every time.
One of the major downsides is reduced speed: for example, in case of an assignment like “a = 5”, an interpreted language might need the processor to execute even
a few dozen instructions, which will read this statement, decide what it means, and then finally do it, while with a compiled language (resulting in native code),
a single instruction can often handle this task.
If you got this far: congratulations! I hope that I helped to uncover some of the secrets of the workings of a processor and how it’s related to both
assembly (low level) and high level programming. While this doesn’t teach you how to program in assembly and how to hack/crack/analyze existing programs,
it hopefully gives you the required knowledge to start learning about these things and know what to expect.
Please note that a lot of things described were simplified for the sake of easy understanding. Many of the topics covered here would be enough to fill
a few books and I’m not currently planning to write any, at least not about these topics :-)
I’ll be grateful if you show any appreciation for this article and provide any feedback, whether it’s about the apprehensibility of the article or some mistakes
on my side (mostly grammar and spelling, please ignore deliberate simplifications of facts).
Thanks for reading.