You might be wondering about how your computer works: what happens when you write a program and then compile it? What is assembler and what is the basic principle of programming in it? This tutorial should clarify this for you, it’s not indented to teach you the assembly programming itself, but rather give you needed basis, to understand what’s actually going on under the hood. It also deliberately simplifies some things, so you’re not overwhelmed by additional information. However, I assume that you have some knowledge in high level programming (C/C++, Visual Basic, Python, Pascal, Java and tons more…).
Also I hope that the more skilled guys will forgive me for simplyfing a lot of things here, my intention was to make the explanation clear and simple for someone who doesn't have a clue about this topic. Note: I will be very grateful for any feedback on this. It’s difficult to write explanations for people who don’t know much about the topic, so I might’ve omitted some important things or didn’t clarify something enough, so if something is unclear, don’t worry to ask.
How does processor (CPU) work?
You might know that the CPU (Central Processing Unit, or simply processor) is the “brain” of the computer, controlling all other parts of the computer and performing various calculations and operations with data. But how does it achieve that?
Processor is a circuit that is designed so it can perform single instructions: actually a whole series of them, one by one. The instructions to be executed are stored in some memory, in a PC, it’s the operating memory. Imagine the memory like a large grid of cells. Each cell can store a small number and each cell has its own unique number – address. Processor tells the memory address of a cell and the memory responds with the value (number, but it can represent anything – letters, graphics, sound… everything can be converted to numerical values) stored in the cell. Of course, processor can tell the memory to store a new number in the given cell as well.
Instructions themselves are basically numbers too: each simple operation is assigned its own unique numeric code. The processor retrieves this number and decides what to do: for example number 35 will cause the processor to copy data from one memory cell to another, number 48 can tell it to add two numbers together and number 12 can tell it to perform simple logical operation called OR.
Which operations are assigned to which numbers is decided by the engineers who design given processor, or it’s better to say processor architecture: they decide what number codes will be assigned to various operations (and of course, they decide other aspects of the processor, but that’s not relevant now). This set of rules is then called architecture. This way, manufactures can create various processors that support given architecture: then can differ in speed, power consumption and price, but they all understand the same codes as same instructions.
Once the processor completes the action determined by the code (the instruction), it simply requests the following one and repeats the whole process. Sometimes, it can also decide to jump at different place in the memory, for example to some subroutine (function) or jump a few cells back to previous instruction and execute the same sequence again – basically creating a loop. The sequence of numerical codes that form the program is called machine code.
What are instructions and how are they used?
As I already mentioned, instructions are very simple tasks that the processor can perform, each one having its unique code. The circuit that makes up the processor is designed in a way to perform given operations according to the codes it loads form the memory. The numeric code is often called opcode.
The operations that the instructions perform are usually very simple. Only by writing a sequence of these simple operations, you can make the processor perform some specific task. However, writing a sequence of numeric codes is quite tedious (though that’s how programming was done long ago), so the assembly programming language was created. It assigns the opcodes (the numeric code) a symbol – name that sort of describes what it does.
Given the previous examples, where number 35 makes the processor move data from one memory cell to another, we can assign this instruction name MOV, which is a short for MOVe. Number 48, which is for instruction that adds two numbers together will get name ADD and 12, which performs the OR logical operation, will get name ORL.
Programmer writes a sequence of instructions – simple operations that the processor can perform, using these names, which are much easier to read than just numeric codes. Then he executes a tool named assembler (but often the term “assembler” is used also for the programming language, though technically it means the tool), which will convert these symbols to appropriate numeric codes that can be executed by the processor.
However, in most cases, the instruction itself isn’t sufficient. For example, if you want to add two numbers together, you obviously need to specify them, similar goes for logical operations, or moving data from memory cell to another one: you need to specify the address of the source and target cell. This is done by adding so-called operands to the instruction – simply one or more values (numbers) that will provide additional information for the instruction needed to perform given operation. The operands are stored in the memory too, along with the instruction opcodes.
For example, if you want to move data from location with address 1000 to location 1258, you can write:
MOV 1258, 1000
The first number being the target address and the second the source (in assembly, you usually write the target first and source as the second one, it’s quite common). The assembler (the tool that converts the source to the machine code) stores these operands too, so when the processor first loads the instruction opcodes, which will tell it that it must move data from one location to another. Of course, it needs to know from which location to move to what destination, so it will load the operand values from the memory too (they can be at addresses right after the instruction opcodes) and once it has all the necessary data, it will perform the operation.
Let’s look at some short code and describe what it does. Please note that it's a pseudocode, it's not made for a specific architecture or language and various symbols can differ, the principle however, remains the same.
MOV A, 2000
ADD A, #5
JNL A, #200, LOOP
MOV 2001, A
The first instruction will move the number form the memory cell with address 2000 to a register A – it’s a temporary location, where can the processor store numbers. It can have many registers like this. Second line contains something called a label: it’s not any instruction, it’s simply a mark in the source code, that we may use later (you’ll see how).
On the third line, there’s an ADD instruction, which adds two numbers together. The operands are register A and number 5 (the # mark before tells the assembler, that it’s number five, not a number in the memory cell with address 5). And remember? We stored value from memory location 2000 in the A register, so whatever the value is, this instruction will add a number 5 to it.
Following instruction is called conditional jump: the processor will test some condition and based on the result, it will jump or not. In this case, the condition is whether a given number is not larger than another one (JNL = Jump (if) Not Larger). The number being compared is number in register A with number 200 (again, mark # means that it’s a direct number, not a number from the memory location with address 200). In the case that the number in A is smaller than 200 (thus not larger than 200 – condition is true), the processor will make a jump at instruction specified by the third operand and this is where our label comes in: the assembler tool (the translator) will replace the “LOOP” with a memory address of the instruction right after this mark.
So if the number is smaller, the processor will jump back to the instruction ADD and again add value 5 to the number A (which is already larger from the previous calculation) and then get back to the JNL instruction. If the number is still smaller than 200 it will jump back again, however if it is larger, then the condition won’t be true anymore, so no jump occurs and the following instruction gets executed. This one moves value from the register A to the memory cell with address 2001, basically storing the resulting number there. It’s important to add, that memory cell with address 2000 still contains the original value, because we created a copy of it in the register A, we didn’t modify the original.
This piece of code doesn’t really have much of a purpose; it’s simply for a demonstration and also for some hypothetical architecture. Real-world programs are composed from hundreds, thousands even hundreds of thousands of instructions.
Architectures and assembly languages
I already mentioned the term architecture before: it describes features of certain processors. It describes what simple operations can the processor perform (some may be able to perform only a dozen of them, some hundreds of various operations) and what opcodes does the each instruction have. It also specifies a lot of other things: what and how many registers (small storage places directly in the processor itself, where can programmer temporarily store any data) it has, how can it communicate with other chips and devices, like the memory, chipset, graphics card and other features of its function.
This means that every processor has its own assembly language, because the instructions it has are different. Thus, assembly language (or simply assembler, though it’s technically not correct) is not just one language, it’s a whole set of languages. They’re all quite similar, but differ in what instructions are there, what are the operands and some other features specific to the processor. However the basic principle is the same (unless it’s one of my WPU experimental processors :-) ) among them, so if you understand principle of one assembler for a given architecture, learning others will be a cinch.
So it’s important to understand: assembly language is always meant to be used with a specific architecture. For example most personal computers use architecture called x86 or in case of 64-bit systems and applications, its extension x64, so if you wanted to program for this architecture, you would use x86 assembly language. Many mobile devices use architecture called ARM, so if you programmed these processors in assembler, you would use ARM assembly language. If you wanted to program some old console, like Sega Genesis, you would use 68000 assembly language, because it uses the Motorola 68000 processor and so on. There are hundreds of various architectures for various purposes.
Also like I said, there are a lot of processors on the market with varying speed, price and power consumption, but many of them support the same architecture – thus the programs written in the given assembly language will work on them, they'll just run faster or slower.
However programs created for one architecture generally won’t work on another one, because the processor is simply different: the opcodes are different, the supported instructions and other features are different, so the machine code (a program – set of numeric codes) for x86 architecture would be gibberish for the ARM architecture. This also means when you write a program in the assembler for one processor architecture, it won’t work on another one: you would need to rewrite it completely into assembly language for a different architecture.
Need for high level programming languages and compilers
There were two big pressing issues that led to the high level programming languages that you most probably already know. First, creating complex programs requires dividing them into a lot of simple instructions, so to achieve more complex actions, you need to write a lot of instructions: this is both tedious and time consuming, not to mention that it’s more difficult to understand. The second issue was already mentioned: program written for one architecture won’t work on another one without complete rewriting. High level languages solve both these problems.
Complexity of programs
Let’s imagine that you want to perform a more complex calculation, for example you want to calculate the result of A = 2 + ( 7 – 3 ) * 2. However processor doesn’t support anything like that, it can only perform very simple operations. So if you want to write a code in the assembly, you need to split this calculation into the simple operations that the processor supports. For a mathematic expression, this is done the same way you would do it if you were calculating the expression manually in math class for example: first you need to calculate the value in the parenthesizes (subtract 3 from 7), then multiply the result by 2 and finally add it to number 2. The result will be stored in register A. So the assembly code would look like this (the “;” starts a comment - not part of the code, just a remark what it does):
SUB #3, #7 MUL A, #2 ADD #2, A
Of course, you’ll rarely use fixed numbers in some calculation, you’ll rather calculate with values from the memory, so let's complicated the whole process by following change to the equation: @250 = @200 + ( @201 - @202 ) * @203. Here @(number) means “ at address” – number stored in the memory cell at given address. To complete the calculation, we need to load the values from the memory, because the processor doesn’t allow calculations with numbers at given memory addresses directly, however it provides another register B.
MOV B, 202 MOV A, 201 SUB A, B MOV B, 203 MUL A, B MOV B, 200 ADD A, B MOV 250, A
As you can see, even simple expression can get complicated and require several lines of code, which is quite tedious. Not to mention, that it’s quite difficult from the code to understand what is it actually doing, unless you explain it in the comments. But unfortunately, there’s no way around, it’s simply how the processor works. You can imagine that with more complex code, the number of instructions and tediousness would increase rapidly.
Another problem is, that the values you’re working with (you can basically consider them variable) are just numbers (addresses), which Is not exactly easy to deal with. You can assign the addresses some names, but that just delays the problem a bit: you still need to say that address 204 will be known under name MYVARIABLE and if the number of variables increases, it will get quickly problematic, although assigning exact addresses is usually automated. Not to mention you might use one address as a specific variable only for a short while and then reuse it as another, but you also have to make sure that the both (or even more) usages won’t collide.
Where compiler comes in
Okay, so here’s a question: if you can split some expression and task into a series of simple instructions and if you can assign memory locations some variables, why can’t it be done by a program? And that’s exactly what compiler does. The programming language specifies what kind of statements you can write and how and the compiler must support them.
So you can simply write following code (It’s C-like code):
int a, b, c, d, e;
a = 2;
b = 7;
c = 3;
d = 2;
e = a + (b - c) * d;
When you compile this code, the compiler will analyze (parse) this code and it will find that you want five variables. You don’t have to decide which memory cells will be assigned to these variables: it’s all handled for you. For example, compiler might decide that contents of the variable named “a” will be stored in memory cell with address 200, “b” in 201 and so on. The compiler keeps track of these assignments, so wherever you use the given variable, it will make sure that the proper memory address is used. In reality, the process is often a bit more complex than this, but the principle remains the same.
In the example code, there are a few value assignments, starting with “a = 2;”. The compiler will read this and according to the rules of the programming language, this means, that the variable is assigned value 2. Compiler knows which memory address corresponds to the variable named “a”, so it will generate the proper instructions for you: remember, the processor doesn’t understand expression like “a = 2;”, it can only work with simple instructions. But it’s the compiler’s job to convert these high level statements into the instructions that the processor understands:
MOV 200, #2 MOV 201, #7 MOV 202, #3 MOV 203, #2
This is hopefully simple enough; each assignment corresponds to one processor instruction (I’ve written them in assembly language, compiler will of course generate appropriate numeric codes for each instruction – the machine code). However, when it comes to the last statement, which assigns result of the expression “a + (b - c) * d” to the variable “e”, it can’t be done using a single instruction as you’ve seen before. However, all you need to do, is to write this expression and the compiler will read it and do the splitting into a series of simple instructions itself, without you even knowing about it (until now at least :-) ). For example, it might generate following instructions:
MOV B, 202 MOV A, 201 SUB A, B MOV B, 203 MUL A, B MOV B, 200 ADD A, B MOV 250, A
I think it needs no explaining that it’s much easier to write simply “e = a + (b - c) * d” instead of series of instructions and the same principle applies to everything. High level programming language allows you to express the actions to be performed in more clear, easier and understandable manner and the compiler takes care of converting this into the series of simple instructions the processor understands and handling all other details for you. This is called abstraction and solves the program complexity problem: you can write and manage much more complex programs, because you don’t have to bother with all the details: they’re taken care of for you automatically.
It might be important to mention how some basic programming constructs are handled. For example the “if” statement. Let’s consider following C code:
if(a > 2)
b = 3;
b = 5;
c = 8;
a = 8;
The processor doesn’t understand what the “if” statement is, however it has a conditional jump instruction: it will jump at any other instruction if some condition is true. So this fragment will be translated to the following assembly code:
MOV A, 200 LGR #2 JZ ELSE MOV 201, #3 JMP END ELSE:
MOV 201, #5 MOV 202, #8 END:
MOV 201, #8
Similar goes for loops, except the jump instruction causes jump back in the sequence, so the portion of code gets repeated. When programming in high level language, you don’t have to bother how exactly is the loop constructed using the jump instruction, processor can actually even have several of them, each one having different condition: compiler will handle this arrangement for you, as well as choosing the appropriate jump instruction and doing additional work so the conditional jump can be performed: in this example, we assumed that the processor doesn’t have conditional instruction that would compare which number is larger: instead, it has separate instruction to test this, so the compiler used instruction to compare two numbers first and then used the result with appropriate conditional jump instruction.
Need for the assembler nowadays
You might be wondering: if compiler can handle all these things for you, what’s the point of knowing assembler today? There are several answers: first, compiler might not always generate optimal instructions: some actions can be done using fewer instructions in a less standard way that the compiler doesn’t know of. This is a problem when you need to squeeze every bit of performance, in such case you can write the instructions for the performance critical part yourself, ensuring that it works exactly the way you want it to.
This might be even a bigger problem when programming for some small embedded devices with limited resources, where you can’t simply afford any overhead (more instructions that are actually needed and suboptimal ways to solve things). Another reason is also limitation of the compiler: you’re limited only by what it supports, if you want to use some feature of the processor that the compiler can’t generate instructions for, for example new special instructions, you’ll need to write them yourself.
Knowledge of assembly is an absolute must if you want to analyze existing software, hack (alter its behavior) or crack it. As you already know, program is composed of series of simple instructions – numeric codes that represent various actions. It’s easy to disassemble an existing program when you don’t have its source code: the numeric codes are simply replaced with their appropriate names, resulting in an assembly language code, so if you want to analyze and modify them, you must know assembly language.
It’s much more difficult to decompile the program – convert it back to its high level source code: this needs extensive analysis of the instructions and their structure and the resulting source code will be still very far from the original: important things like names of the variables, functions and comments are lost during the compilation (not necessarily for all languages), because they’re simply not needed by the processor: all it needs is memory address, which is just a number.
The second major problem of assembly programming is portability: to transfer your program to another platform with different processor architecture, you need to rewrite it completely in the assembly language for the target platform. Using a high level language solves this problem quite easily. The code in high level language is usually platform independent: it’s the compiler that generates the appropriate instructions, not you.
So if you want your code to run on the PC with x86 architecture, you give the sources to a compiler that generates the instructions for the x86. If you want to make binary (machine code) for a mobile device with ARM architecture, you give the same source codes to the ARM compiler and it will generate instructions for this architecture, without you needing to do anything.
This of course requires that a compiler exists for the given architecture, if there is no compiler for some particular architecture, then you’ll be left with assembly programming, unless you write the compiler yourself.
So far, we only dealt with something that’s called native code: languages that create a native code result in instructions that are directly executed by the given processor. If you create a binary file (containing these raw instructions for given processor) it will only work on given architecture, if you want to use it on another, you’ll need to compile it for that architecture: generate appropriate machine code that the processor of given architecture understands.
However, there exists something that’s called an interpreted language, which makes the portability much easier. I will mention these only briefly, since the topic can be elaborated into a long article on its own. With interpreted programming language, the source is left as it is, or it is compiled into some “universal” assembly code (that’s what happens with Java – resulting universal assembly code is called bytecode). If you want to run such program, you need some interpreter: it’s a program in a native code – something that the processor understands directly, that can read this universal code and translate it into instructions for target architecture on the fly – as the program is run.
The advantage of this approach is very easy portability, safety and flexibility: you can write your program once and then it will run on every architecture where is the interpreter for given language available, without having to change a single thing about your program. Because the interpreter is in control of what the program can do, safety is also increased, because it can choose to block certain actions, which is much more difficult with native code, if not impossible. You can also test and modify your program quickly without having to compile it every time.
One of the major downsides is reduced speed: for example in case of an assignment like “a = 5”, interpreted language might need processor to execute even a few dozen instructions, which will read this statement, decide what it means and then finally do it, while with compiled language (resulting in native code) a single instruction can often handle this task.
If you got this far: congratulations! I hope that I helped to uncover some of the secrets of the workings of the processor and how it’s related to both assembly (low level) and high level programming. While this doesn’t teach you how to program in assembly and how to hack/crack/analyze existing programs, it hopefully gives you required knowledge to start learning about these things and what to expect.
Please note that a lot of things described were simplified for the sake of easy understanding. Many of the topics covered here would be enough to fill a few books and I’m not currently planning to write any, at least not about these topics :-)
I’ll be grateful if you show any appreciation for this article and provide any feedback, whether it’s about apprehensibility of the article or some mistake on my side (mostly grammar and spelling, please ignore deliberate simplifications of the facts).
Thanks for reading.