Github link: https://github.com/WindowsNT/asm now with VS solution and automatic compilation/ISO generation/bochs/vmware/virtualbox running.
The Infamous Trilogy: Part 1
This article targets the user who wants to know how the CPU works. I will explain some assembly basics, the real mode, the protected mode, and the long mode. Working Assembly code is included, so you can test for yourself how the processor works in real mode, how protected mode is entered, how we get into 64-bit mode, and finally, how to exit from all of them and go back to DOS.
Do you dare to follow? Let's go!
- Assembly knowledge. While you will not be writing code, basic knowledge such as registers, memory access, basic commands, and such stuff is helpful.
- Flat Assembler, a modern assembler which can make executables for Win32, x64, and DOS.
- Bochs, the must-have tool for system developers. I recommend it, not only because it is free, but because it has a debugger that will trap any exceptions your program will generate and tell you what happened.
In the distant past, Borland's Turbo Assembler (TASM) was used because it was the first one to allow 32-bit segments. But since then, quite a few things have changed, and TASM is dead. All the code here is configured to compile with the modern FASM, which can create executables for both DOS and Windows, 16/32/64 bit.
Visual Studio includes ML.EXE and ML64.EXE, the newer MASM versions. However, these assemblers only output Windows executables and only for their respective architecture, so ML.EXE only outputs 32-bit flat code for Win32, and ML64.EXE only outputs 64-bit flat code for Win64.
The code is a VS solution that will compile with FASM, then create an ISO with PowerShell with the program entry.exe. Then you can run and it will launch Bochs with the configuration file to load a bootable FreeDos Disk with a CD-ROM drive that contains the entry.exe, which you can run.
There are some bugs as for now, keep going!
Assembly in General
The assembly language is basically a collection of low level instructions (opcodes) to do useful stuff and to access memory. In order to make things easier, there are registers - e.g., places to get and set data.
Memory is not treated as a continuous array of bytes (like a C array). It is divided in segments. A segment has different meaning, depending on the CPU mode (Real, Protected, or Long mode). Each memory address is referred by a segment register which holds the segment value, and an offset indicating the distance from the start of the segment.
In order to make it easier for functions to have local variables (like in C++) and transfer data between them, each application sets up a special segment called "stack segment" which holds the address of the memory used for stack. Stack is a "LIFO" vector: the last you push to it is the first you get from it.
It is like pointers in C. If you have a variable that contains a pointer in C, you can use
* to access the data. In assembly, you do the same with
What if you try to access something that doesn't exist? In DOS, you can trash yourself or the OS or both; in Protected mode systems, you will not be allowed to access non-existent memory and thus, the exception handler will be called (or else, your program will be sacked).
Function Calls and Interrupts
Just as in C++, there are function calls in assembly (no, not everything is implemented with
goto :). Depending on the programming model (you will learn this later), a function call can be
near (in which the current IP is pushed to the stack and the function exits with a
far (in which the IP and CS are pushed, and the function exits with a
An Interrupt is basically a handler that gets executed when something calls it. Many times, these are software interrupts, which means that the program calls it using the
INT instruction. In fact, all DOS/BIOS services are provided through interrupts, so if the programmer does:
then the programmer knows that it has called function
INT 21h, which is a DOS function to display the string pointed to by
DS:DX. The address for each interrupt (there are 255) is stored in the Interrupt Table, which is accessible via the
SIDT command, and it is different, depending on the processor mode (you will learn about that later).
CS, and the flags are pushed to the stack, and the interrupt exits with
An interrupt can also happen when something else occurs (usually, an exception). For example, when your code divides something by zero,
INT 00h is automatically called. The code for
INT 00h sees that you have tried to divide something with zero and thus, you can't continue. So Windows displays a nice message and closes your application.
If you could install an
INT 00h handler yourself (via the DOS function
25h), then the exception would get to your code before making it to Windows (that's what the structured exception handling basically does), but you must still fix the error. If you can't fix the error, then you must abort - pretty much what Windows does.
In DOS, the default exception handlers do little than block the CPU from further processing, so you can only do Ctrl+Alt+Del to resume.
- 16-bit Flag Register
IP holds the current execution point. As commands are executed,
IP changes its value automatically.
DX can be accessed either entirely:
or using their low 8-bit (
dl) or high bit (
DI are always accessed as 16-bit registers (there is no sl, sh) and they are generally used as pointers to data.
BP can also be used as a generic-purpose register (although it is usually used to access the stack), and
SP holds the pointer to the current stack entry. So let's see what happens when you put something to the stack:
What happens when we push more than the current stack can hold? Boom, stack overflow. You have probably encountered it in your C++ recursive functions.
The flags register is a set of 16 bits (not all are actually used) that change their value depending on the operation of each opcode. The variables of the
JMP command (
JB, etc.) can jump conditionally depending on those flags. For example, the
ZF (Zero Flag) is set to 1 after an operation is zero:
You can use
popf to set or read the flags to a register, for example:
16-bit Segment Registers
These registers hold a value that identifies the current segment. The way this value is interpreted depends on the current CPU mode (Real/Protected/Long).
CS always holds the segment of the currently executing code. You cannot set
CS by using, say,
mov cs,ax. When you call a function that resides in another segment (FAR call), or when you jump to another segment (FAR jump),
DS holds the default data segment. That means that if you do this:
AX gets its value from the segment pointed to by
DS, with the offset specified by
BX gets its value from the segment pointed to by
DS with an offset of
1000h. If you want to use another segment, you must do so explicitly:
DI is used as an index,
ES holds the default segment. When
BP is used as an index,
SS is the default segment. In all other cases,
DS is the default segment. Note that not every register can be used as an index in real mode, for example,
mov ax,[dx] is not valid in real mode.
GS are general-purpose auxiliary segment registers.
SS holds the value of the stack segment.
32 bit registers are available in all of the modes (real, protected, and long).
- 32-bit Flag Register
Each of them is an extension of their relative 16-bit register. For example:
The 32-bit registers are usable in Real mode, but indexes (
ESI) cannot be used unless their upper 16 bits are zero (that is, you can use a max index of 65535 only). But see below for Unreal mode, which provides a way around this.
64 bit registers are only available when the processor is in 64-bit mode. They are not available in real or protected or compatibility mode.
- 64-bit Flag Register
In addition, x64 defines 8 more 64-bit registers (r8, r9, r10, r11, r12, r13, r14, r15) to be used as auxiliary registers, and some 128-bit registers to be used when programming multimedia.
In this manual, when operations are applied to both 32-bit and 64-bit registers, I use the notation X(reg), for example XIP, XSP etc.
These registers hold information about the current state of the CPU.
CR0 is mostly used to set the CPU to protected mode (bit 0), and enable paging (bit 31).
CR1 is reserved.
CR2 holds the Page Fault Linear address when a page fault exception is triggered.
CR3 holds the address of the paging table.
CR4 defines some other flags, like Physical Address Extensions and VM86 mode.
For more on these, see Control Registers in Wikipedia.
These registers hold information for hardware debugging.
DR3 hold the linear address of 4 breakpoints, and
DR7 some flags to use them. See Debug Registers for more.
These registers used to hold information for CPU testing (now removed from the set). See Test Registers for more.
Addressing and Segmentation
In Real mode, everything is 16 bits. The entire memory is not accessed with an absolute index from 0, but it is divided into segments. Each segment represents the actual offset from 0, multiplied by 16. To this segment, an offset value can be added to refer to a distance from the start of this segment. These two things (segment:offset) tell the CPU the absolute value of the memory we need to access. For example:
0000h : 0000h: Indicates segment 0, offset 0. 0*16 + 0 = 0 actual memory address.
0100h : 000Fh: Indicates segment 100h, offset 0Fh. 100h*16 + 0Fh = 100Fh = actual memory address.
0002h : 0000h: 2h*16 + 0 = 20h actual memory address.
0001h : 0010h: 1h*16 + 10h = 20h actual memory address. As you can see, memory addresses can overlap.
Because the segment and the offset are only 16-bit values, the maximum memory accessible by this method is 0ffffh : 0010h = 1MB. Specifying 0ffffh segment and an offset larger than 0010h results in wrapping (see protected mode, A20 line). And because the area after 0a000h:0000 is reserved for the system (screen, etc.), only 640 KB remains for DOS applications.
In addition, all segments have read/write/execute access from anywhere (that is, any program can read/write or execute code within any segment). Because in 16-bit real mode OS the CPU sees the memory the way above, any application can read from or write to any part of memory, including the part in which the OS resides. That is why a real mode OS is a single tasking OS.
In real mode,
CS:IP holds the current execution point,
DS holds the default data segment, and
SS holds the stack segment. Any application that has more than 64K of a code or data segment must break it into multiple segments.
Interrupts are simply special functions that are called when something happens (hardware interrupt), like a division by zero, or when called by software (by using the
INT instruction - software interrupt). In real mode, there are 256 interrupts. The table that holds the segment:offset for each interrupt is initially put into absolute address 0, but (in 286+) may be put elsewhere when using the
LIDT instruction (use
SITD to get the table address).
In Real mode, the OS provides features to the application via software interrupts, for example, DOS provides a range of functions in
The program gets loaded by DOS into a memory segment, and execution starts at the offset that is specified in the EXE's header (or at 0100h if it is a COM file which has no header). After that, the application is free to do anything, to completely trash the memory. This is a Real mode "feature": an application owns the entire machine. In addition, the application is allowed to communicate directly with any hardware (via in/out opcodes), thus bypassing any limitations or security restrictions the OS might have. And if the application crashes, the entire system crashes and you have to reboot.
Here is an easy "Hello World" sample for a 16-bit EXE that uses multiple segments:
FORMAT MZ ; DOS 16-bit EXE format
SEGMENT CODE16_2 USE16 ; Declare a 16-bit segment
SEGMENT CODE16 USE16 ; Declare a 16-bit segment
ORG 0 ; Says that the offset of the first opcode
call far [es:ShowMsg]
SEGMENT DATA16 USE16
Msg db "Hello World!$"
SEGMENT STACK USE16
stackdata dw 0 dup(1024)
How does the assembler know the actual value of the
stack16 segments? It doesn't. What it does is to put null values, and then creates entries to the EXE file (known as "relocations") so the loader, once it copies the code to the memory, writes to the specified address, the true values of the segments. And because this relocation map has a header, COM files cannot have multiple segments even if they sum to less than 64KB in total.
This program calls a function
ShowMsg in another segment via a far call, which uses a DOS function (
INT 21h) to display text. However, it could do it as well by writing directly into the video buffer (which for text mode, resides in the segment 0b000h) thus bypassing any OS or any security the function
09h might implement. Therefore, multitasking is not possible because each application can easily write to anywhere, thus destroying another application's or the OS's data.
Here is an easy "Hello World" sample for 16-bit COM:
org 100h ; code starts at offset 100h
use16 ; use 16-bit code
Msg db "Hello World!$"
What are the differences here? All stuff (code, data, stack) must reside in one segment. Code must start from offset 100h (to allow DOS to put information to the low 100h bytes), and no stack segment or data segment must be defined - COM files are "memory maps" and are limited to 64KB. For that reason, COM files are rarely used.
Generally, a DOS program consists of some code segments, some data segments, and a stack segment like above. A DOS program calls DOS and BIOS functions (through Interrupts) and accomplishes its task.
Because segments are limited to 64KB, there are many programming models depending on the application's requirements:
- Tiny, when everything has to fit in one single segment (COM files).
- Small, when there is one code segment and one data segment. Calls and jumps are near.
- Medium, when there is one data segment but more code segments. Calls and jumps are far.
- Compact, when there is one code segment but more data segments. Calls and jumps are near.
- Large, when there are more code and data segments. Calls and jumps are far.
- Huge, when the data structures exceed 64KB in size and thus, they have to programmatically be split into segments.
The most common models are the Small and the Large.
In 32-bit protected code (we are not discussing 16-bit protected mode here because it is very rare), a segment can have any size, from 1 byte to 4GB. The OS defines the size of each segment, and now each segment can have limitations (read, write, execute on or off). This allows the OS to "protect" the memory. In addition, there are 4 levels of authority (0 to 3, 0 = highest), so, for example, when a user application runs in level 3, it cannot touch the OS which runs at level 0.
In addition, if a 32-bit protected mode task crashes, OS catches the exception and terminates the program safely without crashing any other application or the OS itself. This way, true multitasking can occur.
Many people believe that multitasking is the art of running applications at the same time. This is not true, for one CPU core can only execute one command at a time. What is really happening is that OS permits Task #1 to run for X time, switches to Task #2, permits it to run for X time, switches to Task #3, and this is so fast that it appears that it is simultaneous.
Enabling the A-20 line is the first step to use memory above the 640KB. This trick (available in 286+) is the way to earn 0xFFF0 bytes of RAM (in the range 0ffffh:0010 through 0ffffh:0ffffh) accessible in real mode. Enabling the line (via the keyboard controller! - yes I don't even understand why) forces the CPU to avoid wrapping. This memory (known as High Memory Area, HMA) is used by HIMEM.SYS to load parts of DOS to it and therefore make more low memory available for applications.
The following code enables/disables A20. Note that if HIMEM.SYS is installed, A20 is enabled by default. HIMEM.SYS should be queried to alter A20 status instead of doing it directly.
The following code checks A20 and returns 1 to CF if it is enabled, 0 otherwise.
Global Descriptor Table Type 1: Application Entries
The Global Descriptor Table is a table that contains all the globally visible segments. Each segment has properties like:
- Base address (physical address in memory)
- Access restrictions
For Protected mode, the system maintains the the
GDTR register (accessible via
LGDT) which contains 6-byte data:
- 2 bytes - size of the entire array. Because each GDT entry is an 8-byte entry, a maximum of 8192 entries may be specified.
- 4 bytes - physical address of the GDT array in memory.
There are two types for a GDT entry. An entry for the application (S flag == 1, see below), and an entry for the OS (S flag == 0).
The definition for a GDT for an application entry as a C++ structure is this:
unsigned short seg_length_0_15;
unsigned short base0_15;
unsigned char base16_23;
unsigned char flags;
unsigned char seg_length_16_19:4;
unsigned char access:4;
usigned char base24_31;
Although this seems easy, it is more complicated than you might think. Let's examine the fields. Note that the analysis below is for the S bit set to 1 (user GDT). If this bit is 0, then we talk about System related GDT entries (more on that below).
- A 20-bit value describing the length of the segment. If the G flag (see below) is not set, this value represents the actual segment length. If the G flag is set, this value is multiplied with 4096 to represent the segment length. So if you set it to FFFFFh (20 bits) and G is set, it is 10000h * 4096 = 4GB.
- A 32-bit value indicating the start of the segment in physical memory.
- Flags for the segment
- Bit 0: Type
- Bit 1: Subtype
- For Code Segment (B0 == 1)
- 0 - Not conforming.
- 1 - Conforming. A conforming segment can be called from any segment that has equal or higher privilege. So if a segment is conforming with privilege level 3, you can call it from a privilege level 0, 1, or 2 segment. If the segment is not conforming, then it can only be called from a segment with the same privilege level.
- For Data Segment (B0 == 0)
- 0 - Expand up. The segment starts from its base address and ends to its limit.
- 1 - Expand down. The segment starts from its limit and ends to its base, with the address going the reverse way. This flag was created so a stack segment could be easily expanded, but it is not used by today's OSs.
- Bit 2: Accessibility
- Bit 3: Access
- 0 - Segment is not accessed.
- 1 - Segment is accessed. The CPU sets this bit each time the segment is accessed, so the OS gets an idea how frequent is the access to the segment, so it knows if it can cache it to disk or not.
- Bit 4: S
- 0 - This descriptor is for the OS. If this bit is not set, the entire GDT entries have different meanings, discussed below.
- 1 - This descriptor is for the application.
- Bit 5-6 : DPL
- The privilege level of this segment, from 00 (highest) to 11b (3) (lowest).
- Bit 7 : P
- Set to 1 to indicate that the segment is present in memory. If the OS caches this segment to the disk, then it sets P to 0. Any attempt to access the removed segment causes an exception. The OS catches this exception, and reloads the segment to memory, setting P to 1 again.
As you saw, the segment might not be present in memory at all, which allows the OS to cache the segment to the disk and reload it only when it is needed.
The first entry in the GDT table is always 0. CPU does not read information from entry #0 and thus it is considered a "dummy" entry. This allows the programmer to put the 0 value to a segment register (DS, ES, FS, GS) without causing an exception.
The following code creates some GDT entries, then loads them:
struc GDT_STR s0_15,b0_15,b16_23,flags,access,b24_31
.s0_15 dw s0_15
.b0_15 dw b0_15
.b16_23 db b16_23
.flags db flags
.access db access
.b24_31 db b24_31
gdt_start dw gdt_size
gdt_ptr dd 0
dummy_descriptor GDT_STR 0,0,0,0,0,0
code32_descriptor GDT_STR 0ffffh,0,0,9ah,0cfh,0
9ah = 10011010b = Present, DPL 00,No System,
Code Exec/Read. 0cfh access = 11001111b = Big,32bit,
<resvd 0>,1111 more size
data32_descriptor GDT_STR 0ffffh,0,0,92h,0cfh,0
92h = 10010010b = Present, DPL 00, No System, Data Read/Write
stack32_descriptor GDT_STR 0ffffh,0,0,92h,0cfh,0
code16_descriptor GDT_STR 0ffffh,0,0,9ah,0,0
data16_descriptor GDT_STR 0ffffh,0,0,92h,0,0
stack16_descriptor GDT_STR 0ffffh,0,0,92h,0,0 ; 64k 16-bit data
gdt_size = $-(dummy_descriptor)
' I 've only created it now for code32_descriptor.
Note you have to create entries for your current real mode segments if you want to access data in them.
In real mode, the segment registers (
GS) specify a real mode segment. And you can put anything to them, no matter where it points. And you can read and write and execute from that segment. In protected mode, these registers are loaded with selectors.
- Bit 0 - 1 : RPL
- Requested Protection Level. It must be equal or less privileged than the segments DPL.
- Bit 2 : TI
- If this bit is set to 1, the selector selects an entry from the LDT instead of the GDT (see below for LDT).
- Bits 3 - 15:
- Zero based index to the table (GDT or LDT).
So, to load
ES with the code32 segment, we would do:
In protected mode, you can't just select random values to the segment registers like in real mode. You must put valid values or you will get an exception.
The OS uses the
LIDT instruction to load the interrupt table. The IDTR contains the 6-byte data, 2 for the length of the tables and 4 for the physical address in memory.
Each entry in it is now 8 bytes, describing the location of the interrupt handlers.
.ofs0_15 dw ofs0_15
.sel dw sel
.zero db zero
.flags db flags
.ofs16_31 dw ofs16_31
Let's see some code to define only one interrupt:
SEGMENT CODE32 USE32
SEGMENT DATA16 USE16
idt_PM_start dw idt_size
idt_PM_length dd 0
interrupt0 db 6 dup(0)
SEGMENT CODE16 USE16
mov [interrupt0 + 2],eax
Notice the =NO DEBUG HERE=. Once the IDT table has been reset, a real mode debugger cannot work. So if you try to step into
LIDT, you will crash. And no, you cannot call DOS or BIOS interrupts from protected mode. However, Bochs has its own hardware debugger that can step into anything, so you can do your stuff there!
Preparing for Crash
It is very rare that your first protected mode application won't crash. When this happens, CPU does the triple fault and gets reset. To avoid resetting, you can put a real code to be executed:
If the CPU crashes, your routine will be executed. That routine must reset all registers and stack, then exit to DOS.
Entering Protected Mode
After that, you must execute a far jump to a protected mode code segment in order to clear possible invalid command cache. Using JMP FAR results in error, for the assembler does not know at this point that we are in protected mode. If this code segment is a 16-bit code segment, you must do:
If this code segment is a 32-bit code segment, you must do:
Before enabling interrupts, you must setup the stack and other registers:
mov ax, data_selector
mov ax, stack_selector
Exiting Protected Mode
Because protected mode cannot call DOS or BIOS interrupts, it is generally not useful to DOS applications. However, a 'bug' in the 386+ processor turned out to be a feature called unreal mode. The unreal mode is a method to access the entire 4GB of memory from real mode. This trick is undocumented, however a large number of applications (including HIMEM.SYS) are using it.
- Enable A20.
- Enter protected mode.
- Load a segment register (
GS) with a 4GB data segment.
- Return to real mode.
As long as the register does not change its value, it still points to a 4GB data segment, so it is possible to use it along with
EDI to access the entire address space. After returning from protected mode, you can easily do:
mov byte [fs:edi],0
286 lacks this capability because to exit protected mode, the CPU has to be reset, so all registers are destroyed.
The following function is a routine that will put your CPU to unreal mode and set FS to a 32-bit data segment:
s0_15 dw ?
b0_15 dw ?
b16_23 db ?
flags db ?
access db ?
b24_31 db ?
SEGMENT CODE16 USE16 PUBLIC
; GDT definitions
gdt_start dw gdt_size
gdt_ptr dd 0
dummy_descriptor GDT_STR <0,0,0,0,0,0>
code16_descriptor GDT_STR <0ffffh,0,0,9ah,0,0> ; 64k 16-bit code
data32_descriptor GDT_STR <0ffffh,0,0,92h,0cfh,0> ; 4GB 32-bit data, 92h = 10010010b = Presetn , DPL 00, No System, Data Read/Write
gdt_size = $-(dummy_descriptor)
dummy_idx = 0h ; dummy selector
code16_idx = 08h ; offset of 16-bit code segment in GDT
data32_idx = 10h ; offset of 32-bit data segment in GDT
PROC _EnterUnreal FAR
mov ax,CODE16 ; get 16-bit code segment into AX
shl eax,4 ; make a physical address
mov [ds:code16_descriptor.b0_15],ax ; store it in the dscr
mov [ds:data32_descriptor.b0_15],ax ; store it in the dscr
; Set gdt ptr
add ax,offset dummy_descriptor
mov bx,offset gdt_start
and al,not 1
Global Descriptor Tables
S flag is set to 0, the meaning of a GDT entry is quite different.
flags - Flags for the segment
More on gates later in this article.
- Bits 3 2 1 0 : Type of the entry
- 0000 - Reserved
- 0001 - Available 16-bit TSS
- 0010 - Local Descriptor Table (LDT)
- 0011 - Busy 16-bit TSS
- 0100 - 16-bit Call Gate
- 0101 - Task Gate
- 0110 - 16-bit Interrupt Gate
- 0111 - 16-bit Trap Gate
- 1000 - Reserved
- 1001 - Available 32-bit TSS
- 1010 - Reserved
- 1011 - Busy 32-bit TSS
- 1100 - 32-bit Call Gate
- 1101 - Reserved
- 1110 - 32-bit Interrupt Gate
- 1111 - 32-bit Trap Gate
Local Descriptor Table
Local Descriptor Table (LDT) is a method for each application to have a private set of segments, loaded with the
LLDT assembly instruction. The LDT bit in the selector specifies if the segment loaded is from the GDT or from the LDT. This, although originally helpful, is not used in modern OSes because of Paging.
Call gates are a mechanism to switch from a low privilege code to a higher one, used for user-level code to call system-level code. You specify a 1100 type entry in the GDT with the following format:
unsigned short offs0_15;
unsigned short selector;
unsinged short argnum:5;
unsigned char r:3;
unsigned char type:5;
unsigned char dpl:2;
unsigned char P:1;
unsigned short offs16_31;
Using CALL FAR with the selector of this callgate (the offset is ignored) will switch to the gate and execute the higher level privilege commands. If argnum specifies parameters to be copied, the system copies them to the new stack after pushing SS,ESP,CS,EIP. Using RETF will return from the gate call.
Because nowadays there are the faster SYSENTER/SYSEXIT commands, gates are not anymore used. Their usage is limited to:
- When you need transition other than Ring 3 <-> Ring 0 (Sys commands only go from 3 to 0 and vice versa)
- For malware exploits, patching the GDT to create call gates and then execute privilege commands. Note that in x64 versions of Windows, the "Kernel Patching Protection" prevents modification of the GDT.
SYSENTER / SYSEXIT
These instructions are the current way to switch from ring 3 to ring 0. You will use WRMSR to set the new values for CS (0x174) , XSP (0x175) and XIP (0x176). XCX must hold the ring 3 stack pointer for SYSEXIT and XDX contains the ring 3 IP for SYSEXIT. The entry stored for CS must be the index to 4 selectors, the first is the ring 0 code, the second is the ring 0 data, the third is the ring 3 code and the fourth is the ring 4 data. These values are fixed, so in order to use SYSENTER your GDT table must contain these entries in this format.
These opcodes only support switching between ring 3 and ring 0, but they are much faster.
There are a number of problems that occur in a multitasking OS when the above setups are used:
- A task has to be loaded in memory entirely.
- DOS applications think that they always access the lowest MB of RAM, so they can't be put outside it.
- An application must handle its own segments which must be different from other applications', thus making an application of dynamic link libraries costly.
Paging is the method to redirect an address to another address. The address that the application uses is called the "linear address" and the actual address is the "physical address".
32 bit Protected Mode Paging
The simplest form of paging consists of 2 tables: The Page Directory and the Page Table. The Page Directory is an array of 1024 32-bit entries with the following format:
- P - Page is present in memory. This flag allows the OS to cache the pages back to disk , clear P, and reload them when a page fault is generated when software attemps to access the page.
- R - Page is Read Write if set, else Read only.
- U - If set, only ring 0 can access this page.
- W - If set, write-through is enabled.
- D - If set, the page will not be cached.
- A - Set when the page is accessed (not automatically, like the GDT bit).
- N - Set to 0.
- S - Set to 0. If PSE is enabled, see below.
- G - Set to 0.
- Addr - The upper 20 bits (the lower 12 are ignored because it must be 4096- aligned) of the Page Table entry that this Page Directory entry points to.
The Page Table is an array of 1024 32-bit entries with the same format. The address points to the actual physical address that this page is mapped to.
Since there are 1024 page tables and 1024 directory entries, a total of 1024x1024 mappings are possible. Because the size of the page is 4096 bytes, we can map the entire 32-bit address space.
Put the address of the first page directory entry to CR3 before enabling paging (CR0 PE bit).
If PSE is enabled (CR4 bit 4), then S can be 1, in which case the page size is 4MB instead, and the pages must be 4MB aligned. This mode is introduced to avoid lots of small pages, at the expense of more memory wasted if the needed memory is somewhat larger than 4MB. Fortunately, modes can be mixed.
Physical Address Extension (PAE)
PAE is the ability of x86 to use 36 address bits instead of 32. This increases the available memory from 4GB to 64GB. The 32-bit applications still see only a 4GB address space, but the OS can map (via paging) memory from the high area to the lower 4GB address space. This extension was added to x86 to cope with the (nowadays not enough) limit of 4GB, before 64-bit software came to the foreground.
Enabling PAE (CR4 bit 5) means that now you have 3 paging levels: In addition to PT and PDT, you have now the PDTD , Page Directory Pointer Table, which has 4 64-bit entries. Each of the PDTD entries points to a Page Directory of 4KB (like in normal paging). Each entry in the new Page Directory is now 64 bit long (so there are 512 entries). Each entry in the new Page Directory points to a Page Table of 4KB (like in normal paging), and each entry in the new Page Table is now 64-bit long, so there are 512 entries. Because that would allow only a quarter of the original mapping, that's why 4 directory/table entries are supported. The first entry maps the first 1GB, the 2nd the 2nd GB, the 3rd the 3rd GB and finally, the 4th entry maps the 4th GB.
But now the "S" bit in the PDT has a different meaning: If not set, it means that the page entry is 4KB but if set, it means that this entry does not point to a PT entry, but it describes itself a 2MB page. So you can have different levels of paging traversal depending on the S bit.
There is a new flag in the Page Directory entry as well, the NX bit (Bit 63) which, if set, prevents code execution in that page.
This system allows the OS to handle memory over 4GB, but since the address space is still 4GB, each process is still limited to 4GB. The memory can be up to 64GB but a process cannot see the entire memory.
Initially, the Local Descriptor Table was used so each application could have a local array of segments. But because of paging, modern 32-bit OSs now use the "flat" mode. This way the applications receive the entire 4GB address space to hold code, data, and stack, but this portion of the address space is mapped into a different physical memory. So the applications can use the same memory addresses which are mapped to different physical addresses.
For example, see these two C++ programs running under 32-bit Windows:
int flags = MB_OK
char* msg = "Hello there"
char* title = "Title"
int flags = MB_OK
char* msg = "Hello there"
char* title = "Title"
This allows the application programmer to never consider segmentation. All pointers are near, there are no segments (all have the same value), and thus creating applications is easier. There is no thing as "small/large" model, because all the stuff is within the same segment.
The OS maps via paging all needed memory (say, a DLL) to some virtual address in the 32-bit address space, and the app will never consider far pointers or segmentation. CS, DS and SS values are pernanently viewing the entire 4GB address space, but all addresses are virtual and mapped to the application using Paging, therefore there is no segmentation.
So, in flat mode:
- All the segments are expanded to full 32-bit 4GB
- Via paging, different linear addresses are mapped to same physical addresses and similar linear addresses are mapped to different physical addresses
- No segmentation, LDT, call gates. No ring 1 or 2. Enter via SYSENTER/SYSEXIT (which btw share now the old LOADALL opcodes).
Because of its simplicity, the "flat mode" is now the mode used by all 32-bit OSs, and also it is the only one that exists in 64-bit mode.
So far all nice with protected mode, but many of the existing applications were real-mode at that time. Even today, many (mostly games) are played under Windows. To force these applications (which think they own the machine) to cooperate, a special mode should be created.
The VM86 mode is a special flag to the
EFlags register, allowing a normal 16-bit DOS memory map of 640KB which is of course forwarded via paging to the actual memory - this makes it possible to run multiple DOS applications at the same time without risking any chance for one application to overwrite another. EMM386.EXE, the old known memory manager, puts the processor to that state. The OS performs a step-by-step watching to the process, making sure that the process won't execute something illegal (so don't expect to enter protected mode when EMM386.EXE is loaded because once you try to set the GDT with
LGDT, you will be sacked :).
Once the VM flag is set, you can load a normal "segment" to a segment register. Interrupt calls by DOS applications are caught by the OS and emulated through it - if possible. Also, some instructions are ignored, for example, if you do a CLI, the interrupts are not actually disabled. The OS sees that you prefer to not be interrupted and acts accordingly, but interrupts are still there.
All VM86 code executes in PL 3, the lowest privilege level. Ins/Outs to ports are also captured and emulated if possible. The interesting thing about VM86 is that there are two interrupt tables, one for the real and one for the protected mode. But only protected mode interrupts are executed.
VM86 was removed from 64-bit mode, so a 64-bit OS cannot execute 16-bit DOS code anymore. In order to execute such code, you need an emulator such as DosBox.
HIMEM is the generic extended memory manager for DOS. At that time, extended memory was mostly, if not totally, used to cache data from the disk, especially from big apps. HIMEM puts the CPU in unreal mode (or it uses LOADALL in 286) and provides a simple interface to the applications that want more memory without messing with the protected mode details. By enabling the A20 line, HIMEM allowed a portion of DOS COMMAND.COM to reside in the high memory area. Because unreal mode is still real mode, your protected mode application can do the stuff we discussed even if HIMEM.SYS is loaded.
At that time, a form of memory now eliminated, the "expanded" memory, existed. Many applications were written to take advantage of it, but the modern standard was the protected mode. EMM386 puts the CPU in VM86 mode and maps via paging memory over 1MB to real mode segments (over 0xA0000), so an application that would like to use expanded memory can use it via EMM386.EXE. In addition, EMM386 allowed "devicehigh" and "loadhigh" commands in CONFIG.SYS, allowing applications to get loaded to these high segments if possible.
Because VM86 mode is protected mode, your protected mode application cannot do the stuff we have discussed if EMM386 is loaded, for once you try LGDT your program will be terminated because ring-3 applications (remember that VM86 mode is protected mode in which the DOS application runs in ring-3), cannot set the GDT.
DOS Protected Mode Interface is a system that allows DOS applications to run 32-bit code. Unreal mode was not enough because it only allows data to be moved, but not code to be executed. What a DPMI server does is to take care of the nasty tables we have discussed above, allowing the executable to specify 32-bit code directly. When the executable calls DOS, the DPMI server catches the call, switches to real mode, calls DOS, then back to protected mode.
At that time, a now non-existent and mostly undocumented instruction existed, LOADALL (0xF 0x5 in 286, 0xF 0x7 in 386). LOADALL used, as the name implies, to load all the registers (including the GDTR and IDTR) from one table in memory. In 286 LOADALL (which was not accessible from 386), this table was fixed at memory address 0x800, whereas in 386 LOADALL it reads the buffer pointed to by real mode ES:EDI. Because the CPU does not check in any way if any of the values loaded by LOADALL is valid, LOADALL was used by many tools at the time, including HIMEM.SYS, for various infamous actions:
- To access the entire memory from real mode without entering protected mode and unreal mode.
- To run real code with paging.
- To get back to real code from protected mode without resetting the 286.
- To run normal 16-bit code inside protected mode without VM86 (which was not there in 286). This was done by trapping each memory access (which would lead to GPF because all the segments were marked non-present) and emulating the desired result by using another LOADALL. Of course this was too slow, but it led to the creation of the VM86 mode in 386, where LOADALL eventually faded out.
LOADALL 286 itself was mentioned in the manuals and was partially documented; by contrast, LOADALL 386 was heavily obscure, probably to induce the programmers to take advantage of the new VM86 mode.
The x64 CPU has three modes of work:
- Real mode, same as in DOS
- Legacy mode, same as 32-bit protected mode
- Long mode
Long mode has two sub-modes:
- Compatibility mode: same as 32-bit protected mode. This allows a 64-bit OS to run 32-bit applications.
- 64 - bit mode: for 64-bit applications
To work in Long mode, the programmer must take into consideration the facts below:
- Unlike Protected mode, which can run with or without paging, long mode absolutely needs PAE and paging. That is, you cannot leave paging out even if your map is "see-through". You have to create PAE - style page tables and "flat" mode is the only valid mode in long mode. No segmentation.
- AMD docs say that, in order to enter long mode, you have to enter protected mode - however, this has proven not to be true, since you can now get into long mode directly from real mode, by enabling protected mode and long mode within one instruction (this can work because Control Registers are accessible from real mode). I use that method in the long mode thread creation from real mode.
- Although in theory any 64-bit value could be used as an address, in practise we don't need yet 2^64 memory. Therefore, current implementations only implement 48-bit addressing, which enforces all pointers to have bits 47-63 either all 0 or all 1. This means that you have 2 ranges of valid "canonical" addresses, one from 0 to 0x00007FFF'FFFFFFFF and one from 0xFFFF8000'00000000 through 0xFFFFFFFF'FFFFFFFF, for a 256TB of total space. Most OSes reserve the upper area for the kernel, and the lower area for the user space. And no, you cannot use the useless bits to store extra smart information about the pointer, because the CPU does not ignore these bits, it enforces them to be either all 1 or all 0. Forget your bad habits :)
Long Mode Paging
In long mode the paging system adds a new top level structure, the PML4T which has 512 64-bit long entries which point to one PDPT and now the PDPT has 512 entries as well (instead of 4 in the x86 mode). So now you can have 512 PDPTs which means that one PT entry manages 4KB, one PDT entry manages 2MB (4KB * 512 PT entries), one PDPT entry manages 1GB (2MB*512 PDT entries), and one PML4T entry manages 512 GB (1GB * 512 PDPT entries). Since there are 512 PML4T entries, a total of 256TB (512GB * 512 PML4T entries) can be addressed.
This is another reason not to use the entire 64-bit for addressing. Using the entire thing would force us to have 6 levels of paging.
Each of the "S" bits in the PDPT/PDT can be 0 to indicate that there is a lower level structure below, or 1 to indicate that the traversal ends here. If the PDPT S flag is 1, then the page size is 1GB.
Global Descriptor Table
- Creating a 64-bit segment
- A segment marked for 64-bit is pretty much the same like a 32-bit segment with a limit of 4GB, but with the L bit set to 1 and the D bit set to 0. The D bit is set to 0 in 16-bit segments, but when L bit is set, it indicates a 64-bit segment.
- 64-bit segments always start from 0 and always end in 0xFFFFFFFFFFFFFFFF.
If your GDT resides in the lower 4GB of memory, you need not change it after entering long mode. However, if you plan to call
LGDT while in long mode, you must now deal with the 10-byte GDTR, which holds two bytes for the length of the GDT and 8 bytes for the physical address of it.
Any selector you might load to access a 64-bit segment is ignored, and
SS are not used at all. All the segments are flat, and everything is done via paging. End of the segmentation era.
You have to reset the IDT to use 64-bit descriptors.
Each entry in it is now 16 bytes, describing the location of the interrupt handlers in 64-bit mode.
.ofs0_15 dw ofs0_15
.sel dw sel
.flags db flags
.ofs16_31 dw ofs16_31
.ofs32_63 dd ofs32_63
.zero dd zero
Entering Long Mode
mov eax, cr0
mov cr0, eax
mov eax, cr4
bts eax, 5
mov cr4, eax
mov ecx, 0c0000080h
bts eax, 8
' is loaded with the physical address of the page table.
mov eax, cr0
mov cr0, eax
- Turn off paging, if enabled. To do that, you must ensure that you are running in a "see through" area.
- Set PAE, by setting CR4's fifth bit.
- Create the new page tables and load
CR3 with them. Because
CR3 is still 32-bits before entering Long mode, the page table must reside in the lower 4GB.
- Enable Long mode (note, this does not enter Long mode, it just enables it).
- Enable paging. Enabling paging activates and enters Long mode.
wrmsr opcodes are also available in Real mode, you can activate Long mode from Real mode directly by setting both PE and PM bits of CR0 simultaneously.
Now you are in compatibility mode. Enter 64-bit mode by jumping to a 64-bit code segment:
The initial 64-bit segment must reside in the lower 4GB because compatibility mode does not see 64-bit addresses.
Note that you must use the linear address, because 64-bit segments always start from 0. Note also that if the current compatibility segment is 16-bit default, you have to use the 066h prefix.
The only thing you have to do in 64-bit mode is to reset the
linear is a macro that finds the linear address of a target.
ES, are not used in 64-bit mode. That is, if you want to access data in another segment, you cannot load DS with that segment's selector and access the data. You must specify the linear address of the data. Data and stack are always accessed with linear addresses. "Flat" mode is not only the default, it is the only one for 64-bit. However GS and FS can still be used as auxilliary registers and their values are still subject to verification from the GDT. In Windows, FS points to the Thread Information Block.
Once you are in 64-bit mode, the defaults for the opcodes (except from
call) are still 32-bit. So a REX prefix is required (0x40 to 0x4F) to mark a 64-bit opcode. Your assembler handles that automatically if it supports a "code64" segment.
In addition, a 64-bit interrupt table must now be set with a new
LIDT instruction, this time taking a 10-byte operator (2 for the length and 8 for the location), and each entry in the IDT table takes 10 bytes, 2 for the selector and 8 for the offset.
Returning to Compatibility Mode
Because 0eah is not a valid jump when in 64-bit mode, you have to use a
RETF trick to get back to a compatibility mode segment.
This gets you back to compatibility mode. 64-bit OSs keep jumping from 64-bit to compatibility mode in order to be able to run both 64-bit and 32-bit applications.
Why do Windows drivers have to be 64-bit for a 64-bit OS? Because no WOW64 for driver (ring 0) code exists. They could have created one if they wanted to - I guess they wanted to force manufacturers to finally move to 64-bit. Nice decision, I must admit.
Exiting from Long Mode
You have to setup all the registers again with 32-bit selectors - back to segmentation. Also you must be in a see-through area because to exit long mode you must deactivate paging. Of course, you can switch immediately to real mode by resetting the PM bit as well.
mov eax, cr0
mov cr0, eax
mov ecx, 0c0000080h
btc eax, 8
Unreal Mode in 64-bit
I am sorry for I made you feel well for the moment. There is no such thing that would allow you to access over 4GB of RAM from real mode (unless AMD has an Easter egg in its CPU). In addition, although the 32-bit registers
EBX, etc., are available in Real mode, the 64-bit registers
RBX are not even available in compatibility mode - only in 64-bit mode. Or, who knows?
Virtual 86 Mode in 64-bit
Once the CPU enters Long mode, VM86 is not supported anymore. That is the reason why 64-bit OSs cannot run 16-bit applications. However, emulators like DosBox will run fine your 16-bit old game.
DPMI for 64-bit
Not existing, but I made something similar, which allows a DOS application to run multiple threads in real,protected and long mode while still having access to DOS interrupts.
Yup, I made it :)
The code presented here messes with everything we have discussed so far. It has yet some dirty functions, but it works. Have fun with it!
- 30/12/2018 - More details on implementation of flat mode, typos, thread code
- 25/12/2018 - Cleanup, Github code, VS solution
- 21/05/2015 - Paging analysis.
- 24/03/2015 - LOADALL added, and updated unreal mode code.
- 05/02/2015 - Callgates and SYSENTER/SYSEXIT information added.
- 30/09/2014 - Wow after 5 years, at least some typo fixing.
- 02/12/2009 - First release.