(untagged)

The Real, Protected, Long mode assembly tutorial for PCs

Michael Chourdakis

0.00/5 (No votes)

5 Oct 2014

Immerse yourself in system programming!

Download source code - 6.62 KB

Introduction

This article targets the user who wants to know how the CPU works. I will explain some assembly basics, the real mode, the protected mode, and the long mode. Working Assembly code is included, so you can test for yourself how the processor works in real mode, how protected mode is entered, how we get into 64-bit mode, and finally, how to exit from all of them and go back to DOS.

Do you dare to follow? Let's go!

Requirements

Assembly knowledge. While you will not be writing code, basic knowledge such as registers, memory access, basic commands, and such stuff is helpful.
Flat Assembler, a modern assembler which can make executables for Win32, x64, and DOS.
A clean DOS installation (see below). You can try the excellent FreeDOS.

DOS is not the easiest environment to work with, so you can simulate it in two ways:

VMware, the best virtualization software.
Bochs, the must-have tool for system developers. I recommend it, not only because it is free, but because it has a debugger that will trap any exceptions your program will generate and tell you what happened.

In the distant past, Borland's Turbo Assembler (TASM) was used because it was the first one to allow 32-bit segments. But since then, quite a few things have changed, and TASM is dead.

All the code here is configured to compile with the modern FASM, which can create executables for both DOS and Windows, 16/32/64 bit.

Visual Studio includes ML.EXE and ML64.EXE, the newer MASM versions. However, these assemblers only output Windows executables and only for their respective architecture, so ML.EXE only outputs 32-bit flat code for Win32, and ML64.EXE only outputs 64-bit flat code for Win64.

NASM is also good, but it hasn't so many output options as FASM.

Debugger

Yes, you have figured it out already. Your real mode debugger will not work once you step into the LIDT instruction (goodbye, protected mode debugging!). However there is a way around it, it is called 386SWAT and it is both a real and protected mode debugger, which you can invoke while in protected mode. For this to work, you must initialize some GDT entries for it. My code does that, so you can see how.

However, Bochs has its own hardware debugger that can step into anything, so you can do your stuff there!

Assembly in General

Introduction

The assembly language is basically a collection of low level instructions (opcodes) to do useful stuff and to access memory. In order to make things easier, there are registers - e.g., places to get and set data.

Segmentation

Memory is not treated as a continuous array of bytes (like a C array). It is divided in segments. A segment has different meaning, depending on the CPU mode (Real, Protected, or Long mode). Each memory address is referred by a segment register which holds the segment value, and an offset indicating the distance from the start of the segment.

Stack

In order to make it easier for functions to have local variables (like in C++) and transfer data between them, each application sets up a special segment called "stack segment" which holds the address of the memory used for stack. Stack is a "LIFO" vector: the last you push to it is the first you get from it.

Pointer Access

It is like pointers in C. If you have a variable that contains a pointer in C, you can use * to access the data. In assembly, you do the same with []:

; assume that DS has the data segment
mov si,000fh
mov dx,[si] ; Now dx has whatever was placed in DS:000fh

What if you try to access something that doesn't exist? In DOS, you can trash yourself or the OS or both; in Protected mode systems, you will not be allowed to access non-existent memory and thus, the exception handler will be called (or else, your program will be sacked).

Function Calls and Interrupts

Just as in C++, there are function calls in assembly (no, not everything is implemented with goto :). Depending on the programming model (you will learn this later), a function call can be near (in which the current IP is pushed to the stack and the function exits with a RET), far (in which the IP and CS are pushed, and the function exits with a RETF), or interrupt.

An Interrupt is basically a handler that gets executed when something calls it. Many times, these are software interrupts, which means that the program calls it using the INT instruction. In fact, all DOS/BIOS services are provided through interrupts, so if the programmer does:

; assume that DS has the data segment
mov ah,9
mov dx,msg;
int 21h;

then the programmer knows that it has called function 9 of INT 21h, which is a DOS function to display the string pointed to by DS:DX. The address for each interrupt (there are 255) is stored in the Interrupt Table, which is accessible via the SIDT command, and it is different, depending on the processor mode (you will learn about that later). IP, CS, and the flags are pushed to the stack, and the interrupt exits with IRET.

An interrupt can also happen when something else occurs (usually, an exception). For example, when your code divides something by zero, INT 00h is automatically called. The code for INT 00h sees that you have tried to divide something with zero and thus, you can't continue. So Windows displays a nice message and closes your application.

If you could install an INT 00h handler yourself (via the DOS function 25h), then the exception would get to your code before making it to Windows (that's what the structured exception handling basically does), but you must still fix the error. If you can't fix the error, then you must abort - pretty much what Windows does.

In DOS, the default exception handlers do little than block the CPU from further processing, so you can only do Ctrl+Alt+Del to resume.

Registers

16-bit Registers

AX
BX
CD
DX
SI
DI
BP
SP
IP
16-bit Flag Register

IP holds the current execution point. As commands are executed, IP changes its value automatically.

AX, BX, CX, DX can be accessed either entirely:

mov ax,1    ; ax is now 1
mov cx,ax  ; cx is now also 1

or using their low 8-bit (al, bl, cl, dl) or high bit (ah, bh, ch, dh)

xor ax,ax  ; ax is now 0

mov ah,1    ; ah is now 1, thus ax is now 0100h
mov al,2   ; al is now 2, thus ax is now 0102h

SI, DI are always accessed as 16-bit registers (there is no sl, sh) and they are generally used as pointers to data. BP can also be used as a generic-purpose register (although it is usually used to access the stack), and SP holds the pointer to the current stack entry. So let's see what happens when you put something to the stack:

; assume that SP has the value of 0100h
mov ax,3   ; ax is now 0
push ax    ; ax is put to the stack, SP has now the value of 00FEh; (100h - 2)
mov dx,5   ; dx is now 5
push dx    ; dx is put to the stack, SP is now 00FCh;
pop bx     ; bx gets the value from the stack top (5). SP gets back to 00FEh
pop cd     ; cx gets the value from the stack top (3). SP gets back to 0100h

What happens when we push more than the current stack can hold? Boom, stack overflow. You have probably encountered it in your C++ recursive functions.

The flags register is a set of 16 bits (not all are actually used) that change their value depending on the operation of each opcode. The variables of the JMP command (JZ, JAE, JB, etc.) can jump conditionally depending on those flags. For example, the ZF (Zero Flag) is set to 1 after an operation is zero:

mov ax,bx    ; Get value of BX to AX
or  ax,ax    ; Is AX 0? If yes, or ax,ax will also say 0, so ZF will be 1
je  AxIsZero 

jmp AxIsNotZero

You can use pushf and popf to set or read the flags to a register, for example:

pushf  ; push flags to stack
pop ax ; ax has now the flags

or al,1 ; set the bit 0 to 1

push ax
popf

16-bit Segment Registers

CS
DS
SS
ES
FS
GS

These registers hold a value that identifies the current segment. The way this value is interpreted depends on the current CPU mode (Real/Protected/Long).

CS always holds the segment of the currently executing code. You cannot set CS by using, say, mov cs,ax. When you call a function that resides in another segment (FAR call), or when you jump to another segment (FAR jump), CS changes.

DS holds the default data segment. That means that if you do this:

mov si,0

mov ax,[si]
mov bx,[1000h]

Then AX gets its value from the segment pointed to by DS, with the offset specified by SI, and BX gets its value from the segment pointed to by DS with an offset of 1000h. If you want to use another segment, you must do so explicitly:

mov di,0
mov ax,[fs:di]
mov bx,[es:1000h]

When DI is used as an index, ES holds the default segment. When BP is used as an index, SS is the default segment. In all other cases, DS is the default segment. Note that not every register can be used as an index in real mode, for example, mov ax,[dx] is not valid in real mode.

ES, FS, and GS are general-purpose auxiliary segment registers. SS holds the value of the stack segment.

32-bit Registers

32 bit registers are available in all of the modes (real, protected, and long).

EAX
EBX
ECD
EDX
ESI
EDI
EBP
ESP
EIP
32-bit Flag Register

Each of them is an extension of their relative 16-bit register. For example:

mov eax,0   ; eax is now 0, so ax is also 0.

mov ax,1    ; ax is now 1, eax is also 1
or eax,0FFFF0000; ax is now 1, eax is now 0FFFF0001h

The 32-bit registers are usable in Real mode, but indexes (EDI and ESI) cannot be used unless their upper 16 bits are zero (that is, you can use a max index of 65535 only). But see below for Unreal mode, which provides a way around this.

64-bit Registers

64 bit registers are only available when the processor is in 64-bit mode. They are not available in real or protected or compatibility mode.

RAX
RBX
RCX
RDX
RSI
RDI
RBP
RSP
RIP
64-bit Flag Register

In addition, x64 defines 8 more 64-bit registers (r8, r9, r10, r11, r12, r13, r14, r15) to be used as auxiliary registers, and some 128-bit registers to be used when programming multimedia.

Control Registers

CR0
CR1
CR2
CR3

These registers hold information about the current state of the CPU.

CR0 is mostly used to set the CPU to protected mode (bit 0), and enable paging (bit 31).
CR1 is reserved.
CR2 holds the Page Fault Linear address when a page fault exception is triggered.
CR3 holds the address of the paging table.
CR4 defines some other flags, like Physical Address Extensions and VM86 mode.

For more on these, see Control Registers in Wikipedia.

Debug Registers

DR0
DR1
DR2
DR3
DR6
DR7

These registers hold information for hardware debugging. DR0-DR3 hold the linear address of 4 breakpoints, and DR6-DR7 some flags to use them. See Debug Registers for more.

Test Registers

TR4
TR5
TR6
TR7

These registers used to hold information for CPU testing (now removed from the set). See Test Registers for more.

Real Mode

Addressing and Segmentation

In Real mode, everything is 16 bits. The entire memory is not accessed with an absolute index from 0, but it is divided into segments. Each segment represents the actual offset from 0, multiplied by 16. To this segment, an offset value can be added to refer to a distance from the start of this segment. These two things (segment:offset) tell the CPU the absolute value of the memory we need to access. For example:

0000h : 0000h: Indicates segment 0, offset 0. 0*16 + 0 = 0 actual memory address.
0100h : 000Fh: Indicates segment 100h, offset 0Fh. 100h*16 + 0Fh = 100Fh = actual memory address.
0002h : 0000h: 2h*16 + 0 = 20h actual memory address.
0001h : 0010h: 1h*16 + 10h = 20h actual memory address. As you can see, memory addresses can overlap.

Because the segment and the offset are only 16-bit values, the maximum memory accessible by this method is 0ffffh : 0010h = 1MB. Specifying 0ffffh segment and an offset larger than 0010h results in wrapping (see protected mode, A20 line). And because the area after 0a000h:0000 is reserved for the system (screen, etc.), only 640 KB remains for DOS applications.

In addition, all segments have read/write/execute access from anywhere (that is, any program can read/write or execute code within any segment). Because in 16-bit real mode OS the CPU sees the memory the way above, any application can read from or write to any part of memory, including the part in which the OS resides. That is why a real mode OS is a single tasking OS.

In real mode, CS:IP holds the current execution point, DS holds the default data segment, and SS holds the stack segment. Any application that has more than 64K of a code or data segment must break it into multiple segments.

Interrupts

Interrupts are simply special functions that are called when something happens (hardware interrupt), like a division by zero, or when called by software (by using the INT instruction - software interrupt). In real mode, there are 256 interrupts. The table that holds the segment:offset for each interrupt is initially put into absolute address 0, but (in 286+) may be put elsewhere when using the LIDT instruction (use SITD to get the table address).

In Real mode, the OS provides features to the application via software interrupts, for example, DOS provides a range of functions in INT 21h.

Program Execution

The program gets loaded by DOS into a memory segment, and execution starts at the offset that is specified in the EXE's header (or at 0100h if it is a COM file which has no header). After that, the application is free to do anything, to completely trash the memory. This is a Real mode "feature": an application owns the entire machine. In addition, the application is allowed to communicate directly with any hardware (via in/out opcodes), thus bypassing any limitations or security restrictions the OS might have. And if the application crashes, the entire system crashes and you have to reboot.

Code

Here is an easy "Hello World" sample for a 16-bit EXE that uses multiple segments:

FORMAT MZ                  ; DOS 16-bit EXE format
ENTRY CODE16:Main       ; Specify Entry point (i.e. the start address)
STACK STACK16:stackdata ; Specify The Stack Segment and Size
    
SEGMENT CODE16_2 USE16  ; Declare a 16-bit segment
    
    ShowMsg:
        mov ax,DATA16
        mov ds,ax            ; Load DS with our "default data segment"
        mov ax,0900h    
        mov dx,Msg    
        int 21h;            ; Call a DOS function: AX = 0900h (Show Message), 
                            ; DS:DX = address of a buffer, int 21h = show message 
    retf                    ; FAR return; we were called from 
                            ; another segment so we must pop IP and CS.
    
SEGMENT CODE16 USE16         ; Declare a 16-bit segment
    ORG 0                    ; Says that the offset of the first opcode 
                             ; of this segment must be 0.
    
    Main:
        mov ax,CODE16_2
        mov es,ax
        call far [es:ShowMsg] ; Call a procedure in another segment.
                              ; CS/IP are pushed to the stack.
        mov ax,4c00h          ; Call a DOS function: AX = 4c00h (Exit), int 21h = exit
        int 21h
    
SEGMENT DATA16 USE16
    Msg db "Hello World!$"
        
SEGMENT STACK USE16
    stackdata dw 0 dup(1024)  ; use 2048 bytes as stack. When program is initialized, 
                              ; SS and SP are automatically set.

How does the assembler know the actual value of the data16, code16, code16_2, and stack16 segments? It doesn't. What it does is to put null values, and then creates entries to the EXE file (known as "relocations") so the loader, once it copies the code to the memory, writes to the specified address, the true values of the segments. And because this relocation map has a header, COM files cannot have multiple segments even if they sum to less than 64KB in total.

This program calls a function ShowMsg in another segment via a far call, which uses a DOS function (09h, INT 21h) to display text. However, it could do it as well by writing directly into the video buffer (which for text mode, resides in the segment 0b000h) thus bypassing any OS or any security the function 09h might implement. Therefore, multitasking is not possible because each application can easily write to anywhere, thus destroying another application's or the OS's data.

Here is an easy "Hello World" sample for 16-bit COM:

org    100h         ; code starts at offset 100h
use16               ; use 16-bit code
                    ; The SS is same as CS (since only one 
                    ; segment is here) and SP is set to 0xFFFE

mov ax,0900h
mov dx,Msg
int 21h;
mov ax,4c00h
int 21h

Msg db "Hello World!$"

What are the differences here? All stuff (code, data, stack) must reside in one segment. Code must start from offset 100h (to allow DOS to put information to the low 100h bytes), and no stack segment or data segment must be defined - COM files are "memory maps" and are limited to 64KB. For that reason, COM files are rarely used.

Generally, a DOS program consists of some code segments, some data segments, and a stack segment like above. A DOS program calls DOS and BIOS functions (through Interrupts) and accomplishes its task.

Programming Models

Because segments are limited to 64KB, there are many programming models depending on the application's requirements:

Tiny, when everything has to fit in one single segment (COM files).
Small, when there is one code segment and one data segment. Calls and jumps are near.
Medium, when there is one data segment but more code segments. Calls and jumps are far.
Compact, when there is one code segment but more data segments. Calls and jumps are near.
Large, when there are more code and data segments. Calls and jumps are far.
Huge, when the data structures exceed 64KB in size and thus, they have to programmatically be split into segments.

The most common models are the Small and the Large.

Protected Mode

Segments

In 32-bit protected code (we are not discussing 16-bit protected mode here because it is very rare), a segment can have any size, from 1 byte to 4GB. The OS defines the size of each segment, and now each segment can have limitations (read, write, execute on or off). This allows the OS to "protect" the memory. In addition, there are 4 levels of authority (0 to 3, 0 = highest), so, for example, when a user application runs in level 3, it cannot touch the OS which runs at level 0.

In addition, if a 32-bit protected mode task crashes, OS catches the exception and terminates the program safely without crashing any other application or the OS itself. This way, true multitasking can occur.

Multitasking

Many people believe that multitasking is the art of running applications at the same time. This is not true, for one CPU core can only execute one command at a time. What is really happening is that OS permits Task #1 to run for X time, switches to Task #2, permits it to run for X time, switches to Task #3, and this is so fast that it appears that it is simultaneous.

A-20 line

Enabling the A-20 line is the first step to use memory above the 640KB. This trick (available in 286+) is the way to earn 0xFFF0 bytes of RAM (in the range 0ffffh:0010 through 0ffffh:0ffffh) accessible in real mode. Enabling the line (via the keyboard controller! - yes I don't even understand why) forces the CPU to avoid wrapping. This memory (known as High Memory Area, HMA) is used by HIMEM.SYS to load parts of DOS to it and therefore make more low memory available for applications.

The following code enables/disables A20. Note that if HIMEM.SYS is installed, A20 is enabled by default. HIMEM.SYS should be queried to alter A20 status instead of doing it directly.

WaitKBC:
   mov cx,0ffffh
   A20L:
   in al,64h
   test al,2
   loopnz A20L
ret

ChangeA20:
   call WaitKBC
   mov al,0d1h
   out 64h,al
   call WaitKBC
   mov al,0dfh ; use 0dfh to enable and 0ddh to disable.
   out 60h,al
ret

The following code checks A20 and returns 1 to CF if it is enabled, 0 otherwise.

CheckA20:
    PUSH ax 
    PUSH ds
    PUSH es 

    XOR ax,ax 
    MOV ds,ax 
    NOT ax 
    MOV es,ax 
    MOV ah,[ds:0] 
    CMP ah,[es:10h] 
    JNZ A20_ON 

    CLI 
    INC ah 
    MOV [ds:0],ah 
    CMP [es:10h],ah 
    PUSHF 
    DEC ah 
    MOV [ds:0],ah 
    STI 
    POPF 
    JNZ A20_ON 

    CLC 
    POP es
    POP ds
    POP ax 
    RET 

A20_ON: 
    STC 
    POP es
    POP ds
    POP ax 
RET

Global Descriptor Table Type 1: Application Entries

The Global Descriptor Table is a table that contains all the globally visible segments. Each segment has properties like:

Size
Base address (physical address in memory)
Access restrictions

For Protected mode, the system maintains the the GDTR register (accessible via SGDT / LGDT) which contains 6-byte data:

2 bytes - size of the entire array. Because each GDT entry is an 8-byte entry, a maximum of 8192 entries may be specified.
4 bytes - physical address of the GDT array in memory.

There are two types for a GDT entry. An entry for the application (S flag == 1, see below), and an entry for the OS (S flag == 0).

The definition for a GDT for an application entry as a C++ structure is this:

struct GDT_STR
{
    unsigned short seg_length_0_15;
    unsigned short base0_15;
    unsigned char  base16_23;
    unsigned char  flags;
    unsigned char  seg_length_16_19:4;
    unsigned char  access:4;
    usigned  char  base24_31;
};

Although this seems easy, it is more complicated than you might think. Let's examine the fields. Note that the analysis below is for the S bit set to 1 (user GDT). If this bit is 0, then we talk about System related GDT entries (more on that below).

seg_length
- A 20-bit value describing the length of the segment. If the G flag (see below) is not set, this value represents the actual segment length. If the G flag is set, this value is multiplied with 4096 to represent the segment length. So if you set it to FFFFFh (20 bits) and G is set, it is 10000h * 4096 = 4GB.
base
- A 32-bit value indicating the start of the segment in physical memory.
flags
- Flags for the segment
  - Bit 0: Type
    - 0 - Data
    - 1 - Code
  - Bit 1: Subtype
    - For Code Segment (B0 == 1)
      - 0 - Not conforming.
      - 1 - Conforming. A conforming segment can be called from any segment that has equal or higher privilege. So if a segment is conforming with privilege level 3, you can call it from a privilege level 0, 1, or 2 segment. If the segment is not conforming, then it can only be called from a segment with the same privilege level.
    - For Data Segment (B0 == 0)
      - 0 - Expand up. The segment starts from its base address and ends to its limit.
      - 1 - Expand down. The segment starts from its limit and ends to its base, with the address going the reverse way. This flag was created so a stack segment could be easily expanded, but it is not used by today's OSs.
  - Bit 2: Accessibility
    - For Code Segment (B0 == 1)
      Note that a code segment is not writable. However, because segment base addresses can overlap, you can create a writable data segment with the same base address and limit of a code segment.
      - 0 - Not readable. Any code that tries to read memory from this segment will generate an exception.
      - 1 - Readable.
    - For Data Segment (B0 == 0)
      - 0 - Not writable. Any code that tries to write to this data segment will generate an exception. Data segments are always readable.
      - 1 - Writable.
  - Bit 3: Access
    - 0 - Segment is not accessed.
    - 1 - Segment is accessed. The CPU sets this bit each time the segment is accessed, so the OS gets an idea how frequent is the access to the segment, so it knows if it can cache it to disk or not.
  - Bit 4: S
    - 0 - This descriptor is for the OS. If this bit is not set, the entire GDT entries have different meanings, discussed below.
    - 1 - This descriptor is for the application.
  - Bit 5-6 : DPL
    - The privilege level of this segment, from 00 (highest) to 11b (3) (lowest).
  - Bit 7 : P
    - Set to 1 to indicate that the segment is present in memory. If the OS caches this segment to the disk, then it sets P to 0. Any attempt to access the removed segment causes an exception. The OS catches this exception, and reloads the segment to memory, setting P to 1 again.
access
- Bit 0: AVL
  - Not used, set to 0.
- Bit 1: L
  - Set to 0 for 32-bit segments. If set to 1, it indicates 64-bit segments used in long mode.
- Bit 2: D
  Real mode segments are always 16-bit default.
  - When D is not set, the default for opcodes is 16-bit. The segment can still execute 32-bit commands by putting the 0x66 or 0x67 prefix to them.
  - When D is set, the default for opcodes is 32-bit. The segment can still execute 16-bit commands by putting the 0x66 or 0x67 prefix to them.
- Bit 3: G
  - Set to 1 to multiply the seg_length by 4096 to find the true segment length as discussed above.

As you saw, the segment might not be present in memory at all, which allows the OS to cache the segment to the disk and reload it only when it is needed.

The first entry in the GDT table is always 0. CPU does not read information from entry #0 and thus it is considered a "dummy" entry. This allows the programmer to put the 0 value to a segment register (DS, ES, FS, GS) without causing an exception.

The following code creates some GDT entries, then loads them:

struc GDT_STR s0_15,b0_15,b16_23,flags,access,b24_31
; 'access' taken as a byte, it is actually 4+4 bits
{
    .s0_15   dw s0_15
    .b0_15   dw b0_15
    .b16_23  db b16_23
    .flags   db flags
    .access  db access
    .b24_31  db b24_31
}

Note you have to create entries for your current real mode segments if you want to access data in them.

Selectors

In real mode, the segment registers (CS, DS, ES, SS, FS, GS) specify a real mode segment. And you can put anything to them, no matter where it points. And you can read and write and execute from that segment. In protected mode, these registers are loaded with selectors.

Selector

Bit 0 - 1 : RPL
- Requested Protection Level. It must be equal or less privileged than the segments DPL.
Bit 2 : TI
- If this bit is set to 1, the selector selects an entry from the LDT instead of the GDT (see below for LDT).
Bits 3 - 15:
- Zero based index to the table (GDT or LDT).

So, to load ES with the code32 segment, we would do:

mov ax,0008h  ; 0-1 : 00 privilege, 2 : 0 (GDT), 3-15 = 1 (Second Entry)
mov es,ax

In protected mode, you can't just select random values to the segment registers like in real mode. You must put valid values or you will get an exception.

Interrupts

The OS uses the LIDT instruction to load the interrupt table. The IDTR contains the 6-byte data, 2 for the length of the tables and 4 for the physical address in memory.

Each entry in it is now 8 bytes, describing the location of the interrupt handlers.

struc IDT_STR 
{
 .ofs0_15 dw ofs0_15
 .sel dw sel
 .zero db zero
 .flags db flags            ; 0 P,1-2 DPL, 3-7 index to the GDT
 .ofs16_31 dw ofs16_31
}

Let's see some code to define only one interrupt:

SEGMENT CODE32 USE32
    intr00:
      ; do nothing but return
     IRETD

...
SEGMENT DATA16 USE16

    idt_PM_start      dw             idt_size
    idt_PM_length dd 0
    interrupt0 db 6 dup(0)
    idt_size=$-(interruptsall)
...
SEGMENT CODE16 USE16

      xor eax,eax
      mov eax,CODE32
      shl eax,4 ; Make it physical address
      add eax,intr00 ; Add the offset
      mov [interrupt0 + 2],eax
      mov ax,0008h;  The selector of our COD32
      mov [interrupt0],ax
    
...
mov bx,idt_PM_start
mov ax,DATA16
mov ds,ax
; = NO DEBUG HERE =
cli
lidt [bx]

Notice the =NO DEBUG HERE=. Once the IDT table has been reset, a real mode debugger cannot work. So if you try to step into LIDT, you will crash. And no, you cannot call DOS or BIOS interrupts from protected mode. See below on how to use 386SWAT to be able to debug protected mode.

Preparing for Crash

It is very rare that your first protected mode application won't crash. When this happens, CPU does the triple fault and gets reset. To avoid resetting, you can put a real code to be executed:

MOV ax,40h 
MOV es,ax 
MOV di,67h 
MOV al,8fh 
OUT 70h,al 
MOV ax,ShutdownProc 
STOSW 
MOV ax,cs
STOSW 
MOV al,0ah 
OUT 71h,al 
MOV al,8dh 
OUT 70h,al

If the CPU crashes, your routine will be executed. That routine must reset all registers and stack, then exit to DOS.

Entering Protected Mode

cli
mov eax,cr0
or eax,1
mov cr0,eax

After that, you must execute a far jump to a protected mode code segment in order to clear possible invalid command cache. Using JMP FAR results in error, for the assembler does not know at this point that we are in protected mode. If this code segment is a 16-bit code segment, you must do:

db 0eah    ; Opcode for far jump
dw StartPM ; Offset to start, 16-bit
dw 018h    ; This is the selector for CODE16_DESCRIPTOR,
           ; assuming that StartPM resides in code16

If this code segment is a 32-bit code segment, you must do:

db 66h     ; Prefix for 32-bit
db 0eah    ; Opcode for far jump
dd StartPM ; Offset to start, 32-bit
dw 08h     ; This is the selector for CODE32_DESCRIPTOR,
           ; assuming that StartPM resides in code32

Before enabling interrupts, you must setup the stack and other registers:

mov ax, data_selector
mov ds,ax
mov ax, stack_selector
mov ss,ax
mov esp,1000h ; assuming that the limit of the stack segment 
              ; selected by stack_selector is 1000h bytes.
sti
...

Exiting Protected Mode

cli
mov eax,cr0
and eax,0ffffffeh
mov cr0,eax
mov ax,data16
mov ds,ax
mov ax,stack16
mov ss,ax
mov sp,1000h ; assuming that stack16 is 1000h bytes in length
mov bx,RealMemoryInterruptTableSavedWithSidt
litd [bx]
sti
; (You can debug here) ...

Unreal Mode

Because protected mode cannot call DOS or BIOS interrupts, it is generally not useful to DOS applications. However, a 'bug' in the 386+ processor turned out to be a feature called unreal mode. The unreal mode is a method to access the entire 4GB of memory from real mode. This trick is undocumented, however a large number of applications (including HIMEM.SYS) are using it.

Enable A20.
Enter protected mode.
Load a segment register (ES or FS or GS) with a 4GB data segment.
Return to real mode.

As long as the register does not change its value, it still points to a 4GB data segment, so it is possible to use it along with EDI to access the entire address space. After returning from protected mode, you can easily do:

; assuming FS has loaded a 4GB data segment from Protected Mode
mov edi,1048576 ; point above 1MB
mov byte [fs:edi],0 ; Set a byte above 1MB.

286 lacks this capability because to exit protected mode, the CPU has to be reset, so all registers are destroyed.

The following function is a routine that will put your CPU to unreal mode and set DS/ES to the 4GB stuff.

; 386+
SetUnreal:
jmp UnrealCode

gdt:
; Dummy First
    dw 0                        ; limit 15:
    dw 0                    ; base 15:0
    db 0                    ; base 23:16
    db 0                    ; access byte (descriptor type)
    db 0                    ; limit 19:16, flags
    db 0                    ; base 31:24
    DATA_SEL    equ     $-gdt    ; 0008h
    dw 0FFFFh
    dw 0
    db 0
    db 92h                  ; present, ring 0, data, expand-up, writable
    db 0CFh                    ; page-granular, 32-bit
    db 0
gdt_end:
gdt_ptr:
    dw gdt_end - gdt - 1    ; GDT limit
    dd 0                    ; linear adr of GDT (set later)

UnrealCode:

    pushad
    push ds
    push ds
    mov ax,code16
    mov ds,ax
    xor eax,eax             ; point gdt_ptr to gdt
    mov ax,ds
    shl eax,4
    add ax,offset gdt         ; EAX=linear address of gdt
    mov [gdt_ptr + 2],eax
    cli                     ; Interrupts off
    mov bx,offset gdt_ptr
    lgdt [bx]
    mov eax,cr0
    or al,1
    mov cr0,eax             ; partial switch to 32-bit pmode
    mov bx,DATA_SEL         ; selector to segment w/ 4G limit
    mov ds,bx
    mov es,bx               ; set seg limits in descriptor caches
    dec al
    mov cr0,eax             ; back to (un)real mode
    pop es                  ; segment regs back to old values,
    pop ds                   ; but now 32-bit addresses are OK
    popad
    sti
    ret

286 lacks this very useful mode because it can't exit protected mode without all the registers being destroyed.

Global Descriptor Table Type 2: OS Entries

When the S flag is set to 0, the meaning of a GDT entry is quite different.

flags - Flags for the segment

Local Descriptor Table

Local Descriptor Table (LDT) is a method for each application to have a private set of segments, loaded with the LLDT assembly instruction. The LDT bit in the selector specifies if the segment loaded is from the GDT or from the LDT. This, although originally helpful, is not used in modern OSes because of Paging.

Paging

There are a number of problems that occur in a multitasking OS when the above setups are used:

A task has to be loaded in memory entirely.
DOS applications think that they always access the lowest MB of RAM, so they can't be put outside it.
An application must handle its own segments which must be different from other applications', thus making an application of dynamic link libraries costly.

Paging is the method to redirect an address to another address. The address that the application uses is called the "linear address" and the actual address is the "physical address".

There are some methods to use paging, which we need not discuss here. What you have to do is to create two tables in memory: The Page Directory, an array of pointers to the page table, and the Page Table, an array of pointers to the physical memory. My code has a sample to create a 64-bit paging table.

Physical Address Extension (PAE)

PAE is the ability of x86 to use 36 address bits instead of 32. This increases the available memory from 4GB to 64GB. The 32-bit applications still see only a 4GB address space, but the OS can map (via paging) memory from the high area to the lower 4GB address space. This extension was added to x86 to cope with the (nowadays not enough) limit of 4GB, before 64-bit software came to the foreground.

Flat Mode

Initially, the Local Descriptor Table was used so each application could have a local array of segments. But because of paging, modern 32-bit OSs now use the "flat" mode. This way the applications receive the entire 4GB address space to hold code, data, and stack, but this portion of the address space is mapped into a different physical memory. So the applications can use the same memory addresses which are mapped to different physical addresses.

For example, see these two C++ programs running under 32-bit Windows:

int main()
{
    ; CS:EIP at this point is, (say) 010Ch : 00004000h. 
    int flags = MB_OK;
    char* msg = "Hello there";
    char* title = "Title";
    MessageBox(0,msg,title,flags);     ; Address of message box is (say) 00547D45h
}

This allows the application programmer to never consider segmentation. All pointers are near, there are no segments (all have the same value), and thus creating applications is easier. There is no thing as "small/large" model, because all the stuff is within the same segment.

Because of its simplicity, the "flat mode" is now the mode used by most common 32-bit OSs, and also it is the only one that exists in 64-bit mode.

VM86 Mode

So far all nice with protected mode, but many of the existing applications were real-mode at that time. Even today, many (mostly games) are played under Windows. To force these applications (which think they own the machine) to cooperate, a special mode should be created.

The VM86 mode is a special flag to the EFlags register, allowing a normal 16-bit DOS memory map of 640KB which is of course forwarded via paging to the actual memory - this makes it possible to run multiple DOS applications at the same time without risking any chance for one application to overwrite another. EMM386.EXE, the old known memory manager, puts the processor to that state. The OS performs a step-by-step watching to the process, making sure that the process won't execute something illegal (so don't expect to enter protected mode when EMM386.EXE is loaded because once you try to set the GDT with LGDT, you will be sacked :).

Once the VM flag is set, you can load a normal "segment" to a segment register. Interrupt calls by DOS applications are caught by the OS and emulated through it - if possible. Also, some instructions are ignored, for example, if you do a CLI, the interrupts are not actually disabled. The OS sees that you prefer to not be interrupted and acts accordingly, but interrupts are still there.

All VM86 code executes in PL 3, the lowest privilege level. Ins/Outs to ports are also captured and emulated if possible. The interesting thing about VM86 is that there are two interrupt tables, one for the real and one for the protected mode. But only protected mode interrupts are executed.

VM86 was removed from 64-bit mode, so a 64-bit OS cannot execute 16-bit DOS code anymore. In order to execute such code, you need an emulator such as DosBox.

HIMEM.SYS

HIMEM is the generic extended memory manager for DOS. At that time, extended memory was mostly, if not totally, used to cache data from the disk, especially from big apps. HIMEM puts the CPU in unreal mode and provides a simple interface to the applications that want more memory without messing with the protected mode details. By enabling the A20 line, HIMEM allowed a portion of DOS COMMAND.COM to reside in the high memory area. Because unreal mode is still real mode, your protected mode application can do the stuff we discussed even if HIMEM.SYS is loaded.

EMM386.EXE

At that time, a form of memory now eliminated, the "expanded" memory, existed. Many applications were written to take advantage of it, but the modern standard was the protected mode. EMM386 puts the CPU in VM86 mode and maps via paging memory over 1MB to real mode segments (over 0xA0000), so an application that would like to use expanded memory can use it via EMM386.EXE. In addition, EMM386 allowed "devicehigh" and "loadhigh" commands in CONFIG.SYS, allowing applications to get loaded to these high segments if possible.

Because VM86 mode is protected mode, your protected mode application cannot do the stuff we have discussed if EMM386 is loaded, for once you try LGDT your program will be terminated because ring-3 applications (remember that VM86 mode is protected mode in which the DOS application runs in ring-3), cannot set the GDT.

DPMI

DOS Protected Mode Interface is a system that allows DOS applications to run 32-bit code. Unreal mode was not enough because it only allows data to be moved, but not code to be executed. What a DPMI server does is to take care of the nasty tables we have discussed above, allowing the executable to specify 32-bit code directly. When the executable calls DOS, the DPMI server catches the call, switches to real mode, calls DOS, then back to protected mode.

Debugging Protected Mode

As you have figured by now, your real mode debugger won't work. However, 386SWAT will work. What you have to do is to reserve around 50 entries in the GDT for it, then call it (via an interrupt while in real mode) to fill these entries. Once you enter protected mode and setup the stack, you can place an INT 3; 386SWAT that will pop up.

See my code for further details.

What is missing from the 286 protected mode? 32-bit segments (286 only has 16-bit segments), VM86, and paging. Also, because 286 lacks the CR0, it uses SMSW/LMSW to read/write the MSW in order to enter protected mode, but this can't be used to exit from protected mode. Because of this, the processor has to be put to the triple fault state (i.e., reset manually), to return to real mode, which makes the Unreal mode impossible.

Long Mode

The x64 CPU has three modes of work:

Real mode, same as in DOS
Legacy mode, same as 32-bit protected mode
Long mode

Long mode has two sub-modes:

Compatibility mode: same as 32-bit protected mode. This allows a 64-bit OS to run 32-bit applications.
64 - bit mode: for 64-bit applications

To work in Long mode, the programmer must take into consideration the facts below:

Unlike Protected mode, which could ran without paging or PAE, Long mode absolutely needs PAE and paging. That is, you cannot leave paging out even if your map is "see-through". You have to create PAE - style page tables and "flat" mode is the only valid mode in long mode. Almost no segmentation.
AMD docs say that, in order to enter long mode, you have to enter protected mode - however, this has proven not to be true, since you can now get into long mode directly from real mode, by enabling protected mode and long mode within one instruction (this can work because Control Registers are accessible from real mode).

Global Descriptor Table

Creating a 64-bit segment
- A segment marked for 64-bit is pretty much the same like a 32-bit segment with a limit of 4GB, but with the L bit set to 1 and the D bit set to 0. The D bit is set to 0 in 16-bit segments, but when L bit is set, it indicates a 64-bit segment.
64-bit segments always start from 0 and always end in 0xFFFFFFFFFFFFFFFF.

If your GDT resides in the lower 4GB of memory, you need not change it after entering long mode. However, if you plan to call SGDT or LGDT while in long mode, you must now deal with the 10-byte GDTR, which holds two bytes for the length of the GDT and 8 bytes for the physical address of it.

Any selector you might load to access a 64-bit segment is ignored, and DS, ES, SS are not used at all. End of an era.

Interrupts

You have to reset the IDT to use 64-bit descriptors.

Each entry in it is now 16 bytes, describing the location of the interrupt handlers in 64-bit mode.

struc IDT_STR 
{
 .ofs0_15 dw ofs0_15
 .sel dw sel
 .flags db flags        
 .ofs16_31 dw ofs16_31
 .ofs32_63 dd ofs32_63
 .zero dd zero
}

Entering Long Mode

; Disable paging, assuming that we are in a see-through.
mov eax, cr0 ; Read CR0.
and eax,7FFFFFFFh; Set PE=0
mov cr0, eax ; Write CR0.

mov eax, cr4
bts eax, 5
mov cr4, eax

mov ecx, 0c0000080h ; EFER MSR number. 
rdmsr ; Read EFER.
bts eax, 8 ; Set LME=1.
wrmsr ; Write EFER.

; Enable Paging to activate Long Mode. Assuming that CR3
' is loaded with the physical address of the page table.
mov eax, cr0 ; Read CR0.
or eax,80000000h ; Set PE=1.
mov cr0, eax ; Write CR0.

Turn off paging, if enabled. To do that, you must ensure that you are running in a "see through" area.
Set PAE, by setting CR4's fifth bit.
Create the new page tables and load CR3 with them. Because CR3 is still 32-bits before entering Long mode, the page table must reside in the lower 4GB.
Enable Long mode (note, this does not enter Long mode, it just enables it).
Enable paging. Enabling paging activates and enters Long mode.

Because the rdmsr/wrmsr opcodes are also available in Real mode, you should be able to activate Long mode from Real mode directly.

Entering 64-bit

Now you are in compatibility mode. Enter 64-bit mode by jumping to a 64-bit code segment:

db 0eah
dd LinearAddressOfStart64
dw code64_idx

The initial 64-bit segment must reside in the lower 4GB because compatibility mode does not see 64-bit addresses.

Note that you must use the linear address, because 64-bit segments always start from 0. Note also that if the current compatibility segment is 16-bit default, you have to use the 066h prefix.

The only thing you have to do in 64-bit mode is to reset the RSP:

mov rsp,stack64_end

SS, DS, ES, are not used in 64-bit mode. That is, if you want to access data in another segment, you cannot load DS with that segment's selector and access the data. You must specify the linear address of the data. "Flat" mode is not only the default, it is the only one for 64-bit. However GS and FS can still be used as auxilliary registers and their values are still subject to verification from the GDT.

Once you are in 64-bit mode, the defaults for the opcodes (except from jmp/call) are still 32-bit. So a REX prefix is required (0x40 to 0x4F) to mark a 64-bit opcode. Your assembler handles that automatically if it supports a "code64" segment.

In addition, a 64-bit interrupt table must now be set with a new LIDT instruction, this time taking a 10-byte operator (2 for the length and 8 for the location), and each entry in the IDT table takes 10 bytes, 2 for the selector and 8 for the offset.

Returning to Compatibility Mode

Because 0eah is not a valid jump when in 64-bit mode, you have to use the RET trick to get back to a compatibility mode segment.

push code32_idx    ; The selector of the compatibility code segment
xor rcx,rcx    

mov ecx,Back32    ; The address must be an 64-bit address,
                  ; so upper 32-bits of RCX are zero.
push rcx
retf

This gets you back to compatibility mode. 64-bit OSs keep jumping from 64-bit to compatibility mode in order to be able to run both 64-bit and 32-bit applications.

Why do Windows drivers have to be 64-bit for a 64-bit OS? Because no WOW64 for driver (ring 0) code exists. They could have created one if they wanted to - I guess they wanted to force manufacturers to finally move to 64-bit. Nice decision, I must admit.

Exiting from Long Mode

You have to setup all the registers again with 32-bit selectors - back to segmentation. Also you must be in a see-through area because to exit long mode you must deactivate paging.

; We are now in Compatibility mode again
mov ax,stack32_idx 
mov ss,ax 
mov esp,stack32_end 
mov ax,data32_idx 
mov ds,ax
mov es,ax
mov ax,data16_idx
mov gs,ax
mov fs,ax

; Disable Paging to get out of Long Mode
mov eax, cr0 ; Read CR0.
and eax,7fffffffh ; Set PE=0.
mov cr0, eax ; Write CR0.

; Deactivate Long Mode
mov ecx, 0c0000080h ; EFER MSR number. 
rdmsr ; Read EFER.
btc eax, 8 ; Set LME=0.
wrmsr ; Write EFER.

; Back to the dirty, old, protected mode :(

Unreal Mode in 64-bit

I am sorry for I made you feel well for the moment. There is no such thing that would allow you to access over 4GB of RAM from real mode (unless AMD has an Easter egg in its CPU). In addition, although the 32-bit registers EAX, EBX, etc., are available in Real mode, the 64-bit registers RAX, RBX are not even available in compatibility mode - only in 64-bit mode. Or, who knows?

Virtual 86 Mode in 64-bit

Once the CPU enters Long mode, VM86 is not supported anymore. That is the reason why 64-bit OSs cannot run 16-bit applications. However, emulators like DosBox will run fine your 16-bit old game.

DPMI for 64-bit

In your dreams only. Or even, in my dreams only. Perhaps I will make it one day. Hey, is there any DOS game there that needs more than 4GB of RAM?

Debugging Long Mode

Unfortunately, protected mode debuggers do not work in long mode, just like real mode debuggers will not work in protected mode.

However, Bochs with my GUI debugger will work nicely.

The Code

The code presented here messes with everything we have discussed so far. It has yet some dirty functions, but it works. Have fun with it!

History

30/09/2014 - Wow after 5 years, at least some typo fixing.
02/12/2009 - First release.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here