Github link: https://github.com/WindowsNT/asm now with VS solution and automatic compilation/ISO generation/bochs/vmware/virtualbox running.

The Infamous Trilogy: Part 1

This article targets the user who wants to know how the CPU works. I will explain some assembly basics, the real mode, the protected mode, and the long mode. Working Assembly code is included, so you can test for yourself how the processor works in real mode, how protected mode is entered, how we get into 64-bit mode, and finally, how to exit from all of them and go back to DOS.

Do you dare to follow? Let's go!

Requirements

Assembly knowledge. While you will not be writing code, basic knowledge such as registers, memory access, basic commands, and such stuff is helpful.
Flat Assembler, a modern assembler which can make executables for Win32, x64, and DOS.
FreeDOS.
Bochs, the must-have tool for system developers. I recommend it, not only because it is free, but because it has a debugger that will trap any exceptions your program will generate and tell you what happened.

In the distant past, Borland's Turbo Assembler (TASM) was used because it was the first one to allow 32-bit segments. But since then, quite a few things have changed, and TASM is dead. All the code here is configured to compile with the modern FASM, which can create executables for both DOS and Windows, 16/32/64 bit.

Visual Studio includes ML.EXE and ML64.EXE, the newer MASM versions. However, these assemblers only output Windows executables and only for their respective architecture, so ML.EXE only outputs 32-bit flat code for Win32, and ML64.EXE only outputs 64-bit flat code for Win64.

The code is a VS solution that will compile with FASM, then create an ISO with PowerShell with the program entry.exe. Then you can run and it will launch Bochs with the configuration file to load a bootable FreeDos Disk with a CD-ROM drive that contains the entry.exe, which you can run.

There are some bugs as for now, keep going!

Assembly in General

Introduction

The assembly language is basically a collection of low level instructions (opcodes) to do useful stuff and to access memory. In order to make things easier, there are registers - e.g., places to get and set data.

Segmentation

Memory is not treated as a continuous array of bytes (like a C array). It is divided in segments. A segment has different meaning, depending on the CPU mode (Real, Protected, or Long mode). Each memory address is referred by a segment register which holds the segment value, and an offset indicating the distance from the start of the segment.

Stack

In order to make it easier for functions to have local variables (like in C++) and transfer data between them, each application sets up a special segment called "stack segment" which holds the address of the memory used for stack. Stack is a "LIFO" vector: the last you push to it is the first you get from it.

Pointer Access

It is like pointers in C. If you have a variable that contains a pointer in C, you can use * to access the data. In assembly, you do the same with []:

; assume that DS has the data segment
mov si,000fh
mov dx,[si] ; Now dx has whatever was placed in DS:000fh

What if you try to access something that doesn't exist? In DOS, you can trash yourself or the OS or both; in Protected mode systems, you will not be allowed to access non-existent memory and thus, the exception handler will be called (or else, your program will be sacked).

Function Calls and Interrupts

Just as in C++, there are function calls in assembly (no, not everything is implemented with goto :). Depending on the programming model (you will learn this later), a function call can be near (in which the current IP is pushed to the stack and the function exits with a RET), far (in which the IP and CS are pushed, and the function exits with a RETF), or interrupt.

An Interrupt is basically a handler that gets executed when something calls it. Many times, these are software interrupts, which means that the program calls it using the INT instruction. In fact, all DOS/BIOS services are provided through interrupts, so if the programmer does:

; assume that DS has the data segment
mov ah,9
mov dx,msg;
int 21h;

then the programmer knows that it has called function 9 of INT 21h, which is a DOS function to display the string pointed to by DS:DX. The address for each interrupt (there are 255) is stored in the Interrupt Table, which is accessible via the SIDT command, and it is different, depending on the processor mode (you will learn about that later). IP, CS, and the flags are pushed to the stack, and the interrupt exits with IRET.

An interrupt can also happen when something else occurs (usually, an exception). For example, when your code divides something by zero, INT 00h is automatically called. The code for INT 00h sees that you have tried to divide something with zero and thus, you can't continue. So Windows displays a nice message and closes your application.

If you could install an INT 00h handler yourself (via the DOS function 25h), then the exception would get to your code before making it to Windows (that's what the structured exception handling basically does), but you must still fix the error. If you can't fix the error, then you must abort - pretty much what Windows does.

In DOS, the default exception handlers do little than block the CPU from further processing, so you can only do Ctrl+Alt+Del to resume.

Registers

16-bit Registers

AX
BX
CD
DX
SI
DI
BP
SP
IP
16-bit Flag Register

IP holds the current execution point. As commands are executed, IP changes its value automatically.

AX, BX, CX, DX can be accessed either entirely:

mov ax,1    ; ax is now 1
mov cx,ax  ; cx is now also 1

or using their low 8-bit (al, bl, cl, dl) or high bit (ah, bh, ch, dh)

xor ax,ax  ; ax is now 0

mov ah,1    ; ah is now 1, thus ax is now 0100h
mov al,2   ; al is now 2, thus ax is now 0102h

SI, DI are always accessed as 16-bit registers (there is no sl, sh) and they are generally used as pointers to data. BP can also be used as a generic-purpose register (although it is usually used to access the stack), and SP holds the pointer to the current stack entry. So let's see what happens when you put something to the stack:

; assume that SP has the value of 0100h
mov ax,3   ; ax is now 0
push ax    ; ax is put to the stack, SP has now the value of 00FEh; (100h - 2)
mov dx,5   ; dx is now 5
push dx    ; dx is put to the stack, SP is now 00FCh;
pop bx     ; bx gets the value from the stack top (5). SP gets back to 00FEh
pop cd     ; cx gets the value from the stack top (3). SP gets back to 0100h

What happens when we push more than the current stack can hold? Boom, stack overflow. You have probably encountered it in your C++ recursive functions.

The flags register is a set of 16 bits (not all are actually used) that change their value depending on the operation of each opcode. The variables of the JMP command (JZ, JAE, JB, etc.) can jump conditionally depending on those flags. For example, the ZF (Zero Flag) is set to 1 after an operation is zero:

mov ax,bx    ; Get value of BX to AX
or  ax,ax    ; Is AX 0? If yes, or ax,ax will also say 0, so ZF will be 1
je  AxIsZero 

jmp AxIsNotZero

You can use pushf and popf to set or read the flags to a register, for example:

pushf  ; push flags to stack
pop ax ; ax has now the flags

or al,1 ; set the bit 0 to 1

push ax
popf

16-bit Segment Registers

CS
DS
SS
ES
FS
GS

These registers hold a value that identifies the current segment. The way this value is interpreted depends on the current CPU mode (Real/Protected/Long).

CS always holds the segment of the currently executing code. You cannot set CS by using, say, mov cs,ax. When you call a function that resides in another segment (FAR call), or when you jump to another segment (FAR jump), CS changes.

DS holds the default data segment. That means that if you do this:

mov si,0

mov ax,[si]
mov bx,[1000h]

Then AX gets its value from the segment pointed to by DS, with the offset specified by SI, and BX gets its value from the segment pointed to by DS with an offset of 1000h. If you want to use another segment, you must do so explicitly:

mov di,0
mov ax,[fs:di]
mov bx,[es:1000h]

When DI is used as an index, ES holds the default segment. When BP is used as an index, SS is the default segment. In all other cases, DS is the default segment. Note that not every register can be used as an index in real mode, for example, mov ax,[dx] is not valid in real mode.

ES, FS, and GS are general-purpose auxiliary segment registers. SS holds the value of the stack segment.

32-bit Registers

32 bit registers are available in all of the modes (real, protected, and long).

EAX
EBX
ECD
EDX
ESI
EDI
EBP
ESP
EIP
32-bit Flag Register

Each of them is an extension of their relative 16-bit register. For example:

mov eax,0   ; eax is now 0, so ax is also 0.

mov ax,1    ; ax is now 1, eax is also 1
or eax,0FFFF0000; ax is now 1, eax is now 0FFFF0001h

The 32-bit registers are usable in Real mode, but indexes (EDI and ESI) cannot be used unless their upper 16 bits are zero (that is, you can use a max index of 65535 only). But see below for Unreal mode, which provides a way around this.

64-bit Registers

64 bit registers are only available when the processor is in 64-bit mode. They are not available in real or protected or compatibility mode.

RAX
RBX
RCX
RDX
RSI
RDI
RBP
RSP
RIP
64-bit Flag Register

In addition, x64 defines 8 more 64-bit registers (r8, r9, r10, r11, r12, r13, r14, r15) to be used as auxiliary registers, and some 128-bit registers to be used when programming multimedia.

In this manual, when operations are applied to both 32-bit and 64-bit registers, I use the notation X(reg), for example XIP, XSP etc.

Control Registers

CR0
CR1
CR2
CR3

These registers hold information about the current state of the CPU.

CR0 is mostly used to set the CPU to protected mode (bit 0), and enable paging (bit 31).
CR1 is reserved.
CR2 holds the Page Fault Linear address when a page fault exception is triggered.
CR3 holds the address of the paging table.
CR4 defines some other flags, like Physical Address Extensions and VM86 mode.

For more on these, see Control Registers in Wikipedia.

Debug Registers

DR0
DR1
DR2
DR3
DR6
DR7

These registers hold information for hardware debugging. DR0-DR3 hold the linear address of 4 breakpoints, and DR6-DR7 some flags to use them. See Debug Registers for more.

Test Registers

TR4
TR5
TR6
TR7

These registers used to hold information for CPU testing (now removed from the set). See Test Registers for more.

Real Mode

Addressing and Segmentation

In Real mode, everything is 16 bits. The entire memory is not accessed with an absolute index from 0, but it is divided into segments. Each segment represents the actual offset from 0, multiplied by 16. To this segment, an offset value can be added to refer to a distance from the start of this segment. These two things (segment:offset) tell the CPU the absolute value of the memory we need to access. For example:

0000h : 0000h: Indicates segment 0, offset 0. 0*16 + 0 = 0 actual memory address.
0100h : 000Fh: Indicates segment 100h, offset 0Fh. 100h*16 + 0Fh = 100Fh = actual memory address.
0002h : 0000h: 2h*16 + 0 = 20h actual memory address.
0001h : 0010h: 1h*16 + 10h = 20h actual memory address. As you can see, memory addresses can overlap.

Because the segment and the offset are only 16-bit values, the maximum memory accessible by this method is 0ffffh : 0010h = 1MB. Specifying 0ffffh segment and an offset larger than 0010h results in wrapping (see protected mode, A20 line). And because the area after 0a000h:0000 is reserved for the system (screen, etc.), only 640 KB remains for DOS applications.

In addition, all segments have read/write/execute access from anywhere (that is, any program can read/write or execute code within any segment). Because in 16-bit real mode OS the CPU sees the memory the way above, any application can read from or write to any part of memory, including the part in which the OS resides. That is why a real mode OS is a single tasking OS.

In real mode, CS:IP holds the current execution point, DS holds the default data segment, and SS holds the stack segment. Any application that has more than 64K of a code or data segment must break it into multiple segments.

Interrupts

Interrupts are simply special functions that are called when something happens (hardware interrupt), like a division by zero, or when called by software (by using the INT instruction - software interrupt). In real mode, there are 256 interrupts. The table that holds the segment:offset for each interrupt is initially put into absolute address 0, but (in 286+) may be put elsewhere when using the LIDT instruction (use SITD to get the table address).

In Real mode, the OS provides features to the application via software interrupts, for example, DOS provides a range of functions in INT 21h.

Program Execution

The program gets loaded by DOS into a memory segment, and execution starts at the offset that is specified in the EXE's header (or at 0100h if it is a COM file which has no header). After that, the application is free to do anything, to completely trash the memory. This is a Real mode "feature": an application owns the entire machine. In addition, the application is allowed to communicate directly with any hardware (via in/out opcodes), thus bypassing any limitations or security restrictions the OS might have. And if the application crashes, the entire system crashes and you have to reboot.

Code

Here is an easy "Hello World" sample for a 16-bit EXE that uses multiple segments:

FORMAT MZ                  ; DOS 16-bit EXE format
ENTRY CODE16:Main       ; Specify Entry point (i.e. the start address)
STACK STACK16:stackdata ; Specify The Stack Segment and Size
    
SEGMENT CODE16_2 USE16  ; Declare a 16-bit segment
    
    ShowMsg:
        mov ax,DATA16
        mov ds,ax            ; Load DS with our "default data segment"
        mov ax,0900h    
        mov dx,Msg    
        int 21h;            ; Call a DOS function: AX = 0900h (Show Message), 
                            ; DS:DX = address of a buffer, int 21h = show message 
    retf                    ; FAR return; we were called from 
                            ; another segment so we must pop IP and CS.
    
SEGMENT CODE16 USE16         ; Declare a 16-bit segment
    ORG 0                    ; Says that the offset of the first opcode 
                             ; of this segment must be 0.
    
    Main:
        mov ax,CODE16_2
        mov es,ax
        call far [es:ShowMsg] ; Call a procedure in another segment.
                              ; CS/IP are pushed to the stack.
        mov ax,4c00h          ; Call a DOS function: AX = 4c00h (Exit), int 21h = exit
        int 21h
    
SEGMENT DATA16 USE16
    Msg db "Hello World!$"
        
SEGMENT STACK USE16
    stackdata dw 0 dup(1024)  ; use 2048 bytes as stack. When program is initialized, 
                              ; SS and SP are automatically set.

How does the assembler know the actual value of the data16, code16, code16_2, and stack16 segments? It doesn't. What it does is to put null values, and then creates entries to the EXE file (known as "relocations") so the loader, once it copies the code to the memory, writes to the specified address, the true values of the segments. And because this relocation map has a header, COM files cannot have multiple segments even if they sum to less than 64KB in total.

This program calls a function ShowMsg in another segment via a far call, which uses a DOS function (09h, INT 21h) to display text. However, it could do it as well by writing directly into the video buffer (which for text mode, resides in the segment 0b000h) thus bypassing any OS or any security the function 09h might implement. Therefore, multitasking is not possible because each application can easily write to anywhere, thus destroying another application's or the OS's data.

Here is an easy "Hello World" sample for 16-bit COM:

org    100h         ; code starts at offset 100h
use16               ; use 16-bit code
                    ; The SS is same as CS (since only one 
                    ; segment is here) and SP is set to 0xFFFE

mov ax,0900h
mov dx,Msg
int 21h;
mov ax,4c00h
int 21h

Msg db "Hello World!$"

What are the differences here? All stuff (code, data, stack) must reside in one segment. Code must start from offset 100h (to allow DOS to put information to the low 100h bytes), and no stack segment or data segment must be defined - COM files are "memory maps" and are limited to 64KB. For that reason, COM files are rarely used.

Generally, a DOS program consists of some code segments, some data segments, and a stack segment like above. A DOS program calls DOS and BIOS functions (through Interrupts) and accomplishes its task.

Programming Models

Because segments are limited to 64KB, there are many programming models depending on the application's requirements:

Tiny, when everything has to fit in one single segment (COM files).
Small, when there is one code segment and one data segment. Calls and jumps are near.
Medium, when there is one data segment but more code segments. Calls and jumps are far.
Compact, when there is one code segment but more data segments. Calls and jumps are near.
Large, when there are more code and data segments. Calls and jumps are far.
Huge, when the data structures exceed 64KB in size and thus, they have to programmatically be split into segments.

The most common models are the Small and the Large.

Protected Mode

Segments

In 32-bit protected code (we are not discussing 16-bit protected mode here because it is very rare), a segment can have any size, from 1 byte to 4GB. The OS defines the size of each segment, and now each segment can have limitations (read, write, execute on or off). This allows the OS to "protect" the memory. In addition, there are 4 levels of authority (0 to 3, 0 = highest), so, for example, when a user application runs in level 3, it cannot touch the OS which runs at level 0.

In addition, if a 32-bit protected mode task crashes, OS catches the exception and terminates the program safely without crashing any other application or the OS itself. This way, true multitasking can occur.

Multitasking

Many people believe that multitasking is the art of running applications at the same time. This is not true, for one CPU core can only execute one command at a time. What is really happening is that OS permits Task #1 to run for X time, switches to Task #2, permits it to run for X time, switches to Task #3, and this is so fast that it appears that it is simultaneous.

A-20 line

Enabling the A-20 line is the first step to use memory above the 640KB. This trick (available in 286+) is the way to earn 0xFFF0 bytes of RAM (in the range 0ffffh:0010 through 0ffffh:0ffffh) accessible in real mode. Enabling the line (via the keyboard controller! - yes I don't even understand why) forces the CPU to avoid wrapping. This memory (known as High Memory Area, HMA) is used by HIMEM.SYS to load parts of DOS to it and therefore make more low memory available for applications.

The following code enables/disables A20. Note that if HIMEM.SYS is installed, A20 is enabled by default. HIMEM.SYS should be queried to alter A20 status instead of doing it directly.

WaitKBC:
   mov cx,0ffffh
   A20L:
   in al,64h
   test al,2
   loopnz A20L
ret

ChangeA20:
   call WaitKBC
   mov al,0d1h
   out 64h,al
   call WaitKBC
   mov al,0dfh ; use 0dfh to enable and 0ddh to disable.
   out 60h,al
ret

The following code checks A20 and returns 1 to CF if it is enabled, 0 otherwise.

CheckA20:
    PUSH ax 
    PUSH ds
    PUSH es 

    XOR ax,ax 
    MOV ds,ax 
    NOT ax 
    MOV es,ax 
    MOV ah,[ds:0] 
    CMP ah,[es:10h] 
    JNZ A20_ON 

    CLI 
    INC ah 
    MOV [ds:0],ah 
    CMP [es:10h],ah 
    PUSHF 
    DEC ah 
    MOV [ds:0],ah 
    STI 
    POPF 
    JNZ A20_ON 

    CLC 
    POP es
    POP ds
    POP ax 
    RET 

A20_ON: 
    STC 
    POP es
    POP ds
    POP ax 
RET

Global Descriptor Table Type 1: Application Entries

The Global Descriptor Table is a table that contains all the globally visible segments. Each segment has properties like:

Size
Base address (physical address in memory)
Access restrictions

For Protected mode, the system maintains the the GDTR register (accessible via SGDT / LGDT) which contains 6-byte data:

2 bytes - size of the entire array. Because each GDT entry is an 8-byte entry, a maximum of 8192 entries may be specified.
4 bytes - physical address of the GDT array in memory.

There are two types for a GDT entry. An entry for the application (S flag == 1, see below), and an entry for the OS (S flag == 0).

The definition for a GDT for an application entry as a C++ structure is this:

struct GDT_STR
{
    unsigned short seg_length_0_15;
    unsigned short base0_15;
    unsigned char  base16_23;
    unsigned char  flags;
    unsigned char  seg_length_16_19:4;
    unsigned char  access:4;
    usigned  char  base24_31;
};

Although this seems easy, it is more complicated than you might think. Let's examine the fields. Note that the analysis below is for the S bit set to 1 (user GDT). If this bit is 0, then we talk about System related GDT entries (more on that below).

seg_length
- A 20-bit value describing the length of the segment. If the G flag (see below) is not set, this value represents the actual segment length. If the G flag is set, this value is multiplied with 4096 to represent the segment length. So if you set it to FFFFFh (20 bits) and G is set, it is 10000h * 4096 = 4GB.
base
- A 32-bit value indicating the start of the segment in physical memory.
flags
- Flags for the segment
  - Bit 0: Type
    - 0 - Data
    - 1 - Code
  - Bit 1: Subtype
    - For Code Segment (B0 == 1)
      - 0 - Not conforming.
      - 1 - Conforming. A conforming segment can be called from any segment that has equal or higher privilege. So if a segment is conforming with privilege level 3, you can call it from a privilege level 0, 1, or 2 segment. If the segment is not conforming, then it can only be called from a segment with the same privilege level.
    - For Data Segment (B0 == 0)
      - 0 - Expand up. The segment starts from its base address and ends to its limit.
      - 1 - Expand down. The segment starts from its limit and ends to its base, with the address going the reverse way. This flag was created so a stack segment could be easily expanded, but it is not used by today's OSs.
  - Bit 2: Accessibility
    - For Code Segment (B0 == 1)
      Note that a code segment is not writable. However, because segment base addresses can overlap, you can create a writable data segment with the same base address and limit of a code segment.
      - 0 - Not readable. Any code that tries to read memory from this segment will generate an exception.
      - 1 - Readable.
    - For Data Segment (B0 == 0)
      - 0 - Not writable. Any code that tries to write to this data segment will generate an exception. Data segments are always readable.
      - 1 - Writable.
  - Bit 3: Access
    - 0 - Segment is not accessed.
    - 1 - Segment is accessed. The CPU sets this bit each time the segment is accessed, so the OS gets an idea how frequent is the access to the segment, so it knows if it can cache it to disk or not.
  - Bit 4: S
    - 0 - This descriptor is for the OS. If this bit is not set, the entire GDT entries have different meanings, discussed below.
    - 1 - This descriptor is for the application.
  - Bit 5-6 : DPL
    - The privilege level of this segment, from 00 (highest) to 11b (3) (lowest).
  - Bit 7 : P
    - Set to 1 to indicate that the segment is present in memory. If the OS caches this segment to the disk, then it sets P to 0. Any attempt to access the removed segment causes an exception. The OS catches this exception, and reloads the segment to memory, setting P to 1 again.
access
- Bit 0: AVL
  - Not used, set to 0.
- Bit 1: L
  - Set to 0 for 32-bit segments. If set to 1, it indicates 64-bit segments used in long mode.
- Bit 2: D
  Real mode segments are always 16-bit default.
  - When D is not set, the default for opcodes is 16-bit. The segment can still execute 32-bit commands by putting the 0x66 or 0x67 prefix to them.
  - When D is set, the default for opcodes is 32-bit. The segment can still execute 16-bit commands by putting the 0x66 or 0x67 prefix to them.
- Bit 3: G
  - Set to 1 to multiply the seg_length by 4096 to find the true segment length as discussed above.

As you saw, the segment might not be present in memory at all, which allows the OS to cache the segment to the disk and reload it only when it is needed.

The first entry in the GDT table is always 0. CPU does not read information from entry #0 and thus it is considered a "dummy" entry. This allows the programmer to put the 0 value to a segment register (DS, ES, FS, GS) without causing an exception.

The following code creates some GDT entries, then loads them:

struc GDT_STR s0_15,b0_15,b16_23,flags,access,b24_31
; 'access' taken as a byte, it is actually 4+4 bits
{
    .s0_15   dw s0_15
    .b0_15   dw b0_15
    .b16_23  db b16_23
    .flags   db flags
    .access  db access
    .b24_31  db b24_31
}

gdt_start    dw gdt_size
gdt_ptr      dd 0
dummy_descriptor   GDT_STR 0,0,0,0,0,0
code32_descriptor  GDT_STR 0ffffh,0,0,9ah,0cfh,0 ; 4GB 32-bit code, 
       9ah = 10011010b = Present, DPL 00,No System, 
       Code Exec/Read. 0cfh access = 11001111b = Big,32bit,
       <resvd 0>,1111 more size
data32_descriptor  GDT_STR 0ffffh,0,0,92h,0cfh,0 ; 4GB 32-bit data, 
       92h = 10010010b = Present, DPL 00, No System, Data Read/Write
stack32_descriptor GDT_STR 0ffffh,0,0,92h,0cfh,0 ; 4GB 32-bit stack
code16_descriptor  GDT_STR 0ffffh,0,0,9ah,0,0    ; 64k 16-bit code
data16_descriptor  GDT_STR 0ffffh,0,0,92h,0,0    ; 64k 16-bit data
stack16_descriptor GDT_STR 0ffffh,0,0,92h,0,0    ; 64k 16-bit data
gdt_size = $-(dummy_descriptor)

; For each of the descriptors, I create this code. 
' I 've only created it now for code32_descriptor.
 xor eax,eax
 mov     ax,CODE32 ; the definition of a CODE32 segment USE32 in our code
 shl     eax,4           ; make a physical address
 mov     [ds:code32_descriptor.b0_15],ax ; store the low 16 bytes
 shr     eax,16
 mov     [ds:code32_descriptor.b16_23],al ; 
 mov     [ds:code32_descriptor.b24_31],ah ;


; assuming that DS points to the segment which all the above resides
; Set gdt ptr
xor     eax,eax
mov     ax,ds
shl     eax,4 ; By multiplying the segment with 16, we make it a physical address
add     ax,dummy_descriptor ; add the offset of the first entry
mov     [gdt_ptr],eax ; save pointer to the physical location
mov     bx,gdt_start
lgdt    [bx] ; load the GDT

Note you have to create entries for your current real mode segments if you want to access data in them.

Selectors

In real mode, the segment registers (CS, DS, ES, SS, FS, GS) specify a real mode segment. And you can put anything to them, no matter where it points. And you can read and write and execute from that segment. In protected mode, these registers are loaded with selectors.

Selector

Bit 0 - 1 : RPL
- Requested Protection Level. It must be equal or less privileged than the segments DPL.
Bit 2 : TI
- If this bit is set to 1, the selector selects an entry from the LDT instead of the GDT (see below for LDT).
Bits 3 - 15:
- Zero based index to the table (GDT or LDT).

So, to load ES with the code32 segment, we would do:

mov ax,0008h  ; 0-1 : 00 privilege, 2 : 0 (GDT), 3-15 = 1 (Second Entry)
mov es,ax

In protected mode, you can't just select random values to the segment registers like in real mode. You must put valid values or you will get an exception.

Interrupts

The OS uses the LIDT instruction to load the interrupt table. The IDTR contains the 6-byte data, 2 for the length of the tables and 4 for the physical address in memory.

Each entry in it is now 8 bytes, describing the location of the interrupt handlers.

struc IDT_STR 
{
 .ofs0_15 dw ofs0_15
 .sel dw sel
 .zero db zero
 .flags db flags            ; 0 P,1-2 DPL, 3-7 index to the GDT
 .ofs16_31 dw ofs16_31
}

Let's see some code to define only one interrupt:

SEGMENT CODE32 USE32
    intr00:
      ; do nothing but return
     IRETD

...
SEGMENT DATA16 USE16

    idt_PM_start      dw             idt_size
    idt_PM_length dd 0
    interrupt0 db 6 dup(0)
    idt_size=$-(interruptsall)
...
SEGMENT CODE16 USE16

      xor eax,eax
      mov eax,CODE32
      shl eax,4 ; Make it physical address
      add eax,intr00 ; Add the offset
      mov [interrupt0 + 2],eax
      mov ax,0008h;  The selector of our COD32
      mov [interrupt0],ax
    
...
mov bx,idt_PM_start
mov ax,DATA16
mov ds,ax
; = NO DEBUG HERE =
cli
lidt [bx]

Notice the =NO DEBUG HERE=. Once the IDT table has been reset, a real mode debugger cannot work. So if you try to step into LIDT, you will crash. And no, you cannot call DOS or BIOS interrupts from protected mode. However, Bochs has its own hardware debugger that can step into anything, so you can do your stuff there!

Preparing for Crash

It is very rare that your first protected mode application won't crash. When this happens, CPU does the triple fault and gets reset. To avoid resetting, you can put a real code to be executed:

MOV ax,40h 
MOV es,ax 
MOV di,67h 
MOV al,8fh 
OUT 70h,al 
MOV ax,ShutdownProc 
STOSW 
MOV ax,cs
STOSW 
MOV al,0ah 
OUT 71h,al 
MOV al,8dh 
OUT 70h,al

If the CPU crashes, your routine will be executed. That routine must reset all registers and stack, then exit to DOS.

Entering Protected Mode

cli
mov eax,cr0
or eax,1
mov cr0,eax

After that, you must execute a far jump to a protected mode code segment in order to clear possible invalid command cache. Using JMP FAR results in error, for the assembler does not know at this point that we are in protected mode. If this code segment is a 16-bit code segment, you must do:

db 0eah    ; Opcode for far jump
dw StartPM ; Offset to start, 16-bit
dw 018h    ; This is the selector for CODE16_DESCRIPTOR,
           ; assuming that StartPM resides in code16

If this code segment is a 32-bit code segment, you must do:

db 66h     ; Prefix for 32-bit
db 0eah    ; Opcode for far jump
dd StartPM ; Offset to start, 32-bit
dw 08h     ; This is the selector for CODE32_DESCRIPTOR,
           ; assuming that StartPM resides in code32

Before enabling interrupts, you must setup the stack and other registers:

mov ax, data_selector
mov ds,ax
mov ax, stack_selector
mov ss,ax
mov esp,1000h ; assuming that the limit of the stack segment 
              ; selected by stack_selector is 1000h bytes.
sti
...

Exiting Protected Mode

cli
mov eax,cr0
and eax,0ffffffeh
mov cr0,eax
mov ax,data16
mov ds,ax
mov ax,stack16
mov ss,ax
mov sp,1000h ; assuming that stack16 is 1000h bytes in length
mov bx,RealMemoryInterruptTableSavedWithSidt
litd [bx]
sti
; (You can debug here) ...

Unreal Mode

Because protected mode cannot call DOS or BIOS interrupts, it is generally not useful to DOS applications. However, a 'bug' in the 386+ processor turned out to be a feature called unreal mode. The unreal mode is a method to access the entire 4GB of memory from real mode. This trick is undocumented, however a large number of applications (including HIMEM.SYS) are using it.

Enable A20.
Enter protected mode.
Load a segment register (ES or FS or GS) with a 4GB data segment.
Return to real mode.

As long as the register does not change its value, it still points to a 4GB data segment, so it is possible to use it along with EDI to access the entire address space. After returning from protected mode, you can easily do:

; assuming FS has loaded a 4GB data segment from Protected Mode
mov edi,1048576 ; point above 1MB
mov byte [fs:edi],0 ; Set a byte above 1MB.

286 lacks this capability because to exit protected mode, the CPU has to be reset, so all registers are destroyed.

The following function is a routine that will put your CPU to unreal mode and set FS to a 32-bit data segment:

struc GDT_STR
        
                s0_15   dw ?
                b0_15   dw ?
                b16_23  db ?
                flags   db ?
                access  db ?
                b24_31  db ?
ENDS        

SEGMENT CODE16 USE16 PUBLIC
ASSUME CS:CODE16

; GDT definitions
gdt_start dw gdt_size
gdt_ptr dd 0
dummy_descriptor GDT_STR <0,0,0,0,0,0>
code16_descriptor  GDT_STR <0ffffh,0,0,9ah,0,0>    ; 64k 16-bit code
data32_descriptor  GDT_STR <0ffffh,0,0,92h,0cfh,0> ; 4GB 32-bit data,   92h = 10010010b = Presetn , DPL 00, No System, Data Read/Write
gdt_size = $-(dummy_descriptor)

dummy_idx       = 0h    ; dummy selector
code16_idx      =       08h             ; offset of 16-bit code segment in GDT
data32_idx      =       10h             ; offset of 32-bit data  segment in GDT

PUBLIC _EnterUnreal
PROC _EnterUnreal FAR

    PUSHAD
    PUSH DS
    
    PUSH CS
    POP DS
    
    mov     ax,CODE16 ; get 16-bit code segment into AX
    shl     eax,4           ; make a physical address
    mov     [ds:code16_descriptor.b0_15],ax ; store it in the dscr
    shr     eax,8
    mov     [ds:code16_descriptor.b16_23],ah

    XOR eax,eax
    mov     [ds:data32_descriptor.b0_15],ax ; store it in the dscr
    mov     [ds:data32_descriptor.b16_23],ah

    
    ; Set gdt ptr
    xor eax,eax
    mov     ax,CODE16
    shl     eax,4
    add     ax,offset dummy_descriptor
    mov     [gdt_ptr],eax

    
    cli
    mov bx,offset gdt_start
    lgdt [bx]
    mov eax,cr0
    or al,1
    mov cr0,eax 
    
    mov ax,data32_idx
    mov fs,ax
    
    mov     eax,cr0         
    and     al,not 1        
    mov     cr0,eax         

    MOV AX,0
    MOV FS,AX
    POP DS
    POPAD    
    
    RETF

ENDP

Global Descriptor Tables

When the S flag is set to 0, the meaning of a GDT entry is quite different.

flags - Flags for the segment

More on gates later in this article.

Bits 3 2 1 0 : Type of the entry
- 0000 - Reserved
- 0001 - Available 16-bit TSS
- 0010 - Local Descriptor Table (LDT)
- 0011 - Busy 16-bit TSS
- 0100 - 16-bit Call Gate
- 0101 - Task Gate
- 0110 - 16-bit Interrupt Gate
- 0111 - 16-bit Trap Gate
- 1000 - Reserved
- 1001 - Available 32-bit TSS
- 1010 - Reserved
- 1011 - Busy 32-bit TSS
- 1100 - 32-bit Call Gate
- 1101 - Reserved
- 1110 - 32-bit Interrupt Gate
- 1111 - 32-bit Trap Gate

Local Descriptor Table

Local Descriptor Table (LDT) is a method for each application to have a private set of segments, loaded with the LLDT assembly instruction. The LDT bit in the selector specifies if the segment loaded is from the GDT or from the LDT. This, although originally helpful, is not used in modern OSes because of Paging.

Call Gates

Call gates are a mechanism to switch from a low privilege code to a higher one, used for user-level code to call system-level code. You specify a 1100 type entry in the GDT with the following format:

struct CALLGATE
{
    unsigned short offs0_15;
    unsigned short selector;
    unsinged short argnum:5;  // number of arguments to copy to the stack from the current stack
    unsigned char r:3; // Reserved
    unsigned char type:5; // 1100
    unsigned char dpl:2; // DPL of this gate
    unsigned char P:1; // Present bit
    unsigned short offs16_31;

};

Using CALL FAR with the selector of this callgate (the offset is ignored) will switch to the gate and execute the higher level privilege commands. If argnum specifies parameters to be copied, the system copies them to the new stack after pushing SS,ESP,CS,EIP. Using RETF will return from the gate call.

Because nowadays there are the faster SYSENTER/SYSEXIT commands, gates are not anymore used. Their usage is limited to:

When you need transition other than Ring 3 <-> Ring 0 (Sys commands only go from 3 to 0 and vice versa)
For malware exploits, patching the GDT to create call gates and then execute privilege commands. Note that in x64 versions of Windows, the "Kernel Patching Protection" prevents modification of the GDT.

SYSENTER / SYSEXIT

These instructions are the current way to switch from ring 3 to ring 0. You will use WRMSR to set the new values for CS (0x174) , XSP (0x175) and XIP (0x176). XCX must hold the ring 3 stack pointer for SYSEXIT and XDX contains the ring 3 IP for SYSEXIT. The entry stored for CS must be the index to 4 selectors, the first is the ring 0 code, the second is the ring 0 data, the third is the ring 3 code and the fourth is the ring 4 data. These values are fixed, so in order to use SYSENTER your GDT table must contain these entries in this format.

These opcodes only support switching between ring 3 and ring 0, but they are much faster.

Paging

There are a number of problems that occur in a multitasking OS when the above setups are used:

A task has to be loaded in memory entirely.
DOS applications think that they always access the lowest MB of RAM, so they can't be put outside it.
An application must handle its own segments which must be different from other applications', thus making an application of dynamic link libraries costly.

Paging is the method to redirect an address to another address. The address that the application uses is called the "linear address" and the actual address is the "physical address".

32 bit Protected Mode Paging

The simplest form of paging consists of 2 tables: The Page Directory and the Page Table. The Page Directory is an array of 1024 32-bit entries with the following format:

PRUWDANSGA-Addr

P - Page is present in memory. This flag allows the OS to cache the pages back to disk , clear P, and reload them when a page fault is generated when software attemps to access the page.
R - Page is Read Write if set, else Read only.
U - If set, only ring 0 can access this page.
W - If set, write-through is enabled.
D - If set, the page will not be cached.
A - Set when the page is accessed (not automatically, like the GDT bit).
N - Set to 0.
S - Set to 0. If PSE is enabled, see below.
G - Set to 0.
Addr - The upper 20 bits (the lower 12 are ignored because it must be 4096- aligned) of the Page Table entry that this Page Directory entry points to.

The Page Table is an array of 1024 32-bit entries with the same format. The address points to the actual physical address that this page is mapped to.

Since there are 1024 page tables and 1024 directory entries, a total of 1024x1024 mappings are possible. Because the size of the page is 4096 bytes, we can map the entire 32-bit address space.

Put the address of the first page directory entry to CR3 before enabling paging (CR0 PE bit).

PSE

If PSE is enabled (CR4 bit 4), then S can be 1, in which case the page size is 4MB instead, and the pages must be 4MB aligned. This mode is introduced to avoid lots of small pages, at the expense of more memory wasted if the needed memory is somewhat larger than 4MB. Fortunately, modes can be mixed.

Physical Address Extension (PAE)

PAE is the ability of x86 to use 36 address bits instead of 32. This increases the available memory from 4GB to 64GB. The 32-bit applications still see only a 4GB address space, but the OS can map (via paging) memory from the high area to the lower 4GB address space. This extension was added to x86 to cope with the (nowadays not enough) limit of 4GB, before 64-bit software came to the foreground.

Enabling PAE (CR4 bit 5) means that now you have 3 paging levels: In addition to PT and PDT, you have now the PDTD , Page Directory Pointer Table, which has 4 64-bit entries. Each of the PDTD entries points to a Page Directory of 4KB (like in normal paging). Each entry in the new Page Directory is now 64 bit long (so there are 512 entries). Each entry in the new Page Directory points to a Page Table of 4KB (like in normal paging), and each entry in the new Page Table is now 64-bit long, so there are 512 entries. Because that would allow only a quarter of the original mapping, that's why 4 directory/table entries are supported. The first entry maps the first 1GB, the 2nd the 2nd GB, the 3rd the 3rd GB and finally, the 4th entry maps the 4th GB.

But now the "S" bit in the PDT has a different meaning: If not set, it means that the page entry is 4KB but if set, it means that this entry does not point to a PT entry, but it describes itself a 2MB page. So you can have different levels of paging traversal depending on the S bit.

There is a new flag in the Page Directory entry as well, the NX bit (Bit 63) which, if set, prevents code execution in that page.

This system allows the OS to handle memory over 4GB, but since the address space is still 4GB, each process is still limited to 4GB. The memory can be up to 64GB but a process cannot see the entire memory.

Flat Mode

Initially, the Local Descriptor Table was used so each application could have a local array of segments. But because of paging, modern 32-bit OSs now use the "flat" mode. This way the applications receive the entire 4GB address space to hold code, data, and stack, but this portion of the address space is mapped into a different physical memory. So the applications can use the same memory addresses which are mapped to different physical addresses.

For example, see these two C++ programs running under 32-bit Windows:

int main()
{
    ; CS:EIP at this point is, (say) 010Ch : 00004000h. 
    int flags = MB_OK;
    char* msg = "Hello there";
    char* title = "Title";
    MessageBox(0,msg,title,flags);     ; Address of message box is (say) 00547D45h
}
int main()
{
    ; CS:EIP at this point are the same as in previous program. However paging actually 
    ; maps them to a different physical address so these
    ; two programs do not interfere with same memory.
    ; This is transparent to the application
    int flags = MB_OK;
    char* msg = "Hello there";
    char* title = "Title";
    MessageBox(0,msg,title,flags);
    ; Address of message box is (say) 00547D45h,
    ; and this value is mapped to the same 
    ; memory as in the previous application, so the shared function
    ; "MessageBox" is only once found in physical memory.
}

This allows the application programmer to never consider segmentation. All pointers are near, there are no segments (all have the same value), and thus creating applications is easier. There is no thing as "small/large" model, because all the stuff is within the same segment.

The OS maps via paging all needed memory (say, a DLL) to some virtual address in the 32-bit address space, and the app will never consider far pointers or segmentation. CS, DS and SS values are pernanently viewing the entire 4GB address space, but all addresses are virtual and mapped to the application using Paging, therefore there is no segmentation.

So, in flat mode:

All the segments are expanded to full 32-bit 4GB
Via paging, different linear addresses are mapped to same physical addresses and similar linear addresses are mapped to different physical addresses
No segmentation, LDT, call gates. No ring 1 or 2. Enter via SYSENTER/SYSEXIT (which btw share now the old LOADALL opcodes).

Because of its simplicity, the "flat mode" is now the mode used by all 32-bit OSs, and also it is the only one that exists in 64-bit mode.

VM86 Mode

So far all nice with protected mode, but many of the existing applications were real-mode at that time. Even today, many (mostly games) are played under Windows. To force these applications (which think they own the machine) to cooperate, a special mode should be created.

The VM86 mode is a special flag to the EFlags register, allowing a normal 16-bit DOS memory map of 640KB which is of course forwarded via paging to the actual memory - this makes it possible to run multiple DOS applications at the same time without risking any chance for one application to overwrite another. EMM386.EXE, the old known memory manager, puts the processor to that state. The OS performs a step-by-step watching to the process, making sure that the process won't execute something illegal (so don't expect to enter protected mode when EMM386.EXE is loaded because once you try to set the GDT with LGDT, you will be sacked :).

Once the VM flag is set, you can load a normal "segment" to a segment register. Interrupt calls by DOS applications are caught by the OS and emulated through it - if possible. Also, some instructions are ignored, for example, if you do a CLI, the interrupts are not actually disabled. The OS sees that you prefer to not be interrupted and acts accordingly, but interrupts are still there.

All VM86 code executes in PL 3, the lowest privilege level. Ins/Outs to ports are also captured and emulated if possible. The interesting thing about VM86 is that there are two interrupt tables, one for the real and one for the protected mode. But only protected mode interrupts are executed.

VM86 was removed from 64-bit mode, so a 64-bit OS cannot execute 16-bit DOS code anymore. In order to execute such code, you need an emulator such as DosBox.

HIMEM.SYS

HIMEM is the generic extended memory manager for DOS. At that time, extended memory was mostly, if not totally, used to cache data from the disk, especially from big apps. HIMEM puts the CPU in unreal mode (or it uses LOADALL in 286) and provides a simple interface to the applications that want more memory without messing with the protected mode details. By enabling the A20 line, HIMEM allowed a portion of DOS COMMAND.COM to reside in the high memory area. Because unreal mode is still real mode, your protected mode application can do the stuff we discussed even if HIMEM.SYS is loaded.

EMM386.EXE

At that time, a form of memory now eliminated, the "expanded" memory, existed. Many applications were written to take advantage of it, but the modern standard was the protected mode. EMM386 puts the CPU in VM86 mode and maps via paging memory over 1MB to real mode segments (over 0xA0000), so an application that would like to use expanded memory can use it via EMM386.EXE. In addition, EMM386 allowed "devicehigh" and "loadhigh" commands in CONFIG.SYS, allowing applications to get loaded to these high segments if possible.

Because VM86 mode is protected mode, your protected mode application cannot do the stuff we have discussed if EMM386 is loaded, for once you try LGDT your program will be terminated because ring-3 applications (remember that VM86 mode is protected mode in which the DOS application runs in ring-3), cannot set the GDT.

DPMI

DOS Protected Mode Interface is a system that allows DOS applications to run 32-bit code. Unreal mode was not enough because it only allows data to be moved, but not code to be executed. What a DPMI server does is to take care of the nasty tables we have discussed above, allowing the executable to specify 32-bit code directly. When the executable calls DOS, the DPMI server catches the call, switches to real mode, calls DOS, then back to protected mode.

LOADALL

At that time, a now non-existent and mostly undocumented instruction existed, LOADALL (0xF 0x5 in 286, 0xF 0x7 in 386). LOADALL used, as the name implies, to load all the registers (including the GDTR and IDTR) from one table in memory. In 286 LOADALL (which was not accessible from 386), this table was fixed at memory address 0x800, whereas in 386 LOADALL it reads the buffer pointed to by real mode ES:EDI. Because the CPU does not check in any way if any of the values loaded by LOADALL is valid, LOADALL was used by many tools at the time, including HIMEM.SYS, for various infamous actions:

To access the entire memory from real mode without entering protected mode and unreal mode.
To run real code with paging.
To get back to real code from protected mode without resetting the 286.
To run normal 16-bit code inside protected mode without VM86 (which was not there in 286). This was done by trapping each memory access (which would lead to GPF because all the segments were marked non-present) and emulating the desired result by using another LOADALL. Of course this was too slow, but it led to the creation of the VM86 mode in 386, where LOADALL eventually faded out.

LOADALL 286 itself was mentioned in the manuals and was partially documented; by contrast, LOADALL 386 was heavily obscure, probably to induce the programmers to take advantage of the new VM86 mode.

Long Mode

The x64 CPU has three modes of work:

Real mode, same as in DOS
Legacy mode, same as 32-bit protected mode
Long mode

Long mode has two sub-modes:

Compatibility mode: same as 32-bit protected mode. This allows a 64-bit OS to run 32-bit applications.
64 - bit mode: for 64-bit applications

To work in Long mode, the programmer must take into consideration the facts below:

Unlike Protected mode, which can run with or without paging, long mode absolutely needs PAE and paging. That is, you cannot leave paging out even if your map is "see-through". You have to create PAE - style page tables and "flat" mode is the only valid mode in long mode. No segmentation.
AMD docs say that, in order to enter long mode, you have to enter protected mode - however, this has proven not to be true, since you can now get into long mode directly from real mode, by enabling protected mode and long mode within one instruction (this can work because Control Registers are accessible from real mode). I use that method in the long mode thread creation from real mode.
Although in theory any 64-bit value could be used as an address, in practise we don't need yet 2^64 memory. Therefore, current implementations only implement 48-bit addressing, which enforces all pointers to have bits 47-63 either all 0 or all 1. This means that you have 2 ranges of valid "canonical" addresses, one from 0 to 0x00007FFF'FFFFFFFF and one from 0xFFFF8000'00000000 through 0xFFFFFFFF'FFFFFFFF, for a 256TB of total space. Most OSes reserve the upper area for the kernel, and the lower area for the user space. And no, you cannot use the useless bits to store extra smart information about the pointer, because the CPU does not ignore these bits, it enforces them to be either all 1 or all 0. Forget your bad habits :)

Long Mode Paging

In long mode the paging system adds a new top level structure, the PML4T which has 512 64-bit long entries which point to one PDPT and now the PDPT has 512 entries as well (instead of 4 in the x86 mode). So now you can have 512 PDPTs which means that one PT entry manages 4KB, one PDT entry manages 2MB (4KB * 512 PT entries), one PDPT entry manages 1GB (2MB*512 PDT entries), and one PML4T entry manages 512 GB (1GB * 512 PDPT entries). Since there are 512 PML4T entries, a total of 256TB (512GB * 512 PML4T entries) can be addressed.

This is another reason not to use the entire 64-bit for addressing. Using the entire thing would force us to have 6 levels of paging.

Each of the "S" bits in the PDPT/PDT can be 0 to indicate that there is a lower level structure below, or 1 to indicate that the traversal ends here. If the PDPT S flag is 1, then the page size is 1GB.

Global Descriptor Table

Creating a 64-bit segment
- A segment marked for 64-bit is pretty much the same like a 32-bit segment with a limit of 4GB, but with the L bit set to 1 and the D bit set to 0. The D bit is set to 0 in 16-bit segments, but when L bit is set, it indicates a 64-bit segment.
64-bit segments always start from 0 and always end in 0xFFFFFFFFFFFFFFFF.

If your GDT resides in the lower 4GB of memory, you need not change it after entering long mode. However, if you plan to call SGDT or LGDT while in long mode, you must now deal with the 10-byte GDTR, which holds two bytes for the length of the GDT and 8 bytes for the physical address of it.

Any selector you might load to access a 64-bit segment is ignored, and DS, ES, SS are not used at all. All the segments are flat, and everything is done via paging. End of the segmentation era.

Interrupts

You have to reset the IDT to use 64-bit descriptors.

Each entry in it is now 16 bytes, describing the location of the interrupt handlers in 64-bit mode.

struc IDT_STR 
{
 .ofs0_15 dw ofs0_15
 .sel dw sel
 .flags db flags        
 .ofs16_31 dw ofs16_31
 .ofs32_63 dd ofs32_63
 .zero dd zero
}

Entering Long Mode

; Disable paging, assuming that we are in a see-through.
mov eax, cr0 ; Read CR0.
and eax,7FFFFFFFh; Set PE=0
mov cr0, eax ; Write CR0.

mov eax, cr4
bts eax, 5
mov cr4, eax

mov ecx, 0c0000080h ; EFER MSR number. 
rdmsr ; Read EFER.
bts eax, 8 ; Set LME=1.
wrmsr ; Write EFER.

; Enable Paging to activate Long Mode. Assuming that CR3
' is loaded with the physical address of the page table.
mov eax, cr0 ; Read CR0.
or eax,80000000h ; Set PE=1.
mov cr0, eax ; Write CR0.

Turn off paging, if enabled. To do that, you must ensure that you are running in a "see through" area.
Set PAE, by setting CR4's fifth bit.
Create the new page tables and load CR3 with them. Because CR3 is still 32-bits before entering Long mode, the page table must reside in the lower 4GB.
Enable Long mode (note, this does not enter Long mode, it just enables it).
Enable paging. Enabling paging activates and enters Long mode.

Because the rdmsr/wrmsr opcodes are also available in Real mode, you can activate Long mode from Real mode directly by setting both PE and PM bits of CR0 simultaneously.

Entering 64-bit

Now you are in compatibility mode. Enter 64-bit mode by jumping to a 64-bit code segment:

; also db 066h if entering from a 16-bit code segment
db 0eah
dd LinearAddressOfStart64

The initial 64-bit segment must reside in the lower 4GB because compatibility mode does not see 64-bit addresses.

Note that you must use the linear address, because 64-bit segments always start from 0. Note also that if the current compatibility segment is 16-bit default, you have to use the 066h prefix.

The only thing you have to do in 64-bit mode is to reset the RSP:

linear rsp,stack64_end

linear is a macro that finds the linear address of a target. SS, DS, ES, are not used in 64-bit mode. That is, if you want to access data in another segment, you cannot load DS with that segment's selector and access the data. You must specify the linear address of the data. Data and stack are always accessed with linear addresses. "Flat" mode is not only the default, it is the only one for 64-bit. However GS and FS can still be used as auxilliary registers and their values are still subject to verification from the GDT. In Windows, FS points to the Thread Information Block.

Once you are in 64-bit mode, the defaults for the opcodes (except from jmp/call) are still 32-bit. So a REX prefix is required (0x40 to 0x4F) to mark a 64-bit opcode. Your assembler handles that automatically if it supports a "code64" segment.

In addition, a 64-bit interrupt table must now be set with a new LIDT instruction, this time taking a 10-byte operator (2 for the length and 8 for the location), and each entry in the IDT table takes 10 bytes, 2 for the selector and 8 for the offset.

Returning to Compatibility Mode

Because 0eah is not a valid jump when in 64-bit mode, you have to use a RETF trick to get back to a compatibility mode segment.

push code32_idx    ; The selector of the compatibility code segment
xor rcx,rcx    

mov ecx,Back32    ; The address must be an 64-bit address,
                  ; so upper 32-bits of RCX are zero.
push rcx
retf

This gets you back to compatibility mode. 64-bit OSs keep jumping from 64-bit to compatibility mode in order to be able to run both 64-bit and 32-bit applications.

Why do Windows drivers have to be 64-bit for a 64-bit OS? Because no WOW64 for driver (ring 0) code exists. They could have created one if they wanted to - I guess they wanted to force manufacturers to finally move to 64-bit. Nice decision, I must admit.

Exiting from Long Mode

You have to setup all the registers again with 32-bit selectors - back to segmentation. Also you must be in a see-through area because to exit long mode you must deactivate paging. Of course, you can switch immediately to real mode by resetting the PM bit as well.

; We are now in Compatibility mode again
mov ax,stack32_idx 
mov ss,ax 
mov esp,stack32_end 
mov ax,data32_idx 
mov ds,ax
mov es,ax
mov ax,data16_idx
mov gs,ax
mov fs,ax

; Disable Paging to get out of Long Mode
mov eax, cr0 ; Read CR0.
and eax,7fffffffh ; Set PE=0.
mov cr0, eax ; Write CR0.

; Deactivate Long Mode
mov ecx, 0c0000080h ; EFER MSR number. 
rdmsr ; Read EFER.
btc eax, 8 ; Set LME=0.
wrmsr ; Write EFER.

; Back to the dirty, old, protected mode :(

Unreal Mode in 64-bit

I am sorry for I made you feel well for the moment. There is no such thing that would allow you to access over 4GB of RAM from real mode (unless AMD has an Easter egg in its CPU). In addition, although the 32-bit registers EAX, EBX, etc., are available in Real mode, the 64-bit registers RAX, RBX are not even available in compatibility mode - only in 64-bit mode. Or, who knows?

Virtual 86 Mode in 64-bit

Once the CPU enters Long mode, VM86 is not supported anymore. That is the reason why 64-bit OSs cannot run 16-bit applications. However, emulators like DosBox will run fine your 16-bit old game.

DPMI for 64-bit

Not existing, but I made something similar, which allows a DOS application to run multiple threads in real,protected and long mode while still having access to DOS interrupts.

http://www.codeproject.com/Articles/894522/Teh-Low-Level-M-ss-DOS-Multicore-Mode-Interface

Yup, I made it :)

The Code

The code presented here messes with everything we have discussed so far. It has yet some dirty functions, but it works. Have fun with it!

History

30/12/2018 - More details on implementation of flat mode, typos, thread code
25/12/2018 - Cleanup, Github code, VS solution
21/05/2015 - Paging analysis.
24/03/2015 - LOADALL added, and updated unreal mode code.
05/02/2015 - Callgates and SYSENTER/SYSEXIT information added.
30/09/2014 - Wow after 5 years, at least some typo fixing.
02/12/2009 - First release.