Anyone that has already read my "infamous trilogy":
would want to combine all the stuff in one nice application. Here is such a combination, along with some new tips/techniques not discussed in the previous articles. It is implemented as a TSR which other apps can call for true multithreading in real, protected or long mode in raw dos.
Using this app you can create a DOS app that can:
- Use all your CPUS together
- Lock/Unlock mutexes
- Start threads in real,protected and long mode (and soon, virtualized mode).
You need flat assembler, and a freedos installation in some virtualization environement that can have multiple cores. VMWare works. DOSBox doesn't because it doesn't expose an ACPI. Bochs will work in the special SMP edition but only for real or protected mode.
- 1024 assembly books.
- 4.023 x 10^23 C++ lines written.
- 1 << 62 free space in your mind. The upper bits are reserved for the kernel.
- Lots of patience and humor :)
Locking the Mutex
Yes in Win32 you have the nice Mutex functions. But what about in raw DOS?
First, a word about spin loops. When a Win32 thread calls WaitForSingleObject, the kernel checks if the object is signaled and, if not, it does not schedule the thread for resuming. If there is no thread to be scheduled, the kernel halts the CPU code with the HLT instruction, until later. In our little program we own the system, there is no scheduler. So the code will simply spin loop until the mutex is available.
Therefore, one would expect code like this:
; BL is the index of this CPU
CMP [shared_var],0xFF ; shared_var would be 0xFF if mutex is released
MOV [shared_var],BL ; Lock it
Not so. The problem is that, when the mutex is released, another CPU might lock the variable before this code. That is, something might be executed after the JZ command but before the MOV command.
Therefore, we have to use some atomic operation to achieve the lock:
; BL is the index of this CPU
CMP [shared_val],BL ; Perhaps it is locked to us anyway
CMP [shared_val],0xFF ; Free
JZ .OutLoop1 ; Yes
pause ; equal to rep nop.
JMP .Loop1 ; Else, retry
; Lock is free, grab it
LOCK CMPXCHG [shared_val],BL
JNZ .Loop1 ; Write failed
.OutLoop2: ; Lock Acquired
The magic here is simple. We use the CMPXCHG instruction which, along with the LOCK prefix, atomically tests the shared val if it is still 0xFF (the value in AL), and if yes, then it writes BL to it and sets the ZF. If another cpu has grabbed the mutex, the ZF is cleared and BL is not moved to the shared_var. Most convenient.
The another interesting thing is the pause opcode, a hint to the cpu that we are inside a spin loop. This greatly improves performance since the CPU knows we are in a spin loop and therefore it will not prefetch code.
Waking the CPUs
As we saw in the trilogy, we send the INIT and the SIPI. The CPU must start in a 4096-aligned address, so I 've filled an array with NOPs and adjust the startup address accordingly. They say that the CPU starts in real mode, but what I believe is that the CPU starts in unreal mode if the main CPU has already been switched to unreal mode. Most weird, but don't rely on it. Assume that it starts in real mode, as they say.
Therefore, a "SipiStart" routine would be like that:
db 4096 dup (144) ;
lidt fword [ds:RealIDT]; Load real mode interrupts in case they are not loaded
call FAR CODE16:EnterUnreal; Far call because CS is not CODE16 at this point
; Enable APIC
MOV EDX,[FS:EDI]; unreal mode, FS:EDI works.
mov di,StartSipiAddrOfs ; a dd that contains pre-configured jump to the actual routine for this CPU
jmp far [ds:di]
Anyway, to access the APIC, I have to enter unreal mode, so I call EnterUnreal. Note the FAR call; The segment value in which EnterUnreal begins is not the same with the CS which is loaded during the SIPI. A newly awoken CPU must also enable spurious vector and software APIC, as we have seen earlier. Finally, the code jumps far to the 'startup' address for the CPU, depending on the CPU index.
The APIC provides us a way to send a message to another CPU. Apart from INIT and SIPI, which we saw earlier, the local APIC can be used to send a 'normal' interrupt, i.e. merely executing INT XX in the context of the target CPU. We have to take into consideration the following:
- If the CPU is in HLT state, the interrupt awakes it, and when the interrupt returns the CPU resumes with the instruction after the HLT opcode. If there is also a CLI, then we must send a NMI interrupt (A flag in the APIC Interrupt Register) to wake the CPU.
- If the CPU is in HLT state and we send again an INIT and a SIPI, the CPU starts all over again from real mode.
- The interrupt must exist in the target processor. For example, in protected mode, the interrupt must have been defined in IDT.
- The Local APIC is common to all CPUS (memorywise), therefore, we must lock for write access (mutex) before we can issue the interrupt.
- Because the registers cannot be passed from CPU to CPU, we have to write all the registers (that will be used for the interrupt, if any) in a separated memory area.
- The interrupt might fail. I don't know why, but that's what they say. So, you have to rely on some inter-cpu communication (via shared memory and mutexes) to verify the delivery. I 'm doing that in my code with a simple flag.
- Finally, the handler of the interrupt must tell it's own Local APIC that there is an "End of Interrupt". Remember out 020h,al in the past? Now we write to the EOI register (LocalApic + 0xB0) the value 0.
My code defines a real mode INT F1h. This reads registers from a specified memory area, calls the interrupt and then restores the registers. This interrupt is called by IPI from the other CPUS (except the main one).
CPU #1 the Real mode
Since this CPU is running in real mode, you may want to call DOS. It will work, provided that no other CPU calls DOS at the same time, which of course cannot be assumed in our simple app. Therefore, you have to use int OxF0.
Note that, in a real mode CPU we don't need to send an IPI to CPU #0 because a real mode interrupt can be called, but you need synchronization.
print16 m2 ; instead of calling int 21h ah 9, it redirects it through int 61h, with mutex locks
CPU #2 the Protected mode
This is code of 2 pieces. One in real mode to prepare entering PM (GDT,IDT etc), and a db 066h 0eah jmp far to 32-bit protected mode. But now interrupts cannot be called directly (even with proper synchronization), so we send an IPI to CPU #0:
; mutex lock
mov di,mutexes + 2
mov word [ds:real_regs + 0],0900h
mov word [ds:real_regs + 6],m5
mov word [ds:real_regs + 12],DATA16
; Send the IPI
dw SendIPIF,code16_idx ; code16_idx = 20h, index in the GDT
; mutex unlock
mov di,mutexes + 2
Note the following issues:
- Calls to MutexLock16 and MutexFree16f are with 66h 9ah, remember, they are far calls to 16-bit code segment from 32-bit code segment.
- MutexLock16 and MutexFree16f can be called from any mode as long as [DS:DI] has the mutex to lock. In real mode DS is a real mode segment, where in protected mode DS is the protected mode GDT selector of that real mode segment. In a similar way, FS in real mode is 0 with 32-bit unreal mode access, whereas in protected mode it is a GDT selector that has selected a 32-bit data segment with base 0. So our code is callable from any mode.
- SendIPI32 is reimplemented in 32-bit, with EBX the index to the cpu, and ECX the IPI message.
CPU #3 the Long mode
As I had said (but not applied) in the trilogy, long mode can be entered directly from real mode, because the instructions RDMSR and WRMSR are available. This is also implemented in two pieces. One to prepare the long mode by:
- Loading the GDT.
- Preparing a see-through page table for the first 1GB and ,apping the Local APIC to a fixed position (1GB - 2MB) memory area, because the Local APIC is usually located at 0xFEE00000, which means it won't be visible in our 1GB see through, OR,
- Preparing a 4GB page table with 1GB pages, if your system supports 1GB pages. Most do. page1gb is a flag you can adjust to toggle that.
- Enabling PAE, PSE, and long mode.
And one to enter long mode by enabling paging, enabling interrupts with int 0xf0 accessible, then jumping to the code.
Because 64-bit mode cannot call 16-bit code (it can only jump to it), I 've reimpletemted the lock, some other functions and the IPI function for 64 bits. Remember also that, in 64 bit, DS,ES and SS have no meaning.
I 've called it DOS Multicore Mode Interface. It is a driver which helps you develop 32 and 64 bit applications for DOS, using int 0xF0. This interrupt is accessible from both real,protected and long mode. Put the function number to AH.
To check for existance, check the vector for INT 0xF0. If the driver exists, the vector starts with a 2 byte JMP, then 'dmmi' 'dmmi' in 2 dwords.
Int 0xF0 provides the following functions to the main (real) mode thread only:
- AH = 0, verify existence. Return values, AX = 0xFACE if the driver exists, CX = total CPUs, BX = available CPUs. ES:DX = far procedure for calling from real mode of the services (instead of using the interrupt). GS:ESI = far procedure for calling from protected mode of the services (instead of using the interrupt), and EDI = near flat procedure for calling from long mode of the services.
- AH = 1, begin thread. BX is 0 to select any CPU, or > 0 to select a specific CPU. ECX is the loops to spin to wait for a successful threads notification (by calling function AH = 7), if ECX is zero, no waiting will be done. AL bits 0-1 have the following meaning:
- 00, begin (un)real mode thread. EDX holds the segment in the upper 16 bits, and the offset in the lower 16 bits of the real mode far address of code. Returns CPU handle to AX (> 0) if successful, 0 otherwise. The code that starts in real mode has DS and FS loaded with unreal mode 32-bit limits, it must setup a stack, and must terminate itself with int 0xF0 function 2.
- 01, begin 32 bit protected mode thread. Upper bits of EDX and lower bits of EDX hold a real, 16 bit code and a 16 data segment respectively which can be accessible from the thread, ESI and EDI hold physical addresses of the 32 bit code segment that the thread should start from (offset 0) and a 32 bit data segment, respectively. The code that starts in protected mode will have DS loaded with the real data segment selector, ES loaded with the protected 32 bit data segment selector and GS loaded with the 16 bit code segment selector. Returns CPU handle to AX (> 0) if successful.
- 10, begin 64 bit long mode thread. ESI holds the physical address of the code to start in 64-bit long mode. Remember that long mode only has flat 64 bit zero-base segments. Returns CPU handle to AX (> 0) if successful.
If bit 2 of AL is 1, then for begin real mode thread then SI:DX hold segment:offset of a procedure that is assumed to return with a RETF. A stack is setup automatically and AH = 7 function is called before calling this procedure. This is useful for setting up a real mode procedure of a high level function. Therefore the procedure need not call AH = 7 function, and need not call AH = 2 to terminate. Declaring that function as "far" ensures that it will return with a RETF.
If ECX is > 0,then this function will spin loop until ECX gets to zero, unless your thread calls function AH = 7 to indicate completion.
- AH = 3, wait for CPU to finish. BX = handle. This uses a spinloop.
- AH = 5, mutex functions
- AX = 0x500, create mutex. This function returns a handle to BX and AX = 1 if successful.
- AX = 0x501, free mutex (handle in BX). AX = 1 on success.
Int 0xF0 provides the following functions to all threads, in real, protected or long mode:
- AH = 2, terminate thread. BX = handle (or 0 to terminate current thread). Currently, only BX = 0 is supported. Do not call this from your main CPU.
- AH = 4, execute real mode interrupt. AL is the interrupt number, BP holds the AX value and BX,CX,DX,SI,DI are passed to the interrupt. If this function is called from a real mode thread, then DS and ES are also passed to the interrupt. If this is a call from a protected or long mode thread, then DS and ES are loaded from the high 16 bits of ESI and EDI.
- AH = 5, mutex functions
- AL = 0x02, lock mutex. BX contains the mutex handle. AX = 1 if successful. This function will wait for the mutex if some other CPU has locked it, then it will lock it. ECX = wait timeout. If -1, waits indefinitely. Else waits in a loop until ECX = 0.
- AL = 0x03, release mutex. BX contains the mutex handle. AX = 1 if successful.
- AL = 0x04, wait for mutex. This function is similar to AL = 0x02, but it does not lock the mutex after waiting for it.
- AH = 7, thread started and initialization completed. You must call this function from your thread if the call to AH = 1 function had set ECX > 0, in order to proper synchronize the caller and the thread.
Now, if you have more than one CPU, your DOS game can now directly access all 2^64 of memory and all your CPUs, while still being able to call DOS directly. Isn't that fun?
INT 0x21 redirection
In order to avoid calling int 0xF0 directly from assembly and to make the driver compatible with higher level languages, an INT 0x21 redirection handler is installed. If you call INT 0x21 from the main thread, INT 0x21 is executed directly. If you call INT 0x21 from another thread, then INT 0xF0 function AX = 0x0421 is executed automatically.
So you can use your favorite stdio functions from a C function in another thread directly!
The code is available with a "testr" flag set to 1, in which case it setups the interface, then performs some tests in real, protected, long mode and some mutex sync between 2 long mode threads.
If you set the "testr" flag to 0, then the app will install as a TSR, allowing other programs to benefit from it's use. I 've also included a dtest.asm sample to demonstrate execution.
Yes, dmmi2.asm has the whole thing in a single file. But the code is very clean (or so I 've thought :P)
- Add a mode to start a virtualized thread.
- 22 - 5 - 2015 : Thanks to Brendan for the synchronization tip.
- 18 - 5 - 2015 : Fixed multiple call bug with End of Interrupt write.
- 17 - 5 - 2015 : First release.