This article targets the user who has first read my ASM tutorial (http://www.codeproject.com/KB/system/asm.aspx ) and wants to learn how Virtualization works. You want to create your own VMWare workstation? Let's go!
If you are a beginner programmer, quit right now.
If you are an advanced programer, quit right now.
If you are an expert programmer, quit right now. When you start reading this article you will feel like a beginner anyway.
BUT, since I am a beginner too, you will be eventually able to read what I have to say because I felt the same after I read the virtualization manuals. So keep reading!
We will create an application that prepares the CPU for virtualization, creates a guest, enters it and exits. All this will be done in x64 mode for simplicity. In x86 it is also possible, but we will focus on the x64 architecture to avoid unnecessary overhead in the code.
The code demonstrates only the basic VMX features and it might not work in your own CPU. However you can use bochs with virtualization enabled and then you will be able to test my code.
mov ecx, 0480h
A VMCS is a 4-KB aligned memory area used to support VM operations. It consists of 3 fields: 4 bytes that hold the revision number (0x480 MSR Register returned value), 4 bytes that are used for VMX Abort data (more on this later), and the rest is a collection of six fields to control the VM operations.
A VMXON region is a single VMCS region which you only need to initialize the revision number. Initialization of the VMXON region requires putting the correct revision number (first 4 bytes) as returned by the 0x480 MSR register above.
The VMXON instruction requires an address (e.g. VMXON [rdi]). This address should contain the 64-bit physical address of the VMXON region (4-KB aligned) and the first 4 bytes of that region should contain the VMX revision.
CR4 bit set for VMX operations
mov [rdi],ebx ; Put the revision. Rdi holds the VMCS address and ebx holds the revision
VMXON [rsi] ; Assuming rsi holds the address of the VMCS
The VMCS Groups
It was easy so far, but here starts your hell. The rest of the VMCS (that is, after the first 8 bytes (revision + VMX Abort) is divided into 6 subgroups:
Each of the above fields contains important information about how the VM starts (State after a VMEntry), what is the host state after a VMExit, when a VMExit will occur and others.
Func: VMX_TryGuest and VMX_TryGuest2
The Guest State
This contains the following information (In parentheses, the bit number):
The guest state describes the values of the registers that the cpu has after a VMEntry. Because you can totally control the registers, you can start a VM in any mode (real, protected, long etc). But even if you are to start a real mode VM (as my code does) you have to initialize the segment registers as normal p-mode selectors, with proper limits access etc.
The values that are used for the segment registers (limits, base address, selector , access rights and flags) are the same with those used in ordinary protected mode, so for example you will see my code adding a 0x92 access flag for a DS read/write data segment.
The Host State
This contains the following information (In parentheses, the bit number):
The host state tells the cpu how to return to the VMM after a VMExit.
Executon Control Fields
These fields essentially tell the CPU what is allowed to be executed in the VM and what is not. Everything not allowed causes a VMExit. The sections are:
My code only uses the pin-based and the processor based for simplicity, but these fields are your real swiss army knife; you can control entirely what the VM is and is not allowed to perform.
VM-Exit Control Fields These fields tell the CPU what to load and what to discard in case of a VMExit:
VM-Entry Control Fields
VM-Exit Information field (read only)
This means that, after a successful VM Entry, the guest will start with RIP set to 0.
You would think you were finished? Hahahah. Not so fast. You have to give your new Virtual Machine some memory to work, and you have to configure the EPT. An EPT is a mechanism that translates host physical address to guest physical addresses. And because the structure is similar to Paging structures, we will first going to review how paging works in x86 and in x64 mode.
In my previous article I noted that you need Paging and PAE style paging tables for long mode, but I didn't explain how to create such tables. Since they are required here as you see, here is a quick introduction to the paging mechanism.
Paging is the mechanism to map an actual physical address to another virtual address. Each process can have a full virtual address space without the need for the physical ram to be mapped to the entire range or even installed. The OS maps the same code (e.g. the kernel modules, or the API) to different address spaces for each process. When a process tries to access memory that doesn't exist, a "Page Fault" is generated.
Because paging has so many benefits, it's the only used in x86 mode (all modern OSes implement paging "flat" mode) and it is the only available in x64.
There are 2 tables that are used for paging: The Paging Directory and the Paging Table. Each one of these tables has a size of 4kb, containing 1024 dword entries. Their addresses must be alinged in a 4KB boundary as well. In the page directory, each entry points to a table in the paging table. In the paging table, each entry points to a physical address which is mapped to a virtual address. The virtual address is calculated using the offset in the paging directory and the offset in the paging table.
A See Through system (the style we used in the previous article) is a paging system in which each physical address is mapped to the same virtual address.
After you have created the page tables, load the Paging Directory address into CR3, then enable paging by setting CRO bit 31.
Each dword entry of the 1024 in the Paging Directory has the following format:
012 345 678 9-10 11-31
PRUWDA0SG - AV - Address
Each dword entry of the 1024 in the Page Table has also the above format. Here the address field is the physical address of a 4KB/4MB block (depending on the S flag in page directory) which is mapped. The virtual address that this physical address is mapped is the product of the multiplication of the page directory index by the page table index. Because each page can be 4MB in size and the page table has 1024 entries, a maximum of 4MB * 1024 = 4GB (the entire 32-bit space) is allowed to be mapped. The G flag, if set, prevents the INVPLG instruction to update the address in it's cache if CR3 is reset.
Physical Address Extensions (PAE)
PAE is a paging system to allow more than 4GB to be accessed in x86. PAE is mandatory for long mode as we saw. To enable PAE, set the CR4 bit 5 and after that, the CR3 points to a top level PAE table which consists of 4 64-bit entries.
So now there are 3 tables: the PDPT (Page Directory Pointer Table) (4 64-bit entries), the PDT (Page Directory Table) and the PT (Page Table).
But now the "S" bit in the PDT has a different meaning: If not set, it means that the page entry is 4KB but if set, it means that this entry does not point to a PT entry, but it describes itself a 2MB page. So you can have different levels of paging traversal depending on the S bit.
Each of the PDTD entries points to a Page Directory of 4KB (like in normal paging). Each entry in the new Page Directory is now 64 bit long (so there are 512 entries). Each entry in the new Page Directory points to a Page Table of 4KB (like in normal paging), and each entry in the new Page Table is now 64-bit long, so there are 512 entries. Because that would allow only a quarter of the original mapping, that's why 4 directory/table entries are supported. The first entry maps the first 1GB, the 2nd the 2nd GB, the 3rd the 3rd GB and finally, the 4th entry maps the 4th GB.
There is a new flag in the Page Directory entry as well, the NX bit (Bit 63) which, if set, prevents code execution in that page.
This system allows the OS to handle memory over 4GB, but since the address space is still 4GB, each process is still limited to 4GB. The memory can be up to 64GB but a process cannot see the entire memory.
In long mode the PAE system adds a new top level structure, the PML4T which has 512 64-bit long entries which point to one PDPT and now the PDPT has 512 entries as well (instead of 4 in the x86 mode). So now you can have 512 PDPTs which means that one PT entry manages 4KB, one PDT entry manages 2MB (4KB * 512 PT entries), one PDPT entry manages 1GB (2MB*512 PDT entries), and one PML4T entry manages 512 GB (1GB * 512 PDPT entries). Since there are 512 PML4T entries, a total of 256TB (512GB * 512 PML4T entries) can be addressed.
Each of the "S" bits in the PDPT/PDT can be 0 to indicate that there is a lower level structure below, or 1 to indicate that the traversal ends here. Note that some CPUs do not support 1GB page entries (i.e. the "S" bit set in a PDPT) and so does bochs by default. My code contains 2 functions for EPT initialization, one for 1GB support and one for PDT-level "S" support, which is most common. If you would like to test the 1GB pages, you must enable it in bochs using a CPUID flag in bochs' settings.
Now that we have reviewed how paging works, we are ready to configure the EPT - that is, to give some memory to our Virtual Machine.
Originally the VMX capabilities of the cpu required guests to start in paged protected mode, and VMM applications usually put the virtual cpu into VM86 mode, to allow OSes (which expect a clean real mode boot) to work. Soon they introduced the "Unrestricted Guest" flag (bit 7 in Secondary Exit Controls) that would allow a guest to start in real mode - and that we will be using here for simplicity. However putting the virtual cpu in real mode means we have to map the lower 640KBtyes, so we have to use EPT.
If your CPU doesn't allow the "unrestricted guest" mode, then you can setup a protected mode guest using similar code, because my code creates protected mode style segments anyway.
Of course, depending on your guest's initial state (for example, if you 'd want to start a guest in long mode) you would also need to configure Guest PAE, paging, proper CR4 and stuff. But our little application will configure a real mode guest, so it needs to map a region of host memory to guest physical address.EPT Translation uses the lower 48 bits (as the nowadays CPU actually do nowadays - not the entire 64-bit range is used).
The fortunate thing is that the EPT table is like the Page table and directories we have seen in our previous article. It consists of a top level PML4T which consists of 512 64-bit entries. And these entries either reference directly a memory area, or they reference a lower page table (PDPT, PDT or PG). The format of each entry is :
012 3-7 8-11 12:N-1 N:51 52:63RWE R I A R I
The PDPT entry is as follows: 012 3-5 6 7 8-11 12:29 30:N-1 N:51 52:63RWE MT PAT S I R A R I
The PDT entry is as follows: 012 3-5 6 7 8-11 12:20 21:N-1 N:51 52:63RWE MT PAT S I R A R I
012 3-5 6 7-11 12:N-1 N:51 52:63
RWE MT PAT I A R I
The initialization code for this VM is in VMX_TryGuest. It is setup so guest CR0 is set to real mode. VENTRY16.ASM has the entry point for our real mode guest, which does little but to set a flag and return to the VMM with VMCall.
Note that chances are that your cpu does not support the "Unrestricted guest" flag, so it can not start a Virtual Machine in real mode. If so, try the VM2 discussed below. To test if the cpu supports the unrestricted guest, check bit 5 of the IA32_VMX_MISC MSR (index 0x485):
Note that because the VMM has configured protected-mode style selectors, a real mode guest has to execute a JMP to a real-mode style CS (as if we were returning from protected mode)
Initialization for this VM is in VMX_TryGuest2. This time CR0 is set to be in protected paged mode, CR4 is loaded with the page directory (the very same used for normal protected mode since our EPT is a see-through). VENTRY32.ASM has the entry point for our protected mode guest and this time the selectors are ready to go. VENTRY32.ASM merely sets a flag and exits to the VMM with VMCall.
For a simple test, one might think that it should be easy to copy the actual bios to the virtual memory so, for example, DOS can boot. Right, and from what device will DOS boot since there is no one in the VM? That's why you have to duplicate an actual device into the VM using your custom BIOS in order to communicate with the host with a specific protocol, then emulate the allowed devices in order for the VM to function properly.
My application simply forwards memory in a see-through style, so calling BIOS and DOS from the VM is possible. But in real life you don't want to do that, as then the VM can ruin the VMM because they share the same memory.
In real life also, in case that the "unrestricted guest" isn't allowed, you have to start the guest in VM86 paging protected mode and if the guest likes itself to set protected/long mode (like an OS) you must catch the VMExit (which would occur when the guest software attempts to execute LGDT) and emulate all the calls that would otherwise fail (LGDT, LIDT, CR0 , Paging initialization etc) so the guest can assume that its operations were successful.
A VMExit can occur for various reasons, either because you had specified a VMExit reason in VMCS control/exit fields, or if the VM actually entered a shutdown state (for example, a ring 0 crash) that would reset the CPU if it would be run in an actual, non virtualized state, or anything else. Execution resumes at the VMCS host state saved (CS:XIP), and you can read the VMCS exit information (read only) to detect the reasons of the exit.
Use VMRESUME to resume the VM after an exit.
Some systems know that they run under virtualization (for example, VMWare drivers) and they do want to jump back to their host in order to exchange information. The VMCall opcode causes a VMExit to the host, and the virtualized system can exchange information with the host. My code also uses VMCall to exit to the host.Of course, if VMCall is executed in a non VMX-non root environment, an unrecognized opcode exception is thrown.
For simplicity, my code doesn't check for all features (that's the most probable reason it won't work in your raw DOS), but you should check the VMX MSRs for available features before testing them. Intel's 3B Appendix G contains all these MSRs. To load a MSR, you put its number to RCX and execute the rdmsr opcode. The result is in RAX.
So far we are interested in the VM science all right, but the programmer's soul will always contain notorious feelings like killing, revenging, cheating, cracking and all sort of that stuff.
Now we 'll take into account the fact that you are evil (or you wouldn't make it up to here) and discuss about Blue Pill. Blue Pill is a hypervisor virtus that controlls the entire OS. For this to work you would simply map the entire memory as a see through and start the VM with Windows in it, while configuring almost anything to cause a VMExit. Now whatever Windows tries will be reported to your hypervisor via VMExits, and using the injection technology you can fake any response - and since Intel doesn't have a (known) way to detect if an application is running in Virtualization, you will never get caught. Never? Who knows - but if you ever get caught let me assure you that I know nothing about it
But wait! CR4's 13th bit should be 1 inside a Virtual Machine so if that bit is 0 you definitely know you are not virtualized! But if this bit is 1, do you really know if you are under a VMM? Who knows. If anybody gets the Windows Loader source and finds out that a mov eax,cr4 - test eax 0x2000 - jz WE_ARE_OWNED sequence is there, let me know.
Another possible option to test if you are owned is the VMCall, which would raise an exception and you can catch it. However, did anyone ensure you that there wasn't an exit and your host injected an exception for you to catch and assume you are free?
Another possible option is to test if the CPU does not support unrestricted guests and you started in VM86 mode. If you see that you are running in VM86 mode then chances are that you are virtualized. But whoops - did we forget EMM386 exe? But Windows NT-based OSes do not load any DOS drivers so if NT loaded checks for VM86 and it is enabled, it may assume it is under virtualization.
There are a number of options that work and others that won't work:
As you saw, virtualization is not initially a very complex subject, but to make something that really works you need to implement a BIOS, drivers etc. That's why not many programmers really try such a thing and that's why only a few applications support virtualization. VMWare has completed a great deal of work to make their Workstation actually do the job.
The code is imported from my previous article and organized in 6 files. It is rather dirty, but it works.
Try it and tell me. If it doesn't work, tell me and help me to improve it. Either way, the fact that you are reading up to here is appreciated.
If you aren't disappointed by now, I urge you to apply to a Virtualization Software company like VMWare for a job - you will do very well. And tell them you have read my article, they might hire me as well GOOD LUCK.
This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)
General News Suggestion Question Bug Answer Joke Rant Admin
Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.