This program can be used to intercept PC I/O port activities. It uses X86 hardware debug registers as its ears, and eavesdrops at the lowest level. The readers are assumed to be familiar with Windows device driver development and some knowledge about Operating System and Intel X86 architecture.
I once wanted to analyze a program and find out the data it sends and receives from some I/O ports. I was surprised to find little resources about an I/O port sniffer over the web. I thought it should be a common utility. So I had to write my own.
At first, I wrote a driver to hook the original HAL routines such as
READ_PORT_UCHAR. Though it worked mostly, it failed on the wanted program. I guess the programmer must have used assembly opcodes like "in" and "out" directly in his code. So what can I do? Is there any method to intercept I/O port accesses from the bottom (the hardware level)? Of course there is. At least, WinDbg can do it. WinDbg can set breakpoints on I/O port accesses. It uses X86 hardware registers as the deadly weapon.
Brief introduction of X86 debug registers
Intel X86 contains eight 32-bit debug registers to facilitate setting hardware breakpoints, DR7 ~ DR0, among which DR4 and DR5 are not commonly used. DR7 acts as the control register, DR6 as the status register, and DR3 ~ DR0 are used to save breakpoint addresses. Only four different hardware breakpoints can be set at the same time. We can set hardware breakpoints either on memory or I/O port access. When a hardware breakpoint is hit, CPU will set the according bits in DR6, raise a debug exception (#DB), and jump to the OS installed ISR handler, just as it does with a common exception. Hardware breakpoint exception is a "trap", which means the #DB ISR handler is fired after the CPU has executed the code which triggers the exception.
Set or clear the according bits of DR7 to enable or disable breakpoints. Bit 13 of DR7 is called global detection bit; if it is set, CPU will raise a debug exception when some code wants to access the debug registers. This is a "fault", which means the ISR handler is fired before the CPU executes the exception code. The Intel manual tells it's used to facilitate ICE debugging. At first, I thought it's of no use to me, but finally, this bit saved my life.
Reference the Intel system programming guide for details.
Now we have a powerful enough weapon, and can set breakpoints on I/O port access. Once it's hit by whatever code, a debug exception is raised and the ISR handler is called. So our next stop is to replace the OS provided debug handler with our own.
Replace the debug handler
Here is an excellent article about how to hook an interrupt service routine.
At first, I try to "hook" the OS provided ISR, and after my job is done, control is transferred to the original code. But it seems the Windows #DB handler is robust enough and makes too much verification. It always leads to blue screen no matter what I try to wipe the clue. So I have to replace the whole code with my own.
The ISR code has the following prototype:
__declspec(naked) void __cdecl DebugHandler(BYTE *_eip, DWORD _cs, DWORD _eflags)
__asm push ebp
__asm mov ebp, esp
__asm sub ebp, 4
__asm pop ebp
The function is declared as
__declspec(naked) to tell the compiler not to add stack and local parameter manipulation code for us. We are responsible for this. This is not a common routine which is called by another function, it is an ISR handler. The compiler has no knowledge about an interrupt handler's stack content. But we can cheat the compiler and make it believe it's just a regular function.
According to the Intel manual, when an exception is raised, CPU will push the current EFLAGS, CS, EIP to the kernel stack (sometimes an error code will also be pushed) and jump to the ISR. The "
iretd" instruction will pop these values from the stack and continue the interrupted code. The
__cdecl calling convention pushes parameters from right to left, then the return address. Following is the comparison of the actual stack content and what the compiler thought it should be.
From the diagram, we can see the stack mismatches because the compiler thinks the return address has been pushed on the stack. So the compiler cannot reference the three parameters correctly. Code in the third line "
__asm sub ebp, 4" fixes this problem.
The EBP register, also called the stack frame pointer, is used to reference the parameters and the local variables by the compiler. Almost every function starts with the code "
PUSH EBP" followed by "
MOV EBP, ESP". So the first parameter is referenced as
[EBP+8], and the second
[EBP+12]. As we lack the return address on the stack, subtracting EBP with 4 will make the compiler happy. We can now use the three pseudo parameters as if they are really pushed by the compiler, not the hardware.
To be simple, I don't declare any local variables. All temporary variables are declared as global outside.
Prevent the breakpoints being cleared by the scheduler
After some test, I'm excited to find it works. But it will only work for a randomly short time after it starts. I found it will run longer if I stick to the current windows. If I switch to a new application, it will stop working immediately. It seems a task switch will beat it down. A new task will reload its context from the TSS (task state segment); will the debug registers also be rewritten? After Googling, I'm depressed to find they will. Windows only allows hardware breakpoints for a single task, not for the whole system.
I'm lucky enough to recall an article on the phrack magazine when I wanted to quit. I had little knowledge about debug registers when I first read this article, so it didn't impress me much. When I read it again, I found myself foolish enough. I even had found the solution and coded correctly, but abandoned it because of a minor mistake. It's just so simple that I couldn't make believe myself it would work.
I had mentioned that DR7 has a bit called global detection. If this bit is set, any debug register access will raise a fault and control is transferred to the debug handler. If we adjust the saved EIP and let it point to the next instruction, when "
iretd" is executed, the debug register access code will just be skipped.
Debug registers can only be accessed by "
mov" to or from a common register such as EAX. Its opcode size is always three bytes on X86. So in the ISR routine DebugHandler, a simple line "
_eip += 3" will be enough. That's why I declare
As for actual programming, there are some tricks to prevent infinite loops when we set the global detection bit. Just reference the source code. I abandoned my first test because I didn't carefully watch this condition. And it's very hard to debug this driver, because a single step in my #DB ISR code will trigger itself again. The single step is also a #DB exception; the code will loop again and again in WinDbg. I don't know if there is any good method to debug a debug handler.
The code is now almost complete, and tests great in a virtual machine. But when I test on my Notebook, it behaves strange, and sometime leads to blue screen. I'm not surprised as I had thought about the multi-processor problems. When I adjust the CPU number of the virtual machine from 1 to 2, the code crashes.
In a multi-processor, multi-core, or single-core with hyperthread platform, each logical core has its own register set. My Intel i5 has two processor cores embedded in a single physical package, and each core can execute two threads simultaneously (hyperthread, not that good as it sounds). Windows tells me my Notebook owns four strong hearts; of course, it's only a trick played together by Intel and Microsoft.
When we start a program, the Operating System will choose a processor to execute the main thread of the program. When the task is put to sleep and made to run again by the scheduler, the OS will pick up a new processor to run the code by some algorithm. At most times, the new processor will be different from the previous one. As each processor has its own debug registers and interrupt descriptor tables, the program will only work when it's luckily running on the processor which starts it.
We can set processor affinity in the task manager to make our code always run on a specified core. But we are designing a device driver, and our #DB handler will be called at an arbitrary thread context, which means all the tasks should be made to run on the same processor. Apparently, it's a stupid idea.
So we need to find a way to write to all the processors' register set. Some smart guys already got the solution by using DPC (deferred procedure call) which can be set to run on a specified processor. This technique is presented in the book "Rootkits: Subverting the Windows Kernel". But we needn't be as aggressive because we can make it from the user land with the documented API.
SetThreadAffinityMask" can be used to bind a thread with a processor. We can first enumerate all the logical processors. Then invoke a thread for each processor. In the thread, call the driver code to write the related processor's register set, and clean the rubbish on exit. Browse the source code to find how to enumerate logical processors and how to use
Source code and demo
The demo contains a driver "IOSniffer.sys" and a console application "test.exe". In the driver, I set hardware breakpoints on I/O port 0x60 and 0x64, so it can be considered as a lowest level keyboard hooker (USB keyboard will not work). I haven't implemented any IOCTL codes, the breakpoints are hardcoded in the source. I will be happy if someone can improve it to an actually useful tool. I just lack the patience.
The demo uses only "DbgPrint" to communicate with outside. So you must install DbgView to watch the output messages.
Here is the sample output on my Notebook. Apparently, Microsoft Windows is a good SMP system as the processors are used evenly. If we dig further, we can peek some system internal mysteries as we catch the system every time it's making a task switch.
The code is tested on Windows XP SP3 32-bit. It won't work on the 64-bit version. Use it at your own risk.
The driver is built under WDK 7600.16385.1, and the test application is developed under VC10.0. They can be built under lower version development tools with minor or even no modification.
The driver contains some inline assembly code. Comparing with ARM, I'm not familiar with the X86 assembly at all. You may find my code awkward as I always use general registers to access memory just as on an RISC platform, though X86 has much richer addressing modes and less general registers. I applied for the paper printed instruction set manual from Intel, but they told me it would never be available. So I had to beg my wife to print the manual. But she kicked my ass when she knew how much paper it would cost. The result is that my X86 assembly programming skill is still at the level when I was at school ten years ago, even worse.