Using the Microsoft Driver Verification Tool
If you have ever used the Debugging Tools for Windows to analyze crash dumps, you have undoubtedly used WinDbg to open a crash dump file. WinDbg will have performed an internal analysis of the crash file and suggested that you start with the !analyze command. That command outputs the stack along with a lot of other information. Upon doing this, the bottom of the stack will show the thread that transitioned into Kernel mode, and from there, you walk up the stack to see if there is a culprit driver. While this is a solid debugging technique, sometimes a crash dump , or a set of them, will be unanalyzable. There are no patterns in memory to point the finger at what is causing the system to crash, or maybe the memory is just corrupt because the crash dump file actually points at Ntsokrnl.exe or win32k.sys.
There is a way to turn unanlayzable crashes into analyzable ones by using the Microsoft Driver Verification Tool. This tool ships with every version of Windows, and is not a separate installation. It is not visible, so it is not on the Start menu or in the Administrative Tools in the Control Panel. You launch it by typing “verifier” in the Run box of the Start menu, but you should know how it works in order to use it. This article will provide an understanding about how to use this tool to turn unanalyzable crashes into analyzable ones. The Driver Verification Tool contains many options, some of which should be strictly avoided. To obtain some information on analyzing crash dump files, refer to “Analyzing Windows Crash Dump Files”.
To launch the Driver Verifier, we type “verifier.exe” in the Run box of the Start menu. The first data box shows a list options. The option to choose is “Create custom settings (for code developers)”. Avoid the default “Standard settings” option. After clicking Next, we choose “Select individual settings from a full list”. Notice that we have chosen none of the default settings. After we click Next, we see a list of options that range from “Special pool” to “Miscellaneous checks”. We actually select all of them except “Low resource simulation”. Low resource simulation is a setting that does exactly that, so we do not want to reboot and have any device driver to actually be tested for its behavior, as its resources are simulated and are being drained purposefully. The “Special pool” option is one that will be discussed later in this article. Let’s first examine “forced IRQL checking”.
Consider a driver that touches a piece of paged memory. The IRQL is currently at Passive level, which is the IRQL where all the user mode code runs. Because the driver has touched a piece of memory, however, the memory manager must bring that data in physical RAM and connect it to that paged virtual address. Now, the driver performs an operation that causes the IRQL to go up to
DISPATCH_LEVEL and immediately reference that same paged buffer again. For that to show up as a bug that is caught by the Operating System would require -- in that tiny amount of time where those few instructions between the first reference, the IRQL raise, and that second reference happen – for the memory manager to decide that the touched page needs to be reused (or sent out to the paging file). This is something that is extremely unlikely. What "forced IRQL checking" does is that any time a driver is being verified with that option on, it causes that IRQL to move to the dispatch level or higher. The memory manager takes all the pages that are part of the system's working set connected to the paged pool virtual memory and disconnects those from the physical.
Note that a working set is the amount of physical memory that is allocated to a process, and that amount has been determined by the memory manager, because the operating system has monitored the behavior of that process based on memory demands and paging rates. So now, if that driver goes and accesses that buffer again at
DISPATCH_LEVEL, it will generate a page fault because the memory manager is going to have to go and fix up that connection between that virtual memory and that physical memory. It is at that point that the memory manager will look to check the current IRQL. It would see that it is
DISPATCH_LEVEL or higher, to then determine that this is an illegal operation and thus crash the system. And that is what you want. You want to find which driver is buggy enough to get caught performing some illegal operation that is going to crash the system on some end user's machine or in some uncontrolled environment. Forced IRQL checking will reveal what kind of driver has exactly those kinds of bugs, and thus expose the culprit driver.
The pool tracking option is useful for driver memory leaks. I/O verification and enhanced I/O verification causes the Operating System driver verifier code to perform some rigorous inspection of the data structures that a driver is passed to and that the driver passes back to the system. That data structure is called an Interrupt Request Packet (IRP). The structure of an IRP has some special rules. It has to point to valid structures. It has to have a consistent set of values. So, the driver verifier will inspect that packet after the driver has operated on it to make sure that it is still in a consistent state.
To create a set of steps to follow to turn unanalyzable crashes into analyzable ones, we must realize that certain crashes are related to certain conditions, those of which are described by these options. So, after clicking Next after those options, choose “Select drivers from a list”. Do not select "Automatically select all drivers installed on the system". At that point, the menu will load a list of drivers. Below that list is a button stating "Add currently loaded driver(s) to the list". Maybe, there is driver that you know of that is problematic that you might want to add. Drag and drop the Provider section to isolate those drivers that are not from Microsoft, and do a quick inventory. After selecting those suspicious drivers, enabling Verifier on those drivers and then rebooting, see if the system crashes. If the system does not crash, then take another step and select all unsigned drivers and/or third party device drivers, and run Verifier on those drivers. If the system does not crash, then as a last resort, run the driver verifier on every driver. But, do not do this all in one pop. Choose around ten or twenty drivers at a time, enable Verifier, and then reboot. If you choose all of the drivers with those options configured, it could take your system 20 minutes to reboot. The behavior of your system may seem different for a short while, but will eventually snap into stability (if you do this as an exercise, rather than performing an attempt to turn an unanalyzable crash into an analyzable crash).
Using the Notmyfault.exe Test Driver Program
Mark Russonivich, who is now currently employed by Microsoft, wrote a test driver program called “Notmyfault.exe”. This utility contains a device driver, myfault.sys, which will cause a certain type of crash that coincides with certain Operating System conditions. Amongst other tools that he has written, this tool in particular is invaluable to use and to gain an understanding about system crashes and what you can do to avoid them. While not mandatory, this tool is best run in a virtual environment. A virtual environment is a software layer that functions as a computing environment. If you download one of the trial copies of VMWare Workstation, and install it, you will be able to install an Operating System within, but separated from, the Operating System you are running. Try it, and install an old version of XP.
Even if you are not concerned with performance, there is another reason to go after batches of drivers if your are targeting all of the drivers, and that is the special pool option. When we invoke the NotmyFault.exe program to send the control request to the myfault.sys driver to perform a buffer overrun, the myfault.sys driver is going to allocate a buffer from kernel memory and then write past of the end of the buffer array. This will corrupt the memory, as shown in the diagram:
Notice, we checked the buffer overflow choice. When we press the "Do bug" button, some random buffer is going to be overwritten in Kernel memory. The overwritten memory is therefore corrupt. But, just corrupting the memory may or may not crash the system until something references that corrupt memory. Then, the system will crash. So, there is a potentially long delay between the corruption of that memory and the detection of that corruption. Normally, another driver or the Kernel makes the reference. MyFault.sys allocates a non-paged pool buffer and writes a string past the end, corrupting the pool header and the data structure that follows. So, we press "Do bug" once, and nothing happens. Maybe, press that button ten times, and still nothing has happened. One thing for sure, however, we now have a very sick Kernel memory. If there is still no crash, then run something like Internet Explorer to see if it references to the Operating System enough for the detection of the memory corruption. If that doesn't work, then run something heavier that stands more of chance to crash the system, something like Windows Messenger.
Assume that the system now crashes. But when the system crashes, was it Windows Messenger that has crashed the system? Definitely no. But, something happened in some software in Kernel mode (not Windows Messenger), something that was invoked indirectly to cause the system to crash. When the Kernel detects a corrupt pool, the blue screen says to enable the driver verifier. It tells you what happened, but not why. At this point, we then examine the crash file. The crash file shows the stack, but then shows a Microsoft device driver on that trace stack, one that could actually be a very important one. The driver, however, has only made a reference to the corrupted memory. It did not actually corrupt the memory, which is the reason the system crashed.
Now, we perform the same test, but with the Driver Verifier enabled, with all of the options enabled (in particular, special pool, but again, do not enable low resource simulation). When you press the "Do bug" button, the driver is going to attempt to write to the end of its allocation, but will get caught exactly in the act. The system crashes immediately, but more importantly, it points directly at myfault.sys. That is, we have caught a driver red-handed. The verification option that triggered that is the special pool option. When a driver is being verified with the special pool verification option set, Windows tries to satisfy memory allocations for it from a special region of memory, hence the name special pool. This region is special because every other page in this region is an invalid page of memory, because it aligns driver buffers against the top of the memory from which it allocates the buffer. So, when the driver strolls off the end of its buffer, it doesn't end up sitting in something else's buffer, but it ends up touching one of these invalid pages of memory. And, the mere fact of it touching an invalid page of memory triggers a page fault. The page fault handler looks to see what is being referenced, sees that it’s an invalid page of memory being accessed from Kernel mode, and will immediately crash the system and tell you that there is a problem with the driver:
There are a couple of conditions, however, even if you have enabled special pool on a particular driver or system-wide. Allocations that are sent to the special pool have to be slightly less than one page. That is, on an x86 system, the allocated page would have to be of a size less that 4 KB or 4096 bytes. So, when a driver does a large memory allocation, it is not going to come from the special pool. This means that if it overwrites that buffer, it is not going to have the protection and the check that we had in the myfault.sys demo. The special pool is one of the driver verification options that does not require a reboot, but keep in mind that it is a limited resource. So, when the special pool has run out, drivers that are being verified are going to have their pool allocations sent to the normal pool. In other words, they will try and be verified without the protection described above.
Another useful test is the system code overwrite test. System code overwrite happens when a driver has a bug in it that corrupts a pointer and that pointer ends up pointing into the code of the Operating System kernel or other boot start drivers. Most of the time, this kind of access won't be detected. In those situations when it does get detected, Windows will recognize that a driver is trying to overwrite part of the code of the Operating System or another driver. For this to occur, Windows has to have a facility called the system-code write protection. System code write protection is the mechanism by which the memory manager marks the code pages of the Operating System and the drivers as read-only. So, if a driver tries to write to those pages, a page fault is triggered, and the memory manager crashes the system with a stop code to indicate that a driver tried to modify code. However, the system code write feature is turned off for performance reasons. That is, for performance, the Kernel is not marked as read-only on most systems. Windows, in order to save space in the translation look aside buffer that is on a CPU as a cache to map virtual address to physical addresses, will map the Operating System code and the boot start drivers into a single large page of memory. A standard, or "small" page on a typical x86 machine is 4 KB. But an image, like a driver or the Kernel, is defined in 4 KB segments, where code and data are within the image. If the Operating System loads an entire image with both code and data pages in it into a large 4 MB page, it has no choice but to set the memory protection on that page to read/write. Otherwise, the Kernel and drivers wouldn't be able to modify their own data (that is also mapped into that large page). So, the system is nearly always set to have system code overwrite off.
A Brief Note
A misconception is that if an executable like Notepad is launched, the entire image is loaded. In reality, only pieces of it are loaded, and this is called a "lazy allocator". As more features are used, more of this executable's images are read off of the disk. The necessary DLLs are not loaded in their entirety either. Only the referenced parts of those DLLs are loaded. This is called "virtual allocation”. Any memory in Windows that is shareable is shared. This means both code and DLLs, but not data. A misconception is that if you load two instances of Notepad, then there are two loaded images of Notepad. Windows realizes that there is a second instance of an image that already has pieces of it loaded into the physical RAM, and automatically connects the two virtual images to the same underlying pages. But, the data typed in each loaded Notepad instance is private to that corresponding instance. Therefore, the data is not shared, but the code that executes Notepad is shared, as are the DLLs.
There is a way to enable system code overwrite explicitly in the Registry, but this is not necessary. When verifier is on, even with the most minimal settings, system code overwrite is enabled. So, when you turn on verifier, the Kernel is going to be mapping itself and the drivers with small pages, and therefore an attempt to write to code will generate an immediate blue screen. If we select the "code overwrite" radio button on Notmyfault.exe, the myfault.sys overwrites the first few bytes of
NtReadFile, a frequently used system function.
NtReadFile is the underlying system call that is called on behalf of a thread that goes to read from any file handle.
So, when we press "Do bug", that code is going to be overwritten (as it is permitted, as we are running a default system that has its Kernel and boot driver images mapped into a large page that is marked read/write). When that code is overwritten, it will be very easily detected because something else will make a call into the
NtReadFile function, which will run into some instruction that has been overwritten and thus cause a crash. If you look at the stop code, there is no pointer to the driver, because the driver that caused this crash is long since gone. If we looked at the crash dump file, we could easily find a key system component driver such as Win32k.sys, and now that is a misdiagnosis. We could use the most advanced debugging commands and remain with an unanalyzable crash file. So again, the solution would be to use the driver verifier. Pressing "Do bug" with the driver verifier on means that the system code overwrite is on. When it crashes, the blue screen immediately points at myfault.sys, and further text states that an attempt was made to write to read-only memory.
The crash dump file, which previously pointed at a driver that had nothing to do with the crash, now points to the correct driver, and gives an accurate explanation as to what went wrong. That is, with driver verifier turned on, the system code overwrite was on, and the exact culprit was caught as it happened. The unanalyzable crash dump file now appears as a basic crash dump file:
Stated loosely, when Windows crash dump files are incorrect from the debugger’s analysis engine, the goal is to transform those files into analyzable ones. Driver Verifier is the tool to help do that, as well as improve the performance of your system.
References and Suggested Reading
- Advanced Windows Debugging by Mario Hewardt and Daniel Pravat
- Sysinternals Video Library, with David Solomon and Mark Russonivich