Click here to Skip to main content
15,885,004 members
Articles / Virtualization
Article

Analyzing Windows Crash Dump Files

Rate me:
Please Sign up or sign in to vote.
4.00/5 (8 votes)
31 Oct 2008CPOL18 min read 122.4K   66   2
An article that focuses on how to analyze a crash dump file.

Introduction

This article will focus on using the Debugging Tools for Windows in order to analyze a crash dump. The intention therein is to encourage the reader to use these techniques if his or her system crashes. It is entirely possible to offer this as a learned skill to those who have systems that crash a lot. Analyzing a crash dump file that is generated by the Operating System can be an easy task once a few of the necessary principles are understood, as well as the tools needed to perform an analysis. Tools are needed for analyzing crash dumps. The tools needed to analyze a crash dump would be the Debugging Tools for Windows debuggers. After installing those tools, you would download the symbols files to cache them locally. During the debugging process, these symbol files can also be downloaded from the Microsoft Symbol Server by setting the path to the environment:

set PATH=srv*c:\symbols*http://msdl.microsoft.com/download/symbols

You should use the /M switch at the end of the line when running Vista. Notice how the symbols are cached locally in a directory called c:\symbols. But, what are symbols? Symbols are made when a program is being built, the compilation process translates the human-readable source code to the machine’s assembly language. This code is normally used to build an object file, which contains a symbol table describing all the objects in the file that have external linkage. Symbols refer to variables and functions in the running program by the names given to them by the programmer in the source code. In order to display and interpret these names, the debugger requires information about the types of the variables and functions in the program, and about which instructions in the executable file correspond to which lines in the source code files. Such information takes the form of a symbol table, which the compiler and linker include in the executable file during the linking process to build that executable. Therefore, the downloaded symbols would be for Microsoft code alone. As we will see, a third party driver will not have symbols, and also uses a calling convention that omits a stack frame pointer. This third party driver would call an Operating System function and that would cause the crash, but it is likely that the third party driver passed the function some erroneous data. Having said that, another powerful debugger is the livekd.exe written by Mark Russinovich. As we will see, he is also the author of the tool that causes a crash for the educational sake of how to analyze crash dump files and put that knowledge to practical use.

Before we discuss these tools and how they are used, we must first understand that normally when the system crashes, something went wrong in the kernel mode. A device driver or an Operating System function running in kernel-mode incurs an unhandled exception, such as a memory access violation, an example of which would be either an attempt to write to a read-only page, or an attempt to read an address that isn’t currently mapped and is therefore not a valid memory location. Stated loosely, an executing thread attempts to or does write to a memory block that it does not own and corrupts the state of that memory block.

Crash dump analysis resides under the topic of memory analysis. A fundamental aspect of memory analysis is that the locations of data used by the Operating System are not the same as the physical locations needed to locate data in a memory dump. Because there is generally not enough physical memory to contain all running processes simultaneously, the Windows Operating System must simulate a larger memory space. This is why configuring a full memory dump is not very practical, as user mode code and data are normally not used for crash dump analysis. If something went wrong in kernel mode, then configuring a kernel dump crash file would be the best choice to analyze a system crash. These settings are found in the Advanced Settings tab on the applet in the Control Panel that also contains the device manager and the remote settings.

A Brief Look at Threads and Processes

A thread is a unit of execution context. Threads are the units of scheduling, and contain the execution state: the register values, the instruction pointer, and the stack pointer. A process is a container that has at least one thread, a handle table, a security token, and an address space. Threads share the private address space, so it is up to the programmer to synchronize access to shared data within the address space among these threads. In fact, part of the Windows memory protection scheme is premised on the fact that when a process (threads within) is executing, the address space of that process is mapped into the microprocessor’s memory management hardware. Therefore, a process can’t see the address space of another process by virtue of the fact that it is not present—it is currently not loaded into the microprocessor’s memory management hardware. This does not mean that it cannot access the address space of another process. In order to do so, it has to follow Windows security principles, open that process, and use special APIs to gain access to that remote process’s address space.

The Windows Memory Manager creates the illusion of a flat virtual address space, when in fact, the hardware unit of the microprocessor maps the virtual address space to the physical address. This larger memory space simulation is achieved by creating a virtual address space for each process that is translated to physical storage locations through a series of data structures. The main data structures are the page directory and the page table. Mapping the virtual address space to the physical address is done so in the granularity of a page (4 kilobytes of physical memory). When a user mode application needs to map its code and data onto the virtual address space, the process may represent to the system an instance of a running program. But, as an application needs to map its code and data onto the virtual address space, the actual Operating System also needs to map itself, as well as the configured device drivers, and the data that is used by device drivers that is stored on the kernel-mode heap. The virtual address used by a process does not represent the actual physical location of an object in memory. Instead, the system maintains a page map for each process, which is an internal data structure used to translate virtual addresses into corresponding physical addresses.

Another thing about memory protection is that the address space consists of both the user's address space as well as part of the address space that is dedicated towards mapping the kernel, the drivers, and the data they both use. It would pose a security risk if user mode components like Notepad could reach into kernel mode and read the data out of there or even modify it. For this reason, Windows relies on the help of the memory management hardware to mark pages that represent kernel address space as being system pages. User process memory addresses are separate, as all kernel mode components share a single address space: user threads cannot access kernel memory.

The memory management hardware on processors that Windows run on prevent anything running in user mode from accessing pages that are marked as system pages. So, in order for a thread to make a system call and thus enter Operating System code and access kernel memory, a transition has to occur. When a thread has to make a system call, that thread makes a call function in a DLL that performs a special instruction that safely transitions into this elevated processor access mode. On an x86 architecture, this elevated processor access mode is called Ring 0. So, kernel-mode code runs in ring 0, and user mode code runs in ring 3. Threads are constantly switching back and forth from user-mode to kernel-mode and back every time they make a system call. When that switch is made, the thread is now executing in kernel mode, and now the Operating System and the drivers have access to that kernel-mode protected memory.

Interrupt Request Levels: IRQLs

x86 interrupt controllers perform a level of interrupt prioritization, but Windows imposes its own interrupt priority scheme known as interrupt request levels (IRQLs). This scheme is actually a software concept that is used by Windows to prioritize its own work. It is basically the priority of what's happening on the processor at that point. There are a few IRQLs that are normally related to crashes. One is the lowest level, and is called the PASSIVE_LEVEL, during which no interrupts are masked: no software or hardware interrupts are masked. By definition, when the system is running user-mode code, the IRQL is at PASSIVE_LEVEL. The only time an IRQL can be elevated to higher levels is if the system is executing kernel mode code in response to software generated interrupts or hardware generated interrupts that trigger the execution of interrupt service routines or deferred procedure calls. Even when running kernel-mode code, the system tries to keep the IRQL at PASSIVE_LEVEL because it is more responsive to devices that are interrupting the system to keep their interrupts unmasked. The next IRQL relevant to system crashes is the DISPATCH_LEVEL. DISPATCH_LEVEL is the highest software interrupt level, and scheduler operations are mapped to this level. When the scheduler is operating on the system, it raises the IRQL to DISPATCH_LEVEL. Other operations can raise the IRQL to DISPATCH_LEVEL, but when another operation raises the IRQL to DISPATCH_LEVEL, the scheduler is disabled. A way that a thread running in kernel mode can ensure that it is not preempted by another thread on that processor is to raise the interrupt level to DISPATCH_LEVEL. This turns off the scheduler, and now that thread can run through whatever operation is performing to completion. When it is done, it drops the interrupt level down to PASSIVE_LEVEL and re-enables the scheduler. A side effect of having the scheduler off at DISPATCH_LEVEL is that a driver executing at DISPATCH_LEVEL or a level higher cannot take a page fault. It cannot reference a piece of memory that is marked as pageable that is not present because to do so would trigger the memory manager in the page-in handler, which would be forced to issue a disk I/O (hard fault). It would then suspend that thread until the I/O is complete, until the data that was referenced has been brought in from disk (either a mapped file or a page file). In that process of placing the thread in a wait state, it is basically calling the scheduler and informing it that it must find another thread to run on the CPU. But at DISPATCH_LEVEL, the scheduler is off. This is a violation of the Windows internal synchronization architecture, and thus by the system, an illegal operation.

The Stack in Contrast with the Heap

The stack is an abstract data structure that is read recursively, from bottom to top. The heap is a dynamically allocated amount of memory used for building programs when the size of their data structures cannot be determined statically. That is, the data structures will grow and shrink as the program dictates the need for heap allocations. The heap grows from the lower memory addresses to the higher addresses, a manner of opposite of that of the stack. It is not possible for the heap and the stack to run into each other. The data section of an application program stores global and static variables on the heap. The BSS section of an application program stores globally initialized variables on the heap. The stack represents the data that the hardware records and that the device drivers that are calling Operating System functions record that allow nested function invocation. So, when a device driver calls the Operating System, its information that is stored in the stack is used to pass parameters to the Operating System and return back to the function that called it. So, the stack stores the parameters passed, the return address, and local variables (information local to the function that is processing the request). In Windows, each thread has two stacks: one for user-mode execution of the thread, and the stack that resides in the user address space and therefore is accessible to any thread in the process. When a thread enters kernel-mode, having invoked a system call, that thread now runs off of its kernel mode stack. The kernel mode stack resides in the kernel address, and therefore is not accessible to the threads running in user mode.

VIDEO_TS__title_0_ch_3_frame_96130_.gif

The return address is saved at the time when Function 1 makes its call to Function 2. That is what the hardware saves, so that when Function 2 returns, the hardware knows where it should pick up execution inside of Function 1. Function 2, after it is called, begins by setting up its frame pointer. It saves that to the stack; it might use some local buffers that it wants to use temporarily while it is executing. These local buffers are allocated on the stack and are seen as Local Variable 1 and Local Variable 2. Function 2, when it calls Function 3, passes the arguments to the function on the stack in the same way it was passed arguments. The stack frame pointers clearly delineate the areas that correspond to each of the functions in that nesting.

The scenario above illustrates a very simple calling convention for the debugging analysis engine to analyze. Other calling conventions are different, however. Once the calling convention inside the kernel itself is called, the frame pointer is omitted: no frame pointer is pushed onto the stack, making it difficult for the analysis engine in the debugger unless the analysis has symbols. The analysis engine has symbols for all of Microsoft code, but if you have third party drivers on the stack and they are using calling conventions that don't use the frame pointer, it is difficult for the analysis engine to figure out where the stack frames are.

When you open up a crash dump file in the Windows debugger, it performs a basic analysis, and essentially makes a guess as to who the culprit is. When you open the debugger, it internally invokes a command that you can explicitly use, called !analyze (!analyze -v load). !Analyze displays the stop code and parameters and a guess at the offending driver. !Analyze basically looks at the stack. Sometimes, the bug check parameters point to the instruction pointer (cs:ip) that is the offending instruction. Using the loaded module list (the !lm command), it can determine what driver that instruction fits. In other cases, !analyze uses heuristics to walk the stack and determine what was happening at the time of the crash, and then performs a sort of profiling. If the crash occurred inside of the Operating System but a caller of the function of the Operating System that triggered the crash was a third party driver, the debugger might guess and state that the crash was probably caused by .. and then point the finger at the third party driver even though the crash itself might have been caused by a Windows Operating System function. But, it is very likely that the called function was passed some erroneous data (a pointer to a corrupt data structure, or some parameter that was invalid). If the debugger states that the crash was by, say, ntokrnl.exe, or some file system driver, don't believe it. Microsoft has gathered multitudes of data involving crashes and has the data to prove that at least 80% of the crashes are caused by 3rd party device drivers. This means that you have to do more digging.

Mark Russinovich, who is now employed by Microsoft, wrote an application called “Notmyfault.exe” that is downloadable freeware in a zip file. The purpose of this application is solely meant for educational purposes to help users and the like learn how to analyze and interpret a crash dump file. When the Operating System sees something wrong that is out of any legal bounds, it will call an Operating System function called KernelBugEx (documented in the Windows DDK). This function takes the stop code and four parameters that are interpreted in a per-stop code basis. KernelBugCheckEx masks out all interrupts on all processors in the displays to then switch the display mode into a low-resolution VGA graphics mode, which then lets the system paint a blue screen. The example that will be shown below describes some of the contents of the blue screen while the memory dump begins, dumping physical memory to disk. You can use the Notmyfault utility to generate a crash. You simply select the type of Operating System scenario that would cause a system crash:

Capture.JPG

  • High IRQL (kernel mode)
  • Buffer Overflow
  • Code Overwrite
  • Stack Trash
  • High IRQL (user mode)
  • Deadlock
  • Hang

Choosing the High IRQL (kernel mode) crash and click “do bug” will cause the driver to allocate a page of paged pool, raise the IRQL to above the DPC (Deferred Procedure Call) DISPATCH_LEVEL to then touch the page it has freed. If a crash does not happen immediately, the process continues by reading memory past the end of the page until it causes a crash by accessing invalid pages. More to the point, the driver allocates a paged pool buffer, frees the buffer, raises the IRQL to greater than or equal to DISPATCH_LEVEL, and then touches that buffer and the pages that follow. Pages in memory that are accessed at IRQLs DISPATCH_LEVEL or above must be physically present. The crash occurs and the blue screen depicts the DRIVER_IRQL_NOT_LESS_THAT_OR_EQUAL and begins dumping memory to the page file. Stated loosely, when the system reboots, the smss.exe (Session Manager) looks at the page file to see if there is a crash dump file. It will, in turn, call a function to make another process copy the crash file and write it to %systemroot%. So, with our symbols configured in the WinDbg tool, we use the Notmyfault utility, but ideally in a virtual environment. The writer of this paper strongly recommends that the student of crashes download a trial version of VMWare Workstation for Windows, where another installation of Windows can be installed to create an artificial computing environment. This virtual environment would be a software layer that functions as an abstraction to provide a computing environment that would be completely separated from your Operating System. Even spyware in a virtual environment cannot get to the outer computer.

After the blue screen appears, the system reboots, and we open the crash file that should reside in the Windows directory: c:\Windows\MEMORY.dmp. Open this file when you choose the “open crash file” choice of the File menu on the WinDbg toolbar:

Microsoft (R) Windows Debugger Version 6.8.0004.0 X86
Copyright (c) Microsoft Corporation. All rights reserved.
Loading Dump File [C:\Windows\MEMORY.DMP]
Kernel Summary Dump File: Only kernel address space is available
Symbol search path is: c:\symbols;srv*c:\symbols*http://msdl.microsoft.com/download/symbols
Executable search path is: 
Windows Kernel Version 6001 (Service Pack 1) MP (2 procs) Free x86 compatible
Product: WinNt, suite: TerminalServer SingleUserTS
Built by: 6001.18145.x86fre.vistasp1_gdr.080917-1612
Kernel base = 0x81a4c000 PsLoadedModuleList = 0x81b63c70
Debug session time: Sat Nov  1 01:08:53.731 2008 (GMT-4)
System Uptime: 0 days 4:07:49.287
Loading Kernel Symbols
.......................................................................................
Loading User Symbols
PEB is paged out (Peb.Ldr = 7ffd500c).  Type ".hh dbgerr001" for details
Loading unloaded module list
BugCheck D1, {bec0c5e8, 1c, 0, b80493dd}

Use !analyze -v to get detailed debugging information.

Notice that the error message states that no symbol information could be loaded: Microsoft did not write this driver, as myfault.sys is a third party driver.

*** ERROR: Module load completed but symbols could not be loaded for myfault.sys
Page 52f17 not present in the dump file. Type ".hh dbgerr004" for details

PEB is paged out (Peb.Ldr = 7ffd500c).  Type ".hh dbgerr001" for details
PEB is paged out (Peb.Ldr = 7ffd500c).  Type ".hh dbgerr001" for details
Probably caused by : myfault.sys ( myfault+3dd )
Followup: MachineOwner

Now, we explicitly issue the !analyze debugger command in verbose mode:

tetx
0: kd> !analyze –v

The IRQL_NOT_LESS_THAN_OR_EQUAL is a common stop code that states that an attempt was made to reference a page of memory that was not present:

DRIVER_IRQL_NOT_LESS_OR_EQUAL (d1)
An attempt was made to access a pageable (or completely invalid) address at an
interrupt request level (IRQL) that is too high.  This is usually
caused by drivers using improper addresses.
If kernel debugger is available get stack backtrace.
Arguments:
Arg1: bec0c5e8, memory referenced
Arg2: 0000001c, IRQL
Arg3: 00000000, value 0 = read operation, 1 = write operation
Arg4: b80493dd, address which referenced memory
Debugging Details:
------------------
Page 52f17 not present in the dump file. Type ".hh dbgerr004" for details
PEB is paged out (Peb.Ldr = 7ffd500c).  Type ".hh dbgerr001" for details
PEB is paged out (Peb.Ldr = 7ffd500c).  Type ".hh dbgerr001" for details
READ_ADDRESS:  bec0c5e8 Paged pool
CURRENT_IRQL:  1c
FAULTING_IP:    
myfault+3dd
b80493dd 8b06            mov     eax,dword ptr [esi]
DEFAULT_BUCKET_ID:  VISTA_DRIVER_FAULT
BUGCHECK_STR:  0xD1
PROCESS_NAME:  NotMyfault.exe
TRAP_FRAME:  aaf75b78 -- (.trap 0xffffffffaaf75b78)
ErrCode = 00000000
eax=bec0b5e8 ebx=83cf7138 ecx=b74e421c edx=83625088 esi=bec0c5e8 edi=00000000
eip=b80493dd esp=aaf75bec ebp=aaf75c44 iopl=0         nv up ei ng nz na pe nc
cs=0008  ss=0010  ds=0023  es=0023  fs=0030  gs=0000             efl=00010286
myfault+0x3dd:
b80493dd 8b06            mov     eax,dword ptr [esi]  ds:0023:bec0c5e8=????????
Resetting default scope

Note that the question marks indicate inaccessible memory, and that the faulting IP (instruction pointer) points at the address that corresponds to the module that caused the crash.

LAST_CONTROL_TRANSFER:  from b80493dd to 81aa6d24
STACK_TEXT:  
aaf75b78 b80493dd badb0d00 83625088 00000003 nt!KiTrap0E+0x2ac
WARNING: Stack unwind information not available. Following frames may be wrong.

aaf75c44 81c98615 840bebd8 83cf7120 83cf7190 myfault+0x3dd
aaf75c64 81c98dba 83decf08 840bebd8 00000000 nt!IopSynchronousServiceTail+0x1d9
aaf75d00 81c82a8d 83decf08 83cf7120 00000000 nt!IopXxxControlFile+0x6b7
aaf75d34 81aa3a1a 00000080 00000000 00000000 nt!NtDeviceIoControlFile+0x2a
aaf75d34 77bf9a94 00000080 00000000 00000000 nt!KiFastCallEntry+0x12a
0012f9f4 00000000 00000000 00000000 00000000 0x77bf9a94

Notice that the bottom of the stack contains the instruction 0x77bf9a94 that transitioned into kernel mode. The instruction above, nt!KiFastCallEntry+0x12a calls nt!DeviceIoControlFile+0x2a (related to the IOCTLs in the system), which calls nt!IopSynchrounousServiceTail+0x6b7, that, finally winds up at the device driver myfault.sys referencing invalid memory. The stack information shows that running the executable in user mode results in the bottom instruction that enters into kernel mode after the thread makes a system function call. This invokes the DeviceIoControl function in kernel32.dll, and so on. The module “nt” that precedes the system functions stands for Ntsokrnl.exe (the kernel). So, when the debugger performs an analysis, it sees the instruction that transitioned into kernel mode, then nt, and up to myfault. It recognizes “nt” as the standing for the kernel image and says to itself, that is ours, so keep tracing recursively.

STACK_COMMAND:  kb
FOLLOWUP_IP: 
myfault+3dd
b80493dd 8b06            mov     eax,dword ptr [esi]
SYMBOL_STACK_INDEX:  1
SYMBOL_NAME:  myfault+3dd
FOLLOWUP_NAME:  MachineOwner
MODULE_NAME: myfault
IMAGE_NAME:  myfault.sys
DEBUG_FLR_IMAGE_TIMESTAMP:  453143ee
FAILURE_BUCKET_ID:  0xD1_myfault+3dd
BUCKET_ID:  0xD1_myfault+3dd
Followup: MachineOwner

If the driver is not as familiar (or as obvious) as myfualt.sys, then use the lm (list modules) command to look at the driver’s version information. Add the k (kernel modules) and the v (verbose) options along with the m (match) option followed by the name of the driver and a wildcard:

0: kd> lm kv m myfault*
start    end        module name
b8049000 b8049ec0   myfault    (no symbols)
    Loaded symbol image file: myfault.sys
    Image path: \??\C:\Windows\system32\drivers\myfault.sys
    Image name: myfault.sys
    Timestamp:        Sat Oct 14 16:09:18 2006 (453143EE)
    CheckSum:         0000295E
    ImageSize:        00000EC0
    File version:     2.0.0.0
    Product version:  2.0.0.0
    File flags:       0 (Mask 3F)
    File OS:          40004 NT Win32
    File type:        3.7 Driver
    File date:        00000000.00000000
    Translations:     0409.04b0
    CompanyName:      Sysinternals
    ProductName:      Sysinternals Myfault
    InternalName:     myfault.sys
    OriginalFilename: myfault.sys
    ProductVersion:   2.0
    FileVersion:      2.0
    FileDescription:  Crash Test Driver
    LegalCopyright:   Copyright (C) M. Russinovich 2002-2004

Despite the fact the example above is a basic example that uses a test crash driver to generate a crash and a crash file, the techniques involved can often be used in other situations. Should your system crash and you see a device driver that is the culprit and the crashes are regular, go to the website of the device driver. It may be that your current version is outdated, and the driver has been updated several times. Download that driver and see if the system still crashes. Perhaps, some third party device driver companies can also offer symbol files for debugging information.

References:

  • Windows Internals 4th Edition, by Mark Russinovich and David A. Solomon.
  • Sysinternals Video Library, by Mark Russinovich and David A. Solomon.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Software Developer Monroe Community
United States United States
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
GeneralMy vote of 4 Pin
Member 44576464-Jul-11 3:25
Member 44576464-Jul-11 3:25 
GeneralThanks for a nice introduction to Windows kernel-mode debugging. Pin
wtwhite16-Nov-08 18:14
wtwhite16-Nov-08 18:14 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.