Click here to Skip to main content
Click here to Skip to main content

Tagged as

Diagnosing Tricky ASP.NET Production Issues with DebugDiag and winDbg

, 10 Jul 2010
Rate this:
Please Sign up or sign in to vote.
Diagnosing Tricky ASP.NET Production Issues with DebugDiag and winDbg

Issues that (appear to) occur only in production can be very frustrating. Recently, I've had difficulty with a ‘normal’ product upgrade introducing several features and fixes… the code base worked in development, passed QA and passed UAT in the staging environment.

However, 5 minutes after deploying to live, the w3wp process crashes, shortly followed by the other servers in the cluster.

Bugger.

A quick look in the event logs show no unusual errors logged by the application, but the Application log records this:

Faulting application w3wp.exe, version 7.0.6002.18005, time stamp 0x49e023cf, 
faulting module kernel32.dll, version 6.0.6002.18005, time stamp 0x49e037dd, 
exception code 0xe053534f, fault offset 0x0003fbae, process id 0x%9, 
application start time 0x%10.

Fecking awesome, no quick fix help here. After rolling the site back to the previous version, a bit of Googling reveals that exception code 0xe053534f (catchy) is a StackOverflowException.

This is one of those lovely exceptions (like OutOfMemoryException) which you can’t really recover from (or indeed catch!), as the CLR will simply halt executing any of your code whilst it cleans up the mess and then roll over and die.

What I need is something that can detect the error, stop the CLR, and give me a ice cube in hell’s chance of seeing what’s up. What I need is a debugger.

But You Can’t Hook Up VS to a Production Website!

If this was happening on my local environment, VS was clearly the tool of choice. But for hermetically-sealed, locked down environments like Production – Remote Debugging just ain’t gonna happen.

Shockingly, it seems I’m not the only (or even the first) developer to have deployed code with a bug in. Fortunately, this means there’s already some tools out there to help, enter DebugDiag.

First, a bit of preparation… I pick a sacrificial server to host the debugger process and the faulty code and remove it from the public facing cluster.

After installing and running the app, you’re faced with a simple wizard for defining what you want to hook into:

image

It’s pretty self-explanatory. You pick from Crash (App be Bust), IIS Hang (Web App be Slow) or Memory and Handle Leak (App be Greedy).

I’ll be wanting ‘Crash’.

Next, you need to pick what you’re trying to debug:

image

Again, it’s pretty obvious. As I’m not looking to borrow trouble from other applications, I’m just interesting in what’s going wrong with w3wp. So ‘A Specific Process’ it is.

image

Selecting the Process Name ‘w3wp.exe means that debugger will hook into any and all w3wp processes that it can find. Checking the ‘This process instance only’ will limit it to just that Process ID (I don’t want this).

And now for ‘Advance Configuration

image

If I just wanted to capture a dump when a crash occurs, I could just click ‘Next’. But right now I’m after a specific exception (0xe053534f). If the application crashes after that’s sorted, then I’ll just have to repeat the process. So click the ‘Exceptions’ button:

image

Now, it’s just a case of filling in the Hex Error Code from the Application Log (e053534f) and selecting ‘Full Userdump’ and set the Action Limit to ‘1’. This will halt the process when the first matching exception is recorded and grab everything it can, ready for diagnosis.

There aren’t any more exceptions that I want to add, so I can just continue through the wizard and save the rule when prompted (taking a note of where the logs will be stored):

image

And Activate the rule.

image

Now for the bit I hope the client never finds out about….knowingly exposing the public to faulty code!

In order to identify the issue, I need to capture the event. So, with the rule active in DebugDiag, I re-add the sacrificial server back to the public facing cluster and wait for the error to occur…

image

Thankfully, it’s simply a case of monitoring the Userdump count for the rule on the DebugDiag window. When that hits 1, I can remove the server from the cluster again.

Analysing the Crash Dump and Identifying the Issue

Ok, I’ve captured the exception and dropped the sacrificial server from the cluster again and transferred the dump to my development machine. Let’s take a look at what DebugDiag found.

Switch to the ‘Advanced Analysis’ tab, select the ‘Crash/Hang Analyzers’ script and the dump data file and click ‘Start Analysis’.

image

After a few seconds, an Internet Explorer window pops up with the findings:

image

Errrk, the recommendation mentions a fault with mscorwks.dll. Not to worry, I happen to know that all Out of Memory and Stack Overflow originate from this library… it’s where they’re detected and raised! Not the root cause, for that I need to look further into the stacktrace:

image

Wow, a crystal clear view of what was occurring on the thread when the stack overflow occurred – if you’re a sodding machine. This is the native code view of the compiled managed code, so it’s been doubly mangled. I need a translator….WinDbg.

So, open the DebugDiag crash dump in WinDbg:

image

And you’ll get a nice little terminal window:

image

All very well and good, but not exactly helpful. So we need to tell it to analyse the dump with its .NET goggles on. This can be done by entering these commands into the command line at the bottom of the screen:

  • .loadby sos mscorwks
  • !clrstack

This should now give a much more readable and understandable stack trace:

image

(Code blacked out to protect the guilty…even if it is me!)

By reviewing the stack trace, you should be able to work out the code that’s trapped in a recursive loop (a repeating set of lines towards the top) and also its entry point (towards the bottom).

Now it’s back to the usual replicate/fix/test bug fixing process.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Martin Jarvis
Software Developer (Senior) Freestyle Interactive Ltd
United Kingdom United Kingdom
I'm a lead developer for Freestyle Interactive Ltd where we create many wonderful websites built on Microsofts ASP.Net and Ektron CMS.
 
I've been developing .Net applications (both Windows and Web) since 2002.
Follow on   Twitter

Comments and Discussions

 
GeneralMy vote of 5 PinmemberDavid Prothero9-Sep-10 4:51 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web02 | 2.8.140821.2 | Last Updated 10 Jul 2010
Article Copyright 2010 by Martin Jarvis
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid