Automating the Reporting of Critical Errors in Your Program

OlegKrivtsov

4.86/5 (23 votes)

Nov 27, 2012

CPOL

13 min read

32731

How to automate collecting information about critical errors occurring in your program to greatly simplify your life in sense of bug analysis and troubleshooting.

Introduction

You do not do bugs only if you do not write code. Every developer is able to make bugs and loves to make bugs, but does not like to fix them. A coding error in one case can result in an incorrect implementation of the data processing algorithm, and in another case it results in exceptions (crashes or critical errors). In this article I will show how to automate collecting information about critical errors occurring in your program to greatly simplify your life in the sense of bug analysis and troubleshooting.

This article is mostly aimed for developers and QA. A beginner level is enough for reading it. After reading this article you will understand:

why you cannot effectively fix software errors by just communicating with user by e-mail;
what information about a crash in your software can be collected automatically;
how to use that information to fix the majority of user issues.

Do we really need to automate error reporting?

Do you know what kind of reaction I faced with when I told a friend of mine that I was planning to use a crash reporting library in my software? I heard the following: "Oh, why would someone need such a complex system? In my projects, for example, if something happens with my software, I just ask the end user to write me a letter describing which button he had pressed and what he was doing. And then I reproduce the error and fix it."

Yes, programmers are lazy. We want to convince ourselves that we are able to fix errors detected by end users, and that we can easily collect information about the problem by e-mail. Similarly, we convince ourselves that writing good and stable code is real without unit tests, and even without fixing warnings being sprinkled by the compiler while building the project. Additionally, the customer needs the project to "be ready ASAP". And we think that all errors will be detected and eliminated some time in the future. Testers should detect bugs, so let them do that. There is never enough time, deadlines are running out, and the boss does not approve all those "extra bells and whistles".

Practice shows that such projects can be, and are carried out in record time, and, perhaps, most of the errors are caught by testers. But, nevertheless, after the release, end users send tons of E-mails to support department and complain about exceptions in the program. With the passage of time an effect of absence of a consistent architecture, code comments, unit tests and the like becomes evident. Every developer who makes a change in someone else's code produces new bugs and the potential for exceptions, some of which are caught again by testers, and some ones successfully go to release.

What do developers usually do when a user sends an E-mail asking for help?

When a user sends a letter to our support department with the message that a critical error occurred in our program installed on his machine, I usually ask support, to find out his app’s version, and operating system’s version. This is because the software environment of the user is very important to identify the reason of the error.

The first problem is that I have to communicate with the user through an intermediary (support department). And I yet have to explain what I want. The second problem is that I have to ask the user for technical information, but the users rarely have some technical knowledge. Most users with whom I talked to seem to me as good people, but they cannot express their ideas and do not know the technical terms. It’s impossible to explain them how to change a registry key and open a hidden folder LocalAppData.

When the user responds to my request, I almost always have some additional questions like "please try to press this button and change that option". In response, I get silence. The user simply does not have the patience to do these actions several times until I collect the required information.

In our projects, no matter what they are – an NT service or a GUI application – I use logging. I write error messages and ongoing operations to the log file. Log file in theory should help me to collect the technical information about the application that the user does not know and cannot know about.

When an error occurs, I ask user to send me the log file. Here is the place where problems begin. In fact, it is rather difficult to explain the user where exactly to take the log file and at what time he should take it. This is not a trivial problem. Often the user restarts the application after the error has occurred, then the log file is overwritten, and comes to me empty, causing my bewilderment.

Generally, reproducing the bug on my machine is my main goal. This will allow me to fix it. Having only the instructions from users, I rarely understand what actions to perform to crash the app. It is also difficult to answer this question, having just a bunch of text in the log before my eyes.

Therefore, the next stage of my communication with the user is asking him to make a screenshot or recorded video that would show his actions, the way he achieves an error in the application. Do you think a lot of people sent me the video? No, but, nevertheless, I have met such craftsmen. And this movie turned out to be sometimes the most useful thing to reproduce the error.

Thus, the main disadvantage of manual data collection about a critical error in the application is that the user is lazy and tongue-tied. He will not test your hypothesis to find the cause of the problem - he has no patience and technical knowledge for that.

So how, after all, it is better to collect information about an error - either manually or automatically?

The idea of automatic error data collection and error report delivery appeared in the early 2000s in Microsoft (a Microsoft Research publication telling about that can be found by this link), when they were taken aback by a hail of errors that were occurring in many of their products (including Windows and Office). The Windows team developed a tool that allowed doing a core dump of the system, which could then be analyzed.

Independently, Office development team created a tool that was able to catch unhandled exceptions and create a minidump file. Minidump contained only small fragments of the process's virtual memory needed to read the call stack of the thread in which the exception occurred. Minidump was very convenient because it could be automatically sent over the Internet.

They named this tool Windows Error Reporting (WER) and embedded it into Windows XP. Since then, WER catches the errors in all the products of Microsoft. There is the following statistics on the effectiveness of WER in products from Microsoft:

Fixing 20 percent of the "top" detected errors can solve 80 percent of customer issues;
Fixing the cause of 1 percent of the errors, you can fix 50 percent of customer issues.

Any developer can use WER to automate the collection of data about errors in his application. WER sends an error report to the special WER server, and a developer can access the server “for free”. But you need to have a code signing certificate from VeriSign. Buying such a certificate is $500 per year. In fact, if you're developing for Windows platform, you need this certificate, so they assume that the access to the WER server is “free”.

Besides WER, there are other free libraries for collecting data about software crashes. For example, in my C++ projects for Windows I use an open-source library called CrashRpt. And I can recommend open-source Google breakpad library to Linux users.

Personally, I prefer CrashRpt, since I write programs only for Windows. In addition, this library allows me to send error reports not only to my server via HTTP, but also as an email to my mail box, which is more convenient for me. Below, I'll show you what else you can do with this library and provide some links to articles describing how exactly you can do that.

What error data can be collected automatically?

So, let’s assume that a critical error (exception) has occurred in the application. The exception can be triggered by many factors: referencing NULL address in memory, stack overflow, memory exhaustion and so on. In MSDN, you can find dozens of functions that C run-time provides for intercepting (handling) exceptions. Those functions are used by CrashRpt internally for catching exceptions.

When an exception occurs, the exception handler function runs CrashRpt code, which, first of all, collects the exception pointers (a structure that contains the exception address, code and type). Next, CrashRpt starts a new process and passes the exception pointers to that process. Parent application (in which there was an exception) can be unstable, and it is being killed as soon as all the error data are extracted.

Minidump

The main error data collection work is then continued in that new process.

First, the minidump file is being written using a system library dbghelp.dll from Microsoft. To do this, all the threads of the parent process are suspended, and a "snapshot" of the process is recorded. The snapshot includes the names and versions of all DLL-modules loaded into the process and the list of threads that work in the process. For each of those threads, call stack image is recorded. Also, the information about the version of the operating system, the number of CPUs and their brand names are written to minidump file. Minidump size is typically about several tens of kilobytes.

Note: For advanced information on exception handing in Visual C++ and minidumps, you can refer to Effective Exception Handling in Visual C++ article on CodeProject.

Once we got the minidump file, we can open it in Visual Studio and visualize the state of the program at the time of the crash: the version of the application, operating system version and see the place in the code where the exception occurred (see the figure below). Doesn’t it make our life easier?

Logs

Automating the collection of information about a crash does not force us to not use the conventional logs. We can, as before, record current operations and errors in the log file, and automatically add it to the error report on crash. This eliminates the confusion that could arise earlier in the "manual" requesting the logs from the user. Now the log will always contain the actual data at the time of the crash of the program, and you don’t have to explain how to open the hidden folder LocalAppData and what file to take from that folder.

In addition to the log file we can add any custom application-produced files we wish (for example, INI configuration files so on).

Screenshots

What I do not like about minidumps and logs is the fact that they often do not provide the way to reproduce the error. Yes, I can see a place in the program where the crash occurred, and I can hypothesize, from what reason it could happen. For example, often crashes occur because the variable is not initialized, and the program accesses a garbage memory address.

Note: For advanced information on the most frequent reasons of program crashes, you can refer to Making Your C++ Code Robust article.

But no matter how many efforts I do, in most cases I cannot find the sequence of actions that allow me to reproduce the crash on my machine. This is not only because of the fact that each user has its own unique software environment that is different from the environment of my computer. It is also because of the fact that the users have their own patterns of work with the program. The way how you use the program, what actions you perform with it, can be radically different from what the user is doing.

Therefore screenshot is very useful information about the error. A screenshot allows me to see the user's screen at the time of the crash. The CrashRpt library is able to automatically create a screenshot, save it as a JPEG (compression quality can be adjusted), or PNG file. As a result, I can see which button the user clicked at the time of the error, which, believe me, helps to reproduce the problem.

In the figure below, there is an example screenshot automatically created at the moment of the application’s crash. The screenshot contains only the area of the application window, the rest of the area is automatically painted black (to guard the privacy of the user).

Video Recording

Remember when I said that a couple of times users have sent me videos showing the actions that they were doing just before the crash of the program? So, with the CrashRpt library you do not have to ask user to make a screen capture movie. The library itself will make them (with end user’s consent, of course).

It is clear that we cannot predict the moment when a crash occurs. Therefore CrashRpt periodically (at intervals that you define) takes screenshots and save them to disk in uncompressed form (as BMP files). With the accumulation of screenshots, old ones are removed, and their place is occupied by the new ones. In tests on my machine it takes about 5-7% of CPU time and several hundred megabytes of disk space.

If an exception occurs, the recorded screenshots are compressed using the OGG Theora video codec, and you get the video file that you can open in Chrome or Firefox browser, or in any video player.

I agree that video recording operation is quite resource-intensive, but it is not necessarily to enable it always. You can, for example, enable it only if all trials to collect information from the user fail and the last opportunity to reproduce the error is to request the video.

Conclusion

Of course, fixing the critical errors in your program is "possible" even without automating error data collection. With some degree of success, you can request the data you need from user by e-mail. But automatic collection of error data can do this really effectively. It allows making the life of the developer easier, and not necessarily just after the software release. For example, we use automation inside of our company at the early stages of alpha and beta testing of our programs to facilitate the communication with QA engineers. Because the testers do not always understand where to get a log file and how to take a screenshot or record a video clip, too.

If you decided to give CrashRpt a try, please refer to the Integrating Crash Reporting into Your Application - A Beginners Tutorial article for detailed instructions and source code.

By the way, do you know the shortest program in C having a critical error in it? It is presented below:

main () {main ();}

History

27 November 2012 - Initial release.