How do you monitor and troubleshoot a legacy system that fails silently?
We run nightly batch processes to extract data from a legacy system. These processes fail fairly often without giving any indication that there was a problem. The legacy system is outside our control and is scheduled to be replaced. We have no way of making the system more reliable, but we came up with a simple way to detect failures.
There are many commercial software packages for monitoring processes, but we wrote our own because we thought we could write our own monitor in less time than it would take to evaluate third party packages. Our simple solution was adequate for our needs, we understand it thoroughly, and we can customize it however we like.
We used to schedule a batch file that would kick off scripts that interact with the legacy system. We replaced the batch file with a PowerShell script, and we inserted a manager application between the scheduled script and the legacy processes. This manager application, called
RunAndWait, provides monitoring and logging features that the legacy system does not. RunAndWait starts a child process and sends notification messages if the process does not complete within a specified period of time. The program also keeps notes to a log file.
One advantage of using PowerShell to control the process is that it provides far greater diagnostic information when things go wrong. However, the biggest improvement comes from the
RunAndWait program rather than the script that runs it.
Using the Code
RunAndWait is a C# 2.0 console application. You must have version 2.0 of the .NET Framework installed to run the application.
You can use
RunAndWait to start-up another child process (such as a *.bat file) and have it wait for a specified time. If the child process doesn't complete within the specified time, then RunAndWait.exe will send an email to all the recipients listed in the file RunAndWait.exe.config in the
EmailAddresses section. You can edit this file to add whatever email recipients you need. A separate email message is sent to each person in the list. This is done so that if one of the email addresses is 'bad', only that email will fail and the others in the list will continue to be sent.
The SMTP server name is specified in the
EmailServer section of the app.config file. The email address used in the "from" section is specified in the
EmailFromAddr section of the app.config file.
The program also writes status information to a file RunAndWait.exe.log.
Note that RunAndWait.exe and RunAndWait.exe.config must both be present when you use the program.
RunAndWait.exe requires three arguments in order.
- A number specifying how long to wait (in minutes) before deciding that the child process failed.
- The successful exit code of the child process. This argument is either an integer or the character
* would indicate that all return codes should be interpreted as success.
- The name of the program to run.
Any additional arguments to RunAndWait.exe are passed on as arguments to the child process. There is no limit on the number of additional arguments.
Launching RunAndWait.exe without any arguments will print an explanation of the expected arguments.
RunAndWait.exe 30 0 c:\tools\MyProg.exe
In this example, RunAndWait.exe will launch MyProg.exe, wait 30 minutes, expect a success and expect an exit code of 0. If the program does not complete within 30 minutes, or if the exit code is not 0, email notifications will be sent out and a record written to the log file.
RunAndWait.exe 20 * c:\tools\MyProg.exe myfile.txt
In this example, RunAndWait.exe will launch MyProg.exe with the argument myfile.txt and will wait 20 minutes for the process to run to completion. If the program does not complete within 20 minutes, email notifications will be sent out and a record written to the log file. In this example, the error code is not examined.
Points of Interest
RunAndWait illustrates the principle of first trying the simplest thing that might possibly work. There is a fair amount of exception handling and error checking — after all, the point of the program is to monitor unreliable processes — but the code is essentially simple. The first cut of the program turned out to be very useful and we haven't elaborated on the original program much since we first began using it.
- 24th June, 2008: Initial post