Tiered Error Management and Recovery -- A Defensive Programming Technique

Marc Clifton

Rate me:

3.76/5 (7 votes)

25 Jun 2007CPOL6 min read

35K

184

A small RecoveryService class that implements multiple level, multiple retry per level error management.

Download source - 4.46 KB

Introduction

This article is a part of a series of articles covering the topic of "Defensive Programming". If anyone has any good stories or techniques that they've developed regarding defensive programming, I'd love to hear them.

Just about all of the work I do for various clients nowadays is complicated enough that managing errors is no longer a matter of try-catch blocks and giving up. Error management means doing something more than just giving up. For example, DVD or AVI players may fail because of a bug in the underlying DirectX technology (no, really???) requiring the player to restart or sometimes the service hosting the player to restart. Another example--communications with a server might go down, generating a number of retries of the primary communication path before trying secondary options, such as a dial up line, or finally deciding to switch to an offline, smart client state.

In fact, terminating the application with an error message or putting up a "something has failed, please wait" message is no longer acceptable in a wide variety of applications. There are alternatives to try, often behind the scenes and transparent to the user, all to support the ever growing requirement (or perception) that the computer, its applications, and the data that is being shipped around is "mission critical". This drives new technologies (such as RAID, redundant servers, etc.) and solutions (while of course creating another class of problems at the same time).

Aside: And do these redundancies work? Sometimes not. June 2007 saw major flight delays because a primary computer handling flight plans failed and the backup computer couldn't handle the workload. We also saw the computers responsible for keeping the Space Station oriented correctly fail, an odd situation where three computers are used to confirm decisions, yet all three computers failed to reboot. So clearly, error recovery and management is complicated, hard to test, and even the resources of the FAA or an international space administration can't seem to get it right.

But back to my much simpler universe. To deal with some of my client's requirements, I've written a small RecoveryService class that implements multiple level, multiple retry per level error management. In actual use, I have found this class has greatly simplified the error management/recovery logic that has previously been entangled in the application.

Requirements And Design

My requirements are very simple:

I want to specify a sequence of recovery methods that are called in order until one of them succeeds or all of them fail
Each recovery method can be retried by a specified amount
Recovery attempts should release back to the caller for every iteration/escalation, to help prevent UI lockup and railing processor utilization

The first two requirements result in two simple classes: defining the recovery event and managing the recovery list and processing.

Bells & Whistles

I also added events that the application can hook to monitor recovery calls, retrying, escalating, and failing recovery.

Implementation

Defining Recovery Methods

The RecoveryInfo class is a container class for the recovery method and the number of retries (additional times) the method is called.

/// <summary>
/// Container class for the recovery method and
/// number of tries to call that method.
/// </summary>
public class RecoveryInfo
{
  protected RecoveryDlgt recoveryFnc;
  protected int maxRetries;

  /// <summary>
  /// Returns the recovery method.
  /// </summary>
  public RecoveryDlgt RecoveryFnc
  {
    get { return recoveryFnc; }
  }

  /// <summary>
  /// Returns the number of retries the method is called.
  /// </summary>
  public int MaxRetries
  {
    get { return maxRetries; } 
  }

  /// <summary>
  /// Constructor.
  /// </summary>
  /// <param name="fnc">The recovery method.</param>
  /// <param name="maxRetries">The number of retries.</param>
  public RecoveryInfo(RecoveryDlgt fnc, int maxRetries)
  {
     recoveryFnc = fnc;
     this.maxRetries = maxRetries;
  }
}

The Recovery Service

The recovery service class consists of an interface used by the events, which abstracts the implementation.

public interface IRecoveryService
{
  /// <summary>
  /// Gets/sets the ready state. If true, the instance is ready. If false,
  /// the instance is in error and needs to execute a recovery process.
  /// </summary>
  bool Ready { get; set;}

  /// <summary>
  /// Returns the current recovery level.
  /// </summary>
  int CurrentLevel { get;}

  /// <summary>
  /// Returns the current retry level.
  /// </summary>
  int CurrentRetry { get;}

  /// <summary>
  /// Check the state and call the appropriate recovery process if in error.
  /// </summary>
  void Check();

  /// <summary>
  /// Call this method to force a specific recovery level. The enumeration
  /// value should correlate 1:1 to the recovery processes as initialized in
  /// the constructor. The enumeration is used as a convenience to make the
  /// application code more readable when calling this method.
  /// </summary>
  void SetFailureLevel(Enum t);
}

From the interface, one can glean the usage of this class.

Ready Property

There is a Ready property that the application sets to false when an error has initially been detected and is set to true after recovering from a fault. When transitioning from a "ready" state to a "not ready" state, the recovery service resets the recovery level and retry counters:

/// <summary>
/// Gets/sets the ready state. If true, the instance is ready. If false,
/// the instance is in error and needs to execute a recovery process.
/// </summary>
public bool Ready
{
  get { return ready; }
  set 
  {
    if (ready != value)
    {
      ready = value;

      // If transitioning from a ready state to a failure state,
      // the retry count and current level need to be reset.
      if (!ready)
      {
        Reset();
      }
    }
  }
}

SetFailureLevel Method

This method is used to force the service into a specific failure level. It accepts an Enum which should correlate to the recovery methods specified in the array passed to the constructor. So, obviously, the enum should be contiguous from 0 to n-1, where "n" in the number of recovery levels. The enum is a convenience to make the code more readable when using this method.

/// <summary>
/// Call this method to force a specific recovery level. The enumeration
/// value should correlate 1:1 to the recovery processes as initialized in
/// the constructor. The enumeration is used as a convenience to make the
/// application code more readable when calling this method.
/// </summary>
/// <param name="t"></param>
public void SetFailureLevel(Enum t)
{
  int level = Convert.ToInt32(t);

  // Parameter validation.
  if (level >= recoveryLevels.Count)
  {
    throw new ArgumentOutOfRangeException("Enumeration exceeds available 
           recovery levels.");
  }

  // Go into failure mode and set the current level to the specified
  // enumeration value.
  ready = false;
  retryCount = 0;
  currentLevel = level;
  manuallySet = true;
}

Check Method

The check method is a means of checking the readiness of the "system" in a controlled manner. I typically call this method in a UI timer event or a worker thread. There are concerns that are outside of the scope of this article dealing with both UI synchronous error checking and asynchronous error checking.

CurrentLevel and CurrentRetry Properties

For diagnostic purposes, these two properties have been exposed.

The Recovery Workflow

This is an implementation detail, specific to how I wanted the recovery process to work and addresses the third requirement.

The Check method...

/// <summary>
/// Check the state and call the appropriate recovery process if in error.
/// </summary>
public bool Check()
{
  bool tryAgain = false;

  // If in error...
  if (!ready)
  {
    // ...attempt a recovery at the current level.
    AttemptRecovery(recoveryLevels[currentLevel]);

    // If still is error...
    if (!ready)
    {
      // ...retry and escalate.
      tryAgain = RetryOrEscalate(recoveryLevels[currentLevel]);
    }
  }

  return tryAgain;
}

...is designed to be called for every retry or escalation, until the recovery completely fails (indicated by method returning a false, but which has to be tested against whether Ready is false as well). A typical use would be to have a worker thread or timer event call the Check method--that's it. The service takes care of determining whether the system requires recovery and what recovery method to call. The application can determine how often it checks the system state and how quickly it retries/escalates recovery on a failure. This puts the application in control of processor utilization during recovery and helps to prevent the UI from locking up during a lengthy recovery process involving many retries and recovery levels.

Usage

I'm going to illustrate usage via the unit tests. By the way, this is an excellent example of using sequenced unit testing, where each unit test relies on the success and continued state of objects used in the previous unit test.

Initializing The RecoveryInfo List

A RecoveryService instance is designed to handle only one kind of error condition. Meaning, you would use separate instances to manage different kinds of errors, such as a media error, a server communication error, a data error, a timeout error, etc.

In the setup for the unit test, a three tier recovery sequence is passed to the recovery service:

[TestFixtureSetUp]
public void TestFixtureSetup()
{
  recoveryService = new RecoveryService(new RecoveryInfo[]
  {
    new RecoveryInfo(Level1, 10),
    new RecoveryInfo(Level2, 5),
    new RecoveryInfo(Level3, 0),
  });

  recoveryService.Recovering += new EventHandler(OnRecovering);
  recoveryService.RetryingRecovery += new EventHandler(OnRetryingRecovery);
  recoveryService.RaisingRecoveryLevel += 
        new EventHandler(OnRaisingRecoveryLevel);
  recoveryService.RecoveryFailure += new EventHandler(OnRecoveryFailure);
}

The first tier attempts 10 retries, the second 5 retries, and third 0 retries. A word about the word "retry": this is a retry count for the specific tier. Therefore, on failure, there are 11 recovery attempts for the first tier (1 attempt and 10 retries), 6 recovery attempts at the second tier (1 attempt and 5 retries), and 1 recovery attempt at the third tier (1 attempt and no retries).

The events handlers count the number of times the recovery (the attempt + retries) is attempted. In actual usage, you might hook these events to display an informative message to the user, suspend worker threads, etc.

Calling the Check Method

Since we know that the first tier has been set up with 1 attempt and 10 retries, the unit test can be written as:

[Test, Sequence(0)]
public void Level1RecoveryTest()
{
  for (int i = 0; i < 11; i++)
  {
    recoveryService.Check();
  }

  Assertion.Assert(l1Count == 11, "Expected 1 try and 10 retries. Actual = " + 
       l1Count.ToString());
  Assertion.Assert(recoveringCount == 11, "Expected 11 recovering attempts.");
  Assertion.Assert(retryCount == 10, "Expected 10 retries. Actual = " + 
       retryCount.ToString());
  Assertion.Assert(raiseCount == 1, "Expected level raise.");
}

At the end, we assert that:

there were a total of 11 attempts
the Recovering event fired 11 times
The RetryingRecovery event fired 10 times
The RaisingRecoveryLevel event fired 1 time.

We are now at the second tier of recovery:

[Test, Sequence(1)]
public void Level2RecoveryTest()
{
  recoveringCount = 0;
  retryCount = 0;

  for (int i = 0; i < 6; i++)
  {
    recoveryService.Check();
  }

  Assertion.Assert(l2Count == 6, "Expected 1 try and 5 retries. Actual = " + 
       l2Count.ToString());
  Assertion.Assert(recoveringCount == 6, "Expected 6 recovering attempts.");
  Assertion.Assert(retryCount == 5, "Expected 5 retries. Actual = " + 
       retryCount.ToString());
  Assertion.Assert(raiseCount == 2, "Expected level raise.");
}

and finally the third tier of recovery:

[Test, Sequence(2)]
public void Level3RecoveryTest()
{
  retryCount = 0;
  recoveringCount = 0;
  recoveryService.Check();
  Assertion.Assert(l3Count == 1, "Expected 1 try.");
  Assertion.Assert(recoveringCount == 1, "Expected 1 recovering attempt.");
  Assertion.Assert(retryCount == 0, "Expected 0 retries. Actual = " + 
       retryCount.ToString());
  Assertion.Assert(raiseCount == 2, "Did not expect a level raise.");
  Assertion.Assert(failure, "Expected failure event to fire.");
}

Which asserts that:

one try occurred
one recovery event occurred
there were no retries
the recovery service did not escalate
and the failure event fired, indicating that the recovery process failed

About the Download

The download consists only of the two files:

RecoveryService.cs
IRecoveryService.cs

Conclusion

The recovery service is really just a specialized state machine, where each recovery level escalates (moves to the next recovery state) after a defined number of retries until the recovery succeeds or a terminal failure condition occurs. However, I find that it's easier to work with as a specialized service rather than using a generic state machine, especially regarding the different events that can be hooked to monitor the recovery process.

History

25^th June, 2007: Initial post

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Written By

Marc Clifton

Architect Interacx

United States

Blog: https://marcclifton.wordpress.com/
Home Page: http://www.marcclifton.com
Research: http://www.higherorderprogramming.com/
GitHub: https://github.com/cliftonm

All my life I have been passionate about architecture / software design, as this is the cornerstone to a maintainable and extensible application. As such, I have enjoyed exploring some crazy ideas and discovering that they are not so crazy after all. I also love writing about my ideas and seeing the community response. As a consultant, I've enjoyed working in a wide range of industries such as aerospace, boatyard management, remote sensing, emergency services / data management, and casino operations. I've done a variety of pro-bono work non-profit organizations related to nature conservancy, drug recovery and women's health.

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.