Robust C++: Safety Net

Greg Utas

4.99/5 (39 votes)

Aug 28, 2019

GPL3

16 min read

102279

2887

Keeping a program running when it would otherwise abort

Introduction

Some programs need to keep running even after nasty things happen, such as using an invalid pointer. Servers, other multi-user systems, and real-time games are a few examples. This article describes how to write robust C++ software that does not exit when the usual behavior is to abort. It also discusses how to capture information that facilitates debugging when nasty things occur in software that has been released to users.

Background

It is assumed that the reader is familiar with C++ exceptions. However, exceptions are not the only thing that robust software needs to deal with. It must also handle POSIX signals, which the operating system raises when something nasty occurs. The header <csignal> defines the following subset of POSIX signals for C/C++:

SIGINT: interrupt (usually when Ctrl-C is entered)
SIGILL: illegal instruction (perhaps a stack corruption that affected the instruction pointer)
SIGFPE: floating point exception (includes dividing by zero)
SIGSEGV: segment violation (using a bad pointer)
SIGTERM: forced termination (usually when the kill command is entered)
SIGABRT: abnormal termination (when abort is invoked by the C++ run-time environment)

Similar to how exceptions are caught by a catch statement, signals are caught by a signal handler. Each thread can register a signal handler against each signal that it wants to handle. A signal is simply an int that is passed to the signal handler as an argument.

Using the Code

The code in this article is taken from the Robust Services Core (RSC). If this is the first time that you're reading an article about an aspect of RSC, please take a few minutes to read this preface.

Compiler options. The approach described in this article requires the following compiler options. They are set in the download's CMake files, but you need to set them in your own project if you are developing code based on this article:

Windows (MSVC or clang): /EHa
Linux (gcc): -fnon-call-exceptions

Overview of the Classes

An application developed using RSC derives from Thread to implement its threads. Everything described in this article then comes for free. This section also describes other classes that collaborate with Thread.

Thread

Software that wants to be continuously available must catch all exceptions. A single-threaded application could do this in main. But RSC supports multi-threading, so it does this in a base Thread class from which all other threads derive. Thread has a loop that invokes the application in a try clause that is followed by a series of catch clauses which handle any exception not caught by the application.

SysThread

This is a wrapper for a native thread and is created by Thread's constructor. Some of the implementation is platform-specific.

Daemon

When a Thread is created, it can register a Daemon to recreate the thread after it is forced to exit, which usually occurs when the thread has caused too many exceptions.

Exception

The direct use of <exception> is inappropriate in a system that needs to debug problems in released software. Consequently, RSC defines a virtual Exception class from which all of its exceptions derive. This class's primary responsibility is to capture the running thread's stack when an exception occurs. In this way, the entire chain of function calls that led to the exception will be available to assist in debugging. This is far more useful than the const char* returned by std::exception::what, stating something like "invalid string position", which specifies the problem but not where it arose and maybe not even uniquely where it was detected.

SysStackTrace

SysStackTrace is actually a namespace that wraps a handful of functions. The function of most interest is one that actually captures a thread's stack. Exception's constructor invokes this function, and so does a function (Debug::SwLog) whose purpose is to generate a debug log to record a problem that, although unexpected, did not actually result in an exception. All SysStackTrace functions are platform-specific.

SignalException

When a POSIX signal occurs, RSC throws it in a C++ exception so that it can be handled in the usual way, by unwinding the stack and deleting local objects. SignalException, derived from Exception, is used for this purpose. It simply records the signal that occurred and relies on its base class to capture the stack.

PosixSignal

Each signal supported within RSC must create a PosixSignal instance that includes its name (e.g. "SIGSEGV"), numeric value (11), explanation ("Invalid Memory Reference"), and other attributes. The PosixSignal instances for various signals defined by the POSIX standard, including those in <csignal>, are implemented as private members of the simple class SysSignals. The subset of signals supported on the target platform are then instantiated by SysSignals::CreateNativeSignals.

Throwing a SignalException turns out to be a useful way to recover from serious errors. RSC therefore defines signals for internal use in NbSignals.h. An instance of PosixSignal is also associated with each of these:

//  The following signals are proprietary and are used to throw a
//  SignalException outside the signal handler.
//
constexpr signal_t SIGNIL = 0;        // nil signal (non-error)
constexpr signal_t SIGWRITE = 121;    // write to protected memory
constexpr signal_t SIGCLOSE = 122;    // exit thread (non-error)
constexpr signal_t SIGYIELD = 123;    // ran unpreemptably too long
constexpr signal_t SIGSTACK1 = 124;   // stack overflow: attempt recovery
constexpr signal_t SIGSTACK2 = 125;   // stack overflow: exit thread
constexpr signal_t SIGPURGE = 126;    // thread killed or suicided
constexpr signal_t SIGDELETED = 127;  // thread unexpectedly deleted

Walkthroughs

Creating a Thread

Now for the details. Let's start by creating a Thread. A subclass can add its own thread-specific data—which means that there is no need for thread_local—but we're interested in Thread's constructor:

Thread::Thread(Faction faction, Daemon* daemon) :
   daemon_(daemon),
   faction_(faction)
{
   //  Thread uses the PIMPL idiom, with much of its data in priv_.
   //
   priv_.reset(new ThreadPriv);

   //  Create a new thread. StackUsageLimit is in words, so convert
   //  it to bytes.
   //
   auto prio = FactionToPriority(faction_);
   systhrd_.reset(new SysThread(this, prio,
      ThreadAdmin::StackUsageLimit() << BYTES_PER_WORD_LOG2));
   Singleton<ThreadRegistry>::Instance()->Created(systhrd_.get(), this);
   if(daemon_ != nullptr) daemon_->ThreadCreated(this);
}

This constructor creates an instance of SysThread, which in turn creates a native thread. The arguments to SysThread's constructor are the thread's attributes:

the Thread object being constructed (this)
its entry function (EnterThread for all Thread subclasses; it receives this as its argument)
its priority (RSC bases this on a thread's Faction, which is not relevant to this article)
its stack size, defined by the configuration parameter ThreadAdmin::StackUsageLimit

The new thread is then added to ThreadRegistry, which tracks all active threads.

Here is SysThread's constructor:

SysThread::SysThread(Thread* client, Priority prio, size_t size) :
   nid_(NIL_ID),
   nthread_(0),
   priority_(Priority_N),
   signal_(SIGNIL)
{
   Create(client, size);
   SetPriority(prio);
}

This has invoked two platform-specific functions (see SysThread.win.cpp if you're interested in the details):

Create creates the native thread. Its platform-specific handle is saved in nthread_, and its thread number is saved in nid_.
SetPriority sets the thread's priority.

Entering a Thread

EnterThread is the entry function for all Thread subclasses.

static unsigned int EnterThread(void* arg)
{
   Debug::ft("NodeBase.EnterThread");

   //  Our argument is a pointer to a Thread.
   //
   auto thread = static_cast<Thread*>(arg);
   return thread->Start();
}

This enters the following, which sets up the safety net before it invokes thread-specific code:

main_t Thread::Start()
{
   auto started = false;

   while(true)
   {
      try
      {
         if(!started)
         {
            //  Immediately register to catch POSIX signals.
            //
            RegisterForSignals();

            //  Indicate that we're ready to run.  This blocks until we're
            //  scheduled in.  At that point, resume execution.
            //
            Ready();
            Resume(Thread_Start);
            started = true;
         }

         //  Perform any environment-specific initialization (and recovery,
         //  if reentering the thread). Exit the thread if this fails.
         //
         auto rc = systhrd_->Start();
         if(rc != 0) return Exit(rc);

         switch(priv_->traps_)
         {
         case 0:
            break;

         case 1:
         {
            //  The thread just trapped. Invoke its virtual Recover function
            //  in case it needs to clean up unfinished work before resuming
            //  execution.  (The full version of this code is more complex
            //  because it handles the case where Recover traps.)
            //
            priv_->traps_ = 0;
            Recover();
            break;
         }

         default:
            //
            //  TrapHandler (which appears later) should have prevented us
            //  from getting here.  Exit the thread.
            //
            return Exit(priv_->signal_);
         }

         //  Invoke the thread's entry function. If this returns,
         //  the thread exited voluntarily.
         //
         Enter();
         return Exit(SIGNIL);
      }

      //  Catch all exceptions. TrapHandler returns one of
      //  o Continue, to resume execution at the top of this loop
      //  o Release, to exit the thread after deleting it
      //  o Return, to exit the thread immediately
      //
      catch(SignalException& sex)
      {
         switch(TrapHandler(&sex, &sex, sex.GetSignal(), sex.Stack()))
         {
         case Continue: continue;
         case Release:  return Exit(sex.GetSignal());
         default:       return AbnormalExit(sex.GetSignal());
         }
      }
      catch(Exception& ex)
      {
         switch(TrapHandler(&ex, &ex, SIGNIL, ex.Stack()))
         {
         case Continue: continue;
         case Release:  return Exit(SIGNIL);
         default:       return AbnormalExit(SIGNIL);
         }
      }
      catch(std::exception& e)
      {
         switch(TrapHandler(nullptr, &e, SIGNIL, nullptr))
         {
         case Continue: continue;
         case Release:  return Exit(SIGNIL);
         default:       return AbnormalExit(SIGNIL);
         }
      }
      catch(...)
      {
         switch(TrapHandler(nullptr, nullptr, SIGNIL, nullptr))
         {
         case Continue: continue;
         case Release:  return Exit(SIGNIL);
         default:       return AbnormalExit(SIGNIL);
         }
      }
   }
}

When first entered, this code invoked RegisterForSignals, which registers SignalHandler against each signal that is native to the underlying platform. This is done by invoking signal (in <csignal>), which must be done by every thread, for each signal that it wants to handle, when it is first entered and after each time that it receives a signal. This ensures that the thread will receive POSIX signals so that it can recover instead of allowing the program to abort:

void Thread::RegisterForSignals()
{
   auto& signals = Singleton<PosixSignalRegistry>::Instance()->Signals();

   for(auto s = signals.First(); s != nullptr; signals.Next(s))
   {
      if(s->Attrs().test(PosixSignal::Native))
      {
         signal(s->Value(), SignalHandler);
      }
   }
}

We will look at SignalHandler later. To complete this section, we need to look at Start, which EnterThread invoked.

Each time through its loop, Start began by invoking SysThread::Start, which allows the native thread to perform any work that is required before it can safely run. This is platform-specific code which looks like this on Windows:

signal_t SysThread::Start()
{
   //  This is also invoked when recovering from a trap, so see if a stack
   //  overflow occurred. Some of these are irrecoverable, in which case
   //  returning SIGSTACK2 causes the thread to exit.
   //
   if(status_.test(StackOverflowed))
   {
      if(_resetstkoflw() == 0)
      {
         return SIGSTACK2;
      }

      status_.reset(StackOverflowed);
   }

   //  The translator for Windows structured exceptions must be installed
   //  on a per-thread basis.
   //
   _set_se_translator((_se_translator_function) SE_Handler);
   return 0;
}

The first part of this deals with thread stack overflows, which can be particularly nasty. The last part installs a Windows-specific handler. Windows doesn't normally raise POSIX signals, but instead has what it calls "structured exceptions". We therefore provide SE_Handler, which translates a Windows-specific exception into a POSIX signal that can be thrown using our SignalException. The code for this will appear later.

Exiting a Thread

Exit is normally invoked to exit a thread; this occurs when its Enter function returns or if it is forced to exit after an exception. Exit is only bypassed if a Thread somehow gets deleted while it is still running. In that case, TrapHandler returns Return, which causes the thread to exit immediately, given that it no longer has any objects to delete.

When a Thread object is deleted, its Daemon (if any) is notified so that it can recreate the thread. RSC also tracks mutex ownership, so it releases any mutex that the thread owns. Most operating systems do this anyway, but RSC generates a log to highlight that this occurred. Tracking mutex ownership also allows deadlocks to be debugged as long as the CLI thread is not involved in the deadlock.

main_t Thread::Exit(signal_t sig)
{
   delete this;
   return sig;
}

Thread::~Thread()
{
   //  Other than in very rare situations, the usual path is
   //  to schedule the next thread (via Suspend) and delete
   //  this thread's resources.
   //
   Suspend();
   ReleaseResources();
}

void Thread::ReleaseResources()
{
   //  If the thread has a daemon, tell it that the thread is
   //  exiting.  Remove the thread from the registry and free
   //  its native thread.
   //
   Singleton<ThreadRegistry>::Extant()->Erase(this);
   if(dameon_ != nullptr) daemon_->ThreadDeleted(this);
   systhrd_.reset();
}

Receiving a Windows Structured Exception

As previously mentioned, we register SE_Handler to map each Windows exception to a POSIX signal:

//  Converts a Windows structured exception to a POSIX signal.
//
void SE_Handler(uint32_t errval, const _EXCEPTION_POINTERS* ex)
{
   signal_t sig = 0;

   switch(errval)                         // errval:
   {
   case DBG_CONTROL_C:                    // 0x40010005
      sig = SIGINT;
      break;

   case DBG_CONTROL_BREAK:                // 0x40010008
      sig = SIGBREAK;
      break;

   case STATUS_ACCESS_VIOLATION:          // 0xC0000005
      //
      //  The following returns SIGWRITE instead of SIGSEGV if the exception
      //  occurred when writing to a legal address that was write-protected.
      //
      sig = AccessViolationType(ex);
      break;

   case STATUS_DATATYPE_MISALIGNMENT:     // 0x80000002
   case STATUS_IN_PAGE_ERROR:             // 0xC0000006
   case STATUS_INVALID_HANDLE:            // 0xC0000008
   case STATUS_NO_MEMORY:                 // 0xC0000017
      sig = SIGSEGV;
      break;

   case STATUS_ILLEGAL_INSTRUCTION:       // 0xC000001D
      sig = SIGILL;
      break;

   case STATUS_NONCONTINUABLE_EXCEPTION:  // 0xC0000025
      sig = SIGTERM;
      break;

   case STATUS_INVALID_DISPOSITION:       // 0xC0000026
   case STATUS_ARRAY_BOUNDS_EXCEEDED:     // 0xC000008C
      sig = SIGSEGV;
      break;

   case STATUS_FLOAT_DENORMAL_OPERAND:    // 0xC000008D
   case STATUS_FLOAT_DIVIDE_BY_ZERO:      // 0xC000008E
   case STATUS_FLOAT_INEXACT_RESULT:      // 0xC000008F
   case STATUS_FLOAT_INVALID_OPERATION:   // 0xC0000090
   case STATUS_FLOAT_OVERFLOW:            // 0xC0000091
   case STATUS_FLOAT_STACK_CHECK:         // 0xC0000092
   case STATUS_FLOAT_UNDERFLOW:           // 0xC0000093
   case STATUS_INTEGER_DIVIDE_BY_ZERO:    // 0xC0000094
   case STATUS_INTEGER_OVERFLOW:          // 0xC0000095
      sig = SIGFPE;
      _fpreset();
      break;

   case STATUS_PRIVILEGED_INSTRUCTION:    // 0xC0000096
      sig = SIGILL;
      break;

   case STATUS_STACK_OVERFLOW:            // 0xC00000FD
      //
      //  A stack overflow in Windows now raises the exception
      //  System.StackOverflowException, which cannot be caught.
      //  Stack checking in Thread should therefore be enabled.
      //
      sig = SIGSTACK1;
      break;

   default:
      sig = SIGTERM;
   }

   //  Handle SIG. This usually throws an exception; in any case, it will
   //  not return here. If it does return, there is no specific provision
   //  for reraising a structured exception, so simply return and assume
   //  that Windows will handle it, probably brutally.
   //
   Thread::HandleSignal(sig, errval);
}

Receiving a POSIX Signal

We registered SignalHandler to receive POSIX signals. Even on Windows, with its structured exceptions, this code is reached after invoking raise (in <csignal>):

void Thread::SignalHandler(signal_t sig)
{
   //  Re-register for signals before handling the signal.
   //
   RegisterForSignals();
   if(HandleSignal(sig, 0)) return;

   //  Either trap recovery is off or we received a signal that could not be
   //  associated with a thread. Restore the default handler for the signal
   //  and reraise it (to enter the debugger, for example).
   //
   signal(sig, nullptr);
   raise(sig);
}

Converting a POSIX Signal to a SignalException

Now that we have a POSIX signal which was either received by SignalHandler or translated from a Windows structured exception by SE_Handler, we can turn it into a SignalException:

bool Thread::HandleSignal(signal_t sig, uint32_t code)
{
   auto thr = RunningThread(std::nothrow);

   if(thr != nullptr)
   {
      //  Turn the signal into a standard C++ exception so that it can
      //  be caught and recovery action initiated.
      //
      throw SignalException(sig, code);
   }

   //  The running thread could not be identified. A break signal (e.g.
   //  on ctrl-C) is sometimes delivered on an unregistered thread. If
   //  the RTC timeout is not being enforced and the locked thread has
   //  run too long, trap it; otherwise, assume that the purpose of the
   //  ctrl-C is to trap the CLI thread so that it will abort its work.
   //
   auto reg = Singleton<PosixSignalRegistry>::Instance();

   if(reg->Attrs(sig).test(PosixSignal::Break))
   {
      if(!ThreadAdmin::TrapOnRtcTimeout())
      {
         thr = LockedThread();

         if((thr != nullptr) && (SteadyTime::Now() < thr->priv_->currEnd_))
         {
            thr = nullptr;
         }
      }

      if(thr == nullptr) thr = Singleton<CliThread>::Extant();
      if(thr == nullptr) return false;
      thr->Raise(sig);
      return true;
   }

   return false;
}

The code after the throw requires some explanation. Break signals (SIGINT, SIGBREAK), which are generated when the user enters Ctrl-C or Ctrl-Break, often arrive on an unknown thread. It is reasonable to assume that the user wants to abort work that is taking too long or, worse, stuck in an infinite loop.

But what work should be aborted? Here, it must be pointed out that RSC strongly encourages the use of cooperative scheduling, where a thread runs unpreemptably ("locked") and yields after completing a logical unit of work. RSC only allows one unpreemptable thread to run at a time, and it also enforces a timeout on such a thread's execution. If the thread does not yield before the timeout, it receives the internal signal SIGYIELD, causing a SignalException to be thrown. During development, it is sometimes useful to disable this timeout. So in trying to identify which thread is performing the work that the user wants to abort, the first candidate is the thread that is running unpreemptably. However, this thread will only be interrupted if the use of SIGYIELD has been disabled and the thread has already run for longer than the timeout.

If interrupting the unpreemptable thread doesn't seem appropriate, the assumption is that CliThread should be interrupted. This thread is the one that parses and executes user commands entered through the console. So unless CliThread doesn't exist for some obscure reason, it will receive the SIGYIELD.

If a thread to interrupt has now been identified, Thread::Raise is invoked to deliver the signal to that thread.

Signaling Another Thread

Sending a signal to another thread is problematic. The raise function in <csignal> only signals the running thread. Nor does Windows appear to expose any function that could be used for the purpose. So what to do?

In RSC, the first thing that most functions do is call Debug::ft to identify the function that is now executing. These calls were removed from the code in this article, but now it is necessary to mention them. The original (and still extant) purpose of Debug::ft is to support a function trace tool, which is why most non-trivial functions invoke it. What this trace tool produces will be seen later. The pervasiveness of Debug::ft also allows it to be co-opted for other purposes. Because a thread is likely to invoke it frequently, it can check if the thread has a signal waiting. If so, boom! It can also check if the thread is at risk of overrunning its stack, in which case boom! (This is better than allowing an overrun to occur. As noted in SE_Handler, Windows no longer even allows a stack overflow exception to be intercepted.)

Here is the code that delivers a signal to another thread:

void Thread::Raise(signal_t sig)
{
   Debug::ft(Thread_Raise);

   auto reg = Singleton<PosixSignalRegistry>::Instance();
   auto ps1 = reg->Find(sig);

   //  If this is the running thread, throw the signal immediately. If the
   //  running thread can't be found, don't assert: the signal handler can
   //  invoke this when a signal occurs on an unknown thread.
   //
   auto thr = RunningThread(std::nothrow);

   if(thr == this)
   {
      throw SignalException(sig, 0);
   }

   //  If the signal will force the thread to exit, try to unblock it.
   //  Unblocking usually involves deallocating resources, so force the
   //  thread to sleep if it wakes up during Unblock().
   //
   if(ps1->Attrs().test(PosixSignal::Exit))
   {
      if(priv_->action_ == RunThread)
      {
         priv_->action_ = SleepThread;
         Unblock();
         priv_->action_ = ExitThread;
      }
   }

   SetSignal(sig);
   if(!ps1->Attrs().test(PosixSignal::Delayed)) SetTrap(true);
   if(ps1->Attrs().test(PosixSignal::Interrupt)) Interrupt(Signalled);
}

Given that the target thread can throw a SignalException for itself, via a check supported by Debug::ft, Raise does the following:

invokes SetSignal to record the signal against the thread
invokes Unblock (a virtual function) to unblock the thread if the signal will force it to exit
invokes SetTrap if the signal should be delivered as soon as possible instead of waiting until the next time the thread yields (this sets the flag that is checked via Debug::ft)
invokes Interrupt to wake up the thread if the signal should be delivered now instead of waiting until the thread resumes execution

In the above list, whether to invoke each of the last three functions is determined by various attributes that can be set in the signal's instance of PosixSignal.

Capturing a Thread's Stack When an Exception Occurs

SignalException derives from Exception (which derives from std::exception). Although Exception is a virtual class, all RSC exceptions derive from it because its constructor captures the running thread's stack by invoking SysStackTrace::Display:

Exception::Exception(bool stack, fn_depth depth) : stack_(nullptr)
{
   //  When capturing the stack, exclude this constructor and those of
   //  our subclasses.
   //
   if(stack)
   {
      stack_.reset(new std::ostringstream);
      if(stack_ == nullptr) return;
      *stack_ << std::boolalpha << std::nouppercase;
      SysStackTrace::Display(*stack_, depth + 1);
   }
}

SignalException simply records the signal and a debug code after telling Exception to capture the stack:

SignalException::SignalException(signal_t sig, debug32_t errval) :
   Exception(true, 1),
   signal_(sig),
   errval_(errval)
{
}

Capturing a thread stack is platform-specific. See SysStackTrace.win.cpp for the Windows targets. Here is an example of its output within an RSC log for a Windows structured exception that got mapped to SIGSEGV. The stack trace is the portion after "Function Traceback":

    THR902 Jun-27-2022 15:16:16.123 on Reigi {3}
    in NodeTools.RecoveryThread (tid=20, nid=0x4eb8): trap number 2
    type=Signal
    signal : 11 (SIGSEGV: Illegal Memory Access)
    errval : 0xc0000005
    Function Traceback:
      NodeBase.Exception.Exception @ Exception.cpp + 53[28]
      NodeBase.SignalException.SignalException @ SignalException.cpp + 38[12]
      NodeBase.Thread.HandleSignal @ Thread.cpp + 1892[27]
      NodeBase.SE_Handler @ SysThread.win.cpp + 147[0]
      _NLG_Return2 @ <unknown file> (err=487)
      _NLG_Return2 @ <unknown file> (err=487)
      _NLG_Return2 @ <unknown file> (err=487)
      _NLG_Return2 @ <unknown file> (err=487)
      _CxxFrameHandler4 @ <unknown file> (err=487)
      __GSHandlerCheck_EH4 @ gshandlereh4.cpp + 86[0]
      _chkstk @ <unknown file> (err=487)
      RtlRestoreContext @ <unknown file> (err=487)
      KiUserExceptionDispatcher @ <unknown file> (err=487)
      NodeBase.Thread.CauseTrap @ Thread.cpp + 1264[5]
      NodeTools.RecoveryThread.UseBadPointer @ NtIncrement.cpp + 3405[0]
      NodeTools.RecoveryThread.Enter @ NtIncrement.cpp + 3304[0]
      NodeBase.Thread.Start @ Thread.cpp + 3124[0]
      NodeBase.EnterThread @ SysThread.win.cpp + 159[0]
      recalloc @ <unknown file> (err=487)
      BaseThreadInitThunk @ <unknown file> (err=487)
      RtlUserThreadStart @ <unknown file> (err=487)

In released software, users can collect these logs and send them to you. Better still, your software can include code to automatically send them to you over the internet. Each of these logs highlights a bug that needs to be fixed.

Recovering from an Exception

The above log was produced by TrapHandler, which was mentioned a long time ago as the function that Thread::Start invokes when it catches an exception:

Thread::TrapAction Thread::TrapHandler(const Exception* ex,
   const std::exception* e, signal_t sig, const std::ostringstream* stack)
{
   try
   {
      //  If this thread object was deleted, exit immediately.
      //
      if(sig == SIGDELETED)
      {
         return Return;
      }

      if(Singleton<Threads>::Instance()->GetState() != Constructed)
      {
         return Return;
      }

      //  The first time in, save the signal.  After that, we're dealing
      //  with a trap during trap recovery:
      //  o On the second trap, log it and force the thread to exit.
      //  o On the third trap, force the thread to exit.
      //  o On the fourth trap, exit without even deleting the thread.
      //    This will leak its memory, which is better than what seems
      //    to be an infinite loop.
      //
      auto retrapped = false;

      switch(++priv_->traps_)
      {
      case 1:
         SetSignal(sig);
         break;
      case 2:
         retrapped = true;
         break;
      case 3:
         return Release;
      default:
         return Return;
      }

      //  Record a stack overflow against the native thread wrapper
      //  for use by SysThread::Start.
      //
      if((sig == SIGSTACK1) && (systhrd_ != nullptr))
      {
         systhrd_->status_.set(SysThread::StackOverflowed);
      }

      auto exceeded = LogTrap(ex, e, sig, stack);

      //  Force the thread to exit if
      //  o it has trapped too many times
      //  o it trapped during trap recovery
      //  o this is a final signal
      // 
      auto sigAttrs = Singleton<PosixSignalRegistry>::Instance()->Attrs(sig);

      if(exceeded | retrapped | sigAttrs.test(PosixSignal::Final))
      {
         return Release;
      }

      //  Resume execution at the top of Start.
      //
      return Continue;
   }

   //  The following catch an exception during trap recovery (a nested
   //  exception) and invoke this function recursively to handle it.
   //
   catch(SignalException& sex)
   {
      switch(TrapHandler(&sex, &sex, sex.GetSignal(), sex.Stack()))
      {
      case Continue:
      case Release:
         return Release;
      default:
         return Return;
      }
   }
   catch(Exception& ex)
   {
      switch(TrapHandler(&ex, &ex, SIGNIL, ex.Stack()))
      {
      case Continue:
      case Release:
         return Release;
      default:
         return Return;
      }
   }
   catch(std::exception& e)
   {
      switch(TrapHandler(nullptr, &e, SIGNIL, nullptr))
      {
      case Continue:
      case Release:
         return Release;
      default:
         return Return;
      }
   }
   catch(...)
   {
      switch(TrapHandler(nullptr, nullptr, SIGNIL, nullptr))
      {
      case Continue:
      case Release:
         return Release;
      default:
         return Return;
      }
   }
}

Recreating a Thread

If a thread traps too often, it is forced to exit. But if the thread served an important purpose, there needs to be a way to recreate it.

In Creating a Thread, we saw that a thread could register a Daemon when it was created. And in Exiting a Thread, Daemon::ThreadDeleted was notified when a thread exited. This function isn't virtual, but the same for every Daemon:

void Daemon::ThreadDeleted(Thread* thread)
{
   //  This does not immediately recreate the deleted thread.  We only create
   //  threads when invoked by InitThread, which is not the case here.  So we
   //  must ask InitThread to invoke us.  During a restart, however, threads
   //  often exit, so there is no point doing this, and InitThread will soon
   //  invoke our Startup function so that we can create threads.
   //
   auto item = Find(thread);

   if(item != threads_.end())
   {
      threads_.erase(item);
      if(Restart::GetStage() != Running) return;
      Singleton<InitThread>::Instance()->Interrupt(InitThread::Recreate);
   }
}

When InitThread runs, it invokes the following when it sees that it was interrupted to recreate threads:

void InitThread::RecreateThreads()
{
   //  Invoke daemons with missing threads.
   //
   auto& daemons = Singleton<DaemonRegistry>::Instance()->Daemons();

   for(auto d = daemons.First(); d != nullptr; daemons.Next(d))
   {
      if(d->Threads().size() < d->TargetSize())
      {
         d->CreateThreads();
      }
   }

   //  This is reset after the above so that if a trap occurs, we will
   //  again try to recreate threads when reentered.
   //
   Reset(Recreate);
}

And the following finally invokes the virtual function Daemon::CreateThread:

void Daemon::CreateThreads()
{
   switch(traps_)  // initialized to 0 when creating a Daemon
   {
   case 0:
      break;

   case 1:
      //  CreateThread trapped.  Give the subclass a chance to
      //  repair any data before invoking CreateThread again.
      //
      ++traps_;
      Recover();
      --traps_;
      break;

   default:
      //  Either Recover trapped or CreateThread trapped again.
      //  Raise an alarm.
      //
      RaiseAlarm(GetAlarmLevel());
      return;
   }

   //  Try to create new threads to replace those that exited.
   //  Incrementing traps_, and clearing it on success, allows
   //  us to detect traps.
   //
   while(threads_.size() < size_)
   {
      ++traps_;
      auto thread = CreateThread();
      traps_ = 0;

      if(thread == nullptr)
      {
         RaiseAlarm(GetAlarmLevel());
         return;
      };

      threads_.insert(thread);
      ThreadAdmin::Incr(ThreadAdmin::Recreations);
   }

   RaiseAlarm(NoAlarm);
}

Traces of the Code in Action

RSC has 29 tests that focus on exercising this software. Each of them does something nasty to see if the software can handle it without exiting. During these tests, the function trace tool is enabled so that Debug::ft will record all function calls. For the SIGSEGV test, which is associated with the log shown above, the output of the trace tool looks like this. When the tool is on, code slows down by a factor of about 4x. When the tool is off, calls to Debug::ft incur very little overhead.

A Destructor Uses a Bad Pointer

A recently added test uses a bad pointer in the destructor of a concrete Thread subclass. This test should have been added long ago; it is an especially good one because an exception in a destructor normally causes a program to abort. Although RSC survives if compiled with Microsoft's C++ compiler, what occurs is interesting. The structured exception (Windows' SIGSEGV equivalent) gets intercepted and thrown as a C++ exception. But this exception is not caught immediately. The C++ runtime code handling the deletion catches the exception itself and continues its work of invoking the destructor chain. This is admirable because it allows the base Thread class to release its resources. Only afterwards does the C++ runtime rethrow the exception, which is finally caught by the safety net in Thread::Start. We now have the unusual situation of a member function running after its object has been deleted. Because Thread::TrapHandler is not virtual, it gets invoked successfully. When it notices that the thread has been deleted, it returns and exits the thread.

Points of Interest

It is only forthright to mention that the C++ standard does not support throwing an exception in response to a POSIX signal. In fact, it is undefined behavior for a signal handler to do almost anything in a C++ environment! A list of undefined behaviors appears here; those pertaining to signal handling are numbered 128 through 135. The detailed coding standard available on the same website makes these recommendations about signals:

SIG31-C. Do not access shared objects in signal handlers
SIG34-C. Do not call signal() from within interruptible signal handlers
SIG35-C. Do not return from a computational exception signal handler

Fortunately, much of this is theoretical rather than practical. The main reason that most things related to signal handling are undefined behavior is because different platforms support signals in different ways. Many of the risks that lead to undefined behavior result from race conditions that will rarely occur¹. Regardless, what can you do if your software has to be robust? It's far better to risk undefined behavior than to let your program exit.

The same rationale, of not being able to depend on how the underlying platform does something, does not excuse the standard's adoption of noexcept. If it were possible to throw an exception in reponse to a signal, any noexcept function would be unable to do so. Even a non-virtual "getter" that simply returns a member's value is now at risk. If such a function is invoked with a bad this pointer, it will add an offset to that pointer and try to read memory. Boom! An ostensibly trivial noexcept function, through no fault of its own, has now caused the invocation of abort when the signal handler throws an exception to recover from the SIGSEGV.

The invocation of abort isn't the end of the world, let alone your program, because your signal handler can turn the SIGABRT into an exception. But now what are we dealing with, abort or an exception? What if the exception isn't "allowed", either because it occurred in a destructor or noexcept function? (Hands up, those of you who have never seen anything nasty happen in a destructor.)

When abort is invoked, the C++ standard says it is implementation dependent whether the stack is unwound in the same way as when an exception is thrown. That is, local objects may not get deleted. So if a function on the stack owns something in a unique_ptr local, it will leak. And if it has wrapped a mutex in a local object whose destructor releases the mutex whenever the function returns, the outcome could be far worse. This is assuming, of course, that your program will be allowed to survive. If it won't, it doesn't really matter.

Unless your software is shockingly infallible, it will occasionally cause an abort, and your C++ compiler better allow this to turn into an exception that unwinds the stack in all circumstances. In the end, both your platform and compiler will make it either possible or virtually impossible to deliver robust C++ software.

To summarize, here are some things that the C++ standard should mandate to get serious about robustness:

A signal handler must be able to throw an exception when it receives a signal.
The stack must be unwound if the signal handler throws an exception in reponse to a SIGABRT.
std::exception's constructor must provide a way to capture debug information, such as a thread's stack, before the stack is unwound.

The good news is that platform and compiler vendors often make it possible to deliver robust software, despite what the standard fails to mandate.

Notes

¹ In UNIX-like environments, signals other than those discussed in this article are sometimes used as a primitive form of inter-thread communication. This greatly increases the risk of these race conditions and is not recommended here.

History

3^rd September, 2020: Add section on recreating a thread
11^th August, 2020: Add details about what happens when a thread is exited
27^th May, 2020: Describe what happens when an exception occurs in a destructor
28^th August, 2019: Initial version