Robust C++: Safety Net
Keeping a program running when it would otherwise abort
Introduction
Some programs need to keep running even after nasty things happen, such as using an invalid pointer. Servers, other multi-user systems, and real-time games are a few examples. This article describes how to write robust C++ software that does not exit when the usual behavior is to abort. It also discusses how to capture information that facilitates debugging when nasty things occur in software that has been released to users.
Background
It is assumed that the reader is familiar with C++ exceptions. However, exceptions are not the only thing that robust software needs to deal with. It must also handle POSIX signals, which the operating system raises when something nasty occurs. The header <csignal>
defines the following subset of POSIX signals for C/C++:
SIGINT
: interrupt (usually when Ctrl-C is entered)SIGILL
: illegal instruction (perhaps a stack corruption that affected the instruction pointer)SIGFPE
: floating point exception (includes dividing by zero)SIGSEGV
: segment violation (using a bad pointer)SIGTERM
: forced termination (usually when thekill
command is entered)SIGABRT
: abnormal termination (whenabort
is invoked by the C++ run-time environment)
Similar to how exceptions are caught by a catch
statement, signals are caught by a signal handler. Each thread can register a signal handler against each signal that it wants to handle. A signal is simply an int
that is passed to the signal handler as an argument.
Using the Code
The code in this article is taken from the Robust Services Core (RSC). If this is the first time that you're reading an article about an aspect of RSC, please take a few minutes to read this preface.
Compiler options. The approach described in this article requires the following compiler options. They are set in the download's CMake files, but you need to set them in your own project if you are developing code based on this article:
- Windows (MSVC or clang): /EHa
- Linux (gcc): -fnon-call-exceptions
Overview of the Classes
An application developed using RSC derives from Thread
to implement its threads. Everything described in this article then comes for free. This section also describes other classes that collaborate with Thread
.
Thread
Software that wants to be continuously available must catch all exceptions. A single-threaded application could do this in main
. But RSC supports multi-threading, so it does this in a base Thread
class from which all other threads derive. Thread
has a loop that invokes the application in a try
clause that is followed by a series of catch
clauses which handle any exception not caught by the application.
SysThread
This is a wrapper for a native thread and is created by Thread
's constructor. Some of the implementation is platform-specific.
Daemon
When a Thread
is created, it can register a Daemon
to recreate the thread after it is forced to exit, which usually occurs when the thread has caused too many exceptions.
Exception
The direct use of <exception>
is inappropriate in a system that needs to debug problems in released software. Consequently, RSC defines a virtual Exception
class from which all of its exceptions derive. This class's primary responsibility is to capture the running thread's stack when an exception occurs. In this way, the entire chain of function calls that led to the exception will be available to assist in debugging. This is far more useful than the const
char*
returned by std::exception::what
, stating something like "invalid string position
", which specifies the problem but not where it arose and maybe not even uniquely where it was detected.
SysStackTrace
SysStackTrace
is actually a namespace that wraps a handful of functions. The function of most interest is one that actually captures a thread's stack. Exception
's constructor invokes this function, and so does a function (Debug::SwLog
) whose purpose is to generate a debug log to record a problem that, although unexpected, did not actually result in an exception. All SysStackTrace
functions are platform-specific.
SignalException
When a POSIX signal occurs, RSC throws it in a C++ exception so that it can be handled in the usual way, by unwinding the stack and deleting local objects. SignalException
, derived from Exception
, is used for this purpose. It simply records the signal that occurred and relies on its base class to capture the stack.
PosixSignal
Each signal supported within RSC must create a PosixSignal
instance that includes its name (e.g. "SIGSEGV"
), numeric value (11
), explanation ("Invalid Memory Reference"
), and other attributes. The PosixSignal
instances for various signals defined by the POSIX standard, including those in <csignal>
, are implemented as private
members of the simple class SysSignals
. The subset of signals supported on the target platform are then instantiated by SysSignals::CreateNativeSignals
.
Throwing a SignalException
turns out to be a useful way to recover from serious errors. RSC therefore defines signals for internal use in NbSignals.h. An instance of PosixSignal
is also associated with each of these:
// The following signals are proprietary and are used to throw a
// SignalException outside the signal handler.
//
constexpr signal_t SIGNIL = 0; // nil signal (non-error)
constexpr signal_t SIGWRITE = 121; // write to protected memory
constexpr signal_t SIGCLOSE = 122; // exit thread (non-error)
constexpr signal_t SIGYIELD = 123; // ran unpreemptably too long
constexpr signal_t SIGSTACK1 = 124; // stack overflow: attempt recovery
constexpr signal_t SIGSTACK2 = 125; // stack overflow: exit thread
constexpr signal_t SIGPURGE = 126; // thread killed or suicided
constexpr signal_t SIGDELETED = 127; // thread unexpectedly deleted
Walkthroughs
Creating a Thread
Now for the details. Let's start by creating a Thread
. A subclass can add its own thread-specific data—which means that there is no need for thread_local
—but we're interested in Thread
's constructor:
Thread::Thread(Faction faction, Daemon* daemon) :
daemon_(daemon),
faction_(faction)
{
// Thread uses the PIMPL idiom, with much of its data in priv_.
//
priv_.reset(new ThreadPriv);
// Create a new thread. StackUsageLimit is in words, so convert
// it to bytes.
//
auto prio = FactionToPriority(faction_);
systhrd_.reset(new SysThread(this, prio,
ThreadAdmin::StackUsageLimit() << BYTES_PER_WORD_LOG2));
Singleton<ThreadRegistry>::Instance()->Created(systhrd_.get(), this);
if(daemon_ != nullptr) daemon_->ThreadCreated(this);
}
This constructor creates an instance of SysThread
, which in turn creates a native thread. The arguments to SysThread
's constructor are the thread's attributes:
- the
Thread
object being constructed (this
) - its entry function (
EnterThread
for allThread
subclasses; it receivesthis
as its argument) - its priority (RSC bases this on a thread's
Faction
, which is not relevant to this article) - its stack size, defined by the configuration parameter
ThreadAdmin::StackUsageLimit
The new thread is then added to ThreadRegistry
, which tracks all active threads.
Here is SysThread
's constructor:
SysThread::SysThread(Thread* client, Priority prio, size_t size) :
nid_(NIL_ID),
nthread_(0),
priority_(Priority_N),
signal_(SIGNIL)
{
Create(client, size);
SetPriority(prio);
}
This has invoked two platform-specific functions (see SysThread.win.cpp if you're interested in the details):
Create
creates the native thread. Its platform-specific handle is saved innthread_
, and its thread number is saved innid_
.SetPriority
sets the thread's priority.
Entering a Thread
EnterThread
is the entry function for all Thread
subclasses.
static unsigned int EnterThread(void* arg)
{
Debug::ft("NodeBase.EnterThread");
// Our argument is a pointer to a Thread.
//
auto thread = static_cast<Thread*>(arg);
return thread->Start();
}
This enters the following, which sets up the safety net before it invokes thread-specific code:
main_t Thread::Start()
{
auto started = false;
while(true)
{
try
{
if(!started)
{
// Immediately register to catch POSIX signals.
//
RegisterForSignals();
// Indicate that we're ready to run. This blocks until we're
// scheduled in. At that point, resume execution.
//
Ready();
Resume(Thread_Start);
started = true;
}
// Perform any environment-specific initialization (and recovery,
// if reentering the thread). Exit the thread if this fails.
//
auto rc = systhrd_->Start();
if(rc != 0) return Exit(rc);
switch(priv_->traps_)
{
case 0:
break;
case 1:
{
// The thread just trapped. Invoke its virtual Recover function
// in case it needs to clean up unfinished work before resuming
// execution. (The full version of this code is more complex
// because it handles the case where Recover traps.)
//
priv_->traps_ = 0;
Recover();
break;
}
default:
//
// TrapHandler (which appears later) should have prevented us
// from getting here. Exit the thread.
//
return Exit(priv_->signal_);
}
// Invoke the thread's entry function. If this returns,
// the thread exited voluntarily.
//
Enter();
return Exit(SIGNIL);
}
// Catch all exceptions. TrapHandler returns one of
// o Continue, to resume execution at the top of this loop
// o Release, to exit the thread after deleting it
// o Return, to exit the thread immediately
//
catch(SignalException& sex)
{
switch(TrapHandler(&sex, &sex, sex.GetSignal(), sex.Stack()))
{
case Continue: continue;
case Release: return Exit(sex.GetSignal());
default: return AbnormalExit(sex.GetSignal());
}
}
catch(Exception& ex)
{
switch(TrapHandler(&ex, &ex, SIGNIL, ex.Stack()))
{
case Continue: continue;
case Release: return Exit(SIGNIL);
default: return AbnormalExit(SIGNIL);
}
}
catch(std::exception& e)
{
switch(TrapHandler(nullptr, &e, SIGNIL, nullptr))
{
case Continue: continue;
case Release: return Exit(SIGNIL);
default: return AbnormalExit(SIGNIL);
}
}
catch(...)
{
switch(TrapHandler(nullptr, nullptr, SIGNIL, nullptr))
{
case Continue: continue;
case Release: return Exit(SIGNIL);
default: return AbnormalExit(SIGNIL);
}
}
}
}
When first entered, this code invoked RegisterForSignals
, which registers SignalHandler
against each signal that is native to the underlying platform. This is done by invoking signal
(in <csignal>
), which must be done by every thread, for each signal that it wants to handle, when it is first entered and after each time that it receives a signal. This ensures that the thread will receive POSIX signals so that it can recover instead of allowing the program to abort:
void Thread::RegisterForSignals()
{
auto& signals = Singleton<PosixSignalRegistry>::Instance()->Signals();
for(auto s = signals.First(); s != nullptr; signals.Next(s))
{
if(s->Attrs().test(PosixSignal::Native))
{
signal(s->Value(), SignalHandler);
}
}
}
We will look at SignalHandler
later. To complete this section, we need to look at Start
, which EnterThread
invoked.
Each time through its loop, Start
began by invoking SysThread::Start
, which allows the native thread to perform any work that is required before it can safely run. This is platform-specific code which looks like this on Windows:
signal_t SysThread::Start()
{
// This is also invoked when recovering from a trap, so see if a stack
// overflow occurred. Some of these are irrecoverable, in which case
// returning SIGSTACK2 causes the thread to exit.
//
if(status_.test(StackOverflowed))
{
if(_resetstkoflw() == 0)
{
return SIGSTACK2;
}
status_.reset(StackOverflowed);
}
// The translator for Windows structured exceptions must be installed
// on a per-thread basis.
//
_set_se_translator((_se_translator_function) SE_Handler);
return 0;
}
The first part of this deals with thread stack overflows, which can be particularly nasty. The last part installs a Windows-specific handler. Windows doesn't normally raise POSIX signals, but instead has what it calls "structured exceptions". We therefore provide SE_Handler
, which translates a Windows-specific exception into a POSIX signal that can be thrown using our SignalException
. The code for this will appear later.
Exiting a Thread
Exit
is normally invoked to exit a thread; this occurs when its Enter
function returns or if it is forced to exit after an exception. Exit
is only bypassed if a Thread
somehow gets deleted while it is still running. In that case, TrapHandler
returns Return
, which causes the thread to exit immediately, given that it no longer has any objects to delete.
When a Thread
object is deleted, its Daemon
(if any) is notified so that it can recreate the thread. RSC also tracks mutex ownership, so it releases any mutex that the thread owns. Most operating systems do this anyway, but RSC generates a log to highlight that this occurred. Tracking mutex ownership also allows deadlocks to be debugged as long as the CLI thread is not involved in the deadlock.
main_t Thread::Exit(signal_t sig)
{
delete this;
return sig;
}
Thread::~Thread()
{
// Other than in very rare situations, the usual path is
// to schedule the next thread (via Suspend) and delete
// this thread's resources.
//
Suspend();
ReleaseResources();
}
void Thread::ReleaseResources()
{
// If the thread has a daemon, tell it that the thread is
// exiting. Remove the thread from the registry and free
// its native thread.
//
Singleton<ThreadRegistry>::Extant()->Erase(this);
if(dameon_ != nullptr) daemon_->ThreadDeleted(this);
systhrd_.reset();
}
Receiving a Windows Structured Exception
As previously mentioned, we register SE_Handler
to map each Windows exception to a POSIX signal:
// Converts a Windows structured exception to a POSIX signal.
//
void SE_Handler(uint32_t errval, const _EXCEPTION_POINTERS* ex)
{
signal_t sig = 0;
switch(errval) // errval:
{
case DBG_CONTROL_C: // 0x40010005
sig = SIGINT;
break;
case DBG_CONTROL_BREAK: // 0x40010008
sig = SIGBREAK;
break;
case STATUS_ACCESS_VIOLATION: // 0xC0000005
//
// The following returns SIGWRITE instead of SIGSEGV if the exception
// occurred when writing to a legal address that was write-protected.
//
sig = AccessViolationType(ex);
break;
case STATUS_DATATYPE_MISALIGNMENT: // 0x80000002
case STATUS_IN_PAGE_ERROR: // 0xC0000006
case STATUS_INVALID_HANDLE: // 0xC0000008
case STATUS_NO_MEMORY: // 0xC0000017
sig = SIGSEGV;
break;
case STATUS_ILLEGAL_INSTRUCTION: // 0xC000001D
sig = SIGILL;
break;
case STATUS_NONCONTINUABLE_EXCEPTION: // 0xC0000025
sig = SIGTERM;
break;
case STATUS_INVALID_DISPOSITION: // 0xC0000026
case STATUS_ARRAY_BOUNDS_EXCEEDED: // 0xC000008C
sig = SIGSEGV;
break;
case STATUS_FLOAT_DENORMAL_OPERAND: // 0xC000008D
case STATUS_FLOAT_DIVIDE_BY_ZERO: // 0xC000008E
case STATUS_FLOAT_INEXACT_RESULT: // 0xC000008F
case STATUS_FLOAT_INVALID_OPERATION: // 0xC0000090
case STATUS_FLOAT_OVERFLOW: // 0xC0000091
case STATUS_FLOAT_STACK_CHECK: // 0xC0000092
case STATUS_FLOAT_UNDERFLOW: // 0xC0000093
case STATUS_INTEGER_DIVIDE_BY_ZERO: // 0xC0000094
case STATUS_INTEGER_OVERFLOW: // 0xC0000095
sig = SIGFPE;
_fpreset();
break;
case STATUS_PRIVILEGED_INSTRUCTION: // 0xC0000096
sig = SIGILL;
break;
case STATUS_STACK_OVERFLOW: // 0xC00000FD
//
// A stack overflow in Windows now raises the exception
// System.StackOverflowException, which cannot be caught.
// Stack checking in Thread should therefore be enabled.
//
sig = SIGSTACK1;
break;
default:
sig = SIGTERM;
}
// Handle SIG. This usually throws an exception; in any case, it will
// not return here. If it does return, there is no specific provision
// for reraising a structured exception, so simply return and assume
// that Windows will handle it, probably brutally.
//
Thread::HandleSignal(sig, errval);
}
Receiving a POSIX Signal
We registered SignalHandler
to receive POSIX signals. Even on Windows, with its structured exceptions, this code is reached after invoking raise
(in <csignal>
):
void Thread::SignalHandler(signal_t sig)
{
// Re-register for signals before handling the signal.
//
RegisterForSignals();
if(HandleSignal(sig, 0)) return;
// Either trap recovery is off or we received a signal that could not be
// associated with a thread. Restore the default handler for the signal
// and reraise it (to enter the debugger, for example).
//
signal(sig, nullptr);
raise(sig);
}
Converting a POSIX Signal to a SignalException
Now that we have a POSIX signal which was either received by SignalHandler
or translated from a Windows structured exception by SE_Handler
, we can turn it into a SignalException
:
bool Thread::HandleSignal(signal_t sig, uint32_t code)
{
auto thr = RunningThread(std::nothrow);
if(thr != nullptr)
{
// Turn the signal into a standard C++ exception so that it can
// be caught and recovery action initiated.
//
throw SignalException(sig, code);
}
// The running thread could not be identified. A break signal (e.g.
// on ctrl-C) is sometimes delivered on an unregistered thread. If
// the RTC timeout is not being enforced and the locked thread has
// run too long, trap it; otherwise, assume that the purpose of the
// ctrl-C is to trap the CLI thread so that it will abort its work.
//
auto reg = Singleton<PosixSignalRegistry>::Instance();
if(reg->Attrs(sig).test(PosixSignal::Break))
{
if(!ThreadAdmin::TrapOnRtcTimeout())
{
thr = LockedThread();
if((thr != nullptr) && (SteadyTime::Now() < thr->priv_->currEnd_))
{
thr = nullptr;
}
}
if(thr == nullptr) thr = Singleton<CliThread>::Extant();
if(thr == nullptr) return false;
thr->Raise(sig);
return true;
}
return false;
}
The code after the throw
requires some explanation. Break signals (SIGINT
, SIGBREAK
), which are generated when the user enters Ctrl-C or Ctrl-Break, often arrive on an unknown thread. It is reasonable to assume that the user wants to abort work that is taking too long or, worse, stuck in an infinite loop.
But what work should be aborted? Here, it must be pointed out that RSC strongly encourages the use of cooperative scheduling, where a thread runs unpreemptably ("locked") and yields after completing a logical unit of work. RSC only allows one unpreemptable thread to run at a time, and it also enforces a timeout on such a thread's execution. If the thread does not yield before the timeout, it receives the internal signal SIGYIELD
, causing a SignalException
to be thrown. During development, it is sometimes useful to disable this timeout. So in trying to identify which thread is performing the work that the user wants to abort, the first candidate is the thread that is running unpreemptably. However, this thread will only be interrupted if the use of SIGYIELD
has been disabled and the thread has already run for longer than the timeout.
If interrupting the unpreemptable thread doesn't seem appropriate, the assumption is that CliThread
should be interrupted. This thread is the one that parses and executes user commands entered through the console. So unless CliThread
doesn't exist for some obscure reason, it will receive the SIGYIELD
.
If a thread to interrupt has now been identified, Thread::Raise
is invoked to deliver the signal to that thread.
Signaling Another Thread
Sending a signal to another thread is problematic. The raise
function in <csignal>
only signals the running thread. Nor does Windows appear to expose any function that could be used for the purpose. So what to do?
In RSC, the first thing that most functions do is call Debug::ft
to identify the function that is now executing. These calls were removed from the code in this article, but now it is necessary to mention them. The original (and still extant) purpose of Debug::ft
is to support a function trace tool, which is why most non-trivial functions invoke it. What this trace tool produces will be seen later. The pervasiveness of Debug::ft
also allows it to be co-opted for other purposes. Because a thread is likely to invoke it frequently, it can check if the thread has a signal waiting. If so, boom! It can also check if the thread is at risk of overrunning its stack, in which case boom! (This is better than allowing an overrun to occur. As noted in SE_Handler
, Windows no longer even allows a stack overflow exception to be intercepted.)
Here is the code that delivers a signal to another thread:
void Thread::Raise(signal_t sig)
{
Debug::ft(Thread_Raise);
auto reg = Singleton<PosixSignalRegistry>::Instance();
auto ps1 = reg->Find(sig);
// If this is the running thread, throw the signal immediately. If the
// running thread can't be found, don't assert: the signal handler can
// invoke this when a signal occurs on an unknown thread.
//
auto thr = RunningThread(std::nothrow);
if(thr == this)
{
throw SignalException(sig, 0);
}
// If the signal will force the thread to exit, try to unblock it.
// Unblocking usually involves deallocating resources, so force the
// thread to sleep if it wakes up during Unblock().
//
if(ps1->Attrs().test(PosixSignal::Exit))
{
if(priv_->action_ == RunThread)
{
priv_->action_ = SleepThread;
Unblock();
priv_->action_ = ExitThread;
}
}
SetSignal(sig);
if(!ps1->Attrs().test(PosixSignal::Delayed)) SetTrap(true);
if(ps1->Attrs().test(PosixSignal::Interrupt)) Interrupt(Signalled);
}
Given that the target thread can throw a SignalException
for itself, via a check supported by Debug::ft
, Raise
does the following:
- invokes
SetSignal
to record the signal against the thread - invokes
Unblock
(avirtual
function) to unblock the thread if the signal will force it to exit - invokes
SetTrap
if the signal should be delivered as soon as possible instead of waiting until the next time the thread yields (this sets the flag that is checked viaDebug::ft
) - invokes
Interrupt
to wake up the thread if the signal should be delivered now instead of waiting until the thread resumes execution
In the above list, whether to invoke each of the last three functions is determined by various attributes that can be set in the signal's instance of PosixSignal
.
Capturing a Thread's Stack When an Exception Occurs
SignalException
derives from Exception
(which derives from std::exception
). Although Exception
is a virtual class, all RSC exceptions derive from it because its constructor captures the running thread's stack by invoking SysStackTrace::Display
:
Exception::Exception(bool stack, fn_depth depth) : stack_(nullptr)
{
// When capturing the stack, exclude this constructor and those of
// our subclasses.
//
if(stack)
{
stack_.reset(new std::ostringstream);
if(stack_ == nullptr) return;
*stack_ << std::boolalpha << std::nouppercase;
SysStackTrace::Display(*stack_, depth + 1);
}
}
SignalException
simply records the signal and a debug code after telling Exception
to capture the stack:
SignalException::SignalException(signal_t sig, debug32_t errval) :
Exception(true, 1),
signal_(sig),
errval_(errval)
{
}
Capturing a thread stack is platform-specific. See SysStackTrace.win.cpp for the Windows targets. Here is an example of its output within an RSC log for a Windows structured exception that got mapped to SIGSEGV
. The stack trace is the portion after "Function Traceback
":
THR902 Jun-27-2022 15:16:16.123 on Reigi {3}
in NodeTools.RecoveryThread (tid=20, nid=0x4eb8): trap number 2
type=Signal
signal : 11 (SIGSEGV: Illegal Memory Access)
errval : 0xc0000005
Function Traceback:
NodeBase.Exception.Exception @ Exception.cpp + 53[28]
NodeBase.SignalException.SignalException @ SignalException.cpp + 38[12]
NodeBase.Thread.HandleSignal @ Thread.cpp + 1892[27]
NodeBase.SE_Handler @ SysThread.win.cpp + 147[0]
_NLG_Return2 @ <unknown file> (err=487)
_NLG_Return2 @ <unknown file> (err=487)
_NLG_Return2 @ <unknown file> (err=487)
_NLG_Return2 @ <unknown file> (err=487)
_CxxFrameHandler4 @ <unknown file> (err=487)
__GSHandlerCheck_EH4 @ gshandlereh4.cpp + 86[0]
_chkstk @ <unknown file> (err=487)
RtlRestoreContext @ <unknown file> (err=487)
KiUserExceptionDispatcher @ <unknown file> (err=487)
NodeBase.Thread.CauseTrap @ Thread.cpp + 1264[5]
NodeTools.RecoveryThread.UseBadPointer @ NtIncrement.cpp + 3405[0]
NodeTools.RecoveryThread.Enter @ NtIncrement.cpp + 3304[0]
NodeBase.Thread.Start @ Thread.cpp + 3124[0]
NodeBase.EnterThread @ SysThread.win.cpp + 159[0]
recalloc @ <unknown file> (err=487)
BaseThreadInitThunk @ <unknown file> (err=487)
RtlUserThreadStart @ <unknown file> (err=487)
In released software, users can collect these logs and send them to you. Better still, your software can include code to automatically send them to you over the internet. Each of these logs highlights a bug that needs to be fixed.
Recovering from an Exception
The above log was produced by TrapHandler
, which was mentioned a long time ago as the function that Thread::Start
invokes when it catches an exception:
Thread::TrapAction Thread::TrapHandler(const Exception* ex,
const std::exception* e, signal_t sig, const std::ostringstream* stack)
{
try
{
// If this thread object was deleted, exit immediately.
//
if(sig == SIGDELETED)
{
return Return;
}
if(Singleton<Threads>::Instance()->GetState() != Constructed)
{
return Return;
}
// The first time in, save the signal. After that, we're dealing
// with a trap during trap recovery:
// o On the second trap, log it and force the thread to exit.
// o On the third trap, force the thread to exit.
// o On the fourth trap, exit without even deleting the thread.
// This will leak its memory, which is better than what seems
// to be an infinite loop.
//
auto retrapped = false;
switch(++priv_->traps_)
{
case 1:
SetSignal(sig);
break;
case 2:
retrapped = true;
break;
case 3:
return Release;
default:
return Return;
}
// Record a stack overflow against the native thread wrapper
// for use by SysThread::Start.
//
if((sig == SIGSTACK1) && (systhrd_ != nullptr))
{
systhrd_->status_.set(SysThread::StackOverflowed);
}
auto exceeded = LogTrap(ex, e, sig, stack);
// Force the thread to exit if
// o it has trapped too many times
// o it trapped during trap recovery
// o this is a final signal
//
auto sigAttrs = Singleton<PosixSignalRegistry>::Instance()->Attrs(sig);
if(exceeded | retrapped | sigAttrs.test(PosixSignal::Final))
{
return Release;
}
// Resume execution at the top of Start.
//
return Continue;
}
// The following catch an exception during trap recovery (a nested
// exception) and invoke this function recursively to handle it.
//
catch(SignalException& sex)
{
switch(TrapHandler(&sex, &sex, sex.GetSignal(), sex.Stack()))
{
case Continue:
case Release:
return Release;
default:
return Return;
}
}
catch(Exception& ex)
{
switch(TrapHandler(&ex, &ex, SIGNIL, ex.Stack()))
{
case Continue:
case Release:
return Release;
default:
return Return;
}
}
catch(std::exception& e)
{
switch(TrapHandler(nullptr, &e, SIGNIL, nullptr))
{
case Continue:
case Release:
return Release;
default:
return Return;
}
}
catch(...)
{
switch(TrapHandler(nullptr, nullptr, SIGNIL, nullptr))
{
case Continue:
case Release:
return Release;
default:
return Return;
}
}
}
Recreating a Thread
If a thread traps too often, it is forced to exit. But if the thread served an important purpose, there needs to be a way to recreate it.
In Creating a Thread, we saw that a thread could register a Daemon
when it was created. And in Exiting a Thread, Daemon::ThreadDeleted
was notified when a thread exited. This function isn't virtual
, but the same for every Daemon
:
void Daemon::ThreadDeleted(Thread* thread)
{
// This does not immediately recreate the deleted thread. We only create
// threads when invoked by InitThread, which is not the case here. So we
// must ask InitThread to invoke us. During a restart, however, threads
// often exit, so there is no point doing this, and InitThread will soon
// invoke our Startup function so that we can create threads.
//
auto item = Find(thread);
if(item != threads_.end())
{
threads_.erase(item);
if(Restart::GetStage() != Running) return;
Singleton<InitThread>::Instance()->Interrupt(InitThread::Recreate);
}
}
When InitThread
runs, it invokes the following when it sees that it was interrupted to recreate threads:
void InitThread::RecreateThreads()
{
// Invoke daemons with missing threads.
//
auto& daemons = Singleton<DaemonRegistry>::Instance()->Daemons();
for(auto d = daemons.First(); d != nullptr; daemons.Next(d))
{
if(d->Threads().size() < d->TargetSize())
{
d->CreateThreads();
}
}
// This is reset after the above so that if a trap occurs, we will
// again try to recreate threads when reentered.
//
Reset(Recreate);
}
And the following finally invokes the virtual
function Daemon::CreateThread
:
void Daemon::CreateThreads()
{
switch(traps_) // initialized to 0 when creating a Daemon
{
case 0:
break;
case 1:
// CreateThread trapped. Give the subclass a chance to
// repair any data before invoking CreateThread again.
//
++traps_;
Recover();
--traps_;
break;
default:
// Either Recover trapped or CreateThread trapped again.
// Raise an alarm.
//
RaiseAlarm(GetAlarmLevel());
return;
}
// Try to create new threads to replace those that exited.
// Incrementing traps_, and clearing it on success, allows
// us to detect traps.
//
while(threads_.size() < size_)
{
++traps_;
auto thread = CreateThread();
traps_ = 0;
if(thread == nullptr)
{
RaiseAlarm(GetAlarmLevel());
return;
};
threads_.insert(thread);
ThreadAdmin::Incr(ThreadAdmin::Recreations);
}
RaiseAlarm(NoAlarm);
}
Traces of the Code in Action
RSC has 29 tests that focus on exercising this software. Each of them does something nasty to see if the software can handle it without exiting. During these tests, the function trace tool is enabled so that Debug::ft
will record all function calls. For the SIGSEGV
test, which is associated with the log shown above, the output of the trace tool looks like this. When the tool is on, code slows down by a factor of about 4x. When the tool is off, calls to Debug::ft
incur very little overhead.
A Destructor Uses a Bad Pointer
A recently added test uses a bad pointer in the destructor of a concrete Thread
subclass. This test should have been added long ago; it is an especially good one because an exception in a destructor normally causes a program to abort. Although RSC survives if compiled with Microsoft's C++ compiler, what occurs is interesting. The structured exception (Windows' SIGSEGV
equivalent) gets intercepted and thrown as a C++ exception. But this exception is not caught immediately. The C++ runtime code handling the deletion catches the exception itself and continues its work of invoking the destructor chain. This is admirable because it allows the base Thread
class to release its resources. Only afterwards does the C++ runtime rethrow the exception, which is finally caught by the safety net in Thread::Start
. We now have the unusual situation of a member function running after its object has been deleted. Because Thread::TrapHandler
is not virtual
, it gets invoked successfully. When it notices that the thread has been deleted, it returns and exits the thread.
Points of Interest
It is only forthright to mention that the C++ standard does not support throwing an exception in response to a POSIX signal. In fact, it is undefined behavior for a signal handler to do almost anything in a C++ environment! A list of undefined behaviors appears here; those pertaining to signal handling are numbered 128 through 135. The detailed coding standard available on the same website makes these recommendations about signals:
- SIG31-C. Do not access shared objects in signal handlers
- SIG34-C. Do not call
signal()
from within interruptible signal handlers - SIG35-C. Do not return from a computational exception signal handler
Fortunately, much of this is theoretical rather than practical. The main reason that most things related to signal handling are undefined behavior is because different platforms support signals in different ways. Many of the risks that lead to undefined behavior result from race conditions that will rarely occur1. Regardless, what can you do if your software has to be robust? It's far better to risk undefined behavior than to let your program exit.
The same rationale, of not being able to depend on how the underlying platform does something, does not excuse the standard's adoption of noexcept
. If it were possible to throw an exception in reponse to a signal, any noexcept
function would be unable to do so. Even a non-virtual
"getter" that simply returns a member's value is now at risk. If such a function is invoked with a bad this
pointer, it will add an offset to that pointer and try to read memory. Boom! An ostensibly trivial noexcept
function, through no fault of its own, has now caused the invocation of abort
when the signal handler throws an exception to recover from the SIGSEGV
.
The invocation of abort
isn't the end of the world, let alone your program, because your signal handler can turn the SIGABRT
into an exception. But now what are we dealing with, abort
or an exception? What if the exception isn't "allowed", either because it occurred in a destructor or noexcept
function? (Hands up, those of you who have never seen anything nasty happen in a destructor.)
When abort
is invoked, the C++ standard says it is implementation dependent whether the stack is unwound in the same way as when an exception is thrown. That is, local objects may not get deleted. So if a function on the stack owns something in a unique_ptr
local, it will leak. And if it has wrapped a mutex in a local object whose destructor releases the mutex whenever the function returns, the outcome could be far worse. This is assuming, of course, that your program will be allowed to survive. If it won't, it doesn't really matter.
Unless your software is shockingly infallible, it will occasionally cause an abort
, and your C++ compiler better allow this to turn into an exception that unwinds the stack in all circumstances. In the end, both your platform and compiler will make it either possible or virtually impossible to deliver robust C++ software.
To summarize, here are some things that the C++ standard should mandate to get serious about robustness:
- A signal handler must be able to throw an exception when it receives a signal.
- The stack must be unwound if the signal handler throws an exception in reponse to a
SIGABRT
. std::exception
's constructor must provide a way to capture debug information, such as a thread's stack, before the stack is unwound.
The good news is that platform and compiler vendors often make it possible to deliver robust software, despite what the standard fails to mandate.
Notes
1 In UNIX-like environments, signals other than those discussed in this article are sometimes used as a primitive form of inter-thread communication. This greatly increases the risk of these race conditions and is not recommended here.
History
- 3rd September, 2020: Add section on recreating a thread
- 11th August, 2020: Add details about what happens when a thread is exited
- 27th May, 2020: Describe what happens when an exception occurs in a destructor
- 28th August, 2019: Initial version