Debugging Live Systems

Greg Utas

Rate me:

5.00/5 (11 votes)

10 Mar 2022GPL318 min read

31.3K

377

No breakpoints or drooling all over the console!

This article presents logging and trace tools that can be used in live systems. These tools gather information to help identify and reproduce bugs. One of the tools captures function calls and displays them in an indented "Who invoked whom?" format. A new tool is implemented by deriving from TraceRecord, which allows a common framework to manage all of a system's trace tools.

Introduction

During development, setting breakpoints and writing information to the console are probably the two most commonly used debugging techniques. But once software is released and running on a customer's system, these techniques are no longer acceptable. This article describes alternatives that can be used to debug live systems and that are also useful during development. They actually need to be sufficient during development, because they will be the only things available in a live system.

When released software contains a bug, the goal is to reproduce it and develop a fix offline. Sometimes the customer can tell you how to reproduce the problem. But when the customer can't, you need to gather enough data to figure out how to recreate it. There may be times when the customer doesn't even know there is a bug, but your software does. It should then collect the necessary data on its own.

Tracepoint debugging is sometimes available in a live system. Many debuggers support tracepoints as well as breakpoints. When the debugger hits a tracepoint, it collects information and continues without halting. Most often, you tell it to collect the contents of certain data structures. Because this technique is closely related to breakpoint debugging, this article does not discuss it in further detail.

Walkthroughs

The code in this article is taken from the Robust Services Core (RSC). If this is the first time that you're reading an article about an aspect of RSC, please take a few minutes to read this preface.

Software Error Log

RSC produces a software error log when Thread::Start catches an exception. This function is inherited by all threads that derive from RSC's Thread class. Exceptions have significant overhead, so applications should throw them very sparingly and catch them even less often. In the vast majority of cases, an exception represents a bug that needs to be fixed.

When Thread::Start catches an exception, the log that it produces includes a stack trace, as described in Capturing a Thread's Stack When An Exception Occurs. All exceptions in RSC derive from Exception, whose constructor captures the stack when the exception is thrown. This allows Thread::LogTrap to include a stack trace when it generates its log. This log, for example, occurs when using a bad pointer:

THR902 2-Aug-2019 09:45:43.183 on Reigi {5}
in NodeTools.RecoveryTestThread (tid=15, nid=0x0000bc60): trap number 2
type=Signal
signal : 11 (SIGSEGV: Invalid Memory Reference)
errval : 0xc0000005
Function Traceback:
  NodeBase.SignalException.SignalException @ signalexception.cpp + 40[12]
  NodeBase.Thread.HandleSignal @ thread.cpp + 1801[16]
  NodeBase.SE_Handler @ systhread.win.cpp + 126[13]
  _unDNameEx @ <unknown file> (err=487)
  is_exception_typeof @ <unknown file> (err=487)
  is_exception_typeof @ <unknown file> (err=487)
  is_exception_typeof @ <unknown file> (err=487)
  _CxxFrameHandler3 @ <unknown file> (err=487)
  RtlInterlockedCompareExchange64 @ <unknown file> (err=487)
  RtlInterlockedCompareExchange64 @ <unknown file> (err=487)
  KiUserExceptionDispatcher @ <unknown file> (err=487)
  NodeTools.RecoveryTestThread.Enter @ ntincrement.cpp + 3584[0]
  NodeBase.Thread.Start @ thread.cpp + 2799[15]
  NodeBase.Thread.EnterThread @ thread.cpp + 1564[0]
  BaseThreadInitThunk @ <unknown file> (err=487)
  RtlGetAppContainerNamedObjectPath @ <unknown file> (err=487)
  RtlGetAppContainerNamedObjectPath @ <unknown file> (err=487)

The stack must be captured when the exception is thrown, not when it is caught. By the time it's caught, the stack has unwound (in this case all the way to Thread::Start), so there is no longer anything useful to capture. This is why Exception's constructor captures the stack. Contrast this with exceptions thrown in the STL, where all you get is a string like "invalid string position", which says nothing about who caused the problem. The stack trace is vital because it not only identifies where the bug occurred (e.g., in a function that didn't check for a nullptr argument), but also how that code was reached (i.e., from a function that actually provided the bad argument).

When an error is so severe that the work in progress should be aborted, an application should either

invoke the function Debug::Assert, or
throw a SoftwareException or one of its subclasses.

Both of these produce a log that includes a stack trace. RSC handles these exception logs in the same way as all others. It puts them in a log buffer that is soon written to a file and, optionally, to the console. In released software, users can collect these logs and send them to you. Better still, your software can include code to automatically send them to you over the internet. Each of these logs highlights a bug that needs to be fixed.

Capturing a stack is platform specific; the Windows targets are in SysThreadStack.win.cpp. The above log only captured the chain of function calls that led to the exception. A useful enhancement is to also capture each stack frame and the CPU's registers. Within the log, these simply appear as hex values, but an offline tool can then tie into the compiler's object files and interpret the raw hex so that the values of each function's arguments and local variables can be studied at the time of the exception. But this is more platform-specific code.

Software Warning Log

Unlike the software error log, which is produced after a problem has occurred, a software warning log is used when a problem is detected proactively. Application software generates these logs, most often when it receives invalid input. It invokes Debug::SwLog to generate them:

C++

static void SwLog(fn_name_arg func,
   const std::string& errstr, debug64_t errval, bool stack = true);

func identifies the function that is generating the log
errstr explains the reason for the log
errval is a value to assist with debugging (debug64_t is an alias for uint64_t)
stack is usually true so that the log will include a stack trace to assist with debugging

Let's look at an example of a software warning log. The framework for RSC's command line interface (CLI) defines the base class CliParm, from which all CLI parameters ultimately derive. When a parameter registers with the CLI, it must provide a help string so that the CLI's help command can explain each of the parameters that a CLI command expects. We want to limit the length of these strings so that, after they are prefixed with some other information, they will still fit on a console that is 80 characters wide. CliParm's constructor therefore looks like this:

C++

CliParm::CliParm(c_string help, bool opt, c_string tag) :
   help_(help),
   opt_(opt),
   tag_(tag)
{
   Debug::Assert(help_ != nullptr);

   auto size = strlen(help_);
   auto total = ParmWidth + strlen(ParmExplPrefix) + size;

   if(size == 0)
      Debug::SwLog(CliParm_ctor, "help string empty", size);
   else if(total >= COUT_LENGTH_MAX)
      Debug::SwLog
         (CliParm_ctor, "help string too long", total - COUT_LENGTH_MAX + 1);
}

Debug::Assert throws an AssertionException if help is not provided. This would actually cause system initialization to fail, so it won't occur in released software: a developer must provide the value before the code can even be checked in. After that, the code generates a warning log if help is empty or too long to fit on one line of the console. So if we edit NbIncrement.cpp by changing

C++

fixed_string LogsListTextExpl =
   "shows info for all logs or the logs in a specific group";

to this:

C++

fixed_string LogsListTextExpl =
   "shows information for all logs or the logs in a specific group";

The following log then appears during system initialization:

SW900 19-Feb-2020 18:08:29.915 on Reigi {1}
in CliParm.ctor
errval=help string too long  offset=0x3
Function Traceback:
  NodeBase.Debug.GenerateSwLog @ debug.cpp + 102[22]
  NodeBase.Debug.SwLog @ debug.cpp + 214[25]
  NodeBase.CliParm.CliParm @ cliparm.cpp + 65[45]
  NodeBase.CliText.CliText @ clitext.cpp + 42[19]
  NodeBase.LogsListText.LogsListText @ nbincrement.cpp + 1435[25]
  NodeBase.LogsAction.LogsAction @ nbincrement.cpp + 1560[31]
  NodeBase.LogsCommand.LogsCommand @ nbincrement.cpp + 1575[39]
  NodeBase.NbIncrement.NbIncrement @ nbincrement.cpp + 3700[33]
  NodeBase.Singleton<NodeBase.NbIncrement>.Instance @ singleton.h + 81[34]
  NodeBase.NbModule.Startup @ nbmodule.cpp + 149[5]
  NodeBase.ModuleRegistry.Startup @ moduleregistry.cpp + 434[19]
  NodeBase.ModuleRegistry.Restart @ moduleregistry.cpp + 263[22]
  NodeBase.InitThread.InitializeSystem @ initthread.cpp + 354[0]
  NodeBase.InitThread.Enter @ initthread.cpp + 225[0]
  NodeBase.Thread.Start @ thread.cpp + 3232[15]
  NodeBase.Thread.EnterThread @ thread.cpp + 1755[0]
  BaseThreadInitThunk @ <unknown file> (err=487)
  RtlGetAppContainerNamedObjectPath @ <unknown file> (err=487)
  RtlGetAppContainerNamedObjectPath @ <unknown file> (err=487)

The log complains that the help string is too long, and the culprit is easily identified as a parameter used by LogsCommand. Looking at the code, 0x3 is the number of characters that need to be trimmed from the string.

The 225K lines of code in RSC's repository include over 700 calls to Debug::SwLog. It should be used often! But as with the software error log, each occurrence of a software warning log highlights a bug that needs to be fixed.

Object Dump

Some threads service a queue of work items one at a time. When an error occurs during one of the thread's transactions, a dump containing all the objects involved in the transaction can be very helpful during debugging. Logging objects takes a fair bit of time, so it is usually done only in conjunction with a software error log.

The following design approach supports an object dump during a software error scenario:

The application thread must track the object(s) involved in its current transaction.
The base Thread class must catch all exceptions, as described in Robust C++: Safety Net.
During its handling of an exception, Thread must invoke a virtual function that allows the application thread to generate a log containing the objects involved in the transaction.
Each object involved in the transaction must implement a virtual Display function (unless you're happy to decrypt a hex dump or write an offline tool that converts hex to symbolic values).

Here is a walkthrough of an object dump that follows a software error log. The initial log occurs during a testcase that prompts the telephony application included with RSC to throw an exception when it receives a message indicating that the far end has answered the call. InvokerThread is the thread that is scheduled to handle transactions for the telephony application:

THR902 2-Feb-2020 13:04:23.037 on Reigi {3}
in SessionBase.InvokerThread (tid=11, nid=0x000089b0): trap number 1
type=SessionBase Exception
errstr : trap recovery test
offset : 0x0
context : 000001D15B508178
Function Traceback:
  SessionBase.SbException.SbException @ context.cpp + 113[25]
  SessionBase.Context.Kill @ context.cpp + 608[26]
  CallBase.BcOaAnalyzeRemoteMessage.ProcessEvent @ bchandlers.cpp + 131[30]
  SessionBase.SbAnalyzeMessage.ProcessEvent @ sbhandlers.cpp + 74[40]
  SessionBase.ServiceSM.ProcessEvent @ servicesm.cpp + 559[46]
  SessionBase.SsmContext.ProcessIcMsg @ ssmcontext.cpp + 211[30]
  SessionBase.Context.ProcessMsg @ context.cpp + 700[0]
  SessionBase.Context.ProcessWork @ context.cpp + 773[40]
  SessionBase.InvokerPool.ProcessWork @ invokerpool.cpp + 554[0]
  SessionBase.InvokerThread.Enter @ invokerthread.cpp + 176[0]
  NodeBase.Thread.Start @ thread.cpp + 3235[0]
  NodeBase.Thread.EnterThread @ thread.cpp + 1755[0]
  BaseThreadInitThunk @ <unknown file> (err=487)
  RtlUserThreadStart @ <unknown file> (err=487)

The log that provides the object dump drones on and on, so you may want to hit the "Shrink" button. The log displays 9 objects:

the processing context (SsmContext) that aggregates the objects involved in the call;
a protocol state machine (PotsCallPsm) for messages to/from the user's telephone;
a protocol state machine (CipPsm) for messages to/from the far end of the call;
the message (CipMessage) from the far end of the call, along with its contents (SbIpBuffer);
two ports (MsgPort) through which the protocol state machines receive and send messages;
the service state machine for a basic call (PotsBcSsm); and
the event that the state machine is currently handling (AnalyzeMsgEvent).

SESS900 2-Feb-2020 13:04:23.037 on Reigi {4}
SessionBase.SsmContext
this : 000001D15B508178
patchArea : 0x0
pid      : 6
seq      : 1
link     : 0000000000000000
assigned : true
orphaned : 0
corrupt  : false
logged   : false
whichq  : 0000000000000000
link    :
  next : 0000000000000000
  prev : 0000000000000000
priMsgq :
  tail : 0000000000000000
  diff : 16
stdMsgq :
  tail : 0000000000000000
  diff : 16
enqTime : 3505468504310
pool    : 000001D15B0F4AD0
thread  : 000001D15C8113C8
faction : 5
state   : 3
prio    : 2
traceOn : true
trans   : 000001D15C8DBB00
trace : 4,9> 4,9> 3,2> 4,9>
portq :
  tail : 000001D15D4E02D0
  diff : 16
  [->]: 000001D15D4E0218 SessionBase.MsgPort
  [->]: 000001D15D4E02C0 SessionBase.MsgPort
psmq  :
  tail : 000001D15DC126B0
  diff : 16
  [->]: 000001D15DC12438 PotsBase.PotsCallPsm
  [->]: 000001D15DC126A0 CallBase.CipPsm
root : 000001D15EB2A3F8
--------------------------------------------------------
PotsBase.PotsCallPsm
this : 000001D15DC12438
patchArea : 0x0
pid      : 9
seq      : 1
link     : 000001D15DC126B0
assigned : true
orphaned : 0
corrupt  : false
logged   : false
ctx   : 000001D15B508178
upper : 0000000000000000
lower : 000001D15D4E0218
rcvdMsgq :
  tail : 0000000000000000
  diff : 16
ogMsgq   :
  tail : 0000000000000000
  diff : 16
sentMsgq :
  tail : 0000000000000000
  diff : 16
timerq   :
  tail : 0000000000000000
  diff : 16
fid      : 7
state    : 1
ogMsg        : 0000000000000000
sendRing     : false
sendScan     : false
sendCause    : false
sendFacility : false
header :
  signal : 0 (nullptr)
  port   : 11 (POTS 20000)
ring :
  on : false
scan :
  digits : false
  flash  : false
cause :
  cause : 0 (Invalid Cause)
facility :
  sid : 0 (nullptr)
  ind : 0
--------------------------------------------------------
CallBase.CipPsm
this : 000001D15DC126A0
patchArea : 0x0
pid      : 9
seq      : 1
link     : 000001D15DC12448
assigned : true
orphaned : 0
corrupt  : false
logged   : false
ctx   : 000001D15B508178
upper : 0000000000000000
lower : 000001D15D4E02C0
rcvdMsgq :
  tail : 000001D15D8F4288
  diff : 16
  [->]: 000001D15D8F4278 CallBase.CipMessage
ogMsgq   :
  tail : 0000000000000000
  diff : 16
sentMsgq :
  tail : 0000000000000000
  diff : 16
timerq   :
  tail : 0000000000000000
  diff : 16
fid      : 2
state    : 8
edge    : false
ogPsm   : 000001D15DC12438
ogTone  : 255
icTone  : 255
icMedia :
  rxFrom : 12 (POTS 20001)
ogMediaSent :
  rxFrom : 11 (POTS 20000)
ogMediaCurr :
  rxFrom : 11 (POTS 20000)
mep : 0000000000000000
iamTimer : false
--------------------------------------------------------
CallBase.CipMessage
this : 000001D15D8F4278
patchArea : 0x0
pid      : 7
seq      : 1
link     : 000001D15D8F4288
assigned : true
orphaned : 0
corrupt  : false
logged   : false
buff    : 000001D15B0CBE78
bt      : 0000000000000000
handled : false
saves   : 0
psm     : 000001D15DC126A0
whichq  : 000001D15DC126D8
--------------------------------------------------------
SessionBase.SbIpBuffer
this : 000001D15B0CBE78
patchArea : 0x0
pid      : 4
seq      : 1
link     : 0000000000000000
assigned : true
orphaned : 0
corrupt  : false
logged   : false
rxTicks : 3505468487443
buff     : 000001D15ED27A90
hdrSize  : 24
buffSize : 42
txAddr   : 192.168.0.10: 40000
rxAddr   : 192.168.0.10: 40000
dir      : 0
external : false
queued   : false
length   : 8
  00: 03 00 00 00 01 08 03 00 - 02 00 00 00 01 08 02 00  ................
  01: 02 80 03 00 04 00 08 00 - 00 00 02 00 0c 00 0b 00  .€..............
MsgHeader:
  txAddr : bid=3  seq=1  pid=8  fid=3
  rxAddr : bid=2  seq=1  pid=8  fid=2
  priority=2  initial=0  final=0  join=0  self=0
    injected=0  kill=0  route=2  length=8  spare=0x0
  protocol=3 (CipProtocol)  signal=4 (CipAnmSignal)
Parameters:
  Deleted parameter: pid=0, length=2
--------------------------------------------------------
SessionBase.MsgPort
this : 000001D15D4E0218
patchArea : 0x0
pid      : 8
seq      : 1
link     : 000001D15D4E02D0
assigned : true
orphaned : 0
corrupt  : false
logged   : false
ctx   : 000001D15B508178
upper : 000001D15DC12438
lower : 0000000000000000
locAddr :
  SessionBase.GlobalAddress
  this : 000001D15D4E0250
  patchArea : 0x0
  v4Addr : 192.168.0.10: 40002
  port   : 40002
  proto  : Any
  socket : 0000000000000000
  sbAddr : bid=1  seq=1  pid=8  fid=7
remAddr :
  SessionBase.GlobalAddress
  this : 000001D15D4E0280
  patchArea : 0x0
  v4Addr : 192.168.0.10: 40001
  port   : 40001
  proto  : Any
  socket : 0000000000000000
  sbAddr : bid=0  seq=0  pid=0  fid=6
msgRcvd : true
msgSent : true
--------------------------------------------------------
SessionBase.MsgPort
this : 000001D15D4E02C0
patchArea : 0x0
pid      : 8
seq      : 1
link     : 000001D15D4E0228
assigned : true
orphaned : 0
corrupt  : false
logged   : false
ctx   : 000001D15B508178
upper : 000001D15DC126A0
lower : 0000000000000000
locAddr :
  SessionBase.GlobalAddress
  this : 000001D15D4E02F8
  patchArea : 0x0
  v4Addr : 192.168.0.10: 40000
  port   : 40000
  proto  : Any
  socket : 0000000000000000
  sbAddr : bid=2  seq=1  pid=8  fid=2
remAddr :
  SessionBase.GlobalAddress
  this : 000001D15D4E0328
  patchArea : 0x0
  v4Addr : 192.168.0.10: 40000
  port   : 40000
  proto  : Any
  socket : 0000000000000000
  sbAddr : bid=3  seq=1  pid=8  fid=3
msgRcvd : true
msgSent : true
--------------------------------------------------------
PotsBase.PotsBcSsm
this : 000001D15EB2A3F8
patchArea : 0x0
pid      : 12
seq      : 1
link     : 0000000000000000
assigned : true
orphaned : 0
corrupt  : false
logged   : false
sid       : 2
currState : 8
nextState : 8
idled     : false
nextSap   : 0
nextSnp   : 0
triggered :
  1 2 4 5 7 8 9 12 13 15 16 18 19 20 23 24
ssmq :
  tail : 0000000000000000
  diff : 16
parentSsm : 0000000000000000
eventq[Active] :
  tail : 000001D15ECA0BF8
  diff : 16
  [->]: 000001D15ECA0BE8 SessionBase.AnalyzeMsgEvent
eventq[Pending] :
  tail : 0000000000000000
  diff : 16
eventq[Saved] :
  tail : 0000000000000000
  diff : 16
ctx : 000001D15B508178
mgwPsm : 0000000000000000
model : 1
upsm  : 000001D15DC12438
npsm  : 000001D15DC126A0
dialed :
  count  : 5
  digits : 20001
analysis :
  selector   : 1 (Directory Number)
  identifier : 20001
route :
  selector   : 7 (PotsCallFactory)
  identifier : 20001
proxyCount : 0
prof : 000001D15B162310
tid  : 0
trmt : 0000000000000000
--------------------------------------------------------
SessionBase.AnalyzeMsgEvent
this : 000001D15ECA0BE8
patchArea : 0x0
pid      : 11
seq      : 1
link     : 000001D15ECA0BF8
assigned : true
orphaned : 0
corrupt  : false
logged   : false
eid      : 1
owner    : 000001D15EB2A3F8
location : 0
msg : 000001D15D8F4278

The object dump was generated when Thread::Start invoked InvokerThread::Recover while handling the exception. InvokerThread tracks the running context in its ctx_ member. After it dumps the context and its objects, it deletes them:

C++

bool InvokerThread::Recover()
{
   if(ctx_ != nullptr)
   {
      ctx_->Dump();

      //  The implementation of unique_ptr.reset may be equivalent to this:
      //    item = this->ptr_;
      //    this->ptr_ = nullptr;
      //    delete item;
      //  That is, the unique_ptr does not point to the item being deleted while
      //  its destructor is executing.  Consequently, Context::RunningContext,
      //  which depends on our ctx_ field, cannot find the running context (the
      //  one being deleted).  So instead of simply invoking ctx_.reset, we do
      //  the following so that ctx_ will be valid during deletion:
      //
      auto ctx = ctx_.get();
      delete ctx;
      ctx_.release();
   }

   return true;
}

Context::Dump creates the log and uses Base::LogSubtended to dump all of the objects:

C++

void Context::Dump() const
{
   auto log = Log::Create(SessionLogGroup, SessionError);
   if(log == nullptr) return;
   LogSubtended(*log, Log::Tab, NoFlags);
   Log::Submit(log);
}

Base::LogSubtended uses Base::GetSubtended to populate an array with a root object and all of the objects that it transitively owns, after which it writes them to the log:

C++

void Base::LogSubtended(ostream& stream,
   const string& prefix, const Flags& options) const
{
   std::vector<Base*> objects;

   GetSubtended(objects);

   for(size_t i = 0; i < objects.size(); ++i)
   {
      if(i > 0) stream << prefix << string(60 - prefix.size(), '-') << CRLF;
      objects[i]->Display(stream, prefix, options);
   }
}

A class whose objects own other objects overrides Base::GetSubtended. It first invokes the base class version to add itself to the array and then invokes GetSubtended on the objects that it owns, so that they can also add themselves and their objects to the array:

C++

void Base::GetSubtended(std::vector<Base*>& objects) const
{
   objects.push_back(const_cast<Base*>(this));
}

Trace Buffer

The debugging techniques that we've looked at so far—software error logs, software warning logs, and object dumps—gather information passively. But when they prove inadequate, accessing the customer's system to enable trace tools is the next step. Although trace tools could be enabled by default, their overhead in a live system is usually prohibitive when used outside of specific scenarios, so they are used only when they might help to uncover how a bug occurs. A packet sniffer is an example of a trace tool that may be familiar to you.

To support trace tools, RSC provides a framework whose key classes are

TraceRecord, a virtual base class for defining events that can be recorded, and
TraceBuffer, for saving recorded events.

TraceBuffer was recently rewritten to be non-blocking. It's a simple circular buffer that contains pointers to a series of TraceRecords. After an application constructs a subclass of TraceRecord , it invokes TraceBuffer::Insert to add the record to the buffer. By default, a TraceRecord is deleted when the buffer is full. However, wraparound can be enabled so that newer records overwrite older ones.

C++

bool TraceBuffer::Insert(TraceRecord* record)
{
   //  Delete the record if no slot is available.
   //
   auto slot = AllocSlot();

   if(slot == SIZE_MAX)
   {
      delete record;
      return false;
   }

   //  If wraparound is allowed, there could already be a record
   //  in the slot.  If so, delete it.
   //
   auto prev = buff_[slot];

   if(prev != nullptr)
   {
      buff_[slot] = nullptr;
      delete prev;
   }

   record->slot_ = slot;
   buff_[slot] = record;
   return true;
}

It is TraceBuffer::AllocSlot that makes tracing non-blocking and thread-safe by incrementing the index bnext_ atomically when allocating the next slot in the buffer:

C++

size_t TraceBuffer::AllocSlot()
{
   //  This fails if
   //  o the buffer is not allocated
   //  o the buffer is locked
   //  o the buffer is full and wraparound is not enabled
   //
   if(buff_ == nullptr) return SIZE_MAX;

   if(softLocks_ > 0)
   {
      ++blocks_;
      return SIZE_MAX;
   }

   auto slot = bnext_.fetch_add(1);

   if(bnext_ >= size_)
   {
      ovfl_ = true;
      if(wrap_) return (slot & (size_ - 1));
      bnext_ = size_;
      return SIZE_MAX;
   }

   return slot;
}

The buffer can be locked (softLocks_), but this is only done by code that is iterating over the buffer's records, something that never occurs while an application that needs to be traced is running.

To control tracing, RSC's CLI provides the commands tools, query, set, include, exclude, start, stop, save, and clear. These are documented, along with all other CLI commands, in the file docs/help.cli. The include and exclude commands allow tracing to be restricted to specific threads, IP ports, IP addresses, protocols, or services, so that time isn't wasted constructing irrelevant trace records that quickly fill up the trace buffer. Although this feature is important even during development, it is essential before trace tools can be used in a live system.

The virtual base class TraceRecord is quite simple and contains three data members:

The record's owner, which identifies a specific trace tool.
The record's slot (index) once it has been added to the trace buffer.
An identifier whose interpretation is tool-specific and which allows the tool to use switch instead of dynamic_cast when it displays different types of events that it has collected.

Function Tracer

RSC's function tracer is one example of a trace tool. It's a good example because it's very general-purpose. And it's a bad example because it slows the system down by a factor of about 4x if used indiscriminately. But if used with suitable filtering or during off-peak times, it's a candidate for use in a live system. And during development, it can be a very useful tool indeed.

The interface to the function tracer is Debug::ft, which most RSC functions invoke in their first statement. These calls were edited out of the code previously shown in this article, but if we went back to the first code example and looked at the full version of CliParm's constructor, we would see this:

C++

fn_name CliParm_ctor = "CliParm.ctor";

CliParm::CliParm(c_string help, bool opt, c_string tag) :
   help_(help),
   opt_(opt),
   tag_(tag)
{
   Debug::ft(CliParm_ctor);  // followed by the code previously shown
}

The call to Debug::ft simply indicates that a function was invoked:

C++

void Debug::ft(fn_name_arg func)
{
   if(FcFlags_.none()) return;
   Thread::FunctionInvoked(func);
}

Simple. If no one cares about function calls, do nothing. Otherwise, call Thread::FunctionInvoked.

Who cares that a function was invoked?

The function trace tool, which records a sequence of function calls.
The Thread class, when it needs to deliver a signal to the thread that is now running.
The Thread class, when it wants to check if the running thread is at risk of overrunning its stack.

The ubiquity of Debug::ft allows it to support the latter two capabilities, which would otherwise be impossible to implement. Which brings us to this:

C++

void Thread::FunctionInvoked(fn_name_arg func)
{
   Thread* thr = nullptr;

   //  This handles the following:
   //  (a) Adding FUNC to a trace.
   //  (b) Causing a trap after a thread is scheduled in.
   //  (c) Causing a trap before a thread overflows its stack.
   //
   if(Debug::FcFlags_.test(Debug::TracingActive))
   {
      auto& lock = AccessFtLock();
      if(!lock.test_and_set())
      {
         if(TraceRunningThread(thr))
            FunctionTrace::Capture(func);
         lock.clear();
      }
   }

   if(Debug::FcFlags_.test(Debug::TrapPending))
   {
      if(thr == nullptr) thr = RunningThread(std::nothrow);
      if(thr == nullptr) return;
      thr->TrapCheck();
   }

   if(Debug::FcFlags_.test(Debug::StackChecking))
   {
      if(StackCheckCounter_ <= 1)
      {
         if(thr == nullptr) thr = RunningThread(std::nothrow);
         if(thr == nullptr) return;
         thr->StackCheck();
      }
      else
      {
         --StackCheckCounter_;
      }
   }
}

The first if statement checks if the function trace tool is enabled. If so, it sets a per-thread lock before it checks whether the running thread should be traced. If it should, it adds the function invocation to the trace buffer. The rest of the code deals with delivering signals to the running thread and checking if it is close to overrunning its stack.

The per-thread lock prevents nested calls to Debug::ft. If these aren't blocked, such that a function invoked during Debug::ft also invokes Debug::ft, the result is a stack overflow. You can remove the lock and try it yourself if you're not convinced. When it crashed and I saw the thread stack, I facepalmed. More recently, I added calls to Debug::noft to functions invoked during Debug::ft, to indicate that they don't bother invoking Debug::ft because doing so would be a waste of time. Besides, who wants to trace functions that are invoked to trace a function?

Assuming that FunctionTrace::Capture gets invoked, what does it do?

C++

void FunctionTrace::Capture(fn_name_arg func)
{
   //  The actual trace is
   //    func
   //      Debug::ft
   //        Thread::FunctionInvoked
   //          FunctionTrace::Capture
   //  FUNC is therefore three levels above the current depth.
   //
   if(Scope_ == CountsOnly)
   {
      auto buff = Singleton<TraceBuffer>::Instance();
      buff->RecordInvocation(func);
      return;
   }

   auto buff = Singleton<TraceBuffer>::Instance();
   auto depth = SysThreadStack::FuncDepth() - 3;
   auto rec = new FunctionTrace(func, depth);
   buff->Insert(rec);
}

The function tracer has two modes. One of these, CountsOnly, is fairly efficient because it simply counts how many times each function was invoked. This is useful when deciding which functions to optimize, because a small speed-up in a frequently invoked function usually provides much greater gains than a large speed-up in a function that is rarely invoked.

If we want to record more than the invocation count for each function, the platform-specific SysThreadStack is used to find the function's depth on the stack. This, along with the function's name, is passed to the FunctionTrace constructor, which also records the running thread's identifier and the current tick time (i.e., the time at which the function was invoked).

After tracing is stopped and the trace records are to be streamed, FunctionTrace records are processed to calculate the gross and net time spent in each function. The number of stack frames (depth) when a function was invoked supports a "staircase" output that shows how function calls were nested. And if more than one thread was traced, it is possible to see which thread was running when each function was invoked.

Here is an excerpt from a function trace. This is actually the first code to be captured during RSC's initialization, even before the entry to main. It shows the creation of three mutexes that register with MutexRegistry, which must be created before it can register the first mutex:

START OF TRACE: 30-Jan-2020 18:36:00.492 on Reigi

mm:ss.ttt  Thr  Event  TotalTime   NetTime  Function
---------  ---  -----  ---------   -------  --------
36:00.492    1>              134         1    :   :   : SysMutex.ctor
36:00.492    1                44        44    :   :   :   Permanent.ctor
36:00.492    1                47         5    :   :   :   Singleton.Instance
36:00.492    1                17         3    :   :   :   : Permanent.operator new
36:00.492    1                14         4    :   :   :   :   Memory.Alloc
36:00.492    1                10        10    :   :   :   :   : SysHeap.Alloc
36:00.492    1                24         2    :   :   :   : MutexRegistry.ctor
36:00.492    1                 7         7    :   :   :   :   Permanent.ctor
36:00.492    1                 9         9    :   :   :   :   Registry.ctor
36:00.492    1                 6         2    :   :   :   :   Registry.Init
36:00.492    1                 4         2    :   :   :   :   : Memory.Alloc
36:00.492    1                 2         2    :   :   :   :   :   SysHeap.Alloc
36:00.492    1                 1         1    :   :   :   : Singletons.BindInstance
36:00.492    1                42        18    :   :   :   MutexRegistry.BindMutex
36:00.492    1                24        24    :   :   :   : MutexRegistry.Find
36:00.492    1                26         0    :   :   : SysMutex.ctor
36:00.492    1                 7         7    :   :   :   Permanent.ctor
36:00.492    1                19         4    :   :   :   MutexRegistry.BindMutex
36:00.492    1                15        15    :   :   :   : MutexRegistry.Find
36:00.492    1               149         0    :   :   : SysMutex.ctor
36:00.492    1                 6         6    :   :   :   Permanent.ctor
36:00.492    1               143        11    :   :   :   MutexRegistry.BindMutex
36:00.492    1               132       132    :   :   :   : MutexRegistry.Find

The trace of the entire initialization is quite long and can be seen here, along with the invocation count for each function that appeared in the trace. If you search for ">" in the trace, you will find entries in the Thr column. That column displays the identifier of the running thread, with a ">" appended when a context switch occurs. Whenever RSC creates a thread during initialization, Windows starts to initiate context switches at a pace that is beyond manic! It only subsides because new threads suspend execution soon after being entered, and are only signaled to resume once initialization is completed.

Most non-trivial functions invoke Debug::ft so that they will appear in a function trace. Those that don't usually fall into one of the following categories:

functions defined in a header file
simple const functions, such as "getters"
functions that write data to a stream
iterators (e.g., First, Next, Prev, Last)
functions that construct or process trace records
functions that are invoked so often that they would fill the trace with noise

Other Trace Tools

The function tracer is one of RSC's many trace tools. Here is a brief description of the others:

MemoryTracer records alloc and free events for a heap and is typically used in combination with other trace tools.
ObjPoolTracer records dequeue, enqueue, claim, and recover events for the blocks in an object pool.
NetworkTracer records 18 different types of events on sockets.
TransTracer records the costs of receiving a message over the network and later processing it.
BufferTracer records the full contents of incoming and outgoing messages.
ContextTracer records 20 different types of events within RSC's state machine framework.
ParserTracer records 14 different types of events in the parser/interpreter that is part of RSC's C++ static analysis tool, resulting in a trace that resembles pseudo-code for a stack machine.

You can enable any subset of tools before starting a trace. After stopping the trace, you can disable some of them to exclude their records before using the CLI's >save command to generate a trace file. Each trace file contains events from all of the enabled tools, in the order in which they occurred. For example, this trace is of a basic call in RSC's telephony application, where A calls B, B answers, and they both disconnect. The Event column contains events recorded by NetworkTracer, TransTracer, and ContextTracer, and complete messages recorded by BufferTracer also appear in the trace. When TransTracer and ContextTracer are both used to record events, their events can also be processed to produce a message sequence chart.

In the Lab

The above trace tools are also useful in the lab, where they can be used with less caution. Two other tools are also useful during focused debugging:

To look for memory leaks, use the subcommands available through the CLI's >heap trace command.
To see context switches, use the subcommands available through the CLI's >sched command, which generates a timeline of threads that looks like this.

Points of Interest

Training

Although trace tools are primarily intended for debugging, they are also useful teaching aids that can help new developers understand how the code works. For example, the function tracer makes it easy to see which override of a virtual function gets invoked. And a message sequence chart, coupled with a state machine trace, provides an extremely good view of how the components of a reactive system collaborate.

Compiler Optimizations

Aggressive code optimization by the compiler can make it very difficult to debug a release build. It's one thing to inline a function defined in a header, but quite another to inline unrelated functions and eliminate local variables. Even the output from the function tracer can be hard to follow. If you need the ability to debug your software in live systems, it is highly recommended that you disable many compiler optimizations. Some of them yield little improvement and simply obfuscate the code.

Function Tracing and Object Creation/Deletion

If you look at the full version of FunctionTrace::Capture, you will come across code that inserts a call to a "C++.delete" at the start of a destructor chain. This function represents compiler-generated code that invokes the destructor chain, followed by the appropriate operator delete. When the first destructor is invoked, the function tracer doesn't know if a delete operator will subsequently be invoked, so it inserts a C++.delete. Later, when the function trace records are processed before being streamed, spurious calls to C++.delete (i.e., those that were not followed by a delete operator) are removed from the trace.

Here is a C++.delete that was added to a trace:

36:01.304    3                39         1    :   : C++.delete
36:01.304    3                18         6    :   :   FileRequest.dtor
36:01.304    3                12         7    :   :   : StreamRequest.dtor
36:01.304    3                 5         5    :   :   :   MsgBuffer.dtor
36:01.304    3                20         7    :   :   Pooled.operator delete
36:01.304    3                13         7    :   :   : ObjectPool.EnqBlock
36:01.304    3                 6         6    :   :   :   Q1Way.Enq

Constructors are even more problematic. First, a chain of constructor calls gets captured bottom to top, from base class to leaf class. This is somewhat confusing and results in misleading timing information when calculating the gross and net times spent in each function. The processing phase therefore reorders constructors by putting the leaf class first, followed by the base class, and thence up the class hierarchy. This more clearly shows which class is being constructed and provides the correct gross time for its constructor.

Another source of confusion is that functions invoked in a member initialization list appear before the constructor to which they belong. As with the basic constructor chain, this inverted ordering occurs because the body of a constructor must be entered before Debug::ft records its invocation.

Finally, as constructors step out (in depth) from base class to leaf class, there are cases where it is impossible to determine the correct leaf class. Provided that no constructor in the chain fails to invoke Debug::ft, each of them (from base class to leaf class) has a depth that is one less than the previous constructor. But if another constructor follows the true leaf constructor at a depth that is also one less, it will appear to be the leaf class. There seems to be no general way to fix this without knowledge of the class hierarchy. (When an object is allocated on the heap, however, the correct leaf class can always be determined, because its constructor must lie at the same depth as the call to operator new.)

Here is an excerpt from a trace in which no processing has been done to reorder constructors:

00:32.972    1              1011        63    :   :   : CxxStrLiteral.CreateRef
00:32.972    1               155       122    :   :   :   Base.operator new
00:32.972    1                23        23    :   :   :   :   CxxToken.ctor
00:32.972    1                33        33    :   :   :   : CxxNamed.ctor
00:32.972    1               169        11    :   :   :   QualName.ctor(string)
00:32.972    1                29        22    :   :   :   : Base.operator new
00:32.972    1                 6         6    :   :   :   :   : CxxToken.ctor
00:32.972    1                 7         7    :   :   :   :   CxxNamed.ctor
00:32.972    1               129       129    :   :   :   : TypeName.ctor
00:32.972    1               104        54    :   :   :   Base.operator new
00:32.972    1                10        10    :   :   :   :   : CxxToken.ctor
00:32.972    1                17        17    :   :   :   :   CxxNamed.ctor
00:32.972    1                44        44    :   :   :   : TypeSpec.ctor
00:32.972    1                 6         6    :   :   :   : TypeTags.ctor
00:32.972    1                43        43    :   :   :   DataSpec.ctor
00:32.972    1                 5         5    :   :   :   TypeTags.SetPointer
00:32.972    1               305       213    :   :   :   Base.operator new
00:32.972    1                 9         9    :   :   :   :   :   : CxxToken.ctor
00:32.972    1                17        17    :   :   :   :   :   CxxNamed.ctor
00:32.972    1                21        21    :   :   :   :   : CxxScoped.ctor
00:32.972    1                75        75    :   :   :   :   CxxScope.ctor
00:32.972    1                92        24    :   :   :   : Data.ctor
00:32.972    1                68         8    :   :   :   :   DataSpec.SetUserType
00:32.972    1                23        23    :   :   :   :   : TypeSpec.SetUserType
00:32.972    1                37        15    :   :   :   :   : QualName.SetUserType
00:32.972    1                22        22    :   :   :   :   :   TypeName.SetUserType
00:32.972    1                53        53    :   :   :   FuncData.ctor
00:32.972    1               114         8    :   :   :   Singleton.Instance
00:32.972    1                54        54    :   :   :   : Base.operator new
00:32.972    1                13        13    :   :   :   : CxxRoot.ctor
00:32.972    1                39        39    :   :   :   : Singletons.BindInstance

And here is the "corrected" version of the same excerpt:

36:00.493    1               710        61    :   :   : CxxStrLiteral.CreateRef
36:00.493    1                41        41    :   :   :   Base.operator new
36:00.493    1               192        21    :   :   :   QualName.ctor(string)
36:00.493    1                20        20    :   :   :   :   CxxToken.ctor
36:00.493    1                39        39    :   :   :   : CxxNamed.ctor
36:00.493    1                13        13    :   :   :   : Base.operator new
36:00.493    1               119         4    :   :   :   : TypeName.ctor
36:00.493    1                 3         3    :   :   :   :   : CxxToken.ctor
36:00.493    1               115       115    :   :   :   :   CxxNamed.ctor
36:00.493    1                14        14    :   :   :   Base.operator new
36:00.493    1                80         7    :   :   :   DataSpec.ctor
36:00.493    1                 3         3    :   :   :   :   : CxxToken.ctor
36:00.493    1                 3         3    :   :   :   :   CxxNamed.ctor
36:00.493    1                34        34    :   :   :   : TypeSpec.ctor
36:00.493    1                39        39    :   :   :   : TypeTags.ctor
36:00.493    1                 3         3    :   :   :   TypeTags.SetPointer
36:00.493    1                47        47    :   :   :   Base.operator new
36:00.493    1               195        82    :   :   :   FuncData.ctor
36:00.493    1                 3         3    :   :   :   :   :   : CxxToken.ctor
36:00.493    1                11        11    :   :   :   :   :   CxxNamed.ctor
36:00.493    1                10        10    :   :   :   :   : CxxScoped.ctor
36:00.493    1                56        56    :   :   :   :   CxxScope.ctor
36:00.493    1               113         9    :   :   :   : Data.ctor
36:00.493    1               104         8    :   :   :   :   DataSpec.SetUserType
36:00.493    1                 9         9    :   :   :   :   : TypeSpec.SetUserType
36:00.493    1                87        13    :   :   :   :   : QualName.SetUserType
36:00.493    1                74        74    :   :   :   :   :   TypeName.SetUserType
36:00.493    1                77         3    :   :   :   Singleton.Instance
36:00.493    1                41        41    :   :   :   : Base.operator new
36:00.493    1                 6         6    :   :   :   : CxxRoot.ctor
36:00.493    1                27        27    :   :   :   : Singletons.BindInstance

History

17^th August, 2020: Add "In the Lab" section
4^th February, 2020: Initial version

License

This article, along with any associated source code and files, is licensed under The GNU General Public License (GPLv3)

Written By

Greg Utas

Architect

United States

Author of Robust Services Core (GitHub) and Robust Communications Software (Wiley). Formerly Chief Software Architect of the core network servers that handle the calls in AT&T's wireless network.