Software Techniques for Lemmings

Greg Utas

Rate me:

4.92/5 (36 votes)

17 Mar 2022GPL323 min read

93.5K

306

Are we about to go over a cliff?

This article summarizes commonly used software techniques that can be harmful, particularly when implementing a server or other large application. It discusses their drawbacks and when using them is nevertheless appropriate, and outlines alternatives that are usually preferable, with links to articles that go into greater depth.

Introduction

Some techniques that are common practice in the computing industry are ill-advised in servers and other serious applications. These techniques are sometimes used because they require little effort. Or maybe they're just the default way to do something when using a particular language, platform, or framework. Consequently, a central purpose of the Robust Services Core and the articles that I've written about it is to provide alternatives that are often preferable but not readily available or not even widely known. If this is the first time that you're reading an article about RSC, please take a few minutes to read this preface.

Disclaimers

There are legitimate uses for the techniques that this article disparages, and it even makes note of some of them. But these techniques are significantly overused, with little thought given to their drawbacks. Naturally, it depends on what you're doing. If you're building something for personal use and its quality doesn't matter that much, just do whatever. If you're writing software for a client or a desktop, some of the issues raised in this article won't apply. It is written primarily from the standpoint of designing software for a server that needs to be highly available, reliable, efficient, and scalable. But even if you're not working on server software, this article should alert you to the drawbacks of many frequently used techniques and make you aware of other options.

The article concludes with a discussion of some C++ topics because, despite its drawbacks, C++ is the language that I would choose for building a system with the characteristics just mentioned. However, the rest of the article discusses general design principles that are not specific to C++.

The Server Design Lame List¹

Threads

The problem isn't threads per se, but rather the number of them. Here are some typical examples:

Thread Per User or Thread Per TCP accept(): moronic
Thread Per Request: imbecilic
Thread Per Object: idiotic

The performance of a system with thousands of threads will be far from satisfying. Threads take time to create and schedule, and their stacks consume a lot of memory unless their sizes are engineered, which won't be the case in a system that spawns them mindlessly. We have a little job to do? Let's fork a thread, call join, and let it do the work. This was popular enough before the advent of <thread> in C++11, but <thread> did nothing to temper it. I don't see <thread> as being useful for anything other than toy systems, though it could be used as a base class to which many other capabilities would then be added.

Even apart from these Thread Per Whatever designs, some systems overuse threads because it's their only encapsulation mechanism. They're not very object-oriented and lack anything that resembles an application framework. So each developer creates their own little world by writing a new thread to perform a new function.

The main reason for writing a new thread should be to avoid complicating the thread loop of an existing thread. Thread loops should be easy to understand, and a thread shouldn't try to handle various types of work that force it to multitask and prioritize them, effectively acting as a scheduler itself.

Here are some ways to avoid an excessive number of threads:

Consider doing work in the current thread instead of spawning a new one.
Make threads daemons (which persist until a reboot, performing a specific type of work).
To perform a lot of work serially, put it on a work queue serviced by another thread. If a work item is addressed to the object that should handle it, and the objects don't perform blocking operations, one thread is sufficient to front-end all of these objects.
To perform a lot of work in parallel—something that won't happen any faster by creating a thread for each request—have the same thread send requests to other processors and process their replies. One thread can handle all of this unless it uses blocking sends, something that will be thoroughly condemned later.
Offload blocking operations to threads that specialize in them, such as a thread that front-ends cin, a thread that front-ends cout, threads that stream data to files, and threads that handle network I/O. To improve throughput, a thread pool is often appropriate for these (e.g., a pool of threads that write to disk, or a dedicated thread for each combination of IP port and protocol).

Scheduling

My article Robust C++: P and V Considered Harmful laments that preemptive and priority scheduling are the de facto standard in most operating systems. By initiating context switches seemingly at random, these scheduling disciplines create critical regions that multi-threaded applications must protect with mutexes.

If you look at that article, you will see that it received a healthy share of downvotes. Most offered no comment, but one stated that semaphores are universally accepted and that a different solution couldn't possibly be an improvement. This comment should simply have said "tl;dr" because, very early on, the article states that semaphores are indispensable but that, like goto, applications should only have to use them rarely.

Assume that we had no scheduler and that we held a meeting to discuss how to implement one. What group would settle on a design that

created as many critical regions as possible;
by so doing, added artificial complexity that had nothing to do with satisfying the product specification;
forced developers to agonize over "thread safety";
degraded system performance with mutex allocations, contentions, or even deadlocks; and
fostered errors that are easily made but that are difficult to reproduce and debug?

Clearly, such a design could only emerge from a group dominated by sadists. So my article recommends these alternatives instead:

Cooperative scheduling instead of preemption. Each thread yields when it is ready for a context switch. It usually does so in its thread loop, after it has completed a logical unit of work. If each logical unit of work runs to completion, there can be no critical regions within that code.
Proportional scheduling instead of priorities. Assign each thread to a faction—a group of threads that perform related work—and give each faction a share of the CPU time. One faction may get considerably more time than another, but each has important work to do and therefore gets some time, even when the system is heavily loaded. Unless you're running SETI processing in the background, it's unlikely that you have threads whose work is frivolous enough that it should be starved by threads of higher priority.

It isn't quite as simple as this, so the article discusses how to deal with scenarios that don't neatly fit into this scheme. Nevertheless, many systems would do themselves a huge favor by adopting these scheduling disciplines.

Symmetric Multiprocessing

Just as preemptive and priority scheduling creates as many critical regions as possible, so does running copies of an executable on CPUs that share memory. To use cooperative and proportional scheduling on such a multicore platform, sandbox each core by giving it a private memory segment and build a distributed system that treats each core as a separate node. This avoids the critical regions, mutex contentions, and cache collisions that often plague multicore systems. And because your application will be truly distributed, it will scale beyond the number of CPUs that are available on a single platform.

Memory Management

Although many applications only use the heap to allocate new objects, doing so has several major drawbacks in a system that will remain in service for a long time, and each of them causes a crash:

Continually allocating and freeing blocks of various sizes can lead to fragmentation that eventually causes allocation to fail. To avoid this, the heap manager must merge adjacent free areas and use a best-fit policy, both of which add overhead.
A bug can result in a software component allocating all available memory.
Memory leaks can eventually cause the heap to run out of memory.

Robust C++: Object Pools describes an alternative that addresses these drawbacks. When the system initializes, it creates an object pool for each major subtree in the system's class hierarchy. The size of each pool is determined by a configuration parameter that specifies the number of blocks (for future objects) in the pool. A base class whose derived classes all share its pool overrides operator new and operator delete to use the pool instead of the heap. This provides some protection against memory gobblers and eliminates escalating fragmentation.

This design can also recover from memory leaks because the pools reflect the system's object model. An object pool audit (a thread) can therefore mark all of the blocks as orphaned, tell each pool to claim its in-use blocks, and then recover any orphans. This is garbage collection, but as a background, not a foreground, activity. The system does not have to spend nearly as much time on garbage collection as does Java or C#, and the garbage collector doesn't have to worry about freezing the system when it has lots of work to do. Applications are still expected to delete objects, so the recovery of an orphaned block highlights a bug that needs to be found and fixed.

The benefits of this design are one reason that I prefer C++ to paternalistic languages that preclude it. It relies on a placement new capability, which can also be used to improve performance in other situations. If I can allocate objects, it's not unreasonable to expect me to free them! And garbage-collected languages aren't a panacea for memory leaks in any case, because they can still occur when references aren't cleared.

Callbacks

Some frameworks use callbacks ubiquitously to implement the Observer pattern. A subscriber registers with a publisher and, when an event of interest occurs, the publisher notifies the subscriber by invoking a callback function that the subscriber supplied.

Callbacks are efficient and have legitimate uses. However, they also have drawbacks that merit consideration:

If a callback causes an exception by using a bad pointer, for example, the publisher's thread, not the subscriber's, is at risk.
If the callback does a lot of work, it can compromise how the system is engineered. Its work may run at the wrong priority or, if we're using proportional scheduling, in the wrong faction.
If the subscriber is implemented in a framework that provides event routing, the callback event does not pass through the framework. The subscriber receives it directly, bypassing the framework and the capabilities that it provides when events are routed through it in the usual manner.

When any of these drawbacks is a concern, a message should replace the callback. The publisher can send the message, or if the subscribers are heterogeneous and require different messages, their callbacks can do so.

Synchronous Messaging

A synchronous message is one that an application sends inline, after which it blocks to wait for a response. This is often implemented as a remote procedure call (RPC), which makes it look as if the application is calling a function that returns a result. So in the interests of brevity, let's refer to it as an SRPC (synchronous RPC).

For many reasons, SRPCs are one of the lamest things that you'll find in software:

If the application is interacting directly with the user, a spinning wheel or hourglass appears on the display during the SRPC. If the recipient of the SRPC fails to respond, the user can do nothing but watch this icon spin until the SRPC times out, by which time his patience has also timed out. In most cases, the user can only make the spinning icon go away by forcibly shutting down the application. Utterly despicable.
If the application is running in a server, the SRPC blocks the running thread, causing the overhead of a context switch. Far worse, a thread pool is needed so that a thread will be available to serve users when its peers are blocked on the SRPC. This is an example of the imbecilic Thread Per Request design mentioned earlier.
The application cannot perform work in parallel because SRPCs cause it to be performed sequentially.
If the application can also receive inputs from other sources during the SRPC, those inputs queue up instead of being handled in a timely manner.
Similar to a point previously made about callbacks, the SRPC's response or timeout arrives inside an application function, bypassing any framework that normally provides event routing.
The application doesn't need to include the SRPC in its state machine. If it uses SRPCs exclusively, it won't even have a state machine apart from its thread stack and instruction pointer! This makes it hard to trace its requirements through to its implementation and to get an overview of its behavior.
In some cases, an SRPC explicitly names the destination function for which the message is intended. This creates an undesirable coupling between the two software components involved in the transaction.
The SRPC looks like a function call, so it's easy to inadvertently hold a mutex during the SRPC, locking out all other users of the mutex until the SRPC completes. Sadly, I've seen systems where this occurred in several places.

Well-designed software avoids these drawbacks by using state machines and asynchronous messaging. However, this generally requires more development time. A state machine must be carefully designed so that it will have an event handler for each possible state-event combination. This is less of a concern with SRPCs, where events other than a reply or timeout simply queue up during an SRPC.

A potentially serious issue for a state machine is an explosion in its state-event space. This usually occurs when an application is implemented by a group of state machines that run on behalf of the same user. While the state machines in the group are exchanging messages to handle an external input, they are in transient states and are therefore loath to accept another external input until they finish handling the current one. Rather than using SRPCs during this interval, they can use asynchronous messages that have a higher priority than the external messages. Because the priority messages are processed first, the group can reach a stable state before it accepts another external message. The only restriction is that the state machines in the group must all run in the same processor, but this is also desirable for performance reasons.

Distribution

Transparent Distribution

Modifying a uniprocessor application to make it distributed, and doing so in a way that is transparent to its applications, is not an actual software design technique. Rather, it's something touted by those who have no experience with serious systems. The "transparent" solution usually involves adding SRPCs, which have already been unmasked as beyond lame.

Distribution introduces interprocessor messages, something that cannot be transparent for various reasons:

Another processor may fail to respond to a message because it (a) is unreachable; (b) is out of service; (c) suffered an exception; (d) discarded the message because of an overload control policy. The sender of the message must therefore handle a timeout.
The recipient of an interprocessor message may be slow to respond, which degrades the application's response time and potentially affects its engineering rules.
As discussed at the end of the previous section, the sender of the interprocessor message enters a new state while waiting for the response, so it must account for other messages that can arrive in this state.
If the ability to independently upgrade each processor's software is needed, the protocol between them must remain backward compatible.

Any claim of "transparent" distribution is fraudulent.

Approaches to Distribution

The motivation for distributing an application is often to improve throughput by adding more processing power. The question is then how to partition the application. The strategy of moving some of its components (functions) to other processors is known as heterogeneous distribution, but this is usually a poor choice for the following reasons:

It seldom improves scalability to a significant degree. What used to be achieved with a procedure call now requires an interprocessor message, which carries far more overhead. This overhead can significantly cut into the amount of processing time that was saved by offloading a portion of the application.
It does nothing to improve survivability, which can otherwise be a benefit of—or even a motivation for—distribution. If any processor that runs part of the application goes down, the whole application becomes unavailable.
It results in poor cohesion because the application's state is now distributed across multiple processors. The interprocessor messages may therefore need to exchange a lot of information. This can become especially challenging if the interprocessor protocols need to maintain backward compatibility.
It often leads to building different software loads, each one handling the subset of the application that can be assigned to a specific processor. This can become a significant administrative overhead.

For these reasons, homogeneous distribution is preferable. Instead of moving functions to other processors, this approach moves users to other processors. This is how scalable networks are designed, so the same approach can be used within a system. This is known as recursive design, and it doesn't suffer from the drawbacks mentioned above:

The system scales the same way that a network scales.
If a processor goes down, only the users served by that processor are affected.
Cohesion is preserved because each instance of the application still runs within a single processor. If the application is multi-user, the same protocol that previously supported collaboration across the network can still be used. A new interprocessor protocol is only required if all of the users previously had to be served by the same processor.
Each processor runs the same software load.

Nevertheless, there are situations where heterogeneous distribution is appropriate:

When it would be difficult or costly to distribute something. A large, frequently updated database is a good example. It would remain centralized while the application that uses it was distributed.
Vertically, between layers. Although homogeneous distribution is preferable within a layer, heterogeneous distribution is often preferable when upper and lower layer functions can be separated. When a network is designed in this way, it defines a protocol standard so that upper and lower layer components from different suppliers can interwork. The lower layer components may face hard real-time requirements, whereas the upper layer components may only face soft real-time requirements. Combining components that face different forces is challenging, so it can be prudent to separate them.

The end result is hierarchical distribution: homogenous within a layer (peer-to-peer), but heterogenous between upper and lower layers.

User Spaces

Many operating systems provide user spaces, meaning that each process runs in a private memory segment. By default, the process cannot access memory outside its own segment; doing so causes an exception.

User spaces are a great way to run unrelated applications on a single system. The firewalls between processes prevent a bug in one application from corrupting data in another application.

However, running components of a large application in separate user spaces is usually a poor design decision, primarily because of the performance penalties that it introduces:

Context switching between processes is far more expensive than context switching between threads.
A system with user spaces also segregates the kernel from applications. Entering and exiting kernel mode to access kernel functions is another overhead.
If one process needs to access data in another process, it must use a message to do so. What used to be a function call turns into something that can easily be 1,000 times more expensive.

The only benefit of the user spaces is that one component cannot cause another to crash. But if a component is poorly designed, there is nothing to prevent it from corrupting its own data. And since all of the components play a role in the application, a crash of one component renders the application unavailable after all.

The cost of accessing data with messages is so prohibitive that data required by multiple processes ends up migrating into a shared segment that all of them can access. However, this reintroduces much of the risk that the user spaces were supposed to eliminate.

The desire to protect the vulnerable shared segment without significantly compromising performance can lead to an epiphany: write-protect the shared segment, just don't read-protect it! There is no reason to prevent the components of an application from being able to read each other's data as long as it can still be encapsulated, which this design doesn't preclude.

Write-protected data must not change often, because every update involves unprotecting and re-protecting the shared segment, which is similar in cost to a context switch between processes. Fortunately, things like configuration data and large databases often satisfy this criterion. This data is vital to the application, and reloading it is time-consuming, so write-protecting it is beneficial and doesn't add undue overhead.

Now that we've arrived at this point, we might as well get rid of the user spaces and run the application in one process. The user spaces do very little to improve survivability. They just slow the system down and serve no useful purpose.

One situation where user spaces are appropriate is to run, on a single system, executables that normally run in separate nodes. This can reduce the cost of deploying the application to a relatively small group of users.

For further details about write-protected data, and a quick restart strategy to which it gives rise, see Robust C++: Initialization and Restarts.

Dynamic Linking

Dynamic linking refers to invoking software that is not loaded until an application requires it. Perhaps the most well-known example is the DLL (Dynamic-Link Library) implementation on Windows.

The advantage of dynamic linking is that multiple applications can share one instance of a library instead of individually including redundant copies. But when a system is dedicated to a specific purpose, it gains nothing by this. It is known, in advance, that the library is needed, so the application should be delivered as a single, statically linked executable. This has several advantages:

It avoids the boilerplate (e.g., dllexport, dllimport) associated with shared libraries.
When the system has initialized, it is fully ready to go. It won't take a deep breath while the loader brings in a shared library. This is in keeping with the general design principle that a server should construct as much of the system as possible during its initialization, so that it can provide predictable latency to client requests when it enters service.
It avoids the dangers of dependency hell, in which software crashes or behaves incorrectly when one of the shared libraries differs from the one under which the software was tested.

Software for a dedicated system should use static libraries exclusively.

C++

Although C++ is an excellent choice for implementing robust software, here are some guidelines that don't appear in the typical coding standard.

Preprocessor

Many uses of the preprocessor date to an era when compilers were far less skilled at optimization. They are now so good at it that it is difficult to debug a release build in which the compiler aggressively optimized the code. Many of the optimizations obfuscate the code to such an extent that, in Debugging Live Systems, I recommend disabling most of them.

In C, the preprocessor was also used to achieve the same things that C++ can better achieve with templates. Indeed, I never "got" templates until I saw them described as "macros on steroids".

In modern C++, there is far less justification for using preprocessor features than there was during its youth. But old habits die hard. Some aspects of the preprocessor, such as #include, can't be avoided, but those that force readers to figure out what the preprocessor will emit should be banished.

Specifically, I limit use of the preprocessor to the following:

#include directives
#ifndef for #include guards
#ifdef to include or exclude, almost always, an entire file that is a platform-specific target
#define for a pseudo-keyword that maps to an empty string (I use NO_OP instead of a bare semicolon when part of a for statement is empty, with the same justification as [[fallthrough]])

That's it, and all of the following should be treated as abominations:

#define for a symbol that does not map to an empty string
#undef
macros
# operator (stringification)
## operator (concatenation)

main()

In a large system, main can easily become swill that #includes the world. Robust C++: Initialization and Restarts describes how to initialize a system in a structured, layered manner.

Exceptions

For a number of reasons, exceptions should be used very sparingly:

When a function can throw an exception, its invokers must check not only what it returns, but also decide whether to use try and catch. And if a function that is invoked transitively can also throw an exception that its invoker does not catch, the situation quickly gets unwieldy. Whoever termed exceptions "gotos from hell" probably worked on a system designed in this way.
Exceptions add overhead, both in the size of the executable and the time required to handle an exception.
A basic exception is useless for debugging purposes unless the problem can be reproduced after setting a breakpoint where the exception was thrown. It is far better—and the only way to capture debugging information in a live system—to derive from exception by implementing a subclass that captures the stack before it gets unwound. This will definitely make it expensive to throw an exception.

When an exception occurs, it should be treated as a bug that needs to be fixed. Robust C++: Safety Net describes how to turn POSIX signals and Windows Structured Exceptions into C++ exceptions so that a base Thread class can catch them and recover from things like SIGSEGV (a bad pointer) instead of allowing the application to exit.

Application functions should, for example, check arguments to avoid causing exceptions. Almost always, they should return a value indicating that they failed rather than throwing an exception. But when an error is so severe that the work being performed should be aborted, an exception is appropriate because the goal is to return to somewhere well down the function call stack and recover. A coding standard that prohibits the use of exceptions is foolishly ignoring their elegance in such situations.

If a system wants to recover when the usual behavior is to abort, it will have to deal with exceptions or something similar. The question then arises, when may applications use them? And the answer is not never, but rarely.

noexcept

Tagging functions noexcept so the compiler can "optimize" the code is one of the more recent fetishes to defile the C++ world. I predict that noexcept will eventually go the way of the dodo, register, the original meaning of inline, and if a praiseworthy C++20 proposal succeeds, volatile.²

If we can catch SIGSEGV exceptions and prevent an abort, it is dangerous to tag any "trivial", non-virtual function as noexcept. The reason is that, if the this pointer passed to the function is bad, the function will cause an abort by using it. We're catching POSIX signals, so we can catch the resulting SIGABRT too, but whether the stack gets properly unwound (by deleting local objects) is left up to the compiler, so you can see the risk.

Some compilers insist that even non-trivial destructors be tagged noexcept. Although they are clueless, you will have to obey. But before voluntarily tagging anything noexcept, think carefully. The tool described in A Static Analysis Tool for C++ therefore suggests removing noexcept unless this would cause an error in a compliant compiler.

This section already mentioned why using noexcept on a non-virtual function is risky. When a virtual function is noexcept, its overrides must also be noexcept, which can be presumptuous. So in the end, noexcept should only be used when there is no choice, namely when overriding an external function that is tagged noexcept.

History

29^th April, 2020: Added section on dynamic linking
6^th March, 2020: Added section on symmetric multiprocessing
19^th February, 2020: Minor changes
10^th February, 2020: Initial version

Notes

¹ The title of this section is a tribute to the timeless article The Windows Sockets Lame List.

² C++20 deprecated some uses of volatile.

License

This article, along with any associated source code and files, is licensed under The GNU General Public License (GPLv3)

Written By

Greg Utas

Architect

United States

Author of Robust Services Core (GitHub) and Robust Communications Software (Wiley). Formerly Chief Software Architect of the core network servers that handle the calls in AT&T's wireless network.

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.