C++ is a large language—too large, some would argue. Because it's a superset of C, it's easy for developers with a C background to build a hybrid OO/non-OO system. C++ also kept the preprocessor, which is sometimes used in what can only be described as despicable ways. And rather than risk offending legacy systems, the C++ standards committee seems very reluctant to deprecate anything—but not at all reluctant to keep adding what seems like one pedantic feature after another, at least to those of us struggling to keep up.
As a result of all this, there are often many ways to do something in C++, and figuring out which way is best can be difficult. Without guidance, it can only be learned through torturous experience. It is therefore unsurprising that there are many books about C++ best practices, such as Scott Meyers' Effective C++. But it's easy to forget their recommendations when you're immersed in coding, especially when new to the language. Of course, some developers don't even bother to read such books, being of the "If it works, it's correct—so don't touch it!" school. Having a tool that could serve as an automated Scott Meyers code inspector would go a long way to addressing these issues.
When I started to develop the Robust Services Core (RSC), I had a reasonable knowledge of C++ but was far from proficient. The code grew very organically and was continually refactored. As I became more familiar with C++ and needed to revisit areas of the code that had lain dormant for a while, I kept finding things that I would now do differently. But there was always more code to develop and never enough time to do a tedious code inspection to find and "fix" all the things that could be improved.
Eventually I decided that, at the very least, it would be nice to clean up all the
#include directives. Surely there was a publicly available tool for this. This was circa 2013, and the only thing I found was a Google initiative called "Include What You Use", which appeared to have been mothballed.1 I therefore decided to write such a tool as a diversion from the main focus of RSC.
Some diversion! It soon became apparent that fixing
#include lists, to add the directives that should be there and remove those that shouldn't, meant writing a parser. And not just a parser, but something closer to a compiler, because it would also have to do name resolution and other things. Another option was to take an open-source C++ compiler and either modify it or extract the necessary information from files that it might produce.
Rather than give up, I decided to try writing the tool from scratch. It would be a learning experience, even if the attempt ultimately had to be abandoned. This article describes the current state of the code that emerged.
Using the Code
Not only does the code clean up
#include directives, it serves as an automated Scott Meyers code inspector that can implement some of its recommendations by suitably editing the source code. Its main drawback is that it only supports the subset of C++11 that RSC uses. Although this is a reasonable subset of the language, what's missing will hamper its usefulness to projects that use unsupported language features. Adding one of these missing language features can be anywhere from moderately easy to quite challenging. Nonetheless, feel free to request that a specific language feature be supported—or even volunteer to implement it! This will make the tool useful to a wider range of projects.
Unlike previous articles that I've written, this one focuses more on how to use the code, and not much on how it works. However, it will provide a high-level overview of the design as a roadmap for those who want to dig into the code.
Defining the Library
Before the tool can be used, the files that make up the code base must be defined. This can be done right after RSC starts by entering the command
buildlib from the CLI. That "
>" is RSC's CLI prompt and is not entered, but this article uses it to denote a CLI command. A dump of all CLI commands is available in help.cli2; scroll down to somewhere after line 1200, to "
full", to see those in the ct directory, which is where the tool is implemented.
buildlib does is execute the script buildlib, which contains a sequence of CLI commands. This results in the execution of the following commands, which are copied from the console transcript file that RSC generates, with commands not relevant to this article removed:
ct>import subs "subs"
ct>import nbase "nb"
ct>import ntool "nt"
ct>import ctool "ct"
ct>import nwork "nw"
ct>import sbase "sb"
ct>import stool "st"
ct>import mbase "mb"
ct>import cbase "cb"
ct>import pbase "pb"
ct>import onode "on"
ct>import cnode "cn"
ct>import rnode "rn"
ct>import snode "sn"
ct>import anode "an"
ct>import diplo "dip"
ct>import rsc "rsc"
The tool is in the ct directory, so the command
>ct is used to access the CLI commands in that directory. The script lib.create is then read. It contains a series of
>import commands that add, to the code library, all of the directories that are needed to compile the project (RSC, in this case). For example, the command
ct>import ctool "ct"
imports the code in the ct directory, which can subsequently be referred to as
ctool in other CLI commands. The path to this directory is relative to the
SourcePath configuration parameter. When RSC starts up, it obtains its configuration parameters from the file element.config. So to use the tools on your own code, you need to
- Modify element.config by setting its
SourcePath entry to a directory that subtends all of your project's code files.
- Create a file similar to lib.create in the same directory as RSC's lib.create. Each of the
>import commands in that file must specify a directory that is relative to your new setting for
- Copy the subs directory from RSC into your own project, just below your
SourcePath directory, and include the command
"subs", as found in RSC's lib.create, in your version of lib.create.
- Modify the buildlib script to
>read your version of lib.create.
>import command ends up creating a
CodeDir instance for its directory and a
CodeFile instance for each code file3 in that directory. There are currently two restrictions:
- Each file name must be unique (i.e. the same name cannot be used in more than one directory).
- All of the code files in a directory get imported (i.e. there is no way to exclude a code file).
Parsing the Code
Once all of the source code directories have been imported, the entire code library can be parsed, which is a prerequisite to checking it with the static analysis tool. This is done with the command
>parse - win32 $files
- specifies that no parser options are being used (the only options are ones that enable debug tools)
win32 specifies that the target is 32-bit Windows (currently, the only other target is
$files is a built-in library variable that contains the set of all code files
$files is replaced with
ctool, meaning all the code files in the ct directory, the result (again taken from the console transcript file) looks like this:
ct>parse - win32 f ctool
// [many lines deleted]
As each file is parsed, its name is displayed. Template instantiations are indented (and indented further, when one template causes the instantiation of another).
The first RSC file to be parsed is FunctionGuard.h. The files that precede it are either from the standard library or Windows. However, they are not the actual instances of those files. Rather, they are taken from the subs directory, which contains simplified versions of them. These versions avoid the need to
>import files that are external to the project from a wide range of directories
#define all the names that would be needed to correctly navigate all the
#ifdefs in external files
- support C++ language features used by external files but not by the project
- parse lots of things that the project doesn't use
Consequently, before you can
>parse your own project, you must ensure that the subs directory contains a stand-in for each external header that your project
#includes, and that each stand-in declares the items that you use from it. Note that in the case of templates, subs headers do not need to provide function definitions.
Performing a Code Inspection
Now that all of the code has been parsed, it can be checked for violations of design guidelines:
>check rsc $files
This produces the file rsc.check, which contains all of the warnings that were found. Basic documentation for each of the ~120 warnings that
>check can produce can be seen in the file cppcheck.
>check is run on a subset of the code, it will first
>parse any unparsed code that would be needed in a successful build. This avoids false positives, such as warnings that is function is not defined or is unused.
Before merging into the master branch, I usually run
>check on all of the code and use WinDiff to see if any new warnings have arisen since the last merge.
At present, the only way to suppress a warning is to modify the function
Because headers in the subs directory do not provide function implementations for templates,
>check can erroneously recommend things such as
- removing an
#include that is needed to make a destructor visible to a
unique_ptr template instance
- declaring a data member
const even though it is inserted in a
set and must therefore allow
- removing most of the things in Allocators.h (which is only invoked from the STL, not from within RSC)
Applying the Recommendations
>fix command is currently able to resolve about half of the warnings:
fix : Interactively fixes warnings detected by >check.
(0:121) : warning number from Wnnn (0 = all warnings)
(t|f) : prompt before fixing?
<str> : a set of code files
For example, the following modifies all code files by deleting unnecessary
#include directives, which is warning
>fix 18 f $files
To select which occurrences of a warning to fix, ask to be prompted. For example,
>fix 53 t $files
will prompt before fixing each occurrence of warning
W053, "Data could be
Warning: Before using
>fix, be sure that you can recover the original version of the file if something goes wrong. It works on RSC's code, but that doesn't mean it's been thoroughly tested!
Exporting the Library
After the code has been parsed, the
>export command can generate any combination of the following files:
A .lib file displays parsed code in a standard format and includes
the underlying type for each
the number of times each item was
initialized, read, or written (for data),
called (for functions); and
the file in which each item was defined (for data and functions).
A .trim file lists the external symbols used within each file, as well as the recommendations for which
using statements, and forward declarations the file should add or remove. Those recommendations also appear as warnings in the .check file.
An .xref file contains a global cross-reference (each symbol, followed a list of the files that use it).
Library Variables and Operators
Many of the CLI commands in the ct directory take an expression as their last parameter. This article only used
$files, but an expression can contain both variables and operators. The user defines a variable with the
>assign command, and the library also provides the following variables, which cannot be modified directly:
|Variable ||Contents |
|directories that have been added to the library by |
|all code files (headers and implementations) found in |
|headers in |
|implementations (.c*) in |
|headers that declare items which are external to the code base |
|headers that appear in an |
#include directive but whose directories were not added to the library by
>import (which will cause
>parse to fail)
|all variables (those above, and any that the user has defined) |
An expression is evaluated left to right, but parentheses can be used to override this. A variable is a set of either directories or files. The following notation is used in the expressions that appear below:
|Set ||Contents |
|<ds> ||the name of a directory (as defined by |
>import) or a set of directories
|<fs> ||the name of a specific file or a set of files |
|<s> ||a <ds> or an <fs> |
Here is a table of basic operators. The Result column is what the operator returns, which becomes the input to commands such as
>list. The Expression column specifies the type of parameter(s) that the operator expects.
|Operator ||Result ||Expression ||Semantics |
|<s> ||<s1> |
|set union of <s1> and <s2> (the ' |
|' is optional)
|<s> ||<s1> |
|set intersection of <s1> and <s2> |
|<s> ||<s1> |
|set difference between <s1> and <s2> |
|<fs> || |
|the files in <ds> |
|<ds> || |
|the directories in <fs> |
|<fs> ||<fs> |
|files in <fs> with the file name <str>* |
|<fs> ||<fs> |
|files in <fs> with the file type *.<str> |
|<fs> ||<fs> |
|files in <fs> that contain <str> |
|<fs> ||<fs> |
|files in <fs> whose directory is in <ds> |
The following operators can also be used on a set of code files:
|Expression ||Operator Name ||Semantics |
|users ||files that |
#include any in <fs>
|used by ||files that any in <fs> |
|affecters || |
ub <fs>, transitively
|affected by || |
us <fs>, transitively
|common affecters || |
), where f1…fn are the files in <fs>
After the code has been parsed, the following operators can also be used on a set of code files:
|Expression ||Operator Name ||Semantics |
|implements ||for each item declared (defined) in <fs>, add the file that defines (declares) it |
|needers ||files that also need <fs> in a build ( |
im ab <fs>, transitively)
|needed by ||files that <fs> also needs in a build ( |
im as <fs>, transitively)
These operators can help to analyze dependencies among code files. For example:
>import sbase "sb" // add SessionBase files to the library
>type us Thread.h // show all files that #include Thread.h
>assign h1 f sbase ft cpp // h1 = all SessionBase implementations
>assign c1 ab Thread.h // c1 = files that could be affected by changing Thread.h
>assign s1 h1 & c1 // s1 = SessionBase .cpps that could be affected by changing Thread.h
What to #include
Interactions exist among the warnings for adding and removing
using statements, and forward declarations.
CodeFile::Trim generates these warnings. Its basic rules are
#include something if nothing guarantees that it will be visible transitively.
#include something that will definitely be visible transitively. It is necessary to
#include a base class, as well as a class that is used directly. However, it is not necessary to
#include their base classes, even when using something declared in one of those transitive base classes. Similarly, it is not necessary for a .cpp to
#include anything that its header will
- If a class is only used indirectly (i.e. as a pointer or reference type), don't
#include it. Use a forward declaration instead. If there is no guarantee that one will be visible transitively, add one to this file.
- A header should not contain a
using directive or declaration. It is therefore told to remove it, and any .cpp that relies on it is told to add it.
- If an
#include, forward declaration, or
using statement is not needed to resolve a symbol, remove it.
All of these warnings can be resolved by
>fix, which will, for example, insert a forward declaration in the correct
namespace and fully qualify symbols from another
namespace when removing a
The parser is implemented using recursive descent, which makes its code easy to read and modify. The advent of
unique_ptr was a godsend to these types of parsers, which were previously cursed by the need to
delete objects when backing up. Placing each of these objects in a
unique_ptr allows the parser to back up without having to write any code to
The parser does not check everything in the same way that a full parser must. It assumes that the code correctly compiles and links, so it only contains enough checks to produce a correct parse. Its grammar, which is informally documented in the relevant functions, is also far simpler than a complete C++ grammar.
As each code file is read in during
#include relationships are noted. This allows a global compile order to be calculated. The only other preprocessing that occurs before parsing is to erase, within C++ code, any macro name that is defined as an empty string. Currently, the only such name is
NO_OP, which RSC uses before a bare semicolon when a
for statement is missing a parenthesized statement.
Once this simple preprocessing is complete, all of the code is parsed together, in a single pass. After an item is parsed, it is added to the scope (namespace, class, function, or code block) in which it appears, and its
EnterScope function is invoked. After each function is parsed, it is "executed" by invoking its
EnterBlock function. An item's
EnterBlock function also invokes the same function on each of its constituent parts.
Some of the warnings generated by
>check are detected during
>import, some are detected during
>parse, and some are detected during
>check itself, through the
CodeFile::Trim, mentioned in the previous section, uses the
GetUsages to obtain, from all of its file's C++ entities, the symbols that are used (a) as base classes, (b) directly, and (c) indirectly, as well as those that were resolved by (d) forward declarations, (e) friend declarations, and (f)
The time required to
>check all of RSC's code is similar to the time required for a complete build using Microsoft's C++ compiler. This isn't a true apples-to-apples comparison because
>parse doesn't lay out memory or generate object code, but its time is also that for a debug, not a release, build. And
>parse doesn't use more than one core, whereas Microsoft's compiler uses two when possible (at least on my quad core).
RSC contains about 212K lines of source, but half of that is blanks, comments, and left braces. When RSC starts up with its default configuration file under Win32, it grows to about 58MB of memory, which could be significantly reduced by changing various configuration parameters. After executing
>export, it has grown by about another 285MB.
The tools don't generate any intermediate or scratch files; everything is kept in memory. Using files would be a significant change, so the amount of available memory ultimately limits the size of the code library that the tools can accommodate. But my guess is that anyone with that much code could also provide a machine with enough memory—or simply purchase a commercial equivalent of the tool.
List of Code Files
The ct directory contains all of the code. If you want to dive into it, here's a summary of the files in that directory:
|File (.h & .cpp) ||Description |
|CodeCoverage ||code coverage tool (not discussed in this article) |
|CodeDir ||a directory that contains source code |
|CodeDirSet ||a set of code directories |
|CodeFile ||a file that contains source code |
|CodeFileSet ||a set of code files |
|CodeIncrement ||CLI commands applicable to the ct directory |
|CodeSet ||base class for CodeDirSet and CodeFileSet |
|CodeTypes ||types for parsing and static analysis |
|CtModule ||initialization of ct directory |
|Cxx ||types for C++ |
|CxxArea ||namespaces, classes, and class template instances |
|CxxCharLiteral ||character literals |
|CxxDirective ||preprocessor directives |
|CxxExecute ||for tracking code parsing and "execution" |
|CxxFwd ||forward declarations |
|CxxNamed ||low-level named C++ items |
|CxxRoot ||global namespace and built-in terminals |
|CxxScope ||code blocks, data items, and functions |
|CxxScoped ||arguments, base classes, enums, enumerators, forwards, friends, terminals, typedefs, usings |
|CxxStatement ||statements used in functions |
|CxxStrLiteral ||string literals |
|CxxString ||string utilities |
|CxxSymbols ||parser symbol tables |
|CxxToken ||low-level unnamed C++ items |
|Editor ||source code editor for |
|Interpreter ||interprets expressions (in CLI commands) that manipulate instances of LibrarySet subclasses |
|Lexer ||lexical analysis utilities |
|Library ||code files, code directories, and CLI symbols |
|LibraryErrSet ||invoked when a CLI command does not apply to a set |
|LibraryItem ||base class for CodeDir, CodeFile, and LibrarySet |
|LibrarySet ||base class for CodeSet, LibraryErrSet, and LibraryVarSet (sets of items to which CLI commands can be applied) |
|LibraryTypes ||types for code library |
|LibraryVarSet ||built-in or user-defined library variables |
|Parser ||parser for C++ source code |
|SetOperations ||difference, intersection, and union operators for instances of LibrarySet |
1 While preparing this article, I checked to see if anything had changed. Google's project eventually gained traction and is now on GitHub. They took the approach of building on Clang, and they say that they're currently "alpha" quality and that changes to Clang sometimes break them.
2 The file help.cli has a .txt extension, which is omitted from file names in this article. These files are attached in a separate .zip file because article submission software currently replaces them with links that do not work.
3 A code file is assumed to be any file with no extension (e.g.
<string>) or a .h, .c, .hpp, .cpp, .hxx, .xxx, .hh, .cc, .h++, or .c++ extension. This is hard-coded in
- 4th October, 2019: Initial version