This article is a user guide to a static analysis tool for C++ code. Among other things, the tool can clean up #include lists, highlight violations of C++ best practices, and analyze dependencies within the code base. It can also implement some of its suggestions by editing the code. The article also provides a high-level overview of the tool's implementation.
C++ is a large language—too large, some would argue. Because it's a superset of C, it's easy for developers with a C background to build a hybrid OO/non-OO system. C++ also kept the preprocessor, which is sometimes used in what can only be described as despicable ways. And rather than risk offending legacy systems, the C++ standards committee seems very reluctant to deprecate anything—but not at all reluctant to keep adding what seems like one pedantic feature after another, at least to those of us struggling to keep up.
As a result of all this, there are often many ways to do something in C++, and figuring out which way is best can be difficult. Without guidance, it can only be learned through torturous experience. It is therefore unsurprising that there are many books about C++ best practices, such as Scott Meyers' Effective C++. But it's easy to forget their recommendations when you're immersed in coding, especially when new to the language. Of course, some developers don't even bother to read such books, being of the "If it works, it's correct—so don't touch it!" school. Having a tool that could serve as an automated Scott Meyers code inspector would go a long way to addressing these issues.
When I started to develop the Robust Services Core (RSC), I had a reasonable knowledge of C++ but was far from proficient. The code grew very organically and was continually refactored. As I became more familiar with C++ and needed to revisit areas of the code that had lain dormant for a while, I kept finding things that I would now do differently. But there was always more code to develop and never enough time to do a tedious code inspection to find and "fix" all the things that could be improved.
Eventually I decided that, at the very least, it would be nice to clean up all the
#include directives. Surely there was a publicly available tool for this. This was circa 2013, and the only thing I found was a Google initiative called "Include What You Use", which appeared to have been mothballed.1 I therefore decided to write such a tool as a diversion from the main focus of RSC.
Some diversion! It soon became apparent that fixing
#include lists, to add the directives that should be there and remove those that shouldn't, meant writing a parser. And not just a parser, but something closer to a compiler, because it would also have to do name resolution and other things. Another option was to take an open-source C++ compiler and either modify it or extract the necessary information from files that it might produce.
Rather than give up, I decided to try writing the tool from scratch. It would be a learning experience, even if the attempt ultimately had to be abandoned. This article describes the current state of the code that emerged.
Using the Code
Not only does the code clean up
#include directives, it serves as an automated Scott Meyers code inspector that can implement some of its recommendations by suitably editing the source code. Its main drawback is that it only supports the subset of C++11 that RSC uses. Although this is a reasonable subset of the language, what's missing will hamper its usefulness to projects that use unsupported language features. Adding one of these missing language features can be anywhere from moderately easy to quite challenging. Nonetheless, feel free to request that a specific language feature be supported—or even volunteer to implement it! This will make the tool useful to a wider range of projects.
Unlike previous articles that I've written, this one focuses more on how to use the code, and not much on how it works. However, it will provide a high-level overview of the design as a roadmap for those who want to dig into the code.
Defining the Library
Before the tool can be used, the files that make up the code base must be defined. This can be done right after RSC starts by entering the command
buildlib from the CLI. That "
>" is RSC's CLI prompt and is not entered, but this article uses it to denote a CLI command. A dump of all CLI commands is available in help.cli2; scroll down to somewhere around line 1246, to "
full", to see those in the ct directory, which is where the tool is implemented.
buildlib does is execute the script buildlib, which contains a sequence of CLI commands. This results in the execution of the following commands, which are copied from the console transcript file that RSC generates, with commands not relevant to this article removed:
ct>import subs subs
ct>import nbase nb
ct>import ntool nt
ct>import ctool ct
ct>import nwork nw
ct>import sbase sb
ct>import stool st
ct>import mbase mb
ct>import cbase cb
ct>import pbase pb
ct>import onode on
ct>import cnode cn
ct>import rnode rn
ct>import snode sn
ct>import anode an
ct>import diplo dip
ct>import rsc rsc
The tool is in the ct directory, so the command
>ct is used to access the CLI commands in that directory. The script lib.create is then read. It contains a series of
>import commands that add, to the code library, all of the directories that are needed to compile the project (RSC, in this case). For example, the command
ct>import ctool ct
imports the code in the ct directory, which can subsequently be referred to as
ctool in other CLI commands. The path to this directory is relative to the
SourcePath configuration parameter. When RSC starts up, it obtains its configuration parameters from the file element.config. So to use the tools on your own code, you need to
- Modify element.config by setting its
SourcePath entry to a directory that subtends all of your project's code files.
- Create a file similar to lib.create in the same directory as RSC's lib.create. Each of the
>import commands in that file must specify a directory that is relative to your new setting for
- Copy the subs directory from RSC into your own project, just below your
SourcePath directory, and include the command
"subs", as found in RSC's lib.create, in your version of lib.create.
- Modify the buildlib script to
>read your version of lib.create.
>import command ends up creating a
CodeDir instance for its directory and a
CodeFile instance for each code file3 in that directory. There are currently two restrictions:
- Each file name must be unique (i.e., the same name cannot be used in more than one directory).
- All of the code files in a directory get imported (i.e., there is no way to exclude a code file).
Parsing the Code
Once all of the source code directories have been imported, the entire code library can be parsed, which is a prerequisite to checking it with the static analysis tool. This is done with the command
>parse - win32 $files
- specifies that no parser options are being used (the only options are ones that enable debug tools)
win32 specifies that the target is 32-bit Windows (currently, the only other target is
$files is a built-in library variable that contains the set of all code files
$files is replaced with
ctool, meaning all the code files in the ct directory, the result (again taken from the console transcript file) looks like this:
ct>parse - win32 f ctool
// [many lines deleted]
As each file is parsed, its name is displayed. Template instantiations are indented (and indented further, when one template causes the instantiation of another).
The first RSC file to be parsed is FunctionGuard.h. The files that precede it are either from the standard library or Windows. However, they are not the actual instances of those files. Rather, they are taken from the subs directory, which contains simplified versions of them. These versions avoid the need to
>import files that are external to the project from a wide range of directories
#define all the names that would be needed to correctly navigate all the
#ifdefs in external files
- support C++ language features used by external files but not by the project
- parse lots of things that the project doesn't use
Consequently, before you can
>parse your own project, you must ensure that the subs directory contains a stand-in for each external header that your project
#includes, and that each stand-in declares the items that you use from it. Note that in the case of templates, subs headers do not need to provide function definitions.
Performing a Code Inspection
Now that all of the code has been parsed, it can be checked for violations of design guidelines:
>check rsc $files
This produces the file rsc.check, which contains all of the warnings that were found. Basic documentation for each of the 130 or so warnings that
>check can produce can be seen in the file cppcheck.
>check is run on a subset of the code, it will first
>parse any unparsed code that would be needed in a successful build. This avoids false positives, such as warnings that a function is not defined or is unused.
Before merging into the master branch, I usually run
>check on all of the code and use the diff tool in VS2017's GitHub plug-in to see if any new warnings have arisen since the last merge.
At present, the only way to suppress a warning is to modify the function
Because headers in the subs directory do not provide function implementations for templates,
>check can erroneously recommend things such as
- removing an
#include that is needed to make a destructor visible to a
unique_ptr template instance
- declaring a data member
const even though it is inserted in a
set and must therefore allow
- removing most of the things in Allocators.h (which is only invoked from the STL, not from within RSC)
Applying the Recommendations
>fix command is currently able to resolve about half of the warnings:
fix : Interactively fixes warnings detected by >check.
(0:133) : warning number from Wnnn (0 = all warnings)
(t|f) : prompt before fixing?
<str> : a set of code files
For example, the following modifies all code files by deleting unnecessary
#include directives, which is warning
>fix 18 f $files
To select which occurrences of a warning to fix, ask to be prompted. For example,
>fix 53 t $files
will prompt before fixing each occurrence of warning
W053, "Data could be
Warning: Before using
>fix, be sure that you can recover the original version of the file if something goes wrong. It works on RSC's code, but that doesn't mean it's been thoroughly tested!
Exporting the Library
After the code has been parsed, the
>export command can generate any combination of the following files:
A .lib file displays parsed code in a standard format and includes
the underlying type for each
the number of times each item was
initialized, read, or written (for data),
called (for functions); and
the file in which each item was defined (for data and functions).
A .trim file lists the external symbols used within each file, as well as the recommendations for which
using statements, and forward declarations the file should add or remove. Those recommendations also appear as warnings in the .check file.
An .xref file contains a global cross-reference (each symbol, followed a list of the files that use it, along with the line numbers where the symbol appears).
Analyzing Code Dependencies
Many of the CLI commands in the ct directory take an expression as their last parameter. So far, we've only mentioned
$files, but an expression can contain both variables and operators. The user defines a variable with the
>assign command, and the library also provides the following variables, which cannot be modified directly:
|Variable ||Contents |
|directories that have been added to the library by |
|all code files (headers and implementations) found in |
|headers in |
|implementations (.c*) in |
|headers that declare items which are external to the code base |
|headers that appear in an |
#include directive but whose directories were not added to the library by
>import (which will cause
>parse to fail)
|all variables (those above, and any that the user has defined) |
An expression is evaluated left to right, but parentheses can be used to override this. A variable is a set of either directories or files. The following notation is used in the expressions that appear below:
|Set ||Contents |
|D ||the name of a directory (as defined by |
>import) or a set of directories
|F ||the name of a specific file or a set of files |
|C ||the name of a specific C++ code item or a set of such items |
|S ||any of the above (D, F, or C) |
Here are the operators that can be used as soon as
>import commands have built the library. The Expression column specifies the type of parameter(s) that the operator expects. The Result column is what the operator returns, which can be used as the input to other operators or commands such as
|Operator Name ||Expression ||Result ||Semantics |
|union ||S1 |
|S ||set union of S1 and S2 (the ' |
|' is optional)
|intersection ||S1 |
|S ||set intersection of S1 and S2 |
|difference ||S1 |
|S ||set difference between S1 and S2 |
|files || |
|F ||the files in S |
|directories || |
|D ||the directories in S |
|filename ||F |
|F ||files in F with the file name <str>* |
|filetype ||F |
|F ||files in F with the file type *.<str> |
|matches ||F |
|F ||files in F whose name partially matches <str> |
|in ||F |
|F ||files in F whose directory is in D |
|users || |
|F ||files that |
#include any in F
|used by || |
|F ||files that any in F |
|affecters || |
|F || |
ub F, transitively
|affected by || |
|F || |
us F, transitively
|common affecters || |
|F || |
), where f1…fn are the files in F
>parse command has run, additional operators become available on the compiled code:
|Operator Name ||Expression ||Result ||Semantics |
|implements || |
|F ||for each item declared (defined) in F, add the file that defines (declares) it |
|needers || |
|F ||files that also need F in a build ( |
ab F, transitively)
|needed by || |
|F ||files that F also needs in a build ( |
as F, transitively)
|declared by || |
|C ||code items declared within S |
|declarers || |
|C ||code items that declare the items in C |
|definitions || |
|C ||distinct definitions of the items in C |
|referenced by || |
|C ||code items referenced within S |
|referencers || |
|C ||code items that reference those in C |
The main purpose of these operators is to analyze dependencies among code files and C++ items. Here are some simple examples to serve as an introduction.
In the first image, the first command lists the users of Thread.h. If you searched all of RSC's code files for
"Thread.h", these are the files that you would find. Next, the .cpp files in the sbase directory are assigned to the variable sbim, and the files that could be affected by changing Thread.h are assigned to thrab. The intersection of sbim and thrab is assigned to sbthr and the result is displayed. These are the .cpp files in the sbase directory that could be affected by changing Thread.h. Finally, the files that implement Duration.h are listed. What's Thread.cpp doing in there?! Well, if you looked at the code, you would find that a number of constants declared in Duration.h, and used even before
main is entered, are initialized at the bottom of Thread.cpp to avoid the "static initialization order fiasco" for which C++ is so infamous.
We continue by assigning
Thread to thr. But
Thread can refer to many things: the
Thread class, its constructor, or one of its forward declarations. So we have to indicate that we want the constructor. If we had entered
Thread::Thread instead, it would have been unambiguous. When we list thr, we see that it's a function in Thread.h. If we list its definition, we see that its implementation begins on line 1062 of Thread.cpp. Finally, listing the items referenced by thr reveals a forward declaration of
Daemon and the enumerator
Faction. These types are used to specify the parameters to the
If we list the items referenced by the definition of thr—that is, by the constructor's implementation—the output fills the screen:
Finally, listing the references to thr reveals the constructors that make a base class constructor call to
rs are recent additions to the static analysis tool. They allow dependencies between C++ code items, not just files, to be analyzed. This can assist the architect who wants to layer a monolithic code base by partitioning it into libraries so that the software not required by a given product can be excluded from its build. Although these operators return sets of C++ code items, the files where those items reside can easily be found by prefixing the
f operator, and the
rb operators can also be used on files or even directories. For example:
>list f ds C displays the files that declare the items in C
>list f rs C displays the files that reference the items in C
>list db F1 & rb F2 displays the items declared in files in F1 and referenced by files in F2
What to #include
Interactions exist among the warnings for adding and removing
using statements, and forward declarations.
CodeFile::Trim generates these warnings. Its basic rules are
#include something if nothing guarantees that it will be visible transitively.
#include something that will definitely be visible transitively. It is necessary to
#include a base class, as well as a class that is used directly. However, it is not necessary to
#include their base classes, even when using something declared in one of those transitive base classes. Similarly, it is not necessary for a .cpp to
#include anything that its header will
- If a class is only used indirectly (i.e., as a pointer or reference type), don't
#include it. Use a forward declaration instead. If there is no guarantee that one will be visible transitively, add one to this file.
- A header should not contain a
using directive or declaration. It is therefore told to remove it, and any .cpp that relies on it is told to add it.
- If an
#include, forward declaration, or
using statement is not needed to resolve a symbol, remove it.
All of these warnings can be resolved by
>fix, which will, for example, insert a forward declaration in the correct namespace and fully qualify symbols from another namespace when removing a
The parser is implemented using recursive descent, which makes its code easy to read and modify. The advent of
unique_ptr was a godsend to these types of parsers, which were previously cursed by the need to delete objects when backing up. Placing each of these objects in a
unique_ptr allows the parser to back up without having to write any code to delete them.
The parser does not check everything in the same way that a full parser must. It assumes that the code correctly compiles and links, so it only contains enough checks to produce a correct parse. Its grammar, which is informally documented in the relevant functions, is also far simpler than a complete C++ grammar.
As each code file is read in during
#include relationships are noted. This allows a global compile order to be calculated. The only other preprocessing that occurs before parsing is to erase, within C++ code, any macro name that is defined as an empty string. Currently, the only such name is
NO_OP, which RSC uses before a bare semicolon when a
for statement is missing a parenthesized statement.
Once this simple preprocessing is complete, all of the code is parsed together, in a single pass. After an item is parsed, it is added to the scope (namespace, class, function, or code block) in which it appears, and its virtual
EnterScope function is invoked. It is then compiled by invoking its virtual
EnterBlock function. An item's
EnterBlock function also invokes the same function on each of its constituent parts.
Some of the warnings generated by
>check are detected during
>import, some are detected during
>parse, and some are detected during
>check itself, through the virtual function
CodeFile::Trim, mentioned in the previous section, uses the virtual function
GetUsages to obtain, from all of its file's C++ items, the symbols that are used (a) as base classes, (b) directly, and (c) indirectly; those that were resolved by (d) forward declarations, (e) friend declarations, and (f)
using statements; and those that were (g) inherited.
References are tracked by the virtual functions
AddReference. This allows
>export to create its cross-reference. The C++ items declared by another are obtained from the virtual function
GetDecls, which supports the
My system, a Dell XPS15 with a 2.6 GHz Intel i7, takes 3½ minutes to
>export all of RSC's code. That's using RSC's release build, which disables various optimizations so that it can be debugged (it's about 3½ times as fast as a debug build, but only half as fast as a fully optimized release build). As a comparison, Microsoft's C++ compiler takes just under 7 minutes to build RSC. Of course, this isn't a true apples-to-apples comparison because
>parse doesn't lay out memory, emit object code, or generate files. But it does gather information that a regular compiler doesn't, and that 3½ minutes also includes the time needed to inspect the code (
>check) and generate several large files (
RSC consists of about 800 source code files and 92K lines of code, excluding those that are blank or that only contain braces or comments. When its release build initializes using its default configuration file under Win32, it uses 61MB, which could be significantly reduced by changing various configuration parameters. After executing
>export, it has grown by another 195MB.
The tools don't generate any intermediate or scratch files; everything is kept in memory. Using files would be a significant change, so the amount of available memory ultimately limits the size of the code library that the tools can accommodate. But my guess is that anyone with that much code could also provide a machine with enough memory—or simply purchase a commercial equivalent of the tool.
List of Code Files
The ct directory contains all of the code. If you want to dive into it, here's a summary of the files in that directory:
|File ||Description |
|code coverage tool (not discussed in this article) |
|a directory that contains source code |
|a set of code directories |
|a file that contains source code |
|a set of code files |
|a set of C++ code items |
|base class for CodeDirSet, CodeFileSet, and CodeItemSet |
|types for parsing and static analysis |
|CLI commands applicable to the ct directory |
|initialization of ct directory |
|types for C++ |
|namespaces, classes, and class template instances |
|character literals |
|preprocessor directives |
|tracks code during compilation |
|forward declarations |
|tracks an item's location in its source file |
|low-level named C++ items |
|global namespace and built-in terminals |
|code blocks, data items, and functions |
|arguments, base classes, enums, enumerators, forwards, friends, terminals, typedefs, usings |
|statements used in functions |
|string literals |
|string utilities |
|parser symbol tables |
|low-level unnamed C++ items |
|source code editor for |
|interprets expressions (in CLI commands) that manipulate instances of LibrarySet subclasses |
|lexical analysis for Parser and Editor |
|code files, code directories, and CLI symbols |
|generates an error message when a CLI command does not apply to a set |
|base class for CodeDir, CodeFile, LibrarySet, and CxxToken |
|base class for CodeSet, LibraryErrSet, and LibraryVarSet (sets of items to which CLI commands can be applied) |
|types for code library |
|built-in or user-defined library variables |
|parser for C++ source code |
|difference, intersection, and union operators for instances of LibrarySet |
1 While preparing this article, I checked to see if anything had changed. Google's project eventually gained traction and is now on GitHub. They took the approach of building on Clang, and they say that they're currently "alpha" quality and that changes to Clang sometimes break them.
2 The file help.cli has a .txt extension, which is omitted from file names in this article.
3 A code file is assumed to be any file with no extension (e.g.,
<string>) or a .h, .c, .hpp, .cpp, .hxx, .xxx, .hh, .cc, .h++, or .c++ extension. This is hard-coded in
- 4th October, 2019: Initial version