Click here to Skip to main content
14,330,512 members

A Static Analysis Tool for C++

Rate this:
5.00 (2 votes)
Please Sign up or sign in to vote.
5.00 (2 votes)
11 Oct 2019GPL3
Automating Scott Meyers' recommendations and cleaning up #include directives

Introduction

C++ is a large language—too large, some would argue. Because it's a superset of C, it's easy for developers with a C background to build a hybrid OO/non-OO system. C++ also kept the preprocessor, which is sometimes used in what can only be described as despicable ways. And rather than risk offending legacy systems, the C++ standards committee seems very reluctant to deprecate anything—but not at all reluctant to keep adding what seems like one pedantic feature after another, at least to those of us struggling to keep up.

As a result of all this, there are often many ways to do something in C++, and figuring out which way is best can be difficult. Without guidance, it can only be learned through torturous experience. It is therefore unsurprising that there are many books about C++ best practices, such as Scott Meyers' Effective C++. But it's easy to forget their recommendations when you're immersed in coding, especially when new to the language. Of course, some developers don't even bother to read such books, being of the "If it works, it's correct—so don't touch it!" school. Having a tool that could serve as an automated Scott Meyers code inspector would go a long way to addressing these issues.

Background

When I started to develop the Robust Services Core (RSC), I had a reasonable knowledge of C++ but was far from proficient. The code grew very organically and was continually refactored. As I became more familiar with C++ and needed to revisit areas of the code that had lain dormant for a while, I kept finding things that I would now do differently. But there was always more code to develop and never enough time to do a tedious code inspection to find and "fix" all the things that could be improved.

Eventually I decided that, at the very least, it would be nice to clean up all the #include directives. Surely there was a publicly available tool for this. This was circa 2013, and the only thing I found was a Google initiative called "Include What You Use", which appeared to have been mothballed.1 I therefore decided to write such a tool as a diversion from the main focus of RSC.

Some diversion! It soon became apparent that fixing #include lists, to add the directives that should be there and remove those that shouldn't, meant writing a parser. And not just a parser, but something closer to a compiler, because it would also have to do name resolution and other things. Another option was to take an open-source C++ compiler and either modify it or extract the necessary information from files that it might produce.

Rather than give up, I decided to try writing the tool from scratch. It would be a learning experience, even if the attempt ultimately had to be abandoned. This article describes the current state of the code that emerged.

Using the Code

Not only does the code clean up #include directives, it serves as an automated Scott Meyers code inspector that can implement some of its recommendations by suitably editing the source code. Its main drawback is that it only supports the subset of C++11 that RSC uses. Although this is a reasonable subset of the language, what's missing will hamper its usefulness to projects that use unsupported language features. Adding one of these missing language features can be anywhere from moderately easy to quite challenging. Nonetheless, feel free to request that a specific language feature be supported—or even volunteer to implement it! This will make the tool useful to a wider range of projects.

Unlike previous articles that I've written, this one focuses more on how to use the code, and not much on how it works. However, it will provide a high-level overview of the design as a roadmap for those who want to dig into the code.

Walkthroughs

Defining the Library

Before the tool can be used, the files that make up the code base must be defined. This can be done right after RSC starts by entering the command >read buildlib from the CLI. That ">" is RSC's CLI prompt and is not entered, but this article uses it to denote a CLI command. A dump of all CLI commands is available in help.cli2; scroll down to somewhere after line 1200, to "ct>help full", to see those in the ct directory, which is where the tool is implemented.

What >read buildlib does is execute the script buildlib, which contains a sequence of CLI commands. This results in the execution of the following commands, which are copied from the console transcript file that RSC generates, with commands not relevant to this article removed:

nb>read buildlib
nb>ct
ct>read lib.create
ct>import subs "subs"
ct>import nbase "nb"
ct>import ntool "nt"
ct>import ctool "ct"
ct>import nwork "nw"
ct>import sbase "sb"
ct>import stool "st"
ct>import mbase "mb"
ct>import cbase "cb"
ct>import pbase "pb"
ct>import onode "on"
ct>import cnode "cn"
ct>import rnode "rn"
ct>import snode "sn"
ct>import anode "an"
ct>import diplo "dip"
ct>import rsc   "rsc"

The tool is in the ct directory, so the command >ct is used to access the CLI commands in that directory. The script lib.create is then read. It contains a series of >import commands that add, to the code library, all of the directories that are needed to compile the project (RSC, in this case). For example, the command

ct>import ctool "ct"

imports the code in the ct directory, which can subsequently be referred to as ctool in other CLI commands. The path to this directory is relative to the SourcePath configuration parameter. When RSC starts up, it obtains its configuration parameters from the file element.config. So to use the tools on your own code, you need to

  • Modify element.config by setting its SourcePath entry to a directory that subtends all of your project's code files.
  • Create a file similar to lib.create in the same directory as RSC's lib.create. Each of the >import commands in that file must specify a directory that is relative to your new setting for SourcePath.
  • Copy the subs directory from RSC into your own project, just below your SourcePath directory, and include the command >import subs "subs", as found in RSC's lib.create, in your version of lib.create.
  • Modify the buildlib script to >read your version of lib.create.

Each >import command ends up creating a CodeDir instance for its directory and a CodeFile instance for each code file3 in that directory. There are currently two restrictions:

  • Each file name must be unique (i.e. the same name cannot be used in more than one directory).
  • All of the code files in a directory get imported (i.e. there is no way to exclude a code file).

Parsing the Code

Once all of the source code directories have been imported, the entire code library can be parsed, which is a prerequisite to checking it with the static analysis tool. This is done with the command

>parse - win32 $files

in which

  • - specifies that no parser options are being used (the only options are ones that enable debug tools)
  • win32 specifies that the target is 32-bit Windows (currently, the only other target is win64)
  • $files is a built-in library variable that contains the set of all code files

If $files is replaced with f ctool, meaning all the code files in the ct directory, the result (again taken from the console transcript file) looks like this:

ct>parse - win32 f ctool
cstdint
cctype
cmath
csignal
cstdio
cstdlib
direct.h
exception
functional
iosfwd
utility
typeinfo
winerror.h
atomic
cstddef
ctime
ios
io.h
iterator
cstring
windows.h
ostream
iomanip
memory
new
queue
stack
unordered_map
algorithm
dbghelp.h
istream
intsafe.h
list
map
set
timeb.h
vector
winsock2.h
string
iostream
ws2tcpip.h
bitset
fstream
sstream
FunctionGuard.h
SysDecls.h
Clock.h
Q1Link.h
Q2Link.h
SysTypes.h
Algorithms.h
Debug.h
  std::bitset<unsigned int>
RegCell.h
Formatters.h
Exception.h
  std::unique_ptr<std::basic_ostringstream>
Memory.h
//
// [many lines deleted]
//
Cxx.cpp
CxxRoot.cpp
  std::unique_ptr<CodeTools::CxxStrLiteral<char,std::basic_string<char,std::char_traits<char>,std::allocator<char>>,CodeTools::Cxx::Encoding::ASCII>>
  std::vector<std::unique_ptr<CodeTools::CxxStrLiteral<char,std::basic_string<char,std::char_traits<char>,std::allocator<char>>,CodeTools::Cxx::Encoding::ASCII>>>
  std::move<std::unique_ptr<CodeTools::CxxStrLiteral<char,std::basic_string<char,std::char_traits<char>,std::allocator<char>>,CodeTools::Cxx::Encoding::ASCII>>>
  std::vector<std::unique_ptr<CodeTools::Macro>>
  std::move<std::unique_ptr<CodeTools::Macro>>
  std::unique_ptr<CodeTools::Define>
  std::move<std::unique_ptr<CodeTools::Define>>
  CodeTools::DisplayObjects<std::unique_ptr<CodeTools::Macro>>
    std::iterator_t<const std::unique_ptr<CodeTools::Macro>>
  NodeBase::Singleton<CodeTools::ParserTraceTool>
Parser.cpp
  CodeTools::CxxCharLiteral<char,CodeTools::Cxx::Encoding::ASCII>
  CodeTools::CxxCharLiteral<char16_t,CodeTools::Cxx::Encoding::U16>
  CodeTools::CxxCharLiteral<char32_t,CodeTools::Cxx::Encoding::U32>
  CodeTools::CxxCharLiteral<wchar_t,CodeTools::Cxx::Encoding::WIDE>
  std::iterator_t<CodeTools::Cxx::Keyword>
  std::iterator_t<const CodeTools::Cxx::Keyword>
  std::unique_ptr<CodeTools::StringLiteral>
  std::move<std::unique_ptr<CodeTools::StringLiteral>>
  std::unique_ptr<CodeTools::Elif>
  std::move<std::unique_ptr<CodeTools::Elif>>
  std::unique_ptr<CodeTools::Else>
  std::move<std::unique_ptr<CodeTools::Else>>
  std::unique_ptr<CodeTools::Endif>
  std::move<std::unique_ptr<CodeTools::Endif>>
  std::unique_ptr<CodeTools::Error>
  std::unique_ptr<CodeTools::Iff>
  std::move<std::unique_ptr<CodeTools::Iff>>
  std::unique_ptr<CodeTools::Ifdef>
  std::move<std::unique_ptr<CodeTools::Ifdef>>
  std::unique_ptr<CodeTools::Ifndef>
  std::move<std::unique_ptr<CodeTools::Ifndef>>
  std::unique_ptr<CodeTools::Line>
  std::unique_ptr<CodeTools::Pragma>
  std::unique_ptr<CodeTools::Undef>
  std::move<std::unique_ptr<CodeTools::Undef>>
  Total=225, failed=0

As each file is parsed, its name is displayed. Template instantiations are indented (and indented further, when one template causes the instantiation of another).

The first RSC file to be parsed is FunctionGuard.h. The files that precede it are either from the standard library or Windows. However, they are not the actual instances of those files. Rather, they are taken from the subs directory, which contains simplified versions of them. These versions avoid the need to

  • >import files that are external to the project from a wide range of directories
  • #define all the names that would be needed to correctly navigate all the #ifdefs in external files
  • support C++ language features used by external files but not by the project
  • parse lots of things that the project doesn't use

Consequently, before you can >parse your own project, you must ensure that the subs directory contains a stand-in for each external header that your project #includes, and that each stand-in declares the items that you use from it. Note that in the case of templates, subs headers do not need to provide function definitions.

Performing a Code Inspection

Now that all of the code has been parsed, it can be checked for violations of design guidelines:

>check rsc $files

This produces the file rsc.check, which contains all of the warnings that were found. Basic documentation for each of the ~120 warnings that >check can produce can be seen in the file cppcheck.

If >check is run on a subset of the code, it will first >parse any unparsed code that would be needed in a successful build. This avoids false positives, such as warnings that is function is not defined or is unused.

Before merging into the master branch, I usually run >check on all of the code and use WinDiff to see if any new warnings have arisen since the last merge.

At present, the only way to suppress a warning is to modify the function CodeWarning::Suppress.

Because headers in the subs directory do not provide function implementations for templates, >check can erroneously recommend things such as

  • removing an #include that is needed to make a destructor visible to a unique_ptr template instance
  • declaring a data member const even though it is inserted in a set and must therefore allow std::move
  • removing most of the things in Allocators.h (which is only invoked from the STL, not from within RSC)

Applying the Recommendations

The >fix command is currently able to resolve about half of the warnings:

fix               : Interactively fixes warnings detected by >check.
  (0:121)         : warning number from Wnnn (0 = all warnings)
  (t|f)           : prompt before fixing?
  <str>           : a set of code files

For example, the following modifies all code files by deleting unnecessary #include directives, which is warning W018:

>fix 18 f $files

To select which occurrences of a warning to fix, ask to be prompted. For example,

>fix 53 t $files

will prompt before fixing each occurrence of warning W053, "Data could be const".

Warning: Before using >fix, be sure that you can recover the original version of the file if something goes wrong. It works on RSC's code, but that doesn't mean it's been thoroughly tested!

Exporting the Library

After the code has been parsed, the >export command can generate any combination of the following files:

  • A .lib file displays parsed code in a standard format and includes

    • the underlying type for each auto variable;

    • the number of times each item was

      • referenced,

      • initialized, read, or written (for data),

      • called (for functions); and

    • the file in which each item was defined (for data and functions).

  • A .trim file lists the external symbols used within each file, as well as the recommendations for which #include directives, using statements, and forward declarations the file should add or remove. Those recommendations also appear as warnings in the .check file.

  • An .xref file contains a global cross-reference (each symbol, followed a list of the files that use it).

Digging Deeper

Library Variables and Operators

Many of the CLI commands in the ct directory take an expression as their last parameter. This article only used $files, but an expression can contain both variables and operators. The user defines a variable with the >assign command, and the library also provides the following variables, which cannot be modified directly:

Variable Contents
$dirs directories that have been added to the library by >import
$files all code files (headers and implementations) found in $dirs
$hdrs headers in $files
$cpps implementations (.c*) in $files
$subs headers that declare items which are external to the code base
$exts headers that appear in an #include directive but whose directories were not added to the library by >import (which will cause >parse to fail)
$vars all variables (those above, and any that the user has defined)

An expression is evaluated left to right, but parentheses can be used to override this. A variable is a set of either directories or files. The following notation is used in the expressions that appear below:

Set Contents
<ds> the name of a directory (as defined by >import) or a set of directories
<fs> the name of a specific file or a set of files
<s> a <ds> or an <fs>

Here is a table of basic operators. The Result column is what the operator returns, which becomes the input to commands such as >assign and >list. The Expression column specifies the type of parameter(s) that the operator expects.

Operator Result Expression Semantics
| <s> <s1> | <s2> set union of <s1> and <s2> (the '|' is optional)
& <s> <s1> & <s2> set intersection of <s1> and <s2>
- <s> <s1> - <s2> set difference between <s1> and <s2>
f <fs> f <ds> the files in <ds>
d <ds> d <fs> the directories in <fs>
fn <fs> <fs> fn <str> files in <fs> with the file name <str>*
ft <fs> <fs> ft <str> files in <fs> with the file type *.<str>
ms <fs> <fs> ms <str> files in <fs> that contain <str>
in <fs> <fs> in <ds> files in <fs> whose directory is in <ds>

The following operators can also be used on a set of code files:

Expression Operator Name Semantics
us <fs> users files that #include any in <fs>
ub <fs> used by files that any in <fs> #include
as <fs> affecters ub <fs>, transitively
ab <fs> affected by us <fs>, transitively
ca <fs> common affecters (as f1) & (as f2) &(as fn), where f1…fn are the files in <fs>

After the code has been parsed, the following operators can also be used on a set of code files:

Expression Operator Name Semantics
im <fs> implements for each item declared (defined) in <fs>, add the file that defines (declares) it
ns <fs> needers files that also need <fs> in a build (im ab <fs>, transitively)
nb <fs> needed by files that <fs> also needs in a build (im as <fs>, transitively)

These operators can help to analyze dependencies among code files. For example:

>import sbase "sb"         // add SessionBase files to the library
>type us Thread.h          // show all files that #include Thread.h
>assign h1 f sbase ft cpp  // h1 = all SessionBase implementations
>assign c1 ab Thread.h     // c1 = files that could be affected by changing Thread.h
>assign s1 h1 & c1         // s1 = SessionBase .cpps that could be affected by changing Thread.h

What to #include

Interactions exist among the warnings for adding and removing #include directives, using statements, and forward declarations. CodeFile::Trim generates these warnings. Its basic rules are

  • Always #include something if nothing guarantees that it will be visible transitively.
  • Don't #include something that will definitely be visible transitively. It is necessary to #include a base class, as well as a class that is used directly. However, it is not necessary to #include their base classes, even when using something declared in one of those transitive base classes. Similarly, it is not necessary for a .cpp to #include anything that its header will #include.
  • If a class is only used indirectly (i.e. as a pointer or reference type), don't #include it. Use a forward declaration instead. If there is no guarantee that one will be visible transitively, add one to this file.
  • A header should not contain a using directive or declaration. It is therefore told to remove it, and any .cpp that relies on it is told to add it.
  • If an #include, forward declaration, or using statement is not needed to resolve a symbol, remove it.

All of these warnings can be resolved by >fix, which will, for example, insert a forward declaration in the correct namespace and fully qualify symbols from another namespace when removing a using statement.

High-Level Design

The parser is implemented using recursive descent, which makes its code easy to read and modify. The advent of unique_ptr was a godsend to these types of parsers, which were previously cursed by the need to delete objects when backing up. Placing each of these objects in a unique_ptr allows the parser to back up without having to write any code to delete them.

The parser does not check everything in the same way that a full parser must. It assumes that the code correctly compiles and links, so it only contains enough checks to produce a correct parse. Its grammar, which is informally documented in the relevant functions, is also far simpler than a complete C++ grammar.

As each code file is read in during >import, #include relationships are noted. This allows a global compile order to be calculated. The only other preprocessing that occurs before parsing is to erase, within C++ code, any macro name that is defined as an empty string. Currently, the only such name is NO_OP, which RSC uses before a bare semicolon when a for statement is missing a parenthesized statement.

Once this simple preprocessing is complete, all of the code is parsed together, in a single pass. After an item is  parsed, it is added to the scope (namespace, class, function, or code block) in which it appears, and its virtual EnterScope function is invoked. After each function is parsed, it is "executed" by invoking its virtual EnterBlock function. An item's EnterScope or EnterBlock function also invokes the same function on each of its constituent parts.

Some of the warnings generated by >check are detected during >import, some are detected during >parse, and some are detected during >check itself, through the virtual function Check. CodeFile::Trim, mentioned in the previous section, uses the virtual function GetUsages to obtain, from all of its file's C++ entities, the symbols that are used (a) as base classes, (b) directly, and (c) indirectly, as well as those that were resolved by (d) forward declarations, (e) friend declarations, and (f) using statements.

Performance

The time required to >parse and >check all of RSC's code is similar to the time required for a complete build using Microsoft's C++ compiler. This isn't a true apples-to-apples comparison because >parse doesn't lay out memory or generate object code, but its time is also that for a debug, not a release, build. And >parse doesn't use more than one core, whereas Microsoft's compiler uses two when possible (at least on my quad core).

RSC contains about 212K lines of source, but half of that is blanks, comments, and left braces. When RSC starts up with its default configuration file under Win32, it grows to about 58MB of memory, which could be significantly reduced by changing various configuration parameters. After executing >parse, >check, and >export, it has grown by about another 285MB.

The tools don't generate any intermediate or scratch files; everything is kept in memory. Using files would be a significant change, so the amount of available memory ultimately limits the size of the code library that the tools can accommodate. But my guess is that anyone with that much code could also provide a machine with enough memory—or simply purchase a commercial equivalent of the tool.

List of Code Files

The ct directory contains all of the code. If you want to dive into it, here's a summary of the files in that directory:

File (.h & .cpp) Description
CodeCoverage code coverage tool (not discussed in this article)
CodeDir a directory that contains source code
CodeDirSet a set of code directories
CodeFile a file that contains source code
CodeFileSet a set of code files
CodeIncrement CLI commands applicable to the ct directory
CodeSet base class for CodeDirSet and CodeFileSet
CodeTypes types for parsing and static analysis
CtModule initialization of ct directory
Cxx types for C++
CxxArea namespaces, classes, and class template instances
CxxCharLiteral character literals
CxxDirective preprocessor directives
CxxExecute for tracking code parsing and "execution"
CxxFwd forward declarations
CxxNamed low-level named C++ items
CxxRoot global namespace and built-in terminals
CxxScope code blocks, data items, and functions
CxxScoped arguments, base classes, enums, enumerators, forwards, friends, terminals, typedefs, usings
CxxStatement statements used in functions
CxxStrLiteral string literals
CxxString string utilities
CxxSymbols parser symbol tables
CxxToken low-level unnamed C++ items
Editor source code editor for >fix command
Interpreter interprets expressions (in CLI commands) that manipulate instances of LibrarySet subclasses
Lexer lexical analysis utilities
Library code files, code directories, and CLI symbols
LibraryErrSet invoked when a CLI command does not apply to a set
LibraryItem base class for CodeDir, CodeFile, and LibrarySet
LibrarySet base class for CodeSet, LibraryErrSet, and LibraryVarSet (sets of items to which CLI commands can be applied)
LibraryTypes types for code library
LibraryVarSet built-in or user-defined library variables
Parser parser for C++ source code
SetOperations difference, intersection, and union operators for instances of LibrarySet

Notes

1 While preparing this article, I checked to see if anything had changed. Google's project eventually gained traction and is now on GitHub. They took the approach of building on Clang, and they say that they're currently "alpha" quality and that changes to Clang sometimes break them.

2 The file help.cli has a .txt extension, which is omitted from file names in this article. These files are attached in a separate .zip file because article submission software currently replaces them with links that do not work.

3 A code file is assumed to be any file with no extension (e.g. <string>) or a .h, .c, .hpp, .cpp, .hxx, .xxx, .hh, .cc, .h++, or .c++ extension. This is hard-coded in CxxString::IsCodeFile.

History

  • 4th October, 2019: Initial version

License

This article, along with any associated source code and files, is licensed under The GNU General Public License (GPLv3)

Share

About the Author

Greg Utas
Architect
Canada Canada
Author of Robust Services Core (GitHub) and Robust Communications Software (Wiley, 2005). Formerly Chief Software Architect of the servers (GSM MSCs) that handle the calls in AT&T's wireless network.

Comments and Discussions

 
QuestionIncorrect absolute include paths in vcproj Pin
Laurent Regnier9-Oct-19 3:11
professionalLaurent Regnier9-Oct-19 3:11 
AnswerRe: Incorrect absolute include paths in vcproj Pin
Greg Utas9-Oct-19 6:52
professionalGreg Utas9-Oct-19 6:52 
GeneralRe: Incorrect absolute include paths in vcproj Pin
Laurent Regnier9-Oct-19 21:28
professionalLaurent Regnier9-Oct-19 21:28 
GeneralRe: Incorrect absolute include paths in vcproj Pin
Greg Utas10-Oct-19 1:40
professionalGreg Utas10-Oct-19 1:40 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Article
Posted 7 Oct 2019

Stats

7.3K views
79 downloads
14 bookmarked