Introduction
gram_grep is a search tool that goes far beyond the capabilities of grep. Searches can span multiple lines and may be chained together in a variety of ways and can even utilise bison style grammars.
Maybe you want a search to ignore comments, or search only within strings. Maybe you have code that has SQL within strings and that SQL itself contains strings that you want to search in. The possibilities are endless and there is no limit to the sequence of sub-searches.
For example, here is how you would search for memory_file
outside of C and C++ style comments:
gram_grep -vE "\/\/.*|\/\*(?s:.)*?\*\/" -E memory_file main.cpp
A Note on DOS Prompt Escapes
Note that '^' is the escape character in the command prompt therefore if you want to use the '^' character you will have to double it up. The same goes for double quotes ('"') and in addition any regex using them will need surrounding with double quotes, as will any containing the pipe symbol ('|') (although in this case you do not to double it up).
Configuration Files Make Things Easier
It quickly gets tedious trying to correctly escape characters in a command shell, so we switch to a configuration file to also exclude strings:
gram_grep -f nosc.g main.cpp
The config file nosc.g
looks like this:
%%
%%
%%
'([^'\\\r\n]|\\.)*' skip()
\"([^"\\\r\n]|\\.)*\" skip()
R\"\((?s:.)*?\)\" skip()
"//".*|"/*"(?s:.)*?"*/" skip()
memory_file 1
%%
Note how characters are also skipped just in case there is a character containing a double quote! Also note how we have moved our search for memory_file
directly into the config file as this part of the config lists regexes that are passed to a lexer generator. This means that we specify the things we want to match (use 1
for the id in this case) or explicitly skip (use skip()
in this case) all within the same section. This mode alone has already given us far more searching power than with traditional techniques.
If we wanted to only search in strings or comments, we would use 1
instead of skip()
for those regexes and omit the memory_file
line altogether. We would then pass memory_file
with -E
or -P
as a command line parameter.
Source Control
Note that it is possible to issue a command to check out files from source control:
gram_grep -r -E "v4\.5\.1" -replace v4.5.2 -o -checkout "tf.exe checkout $1" *.csproj
The above example would replace v4.5.1
with v4.5.2
in *.csproj
, checking out the files from TFS as they match. Note that there are also switches -startup
and -shutdown
where you can run other commands at startup and exit respectively if required (e.g., "tf.exe workspace /new /collection:http://... refactor /noprompt"
and "tf.exe workspace /delete /collection:http://... refactor /noprompt"
).
The Configuration File Format
The config file has the following format:
<grammar/lexer directives>
%%
<grammar>
%%
<regex macros>
%%
<regexes>
%%
As implied above, the grammar/lexer directives
, grammar
and regex macros
are all optional.
Here is an example of a simple grammar that recognises C++ strings split over multiple lines (strings.g
):
/*
NOTE: in order to successfully find strings it is necessary to filter out comments and chars.
As a subtlety, comments could contain apostrophes (or even unbalanced double quotes in
an extreme case)!
*/
%token RawString String
%%
list: String { match = substr($1, 1, 1); };
list: RawString { match = substr($1, 3, 2); };
list: list String { match += substr($2, 1, 1); };
list: list RawString { match += substr($2, 3, 2); };
%%
%%
\"([^"\\\r\n]|\\.)*\" String
R\"\((?s:.)*?\)\" RawString
'([^'\\\r\n]|\\.)*' skip()
[ \t\r\n]+|"//".*|"/*"(?s:.)*?"*/" skip()
%%
Although the grammar is just about as simple as it gets, note the scripting added. Each string fragment is joined into a match
, that can then be searched on by a following search. This means we can search within C++ strings without worrying about how they are split over lines.
Note how we have switched from using 1
as the matching regex id to names which we have specified using %token
and used in the grammar.
Example usage:
gram_grep -f sample_configs/strings.g -E grammar main.cpp
The full list of scripting commands are listed below. You can see their use in the more sophisticated examples that follow later. $n
, $from
and $to
refer to the item in the production you are interested in (numbering starts at 1
).
erase($n);
erase($from, $to);
erase($from.second, $to.first);
insert($n, 'text');
insert($n.second, 'text');
match = $n;
match = substr($n, <omit from left>, <omit from right>);
match += $n;
match += substr($n, <omit from left>, <omit from right>);
replace($n, 'text');
replace($from, $to, 'text');
replace($from.second, $to.first, 'text');
replace_all($n, 'regex', 'text');
Notes on Grammars
By default, the entire grammar will match. However, there are times you are only interested if specific parts of your grammar matches. If you want to only match on particular grammar rules, use {}
just before the terminating semi-colon for that rule. This technique is shown in a later example.
Most of the time, the only grammar/lexer directive you will care about will be %token
. However, the following are supported:
Command Line Switches for gram_grep
--help
(Shows help) -checkout
<checkout command (include $1 for pathname)> -E
<regex> (Search using DFA regex) -Ee
<regex> (as -E
but continue searching at the end of the match) -exclude
<wildcard> (exclude any pathname matching wildcard) -f
<config file> (Search using config file) -fe
<config file> (As -f
but continue searching at the end of the match) -force_write
(If a file is read only, force it to be writable) -i
(Case insensitive searching) -l
(Output pathname only) -o
(Output changes to matching file) -P
<regex> (Search using std::regex
) -Pe
<regex> (as -P
but continue searching at the end of the match) -r, -R, --recursive
(Recurse subdirectories) -replace
<Replacement literal text> -shutdown
<command to run when exiting> -startup
<command to run at startup> -T
<text> (Search for plain text with support for capture ($n) syntax) -utf8
(In the absence of a BOM assume UTF-8) -vE
<regex> Search using DFA regex (negated - match all text other than regex) -VE
<regex> Search using DFA regex (all negated - match if regex not found) -vf
<config file> Search using config file (negated - match all text other than config) -Vf
<config file> Search using config file (all negated - match if config not found) -vP
<regex> Search using std::regex
(negated - match all text other than regex) -VP
<regex> Search using std::regex
(all negated - match if regex not found) - -
vT
<text> (Search for plain text with support for capture ($n) syntax (negated)) -VT
<text> (Search for plain text with support for capture ($n) syntax (all negated)) -writable
Only process files that are writable <pathname>
... (Files to search (wildcards supported))
Unicode
If an input file has a BOM (byte order marker), then that will be recognised. In the case of UTF-16, the contents will be automatically converted to UTF-8 in memory to allow uniform processing.
Unicode support can be enabled with the -utf8
switch. Two things happen with this switch enabled:
- Any files without a BOM (byte order marker) are assumed to be UTF-8.
- The lexer enables Unicode support (
-E
, -vE
, -VE
, -f
, -vf
, -Vf
). Note that the std::regex
support (-P
, -vP
, -VP
) does not currently support Unicode.
Examples
Searching for SQL INSERT Commands Without a Column List
insert.g
:
%token INSERT INTO Name String VALUES
%%
start: insert;
insert: INSERT into name VALUES;
into: INTO | %empty;
name: Name | Name '.' Name | Name '.' Name '.' Name;
%%
%%
(?i:INSERT) INSERT
(?i:INTO) INTO
(?i:VALUES) VALUES
\. '.'
(?i:[a-z_][a-z0-9@$#_]*|\[[a-z_][a-z0-9@$#_]*[ ]*\]) Name
'([^']|'')*' String
\s+|--.*|"/*"(?s:.)*?"*/" skip()
%%
The command line looks like this:
gram_grep -r -f sample_configs/insert.g *.sql
Searching for SQL MERGE Commands Without WITH(HOLDLOCK) Within Strings Only
First the string extraction (strings.g
):
%token RawString String
%%
list: String { match = substr($1, 1, 1); };
list: RawString { match = substr($1, 3, 2); };
list: list String { match += substr($2, 1, 1); };
list: list RawString { match += substr($2, 3, 2); };
%%
%%
\"([^"\\\r\n]|\\.)*\" String
R\"\((?s:.)*?\)\" RawString
'([^'\\\r\n]|\\.)*' skip()
[ \t\r\n]+|"//".*|"/*"(?s:.)*?"*/" skip()
%%
Or if we wanted to scan C#:
%token String VString
%%
list: String { match = substr($1, 1, 1); };
list: VString { match = substr($1, 2, 1); };
list: list '+' String { match += substr($3, 1, 1); };
list: list '+' VString { match += substr($3, 2, 1); };
%%
ws [ \t\r\n]+
%%
\+ '+'
\"([^"\\\r\n]|\\.)*\" String
@\"([^"]|\"\")*\" VString
'([^'\\\r\n]|\\.)*' skip()
{ws}|"//".*|"/*"(?s:.)*?"*/" skip()
%%
Now the grammar to search inside the strings (merge.g
):
%token AS Integer INTO MERGE Name PERCENT TOP USING
%%
merge: MERGE opt_top opt_into name opt_alias USING;
opt_top: %empty | TOP '(' Integer ')' opt_percent;
opt_percent: %empty | PERCENT;
opt_into: %empty | INTO;
name: Name | Name '.' Name | Name '.' Name '.' Name;
opt_alias: %empty | opt_as Name;
opt_as: %empty | AS;
%%
%%
(?i:AS) AS
(?i:INTO) INTO
(?i:MERGE) MERGE
(?i:PERCENT) PERCENT
(?i:TOP) TOP
(?i:USING) USING
\. '.'
\( '('
\) ')'
\d+ Integer
(?i:[a-z_][a-z0-9@$#_]*|\[[a-z_][a-z0-9@$#_]*[ ]*\]) Name
\s+ skip()
%%
The command line looks like this:
gram_grep -r -f sample_configs/strings.g -f sample_configs/merge.g *.cpp
Looking for Uninitialised Variables in Headers
Note the use of {}
here to specify that we only care when the rule item: Name;
matches.
%token Bool Char Name NULLPTR Number String Type
%%
start: decl;
decl: Type list ';';
list: item | list ',' item;
item: Name {};
item: Name '=' value;
value: Bool | Char | Number | NULLPTR | String;
%%
NAME [_A-Za-z][_0-9A-Za-z]*
%%
= '='
, ','
; ';'
true|TRUE|false|FALSE Bool
nullptr NULLPTR
BOOL|BSTR|BYTE|COLORREF|D?WORD|DWORD_PTR Type
DROPEFFECT|HACCEL|HANDLE|HBITMAP|HBRUSH Type
HCRYPTHASH|HCRYPTKEY|HCRYPTPROV|HCURSOR|HDBC Type
HICON|HINSTANCE|HMENU|HMODULE|HSTMT|HTREEITEM Type
HWND|LPARAM|LPCTSTR|LPDEVMODE|POSITION|SDWORD Type
SQLHANDLE|SQLINTEGER|SQLSMALLINT|UINT|U?INT_PTR Type
UWORD|WPARAM Type
bool|(unsigned\s+)?char|double|float Type
(unsigned\s+)?int((32|64)_t)?|long|size_t Type
{NAME}(\s*::\s*{NAME})*(\s*[*])+ Type
{NAME} Name
-?\d+(\.\d+)? Number
'([^'\\\r\n]|\\.)*' Char
\"([^"\\\r\n]|\\.)*\" String
[ \t\r\n]+|"//".*|"/*"(?s:.)*?"*/" skip()
%%
The command line looks like this:
gram_grep -r -f sample_configs/uninit.g *.h
Automatically Converting boost::format to std::format
Note the use of a variety of scripting commands:
%token Integer Name RawString String
%%
start: '(' format list ')' '.' 'str' '(' ')'
/* Erase the first "(" and the trailing ".str()" */
{ erase($1);
erase($5, $8); };
start: 'str' '(' format list ')'
/* Erase "str(" */
{ erase($1, $2); };
format: 'boost' '::' 'format' '(' string ')'
/* Replace "boost" with "std" */
/* Replace the format specifiers within the strings */
{ replace($1, 'std');
replace_all($5, '%(\d+[Xdsx])', '{:$1}');
replace_all($5, '%((?:\d+)?\.\d+f)', '{:$1}');
replace_all($5, '%x', '{:x}');
replace_all($5, '%[ds]', '{}');
replace_all($5, '%%', '%');
erase($6); };
string: String;
string: RawString;
string: string String;
string: string RawString;
list: %empty;
list: list '%' param
/* Replace "%" with ", " */
{ replace($2, ', '); };
param: Integer;
param: name
/* Replace any trailing ".c_str()" calls with "" */
{ replace_all($1, '\.c_str\(\)$', ''); };
name: Name opt_func
| name deref Name opt_func;
opt_func: %empty | '(' opt_param ')';
deref: '.' | '->' | '::';
opt_param: %empty | Integer | name;
%%
%%
\( '('
\) ')'
\. '.'
% '%'
:: '::'
-> '->'
boost 'boost'
format 'format'
str 'str'
-?\d+ Integer
\"([^"\\\r\n]|\\.)*\" String
R\"\((?s:.)*?\)\" RawString
'([^'\\\r\n]|\\.)*' skip()
[_a-zA-Z][_0-9a-zA-Z]* Name
\s+|"//".*|"/*"(?s:.)*?"*/" skip()
%%
The command line looks like this:
gram_grep -o -r -f format.g *.cpp
Coping With Nested Constructs Without Caring What They Are
This example finds an if
statement, its opening parenthesis and its closing parenthesis and copes with any parenthesis nested in between. We introduce the nonsense token anything
so that we stop matching directly after the closing parenthesis and we rely on lexer states to cope with the nesting.
%token if anything
%x PREBODY BODY PARENS
%%
start: if '(' ')';
%%
any (?s:.)
char '([^'\\\r\n]|\\.)+'
name [A-Z_a-z][0-9A-Z_a-z]*
string \"([^"\\\r\n]|\\.)*\"|R\"\((?s:.)*?\)\"
ws [ \t\r\n]+|"//".*|"/*"(?s:.)*?"*/"
%%
<INITIAL>if<PREBODY> if
<PREBODY>[(]<BODY> '('
<PREBODY>.{+}[\r\n]<.> skip()
<BODY,PARENS>[(]<>PARENS> skip()
<PARENS>[)]<<> skip()
<BODY>[)]<INITIAL> ')'
<BODY,PARENS>{string}<.> skip()
<BODY,PARENS>{char}<.> skip()
<BODY,PARENS>{ws}<.> skip()
<BODY,PARENS>{name}<.> skip()
<BODY,PARENS>{any}<.> skip()
{string} anything
{char} anything
{ws} anything
{name} anything
{any} anything
%%
Finding Unused Variables in C++ Functions
gram_grep -r *.cpp;*.h -f \configs\block.g -fe \configs\var.g -VT $1
block.g:
%token Name anything
%x BODY BRACES
%%
start: '{' '}';
%%
any (?s:.)
char '([^'\\]|\\.)+'
name [A-Z_a-z][0-9A-Z_a-z]*
string \"([^"\\]|\\.)*\"|R\"\((?s:.)*?\)\"
ws [ \t\r\n]+|\/\/.*|"/*"(?s:.)*?"*/"
%%
(class|struct|namespace|union)\s+{name}?[^;{]*\{ skip()
extern\s*["]C["]\s*\{ skip()
<INITIAL>\{<BODY> '{'
<BODY,BRACES>\{<>BRACES> skip()
<BRACES>\}<<> skip()
<BODY>\}<INITIAL> '}'
<BRACES,BODY>{string}<.> skip()
<BRACES,BODY>{char}<.> skip()
<BRACES,BODY>{ws}<.> skip()
<BRACES,BODY>{name}<.> skip()
<BRACES,BODY>{any}<.> skip()
{string} anything
{char} anything
{name} anything
{ws} anything
{any} anything
%%
var.g:
%captures
%token Name Keyword String Whitespace
%%
start: Name opt_template Whitespace (Name) opt_ws ';';
opt_template: %empty | '<' name '>';
name: Name | name '::' Name;
opt_ws: %empty | Whitespace;
%%
name [A-Z_a-z]\w*
%%
; ';'
< '<'
> '>'
:: '::'
#{name} Keyword
break Keyword
CExtDllState Keyword
CShellManager Keyword
CWaitCursor Keyword
continue Keyword
delete Keyword
enum Keyword
false Keyword
goto Keyword
namespace Keyword
new Keyword
return Keyword
throw Keyword
VTS_[0-9A-Z_]* Keyword
{name} Name
\"([^"\\\r\n]|\\.)*\" String
R\"\((?s:.)*?\)\" String
\s+ Whitespace
\/\/.* skip()
"" skip()
%%
All of these example configs are available in the zip with a .g
extension.
Linux/g++
There is now a Makefile which will allow you to build on Linux
History
- 18/07/2017: Created
- 10/09/2017: Reworked so that searches can be pipelined
- 22/09/2017: Now finds all matches within a sub-search
- 24/09/2017: Added
-v
support - 26/09/2017: Fixed config files
- 20/10/2017: Now ignoring zero length files
- 21/10/2017: Fixed
{}
handling - 11/12/2017: Now supports C style comments in sections before regex macros
- 16/12/2017: Slight fix to
parse()
in main.cpp
and introduced last_productions_
in parsertl/search.hpp
- 03/02/2018: Added match count and now outputting match with context
- 21/03/2018: Updated
lexertl
and parsertl
libraries - 14/04/2018: Added
-V
support - 26/04/2018: Fixed out of bounds checking for
substr()
- 17/05/2018: Took filesystem out of experimental (needs g++ 8.1 or latest VC++)
- 23/09/2018: Now supporting EBNF syntax
- 23/09/2018: Updated
parsertl
- 06/10/2018: Updated
parsertl
- 02/03/2019: Added checkout/replacement ability
- 28/08/2019: Added modifying actions for grammars
- 01/09/2019: Added
replace_all()
- 07/09/2019: Added
boost::format
to std::format
conversion example - 22/09/2019: Added
-exclude
switch - 04/10/2019: Added warnings for unused tokens
- 05/10/2019: Fixed line number reporting for unknown token names
- 17/01/2020: Updated
lexertl
- 20/01/2020: Updated
lexertl
- 23/03/2020: Added negated wildcard support (as per VS 2019 16.5) and now loading/saving UTF16 correctly
- 24/03/2020: Fixed
-exclude
logic - 02/05/2020: Added lexer state support
- 03/05/2020: Now checking all lexer states for missing ids before issuing warnings
- 04/05/2020: Added
-writable
flag - 28/06/2020: Reworked article text
- 28/06/2020: Added lexer states example
- 28/06/2020: Updated zip with config files.
- 30/06/2020:
RawString
can be multi-line, added RawString
support to if.g
- 01/07/2020: Updated samples in zip again
- 03/07/2020: Added Source Control section to article
- 04/07/2020: Tweaks to switch explanations in article and help text in zip
- 04/07/2020: Added example startup and shutdown strings to article
- 16/07/2020: Added SQL INSERT check example
- 29/07/2020:
-exclude
, -f
, -vf
and -Vf
can now take a semi-colon separated list of wildcards instead of just one - 24/08/2020: Wildcards are now case sensitive for non-windows builds
- 25/08/2020: Corrected negation character in wildcard character sets
- 07/09/2020: Pathnames can now be semicolon separated
- 20/12/2020: Added Unicode support with
-utf8
switch - 20/12/2020: Fixed bug where if a literal filename was passed with no path then it was not loaded
- 30/01/2021: Fixed literal filename bug for recursive mode
- 12/05/2021: Added
-l
switch and brought libraries up to date - 30/05/2021: Added regex capture support for
-replace
($0
-$9
) - 03/06/2021: Added
$0
capture support for -E
switch (also fixed usage of -replace
with -v
switches). - 01/07/2021: Updated
parsertl
(now consumes less memory). - 03/07/2021: Now supporting
stdin
so that input can be piped in. - 08/07/2021: Added error msg when combining
stdin
and -o
. - 08/07/2021: Removed unnecessary newlines from strings.
- 04/08/2021: Added
-force_write
switch. - 04/08/2021:
-r
is now order independent as it always should have been. - 06/08/2021: Fix to
-l
processing. - 07/08/2021: Added
%option caseless
directive. - 03/09/2021: Fixed
-Wall
warnings. - 01/11/2021: Updated article text.
- 07/11/2021: More wildcard processing fixes.
- 17/12/2021: Fix to
boost::format()
script. - 14/03/2022: When in
cin
mode, don't ouput the line number in matches. - 17/03/2022: Fix to end marker when doing a negated grammar search.
- 21/01/2023: Added
%captures
support. - 30/01/2023: Pass iterators by reference in
parsertl::search()
. - 31/01/2023: Fixed lexer iterator guards in
parsertl
functions. - 27/02/2023: Upgraded to Unicode 15.1.0, added
-hits
switch. - 23/05/2023: Added
skip_permission_denied
to directory iterators. - 30/07/2023: Enabled
%prec
support and added C style comment support to regex macros section (must occur before the macro name). - 05/08/2023: Added new regex rule, corrected Charset regex.
- 08/08/2023: Added missing backslashes to more regexes.
- 09/08/2023: Added more missing backslashes to regexes.
- 25/08/2023: Bug fix to lexertl regex macro handling of BOL and EOL, parsertl sm construction speedup.
- 09/11/2023: Added new switches
-Ee
, -fe
, -Pe
, -T
, -vT
and -VT
. - 10/11/2023: Corrected flags for
-T
. - 12/11/2023: Added basic
-i
support for -T
switches. - 18/11/2023:
-T
, -vT
, -VT
searches now do automatic whole word matching as appropriate. - 12/12/2023: Updated searching in strings example as includes now use angle brackets.
- 27/12/2023: Split source into multiple files.
- 15/02/2024: Updated to use lexertl17 and parsertl17.
- 13/03/2024: Fixed grammar for \x handling in strings and characters. Thanks mingodad@github!
- 06/04/2024: Updated comments.g to exclude C++ strings.
- 19/04/2024: Corrected usage of negated wildcards.
I started programming in 1983 using Sinclair BASIC, then moved on to Z80 machine code and assembler. In 1988 I programmed 68000 assembler on the ATARI ST and it was 1990 when I started my degree in Computing Systems where I learnt Pascal, C and C++ as well as various academic programming languages (ML, LISP etc.)
I have been developing commercial software for Windows using C++ since 1994.