Click here to Skip to main content
14,357,666 members

gram_grep: grep for the 21st century

Rate this:
4.00 (3 votes)
Please Sign up or sign in to vote.
4.00 (3 votes)
18 Jul 2017CPOL
grep for the 21st century

Introduction

gram_grep is a grep tool that as well as supporting normal search allows the introduction of a lex spec or even a full blown grammar for search.

gram_grep allows 3 main modes of operation:

  • Classic Mode
  • Lexer Mode
  • Parser Mode

It supports the following switches (this help available with switch --help):

  • --help (Shows help).
  • -checkout <checkout command (include $1 for pathname)>.
  • -E <regex> (Search using DFA regex).
  • -exclude <wildcard> (exclude any pathname matching wildcard).
  • -f <config file> (Search using config file).
  • -i (Case insensitive searching).
  • -o (Update matching file).
  • -P <regex> (Search using Perl regex).
  • -r, -R, --recursive (Recurse subdirectories).
  • -replace (Replacement text).
  • -shutdown <command to run when exiting>.
  • -startup <command to run at startup>.
  • -vE <regex> Search using DFA regex (negated - match all text other than regex).
  • -VE <regex> Search using DFA regex (all negated - match if regex not found).
  • -vf <config file> Search using config file (negated - match all text other than regex).
  • -Vf <config file> Search using config file (all negated - match if regex not found).
  • -vP <regex> Search using Perl regex (negated - match all text other than regex).
  • -VP <regex> Search using Perl regex (all negated - match if regex not found).
  • <pathname>... (Files to search (wildcards supported)).

The config file has the following format:

<grammar directives>
%%
<grammar>
%%
<regex macros>
%%
<regexes>
%%

The grammar directives, grammar and regex macros are all optional.

The following grammar directives are supported:

By default the entire grammar will match. However, there are times you are only interested if specific parts of your grammar matches.  If you want to only match on particular grammar rules, use {} just before the terminating semi-colon for that rule. This technique is shown in the 2nd example below.

It is also possible to use build a string to be searched with some very basic scripting commands.

The commands are:

  • erase($n);
  • erase($from, $to);
  • erase($from.second, $to.first);
  • insert($n, 'text');
  • insert($n.second, 'text')
  • match = $n;
  • match = substr($n, <omit from left>, <omit from right>);
  • match += $n;
  • match += substr($n, <omit from left>, <omit from right>);
  • replace($n, 'text');
  • replace($from, $to, 'text');
  • replace($from.second, $to.first, 'text');
  • replace_all($n, 'regex', 'text');

If you want certain regexes to be skipped, use the skip() command. For other regexes (unless you are using a grammar), just use an unsigned integer greater than 0 (it would be wise to simply stick with 1).

When using a grammar, you will need to specify your tokens. Use these token names (case sensitive) for your regexes in this case.

Examples

Looking for text outside of strings and comments

%%
%%
%%
[\"]([^\"\\]|\\.)*[\"]            skip()
[/][/].*|[/][*].{+}[\r\n]*?[*][/] skip()
memory_file                       1
%%

Note that if we wanted to only search in strings or comments, we would use 1 instead of skip() for those regexes and omit the memory_file line altogether. We would then pass memory_file with -E or -P as a command line parameter.

Looking for uninitialised variables in headers

%token Bool Char Name NULLPTR Number String Type
%%
start: decl;
decl: Type list ';';
list: item | list ',' item;
item: Name {};
item: Name '=' value;
value: Bool | Char | Number | NULLPTR | String;
%%
NAME  [_A-Za-z][_0-9A-Za-z]*
%%
=                                               '='
,                                               ','
;                                               ';'
true|TRUE|false|FALSE                           Bool
nullptr                                         NULLPTR
BOOL|BSTR|BYTE|COLORREF|D?WORD|DWORD_PTR        Type
DROPEFFECT|HACCEL|HANDLE|HBITMAP|HBRUSH         Type
HCRYPTHASH|HCRYPTKEY|HCRYPTPROV|HCURSOR|HDBC    Type
HICON|HINSTANCE|HMENU|HMODULE|HSTMT|HTREEITEM   Type
HWND|LPARAM|LPCTSTR|LPDEVMODE|POSITION|SDWORD   Type
SQLHANDLE|SQLINTEGER|SQLSMALLINT|UINT|U?INT_PTR Type
UWORD|WPARAM                                    Type
bool|(unsigned\s+)?char|double|float            Type
(unsigned\s+)?int((32|64)_t)?|long|size_t       Type
{NAME}(\s*::\s*{NAME})*(\s*[*])+                Type
{NAME}                                          Name
-?\d+([.]\d+)?                                  Number
'([^'\\]|\\.)*'                                 Char
["]([^\"\\]|\\.)*["]                            String
[ \t\r\n]+|[/][/].*|[/][*].{+}[\r\n]*?[*][/]    skip()
%%

Searching for SQL MERGE commands without WITH(HOLDLOCK) within strings only

First the string extraction (strings.g):

%token RawString String
%%
list: String { match = substr($1, 1, 1); };
list: RawString { match = substr($1, 3, 2); };
list: list String { match += substr($2, 1, 1); };
list: list RawString { match += substr($2, 3, 2); };
%%
%%
[\"]([^\"\\]|\\.)*[\"]                       String
R[\"][(].*?[)][\"]                           RawString
'([^'\\]|\\.)*'                              skip()
[ \t\r\n]+|[/][/].*|[/][*].{+}[\r\n]*?[*][/] skip()
%%

Or if we wanted to scan C#:

%token String VString
%%
list: String { match = substr($1, 1, 1); };
list: VString { match = substr($1, 2, 1); };
list: list '+' String { match += substr($3, 1, 1); };
list: list '+' VString { match += substr($3, 2, 1); };
%%
ws [ \t\r\n]+
%%
[+]                                    '+'
[\"]([^"\\]|\\.)*[\"]                  String
@[\"]([^\"]|[\"][\"])*["]              VString
'([^'\\]|\\.)*'                        skip()
{ws}|[/][/].*|[/][*].{+}[\r\n]*?[*][/] skip()
%%

Now the grammar to search inside the strings (merge.g):

%token AS Integer INTO MERGE Name PERCENT TOP USING
%%
merge: MERGE opt_top opt_into name opt_alias USING;
opt_top: %empty | TOP '(' Integer ')' opt_percent;
opt_percent: %empty | PERCENT;
opt_into: %empty | INTO;
name: Name | Name '.' Name | Name '.' Name '.' Name;
opt_alias: %empty | opt_as Name;
opt_as: %empty | AS;
%%
%%
(?i:AS)                                               AS
(?i:INTO)                                             INTO
(?i:MERGE)                                            MERGE
(?i:PERCENT)                                          PERCENT
(?i:TOP)                                              TOP
(?i:USING)                                            USING
\.                                                    '.'
\(                                                    '('
\)                                                    ')'
\d+                                                   Integer
(?i:[a-z_][a-z0-9@$#_]*|\[[a-z_][a-z0-9@$#_]*[ ]*\])  Name
\s+                                                   skip()
%%

The command line looks like this:

gram_grep -f strings.g -f merge.g test.txt

All of these example configs are available in the zip with a .g extension.

Automatically Converting boost::format to std::format

%token Integer Name RawString String
%%
start: '(' format list ')' '.' 'str' '(' ')' { erase($1);
    erase($5, $8); };
start: 'str' '(' format list ')' { erase($1, $2); };
format: 'boost' '::' 'format' '(' string ')' { replace($1, 'std');
    replace_all($5, '%(\d+[Xdsx])', '{:$1}');
    replace_all($5, '%((?:\d+)?\.\d+f)', '{:$1}');
    replace_all($5, '%x', '{:x}');
    replace_all($5, '%[ds]', '{}');
    replace_all($1, '%%', '%');    
    erase($6); };
string: String;
string: RawString;
string: string String;
string: string RawString;
list: %empty;
list: list '%' param { replace($2, ', '); };
param: Integer;
param: name { replace_all($1, '\.c_str\(\)$', ''); };
name: Name opt_func
    | name deref Name opt_func;
opt_func: %empty | '(' opt_param ')';
deref: '.' | '->' | '::';
opt_param: %empty | Integer | name;
%%
%%
\(                              '('
\)                              ')'
\.                              '.'
%                               '%'
::                              '::'
->                              '->'
boost                           'boost'
format                          'format'
str                             'str'
-?\d+                           Integer
\"([^"\\]|\\.)*\"               String
R\"\(.*?\)\"                    RawString
'([^'\\]|\\.)*'                 skip()
[_a-zA-Z][_0-9a-zA-Z]*          Name
\s+|\/\/.*|\/\*.{+}[\r\n]*?\*\/ skip()
%%

The command line looks like this:

gram_grep -o -r -f format.g *.cpp

If you are using TFS and you need to checkout the files, you could add

-checkout "tf.exe checkout $1"

Linux/g++

I used the following command line to build under Linux/g++:

g++ -o gram_grep main.cpp -std=c++17 -lstdc++fs

History

18/07/2017 Created.

10/09/2017 Reworked so that searches can be pipelined.

22/09/2017 Now finds all matches within a sub-search.

24/09/2017 Added -v support.

26/09/2017 Fixed config files.

20/10/2017 Now ignoring zero length files.

21/10/2017 Fixed {} handling.

11/12/2017 Now supports C style comments in sections before regex macros.

16/12/2017 Slight fix to parse() in main.cpp and introduced last_productions_ in parsertl/search.hpp.

03/02/2018 Added match count and now outputting match with context.

21/03/2018 Updated lexertl and parsertl libraries.

14/04/2018 Added -V support.

26/04/2018 Fixed out of bounds checking for substr().

17/05/2018 Took filesystem out of experimental (needs g++ 8.1 or latest VC++).

23/09/2018 Now supporting EBNF syntax.

23/09/2018 Updated parsertl.

06/10/2018 Updated parsertl.

02/03/2019 Added checkout/replacement ability.

28/08/2019 Added modifying actions for grammars.

01/09/2019 Added replace_all().

07/09/2019 Added boost::format to std::format conversion example.

22/09/2019 Added -exclude switch.

04/10/2019 Added warnings for unused tokens.

05/10/2019 Fixed line number reporting for unknown token names.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Ben Hanson
Software Developer (Senior)
United Kingdom United Kingdom
I started programming in 1983 using Sinclair BASIC, then moving on to Z80 machine code and assembler. In 1988 I programmed 68000 assembler on the ATARI ST and it was 1990 when I started my degree in Computing Systems where I learnt Pascal, C and C++ as well as various academic programming languages (ML, LISP etc.)

I have been developing commercial software for Windows using C++ for 22 years.

Comments and Discussions

 
QuestionWhere is it? Pin
Peter_in_278019-Jul-17 14:07
professionalPeter_in_278019-Jul-17 14:07 
AnswerRe: Where is it? Pin
Ben Hanson19-Jul-17 21:14
memberBen Hanson19-Jul-17 21:14 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Article
Posted 18 Jul 2017

Stats

6.1K views
109 downloads
6 bookmarked