Click here to Skip to main content
14,268,557 members

Constructing Fast Lexical Analyzers with RE/flex - Why Another Scanner Generator?

Rate this:
5.00 (28 votes)
Please Sign up or sign in to vote.
5.00 (28 votes)
8 Aug 2019BSD
This article introduces RE/flex for C++. RE/flex generates fast lexical analyzers similar to Flex, but supports Unicode patterns, indent/nodent/dedent, lazy repeats, smart input text handling, and performance tuning. RE/flex is compatible with Bison/Yacc and accepts Flex lexer specifications.


In this article I will introduce the RE/flex lexical analyzer generator. The RE/flex open source project was motivated by the possibility to build a scanner generator based on an entirely different approach to tokenization that permits regex libraries to be used in the generated scanners. This approach provides additional flexibility to use a rich regex syntax with a choice of POSIX-mode (and even Perl-mode!) regex engines. Another motivation was the challenge to come up with an algorithm to construct deterministic finite automata (DFAs) for efficient pattern matching with lazy quantifiers (a.k.a. non-greedy repeats, lazy repeats, reluctant repeats). This algorithm is also new. My third goal was to make RE/flex faster than Flex, at least for typical cases. I knew that was a challenge because Flex (the fast lexical analyzer) has been around a long time and the contributors to Flex made it very fast. Performance tests show that RE/flex is faster than Flex and faster than other regex libraries.

Why RE/flex?

RE/flex is a new fast and flexible regex-based scanner generator for C++. It shares many similarities with Flex (the fast lexical analyzer) and Lex, but is fundamentally different in its overall design. This new design makes RE/flex faster than Flex and/or enables a more expressive regex syntax to be used in lexer specifications. It also adds flexibility by permitting other regex libraries to be used, such as Boost.Regex and the RE/flex regex engine, and perhaps other regex libraries in future releases.

This powerful new design is based on the fact that any regex engine can in principle be used to tokenize input: given a set of n patterns pi, i=1,...,n, a regex of the form "(p1) | (p2) | ... | (pn)" can be used to match and tokenize the input. The group capture index i of a matching pattern pi that is returned by the regex matcher identifies the pattern pi that matched. If a pattern uses a group (X) then that group must be converted to a non-capturing group of the form (?:X) before we can use it in our patterns. We then use repeated partial matching in our scanner to advance over the input to tokenize:

int Lexer::lex()
  if (!has_matcher())
    matcher("(p1)|(p2)|...|(pn)"); // new matcher engine if not set up already
  while (true)
    if (... EOF reached ...)
      return 0;
    switch (matcher().scan()) // scan and match next token, get capture index
      case 1: // pattern p1 matched
        ... // Action for pattern 1 ...
      case 2: // pattern p2 matched
        ... // Action for pattern 2 ...
      ... // etc ...
        ... // no match: this is the "default rule" to echo input character ...

The scanner source code generated by the reflex tool is structured in this way and is therefore very different from the source code generated by Flex. The generated source code is not only humanly readable, it also runs faster for typical scanner applications with reflex option --fast.

That is, a very fast DFA for pattern matching is generated with the reflex --fast option. The generated DFA is expressed in direct code instead of in tables. For example, the following hard-coded DFA that is generated by reflex for pattern "\w+" runs very efficiently to match words:

void reflex_code_FSM(reflex::Matcher& m)
  int c0 = 0, c1 = c0;
  c0 = c1, c1 = m.FSM_CHAR();
  if (97 <= c1 && c1 <= 122) goto S5;
  if (c1 == 95) goto S5;
  if (65 <= c1 && c1 <= 90) goto S5;
  if (48 <= c1 && c1 <= 57) goto S5;
  return m.FSM_HALT(c1);
  c0 = c1, c1 = m.FSM_CHAR();
  if (97 <= c1 && c1 <= 122) goto S5;
  if (c1 == 95) goto S5;
  if (65 <= c1 && c1 <= 90) goto S5;
  if (48 <= c1 && c1 <= 57) goto S5;
  return m.FSM_HALT(c1);

Another advantage of RE/flex is the rich regex pattern syntax offered by Boost.Regex and also by the new RE/flex regex library that is included with the RE/flex software. The RE/flex regex library supports Unicode, indent/nodent/dedent anchors, lazy quantifiers, word boundaries, and more.

How Fast is RE/flex?

RE/flex is faster than Flex when comparing Flex with the best performance options. RE/flex is also faster than Boost.Spirit.Lex, and much faster than popular regex libraries, such as Boost.Regex, C++11 std::regex, PCRE2 and RE2. For example, tokenizing a representative C source code file into 244 tokens takes only 11 microseconds:

Command / Function Software Time (μs)
reflex --fast RE/flex 1.3.5 11
flex -+ --full Flex 2.5.35 17
reflex --full RE/flex 1.3.5 19
boost::spirit::lex::lexertl::actor_lexer::iterator_type Boost.Spirit.Lex 1.66.0 40
hs_compile_multi(), hs_scan() Hyperscan 5.1.0 209
reflex -m=boost-perl Boost.Regex 1.66.0 230
pcre2_match() PCRE2 (pre-compiled) 10.30 318
RE2::Consume() RE2 (pre-compiled) 2018-04-01 417
reflex -m=boost Boost.Regex POSIX 1.66.0 450
RE2::Consume() RE2 POSIX (pre-compiled) 2018-04-01 1226
flex -+ Flex 2.5.35 3968
std::regex::cregex_iterator() C++11 std::regex 5979

The table shows the best times of 10 tests with average time in microseconds over 100 runs using clang 9.0.0 with -O2, 2.9 GHz Intel Core i7, 16 GB 2133 MHz LPDDR3. Hyperscan disqualifies as a potential scanner due to its "All matches reported" semantics resulting in 1915 matches for this test and problems with its event handler requirements. Download the tests here.

To create this comparison, I used RE/flex option --regexp-file to generate the regex string and used it with each regex library to tokenize a string of C source code stored in memory, thus avoiding the unpredictable overhead of file reads. This regex string has n=56 patterns of the form "(p1) | (p2) | ... | (pn)" where each pi is a pattern to match C/C++ token lexemes. The patterns use non-capturing groups (?:...) to replace parenthesis, which allows the capture group index to identify the token matched. In this way, I made the performance comparisons as fair as possible by measuring the time it takes to tokenize the input stored in memory instead of in files, i.e. to measure raw performance without IO overhead.

Compatibility of RE/flex with Flex and Bison/Yacc

When designing RE/flex using the aforementioned approach, I wanted to make sure that the project is compatible with Bison and Yacc parsers. These tools are popular and are typically pre-installed with Linux and Unix distributions. I also wanted to make sure that RE/flex accepts standard Flex specifications (using RE/flex compatibility option --flex). This means that if you are already familiar with Flex or Lex, then you will find RE/flex easy to use. To generate a yyFlexLexer class that has the same methods as a Flex++ yyFlexLexer class, use reflex option --flex.

RE/flex is for C++ projects only. Using C++ permitted a significant number of additions to the lexer specification language and permitted the introduction of many new methods that can be used in rule actions, as I will explain further in the next section. However, you can still use RE/flex with C code, such as Bison-generated parsers that expect a global yylex() function. To do so, use the --flex and --bison command-line options:

reflex --flex --bison lexspec.l

RE/flex also supports options to build Bison reentrant parsers, and bison-bridge and bison-locations parsers. These advanced parsers are MT-safe and manage the locations of syntax errors in cooperation with the lexer.

To get you started with RE/flex, you will find several examples in the RE/flex download package from GitHub and SourceForge. The examples include Bison parsers, from very basic to more advanced bison-bridge and bison-location parsers.

The Structure of a RE/flex Lexer Specification

The structure of a lexer specification is the same as Flex, but with some additions. A lexer specification consists of three parts separated by two lines with %%: the Definitions section, the Rules section, and the User code section.

  1. The Definitions section defines named patterns. That is, commonly used regex patterns can be named and used in other patterns as macros. This section may also contain user code: a %top block with code that goes all the way to the top of the generated scanner source code and the usual %{ code in between %} that is copied to the scanner's source code. A %class block defines the lexer class member variables and functions. The %init block defines the lexer class constructor code for initializations. We can also add one or more %option to configure the scanner.
  2. The Rules section defines pattern-action pairs. Actions are written in C++ and are triggered when the pattern matches.
  3. The User code section at the end of the specification can be used to add any C++ code necessary. For stand-alone scanners, this typically has a main() function defined.

The following outlines the basic structure of a RE/flex lexer specification:

// DEFINITIONS go here

  #include <iostream>
  // ... other stuff to #include <...>

  // ... other stuff to declare in C++

// customize the generated Lexer class (optional):
  void a_public_function();
  std::string a_private_string;

// customize the generated Lexer class default constructor (optional):
  a_private_string = "initialized";

// enable Unicode matching:
%option unicode

// add a named pattern:
newline  \n


// RULES go here, where a rule is a pattern and action

"foo"      { a_public_function(); }
"bar"      { a_private_string = text(); }
{newline}  // skip newline
.          { std::cerr << "unknown character at " << lineno() << ":" << columno(); }


// USER CODE goes here

int main()
  Lexer lexer;
  while (lexer.lex() != 0)

Patterns are common regular expressions, but extended with special Lex syntax features, which means that the named pattern {newline} is expanded into (?:\n), the quoted text "foo" is a literal match of foo, and a slash / (not shown in the example) is a lookahead pattern.

This example shows three built-in functions used in the rule actions: text() that returns the 0-terminated char text matched, lineno(), and columno(). The first two are the same as yytext and yylineno in Flex. Flex-compatible actions are enabled with reflex option --flex.

The following table shows some of the actions of a long list of functions that can be used in rule actions:

RE/flex action Flex action Result
text() YYText(), yytext 0-terminated text match
str() n/a std::string text match
wstr() n/a std::wstring text match
size() YYLeng(), yyleng size of the match in bytes
wsize() n/a number of wide chars matched
lineno() yylineno line number of match (>=1)
columno() n/a column number of match (>=0)
echo() ECHO echo text match
start(n) BEGIN n set start condition to n

The classic Flex actions listed in the table can be used in a RE/flex lexer specification with reflex option --flex. Because the matcher library object is accessible in the rule actions via matcher(), you can use it in your rule actions for added power. Here are some useful example matcher() methods:

RE/flex action Flex action Result
matcher().input() yyinput() get next char from input
matcher().winput() n/a get wide character from input
matcher().unput(c) unput(c) put back 8-bit char
matcher().more() yymore() concat the next match to this match
matcher().less(n) yyless(n) shrink match length to n

These are only examples. For a complete list, see the RE/flex user manual.

Unicode Pattern Matching

A scanner generator that cannot handle Unicode is as good as using a typewriter for texting on a mobile phone. Most modern programming languages have Unicode lexical structures that require Unicode-capable scanners to construct parsers and compilers. At least the widely-used UTF-8 encoding format should be natively supported. Scanner generators produce lexers that scan files (in addition to 8-bit strings and perhaps wide strings), which makes them different from regex engines (i.e., many regex libraries can be used with to wide strings). To scan files that contain with Unicode wide characters, the files have to be decoded from UTF-8, UTF-16, or full UTF-32 (UCS-4) formats. This decoding is done automatically on the fly in RE/flex.

Unicode pattern matching is enabled with reflex command-line option --unicode or with %option unicode in a lexer specification:

%option unicode

With this option, the . (dot), \w, \s, \l, \u, \u{XXXX}, and \p{C} patterns match Unicode. These match respectively any Unicode character (except newline \n), a Unicode word character, Unicode space, Unicode lower case letter, Unicode upper case letter, Unicode U+XXXX, and Unicode character class C.

\s+          // skip spacing
\w+          { std::cout << text() << "\n"; }
\p{Unicode}  { std::cout << "U+" << std::hex << wstr().at(0) << " \n"; }
.            { std::cerr << "invalid Unicode!\n"; }

This lexer removes all white space from a file and prints words that are composed of Unicode letters, numerical digits, and connector-punctuation such as underscores. The matched text is printed in UTF-8 format to std::cout with text() (which is the same as yytext in Flex syntax). Any other Unicode characters are printed in U+XXXX code using the first character of the wide string wstr() match. Finally, the . (dot) pattern is used to catch all else, including invalid characters that are outside of the valid Unicode range U+0000 to U+D7FF and U+E000 to U+10FFFF. When UTF-8 or UTF-16 files are scanned then an invalid UTF encoding such as an overlong form is also matched by . (dot).

To match any valid Unicode character, I recommend to use \p{Unicode}. We only want to use . (dot) for catch-all-else cases as needed. One such case is to match invalid content as shown in the example above to prevent the scanner from jamming on its input when nothing valid matches. For this reason, the . (dot) is designed to be permissive and matches valid Unicode but also matches invalid UTF-8 and UTF-16 sequences in order to catch lexical errors. The . (dot) is self-synchronizing, which means that it advances to the next character on the input after matching an invalid character.

Indent, nodent, and dedent Matching

RE/flex takes a direct approach to matching indent, nodent and dedent positions in text by introducing two new anchors, one for indent and one for dedent.

To match and set an indent stop position on a new line with margin spacing matched by ^[ \t]+, simply add the \i anchor at the end ^[ \t]+\i. This sets and matches a new indent stop when the margin spacing is increased.

To match a dedent when margin spacing is decreased, use the \j anchor ^[ \t]+\j. This matches a previous indent stop and removes it.

Because a more significant decrease in margin spacing may result in multiple indent stop positions to be removed all at once, we should also add a pattern with a single \j anchor to match each extra indent stop. This matches any extra dedents that occur on the same line, but one at a time is matched:

^[ \t]+    // nodent: text is aligned to the current indent margin
^[ \t]*\i  // indent: text is indented with new indent set and matched by \i
^[ \t]*\j  // dedent: text is dedented, matched by \j
\j         // dedent: text is dedented by an extra indent stop on the current line

Notice that matching a "nodent" line, which is aligned to the current margin, is done with a pattern that lacks the \i and \j anchors.

Let's see how this works. Say our text to scan contains the following:

def abs(n):
    if n < 0:
        n = -n
        return n
        return n

Then the four patterns we defined are matched in this order: 2, 2, 1, 3, 2, 3, 3. Note that the first line has no indent margin and no indent stop.

You are not restricted to use white space, such as [ \t]+, to define margin spacing. The margin spacing can be anything, not just spaces and tabs, as long as the pattern does not include a newline character \n.

It may be necessary to temporarily memorize the indent stops when scanning text over multiple lines by patterns that should not change the current indent stop positions, such as line continuations, multi-line comments and expressions that require "line joining" as in Python. This can be done by saving the current indent stops with matcher().push_stops() and retrieving them again with matcher().pop_stops().

For example, to scan a multi-line comment over multiple (indented) lines without affecting the current indent stops, we should save the current stops and transition to a new start condition state CONTINUE:

"/*"          matcher().push_stops(); // save the indent margin/tab stops
              BEGIN CONTINUE;         // BEGIN state transition to CONTINUE w/o indent matching
"*/"          matcher().pop_stops();  // restore the indent margin/tab stops
              BEGIN INITIAL;          // BEGIN state transition to INITIAL
.|\n          /* ignore */

We typically define start condition states, such as CONTINUE in this example, when scanning text that is matched by a different set of patterns and rules. The matcher().push_stops() method resets the indent stop positions. The positions can be accessed directly using matcher().stops() that returns a reference to a std::vector<size_t> with these positions that are ordered from low to high. You can change the vector as long as the low-to-high order is maintained.

There is a quick shortcut for matching multiple lines of text when the text has to be ignored by the scanner, such as multi-line comments and line continuations. To implement line continuations after a \ (backslash) placed at the end of a line, we must ignore the next indent. Similarly, C-style multi-line comments matched with the pattern "/*"(.|\n)*?"*/" (using a lazy repetition (.|\n)*?) should be ignored and not affect the current indent stops. To do this, we use the RE/flex (?^X) negative pattern that consumes text without actions, making the text invisible:

(?^"/*".*?"*/")  // ignore multi-line comments
(?^\\\n[ \t]*)   // lines ending in \ will continue on the next line

The negative pattern (?^X) is a RE/flex-specific addition. It is can be very handy to use to skip unwanted input, such as comments, without affecting indent stop positions. However, any action associated with a negative pattern won't be executed because negative patterns skip input and do not trigger a pattern match.

Finally, new so-called "undent" anchors \k may be used in a pattern. The \k undent anchor matches when the indent depth changed by another indent/dedent pattern at or before the position of the anchor \k in the matching input. Anchor \k restores the indent stops by undoing these changes (i.e. "undenting"). For example, to ignore multi-line comments that may be nested, we can use anchor \k as follows:

  int level;   // a variable to track the /*-comment nesting level


^[ \t]+        out() << "| ";    // nodent, text is aligned to current margin column
^[ \t]+\i      out() << "> ";    // indent
^[ \t]*\j      out() << "< ";    // dedent
\j             out() << "< ";    // dedent, triggered for each extra level dedented
[ \t]*"/*"\k?  level = 1;        // /*-comment after spacing, \k kills indent stop changes
               start(COMMENT);   // goto COMMENT
.|\n           echo();           // ECHO character

"/*"           ++level;          // allow nested /*-comments 
"*/"           if (--level == 0)
                 start(INITIAL); // goto INITIAL (back to initial state)
.|\n           // ignore all content in comments

Note that [ \t]*"/*"\k? matches the start of a /*-comment with leading white space. When this space is on the left margin it is also initially matched by the 2nd rule or the 3rd rule up to the /*, resulting in indent stop changes made by anchors \i and \j, respectively. Even though rule 2 and 3 won't match the /*, but only the space in the margin leading up to the /*, the \i and \j anchors will change the indent stops. Anchor \k restores these stops and matches if the indent stops actually changed. Rule 5 also applies to non-margin /*-comments, with the /*-comments placed anywhere. For these two reasons we made \k optional with \k?.

More details and examples of indent, nodent, dedent, and undent patterns can be found in the RE/flex documentation.

Lazy Repeats Are Nice, but Also Noughty

In the previous section, we saw how to use a lazy repeat (.|\n)*? to match C-style multi-line comments with "/*"(.|\n)*?"*/". The extra ? is called a lazy quantifier and changes the * greedy repeat into a *? non-greedy lazy repeat (also called reluctant repeat). Why is a lazy repeat needed here? Note that the pattern (.|\n) matches any character (. (dot) matches any character except newline \n). The problem is that a greedy repeat (.|\n)* in the pattern "/*"(.|\n)*"*/" eats all matching text including */, but up to the last */ in the input text before EOF. The usual way to work around greedy repeats is to prevent the repeated pattern from matching */, such as "/*"([^*]|\*+[^/])*"*/". By contrast, lazy repeats are simpler and more elegant.

The lazy quantifier ? can be used to make the following greedy patterns lazy:

X??Y      // matches zero or one X as needed to match Y next
X*?Y      // matches zero of more X as needed to match Y next
X+?Y      // matches at least one X or more as needed to match Y next
X{n,m}?Y  // matches at least n times and at most m times X to match Y next

Take, for example, the pattern (a|b)*?bb. The text abaabb and bb match this pattern, but bbb does not match (actually, there is a partial match since bb matches this pattern, but the third b remains on the input for the next match by a scanner).

What about the pattern a*? that has no continuing pattern Y? This matches nothing. When the pattern Y permits an empty text match or matches the same text as X, then lazyness kills X. To make lazy repeats behave nicely, make sure they are followed by a pattern that matches non-empty text or make sure they are properly anchored. For example, a*?$ matches a series of a's up to the end of the line. However, in this case, lazyness is not required and a greedy pattern a*$ works just fine.

Lazy repeats are not available in POSIX matchers. The concept was introduced in Perl that is regex-based using backtracking on the regex (or by using an alternative form of a matching engine). POSIX matchers are text-based and may use a deterministic finite automaton (DFA) to significantly speed up matching. A DFA-based matcher lacks lazy quantifiers, lookahead/behind, and grouping for group captures, due to the inability to dynamically evaluate the context in which this form of regex-based information applies.

RE/flex adds lazy quantifiers to its POSIX matcher using a new DFA construction algorithm (technical publications are pending). A lazy repeat effectively cuts edges from a DFA, but may also lengthen the DFA to permit partial matches until the accept state is reached. Consider the following DFA constructed for regex (a|b)*bb:

and compare this to the DFA constructed for regex (a|b)*?bb:

As you can see, the second DFA matches any a's and b's until bb is matched.

Using the Smart Input Class to Scan Files

We often have to scan files with text encoded in various popular formats such as UTF-8, UTF-16 or even in UTF-32, ISO-8859-1, EBCDIC and sometimes even older code pages such as code page 850 and CP-1252. To support multiple input file formats, the RE/flex reflex::Input class converts the input format on the fly while scanning input by using a smart buffering technique. This smart buffering does not require the entire file to be read first, allowing the implementation of streaming lexers and regex matchers. The streaming converter detects UTF byte order marks (BOM) in files to enable automatic UTF-8/16/32 conversion by normalizing the text to UTF-8 for matching. In fact, when Unicode mode for matching is enabled with %option unicode, the input should be UTF encoded in this way.

If the input file is encoded in EBCDIC, ISO-8859-1 or extended Latin-1, then you should determine this in your code and set the decoding flag to use this file with a RE/flex lexer. For example, as follows:

#include "lex.yy.h" // generated with reflex --flex --header-file

int main()
  FILE *fd = fopen("iso-8859-1-example-file.txt", "r");           // open a file
  if (fd != NULL)
    reflex::Input input(fd, reflex::Input::file_encoding::latin); // construct ISO-8859-1 input
    yyFlexLexer lexer(input);                                     // construct lexer with input
    while (lexer.yylex() != 0)                                    // scan file until EOF
    fclose(fd);                                                   // close the file

RE/flex has built-in support for UTF encodings, ISO-8859-1, EBCDIC, and code pages 427, 850, and 1250 to 1258. You can also define a custom code page to be used by the smart input class that will then custom-translate 8-bit characters to Unicode using the specified code page entries.

Performance Tuning

There are several options to debug a lexer and analyze its performance given some input text to scan. Use reflex command-line option -d to generate a scanner that prints the matched text. Use reflex command-line option -p to generate a scanner that prints a performance report. The report shows the performance statistics obtained for each rule defined in the lexer specification. The statistics are collected when the scanner runs on some given input.

The reflex -p option allows you to fine-tune the performance of your scanner by focusing on patterns and rules that turn out to be computationally expensive. This is perhaps best illustrated with an example. The JSON parser json.l located in the examples directory of the RE/flex download package was built with reflex option -p and then run on some given JSON input to analyze its performance:

reflex 0.9.22 json.l performance report:
  INITIAL rules matched:
    rule at line 51 accepted 58 times matching 255 bytes total in 0.009 ms
    rule at line 52 accepted 58 times matching 58 bytes total in 0.824 ms
    rule at line 53 accepted 0 times matching 0 bytes total in 0 ms
    rule at line 54 accepted 0 times matching 0 bytes total in 0 ms
    rule at line 55 accepted 0 times matching 0 bytes total in 0 ms
    rule at line 56 accepted 5 times matching 23 bytes total in 0.007 ms
    rule at line 57 accepted 38 times matching 38 bytes total in 0.006 ms
    rule at line 72 accepted 0 times matching 0 bytes total in 0 ms
    default rule accepted 0 times
  STRING rules matched:
    rule at line 60 accepted 38 times matching 38 bytes total in 0.021 ms
    rule at line 61 accepted 0 times matching 0 bytes total in 0 ms
    rule at line 62 accepted 0 times matching 0 bytes total in 0 ms
    rule at line 63 accepted 0 times matching 0 bytes total in 0 ms
    rule at line 64 accepted 0 times matching 0 bytes total in 0 ms
    rule at line 65 accepted 0 times matching 0 bytes total in 0 ms
    rule at line 66 accepted 0 times matching 0 bytes total in 0 ms
    rule at line 67 accepted 0 times matching 0 bytes total in 0 ms
    rule at line 68 accepted 314 times matching 314 bytes total in 0.04 ms
    rule at line 69 accepted 8 times matching 25 bytes total in 0.006 ms
    rule at line 72 accepted 0 times matching 0 bytes total in 0 ms
    default rule accepted 0 times
  WARNING: execution times are relative:
    1) includes caller's execution time between matches when yylex() returns
    2) perf-report instrumentation adds overhead and increases execution times

The timings shown include the time of the pattern match and the time of the code executed by the rule. If the rule returns to the caller, then the time spent by the caller is also included in this timing. For this example, we have two start condition states, namely INITIAL and STRING. The rule at line 52 is:

[][}{,:]        { return yytext[0]; }

This returns a [ or ] bracket, a { or } brace, a comma, or a colon to the parser. Since the pattern and rule are very simple, we do not expect these to contribute much to the 0.824 ms time spent on this rule. The parser code executed when the scanner returns could be expensive. Depending on the character returned, the parser constructs a JSON array (bracket) or a JSON object (brace), and populates arrays and objects with an item each time a comma is matched. But which is most expensive? To obtain timings of these events separately, we can split the rule up into three similar rules:

[][]        { return yytext[0]; }
[}{]        { return yytext[0]; }
[,:]        { return yytext[0]; }

Then, we use reflex option -p, recompile the scanner lex.yy.cpp. and rerun our experiment to see these changes:

rule at line 52 accepted 2 times matching 2 bytes total in 0.001 ms
rule at line 53 accepted 14 times matching 14 bytes total in 0.798 ms
rule at line 54 accepted 42 times matching 42 bytes total in 0.011 ms

So it turns out that the construction of a JSON object by the parser is, relatively speaking, the most expensive part of our application, when { and } are encountered on the input. We should focus our optimization effort there if we want to improve the overall speed of our JSON parser.

The JSON lexer and parser source code used in this example are available for download with this article.

Interactive Input With Readline

Shells, interpreters, and other interactive command-line tools allow users to enter commands that are interpreted and executed by the application. The GNU readline library greatly enhances the interactive experience by offering a history mechanism and line-based editing. By contrast, simple interactive input by reading standard input gives users a cumbersome experience.

To use readline with RE/flex is a good example of how we can use the Flex-like wrap() function to our advantage. The wrap() function is invoked when EOF is reached, to permit a new input source to be assigned to continue scanning. In this example, readline() returns a line of input that we scan as a string. When the scanner reaches the end of this string, it triggers the wrap() function, which is set up to return the next line returned by readline(), etc.

The lexer specification we give here first includes the required C header files in the %top block and then defines the lexer class wrap() function in the %class block and the lexer class construction in the %init block:

  #include <stdlib.h>
  #include <stdio.h>
  #include <readline/readline.h>
  #include <readline/history.h>

  const char *prompt;
  // we use wrap() to read the next line
  virtual bool wrap() {
    if (line)
      line = readline(prompt);
      if (line != NULL)
        if (*line)
    // wrap() == true means OK: wrapped after EOF
    // be careful when using Flex --flex: yywrap() == 0 means OK !!
    return line == NULL ? false : true;
  // the line returned by readline() without \n
  char *line;
  // the line with \n appended
  std::string linen;

  prompt = NULL; // we can set a prompt here
  line = readline(prompt);
  if (line != NULL)
    if (*line)

A readline-based example lexer similar to this example, but using Flex actions, is available for download with this article.


RE/flex is not merely a C++ version of Flex. It adds many new useful features that Flex lacks, including critically required Unicode pattern matching, accepting file input encoded in UTF-8/16/32 (with or without BOM), ASCII, ISO-8859-1, EBCDIC, various code pages, lazy repeats in POSIX mode, and the generation of direct-coded DFAs in efficient C++ source code.

RE/flex also includes a regex API for Boost.Regex, RE/flex regex, and C++11 std::regex. This API offers a uniform interface to select the appropriate regex pattern matcher class for the task, i.e. Perl matching with Boost.Regex or std::regex, and POSIX DFA matching with RE/flex regex. This easy-to-use regex API can be used in any C++ application to match input from strings, streams, and file descriptors (to plain files or encoded, e.g. in UTF-8/16/32).

Further Reading

To learn more about RE/flex, I encourage you to read Constructing Fast Lexical Analyzers with RE/flex and the RE/flex user guide.

Download & License

RE/flex is available for download form SourceForge and GitHub. RE/flex is licensed under the open source BSD-3 license.


  • 25th April, 2017: First version of this article
  • 30th April, 2017: Some cosmetic updates
  • 16th May, 2017: Minor updates for clarification
  • 25th June, 2017: Added download icon and explained the use of code pages
  • 31th July, 2019: Expanded indent/nodent/dedent section with \k anchor
  • 8th August, 2019: Updated performance comparisons to include RE/flex version 1.3.5


This article, along with any associated source code and files, is licensed under The BSD License


About the Author

Robert van Engelen
United States United States
Founder of Genivia inc, Professor of Computer Science

Comments and Discussions

PraiseAwesome!!! Pin
koothkeeper10-Aug-19 12:11
professionalkoothkeeper10-Aug-19 12:11 
Questionlexical analysis. Pin
risivincent1-Aug-19 1:37
memberrisivincent1-Aug-19 1:37 
AnswerRe: lexical analysis. Pin
honey the codewitch3-Aug-19 13:24
memberhoney the codewitch3-Aug-19 13:24 
Questiondo you know how to do implement arden's theorem in C++? Pin
honey the codewitch31-Jul-19 6:30
memberhoney the codewitch31-Jul-19 6:30 
PraiseI recently built one of these for C# Pin
honey the codewitch31-Jul-19 6:26
memberhoney the codewitch31-Jul-19 6:26 
QuestionSmall typo Pin
zvx22-Apr-19 4:40
memberzvx22-Apr-19 4:40 
AnswerRe: Small typo Pin
Robert van Engelen22-Apr-19 5:25
memberRobert van Engelen22-Apr-19 5:25 
GeneralMy vote of 5 Pin
Ancient Zygote1-May-17 9:09
memberAncient Zygote1-May-17 9:09 
GeneralRe: My vote of 5 Pin
Robert van Engelen17-May-17 10:05
memberRobert van Engelen17-May-17 10:05 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Posted 25 Apr 2017


37 bookmarked