Click here to Skip to main content
13,196,946 members (55,744 online)
Click here to Skip to main content
Add your own
alternative version

Stats

10.6K views
87 downloads
21 bookmarked
Posted 25 Apr 2017
BSD

Constructing fast lexical analyzers with RE/flex - why another scanner generator?

, 25 Jun 2017
Rate this:
Please Sign up or sign in to vote.
This article introduces RE/flex for C++. RE/flex generates fast lexical analyzers similar to Flex, but supports Unicode patterns, indent/nodent/dedent, lazy repeats, smart input text handling, and performance tuning. RE/flex is compatible with Bison/Yacc.

Introduction

In this article I will give a brief introduction to the RE/flex lexical analyzer generator. The RE/flex open source project was motivated by the possibility to build a scanner generator based on an entirely different approach to tokenization that permits regex libraries to be used in the generated scanners. This approach provides additional flexibility to use a rich regex syntax with a choice of POSIX-mode (and even Perl-mode) regex engines. A second motivation was the challenge to come up with an algorithm to construct deterministic finite automata (DFAs) for efficient pattern matching with lazy quantifiers (a.k.a. non-greedy repeats, lazy repeats, reluctant repeats). This algorithm is also new.

Why RE/flex?

RE/flex is a new fast and flexible regex-based scanner generator for C++. It shares many similarities with Flex (the fast lexical analyzer) and Lex, but is fundamentally different in its overall design based on a simple basic idea. This idea makes RE/flex faster than Flex and/or enables a more expressive regex syntax to be used in lexer specifications. It also adds flexibility by permitting other regex libraries to be used, such as Boost.Regex and the RE/flex regex engine, and perhaps other regex libraries in future releases.

The basic idea is that any regex engine can in princple be used to tokenize input: given a set of n patterns pi, i=1,...,n,  a regex of the form "(p1) | (p2) | ... | (pn)" can be used to match and tokenize the input. The group capture index i of a matching pattern pi that is returned by the regex matcher identifies the pattern pi that matched. If a pattern uses a group (X) then that group must be converted to a non-capturing group of the form (?:X) before we can use it in our patterns. We then use repeated partial matching in our scanner to advance over the input to tokenize:

int Lexer::lex()
{
  if (!has_matcher())
    matcher("(p1)|(p2)|...|(pn)"); // new matcher engine if not setup already
  while (true)
  {
    if (... EOF reached ...)
      return 0;
    switch (matcher().scan()) // scan and match next token, get capture index
    {
      case 1: // pattern p1 matched
        ... // Action for pattern 1 ...
        break;
      case 2: // pattern p2 matched
        ... // Action for pattern 2 ...
        break;
      ... // etc ...
        ...
      default:
        ... // no match: this is the "default rule" to echo input character ...
    }
  }
}

RE/flex uses this basic idea. The scanner source code generated by reflex is structured in this way and is therefore very different from the source code generated by Flex.

A very fast DFA for pattern matching is generated with the reflex --fast option. The generated DFA is expressed in direct code instead of in tables. This hard-coded DFA may run faster than a Flex-based scanner. For example, the following hard-coded DFA that is generated by reflex for pattern "\w+" runs very efficiently to match words:

void reflex_code_FSM(reflex::Matcher& m)
{
  int c0 = 0, c1 = c0;
  m.FSM_INIT(c1);
S0:
  c0 = c1, c1 = m.FSM_CHAR();
  if (97 <= c1 && c1 <= 122) goto S5;
  if (c1 == 95) goto S5;
  if (65 <= c1 && c1 <= 90) goto S5;
  if (48 <= c1 && c1 <= 57) goto S5;
  return m.FSM_HALT(c1);
S5:
  m.FSM_TAKE(1);
  c0 = c1, c1 = m.FSM_CHAR();
  if (97 <= c1 && c1 <= 122) goto S5;
  if (c1 == 95) goto S5;
  if (65 <= c1 && c1 <= 90) goto S5;
  if (48 <= c1 && c1 <= 57) goto S5;
  return m.FSM_HALT(c1);
}

Another advantage of RE/flex is the rich regex pattern syntax offered by Boost.Regex and also by the new RE/flex regex library that is included with the RE/flex software. The RE/flex regex library supports Unicode, indent/nodent/dedent anchors, lazy quantifiers, word boundaries, and more.

How fast is RE/flex?

RE/flex is faster than Flex when comparing Flex with the best performance options. RE/flex is also faster than Boost.Spirit.Lex, and much faster than popular regex libraries, such as Boost.Regex, C++11 std::regex, PCRE2 and RE2. For example, tokenizing a representative C source code file into 244 tokens takes only 13 microseconds:

Command / FunctionSoftwareTime (μs)
reflex --fastRE/flex13
flex -+ --fullFlex17
reflex --fullRE/flex29
boost::spirit::lex::lexertl::actor_lexer::iterator_typeBoost.Spirit.Lex40
reflex -m=boost-perlBoost.Regex (Perl mode)230
pcre2_match()PCRE2 (pre-compiled)318
reflex -m=boostBoost.Regex (POSIX mode)450
flex -+Flex3968
RE2::Consume()RE2 (pre-compiled)5088
std::cregex_iterator()C++11 std::regex14784

The table shows the best times of 10 tests with average time in microseconds over 100 runs (using clang 8.0.0 with -O2, 2.9 GHz Intel Core i7, 16 GB 2133 MHz LPDDR3).

To create this comparison, I used RE/flex option --regexp-file to generate the regex string and used it with each regex library to tokenize a string of C source code stored in memory, thus avoiding the unpredictable overhead of file reads. This regex string has n=57 patterns of the form "(p1) | (p2) | ... | (pn)" where each pi is a pattern to match C/C++ token lexemes. The patterns use non-capturing groups (?:...) to replace parenthesis, which allows the capture group index to identify the token matched.

Compatibility of RE/flex with Flex and Bison/Yacc

When designing RE/flex using the aforementioned approach, I wanted to make sure that the project is compatible with Bison and Yacc parsers. These tools are popular and are typically pre-installed with Linux and Unix distributions. I also wanted made sure that RE/flex accepts standard Flex specifications (using RE/flex compatibility option --flex). This means that if you are already familiar with Flex or Lex then you will find RE/flex easy to use. To generate a yyFlexLexer class that has the same methods as a Flex++ yyFlexLexer class, use reflex option --flex.

RE/flex is for C++ projects only. Using C++ permitted a significant number of additions to the lexer specification language and permitted the introduction of many new methods that can be used in rule actions, as I will explain further in the next section. However, you can still use RE/flex with C code, such as Bison-generated parsers that expect a global yylex() function. To do so, use the --flex and --bison command-line options:

reflex --flex --bison lexspec.l

RE/flex also supports options to build Bison reentrant parsers, and bison-bridge and bison-locations parsers. These advanced parsers are MT-safe and manage the locations of syntax errors in cooperation with the lexer.

To get you started with RE/flex, you will find several examples in the RE/flex download package from GitHub and SourceForge. The examples include Bison parsers, from very basic to more advanced bison-bridge and bison-location parsers.

The structure of a RE/flex lexer specification

The structure of a lexer specification is the same as Flex, but with some additions. A lexer specification consists of three parts separated by two lines with %%: the Definitions section, the Rules section, and the User code section.

  1. The Definitions section defines named patterns. That is, commonly used regex patterns can be named and used in other patterns as macros. This section may also contain user code: a %top block with code that goes all the way to the top of the generated scanner source code and the usual %{ code inbetween %} that is copied to the scanner's source code. A %class block defines the lexer class member variables and functions. The %init block defines the lexer class constructor code for initializations.  We can also add one or more %option to configure the scanner.
  2. The Rules section defines pattern-action pairs. Actions are written in C++ and are triggered when the pattern matches.
  3. The User code section at the end of the specification can be used to add any C++ code necessary. For stand-alone scanners this typically has a main() function defined. 

The following outlines the basic structure of a RE/flex lexer specification:

// DEFINITIONS go here

%top{
  #include <iostream>
  // ... other stuff to #include <...>
%}

%{
  // ... other stuff to declare in C++
%}

// customize the generated Lexer class (optional):
%class{
 public:
  void a_public_function();
 private:
  std::string a_private_string;
%}

// customize the generated Lexer class default constructor (optional):
%init{
  a_private_string = "initialized";
%}

// enable Unicode matching:
%option unicode

// add a named pattern:
newline  \n

%%

// RULES go here, where a rule is a pattern and action

"foo"      { a_public_function(); }
"bar"      { a_private_string = text(); }
{newline}  // skip newline
.          { std::cerr << "unknown character at " << lineno() << ":" << columno(); }

%%

// USER CODE goes here

int main()
{
  Lexer lexer;
  while (lexer.lex() != 0)
    continue;
}

Patterns are common regular expressions but extended with special Lex syntax features, which means that the named pattern {newline} is expanded into (?:\n), the quoted text "foo" is a literal match of foo, and a slash / (not shown in the example) is a lookahead pattern.

This example shows three built-in functions used in the rule actions: text() that returns the 0-terminated char text matched, lineno(), and columno(). The first two are the same as yytext and yylineno in Flex. Flex-compatible actions are enabled with reflex option --flex.

The following table shows some of the actions of a long list of functions that can be used in rule actions:

RE/flex actionFlex actionResult
text()YYText(), yytext0-terminated text match
str()n/astd::string text match
wstr()n/astd::wstring text match
size()YYLeng(), yylengsize of the match in bytes
wsize()n/anumber of wide chars matched
lineno()yylinenoline number of match (>=1)
columno()n/acolumn number of match (>=0)
echo()ECHOecho text match
start(n)BEGIN nset start condition to n

The classic Flex actions listed in the table can be used in a RE/flex lexer specification with reflex option --flex. Because the matcher library object is accessible in the rule actions via matcher(), you can use it in your rule actions for added power. Here are some useful example matcher() methods:

RE/flex actionFlex actionResult
matcher().input()yyinput()get next char from input
matcher().winput()n/aget wide character from input
matcher().unput(c)unput(c)put back 8-bit char
matcher().more()yymore()concat the next match to this match
matcher().less(n)yyless(n)shrink match length to n

These are only examples. For a complete list, see the RE/flex user manual.

Unicode pattern matching

A scanner generator that cannot handle Unicode is as good as using a typewriter for texting on a mobile phone. Most modern programming languages have Unicode lexical structures that require Unicode-capable scanners to construct parsers and compilers. At least the widely-used UTF-8 encoding format should be natively supported. Scanner generators produce lexers that scan files (in addition to 8-bit strings and perhaps wide strings), which makes them different from regex engines (i.e. many regex libraries can be used with to wide strings). To scan files that contain with Unicode wide characters, the files have to be decoded from UTF-8, UTF-16, or full UTF-32 (UCS-4) formats. This decoding is done automatically on the fly in RE/flex.

Unicode pattern matching is enabled with reflex command-line option --unicode or with %option unicode in a lexer specification:

%option unicode

With this option, the . (dot), \w, \s, \l, \u\u{XXXX}, and \p{C} patterns match Unicode. These match respectively any Unicode character (except newline \n), a Unicode word character, Unicode space, Unicode lower case letter, Unicode upper case letter, Unicode U+XXXX, and Unicode character class C.

\s+          // skip spacing
\w+          { std::cout << text() << "\n"; }
\p{Unicode}  { std::cout << "U+" << std::hex << wstr().at(0) << " \n"; }
.            { std::cerr << "invalid Unicode!\n"; }

This lexer removes all white space from a file and prints words that are composed of Unicode letters, numerical digits, and connector-punctuation such as underscores. The matched text is printed in UTF-8 format to std::cout with text() (which is the same as yytext in Flex syntax). Any other Unicode characters are printed in U+XXXX code using the first character of the wide string wstr() match. Finally, the . (dot) pattern is used to catch all else, including invalid characters that are outside of the valid Unicode range U+0000 to U+D7FF and U+E000 to U+10FFFF. When UTF-8 or UTF-16 files are scanned then an invalid UTF encoding such as an overlong form is also matched by . (dot).

To match any valid Unicode character I recommend to use \p{Unicode}. We only want to use . (dot) for catch-all-else cases as needed. One such case is to match invalid content as shown in the example above to prevent the scanner from jamming on its input when nothing valid matches. For this reason, the . (dot) is designed to be permissive and matches valid Unicode but also matches invalid UTF-8 and UTF-16 sequences in order to catch lexical errors. The . (dot) is self-synchronizing, which means that it advances to the next character on the input after matching an invalid character.

Indent, nodent, and dedent matching

RE/flex takes a direct approach to matching indent, nodent and dedent positions in text by introducing two new anchors, one for indent and one for dedent.

To match and set an indent stop position on a new line with margin spacing matched by ^[ \t]+, simply add the \i anchor at the end ^[ \t]+\i. This sets and matches a new indent stop when the margin spacing is increased.

To match a dedent when margin spacing is decreased, use the \j anchor ^[ \t]+\j. This matches a previous indent stop and removes it.

Because a more significant decrease in margin spacing may result in multiple indent stop positions to be removed all at once, we should also add a pattern with a single \j anchor to match each extra indent stop. This matches any extra dedents that occur on the same line, but one at a time is matched:

^[ \t]+    // 1 nodent: text is aligned to the current indent margin
^[ \t]*\i  // 2 indent: text is indented with new indent set and matched by \i
^[ \t]*\j  // 3 dedent: text is dedented, matched by \j
\j         // 4 dedent: text is dedented by an extra indent stop on the current line

Notice that matching a "nodent" line, which is aligned to the current margin, is done with a pattern that lacks the \i and \j anchors.

Let's see how this works. Say our text to scan contains the following:

def abs(n):
    if n < 0:
        n = -n
        return n
    else:
        return n
# EOF

Then the four patterns we defined are matched in this order: 2, 2, 1, 3, 2, 3, 3. Note that the first line has no indent margin and no indent stop.

You are not restricted to use white space, such as [ \t]+, to define margin spacing. The margin spacing can be anything, not just spaces and tabs, as long as the pattern does not include a newline character \n.

It may be necessary to temporarily memorize the indent stops when scanning text over multiple lines by patterns that should not change the current indent stop positions, such as line continuations, multi-line comments and expressions that require "line joining" as in Python. This can be done by saving the current indent stops with matcher().push_stops() and retrieving them again with matcher().pop_stops().

For example, to scan a multi-line comment over multiple (indented) lines without affecting the current indent stops, we should save the current stops and transition to a new start condition state CONTINUE:

"/*"          matcher().push_stops(); // save the indent margin/tab stops
              BEGIN CONTINUE;         // BEGIN state transition to CONTINUE w/o indent matching
<CONTINUE>{
"*/"          matcher().pop_stops();  // restore the indent margin/tab stops
              BEGIN INITIAL;          // BEGIN state transition to INITIAL
.|\n          /* ignore */
}

We typically define start condition states, such as CONTINUE in this example, when scanning text that is matched by a different set of patterns and rules. The matcher().push_stops() method resets the indent stop positions. The positions can be accessed directly using matcher().stops() that returns a reference to a std::vector<size_t> with these positions that are ordered from low to high. You can change the vector as long as the low-to-high order is maintained.

There is a quick shortcut for matching multiple lines of text when the text has to be ignored by the scanner, such as multi-line comments and line continuations. To implement line continuations after a \ (backslash) placed at the end of a line, we must ignore the next indent. Similarly, C-style multi-line comments matched with the pattern "/*"(.|\n)*?"*/" (using a lazy repetition (.|\n)*?) should be ignored and not affect the current indent stops. To do this, we use the RE/flex (?^X) negative pattern that consumes text without actions, making the text invisible:

(?^"/*".*?"*/")  // ignore multi-line comments
(?^\\\n[ \t]*)   // lines ending in \ will continue on the next line

The negative pattern (?^X) is a RE/flex-specific addition.

Lazy repeats are nice, but also noughty

In the previous section we saw how to use a lazy repeat (.|\n)*? to match C-style multi-line comments with "/*"(.|\n)*?"*/". The extra ? is called a lazy quantifier and changes the * greedy repeat into a *? non-greedy lazy repeat (also called reluctant repeat). Why is a lazy repeat needed here? Note that the pattern (.|\n) matches any character (. (dot) matches any character except newline \n). The problem is that a greedy repeat (.|\n)* in the pattern "/*"(.|\n)*"*/" eats all matching text including */, but up to the last */ in the input text before EOF. The usual way to work around greedy repeats is to prevent the repeated pattern from matching */, such as "/*"([^*]|\*+[^/])*"*/". By contrast, lazy repeats are simpler and more elegant.

The lazy quantifier ? can be used to make the following greedy patterns lazy:

X??Y      // matches zero or one X as needed to match Y next
X*?Y      // matches zero of more X as needed to match Y next
X+?Y      // matches at least one X or more as needed to match Y next
X{n,m}?Y  // matches at least n times and at most m times X to match Y next

Take for example the pattern (a|b)*?bb. The text abaabb and bb match this pattern, but bbb does not match (actually, there is a partial match since bb matches this pattern, but the third b remains on the input for the next match by a scanner).

What about the pattern a*? that has no continuing pattern Y? This matches nothing. When the pattern Y permits an empty text match or matches the same text as X then lazyness kills X. To make lazy repeats behave nicely, make sure they are followed by a pattern that matches non-empty text or make sure they are properly anchored. For example, a*?$ matches a series of a's up to the end of the line. However, in this case lazyness is not required and a greedy pattern a*$ works just fine.

Lazy repeats are not available in POSIX matchers. The concept was introduced in Perl that is regex-based using backtracking on the regex (or by using an alternative form of a matching engine). POSIX matchers are text-based and may use a deterministic finite automaton (DFA) to significantly speed up matching. A DFA-based matcher lacks lazy quantifiers, lookahead/behind, and grouping for group captures, due to the inability to dynamically evaluate the context in which this form of regex-based information applies.

RE/flex adds lazy quantifiers to its POSIX matcher using a new DFA construction algorithm (technical publications are pending). A lazy repeat effectively cuts edges from a DFA, but may also lengthen the DFA to permit partial matches until the accept state is reached. Consider the following DFA constructed for regex (a|b)*bb:

and compare this to the DFA constructed for regex (a|b)*?bb:

As you can see, the second DFA matches any a's and b's until bb is matched.

Using the smart input class to scan files

We often have to scan files with text encoded in various popular formats such as UTF-8, UTF-16 or even in UTF-32, ISO-8859-1, EBCDIC and sometimes even older code pages such as code page 850 and CP-1252. To support multiple input file formats, the RE/flex reflex::Input class converts the input format on the fly while scanning input by using a smart buffering technique. This smart buffering does not require the entire file to be read first, allowing the implementation of streaming lexers and regex matchers. The streaming converter detects UTF byte order marks (BOM) in files to enable automatic UTF-8/16/32 conversion by normalizing the text to UTF-8 for matching. In fact, when Unicode mode for matching is enabled with %option unicode, the input should be UTF encoded in this way.

If the input file is encoded in EBCDIC, ISO-8859-1 or extended Latin-1, then you should determine this in your code and set the decoding flag to use this file with a RE/flex lexer. For example as follows:

#include "lex.yy.h" // generated with reflex --flex --header-file

int main()
{
  FILE *fd = fopen("iso-8859-1-example-file.txt", "r");           // open a file
  if (fd != NULL)
  {
    reflex::Input input(fd, reflex::Input::file_encoding::latin); // construct ISO-8859-1 input
    yyFlexLexer lexer(input);                                     // construct lexer with input
    while (lexer.yylex() != 0)                                    // scan file until EOF
      continue;
    fclose(fd);                                                   // close the file
  }
}

RE/flex has built-in support for UTF encodings, ISO-8859-1, EBCDIC, and code pages 427, 850, and 1250 to 1258. You can also define a custom code page to be used by the smart input class that will then custom-translate 8-bit characters to Unicode using the specified code page entries.

Performance tuning

There are several options to debug a lexer and analyze its performance given some input text to scan. Use reflex command-line option -d to generate a scanner that prints the matched text. Use reflex command-line option -p to generate a scanner that prints a performance report. The report shows the performance statistics obtained for each rule defined in the lexer specification. The statistics are collected when the scanner runs on some given input.

The reflex -p option allows you to fine-tune the performance of your scanner by focussing on patterns and rules that turn out to be computationally expensive. This is perhaps best illustrated with an example. The JSON parser json.l located in the examples directory of the RE/flex download package was built with reflex option -p and then run on some given JSON input to analyze its performance:

reflex 0.9.22 json.l performance report:
  INITIAL rules matched:
    rule at line 51 accepted 58 times matching 255 bytes total in 0.009 ms
    rule at line 52 accepted 58 times matching 58 bytes total in 0.824 ms
    rule at line 53 accepted 0 times matching 0 bytes total in 0 ms
    rule at line 54 accepted 0 times matching 0 bytes total in 0 ms
    rule at line 55 accepted 0 times matching 0 bytes total in 0 ms
    rule at line 56 accepted 5 times matching 23 bytes total in 0.007 ms
    rule at line 57 accepted 38 times matching 38 bytes total in 0.006 ms
    rule at line 72 accepted 0 times matching 0 bytes total in 0 ms
    default rule accepted 0 times
  STRING rules matched:
    rule at line 60 accepted 38 times matching 38 bytes total in 0.021 ms
    rule at line 61 accepted 0 times matching 0 bytes total in 0 ms
    rule at line 62 accepted 0 times matching 0 bytes total in 0 ms
    rule at line 63 accepted 0 times matching 0 bytes total in 0 ms
    rule at line 64 accepted 0 times matching 0 bytes total in 0 ms
    rule at line 65 accepted 0 times matching 0 bytes total in 0 ms
    rule at line 66 accepted 0 times matching 0 bytes total in 0 ms
    rule at line 67 accepted 0 times matching 0 bytes total in 0 ms
    rule at line 68 accepted 314 times matching 314 bytes total in 0.04 ms
    rule at line 69 accepted 8 times matching 25 bytes total in 0.006 ms
    rule at line 72 accepted 0 times matching 0 bytes total in 0 ms
    default rule accepted 0 times
  WARNING: execution times are relative:
    1) includes caller's execution time between matches when yylex() returns
    2) perf-report instrumentation adds overhead and increases execution times

The timings shown include the time of the pattern match and the time of the code executed by the rule. If the rule returns to the caller than the time spent by the caller is also included in this timing. For this example, we have two start condition states namely INITIAL and STRING. The rule at line 52 is:

[][}{,:]        { return yytext[0]; }

This returns a [ or ] bracket, a { or } brace, a comma, or a colon to the parser. Since the pattern and rule are very simple, we do not expect these to contribute much to the 0.824 ms time spent on this rule. The parser code executed when the scanner returns could be expensive. Depending on the character returned, the parser constructs a JSON array (bracket) or a JSON object (brace), and populates arrays and objects with an item each time a comma is matched. But which is most expensive? To obtain timings of these events separately, we can split the rule up into three similar rules:

[][]        { return yytext[0]; }
[}{]        { return yytext[0]; }
[,:]        { return yytext[0]; }

Then we use reflex option -p, recompile the scanner lex.yy.cpp. and rerun our experiment to see these changes:

    rule at line 52 accepted 2 times matching 2 bytes total in 0.001 ms
    rule at line 53 accepted 14 times matching 14 bytes total in 0.798 ms
    rule at line 54 accepted 42 times matching 42 bytes total in 0.011 ms

So it turns out that the construction of a JSON object by the parser is relatively speaking the most expensive part of our application, when { and } are encountered on the input. We should focus our optimization effort there if we want to improve the overall speed of our JSON parser.

The JSON lexer and parser source code used in this example are available for download with this article.

Interactive input with readline

Shells, interpreters, and other interactive command-line tools allow users to enter commands that are interpreted and executed by the application. The GNU readline library greatly enhances the interactive experience by offering a history mechanism and line-based editing. By contrast, simple interactive input by reading standard input gives users a cumbersome experience.

To use readline with RE/flex is a good example of how we can use the Flex-like wrap() function to our advantage. The wrap() function is invoked when EOF is reached, to permit a new input source to be assigned to continue scanning. In this example, readline() returns a line of input that we scan as a string. When the scanner reaches the end of this string it triggers the wrap() function, which is set up to return the next line returned by readline(), etc.

The lexer specification we give here first includes the required C header files in the %top block and then defines the lexer class wrap() function in the %class block and the lexer class construction in the %init block: 

%top{
  #include <stdlib.h>
  #include <stdio.h>
  #include <readline/readline.h>
  #include <readline/history.h>
%}

%class{
  const char *prompt;
  // we use wrap() to read the next line
  virtual bool wrap() {
    if (line)
    {
      free((void*)line);
      line = readline(prompt);
      if (line != NULL)
      {
        if (*line)
          add_history(line);
        linen.assign(line).push_back('\n');
        in(linen);
      }
    }
    // wrap() == true means OK: wrapped after EOF
    // be careful when using Flex --flex: yywrap() == 0 means OK !!
    return line == NULL ? false : true;
  }
  // the line returned by readline() without \n
  char *line;
  // the line with \n appended
  std::string linen;
%}

%init{
  prompt = NULL; // we can set a prompt here
  line = readline(prompt);
  if (line != NULL)
  {
    if (*line)
      add_history(line);
    linen.assign(line).push_back('\n');
  }
  in(linen);
%}

A readline-based example lexer similar to this example, but using Flex actions, is available for download with this article.

Conclusions

RE/flex is not merely a C++ version of Flex. It adds many new useful features that Flex lacks, including critically required Unicode pattern matching, accepting file input encoded in UTF-8/16/32 (with or without BOM), ASCII, ISO-8859-1, EBCDIC, various code pages, lazy repeats in POSIX mode, and the generation of direct-coded DFAs that may be faster than Flex scanners.

Further reading

To learn more about RE/flex I encourage you to read Constructing Fast Lexical Analyzers with RE/flex and the RE/flex user guide.

Download & License

RE/flex is available for download form SourceForge and GitHub. RE/flex is licensed under the open source BSD-3 license.

History

April 25: First version of this article.
April 30: Some cosmetic updates.
May 16: Minor updates for clarification.
June 25: Added download icon and explained the use of code pages.

License

This article, along with any associated source code and files, is licensed under The BSD License

Share

About the Author

Robert van Engelen
CEO
United States United States
CEO and CTO of Genivia inc

You may also be interested in...

Comments and Discussions

 
GeneralMy vote of 5 Pin
Ancient Zygote1-May-17 9:09
memberAncient Zygote1-May-17 9:09 
GeneralRe: My vote of 5 Pin
Robert van Engelen17-May-17 10:05
memberRobert van Engelen17-May-17 10:05 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Terms of Use | Mobile
Web04 | 2.8.171019.1 | Last Updated 25 Jun 2017
Article Copyright 2017 by Robert van Engelen
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid