Click here to Skip to main content
13,297,411 members (67,189 online)
Click here to Skip to main content
Add your own
alternative version


32 bookmarked
Posted 9 Apr 2001

A scanner and scanner generator

, 9 Apr 2001
Rate this:
Please Sign up or sign in to vote.
Supports both common approaches to scanners in one object.

Sample Image - ScanGen.gif


A scanner breaks a stream of characters into a sequence of tokens. This is comparable with a human reader who groups characters into words, numbers and punctuation thereby reaching a higher abstraction level. The text:

0-201-05866-9, cancelled,    "Parallel Program Design"

e.g. could be translated into the tokens:


where T_ISBN, T_COMMA and so on are integer constants. There are two approaches for implementing general scanners.

  • The scanner is an object of a class.

    The searched tokens are specified via calls to member functions.

  • The scanner is automatically generated from regular expressions.

    This is a two-phase approach. First, you specify the scanner, and then you run the generator, which outputs "C" source code.

The first approach is better suited for a project with frequent changes. The second approach gives you superior performance but has the disadvantage that the generated "C"- code is nearly unreadable for human beings and therefore shouldn't be edited.

The scanner and scanner generator presented in this article combines both approaches and provides one interface for both implementation strategies.

Interface for the scanner class

The scanner REXI_Scan is based on the regular expression facility already presented in the article 'Fast regular expressions'. To use it, you
  • specify a regular expression for each token to recognize.
  • set the source string.
  • call Scan repeatedly until it returns REXI_Scan::eEos.
class REXI_Scan : public REXI_Base
    REXI_Scan(char cLineBreak= '\n'); //related function 'GetNofLines'

/*initialize scanner with symbol definitions    1.STEP    */
    REXI_DefErr     AddSymbolDef        (string strRegExp,int nIdAnswer);
    REXI_DefErr     AddHelperRegDef     (string strName,string strRegExp);

    REXI_DefErr     SetToSkipRegExp     (string strRegExp= "[ \r\n\v\t]*");

/* set source                     2.STEP    */
    inline  void    SetSource           (const char* pszSource);

/* Read next token, then return symbolId ('nIdAnswer' from 'AddSymbolDef') 
                        3.STEP    */
    int            Scan                ();

/* retrieve,set information after a call to 'Scan'    */
    inline  string    GetTokenString    ()const;
    inline  void    SkipChars    (int nOfChars=1);
    inline  int        GetLastSymbol    ()const;
    inline  int        GetNofLines    ()const;

Example Usage

struct Info{
        string  m_sISBN;    ESymbol m_eKey;  string  m_sTitle;
int main(int argc,char* argv[])
    const int ncOk= REXI_DefErr::eNoErr;
    const char szTestSrc[]= 
    "3-8272-5737-9,AVAILABLE,    \"XML praxis und referenz\"\r\n"
    "0-201-05866-9,cancelled,    \"Parallel Program Design\"\r\n";

    REXI_Scan scanner;
    REXI_DefErr err;
/* STEP 1: initialize scanner with symbol definitions */
    err= scanner.AddSymbolDef ("(AVAILABLE)\\i",T_AVAILABLE);
    err= scanner.AddSymbolDef ("(CANCELLED)\\i",T_CANCELLED);
    err= scanner.AddSymbolDef (",",T_COMMA);
    err= scanner.AddSymbolDef ("\\n",T_LINEBREAK);
    err= scanner.AddHelperRegDef("$Int_","[0-9]+\\-");
    err= scanner.AddSymbolDef ("$Int_ $Int_ $Int_ [0-9]+", T_ISBN);
    err= scanner.AddSymbolDef (" \"( [^\"\\n] | \\\"] )* \"", T_TITLE);
    err= scanner.SetToSkipRegExp("[ \\t\\v\\r]*");
/* STEP 2 : set source */
    int nNofLines=0;
    int nRes;
    Info info;
    vector<Info> vecInfos;
/* STEP 3: read until eos */
    while( (nRes=scanner.Scan())!=REXI_Scan::eEos ){
        case T_AVAILABLE: 
        case T_CANCELLED:
            info.m_eKey= (enum ESymbol)nRes;
        case T_TITLE:
            info.m_sTitle= scanner.GetTokenString();
        case T_ISBN:
            info.m_sISBN=  scanner.GetTokenString();
        case T_LINEBREAK:
            vecInfos.push_back(info); info= Info();
        case REXI_Scan::eIllegal: 
            cout    <<  "Illegal:"    <<  
                scanner.GetTokenString()  <<  endl;
            while( (nRes=scanner.Scan())!=REXI_Scan::eEos 
                                && nRes!= T_LINEBREAK);
            info= Info();
    cout   << "Number of correct read records: "  
           <<  vecInfos.size() <<  endl;
    char c; cin >> c;
    return 0;

Interface for the scanner generator

The scanner generator is a very simple GUI program. It allows you to specify and run a test scanner and finally generates the source code for the specified scanner. The generated code uses a REXI_Scan derived scanner and provides two different code parts. Controlled by the conditional directive #ifdef REXI_STATIC_SCANNER, either an efficient hard coded scanner or a scanner working like the one described above is activated.

The specification for the scanner to be generated uses regular expressions and supports 4 different ways to specify a token, which are shown below.

if    #T_Quote= '[^']'    $Int= [0-9]+    ##T_FLOAT= $Int (\. $Int)?

It is important, that you separate the token definitions by tabs. Now, let's see what the 4 definitions above mean.

1. Token   if    The scanner searches for exactly the word 'if' 
          and automagically creates a constant T_if for the token
2. Token   #T_Quote= '[^']'    The leading # means: 
          The next identifier up to the = is the name of the token constant, 
          then the token definition follows.
3. Helper  $Int= [0-9]+    Defines a helper definition, 
          which can be used later.
4. Token   ##T_FLOAT= $Int (\. $Int)? 
          The leading ## means: Same as # but do postprocessing 
          after recognizing this token.

Fragment of a generated scanner

int    Simple::Scan()
    int nRes= FastScan();
    int nRes= REXI_Scan::Scan();
        case eIllegal:{
            m_sIllegal= GetTokenString();
            return nRes;
        case T_PRICE:{
            // add your postprocessing code here
            return nRes;
        default: return nRes;

Intended Use

Scanning a comma separated file, implementing a pretty printer for C++-source code or building a scanner for an interpreter are potential application areas. There are also quite a lot of freely available scanner generators (lex, bison) out there, but as far as I know, no one generates scanners with such a neat interface as this one.


This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here


About the Author

Martin Holzherr
Switzerland Switzerland
No Biography provided

You may also be interested in...

Comments and Discussions

GeneralExcellent Tool. Pin
Kevin Cao16-Feb-02 23:05
memberKevin Cao16-Feb-02 23:05 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Terms of Use | Mobile
Web01 | 2.8.171207.1 | Last Updated 10 Apr 2001
Article Copyright 2001 by Martin Holzherr
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid