Skip to main content
Email Password   helpLost your password?

Introduction

This is yet another implementation of string tokenizer. This tokenizer allows you to break strings to tokens. The following tokens are recognized:

Each token contains line #, column #, kind, and string data.

Here is a small example of how it works:

string input = "hello \"cool programmer\", your number: 3.45!";

StringTokenizer tok = new StringTokenizer(input);
tok.IgnoreWhiteSpace = true;
tok.SymbolChars      = new char[]{',', ':'};

Token token;
do
{
    token = tok.Next();
    Console.WriteLine(token.Kind.ToString() + ": " + token.Value);
        
} while (token.Kind != TokenKind.EOF);

and the output will be:

Word: hello
QuotedString: "cool programmer"
Symbol: ,
Word: your
Word: number
Symbol: :
Number: 3.45
Unknown: !

Note that ! is returned as Unknown, because it wasn't defined as symbol. You can specify which characters are symbols by: tok.SymbolChars. You can also specify whether whitespace is ignored.

All the source code is included so you can customize and modify the tokenizer. This little section will explain on how to extend the tokenizer, so that you can parse your own custom tokens.

Suppose you want to read $string as special token called Variable, where Variable is $ followed by variable name: (ALPHANUMERIC | _)*. What you would do is add new value to TokenKind enum (in Token.cs):

enum TokenKind
{
...,
Variable
}

and in StringTokenizer inside of Next method, add new case right before default:

switch (ch)
{
    ...
    case '$':
    {
        return ReadVariable();
    }

    default:
    ...
}

then, you just need to write ReadVariable method.

protected Token ReadWord()
{
    StartRead();    // this marks the position of the beginning of the token


    Consume(); 
    // consume first character which is $. If you don't want $ to be returned

    // as part of Value of the token, just calls StartRead() after Consume


    while (true)
    {
        char ch = LA(0);    // look at next available character

        // if it's letter or underscore, we just

        // consume it and continue reading

        if (Char.IsLetter(ch) || ch == '_')
            Consume();
        else        // if not we break the loop

            break;
    }

    // CreateToken creates the token with line and

    // column positions and the value of the token

    // is going to be from the string that was started

    // when StartRead was called, until current position

    return CreateToken(TokenKind.Variable);
}

That's it!

Of course, this tokenizer is very simple right out of the box and easy to modify, but for more complicated parsing, it's much better to use lexer/parser generator tools like antlr (http://www.antlr.org/). This tokenizer is really easy to use and does not have big overhead.

You must Sign In to use this message board.
 
 
Per page   
 FirstPrevNext
GeneralThat was useful Pin
Muammar©
5:51 13 Apr '08  
GeneralThank you! Thank you! Thank you! Pin
98273198273981723987192837129837
8:27 6 Dec '06  
GeneralDifferent quote character for strings... Pin
Nathan Baulch
22:57 5 Jul '06  
Generalnice... Pin
falfan
13:37 26 Jun '06  
GeneralRe: nice... Pin
Werdna
14:00 26 Jun '06  
Generalcannot strip whitespaces Pin
Tim Julian
6:41 15 Jun '06  
GeneralRe: cannot strip whitespaces Pin
Ravi Bhavnani
7:50 15 Jun '06  
GeneralAlternative Pin
stefankruzel
6:18 28 Sep '05  
GeneralRe: Alternative Pin
Werdna
7:49 28 Sep '05  
GeneralRe: Alternative Pin
dave.dolan
17:34 1 Oct '06  
GeneralQuestion about string tokenizer Pin
James Brannan
23:39 9 Aug '05  
GeneralRe: Question about string tokenizer Pin
Werdna
4:31 10 Aug '05  
GeneralRe: Question about string tokenizer Pin
James Brannan
8:50 10 Aug '05  
GeneralGreat Pin
reinux
17:11 16 Jul '04  
GeneralOoh! Pin
dzCepheus
17:57 14 Jul '04  


Last Updated 7 Jul 2004 | Advertise | Privacy | Terms of Use | Copyright © CodeProject, 1999-2009