Introduction
This is yet another implementation of string tokenizer. This tokenizer allows you to break strings to tokens. The following tokens are recognized:
- WORD - series of alpha characters or _
- NUMBER - decimal number
- QUOTEDSTRING - string that starts with " and ends with " and uses "" as escape character
- WHITESPACE - space or tab
- EOL - end of line. Recognized Windows \r\n, Unix \n, or Mac \r
- SYMBOL - any symbol character (customizable)
Each token contains line #, column #, kind, and string data.
Here is a small example of how it works:
string input = "hello \"cool programmer\", your number: 3.45!";
StringTokenizer tok = new StringTokenizer(input);
tok.IgnoreWhiteSpace = true;
tok.SymbolChars = new char[]{',', ':'};
Token token;
do
{
token = tok.Next();
Console.WriteLine(token.Kind.ToString() + ": " + token.Value);
} while (token.Kind != TokenKind.EOF);
and the output will be:
Word: hello
QuotedString: "cool programmer"
Symbol: ,
Word: your
Word: number
Symbol: :
Number: 3.45
Unknown: !
Note that ! is returned as Unknown, because it wasn't defined as symbol. You can specify which characters are symbols by: tok.SymbolChars
. You can also specify whether whitespace is ignored.
All the source code is included so you can customize and modify the tokenizer. This little section will explain on how to extend the tokenizer, so that you can parse your own custom tokens.
Suppose you want to read $string as special token called Variable
, where Variable
is $
followed by variable name: (ALPHANUMERIC | _)*. What you would do is add new value to TokenKind
enum
(in Token.cs):
enum TokenKind
{
...,
Variable
}
and in StringTokenizer
inside of Next
method, add new case
right before default:
switch (ch)
{
...
case '$':
{
return ReadVariable();
}
default:
...
}
then, you just need to write ReadVariable
method.
protected Token ReadWord()
{
StartRead();
Consume();
while (true)
{
char ch = LA(0);
if (Char.IsLetter(ch) || ch == '_')
Consume();
else
break;
}
return CreateToken(TokenKind.Variable);
}
That's it!
Of course, this tokenizer is very simple right out of the box and easy to modify, but for more complicated parsing, it's much better to use lexer/parser generator tools like antlr (http://www.antlr.org/). This tokenizer is really easy to use and does not have big overhead.