Introduction
This is yet another implementation of string tokenizer. This tokenizer allows you to break strings to tokens. The following tokens are recognized:
- WORD - series of alpha characters or _
- NUMBER - decimal number
- QUOTEDSTRING - string that starts with " and ends with " and uses "" as escape character
- WHITESPACE - space or tab
- EOL - end of line. Recognized Windows \r\n, Unix \n, or Mac \r
- SYMBOL - any symbol character (customizable)
Each token contains line #, column #, kind, and string data.
Here is a small example of how it works:
string input = "hello \"cool programmer\", your number: 3.45!";
StringTokenizer tok = new StringTokenizer(input);
tok.IgnoreWhiteSpace = true;
tok.SymbolChars = new char[]{',', ':'};
Token token;
do
{
token = tok.Next();
Console.WriteLine(token.Kind.ToString() + ": " + token.Value);
} while (token.Kind != TokenKind.EOF);
and the output will be:
Word: hello
QuotedString: "cool programmer"
Symbol: ,
Word: your
Word: number
Symbol: :
Number: 3.45
Unknown: !
Note that ! is returned as Unknown, because it wasn't defined as symbol. You can specify which characters are symbols by: tok.SymbolChars. You can also specify whether whitespace is ignored.
All the source code is included so you can customize and modify the tokenizer. This little section will explain on how to extend the tokenizer, so that you can parse your own custom tokens.
Suppose you want to read $string as special token called Variable, where Variable is $ followed by variable name: (ALPHANUMERIC | _)*. What you would do is add new value to TokenKind enum (in Token.cs):
enum TokenKind
{
...,
Variable
}
and in StringTokenizer inside of Next method, add new case right before default:
switch (ch)
{
...
case '$':
{
return ReadVariable();
}
default:
...
}
then, you just need to write ReadVariable method.
protected Token ReadWord()
{
StartRead();
Consume();
while (true)
{
char ch = LA(0);
if (Char.IsLetter(ch) || ch == '_')
Consume();
else
break;
}
return CreateToken(TokenKind.Variable);
}
That's it!
Of course, this tokenizer is very simple right out of the box and easy to modify, but for more complicated parsing, it's much better to use lexer/parser generator tools like antlr (http://www.antlr.org/). This tokenizer is really easy to use and does not have big overhead.
| You must Sign In to use this message board. |
|
|
 |
|
 |
especially for java ex-developers Thanks mate!
All generalizations are wrong, including this one! (\ /) (O.o) (><)
|
| Sign In·View Thread·PermaLink | 1.00/5 |
|
|
|
 |
|
 |
This is exactly what I was looking for. Except for a minor modification to get the position of the token, I am using it as is. The code is very cleanly written. Thank you for sharing it.
John Mathews
|
| Sign In·View Thread·PermaLink | 3.67/5 |
|
|
|
 |
|
 |
Cheers for the simple but effective class. After implementing my own IBindingListView, I needed a simple SQL-like expression parser that can handle strings similar to the ones used on DataView.RowFilter. The only problem is that I want my strings wrapped in single quotes rather than double quotes. So I added the following field and property to your class:
private char _stringChar;
public char StringChar { get { return _stringChar; } set { _stringChar = value; } }
and changed all instances of '"' to _stringChar.
Cheers!
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
nice tokenizer, but i missed: 1) a token delimiter: If i use the string "This is a string" with the delimiter ' ' I will get 4 tokens but if I want to use the delimiter 'a' I'll get only 2 tokens 2) A counter: The number of tokens in the string.
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
with space you should get 4 tokens and there is only 1 a, so the string is split into 2 tokens: [This is ] and [ string] which is the correct behaviour. To count tokens you can easily do:
Token token; int count =0; do { token = tok.Next(); count++; } while (token.Kind != TokenKind.EOF);
and count will have # of tokens.
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
I insert the following:
tok.IgnoreWhitespace = true;
I get the message 'Ader.Text.StringTokenizer' does not contain a definition for 'IgnoreWhiteSpace'.
Any Ideas?
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
 |
|
 |
Find an Alternative to this sollution here. This one seems more java like. And java-like is more likelyer
http://www.c-sharpcorner.com//Code/2003/June/JavaLikeStringTokenizer.asp
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
not really. that alternative is not even close to java's StringTokenizer all it does is break the string into strings similarly to what string.Split does. If you want your tokens to be more like programming language tokens, you can't do that.
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
You're 100% right. This one here is almost like a regular lexer of the sort generated by ANTLR or any of the myriad of LL(k) style parser generators that happen to create lexers as well. That JavaLike foobar is really just a hack that doesn't do anything very versatile. Might as well string.split(). The only thing I find very odd about this particualr one, is that it seems to be allowing you to inherit it and modify the behavior in the inherited version, but, it hardcodes a few things as private that the inherited methods like CreateToken wouldn't be able to see (line, column). Oh well. It's a valiant effort! And of course, since it comes with the source, you can just hack it up yourself and add what you like without deriving it.
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
Hey! I know I'm really late here but I had a question perhaps you could help with.
I'm so glad I found this code, it has worked out wonderfully for me. My own hack was much uglier, and didn't parse everything correctly.
The only thing I need out of this code that I can't figure out is how to (if at all possible) retrieve the index of the token from the main string. I know I can access the line and column number, but I need the exact position in the main string that the token is found. Is this possible?
Thanks, James Brannan
|
| Sign In·View Thread·PermaLink | 1.00/5 |
|
|
|
 |
|
 |
In StringTokenizer there already is pos field that keeps track of pos. There is also savePos that keeps track the beginning of current token. You can just add pos to the Token class, and whenever it is created in StringTokenizer, just pass savePos to Token.
|
| Sign In·View Thread·PermaLink | 1.00/5 |
|
|
|
 |
|
 |
Aha, thanks. Sorry--I probably should have really looked at the code before asking. As soon as I posted I found this. Thanks again!
James
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
 | Great  reinux | 17:11 16 Jul '04 |
|
 |
Bah! If only I'd had checked my email for the CodeProject newsletter a couple days ago when I had to make the exact same thing... Thanks though, this is great!
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
 | Ooh!  dzCepheus | 17:57 14 Jul '04 |
|
 |
This is *really* neat! I haven't downloaded it yet, but I've been *looking* for something like this for a while.
I do know that regexes can mimic some of the functionality you have here, but this really is a more programmer-friendly form, I think. Less arcane than constructing a regex for something like this. Very cool!
Skydive -- Testing gravity, one jump at a time.
|
| Sign In·View Thread·PermaLink | 3.50/5 |
|
|
|
 |
|
|