Click here to Skip to main content
15,867,686 members
Articles / Programming Languages / C#
Article

StringTokenizer

Rate me:
Please Sign up or sign in to vote.
4.70/5 (32 votes)
7 Jul 20041 min read 194.7K   5.4K   65   23
StringTokenizer class that can be used for breaking up a string (or stream) into smaller strings.

Introduction

This is yet another implementation of string tokenizer. This tokenizer allows you to break strings to tokens. The following tokens are recognized:

  • WORD - series of alpha characters or _
  • NUMBER - decimal number
  • QUOTEDSTRING - string that starts with " and ends with " and uses "" as escape character
  • WHITESPACE - space or tab
  • EOL - end of line. Recognized Windows \r\n, Unix \n, or Mac \r
  • SYMBOL - any symbol character (customizable)

Each token contains line #, column #, kind, and string data.

Here is a small example of how it works:

C#
string input = "hello \"cool programmer\", your number: 3.45!";

StringTokenizer tok = new StringTokenizer(input);
tok.IgnoreWhiteSpace = true;
tok.SymbolChars      = new char[]{',', ':'};

Token token;
do
{
    token = tok.Next();
    Console.WriteLine(token.Kind.ToString() + ": " + token.Value);
        
} while (token.Kind != TokenKind.EOF);

and the output will be:

Word: hello
QuotedString: "cool programmer"
Symbol: ,
Word: your
Word: number
Symbol: :
Number: 3.45
Unknown: !

Note that ! is returned as Unknown, because it wasn't defined as symbol. You can specify which characters are symbols by: tok.SymbolChars. You can also specify whether whitespace is ignored.

All the source code is included so you can customize and modify the tokenizer. This little section will explain on how to extend the tokenizer, so that you can parse your own custom tokens.

Suppose you want to read $string as special token called Variable, where Variable is $ followed by variable name: (ALPHANUMERIC | _)*. What you would do is add new value to TokenKind enum (in Token.cs):

C#
enum TokenKind
{
...,
Variable
}

and in StringTokenizer inside of Next method, add new case right before default:

C#
switch (ch)
{
    ...
    case '$':
    {
        return ReadVariable();
    }

    default:
    ...
}

then, you just need to write ReadVariable method.

C#
protected Token ReadWord()
{
    StartRead();    // this marks the position of the beginning of the token

    Consume(); 
    // consume first character which is $. If you don't want $ to be returned
    // as part of Value of the token, just calls StartRead() after Consume

    while (true)
    {
        char ch = LA(0);    // look at next available character
        // if it's letter or underscore, we just
        // consume it and continue reading
        if (Char.IsLetter(ch) || ch == '_')
            Consume();
        else        // if not we break the loop
            break;
    }

    // CreateToken creates the token with line and
    // column positions and the value of the token
    // is going to be from the string that was started
    // when StartRead was called, until current position
    return CreateToken(TokenKind.Variable);
}

That's it!

Of course, this tokenizer is very simple right out of the box and easy to modify, but for more complicated parsing, it's much better to use lexer/parser generator tools like antlr (http://www.antlr.org/). This tokenizer is really easy to use and does not have big overhead.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here


Written By
Chief Technology Officer
United States United States
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
QuestionBrilliant Pin
Michael B Pliam31-Dec-17 8:30
Michael B Pliam31-Dec-17 8:30 
QuestionHow to fashion a Last() method to get previous token? Pin
Michael B Pliam12-Oct-16 5:56
Michael B Pliam12-Oct-16 5:56 
QuestionQuestion Pin
Indiati REstu Putri24-May-15 11:38
Indiati REstu Putri24-May-15 11:38 
AnswerRe: Question Pin
Gun Gun Febrianza17-Jun-16 0:06
Gun Gun Febrianza17-Jun-16 0:06 
QuestionKeeps Crashing with VS2010 Pin
BiedermannS5-Nov-12 0:00
BiedermannS5-Nov-12 0:00 
QuestionWhy why why? Pin
steveski7425-Feb-10 12:42
steveski7425-Feb-10 12:42 
AnswerRe: Why why why? [modified] Pin
Jonathan Wood8-May-11 8:50
Jonathan Wood8-May-11 8:50 
GeneralRe: Why why why? [modified] Pin
sobo12319-Mar-15 11:44
sobo12319-Mar-15 11:44 
GeneralThat was useful Pin
Muammar©13-Apr-08 4:51
Muammar©13-Apr-08 4:51 
GeneralThank you! Thank you! Thank you! Pin
982731982739817239871928371298376-Dec-06 7:27
982731982739817239871928371298376-Dec-06 7:27 
This is exactly what I was looking for. Except for a minor modification to get the position of the token, I am using it as is. The code is very cleanly written. Thank you for sharing it.



John Mathews
GeneralDifferent quote character for strings... Pin
Nathan Baulch5-Jul-06 21:57
Nathan Baulch5-Jul-06 21:57 
Generalnice... Pin
falfan26-Jun-06 12:37
falfan26-Jun-06 12:37 
GeneralRe: nice... Pin
Werdna26-Jun-06 13:00
Werdna26-Jun-06 13:00 
Generalcannot strip whitespaces Pin
Tim Julian15-Jun-06 5:41
Tim Julian15-Jun-06 5:41 
GeneralRe: cannot strip whitespaces Pin
Ravi Bhavnani15-Jun-06 6:50
professionalRavi Bhavnani15-Jun-06 6:50 
GeneralRe: Alternative Pin
Werdna28-Sep-05 6:49
Werdna28-Sep-05 6:49 
GeneralRe: Alternative Pin
dave.dolan1-Oct-06 16:34
dave.dolan1-Oct-06 16:34 
GeneralQuestion about string tokenizer Pin
James Brannan9-Aug-05 22:39
James Brannan9-Aug-05 22:39 
GeneralRe: Question about string tokenizer Pin
Werdna10-Aug-05 3:31
Werdna10-Aug-05 3:31 
GeneralRe: Question about string tokenizer Pin
James Brannan10-Aug-05 7:50
James Brannan10-Aug-05 7:50 
GeneralGreat Pin
Rei Miyasaka16-Jul-04 16:11
Rei Miyasaka16-Jul-04 16:11 
GeneralOoh! Pin
dzCepheus14-Jul-04 16:57
dzCepheus14-Jul-04 16:57 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.