Click here to Skip to main content
15,881,559 members
Articles / Programming Languages / C++
Article

The Token Iterator

Rate me:
Please Sign up or sign in to vote.
3.50/5 (2 votes)
3 May 2000 124.5K   675   19   26
Token Iterator provides an easy to use, familiar, and customizable way in which to go through the tokens contained in a string
  • Download source files - 6 Kb
  • The purpose of Token Iterator is to provide an easy to use, familiar, and customizable way in which to go through the tokens contained in a string. Our hypothetical example will be breaking a string into words, and outputting the words as a list. Here is the code.

    #include <iostream>
    #include <iterator>
    #include <string>
    #include <algorithm>
    #include "tokenizer.h"
    
    
    int main(){
    	
    	using namespace std;
    	using namespace jrb_stl_extensions;
    	
    	// Separate a string into words
    	string s;
    	cout << "Please enter a string with punctuation\n";
    	getline(cin,s);
    	TokenIterator<string> begin(s), end;
    	copy(begin,end,ostream_iterator<string>(cout,"\n"));
    
    }

    A few words of explanation. All this class and the supporting classes are packaged in jrb_stl_extensions namespace.

    Easy to use. There are only 2 lines that do the actual work.

    Let's analyze this a bit more. TokenIterator has a default template parameter that specifies the TokenizerFunc which is an STL style functor. The default is the PunctSpaceTokenizer which will tokenize separating based on whitespace and punctuation. The constructor has this signature

    PunctSpaceTokenizer(bool returnPunct = false, StringType p = WT_Punctuation1, 
                        StringType w = WT_Whitespace)

    returnPunct - Tells whether we want to return the punctuation. Returning the punctuation can be important say when building a mathematical expression parser, where while we want to skip whitespace, we do not want to skip the +'s and -'s.

    p - Punction. There are two constants called WT_Punctuation1, which is all the punctuation on a standard American keyboard. WT_Punctuation2, is the same as WT_Punctuation1, except that it does not have -(hyphen/dash) or '(apostrophe/single quote). The reason for this is that some words are hyphenated or have apostrophes (like can't) and we want to keep them as one token.

    w - Whitespace. The difference between whitespace and punctuation, is that punctuation can be returned using he returnPunct flag. WT_Whitespace is a constant that has the standard whitespace.

    Now, lets look at the familiar part. TokenIterator is an STL forward iterator and can be used with any STL algorithm that can accept it such as copy. In addition, copying the TokenIterator will NOT result in the whole string being copied. Since the string is NEVER modified, a reference counted pointer is shared among all TokenIterator's referring to a particular string.

    On to customizability. The PunctSpaceTokenizer might not suffice for all your needs. Not to worry, TokenIterator is easily customizable for the TokenizerFunc.

    Here are the requirements for TokenizerFunc.

    1. Typedefs - TokenType
      This will refer to the type of the token. For our examples this is string. This will be the return type of operator*(). An example where more than a string token would be needed might be a parser that returns an object that contains the string, and other identifying information.
    2. operator()(...)
      This has the following prototype
      iter operator()(iter* pTokEnd,iter end,TokenType& tToken)
      Return value - This returns the start of the next token in the string.
      pTokEnd - This should be set to the STL style end position (ie past the end) of the token in the string
      end - This is the end possition of the string, and is passed into the functor
      tToken - This should be set to current token. TokenType is string is will be [retval,pTokEnd)

    If you want an example, study the CSVTokenizer functor.

    We will examine using CSVTokenizer. CSVTokenizer breaks a string into C(ie comma)-separated fields. The comma is a template parameter, and can be any character. Assuming that a comma is the parameter, The string will be broken into fields separated by commas, unless the commas are inside quotes. In addition, the constructor takes a character that is defaulted to \ ('\\' in C syntax). That character acts like the same character in C, namely an escape character. For example, \" means a literal " and \\ means a literal \

    Perhaps an example will help:

    John \"Big John\" Doe,"1111 Anytown, USA 12345" will be broken into
    John "Big John" Doe
    1111 Anytown, USA 12345

    An example of using it follows. Add the following lines below our previous sample.

    // Here is some code that will break up a entries formatted like this
    // "field1","field3","field5,5"
    // it uses c-like escape codes for quotes namely \" for " and \\ for \ 
    // a comma is the field separator unless it is embedded inside quotes
    cout << "Please enter a comma separated line of fields\n";
    getline(cin,s);
    TokenIterator<string,CSVTokenizer<string> > begin2(s),end2;
    copy(begin2,end2,ostream_iterator<string>(cout,"\n"));

    This will output the fields.

    Well, there is my overview of TokenIterator. I hope you enjoy using this class.

    Note: When the sample is compiled with MSVC 6, the warnings that result are one talking about not having a return in main, and that the template resulted in an identifier that was truncated to 255 characters in debug.

    John Bandela
    Copyright 2000 John R. Bandela

    License

    This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

    A list of licenses authors might use can be found here


    Written By
    United States United States
    This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

    Comments and Discussions

     
    Ranturgent!!! Pin
    anna cepe21-Aug-09 4:21
    anna cepe21-Aug-09 4:21 
    GeneralDouble quotes Pin
    Alexander_Vikhorev20-Nov-03 5:04
    Alexander_Vikhorev20-Nov-03 5:04 
    GeneralDouble quotes Pin
    Alexander_Vikhorev20-Nov-03 5:04
    Alexander_Vikhorev20-Nov-03 5:04 
    GeneralBoost Pin
    Anonymous1-Nov-03 18:16
    Anonymous1-Nov-03 18:16 
    GeneralRe: Boost Pin
    John R. Bandela3-Nov-03 17:09
    John R. Bandela3-Nov-03 17:09 
    GeneralLink Error Pin
    Bernhard22-Jun-03 23:37
    Bernhard22-Jun-03 23:37 
    GeneralRe: Link Error Pin
    Bernhard23-Jun-03 4:43
    Bernhard23-Jun-03 4:43 
    if anyone is using this lib.. the answer (from the visual c++ forum):

    From: John M. Drescher
    The problem is that all files that include this header will define space for these constants. I would make my own cpp file and put the constants in there as they are in the header and in the header put an extern before constant and remove the everything between the = and the ;

    John


    thanks john



    "I'm from the South Bronx, and I don't care what you say: those cows look dangerous."
    U.S. Secretary of State Colin Powell at George Bush's ranch in Texas
    QuestionCan not compile with VS2003 Final Beta Pin
    Jochen Kalmbach [MVP VC++]22-Nov-02 3:30
    Jochen Kalmbach [MVP VC++]22-Nov-02 3:30 
    AnswerRe: Can not compile with VS2003 Final Beta Pin
    kovey1-Jun-04 16:26
    kovey1-Jun-04 16:26 
    QuestionHow to return only alphanumerics ? Pin
    10-May-01 5:35
    suss10-May-01 5:35 
    Questionwstring? Pin
    25-Mar-01 6:57
    suss25-Mar-01 6:57 
    AnswerRe: wstring? Pin
    26-Mar-01 18:16
    suss26-Mar-01 18:16 
    GeneralRe: wstring? Pin
    luthe13-Nov-03 16:09
    luthe13-Nov-03 16:09 
    GeneralCorrection to code Pin
    John R. Bandela8-May-00 16:39
    John R. Bandela8-May-00 16:39 
    GeneralRe: Correction to code Pin
    Wilka8-May-00 20:28
    Wilka8-May-00 20:28 
    Generalproblem... Pin
    Wilka8-May-00 11:24
    Wilka8-May-00 11:24 
    GeneralRe: problem... Pin
    John R. Bandela8-May-00 16:33
    John R. Bandela8-May-00 16:33 
    GeneralFix for problem - See my post Correction to Code Pin
    John R. Bandela8-May-00 16:41
    John R. Bandela8-May-00 16:41 
    GeneralString Tokenizer Pin
    Member 11278-May-00 0:20
    Member 11278-May-00 0:20 
    GeneralRe: String Tokenizer Pin
    ajit.jadhav@calipertech.com8-May-00 4:38
    ajit.jadhav@calipertech.com8-May-00 4:38 
    GeneralRe: String Tokenizer Pin
    Anonymous27-Feb-05 22:57
    Anonymous27-Feb-05 22:57 
    GeneralRe: String Tokenizer Pin
    John R. Bandela8-May-00 5:40
    John R. Bandela8-May-00 5:40 
    GeneralRe: String Tokenizer Pin
    Member 11278-May-00 5:50
    Member 11278-May-00 5:50 
    GeneralRe: String Tokenizer Pin
    Phil Nash25-Mar-01 20:09
    Phil Nash25-Mar-01 20:09 
    GeneralRe: String Tokenizer Pin
    Member 11278-May-00 23:36
    Member 11278-May-00 23:36 

    General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

    Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.