Click here to Skip to main content
15,883,938 members
Articles / Programming Languages / C++
Article

A C++ STL String Tokenizer

Rate me:
Please Sign up or sign in to vote.
3.27/5 (12 votes)
13 Oct 2001Ms-PL1 min read 204.8K   982   34   22
A C++ STL Tokenizer class capable to tokenize a string when the set of character separators is specified by another string

Introduction

The class CTokenizer I am presenting in this article is capable of tokenizing an STL string when the set of character separators is specified by a predicate class. This is a very generally designed as a class template:

template <class Pred>
void CTokenizer { /*...*/ };

The separating (tokenizing) criteria being implemented in the argument predicate class Pred. The predicate classes are usually derived from unary_function<char, bool> and implement the () operator. I am giving only three examples of predicate classes: CIsSpace where the set of separators contains the white spaces 0x09-0x0D and 0x20, CIsComma where the separator is the comma character ',' and CIsFromString where the set of separators is specified by the characters in a STL string. Other predicate classes can be easily added as needed.

Implementation

First I will present the implemented predicates.

For the case when the separators are white spaces 0x09-0x0D and 0x20;

class CIsSpace : public unary_function<char, bool>
{
public:
  bool operator()(char c) const;
};

inline bool CIsSpace::operator()(char c) const
{
  // isspace<char> returns true if c is a white-space character 
  // (0x09-0x0D or 0x20)
  return isspace<char>(c);
}

For the case where the separator is the comma character ',':

class CIsComma : public unary_function<char, bool>
{
public:
  bool operator()(char c) const;
};

inline bool CIsComma::operator()(char c) const
{
  return (',' == c);
}

For the case where the separator is a character from a set of characters given in a STL string:

class CIsFromString : public unary_function<char, bool>
{
public:
  //Constructor specifying the separators
  CIsFromString::CIsFromString(string const& rostr) : m_ostr(rostr) {}
  bool operator()(char c) const;

private:
  string m_ostr;
};

inline bool CIsFromString::operator()(char c) const
{
  int iFind = m_ostr.find(c);
  if(iFind != string::npos)
    return true;
  else
    return false;
}

Finally the string tokenizer class implementing the Tokenize() function is a static member function. Notice that CIsSpace is the default predicate for the Tokenize() function.

template <class Pred=CIsSpace>
class CTokenizer
{
public:
  //The predicate should evaluate to true when applied to a separator.
  static void Tokenize(vector<string>& roResult, string const& rostr, 
                       Pred const& roPred=Pred());
};

//The predicate should evaluate to true when applied to a separator.
template <class Pred>
inline void CTokenizer<Pred>::Tokenize(vector<string>& roResult, 
                                            string const& rostr, Pred const& roPred)
{
  //First clear the results vector
  roResult.clear();
  string::const_iterator it = rostr.begin();
  string::const_iterator itTokenEnd = rostr.begin();
  while(it != rostr.end())
  {
    //Eat seperators
    while(roPred(*it))
      it++;
    //Find next token
    itTokenEnd = find_if(it, rostr.end(), roPred);
    //Append token to result
    if(it < itTokenEnd)
      roResult.push_back(string(it, itTokenEnd));
    it = itTokenEnd;
  }
}

How to use

The following code snippet is showing some simple usage examples, one for each one of the implemented predicates:

//Test CIsSpace() predicate
cout << "Test CIsSpace() predicate:" << endl;
//The Results Vector
vector<string> oResult;
//Call Tokeniker
CTokenizer<>::Tokenize(oResult, " wqd \t hgwh \t sdhw \r\n kwqo \r\n  dk ");
//Display Results
for(int i=0; i<oResult.size(); i++)
  cout << oResult[i] << endl;
//Test CIsComma() predicate
cout << "Test CIsComma() predicate:" << endl;
//The Results Vector
vector<string> oResult;
//Call Tokeniker
CTokenizer<CIsComma>::Tokenize(oResult, "wqd,hgwh,sdhw,kwqo,dk", CIsComma());
//Display Results
for(int i=0; i<oResult.size(); i++)
  cout << oResult[i] << endl;
//Test CIsFromString predicate
cout << "Test CIsFromString() predicate:" << endl;
//The Results Vector
vector<string> oResult;
//Call Tokeniker
CTokenizer<CIsFromString>::Tokenize(oResult, ":wqd,;hgwh,:,sdhw,:;kwqo;dk,", 
                                          CIsFromString(",;:"));
//Display Results
cout << "Display strings:" << endl;
for(int i=0; i<oResult.size(); i++)
  cout << oResult[i] << endl;

Conclusion

The project StringTok.zip attached to this article includes the source code of the presented CTokenizer class and some test code. I am interested in any opinions and new ideas about this implementation.

License

This article, along with any associated source code and files, is licensed under The Microsoft Public License (Ms-PL)


Written By
Web Developer
Romania Romania
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
GeneralStatic function tokenizer() Pin
Member 27711389-Jan-07 15:15
Member 27711389-Jan-07 15:15 
AnswerRe: Static function tokenizer() Pin
6969-Apr-07 14:47
6969-Apr-07 14:47 
GeneralVS 2005 Changes & bug fix Pin
Terry.Kelly12-Oct-06 7:06
Terry.Kelly12-Oct-06 7:06 
GeneralVS 2005: bug solution Pin
sirnowy11-Aug-06 2:15
sirnowy11-Aug-06 2:15 
QuestionAny ideas on 'escaping' a character ? Pin
Garth J Lancaster5-Sep-05 16:22
professionalGarth J Lancaster5-Sep-05 16:22 
AnswerRe: Any ideas on 'escaping' a character ? Pin
haightasbury7-Feb-06 19:56
haightasbury7-Feb-06 19:56 
GeneralRe: Any ideas on 'escaping' a character ? Pin
Garth J Lancaster7-Feb-06 22:31
professionalGarth J Lancaster7-Feb-06 22:31 
GeneralSeparate the file attributes Pin
Member 87460416-Jul-04 20:32
Member 87460416-Jul-04 20:32 
GeneralString Tokenizer Pin
TheSolver22-Jul-03 5:09
TheSolver22-Jul-03 5:09 
GeneralRe: String Tokenizer Pin
Hatem Mostafa18-Dec-04 19:31
Hatem Mostafa18-Dec-04 19:31 
GeneralUnicode Pin
5-Apr-02 0:56
suss5-Apr-02 0:56 
GeneralTry boost tokenizer Pin
Robin14-Oct-01 22:43
Robin14-Oct-01 22:43 
GeneralRe: Try boost tokenizer Pin
18-Oct-01 22:25
suss18-Oct-01 22:25 
GeneralRe: Try boost tokenizer Pin
19-Apr-02 5:41
suss19-Apr-02 5:41 
GeneralLooks quite complicated... Pin
Petr Prikryl14-Oct-01 22:29
Petr Prikryl14-Oct-01 22:29 
GeneralRe: Looks quite complicated... Pin
Prakash Nadar30-Nov-05 5:26
Prakash Nadar30-Nov-05 5:26 
Questionstrtok ? Pin
14-Oct-01 22:25
suss14-Oct-01 22:25 
AnswerRe: strtok ? Pin
14-Oct-01 22:57
suss14-Oct-01 22:57 
GeneralRe: strtok ? Pin
Hatem Mostafa18-Dec-04 19:34
Hatem Mostafa18-Dec-04 19:34 
AnswerRe: strtok ? Pin
William E. Kempf15-Oct-01 4:24
William E. Kempf15-Oct-01 4:24 
GeneralRe: strtok ? Pin
15-Oct-01 4:48
suss15-Oct-01 4:48 
GeneralRe: strtok ? Pin
Aliff2-Sep-04 3:41
Aliff2-Sep-04 3:41 
What about strtok_t?
It's a is a reentrant version of the strtok() function, with a prototype:
char *strtok_r(char *s, const char *delim, char **ptrptr);
which saves the pointer to next string in ptrptr.

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.