A C++ STL String Tokenizer






3.27/5 (12 votes)
A C++ STL Tokenizer class capable to tokenize a string when the set of character separators is specified by another string
Introduction
The class CTokenizer
I am presenting in this article is capable of tokenizing an STL string
when the set of character separators is specified by a predicate class. This is a very generally designed
as a class template:
template <class Pred> void CTokenizer { /*...*/ };
The separating (tokenizing) criteria being implemented in the argument predicate class Pred
.
The predicate classes are usually derived from unary_function<char, bool>
and implement the () operator. I am
giving only three examples of predicate classes: CIsSpace
where the set of separators contains the white spaces
0x09-0x0D and 0x20, CIsComma
where the separator is the comma character ',' and CIsFromString
where the
set of separators is specified by the characters in a STL string. Other predicate classes can be easily added as needed.
Implementation
First I will present the implemented predicates.
For the case when the separators are white spaces 0x09-0x0D and 0x20;
class CIsSpace : public unary_function<char, bool> { public: bool operator()(char c) const; }; inline bool CIsSpace::operator()(char c) const { // isspace<char> returns true if c is a white-space character // (0x09-0x0D or 0x20) return isspace<char>(c); }
For the case where the separator is the comma character ',':
class CIsComma : public unary_function<char, bool> { public: bool operator()(char c) const; }; inline bool CIsComma::operator()(char c) const { return (',' == c); }
For the case where the separator is a character from a set of characters given in a STL string:
class CIsFromString : public unary_function<char, bool> { public: //Constructor specifying the separators CIsFromString::CIsFromString(string const& rostr) : m_ostr(rostr) {} bool operator()(char c) const; private: string m_ostr; }; inline bool CIsFromString::operator()(char c) const { int iFind = m_ostr.find(c); if(iFind != string::npos) return true; else return false; }
Finally the string tokenizer class implementing the Tokenize()
function is a static member function.
Notice that CIsSpace
is the default predicate for the Tokenize()
function.
template <class Pred=CIsSpace> class CTokenizer { public: //The predicate should evaluate to true when applied to a separator. static void Tokenize(vector<string>& roResult, string const& rostr, Pred const& roPred=Pred()); }; //The predicate should evaluate to true when applied to a separator. template <class Pred> inline void CTokenizer<Pred>::Tokenize(vector<string>& roResult, string const& rostr, Pred const& roPred) { //First clear the results vector roResult.clear(); string::const_iterator it = rostr.begin(); string::const_iterator itTokenEnd = rostr.begin(); while(it != rostr.end()) { //Eat seperators while(roPred(*it)) it++; //Find next token itTokenEnd = find_if(it, rostr.end(), roPred); //Append token to result if(it < itTokenEnd) roResult.push_back(string(it, itTokenEnd)); it = itTokenEnd; } }
How to use
The following code snippet is showing some simple usage examples, one for each one of the implemented predicates:
//Test CIsSpace() predicate cout << "Test CIsSpace() predicate:" << endl; //The Results Vector vector<string> oResult; //Call Tokeniker CTokenizer<>::Tokenize(oResult, " wqd \t hgwh \t sdhw \r\n kwqo \r\n dk "); //Display Results for(int i=0; i<oResult.size(); i++) cout << oResult[i] << endl;
//Test CIsComma() predicate cout << "Test CIsComma() predicate:" << endl; //The Results Vector vector<string> oResult; //Call Tokeniker CTokenizer<CIsComma>::Tokenize(oResult, "wqd,hgwh,sdhw,kwqo,dk", CIsComma()); //Display Results for(int i=0; i<oResult.size(); i++) cout << oResult[i] << endl;
//Test CIsFromString predicate cout << "Test CIsFromString() predicate:" << endl; //The Results Vector vector<string> oResult; //Call Tokeniker CTokenizer<CIsFromString>::Tokenize(oResult, ":wqd,;hgwh,:,sdhw,:;kwqo;dk,", CIsFromString(",;:")); //Display Results cout << "Display strings:" << endl; for(int i=0; i<oResult.size(); i++) cout << oResult[i] << endl;
Conclusion
The project StringTok.zip attached to this article includes the source code of the
presented CTokenizer
class and some test code. I am interested in any opinions and new
ideas about this implementation.