Click here to Skip to main content
Email Password   helpLost your password?

Introduction

This article will present the tokenizing and splitting functionality of a simple C++ library called the String Toolkit. Tokenization in the context of string processing, is the method by which a sequence of elements are broken up or fragmented in sub-sequences. The indices in the original sequence that determine such breaks in the sequence are known as delimiters.

There are two types of delimiters, normal or thin delimiters which are of length one element and a thick delimiters which are of length two or more elements. Even though tokenization is primarily used in conjunction with strings, any sequence of types that can be iterated in a linear fashion can be tokenized, examples may be list of integers, a vector of person classes or a map of strings.

Another Tokenizer?

To date there have been many attempts to define a "standard" Tokenizer implementation in C++. Of them all the likely candidate might be the implementation in the BOOST library. Regardless proposed implementations should to some extent consider one or more of the following areas: over-all usage patterns, constructions, generality (or is it genericty these days?) of design, performance-efficiency.

1. Over-all usage patterns

Here we are mainly concerned with how easy it is to instantiate the tokenizer and integrate it into existing processing patterns, which most often than not requires integration with C++ STL algorithms and containers. A tokenzier by definition would be classed as a producer, so the question becomes how easy is it for others to consume its output? Another issue is consistency of the definition of a token in the situation where one has consecutive delimiters but is not compressing them - can or should there be such a thing as an empty token? and what do preceding and trailing delimiters mean? and when should they be included as part of the token?

2. Constructions

In the context of string tokenizing, the majority of implementations return the token as a new instance of a string. This process requires a string to be created on heap, populated by the sub-string in question from the original sequence, then returned back (some of this may be alleviated by Return Value Optimization RVO). In the case of iterators this is essentially another copy to the caller. Furthermore two kinds of tokens can make this situation worse, they are primarily a large sequence made up of lots of very short tokens or a large sequence made up of lots of very large tokens. The solution is not to return the token as a string but rather as a range and allow the caller to decide how they wish to inspect the token.

StrTk Token Iterators - Copyright Arash Partow

This minor change in interface design provides a great deal of flexibility and performance gain.

3. Generality(Genericity) of design

Most tokenizer implementations concern themselves only with strings, which for the most part is ok, because most things that need tokenizing are strings. However there will be times when one has a sequence of types that may need to be tokenized that aren't strings, hence a tokenizer should be designed in such a way to enable such features, moreover it becomes clear that the most basic of tokenizing algorithms are invariant to the type of the delimiter.

4. Performance and Efficiency

Tokenizing strings range from low frequency inputs such as user input or parsing of simple configuration files to more complex situations such as tokenizing of HTML pages that a web crawler/indexer might do, to parsing of large market-data streams in FIX format. Performance is generally important and can usually be helped along with good usage patterns that encourage reuse of instances, minimal preprocessing and allow for user supplied predicates for the more nasty areas of the process. It should be noted that everything in the proceeding article can be done by purely using the STL, that said, C++'s ability to allow one to skin the proverbial cat in numerous way gives rise to novel solutions that are for the most part not of any practical use other than to showcase ones abilities in using the STL.

Getting started

The String Toolkit Library (StrTk) provides two common tokenization concepts, a split function and a token iterator. Both these concepts require the user to provide a delimiter predicate and an iterator range over which the process will be carried out.

The tokenization process can be further parametrized by switching between "compressed delimiters" or "no compressed delimiters" mode. This essentially means that consecutive delimiters will be compressed down to one and treated as such. Below are two tables depicting the expected tokens from various types of input. The tables represent no compressed and compressed tokenization processes respectively. The delimiter in this instance is a pipe symbol | and <> denotes an empty token.

No Compressed Delimiters

InputToken List
aa
a|ba,b
a||ba,<>,b
|a<>,a
a|a,<>
|a||b<>,a,<>,b
||a||b||<>,<>,a,<>,b,<>,<>
|<>,<>
||<>,<>,<>
|||<>,<>,<>,<>

Compressed Delimiters

InputToken List
aa
a||ba,b
|a||b<>,a,b
||a||b||<>,a,b,<>
|<>,<>
||<>,<>
|||<>,<>

Delimiters

Two forms of delimiters are supported and they are single delimiter predicate and multiple delimiters perdicate otherwise known as SDP and MDP respectively. Essentially an SDP is where there is only one type that can break or fragment the sequence, where as with MDPs there is more than one unique type that can break the sequence. It is possible to represent a SDP using the MDP, however from a performance POV having separate predicates is far more efficient. Additionally for strings based on char or unsigned char (8-bit versions) there is a MDP that has a look-up complexity of O(1) making it greatly more efficient than the basic MDP.

Single Delimiter Predicate

Instantiation requires specialization of type and construction requires an instance of the type.

strtk::single_delimiter_predicate<typename T>(const T& t)

strtk::single_delimiter_predicate<std::string::value_type> predicate('|');

Multiple Delimiter Predicate

Instantiation requires specialization of type and construction requires a sequence of potential delimiters through a range described by iterators.

strtk::multiple_delimiter_predicate<typename T>(Iterator begin, Iterator end);

std::string str_delimiters = " ,.;:<>'[]{}()_?/'`~!@#$%^&*|-_\"=+";
strtk::multiple_delimiter_predicate<char> mdp1(str_delimiters.begin(),str_delimiters.end());

unsigned int uint_delimiters[5] = {1,10,101,1010,10101};
strtk::multiple_delimiter_predicate<unsigned int> mdp2(uint_delimiters,uint_delimiters + 5);

As previously mentioned tokenization of data need not be limited to strings comprised of chars, but can also be extended to other PODs or complex types. In the above example a predicate used for tokenizing a sequence of unsigned ints is being defined.

Multiple Char Delimiter Predicate

Instantiation requires an iterator range based on either unsigned char or char. This delimiter is more efficient than the simple MDP as it has a predicate evaluation of O(1) due to the use of a lookup-table as opposed to O(n) where n is the number of delimiters. Performance increase is seen as more delimiters are used.

strtk::multiple_char_delimiter_predicate(const std::string&)
strtk::multiple_char_delimiter_predicate(const std::string::const_iterator begin,const std::string::const_iterator end)

strtk::multiple_char_delimiter_predicate predicate(" .;?");

The delimeter concept can be extended to the point where the predicate itself can act as a state machine transitioning from state to state based conditions and rules related to the current symbol being processed. This extension can lead to some very interesting parsing capabilities.

Split

This is a function that performs tokenization over an entire sequence in one go. strtk::split takes a sequence through a range of iterators or in the case of a string through a reference to its instance, a delimiter predicate and an output iterator. It populates the output iterator with the tokens it extracts. The tokens in this case are std::pairs of iterators for the sequence.

Split can be used in a "simple - no frills" manner by simply passing the necessary parameters:

std::string str = "abc|123|xyz|789";
strtk::std_string::token_list_type token_list;
strtk::split(" |.;?",str,std::back_inserter(token_list));

strtk::split can also be used in a more explicit manner whereby the exact type of delimiter predicate can be specificed by the user:

std::string str = "abc|123|xyz|789";
strtk::std_string::token_list_type token_list;

strtk::single_delimiter_predicate<std::string::value_type> predicate('|');
strtk::split(str,predicate,std::back_inserter(token_list));

strtk::split provides an additional usage option that allows the user to specify if they would like to either compress the delimiters and whether or not they would like to include the delimiter as part of the token range. This enum parameter is called strtk::split_options and has the following values:

Split OptionDefinition
strtk::split_options::default_modeDefault options
strtk::split_options::compress_delimitersConsecutive delimiters are treated as one
strtk::split_options::include_delimitersDelimiters are included in the resulting token range
strtk::split(str,
             predicate,
             std::back_inserter(token_list),
             strtk::split_options::compress_delimiters | strtk::split_options::include_delimiters);

Another way of using split may be to use the strtk::multiple_char_delimiter_predicate as follows:

std::string str = "abc?123;xyz.789";
strtk::std_string::token_list_type token_list;
strtk::multiple_char_delimiter_predicate predicate(" .;?");
strtk::split(str,predicate,std::back_inserter(token_list));

The contents of the token_list can be printed out as follows:

strtk::std_string::token_list_type::iterator it = token_list.begin();
while(it != token_list.end())
{
   std::string current_token = std::string((*it).first,(*it).second);
   std::cout << current_token << "\t";
   ++it;
}

Split N-Tokens

A natural extension is strtk::split_n. This function provides the ability to tokenize a sequence up until a specific number of tokens have been encountered or until there are no more tokens left. The return value of the strtk::split_n would be the number of tokens encountered.

std::string str = "token1?token2,token3;token4,token5";
strtk::std_string::token_list_type token_list;
const std::size_t token_count = 4;
strtk::split_n(" ,.;?",str,token_count,std::back_inserter(token_list));
strtk::std_string::token_list_type::iterator it = token_list.begin();
while(it != token_list.end())
{
   std::cout << "[" << std::string((*it).first,(*it).second) << "]\t";
   ++it;
}
std::cout << std::endl;

Offset Splitter

Another simple variant is the strtk::offset_splitter. This form of split takes a series of offsets and a iterator range or string and determines the tokens based on the offsets. This function can be set to perform a single pass of the offsets or to rotate them until the range has been entirely consumed. The example below demonstrates how a string representing time can be tokenized into its various parts (hour,minutes,seconds,milliseconds)

std::string str = "091011345";  // 9:10:11sec 345ms
const int offset_list[] = {2, // hour
                           2, // minute
                           2, // second
                           3};// ms
const strtk::offset_predicate<5> predicate(offset_list);
strtk::std_string::token_list_type token_list;
strtk::offset_splitter(str.begin(),str.end(),predicate,std::back_inserter(token_list));
strtk::std_string::token_list_type::iterator it = token_list.begin();
while(it != token_list.end())
{
   std::cout << "[" << std::string((*it).first,(*it).second) << "] ";
   ++it;
}
std::cout << std::endl;

Split Regex

Another form of splitter is based on the concept of using regular expressions as the delimiter predicate. Below is a simple example of splitting tokens wrapped in round-brackets.

std::string str = "(12)(345)(6789)(0ijkx)(yz)";
std::list<std::string> token_list;
strtk::split_regex("\\(.*?\\)",str,std::back_inserter(token_list));
std::list::iterator it = token_list.begin();
while(it != token_list.end())
{
   std::cout << "[" << (*it) << "]\t";
   ++it;
}
std::cout << std::endl;

Tokenizer

The tokenizer is specialized on a sequence iterator and predicate. It is constructed with a range of iterators for the sequence and a reference to the desired predicate. Defines exist for std::string tokenizers. The tokenizer provides an iteration pattern as a means for accessing the tokens in the sequence.

const unsigned int data_size = 12;
unsigned int data[data_size] = {1,2,3,0,4,5,6,0,7,8,0,9};
strtk::single_delimiter_predicate<unsigned int> predicate(0);
strtk::tokenizer< unsigned int*,strtk::single_delimiter_predicate<unsigned int> > tokenizer(data,data + data_size,predicate);

Adding "true" as the last parameter to the constructor of the tokenizer will enable compressed delimiters mode, by default it is set to false.

strtk::tokenizer< unsigned int*,strtk::single_delimiter_predicate<unsigned int> > tokenizer(data,data + data_size,predicate,true);

Iteration over the tokens is performed as follows:

strtk::tokenizer< unsigned int*,strtk::single_delimiter_predicate<unsigned int > >::iterator it = tokenizer.begin();
while (it != tokenizer.end())
{
   std::copy((*it).first,(*it).second,std::ostream_iterator<unsigned int>(std::cout," "));
   std::cout << std::endl;
   ++it;
}

A typical std::string can be tokenized in the following manner:

std::string str = "abc|123|xyz|789";
strtk::std_string::tokenizer<>::type tokenizer(str,"|");
strtk::std_string::tokenizer<>::type::iterator it = tokenizer.begin();
while (it != tokenizer.end())
{
   std::cout << "[" << std::string((*it).first,(*it).second) << "]\t";
   ++it;
}
std::cout << std::endl;

Another common situation may be tokenizing a sequence of strings, such as the following:

std::string str_list[] = { "abc" , "delimiter" , "ijk" , "delimiter" ,
                           "mno" , "delimiter" , "rst" , "uvw" ,
                           "delimiter" , "xyz" };

const std::size_t str_list_size = sizeof(str_list)/sizeof(std::string);

strtk::range_adapter<std::string> range(str_list,str_list_size);
strtk::single_delimiter_predicate<std::string> predicate("delimiter");
strtk::tokenizer< std::string*,strtk::single_delimiter_predicate<std::string> > tokenizer(range.begin(),range.end(),predicate);
strtk::tokenizer< std::string*,strtk::single_delimiter_predicate<std::string> >::iterator it = tokenizer.begin();
while(it != tokenizer.end())
{
   std::copy((*it).first,(*it).second,std::ostream_iterator<std::string>(std::cout," "));
   std::cout << std::endl;
   ++it;
}

Some Simple Examples

The following examples demonstrate parsing of PODs such as int and double into STL compatible sequences (std::vector, std::deque or std::list).

std::string int_string = "1,2,3,4,5,6,7,8,9,10,11,12,13,14,15";
std::vector<int> int_list;
strtk::parse(int_string,",",int_list);

std::string double_string = "123.456,789.012,345.678,901.234,567.890";
std::deque<double> double_list;
strtk::parse(double_string,",",double_list);

std::string data_string = "a,bc,def,ghij,klmno,pqrstu,vwxyz";
std::list<std::string> string_list;
strtk::parse(data_string,",",string_list);

Similar to what is described above, the following demonstrates parsing of upto "N" elements into an STL compatible sequence.

std::string int_string = "100,200,300,400,500,600,700,800,900";
std::vector<int> int_list;
strtk::parse_n(int_string,",",5,int_list);

std::string double_string = "123.456,789.012,345.678,901.234,567.890";
std::deque<double> double_list;
strtk::parse_n(double_string,",",3,double_list);

std::string data_string = "a,bc,def,ghij,klmno,pqrstu,vwxyz";
std::list<std::string> string_list;
strtk::parse_n(data_string,",",6,string_list);

Note: The return value of the routine strtk::parse_n indicates how many elements were parse and placed into the specified sequence.

A Practical Example

Lets assume you have been given an english text file to process, with the intention of extracting a lexicon from the file.

One solution would be to break the problem down to a line by line tokenization problem. In this case you would define a functional object such as the following which will take the container in which you plan on storing your tokens (words) and a predicate and insert the tokens as strings into your container.

template<typename Container, typename Predicate>
struct parse_line
{
public:
   parse_line(Container& container, const Predicate& predicate)
   : container_(container),
     predicate_(predicate)
   {}

   inline void operator() (const std::string& str) const
   {
      strtk::split(str,
                   predicate_,
                   strtk::range_to_type_back_inserter(container_),
                   strtk::split_options::compress_delimiters);
   }

private:
   Container& container_;
   const Predicate& predicate_;
};

The whole thing together would include a process of opening the file and reading it line by line each time invoking the parse_line would be as follows:

template<typename Container>
void parse_text(const std::string& file_name, Container& c)
{
   std::string delimiters = " ,.;:<>'[]{}()_?/'`~!@#$%^&*|-_\"=+";
   strtk::multiple_char_delimiter_predicate predicate(delimiters);
   strtk::for_each_line(file_name,parse_line<Container,strtk::multiple_char_delimiter_predicate>(c,predicate));
}

int main()
{
   std::string text_file_name = "text.txt";
   std::deque< std::string > word_list;
   parse_text(text_file_name,word_list);
   std::cout << "Token Count: " << word_list.size() << std::endl;
   return 0;
}

Now coming back to the original problem, that being the construction of a lexicon. In this case the set of "words" should only contain words of interest. For the sake of simplicity lets define words of interest as being anything other than the following prepositions: as, at, but, by, for, in, like, next, of, on, opposite, out, past, to, up and via. The list definition might be something like the following:

const std::string not_of_interest_list [] = { "as", "at", "but", "by", "for",
                                              "in", "like", "next", "of", "on",
                                              "opposite", "out", "past", "to",
                                              "up", "via", ""};

const std::size_t list_size = sizeof(not_of_interest_list)/sizeof(std::string);

Some minor updates to the parse_line predicate include using the filter_on_match predicate for determining if the currently processed token is a preposition and also the invocation of the range_to_type back_inserter to convert the tokens from their range iterator representation to a string representation compatible with the user defined container. For the new implementation to provide unique words of interest the simplest change that can be made is to replace the deque used as the container for the word_list to some kind of 1-1 associative container such as a map or set. The following is the improved version of the parse_line predicate:

template<typename Container, typename Predicate>
struct parse_line
{
public:
   parse_line(Container& container, const Predicate& predicate)
   : container_(container),
     predicate_(predicate),
     tmp_(""),
     tokenizer_(tmp_,predicate_,true),
     filter_(not_of_interest_list,not_of_interest_list + list_size,
             strtk::range_to_string_back_inserter_iterator<Container>(container_),
             true,false)
   {}

   inline void operator() (const std::string& s)
   {
      const filter_type& filter = filter_;
      strtk::for_each_token(s,tokenizer_,filter);
   }

private:
   Container& container_;
   const Predicate& predicate_;
   std::string tmp_;
   typename strtk::std_string_tokenizer<Predicate>::type tokenizer_;
   strtk::filter_on_match<strtk::range_to_string_back_inserter_iterator<Container>> filter_;
};

The above described predicate can be greatly simplified by using binders and various lamba expressions.

Another Example

When performing serialization or deserialization of an instance object such as a class or struct, a simple approach one could take would be to take each of the members and convert them into string representations and from those strings construct a larger string delimiting each member with a special character guaranteed not to exist in any of the string representations.

In this example we will assume that there exists a struct which represents the properties of a person, a person struct if you will:

struct person
{
   unsigned int id;
   std::string name;
   unsigned int age;
   double height
   double weight;
};

The process of populating a person struct would entail having an instance of a person and the necessary data string. The following is an example of how this would be done using the strtk::parse function.

Person Tuple Format

Token0Delimiter0Token1Delimiter1Token2Delimiter2Token3Delimiter3Token4
Unique ID(hex)|Name|Age|Height(m)|Weight(kg)
std::string data = "0xFA37ED12|Rumpelstiltskin|397|1.31|58.7";
person p;
strtk::hex_to_number_sink<unsigned int> hex_sink(p.id); // register id with the hex sink
strtk::parse(data,"|",hex_sink,p.name,p.age,p.height,p.weight);

Batch processing of a text file comprised of one person tuple per-line is somewhat similar to the previous example. A predicate is defined that takes a container specialized on the person struct, and a delimiter predicate with which the strtk::parse function will be invoked. This predicate is then instantiated and coupled with the text file name, is fed to the strtk::for_each_line processor.

template<typename Container, typename Predicate>
struct parse_person_tuple
{
public:
   parse_person_tuple(Container& container)
   : container_(container)
   {}

   inline void operator() (const std::string& s)
   {
      strtk::hex_to_number_sink<unsigned int> hex_sink(p_.id);
      if(strtk::parse(s,"|",hex_sink,p_.name,p_.age,p_.height,p_.weight))
      {
         container_.push_back(p_);
      }
   }

private:
   Container& container_;
   person p_;
};

Bringing the above pieces together to process a file results in the following:

int main()
{
   std::string text_file_name = "person_records.txt";
   std::deque<person> person_list;
   strtk::for_each_line(file_name,predicate_type(person_list));
   return 0;
}

Another similar example to the previous, would be parsing a text file of 3D coordinates into a sequence. This can be done easily and cleanly using lambdas and strtk as follows:

struct point
{
   double x,y,z;
};

int main()
{
   std::string point_data = "point_data.txt";
   std::deque<point> points;
   point p;
   strtk::for_each_line(point_data,
                        [](const std::string& str)
                        {
                           if(strtk::parse(str,",",p.x,p.y,p.z))
                              points.push_back(p);
                        });
   return 0;
}

Simple 3D Mesh File Format Parser

Lets assume there is a file format for expressing a mesh. The format consists of a section that defines the unique vertexes in the mesh, and another section that defines the triangles in the mesh as a tuple of three indicies indicating the vertexes used. Each section is preceded by an integer that denotes the number of subsequent elements in that section. An example of such a file is the following:

5
+1.0,+1.0,+1.0
-1.0,+1.0,-1.0
-1.0,-1.0,+1.0
+1.0,-1.0,-1.0
+0.0,+0.0,+0.0
4
0,1,4
1,2,4
2,3,4
3,1,4

Code to parse such a file format is as follows:

struct point
{
   double x,y,z;
};

struct triangle
{
  std::size_t i0,i1,i2;
};

int main()
{
   std::string mesh_file = "mesh.txt";
   std::ifstream stream(mesh_file.c_str());
   std::string s;
   // Process points section
   std::deque<point> points;
   point p;
   std::getline(stream,s);
   std::size_t point_count = strtk::string_to_type_converter<std::size_t>(s);
   strtk::for_each_line_n(stream,
                          point_count,
                          [](const std::string& line)
                          {
                             if(strtk::parse(line,",",p.x,p.y,p.z))
                                points.push_back(p);
                          });
   // Process triangles section
   std::deque<triangle> triangles;
   triangle t;
   std::getline(stream,s);
   std::size_t triangle_count = strtk::string_to_type_converter<std::size_t>(s);
   strtk::for_each_line_n(stream,
                          triangle_count,
                          [](const std::string& line)
                          {
                             if(strtk::parse(line,",",t.i0,t.i1,t.i2))
                                triangles.push_back(t);
                          });
   return 0;
}

Pick A Random Line From A Text File

A random line of text is to be picked/selected from a user provided text file comprised of lines of varying length, such that the probability of the line being picked is 1/N where N is the number of lines in the text file. There are many trivial solutions to this problem, however if one were to further constrain the problem by indicating the file is very large (TBs) and that the system upon which the solution is to run is very limited memory-wise, most if not all trivial solutions such as storing indexes of line offsets, or reading the entire file into memory etc will be eliminated.

There is however a very elegant solution to this problem of O(n), O(1) time and space complexities respectively, involving scanning the entire file line by line once, at every ith line choosing whether or not to replace the final result line with the current line by sampling a uniform distribution of range [0,1) and performing a replace if and only if the random value is less than 1/i.

The logic of this solution revolves around the fact that the probability for selecting the ith line will be 1/i and the total probability for selecting any of the previous i-1 lines will be 1 -(1/i) or (i-1)/i. Because there are (i-1) lines before the ith line, we divide the previous probability by (i-1), resulting in a probability of 1/i of being selected for any one of lines up to and including the ith line. If the ith line were to be the last line of the text file, then this results in each of the lines having a selection probability of 1/N - simple and sweet, and so to is the following implementation of said solution:

int main(int argc, char* argv[])
{
   std::string file_name = argv[1];
   std::string line;
   std::size_t i = 0;
   strtk::uniform_real_rng rng(static_cast<std::size_t>(::time(0));

   strtk::for_each_line(file_name,
                        [&i,&line,&rng](const std::string& s)
                        { if (rng() < (1.0 / ++i)) line = s; }
                       );
   std::cout << line << std::endl;
   return 0;
}

Note: TAOCP Volume II section 3.4.2 has an in depth discussion about this problem and other similar problems relating to random distributions. Also one should note that the above example has an inefficiency, in that upon every string replace, an actual string is being copied, normally all one needs to do is maintain a file offset to the beginning of the line, not doing this causes slow downs due to continuous memory allocations/reallocations which is made all the worse when large lines are encountered.

Token Grid

Strtk provides a means to easily parse and consume 2D grids of tokens in an efficient and simple manner. A grid is simply defined as a series of rows comprised of tokens. The ith token of a row is grouped with every ith token of all other rows to produce a column. Tokens can be processed as either rows or columns.

An example of a simple token grid, where each numeric value is deemed to be a token:

1.12.23.34.45.5
1.12.23.34.45.5
1.12.23.34.45.5
1.12.23.34.45.5
1.12.23.34.45.5

A token grid can be either passed in via a file or a string buffer. The delimiters to be used for parsing the columns and rows can also be provided, if not provided standard common defaults will be used.

The following demonstrates how each cell in the grid can be access and cast to a specific type.

std::string data = "1,2,3,4,5,6\n"
                   "7,8,9,0,1,2\n"
                   "3,4,5,6,7,8\n"
                   "9,0,1,2,3,4\n"
                   "5,6,7,8,9,0\n";

strtk::token_grid grid(data,data.size(),",");

for(std::size_t r = 0; r < grid.row_count(); ++r)
{
   strtk::token_grid::row_type row = grid.row(r);
   for(std::size_t c = 0; c < row.size(); ++c)
   {
      std::cout << grid.get<int>(r,c) << "\t";
   }
   std::cout << std::endl;
}

The following example demonstrates how averages over rows and columns of a token grid can be computed:

std::string data = "1.1,1.1,1.1,1.1,1.1,1.1\n"
                   "2.2,2.2,2.2,2.2,2.2,2.2\n"
                   "3.3,3.3,3.3,3.3,3.3,3.3\n"
                   "4.4,4.4,4.4,4.4,4.4,4.4\n"
                   "5.5,5.5,5.5,5.5,5.5,5.5\n"
                   "6.6,6.6,6.6,6.6,6.6,6.6\n";

strtk::token_grid grid(data,data.size(),",");

std::vector<double> avg_c(grid.row(0).size(),0.0);
std::vector<double> avg_r(grid.row_count(),0.0);
std::vector<double> tmp(grid.row(0).size(),0.0);
std::fill(avg_c.begin(),avg_c.end(),0.0);

for(std::size_t i = 0; i < grid.row_count(); ++i)
{
   grid.row(i).parse(tmp.begin());
   std::transform(avg_c.begin(),avg_c.end(),tmp.begin(),avg_c.begin(),std::plus());
   avg_r[i] = std::accumulate(tmp.begin(),tmp.end(),0.0) / tmp.size();
}

for(std::size_t i = 0; i < avg_c.size(); avg_c[i++] /= grid.row_count());

std::cout << "Column Averages:\t";
std::copy(avg_c.begin(),avg_c.end(),std::ostream_iterator(std::cout,"\t"));
std::cout << "\n";

std::cout << "Row Averages:\t";
std::copy(avg_r.begin(),avg_r.end(),std::ostream_iterator(std::cout,"\t"));
std::cout << "\n";

The original intent of the token grid was to support simple and efficient processing of data tuples. The following example demonstrates a simple summation of volume based on OHLC data.

                          //Date,Symbol,Open,Close,High,Low,Volume
std::string market_data = "20090701,GOOG,424.2000,418.9900,426.4000,418.1500,2310768\n"
                          "20090701,MSFT,24.0500,24.0400,24.3000,23.9600,54915127\n"
                          "20090702,GOOG,415.4100,408.4900,415.4100,406.8100,2517630\n"
                          "20090702,MSFT,23.7600,23.3700,24.0400,23.2100,65427699\n"
                          "20090703,GOOG,408.4900,408.4900,408.4900,408.4900,0\n"
                          "20090703,MSFT,23.3700,23.3700,23.3700,23.3700,0\n"
                          "20090706,GOOG,406.5000,409.6100,410.6400,401.6600,2262557\n"
                          "20090706,MSFT,23.2100,23.2000,23.2800,22.8700,49207638\n"
                          "20090707,GOOG,408.2400,396.6300,409.1900,395.9801,3260307\n"
                          "20090707,MSFT,23.0800,22.5300,23.1400,22.4600,52842412\n"
                          "20090708,GOOG,400.0000,402.4900,406.0000,398.0600,3441854\n"
                          "20090708,MSFT,22.3100,22.5600,22.6900,2200000,73023306\n"
                          "20090709,GOOG,406.1200,410.3900,414.4500,405.8000,3275816\n"
                          "20090709,MSFT,22.6500,22.4400,22.8100,22.3700,46981174\n"
                          "20090710,GOOG,409.5700,414.4000,417.3700,408.7000,2929559\n"
                          "20090710,MSFT,22.1900,22.3900,22.5400,22.1500,43238698\n";

strtk::token_grid grid(market_data,market_data.size(),",");

std::size_t goog_total_volume = 0; //should be long long.
std::size_t msft_total_volume = 0;

static const std::size_t volume_column = 6;
static const std::size_t symbol_column = 1;

grid.accumulate_column(volume_column,
                      [](const strtk::token_grid::row_type& row) -> bool
                      {
                         return "GOOG" == row.get<std::string>(symbol_column);
                      },
                      goog_total_volume);

grid.accumulate_column(volume_column,
                      [](const strtk::token_grid::row_type& row) -> bool
                      {
                         return "MSFT" == row.get<std::string>(symbol_column);
                      },
                      msft_total_volume);

std::cout << "Total Volume (GOOG): " << goog_total_volume << std::endl;
std::cout << "Total Volume (MSFT): " << msft_total_volume << std::endl;

Sequential Partitions

A typical operation carried out upon time-series data is to group tuples into buckets (or bins) based upon the time index value. For example grouping data into 2-minute buckets and then performing some kind of operation upon the grouped tuples such as a summation or an average etc.

The strtk::token_grid class provides a method known as sequential_partition. The sequential_partition method requires a Transition Predicate, a Function and optionally a row-range. The Transition Predicate consumes a row and returns true only if the row is in the next partition. All other subsequent consecutive rows until the transition predicate returns a true are said to be in the current partition. Prior to transitioning to a new partition, the function predicate is invoked and provided with the range of rows in the current partition.

The following example takes a simple time-series (time and value), partitions the tuples into groups of Time-Buckets of period length 3 and then proceeds to compute the total sum of each group. The below summarizer class provides provides a Transition Predicate and Function.

class summarizer
{
public:

   enum column_index
   {
      tick_time_column  = 0,
      tick_value_column = 1
   };

   summarizer(std::deque<double>& sum_value)
   : next_tick_time_(0),
     sum_value_(sum_value)
   {}

   // Transition Predicate
   bool operator()(const strtk::token_grid::row_type& row)
   {
      if (row.get<std::size_t>(tick_time_column) >= next_tick_time_)
      {
         next_tick_time_ = row.get<std::size_t>(tick_time_column) + 3;
         return true;
      }
      else
         return false;
   }

   // Function
   bool operator()(const strtk::token_grid& grid,
                   const strtk::token_grid::row_range_type& range)
   {
      double bucket_sum = 0.0;
      if(!grid.accumulate_column(tick_value_column,range,bucket_sum))
      {
         std::cout << "failed to accumulate!" << std::endl;
         return false;
      }
      else
         sum_value_.push_back(bucket_sum);
      return true;
   }

private:

   summarizer& operator=(const summarizer&);

   std::size_t next_tick_time_;
   std::deque<double>& sum_value_;
};

int main()
{
                      //time index, value
   std::string data = "10000,123.456\n"
                      "10001,612.345\n"
                      "10002,561.234\n"
                      "10003,456.123\n"
                      "10004,345.612\n"
                      "10005,234.561\n"
                      "10006,123.456\n";

   strtk::token_grid grid(data,data.size(),",");
   std::deque<double> sum_value;
   summarizer s(sum_value);
   grid.sequential_partition(s,s);
   for(std::size_t i = 0; i < sum_value.size(); ++i)
   {
      std::cout << "bucket[" << i << "] = " << sum_value[i] << std::endl;
   }
}

The expected output is as follows:

  bucket[0] = 1297.035
  bucket[1] = 1036.296
  bucket[2] = 123.456

Extending Delimiter Predicates

As previously mentioned the concept of a delimiter based predicate can lead to some very interesting solutions. A predicate as has been defined so far, with the exception of the offset predicate, has been a stateless entity. Adding the ability to maintain a state based on what the predicate has encountered so far can allow it to behave differently from the simple single and multiple delimiter predicates.

For this example, lets assume a typical command line parsing problem which consists of bracketed groupings and escapable special characters, which can be considered as being dual use delimiters and data. An example string is as follows:

  Data:  abc;"123, mno xyz",i\,jk
  Delimiters:  <space>;,.
  Expected Tokens: 
  Token0 = [abc]
  Token1 = [123, mno xyz]
  Token2 = [i\,jk]

In order to tokenize the above described string, one can create a composite predicate using a multiple char delimiter predicate and some simple state rules. The following is an example of such an extended predicate:

class extended_predicate
{
public:
   extended_predicate(const std::string& delimiters)
   : escape_(false),
     in_bracket_range_(false),
     mdp_(delimiters)
   {}

   inline bool operator()(const unsigned char c) const
   {
      if (escape_)
      {
         escape_ = false;
         return false;
      }
      else if ('\\' == c)
      {
         escape_ = true;
         return false;
      }
      else if ('"' == c)
      {
         in_bracket_range_ = !in_bracket_range_;
         return true;
      }
      else if (in_bracket_range_)
         return false;
      else
         return mdp_(c);
   }

   inline void reset()
   {
     escape_ = false;
     in_bracket_range_ = false;
   }

private:
   mutable bool escape_;
   mutable bool in_bracket_range_;
   mutable strtk::multiple_char_delimiter_predicate mdp_;
};

Usage of the newly defined extended predicate is as follows:

int main()
{
   std::string str = "abc;\"123, mno xyz\",i\\,jk";
   strtk::std_string::token_list_type token_list;
   strtk::split(extended_predicate(".,; "),
                str,
                std::back_inserter(token_list),
                strtk::split_options::compress_delimiters);
   return 0;
}

Strtk Library Dependency

Originally Strtk was mostly compatible with C++ compilers ranging from GCC v2.95 and Visual studio v6.0, however due to upgrades it now requires at least GCC v4.4, Intel C++ Compiler v11 or Visual Studio v9.0 to compile easily. It also requires the BOOST library for its boost::lexical_cast routine for types other than PODs. This dependency can be removed manually. For Visual Studio users, BoostPro provides a free and easy to use installer for the latest BOOST libraries that can be obtained from Here.

Conclusion

Like most things there is a trade-off between performance and usability with the above mentioned tokenizers and parsing methods. The original aim was to provide an interface similar to that of the BOOST tokenizer and split, but to also provide some simplifications that will hopefully provide the user more flexibility in the long run, this is by no means a complete replacement but rather a more simpler way of looking at the problem. Tokenizing a string isn't the most interesting problem one could tackle but it does have its interesting points when one has a few TB of data to process, doing it properly could mean the difference between finishing a simple data processing job today or next month.

You must Sign In to use this message board.
 
 
Per page   
 FirstPrevNext
GeneralNeat
Jerry Evans
5:35 23 Aug '09  
Nice to see that efficiency is still a concern amongst the brethren Smile

Downloading as I write as this may solve part of my BOOST phobia.

How about handling quoted items? e.g:

token , "quoted string contains , multiple ; delimiters" ; token

Many thanks

Jerry
GeneralRe: Neat
Arash Partow
2:20 24 Aug '09  
Jerry Evans wrote:
Nice to see that efficiency is still a concern amongst the brethren Smile


Yes some of us still linger about... Laugh


Jerry Evans wrote:
Downloading as I write as this may solve part of my BOOST phobia.


One source I've found to be quite useful when convincing people to start using boost, is the simple to use windows installer that is provided by boost consulting. It can install for a variety of VS versions, also has all the various combinations (debug, release, multithreaded, static, dynamic etc...), have a try of it:

http://www.boostpro.com/download[^]


Jerry Evans wrote:
How about handling quoted items? e.g:

token , "quoted string contains , multiple ; delimiters" ; token


I've added a section called "Extending Delimiter Predicates" that explains how that can be easily achieved with Strtk.
GeneralRe: Neat
Jerry Evans
6:15 24 Aug '09  
Arash, wonderful, thanks.

Can you supply a bit more detail about the BOOST dependency, espcially how to remove it Smile

Gonna get my 5.

Jerry
GeneralRe: Neat
Arash Partow
13:02 24 Aug '09  
Currently there are 3 dependencies

1. boost::lexical_cast
2. boost::regex
3. boost::random

regarding boost::lexical_cast, the standard solution is to assume that either in the std or anonymous namespaces there exists a streaming operator from an std stream to the type in question. boost::lexical_cast works on this premise. The reason why I thought boost::lexical_cast was an "ok" choice was because of this assumption and because in the future its highly likely someone will optimize the implementation. In the meantime Strtk has specialisations for PODs.

The calls to lexical_cast can be replaced by calls to something like the following:


template<typename T>
void convert_to_type(const std::string& s, T& t)
{
std::stringstream stream;
stream << s;
stream >> t;
}


As for the other two, you can comment out the lines:

#define ENABLE_RANDOM
#define ENABLE_REGEX

Or you can rename the includes to an std tr1 include. Once gcc and VS include tr1 as std, I'm going to remove those dependencies. For now disabling them turns off regex and random related functionality.

You should really investigate having boost around, as it is quite useful and well designed, and until tr1 libs become more prevalent in stock c++ compiler distributions all the nice things like bind, tuple, shared_ptr, regex, random, unordered containers etc can be found there.
GeneralRe: Neat
Jerry Evans
6:38 25 Aug '09  
Arash, thanks again. TR1 is included in VS2008 SP1.

The regex capability is important to me - I assume the TR1 regex is compatible with BOOST?

Agree 110% re BOOST capabilities. What I do not like is having to build N versions of the libraries.
GeneralRe: Neat
Arash Partow
12:57 25 Aug '09  
Jerry Evans wrote:
TR1 is included in VS2008 SP1.


Thats true, but its in the std::tr1 namespace. I believe the process states that at some point tr1 additions it will be promoted to the std namespace.
GeneralBoost.Spirit2.1
OvermindDL1
18:09 17 Aug '09  
I am curious as to if you did any performance measurements compared to Boost.Spirit2.1. It seems like everything example you shown above could be done in 2 or 3 lines of Boost.Spirit2.1. Also, since you are using lexical_cast (which is notoriously slow, although very powerful, but Spirit2.1 can do what lexical_cast does but many many times faster, and there are benchmarks) that will hobble the speed of the implementation as well.

It does appear that Spirit2.1 will be many times faster then the above article, and I could teach you how to use Spirit2.1 as well so you can produce your own benchmarks as proof. What I am curious about is the amount of code required to use the above library. Let me give some examples.

Let's take the first split example above, the above code is:
std::string str = "abc|123|xyz|789";
strtk::std_string::token_list_type token_list;
strtk::split(" |.;?",str,std::back_inserter(token_list));

The Spirit2.1 version would be like this:
using namespace boost::spirit;
std::string str = "abc|123|xyz|789";
std::vector<std::string> token_list; // or any arbitrary type that supports ".insert(element_type(std::string begin, end))"
qi::parse(str.begin(),str.end(),+(ascii::char_-ascii::char_(" |.;?"))%ascii::char_(" |.;?"),token_list);

And since it takes an iterator range (the only reason it looks longer) it can also parse null characters if the incoming iterator supported it (obviously std::string does not though, but an std::vector<char> would).

Now for something longer, lets take this example in the above article (Split-N Tokens):
std::string str = "token1?token2,token3;token4,token5";
strtk::std_string::token_list_type token_list;
const std::size_t token_count = 4;
strtk::split_n(" ,.;?",str,token_count,std::back_inserter(token_list));
strtk::std_string::token_list_type::iterator it = token_list.begin();
while(it != token_list.end())
{
std::cout << "[" << std::string((*it).first,(*it).second) << "]\t";
++it;
}
std::cout << std::endl;

And the Spirit2.1 version:
using namespace boost::spirit;
std::string str = "token1?token2,token3;token4,token5";
std::vector<std::string> token_list;
const std::size_t token_count = 4;
qi::parse(str.begin(),str.end(),repeat(0,token_count)[+(ascii::char_-ascii::char_(" ,.;?"))%ascii::char_(" ,.;?")],token_list);


How about something more complex, how about the 3d-file format parser (I have actually written one of these in Spirit2.1, both for a text format and for a binary format, Spirit2.1 handles both just fine). I am not going to copy it here because of the length, thus the Spirit2.1 version:
using namespace boost::spirit;
typedef boost::tuple<double,double,double> point; // My point 'class', although the class defined in the article would work just fine
typedef boost:;tuple<size_t,size_t,size_t> triangle; // I would think pointers to a point would work better, but whatever, indices here...
int main()
{
using namespace boost;
using namespace boost::spirit;
std::string mesh_file = "mesh.txt";
boost::spirit::file_iterator stream(mesh_file.c_str()); // std file_iterators are not bidirectional_iterators, only forward, so Spirit gives an easy to use wrapper, or make your own, either or
std::deque<point> points;
std::deque<triangle> triangles;
size_t t;

qi::parse(stream.begin(),stream.end(),
// first parse points
uint_[t=_val] >> eol >> repeat(ref(t))[skip(lit(','))[double_>>double_>>double_]%eol]
// then parse triangles
uint_[t=_val] >> eol >> repeat(ref(t))[skip(lit(','))[uint_>>uint_>>uint_]%eol]
,points, triangles);

return 0;
}

And there are other ways to parse it too that may be more clear, but that method is really efficient.

So, I am curious, why not use Spirit2.1 in your back-end and have you library as a wrapper? In fact, if you make a decent wrapper system for a lot of oft-used functionality then you could add your library (boost license of course) to the Spirit2.1 external components section so it would be distributed in the next Boost version (1.41). They are always looking for new things to add to that section.
GeneralRe: Boost.Spirit2.1
Arash Partow
13:17 18 Aug '09  
Hi Tim,

OvermindDL1 wrote:
I am curious as to if you did any performance measurements compared to Boost.Spirit2.1. It seems like everything example you shown above could be done in 2 or 3 lines of Boost.Spirit2.1


I've used what is now called Spirit classic in the past for various kinds of complex expression parsing and eval, I really haven't kept up to date with Spirit as of late due to the huge amount of flux in the codebase, when it settles down a bit, which I'm lead to believe it has, I'll probably have more time to to look as Qi and Proto etc...

Regarding lexical_cast, its predominantly used for string-to and string-from conversions. That said it can be easily removed. Its only called in two places. I've got some replacements for unsigned/signed ints that are quite fast, and I'm finalizing a double/float solution. I'll provide benchmarks in the article later on. But if the numbers which were coming up on the BOOST ML regarding atoi vs Qi comparisons are anything to go by then I believe my method still tops out.

Now back to the code, So from our back and forths do you still agree that it takes 2-3 line of Spirit2.1 to get an effectively similar running and performing result to that of what Strtk provides?

Lets see:
1. for the following Strtk code:

strtk::split(" |.;?",str,std::back_inserter(token_list));


you initially said that the following was equivalent:

qi::parse(str.begin(),str.end(),+ascii::char_%" |.;?",token_list); 


But that turned out to be wrong, after not being able to find much documentation on how to use qi::parse in this manner and messaging the mailing list, we find that the following is correct:

boost::spirit::qi::parse(str.begin(),str.end(),*(boost::spirit::char_-boost::spirit::char_(" .;?|"))%boost::spirit::char_(" .;?|"),token_list);


Thats ok, however when running we found a huge performance hit, but not to worry you had your solution all ready to go - as we thought the difference could be due to the fact that strtk was being "too" efficient in only returning a range, you provided the following equivelent bit of code to the original strtk::split call:


boost::spirit::qi::parse(str.begin(),str.end(),boost::spirit::qi::raw[*(boost::spirit::ascii::char_-boost::spirit::ascii::char_(" .;?|"))]%boost::spirit::ascii::char_(" .;?|"),token_list);


Performance got pretty close, infact somewhere near 79% that of strtk, but not good enough we thought, so you presented the next solution, the reason for this next solution was because you thought it unfair that the strtk::split should use a LUT. Your solution to that was as follows (btw I haven't tried this code so I'm just taking your word):

namespace boost { namespace spirit
{
BOOST_SPIRIT_TERMINAL_EX( chars_without );

struct chars_without_parser
:boost::spirit::qi::primitive_parser<chars_without_parser >
{
chars_without_parser(std::string charList)
{
memset(charArray,true,256);
for(std::string::const_iterator i=charList.begin();i!=charList.end();++i)
charArray[*i]=false;
}

template <typename Context, typename Iterator>
struct attribute
{
typedef iterator_range<Iterator> type;
};

template <typename Iterator, typename Context
, typename Skipper, typename Attribute>
inline bool parse(Iterator& first, Iterator const& last
, Context& context, Skipper const& skipper
, Attribute& attr) const
{
qi::skip_over(first, last, skipper);
Iterator i = first;

while(first!=last&&charArray[*first]) ++first;

qi::detail::assign_to(i, first, attr);

return true;
}

template <typename Context>
boost::spirit::qi::info what(Context& context) const
{
return boost::spirit::qi::info("chars_without-predicate");
}

bool charArray[256];
};
}}

namespace boost { namespace spirit
{
template <typename A0>
struct use_terminal<qi::domain
,terminal_ex<tag::chars_without, fusion::vector1<A0> > >
:mpl::true_ {};

template <>
struct use_lazy_terminal
<qi::domain, tag::chars_without, 1>
:mpl::true_ {};
}}

namespace boost { namespace spirit { namespace qi
{
template <typename Modifiers, typename A0>
struct make_primitive
<terminal_ex<tag::chars_without, fusion::vector1<A0> >
,Modifiers>
{
typedef chars_without_parser result_type;
template <typename Terminal>
result_type operator()(const Terminal& term, unused_type) const
{
return result_type(fusion::at_c<0>(term.args));
}
};
}}}

boost::spirit::qi::parse(str.begin(),str.end(),boost::spirit::chars_without(" .;?|")%boost::spirit::ascii::char_,token_list);


This modification, you claim, brought the performance to within 93% of strtk::split. An excellent effort indeed!

I would expect that someday the above code to be wrapped into an easy to use function, similar to the strtk::split function, and as you mentioned in your comment, placed into the external components section of Spirit.

I would also assume your other example (split_n equivalent) would need similar changes.

Regarding the 3d mesh example, is it possible with Qi to break out of parsing based on an error without using an exception? For example with Strtk you can replace for_each_line_n with for_each_line_conditional_n.



Efficiency is the main concern with Strtk. The areas I work in are mainly concerned with HPC and traditionally people steer clear of highly complex formats for data storage instead opting to use very simple to generate and parse formats. I also wonder what the compile times become when using such code in many different places, as increasing the compile times will definetly hurt the develop-compile-debug cycle (even if that assumes wrapping various bits of the code in shared libs etc).

In anycase I'll most definitely be investigating Qi and SpiritV2 in greater depth, as I really don't want to have to maintain a library like this, but unfortunetly circumstances over the last 7 years have required it to be so. Also one thing to remember, Qi in its current form has only been around for a year or so, So there'll still be a lot of people adopting it, probably not for old stable code but more so for new code etc.

I imagine one day being able to install a stock C++ compiler/standard libs and be able enjoy the variety of libraries and functionality that others who use C# or Java do in their environments. That said its really not that hard these days to get boost et al. installed. But for people starting out it can be quite a daunting task.

Tim thanks for these examples, its always nice to let people know there are other options out there Smile
GeneralRe: Boost.Spirit2.1
OvermindDL1
16:10 18 Aug '09  
I initially saw this article in the CodeProject newsletter in the new articles section, so I did not know it was as old as it was.   Smile

I am quite impressed with the speed of it, although all performance benefit it has still about Spirit is due to the character type being assumed it is ASCII, so it would not work with custom string types or unicode (such as ICU) or what-not, where-as Spirit2.1 is generic to work with it all.   I guess this shows that if you use ASCII only and simple tokenizing, strtk is slighter faster, or if you need generic use, Spirit2.1 is slightly faster.

I am still quite certain that in the 3d-file example that Spirit would still be faster, I think I will give that a try next.

Your library is quite impressive though, it is the first thing I have benchmarked against Spirit2.1 that it beats Spirit2.1 in (although Spirit2.1 was designed for parsing, not tokenizing to be honest, but I bet making a dedicated tokenizing type along the lines similar to what I made that you posted would cause it to have about the same speed then.   Although, Spirit2.1 could use your library as a back-end for tokenizing, it is generic enough to use *anything*.   Wink
GeneralRe: Boost.Spirit2.1
OvermindDL1
18:25 18 Aug '09  
Okay, I created a speed test for the 3d-file format one as well, the code is posted at the bottom.
When parsing with:
5
+1.0,+1.0,+1.0
-1.0,+1.0,-1.0
-1.0,-1.0,+1.0
+1.0,-1.0,-1.0
+0.0,+0.0,+0.0
4
0,1,4
1,2,4
2,3,4
3,1,4

As in your example above, with 10000 loops as in the code below by default, I got this:
[strtk] Token Count: 10000 Total time:   1.5410 Rate: 6489.1396tks/s
[spirit] Token Count: 10000 Total time: 1.1009 Rate: 9083.8705tks/s


When I changed the mesh.txt file to be like the above, but ten times the size, I got these times with 10000 loops:
[strtk] Token Count: 10000 Total time:   6.6937 Rate: 1493.9476tks/s
[spirit] Token Count: 10000 Total time: 2.3889 Rate: 4186.0282tks/s


Changing the mesh.txt to be 100 times the original size, and still with 10000 loops, I get this:
[strtk] Token Count: 10000 Total time:  60.7605 Rate: 164.5807tks/s
[spirit] Token Count: 10000 Total time: 16.1715 Rate: 618.3735tks/s


And I ran each multiple times and the timings are consistent.
So yea, it is as I expected. If you are parsing out an ascii string that is separated by delimiters, then this library rules, but for pretty much anything else, array loading, file parsing, whatever, Spirit is faster as it is a full generalized parser.

struct point
{
double x,y,z;
};

struct triangle
{
std::size_t i0,i1,i2;
};

BOOST_FUSION_ADAPT_STRUCT(point,
(double, x)
(double, y)
(double, z)
);
BOOST_FUSION_ADAPT_STRUCT(triangle,
(std::size_t, i0)
(std::size_t, i1)
(std::size_t, i2)
)

template<typename ContainerType>
struct AddToContainer_point
{
ContainerType &container;
AddToContainer_point(ContainerType &_container):container(_container) {}
inline void operator()(const std::string& line)
{
point p;
if(strtk::parse(line,",",p.x,p.y,p.z))
container.push_back(p);
}
};

template<typename ContainerType>
struct AddToContainer_triangle
{
ContainerType &container;
AddToContainer_triangle(ContainerType &_container):container(_container) {}
inline void operator()(const std::string& line)
{
triangle t;
if(strtk::parse(line,",",t.i0,t.i1,t.i2))
container.push_back(t);
}
};

void strtk_parse(const std::string &mesh_file_name)
{
std::ifstream stream(mesh_file_name.c_str());
std::string s;
// Process points section
std::deque<point> points;
std::getline(stream,s);
std::size_t point_count = strtk::string_to_type_converter<std::size_t>(s);
strtk::for_each_line_n(stream,
point_count,
AddToContainer_point<std::deque<point>>(points));
// Process triangles section
std::deque<triangle> triangles;
std::getline(stream,s);
std::size_t triangle_count = strtk::string_to_type_converter<std::size_t>(s);
strtk::for_each_line_n(stream,
triangle_count,
AddToContainer_triangle<std::deque<triangle>>(triangles));

}

void spirit_parse(const std::string &mesh_file_name)
{
using namespace boost::spirit;
using namespace boost::spirit::qi;
using boost::phoenix::ref;
using boost::spirit::_1;
boost::spirit::classic::file_iterator<> iterBegin(mesh_file_name);
const boost::spirit::classic::file_iterator<> iterEnd(iterBegin.make_end());
std::deque<point> points;
std::deque<triangle> triangles;
size_t t;

parse(iterBegin,iterEnd,
// first parse points
omit[uint_[ref(t)=_1]] >> eol >> repeat(ref(t))[double_>>','>>double_>>','>>double_>>eol]
// then parse triangles
>> omit[uint_[ref(t)=_1]] >> eol >> repeat(ref(t))[uint_>>','>>uint_>>','>>uint_>>eol]
,points,triangles);
}

void main(void)
{
const std::string filename("mesh.txt");
const size_t loopCount = 10000;

{
size_t i=loopCount+1;
util::high_resolution_timer t;
while(--i) strtk_parse(filename);
double d=t.elapsed();
printf("[strtk] Token Count: %d Total time: %8.4f\tRate: %8.4ftks/s\n",loopCount, d,(1.0 * loopCount) / (1.0 * d));
}
{
size_t i=loopCount+1;
util::high_resolution_timer t;
while(--i)spirit_parse(filename);
double d=t.elapsed();
printf("[spirit] Token Count: %d Total time: %8.4f\tRate: %8.4ftks/s\n",loopCount, d,(1.0 * loopCount) / (1.0 * d));
}
}


My compiler does not support C++1x anonymous functions, so I used a functor as well (which internally is implemented the same as a C++1x anonymous function anyway) for the string token library.
For the Spirit code I cannot use iofstream as its iterators are forward only, not bidirectional, and spirit requires bidirectional, so I would either need to load the whole file into a string, or use something that implements iofstream but with bidirectional iterators, and boost has such a class, so I used it, no clue of any performance differences, but since it supports bidirectional it has to do more work to seek, so possibly slower then iofstream, depends on how it is implemented, but that could affect things regardless, however the string tokenizer function could not be rewritten using iterators while keeping the same style. As-is though, the spirit version is a *lot* shorter code needed and is a lot faster.
The timer is a high_resolution_timing class also included with with boost, should have sub-millisecond accuracy on my platform.

Why can the string tokenizer library here not accept iterators, that is very odd, since if it took templated iterators then it could accept anything from strings to file iterators to just about kind of data chunk imaginable. As-is it seems that it can *only* take std::string and std::ifstream, and nothing else, *extremely* limiting...

Do you have another version of the above string tokenizer code that is more efficient, perhaps shorter, so I can run another test? Or is there some other speed test you want me to attempt?
GeneralRe: Boost.Spirit2.1
Arash Partow
20:48 18 Aug '09  
Thanks for getting back so quickly, the test code you've provided does seem very interesting. However I believe a substantial part of the the performance hit with the Strtk version comes from how the getline function is implemented.
Raymond Chen has a nice write-up on this here: http://blogs.msdn.com/oldnewthing/archive/2005/05/13/417183.aspx - I'm not saying this is the case for all implementations (including gnu stl) however its been my experience to be a general case.

Though in short you're right, there should be a more direct stream iterator based version. However I think we can still move forward with the benchmark as the common atomic operation is essentially that of parsing "n" elements (lets keep them as PODs for now), I wrote a simple benchmark, that removes the noise from file io etc and only focuses on parsing.

Obviously the Strtk implementation has the current limit of only being able to parse up to 12 elements. My hope is when variadic template arguments become more common place that I'll redo most of the Strtk API to use them rather than overloads.

For now I believe the following example is a far more general purpose example than the 3D mesh and should provide a better idea of where Qi resides with regards to such parsing tasks.

If you have time would you be able to write a Qi equivalent for this?



---snip---

#include <iostream>
#include <string>
#include <iterator>

#include "strtk.hpp"
#include <boost/tokenizer.hpp>

#ifdef WIN32
#include <windows.h>
#else
#include <sys/time.h>
#include <sys/types.h>
#endif

#ifdef WIN32

class timer
{
public:
timer() { QueryPerformanceFrequency(&clock_frequency); }
void start() { QueryPerformanceCounter(&start_time); }
void stop() { QueryPerformanceCounter(&stop_time); }
double time(){ return (1.0 *(stop_time.QuadPart - start_time.QuadPart)) / (1.0 * clock_frequency.QuadPart); }
private:
LARGE_INTEGER start_time;
LARGE_INTEGER stop_time;
LARGE_INTEGER clock_frequency;
};

#else

class timer
{
public:
void start() { gettimeofday(&start_time, 0); }
void stop() { gettimeofday(&stop_time, 0); }
double time()
{
double diff = (stop_time.tv_sec - start_time.tv_sec) * 1000000.0;
if (stop_time.tv_usec > start_time.tv_usec)
diff += (1.0 * (stop_time.tv_usec - start_time.tv_usec));
else if (stop_time.tv_usec < start_time.tv_usec)
diff -= (1.0 * (start_time.tv_usec - stop_time.tv_usec));

return (diff / 1000000.0);
}
private:
struct timeval start_time;
struct timeval stop_time;
struct timeval clock_frequency;
};

#endif

struct data_block
{
std::string d1;
char d2;
int d3;
unsigned int d4;
double d5;
float d6;
short d7;
unsigned short d8;
bool d9;
unsigned char d10;
long d11;
unsigned long d12;
};


void parse_test01()
{
data_block i1;
i1.d1 = "The1 quick2 brown3 fox4 jumps5 over6 the7 lazy8 dog9";
i1.d2 = 'x';
i1.d3 = -1234;
i1.d4 = 78901;
i1.d5 = 4567.8901;
i1.d6 = 123.456f;
i1.d7 = -16000;
i1.d8 = 15000;
i1.d9 = true;
i1.d10 = 0xEE;
i1.d11 = -737373;
i1.d12 = 333777l;

data_block i2;
i2.d1 = "The9 quick8 brown7 fox6 jumps5 over4 the3 lazy2 dog1";
i2.d2 = 'A';
i2.d3 = -4321;
i2.d4 = 11111;
i2.d5 = 98765.12345;
i2.d6 = 123.456f;
i2.d7 = -11111;
i2.d8 = 13333;
i2.d9 = true;
i2.d10 = 0xA5;
i2.d11 = -737373;
i2.d12 = 333777l;

std::string str1 = "";
std::string str2 = "";

strtk::construct(str1,"|",i1.d1,i1.d2,i1.d3,i1.d4,i1.d5,i1.d6,i1.d7,i1.d8,i1.d9,i1.d10,i1.d11,i1.d12);
strtk::construct(str2,"|",i2.d1,i2.d2,i2.d3,i2.d4,i2.d5,i2.d6,i2.d7,i2.d8,i2.d9,i2.d10,i2.d11,i2.d12);

const std::size_t count = 1000000;
timer t;
t.start();
for(std::size_t i = 0; i < count; ++i)
{
strtk::parse(str1,"|",i1.d1,i1.d2,i1.d3,i1.d4,i1.d5,i1.d6,i1.d7,i1.d8,i1.d9,i1.d10,i1.d11,i1.d12);
strtk::parse(str2,"|",i2.d1,i2.d2,i2.d3,i2.d4,i2.d5,i2.d6,i2.d7,i2.d8,i2.d9,i2.d10,i2.d11,i2.d12);
}
t.stop();
printf("Token Count: %d\tTotal time: %8.4f\tRate: %8.4fprs/s\n",2 * 12 * count,t.time(),(2.0 * count) / (1.0 * t.time()));
}

int main()
{
parse_test01();
return 0;
}

Token Count: 24000000 Total time: 18.3237 Rate: 109148.4462prs/s

---snip---





Note: 2 strings are used because gcc on O3 tries to be a bit "too" smart when optimizing.
GeneralRe: Boost.Spirit2.1
OvermindDL1
4:37 19 Aug '09  
I took your code, copied your parse_test01 function into a new parse_spirit01 function, added a couple namespace things and such at the beginning, and replaced the contents of the for loop with the spirit parser, everything else is kept identical. The function is at the bottom of this post. I then added the call to the function in main right after parse_test01(); so the second printed line will be Spirit's. Now, before I post the time, I did confirm that the struct is fully filled, I even made an altered case that cleared the data_block's first and checked them after by printing them, it is working correctly, and by checking the assembly, nothing is optimized out, it is still working in all its power. Given that, here are the times:
Token Count: 24000000 Total time: 21.2437 Rate: 94145.6219prs/s
Token Count: 24000000 Total time: 3.2008 Rate: 624836.2824prs/s


Remember, the second one is Spirit, which is many many times faster. To be fair, this case you gave is one of Spirit's many strong-points, numerical parsing and advanced stream manipulation, as such you are now seeing its power. Your library is still better for tokenizing. I may make a new Spirit terminal that is specific for very basic parsing splitting, probably something like: parse(str.begin(),str.end(),split(char_(" |!,"))[char_]);
Where the split is the thing I will make, that is a terminal can be a whole grammar, or the tiniest bit of a massive grammar (you should see the C parser made in Spirit, massive). Basically split will take one parameter and use that as the delimiter (and char_ taking a string treats them as individual delimiters, so " |," would match either a space, a | or a comma, but only 1, and something like ",.g-m|" would match either a comma, a period, a | or any character from g to m inclusive, only 1), then the [] part of split would be the things in between the delimeters, for basic tokenizing like your library does this would make Spirit equivalent in speed, if not even faster.

Also, I notice your strtk::construct function. Just as I am using Spirit.QI to parse from a stream into things, Spirit.Karma can parse from things into a stream using the same syntax as Spirit.QI (Spirit2.1 has 4 parts, the parser (QI), the streamer (Karma), the lexer (Lex), and the old ancient unused spirit (Classic)). It might be interesting to compare the speed of your construct function to an equivalent Karma grammar as well if you are curious.

And no, there is no real limit in Spirit parsers, so it can go well beyond 12 (the limit is the compiler template instances, which you can work around anyway by splitting the grammar's across translation units, that was really only a problem back in like gcc2 (not supported anyway) and MSVC2k2 and earlier (also not supported anyway)).



// This macro exposes data_block to Boost.Fusion as a boost::fusion::map, which is directly usable by Spirit2.1
BOOST_FUSION_ADAPT_STRUCT(data_block,
(std::string, d1)
(char, d2)
(int, d3)
(unsigned int, d4)
(double, d5)
(float, d6)
(short, d7)
(unsigned short, d8)
(bool, d9)
(unsigned char, d10)
(long, d11)
(unsigned long, d12)
);


void parse_spirit01()
{
using namespace boost::spirit;
using namespace boost::spirit::qi;
using namespace boost::spirit::iso8859_1;
using boost::phoenix::ref;
using boost::spirit::_1;
uint_parser<bool,2,1,1> bool_;

data_block i1;
i1.d1 = "The1 quick2 brown3 fox4 jumps5 over6 the7 lazy8 dog9";
i1.d2 = 'x';
i1.d3 = -1234;
i1.d4 = 78901;
i1.d5 = 4567.8901;
i1.d6 = 123.456f;
i1.d7 = -16000;
i1.d8 = 15000;
i1.d9 = true;
i1.d10 = 0xEE;
i1.d11 = -737373;
i1.d12 = 333777l;

data_block i2;
i2.d1 = "The9 quick8 brown7 fox6 jumps5 over4 the3 lazy2 dog1";
i2.d2 = 'A';
i2.d3 = -4321;
i2.d4 = 11111;
i2.d5 = 98765.12345;
i2.d6 = 123.456f;
i2.d7 = -11111;
i2.d8 = 13333;
i2.d9 = true;
i2.d10 = 0xA5;
i2.d11 = -737373;
i2.d12 = 333777l;

std::string str1 = "";
std::string str2 = "";

strtk::construct(str1,"|",i1.d1,i1.d2,i1.d3,i1.d4,i1.d5,i1.d6,i1.d7,i1.d8,i1.d9,i1.d10,i1.d11,i1.d12);
strtk::construct(str2,"|",i2.d1,i2.d2,i2.d3,i2.d4,i2.d5,i2.d6,i2.d7,i2.d8,i2.d9,i2.d10,i2.d11,i2.d12);

const std::size_t count = 1000000;
timer t;
t.start();
for(std::size_t i = 0; i < count; ++i)
{
phrase_parse(str1.begin(),str1.end(),
lexeme[raw[(*~char_('|'))]]
>>char_
>>int_
>>uint_
>>double_
>>float_
>>short_
>>ushort_
>>bool_
>>char_ // in iso8859_1 the char_ parser is unsigned
>>long_
>>ulong_
,lit('|'),i1);
phrase_parse(str2.begin(),str2.end(),
lexeme[raw[(*~char_('|'))]]
>>char_
>>int_
>>uint_
>>double_
>>float_
>>short_
>>ushort_
>>bool_
>>char_ // in iso8859_1 the char_ parser is unsigned
>>long_
>>ulong_
,lit('|'),i2);
}
t.stop();
printf("Token Count: %d\tTotal time: %8.4f\tRate: %8.4fprs/s\n",2 * 12 * count,t.time(),(2.0 * count) / (1.0 * t.time()));
}

GeneralRe: Boost.Spirit2.1
Jerry Evans
5:27 23 Aug '09  
Interesting comments about Spirit 2.1 - I have also avoided Boost & friends like the plague due to enormity of installation and complexity of build. Why it was ever allowed to extend out of inlined code is incomprehensible to me. (A sentiment echoed by Joel & friends I think as they shipped a mini-boost with Spirit 1.X)

Has the situation changed for Spirit 2.X ?

Also I think it is fair to point out that due to template metaprogramming Spirit can be incredibly hard to get working - there is no debugger for compile time template instantiation yet.

It seems to me that the solution presented here is small, flexible and comprehensible. Run-time performance can always be tweaked, as previous messages indicate. A fundamentally flawed architecture is much harder to fix Smile

All thoughts welcomed -
GeneralRe: Boost.Spirit2.1
OvermindDL1
13:41 23 Aug '09  
I am not really sure where you get the "enormity of installation and complexity of build" thoughts, but boost is anything but. If you are here on CodeProject I am going to guess you are running windows. In that case, if you want to play with Spirit2.1, first just sync to the boost trunk (http://svn.boost.org/svn/boost/trunk[^]) using TortoiseSVN or whatever SVN program you like. Then right then it is ready to use as far as most of it is concerned. The directory where you downloaded it you can just add to your global include directories. The better way to do it though would be to build it. Building it will do two things, build the boost project that have library parts, and copy all the header-only libraries into an easy to manage destination directory. First you open a command prompt to wherever you downloaded boost and type "bootstrap.bat" and it will prepare everything, then if you just run "bjam" it sets it up in the same directory. I actually like to put it in a different directory so I can always svn-update easily, so I use this instead in a batch file:
bjam --build-type=complete --toolset=msvc-8.0 --without-mpi --without-python --prefix=R:/SDKs/boost/built_head --build-dir=R:/SDKs/boost/build_head install>build-HEAD.log
And it will take about 30-min to an hour to run, you will see nothing printed to the screen because it is redirected to the build-HEAD.log file (or whatever else you wish to name it, or remove the ">build-HEAD.log" part to just dump it to the screen instead), and of course change the directories to what you wish. The build docs tell about all the kind of options you can use too, like if you have ICU it says how you can use that to add unicode support and such. When built it it put in the prefix directory (so R:/SDKs/boost/built_head on my computer). Inside that directory I add the lib and include\boost-1_40 folders to my global libraries and include directories respectively.

Basically:
- Download Boost Trunk over SVN
- Run: bootstrap.bat
- Run: bjam
- Add boost to your global include/library directories.

And that is a heck of a lot easier then many other build systems I have used. Smile
Feel free to email me if you ever need help or anything, but I would recommend to subscribing to the boost-users mailing list, you can get all the help you want from there, it is very active.

Also, Boost also comes with a little tool called "bcp", you can run bcp and pass in the names of boost libraries you use and it will make a self-contained directory of just that library and anything else it depends on, that is how the old Spirit website made that self-contained version, it is literally one command to generate it. Smile

I find Spirit incredibly easy to get working, although honestly, Classic Spirit was a bit of an annoyance at times, Spirit2.1 is very easy, it is designed to be as such. Do not take anything you know about Classic Spirit with you when you try Spirit2.1, it is a whole different creation, and a *VASTLY* better one at that, and it is easy to extend, as stated, it would be possible to even embed this library into a Spirit terminal. Wink

Hmm, perhaps in the next version he could add a single header that does have a Spirit2.1 terminal embedding the library into it, or a couple for different purposes, that would be pretty awesome I have to say.

Oh, almost forgot, they have been writing a Classic Spirit -> Spirit2.1 section in the Spirit2.1 documentation, they just started it a couple of days ago, but it is getting a lot more information on it each day, and if you keep checking this link you will see it update as they make edits: http://svn.boost.org/svn/boost/trunk/libs/spirit/doc/html/spirit/notes/porting_from_spirit_1_8_x.html[^]
GeneralRe: Boost.Spirit2.1
Arash Partow
2:11 24 Aug '09  
Hi Tim,

I've made some updates to Strtk, for the above test I'm now getting:

[strtk] Token Count: 24000000 Total time: 2.13 Rate: 952378.9523prs/s

I'm not sure how much closer Qi can get.


Note: The system I'm using is a thinkpad x61, gcc 4.4 O3
GeneralRe: Boost.Spirit2.1
OvermindDL1
0:46 26 Aug '09  
Sorry it took me so long, been *very* busy recently.
I downloaded the latest version of strtk, recompiled, and ran the test, *very* nice speed increase, I may have to diff the files to see what you did. Smile

Token Count: 24000000 Total time: 5.5049 Rate: 363310.3081prs/s
Token Count: 24000000 Total time: 3.2302 Rate: 619156.0109prs/s


First one is strtk, second is Spirit2.1, very nice closing of the gap there. Smile
And yes, I am still running it on my 1.8ghz, and I still ran it with real-time priority so nothing else running interferes with it. I ran it a few times and the times are consistently within a tenth of a second of what I posted.
Generalexcellent stuff!!!
gediner
12:52 22 Nov '08  
i read the article, i never thought string stuff in c++ could be made so easy. i went through ur library and found some other interesting stuff r u going to write an article about them?

excellent!
GeneralRe: excellent stuff!!!
Arash Partow
23:49 27 Nov '08  
thanks - there are actually a lot of good string processing libraries in c++ out there you just have to take the time to find them, as for more articles I might if I get the time...
GeneralToo complex library, there is a little alternative
Member 4619922
21:46 21 Apr '08  
I worked many years in Delphi and used NumToken and GetToken functions. When I started to work in C++, I rewrote these functions and they seem to work OK. Try this:

int NumToken(const char* String, const char SepChar)
//retrieves number of tokens from String with delimiter SepChar
{
int Result = 0;
char* StrBackup = new char[strlen(String) + 1];
strcpy(StrBackup, String); //strtok changes StrBackup, so we need backup copy of String
char* token = strtok(StrBackup, &SepChar);
while (token != NULL)
{
token = strtok(NULL, &SepChar);
Result++;
}
delete[] StrBackup;
return Result;
}

int GetToken(const char* String, int TokenNum, const char SepChar, char* Token)
//warning: TokenNum is counted from 1, not from 0. Token must have allocated enough of memory
//function copies appropriate token to Token buffer and returns its length.
{
Token[0] = 0;
if (TokenNum > NumToken(String, SepChar) || TokenNum <= 0)
return 0;
char* StrBackup = new char[strlen(String) + 1];
strcpy(StrBackup, String); //strtok changes StrBackup, so we need backup copy of String
int cnt = 1;
char* token = strtok(StrBackup, &SepChar);
while (cnt != TokenNum)
{
token = strtok(NULL, &SepChar);
cnt++;
}
strcpy(Token, token);
delete[] StrBackup;
return(strlen(Token));
}


Sample:
char b[100] = "sdf=5,ydrtg=9,dfgd=4,sdf=5,ydrtg=9,dfgd=4,sdf=5,ydrtg=9,dfgd=4,sdf=5,ydrtg=9,dfgd=4";
char u[20];
char h[20];
for (int i = 1; i <= NumToken(b, ','); i++)
{
GetToken(b, i, ',', u);
GetToken(u, 2, '=', h);
}


Petr Brant
GeneralRe: Too complex library, there is a little alternative [modified]
Arash Partow
1:24 22 Apr '08  
Hi Petr,

Thanks for the code snippet, even though your method does "seem" a little simpler it has about 5 serious problems that range from memory buffer over-runs to non-re-entrant issues.

Also my solution attempts to solve other intrinsic issues with regards to string or string like processing techniques other than simple code syntax. That said, my proposed solution can be just as simple as yours in-fact even simpler and much more safer and far more efficient. Feel free to check out some of the test examples in the download code.

An example you could consider is imagine I have a string of data that is say 5GB in length (that is roughly 10^9 bytes ), where every 4th char is the delimiter using your code how many times will strtok get called?

Another note, did you know that due to variable aliasing, conditionals - functions or simple expressions are always executed regardless of optimizations? - this is also true for Delphi.

In any case thanks again for the example, knowing how not to do something is just as valuable as knowing how to do it. I think people reading this thread will get something positive out of it.

Member 4619922 wrote:
I worked many years in Delphi and used NumToken and GetToken functions. When I started to work in C++, I rewrote these functions and they seem to work OK. Try this:

int NumToken(const char* String, const char SepChar)
//retrieves number of tokens from String with delimiter SepChar
{
int Result = 0;
char* StrBackup = new char[strlen(String) + 1];
strcpy(StrBackup, String); //strtok changes StrBackup, so we need backup copy of String
char* token = strtok(StrBackup, &SepChar);
while (token != NULL)
{
token = strtok(NULL, &SepChar);
Result++;
}
delete[] StrBackup;
return Result;
}

int GetToken(const char* String, int TokenNum, const char SepChar, char* Token)
//warning: TokenNum is counted from 1, not from 0. Token must have allocated enough of memory
//function copies appropriate token to Token buffer and returns its length.
{
Token[0] = 0;
if (TokenNum > NumToken(String, SepChar) || TokenNum <= 0)
return 0;
char* StrBackup = new char[strlen(String) + 1];
strcpy(StrBackup, String); //strtok changes StrBackup, so we need backup copy of String
int cnt = 1;
char* token = strtok(StrBackup, &SepChar);
while (cnt != TokenNum)
{
token = strtok(NULL, &SepChar);
cnt++;
}
strcpy(Token, token);
delete[] StrBackup;
return(strlen(Token));
}

Sample:
char b[100] = "sdf=5,ydrtg=9,dfgd=4,sdf=5,ydrtg=9,dfgd=4,sdf=5,ydrtg=9,dfgd=4,sdf=5,ydrtg=9,dfgd=4";
char u[20];
char h[20];
for (int i = 1; i <= NumToken(b, ','); i++)
{
GetToken(b, i, ',', u);
GetToken(u, 2, '=', h);
}

http://brant.wz.cz/
http://brant.wz.cz/forprogs.htm

Petr Brant


modified on Tuesday, April 22, 2008 6:36 AM

GeneralRe: Too complex library, there is a little alternative
Jörgen Sigvardsson
0:02 26 May '08  
Nailed it... Wink

--
Kein Mitleid Für Die Mehrheit

GeneralRe: Too complex library, there is a little alternative
Arash Partow
12:08 30 May '08  
Smile
GeneralRe: Too complex library, there is a little alternative
ulallala
5:24 7 Jul '08  
GJ Smile
GeneralIMHO, you basically over-engineered your library
Wong Shao Voon
23:34 31 Mar '08  
String Toolkit is everything how a STL string tokenizer should be but that is not how everyone would like to use a string tokenizer.

An real life analogy to this, would be DirectX 9 and XNA. I reserved my vote for this article.
GeneralI must be dense ... this isnt for beginners
Garth J Lancaster
16:34 25 Jan '08  
Arash - happy Australia Day out there in the Heard/Macquarie Islands !!

I value your work, but I must say (and these arnt criticisms, just observations) :-

1) it seems complex - possibly more complex than boost
2) It doesnt indicate that one needs boost to compile it - yet it depends on boost::lexical_cast
3) It wasnt immediately obvious how to switch between compressed/non-compressed delimeters - I see in your code that there's a replace_consecutive option/construct so Im guessing this is it...
4) I was confused in one way between split and tokenize - one of your examples - tokenizer_example11 actually had what I thought I would have been able to do with split, but cant figure out if I can use strtk::range_to_string_back_inserter() with split. The reason is, that usually, I split/tokenize to a vector, and check the length to see if there's enough tokens (keeping empty tokens), and then becuase I only need particular elements from a split/tokenized string/line, I can go vector[x], vector[y] for instance in a sparse fashion. The differentiator between split and tokenize in your notes was split performs the operation in one run - yet in my case Id then have to iterate the pairs, count/convert them to strings, whereas tokenize skips a few steps IF I use std::copy - I guess this comes down to flexibility/usage patterns

sorry, seems like Im grumbling - Im not - your work is great, it just took me a fair while to figure out how to do a relatively simple thing - maybe Im dense - I'd have to be, working on a National Day !!

cheers & thanks, I gave you a 5 anyway, Garth


Last Updated 30 Dec 2009 | Advertise | Privacy | Terms of Use | Copyright © CodeProject, 1999-2010