This article will present the tokenizing and splitting functionality of a simple C++ library called the String Toolkit. Tokenization in the context of string processing, is the method by which a sequence of elements are broken up or fragmented in sub-sequences. The indices in the original sequence that determine such breaks in the sequence are known as delimiters.
There are two types of delimiters, normal or thin delimiters which are of length one element and a thick delimiters which are of length two or more elements. Even though tokenization is primarily used in conjunction with strings, any sequence of types that can be iterated in a linear fashion can be tokenized, examples may be list of integers, a vector of person classes or a map of strings.
To date there have been many attempts to define a "standard" Tokenizer implementation in C++. Of them all the likely candidate might be the implementation in the BOOST library. Regardless proposed implementations should to some extent consider one or more of the following areas: over-all usage patterns, constructions, generality (or is it genericty these days?) of design, performance-efficiency.
Here we are mainly concerned with how easy it is to instantiate the tokenizer and integrate it into existing processing patterns, which most often than not requires integration with C++ STL algorithms and containers. A tokenzier by definition would be classed as a producer, so the question becomes how easy is it for others to consume its output? Another issue is consistency of the definition of a token in the situation where one has consecutive delimiters but is not compressing them - can or should there be such a thing as an empty token? and what do preceding and trailing delimiters mean? and when should they be included as part of the token?
In the context of string tokenizing, the majority of implementations return the token as a new instance of a string. This process requires a string to be created on heap, populated by the sub-string in question from the original sequence, then returned back (some of this may be alleviated by Return Value Optimization RVO). In the case of iterators this is essentially another copy to the caller. Furthermore two kinds of tokens can make this situation worse, they are primarily a large sequence made up of lots of very short tokens or a large sequence made up of lots of very large tokens. The solution is not to return the token as a string but rather as a range and allow the caller to decide how they wish to inspect the token.

This minor change in interface design provides a great deal of flexibility and performance gain.
Most tokenizer implementations concern themselves only with strings, which for the most part is ok, because most things that need tokenizing are strings. However there will be times when one has a sequence of types that may need to be tokenized that aren't strings, hence a tokenizer should be designed in such a way to enable such features, moreover it becomes clear that the most basic of tokenizing algorithms are invariant to the type of the delimiter.
Tokenizing strings range from low frequency inputs such as user input or parsing of simple configuration files to more complex situations such as tokenizing of HTML pages that a web crawler/indexer might do, to parsing of large market-data streams in FIX format. Performance is generally important and can usually be helped along with good usage patterns that encourage reuse of instances, minimal preprocessing and allow for user supplied predicates for the more nasty areas of the process. It should be noted that everything in the proceeding article can be done by purely using the STL, that said, C++'s ability to allow one to skin the proverbial cat in numerous way gives rise to novel solutions that are for the most part not of any practical use other than to showcase ones abilities in using the STL.
The String Toolkit Library (StrTk) provides two common tokenization concepts, a split function and a token iterator. Both these concepts require the user to provide a delimiter predicate and an iterator range over which the process will be carried out.
The tokenization process can be further parametrized by switching between "compressed delimiters" or "no compressed delimiters" mode. This essentially means that consecutive delimiters will be compressed down to one and treated as such. Below are two tables depicting the expected tokens from various types of input. The tables represent no compressed and compressed tokenization processes respectively. The delimiter in this instance is a pipe symbol | and <> denotes an empty token.
| Input | Token List |
a | a |
a|b | a,b |
a||b | a,<>,b |
|a | <>,a |
a| | a,<> |
|a||b | <>,a,<>,b |
||a||b|| | <>,<>,a,<>,b,<>,<> |
| | <>,<> |
|| | <>,<>,<> |
||| | <>,<>,<>,<> |
| Input | Token List |
a | a |
a||b | a,b |
|a||b | <>,a,b |
||a||b|| | <>,a,b,<> |
| | <>,<> |
|| | <>,<> |
||| | <>,<> |
Two forms of delimiters are supported and they are single delimiter predicate and multiple delimiters perdicate otherwise known as SDP and MDP respectively. Essentially an SDP is where there is only one type that can break or fragment the sequence, where as with MDPs there is more than one unique type that can break the sequence. It is possible to represent a SDP using the MDP, however from a performance POV having separate predicates is far more efficient. Additionally for strings based on char or unsigned char (8-bit versions) there is a MDP that has a look-up complexity of O(1) making it greatly more efficient than the basic MDP.
Instantiation requires specialization of type and construction requires an instance of the type.
strtk::single_delimiter_predicate<typename T>(const T& t)
strtk::single_delimiter_predicate<std::string::value_type> predicate('|');
Instantiation requires specialization of type and construction requires a sequence of potential delimiters through a range described by iterators.
strtk::multiple_delimiter_predicate<typename T>(Iterator begin, Iterator end);
std::string str_delimiters = " ,.;:<>'[]{}()_?/'`~!@#$%^&*|-_\"=+";
strtk::multiple_delimiter_predicate<char> mdp1(str_delimiters.begin(),str_delimiters.end());
unsigned int uint_delimiters[5] = {1,10,101,1010,10101};
strtk::multiple_delimiter_predicate<unsigned int> mdp2(uint_delimiters,uint_delimiters + 5);
As previously mentioned tokenization of data need not be limited to strings comprised of chars, but can also be extended to other PODs or complex types. In the above example a predicate used for tokenizing a sequence of unsigned ints is being defined.
Instantiation requires an iterator range based on either unsigned char or char. This delimiter is more efficient than the simple MDP as it has a predicate evaluation of O(1) due to the use of a lookup-table as opposed to O(n) where n is the number of delimiters. Performance increase is seen as more delimiters are used.
strtk::multiple_char_delimiter_predicate(const std::string&)
strtk::multiple_char_delimiter_predicate(const std::string::const_iterator begin,const std::string::const_iterator end)
strtk::multiple_char_delimiter_predicate predicate(" .;?");
The delimeter concept can be extended to the point where the predicate itself can act as a state machine transitioning from state to state based conditions and rules related to the current symbol being processed. This extension can lead to some very interesting parsing capabilities.
This is a function that performs tokenization over an entire sequence
in one go. strtk::split takes a sequence through
a range of iterators or in the case of a string through
a reference to its instance, a delimiter predicate and an output
iterator. It populates the output iterator with the tokens it
extracts. The tokens in this case are std::pairs of iterators for
the sequence.
Split can be used in a "simple - no frills" manner by simply passing the necessary parameters:
std::string str = "abc|123|xyz|789";
strtk::std_string::token_list_type token_list;
strtk::split(" |.;?",str,std::back_inserter(token_list));
strtk::split can also be used in a more explicit
manner whereby the exact type of delimiter predicate can be
specificed by the user:
std::string str = "abc|123|xyz|789";
strtk::std_string::token_list_type token_list;
strtk::single_delimiter_predicate<std::string::value_type> predicate('|');
strtk::split(str,predicate,std::back_inserter(token_list));
strtk::split provides an additional usage option that
allows the user to specify if they would like to either compress the
delimiters and whether or not they would like to include the
delimiter as part of the token range. This enum parameter is called
strtk::split_options and has the following values:
| Split Option | Definition |
strtk::split_options::default_mode | Default options |
strtk::split_options::compress_delimiters | Consecutive delimiters are treated as one |
strtk::split_options::include_delimiters | Delimiters are included in the resulting token range |
strtk::split(str,
predicate,
std::back_inserter(token_list),
strtk::split_options::compress_delimiters | strtk::split_options::include_delimiters);
Another way of using split may be to use the
strtk::multiple_char_delimiter_predicate as follows:
std::string str = "abc?123;xyz.789";
strtk::std_string::token_list_type token_list;
strtk::multiple_char_delimiter_predicate predicate(" .;?");
strtk::split(str,predicate,std::back_inserter(token_list));
The contents of the token_list can be printed out as follows:
strtk::std_string::token_list_type::iterator it = token_list.begin();
while(it != token_list.end())
{
std::string current_token = std::string((*it).first,(*it).second);
std::cout << current_token << "\t";
++it;
}
A natural extension is strtk::split_n. This function
provides the ability to tokenize a sequence up until a specific
number of tokens have been encountered or until there are no more
tokens left. The return value of the strtk::split_n
would be the number of tokens encountered.
std::string str = "token1?token2,token3;token4,token5";
strtk::std_string::token_list_type token_list;
const std::size_t token_count = 4;
strtk::split_n(" ,.;?",str,token_count,std::back_inserter(token_list));
strtk::std_string::token_list_type::iterator it = token_list.begin();
while(it != token_list.end())
{
std::cout << "[" << std::string((*it).first,(*it).second) << "]\t";
++it;
}
std::cout << std::endl;
Another simple variant is the strtk::offset_splitter.
This form of split takes a series of offsets and a iterator range or
string and determines the tokens based on the offsets. This function
can be set to perform a single pass of the offsets or to rotate
them until the range has been entirely consumed. The example
below demonstrates how a string representing time can be tokenized
into its various parts (hour,minutes,seconds,milliseconds)
std::string str = "091011345"; // 9:10:11sec 345ms
const int offset_list[] = {2, // hour
2, // minute
2, // second
3};// ms
const strtk::offset_predicate<5> predicate(offset_list);
strtk::std_string::token_list_type token_list;
strtk::offset_splitter(str.begin(),str.end(),predicate,std::back_inserter(token_list));
strtk::std_string::token_list_type::iterator it = token_list.begin();
while(it != token_list.end())
{
std::cout << "[" << std::string((*it).first,(*it).second) << "] ";
++it;
}
std::cout << std::endl;
Another form of splitter is based on the concept of using regular expressions as the delimiter predicate. Below is a simple example of splitting tokens wrapped in round-brackets.
std::string str = "(12)(345)(6789)(0ijkx)(yz)";
std::list<std::string> token_list;
strtk::split_regex("\\(.*?\\)",str,std::back_inserter(token_list));
std::list::iterator it = token_list.begin();
while(it != token_list.end())
{
std::cout << "[" << (*it) << "]\t";
++it;
}
std::cout << std::endl;
The tokenizer is specialized on a sequence iterator and predicate. It is constructed with a range of iterators for the sequence and a reference to the desired predicate. Defines exist for std::string tokenizers. The tokenizer provides an iteration pattern as a means for accessing the tokens in the sequence.
const unsigned int data_size = 12;
unsigned int data[data_size] = {1,2,3,0,4,5,6,0,7,8,0,9};
strtk::single_delimiter_predicate<unsigned int> predicate(0);
strtk::tokenizer< unsigned int*,strtk::single_delimiter_predicate<unsigned int> > tokenizer(data,data + data_size,predicate);
Adding "true" as the last parameter to the constructor of the tokenizer will enable compressed delimiters mode, by default it is set to false.
strtk::tokenizer< unsigned int*,strtk::single_delimiter_predicate<unsigned int> > tokenizer(data,data + data_size,predicate,true);
Iteration over the tokens is performed as follows:
strtk::tokenizer< unsigned int*,strtk::single_delimiter_predicate<unsigned int > >::iterator it = tokenizer.begin();
while (it != tokenizer.end())
{
std::copy((*it).first,(*it).second,std::ostream_iterator<unsigned int>(std::cout," "));
std::cout << std::endl;
++it;
}
A typical std::string can be tokenized in the following
manner:
std::string str = "abc|123|xyz|789";
strtk::std_string::tokenizer<>::type tokenizer(str,"|");
strtk::std_string::tokenizer<>::type::iterator it = tokenizer.begin();
while (it != tokenizer.end())
{
std::cout << "[" << std::string((*it).first,(*it).second) << "]\t";
++it;
}
std::cout << std::endl;
Another common situation may be tokenizing a sequence of strings, such as the following:
std::string str_list[] = { "abc" , "delimiter" , "ijk" , "delimiter" ,
"mno" , "delimiter" , "rst" , "uvw" ,
"delimiter" , "xyz" };
const std::size_t str_list_size = sizeof(str_list)/sizeof(std::string);
strtk::range_adapter<std::string> range(str_list,str_list_size);
strtk::single_delimiter_predicate<std::string> predicate("delimiter");
strtk::tokenizer< std::string*,strtk::single_delimiter_predicate<std::string> > tokenizer(range.begin(),range.end(),predicate);
strtk::tokenizer< std::string*,strtk::single_delimiter_predicate<std::string> >::iterator it = tokenizer.begin();
while(it != tokenizer.end())
{
std::copy((*it).first,(*it).second,std::ostream_iterator<std::string>(std::cout," "));
std::cout << std::endl;
++it;
}
The following examples demonstrate parsing of PODs such as int and
double into STL compatible sequences (std::vector,
std::deque or std::list).
std::string int_string = "1,2,3,4,5,6,7,8,9,10,11,12,13,14,15";
std::vector<int> int_list;
strtk::parse(int_string,",",int_list);
std::string double_string = "123.456,789.012,345.678,901.234,567.890";
std::deque<double> double_list;
strtk::parse(double_string,",",double_list);
std::string data_string = "a,bc,def,ghij,klmno,pqrstu,vwxyz";
std::list<std::string> string_list;
strtk::parse(data_string,",",string_list);
Similar to what is described above, the following demonstrates parsing of upto "N" elements into an STL compatible sequence.
std::string int_string = "100,200,300,400,500,600,700,800,900";
std::vector<int> int_list;
strtk::parse_n(int_string,",",5,int_list);
std::string double_string = "123.456,789.012,345.678,901.234,567.890";
std::deque<double> double_list;
strtk::parse_n(double_string,",",3,double_list);
std::string data_string = "a,bc,def,ghij,klmno,pqrstu,vwxyz";
std::list<std::string> string_list;
strtk::parse_n(data_string,",",6,string_list);
Note: The return value of the routine strtk::parse_n
indicates how many elements were parse and placed into the specified sequence.
Lets assume you have been given an english text file to process, with the intention of extracting a lexicon from the file.
One solution would be to break the problem down to a line by line tokenization problem. In this case you would define a functional object such as the following which will take the container in which you plan on storing your tokens (words) and a predicate and insert the tokens as strings into your container.
template<typename Container, typename Predicate>
struct parse_line
{
public:
parse_line(Container& container, const Predicate& predicate)
: container_(container),
predicate_(predicate)
{}
inline void operator() (const std::string& str) const
{
strtk::split(str,
predicate_,
strtk::range_to_type_back_inserter(container_),
strtk::split_options::compress_delimiters);
}
private:
Container& container_;
const Predicate& predicate_;
};
The whole thing together would include a process of opening the
file and reading it line by line each time invoking the
parse_line would be as follows:
template<typename Container>
void parse_text(const std::string& file_name, Container& c)
{
std::string delimiters = " ,.;:<>'[]{}()_?/'`~!@#$%^&*|-_\"=+";
strtk::multiple_char_delimiter_predicate predicate(delimiters);
strtk::for_each_line(file_name,parse_line<Container,strtk::multiple_char_delimiter_predicate>(c,predicate));
}
int main()
{
std::string text_file_name = "text.txt";
std::deque< std::string > word_list;
parse_text(text_file_name,word_list);
std::cout << "Token Count: " << word_list.size() << std::endl;
return 0;
}
Now coming back to the original problem, that being the construction of a lexicon. In this case the set of "words" should only contain words of interest. For the sake of simplicity lets define words of interest as being anything other than the following prepositions: as, at, but, by, for, in, like, next, of, on, opposite, out, past, to, up and via. The list definition might be something like the following:
const std::string not_of_interest_list [] = { "as", "at", "but", "by", "for",
"in", "like", "next", "of", "on",
"opposite", "out", "past", "to",
"up", "via", ""};
const std::size_t list_size = sizeof(not_of_interest_list)/sizeof(std::string);
Some minor updates to the parse_line predicate include using the
filter_on_match predicate for determining if the
currently processed token is a preposition and also the invocation of
the range_to_type back_inserter to convert the tokens
from their range iterator representation to a string representation
compatible with the user defined container. For the new implementation
to provide unique words of interest the simplest change that can be
made is to replace the deque used as the container for the word_list
to some kind of 1-1 associative container such as a map or set. The
following is the improved version of the parse_line
predicate:
template<typename Container, typename Predicate>
struct parse_line
{
public:
parse_line(Container& container, const Predicate& predicate)
: container_(container),
predicate_(predicate),
tmp_(""),
tokenizer_(tmp_,predicate_,true),
filter_(not_of_interest_list,not_of_interest_list + list_size,
strtk::range_to_string_back_inserter_iterator<Container>(container_),
true,false)
{}
inline void operator() (const std::string& s)
{
const filter_type& filter = filter_;
strtk::for_each_token(s,tokenizer_,filter);
}
private:
Container& container_;
const Predicate& predicate_;
std::string tmp_;
typename strtk::std_string_tokenizer<Predicate>::type tokenizer_;
strtk::filter_on_match<strtk::range_to_string_back_inserter_iterator<Container>> filter_;
};
The above described predicate can be greatly simplified by using binders and various lamba expressions.
When performing serialization or deserialization of an instance object such as a class or struct, a simple approach one could take would be to take each of the members and convert them into string representations and from those strings construct a larger string delimiting each member with a special character guaranteed not to exist in any of the string representations.
In this example we will assume that there exists a struct which represents the properties of a person, a person struct if you will:
struct person
{
unsigned int id;
std::string name;
unsigned int age;
double height
double weight;
};
The process of populating a person struct would entail having an
instance of a person and the necessary data string. The following is
an example of how this would be done using the
strtk::parse function.
| Token0 | Delimiter0 | Token1 | Delimiter1 | Token2 | Delimiter2 | Token3 | Delimiter3 | Token4 | Unique ID(hex) | | | Name | | | Age | | | Height(m) | | | Weight(kg) |
std::string data = "0xFA37ED12|Rumpelstiltskin|397|1.31|58.7";
person p;
strtk::hex_to_number_sink<unsigned int> hex_sink(p.id); // register id with the hex sink
strtk::parse(data,"|",hex_sink,p.name,p.age,p.height,p.weight);
Batch processing of a text file comprised of one person tuple
per-line is somewhat similar to the previous example. A predicate
is defined that takes a container specialized on the person struct,
and a delimiter predicate with which the
strtk::parse function will be invoked. This
predicate is then instantiated and coupled with the text
file name, is fed to the strtk::for_each_line
processor.
template<typename Container, typename Predicate>
struct parse_person_tuple
{
public:
parse_person_tuple(Container& container)
: container_(container)
{}
inline void operator() (const std::string& s)
{
strtk::hex_to_number_sink<unsigned int> hex_sink(p_.id);
if(strtk::parse(s,"|",hex_sink,p_.name,p_.age,p_.height,p_.weight))
{
container_.push_back(p_);
}
}
private:
Container& container_;
person p_;
};
Bringing the above pieces together to process a file results in the following:
int main()
{
std::string text_file_name = "person_records.txt";
std::deque<person> person_list;
strtk::for_each_line(file_name,predicate_type(person_list));
return 0;
}
Another similar example to the previous, would be parsing a text file of 3D coordinates into a sequence. This can be done easily and cleanly using lambdas and strtk as follows:
struct point
{
double x,y,z;
};
int main()
{
std::string point_data = "point_data.txt";
std::deque<point> points;
point p;
strtk::for_each_line(point_data,
[](const std::string& str)
{
if(strtk::parse(str,",",p.x,p.y,p.z))
points.push_back(p);
});
return 0;
}
Lets assume there is a file format for expressing a mesh. The format consists of a section that defines the unique vertexes in the mesh, and another section that defines the triangles in the mesh as a tuple of three indicies indicating the vertexes used. Each section is preceded by an integer that denotes the number of subsequent elements in that section. An example of such a file is the following:
5
+1.0,+1.0,+1.0
-1.0,+1.0,-1.0
-1.0,-1.0,+1.0
+1.0,-1.0,-1.0
+0.0,+0.0,+0.0
4
0,1,4
1,2,4
2,3,4
3,1,4
Code to parse such a file format is as follows:
struct point
{
double x,y,z;
};
struct triangle
{
std::size_t i0,i1,i2;
};
int main()
{
std::string mesh_file = "mesh.txt";
std::ifstream stream(mesh_file.c_str());
std::string s;
// Process points section
std::deque<point> points;
point p;
std::getline(stream,s);
std::size_t point_count = strtk::string_to_type_converter<std::size_t>(s);
strtk::for_each_line_n(stream,
point_count,
[](const std::string& line)
{
if(strtk::parse(line,",",p.x,p.y,p.z))
points.push_back(p);
});
// Process triangles section
std::deque<triangle> triangles;
triangle t;
std::getline(stream,s);
std::size_t triangle_count = strtk::string_to_type_converter<std::size_t>(s);
strtk::for_each_line_n(stream,
triangle_count,
[](const std::string& line)
{
if(strtk::parse(line,",",t.i0,t.i1,t.i2))
triangles.push_back(t);
});
return 0;
}
A random line of text is to be picked/selected from a user provided text file comprised of lines of varying length, such that the probability of the line being picked is 1/N where N is the number of lines in the text file. There are many trivial solutions to this problem, however if one were to further constrain the problem by indicating the file is very large (TBs) and that the system upon which the solution is to run is very limited memory-wise, most if not all trivial solutions such as storing indexes of line offsets, or reading the entire file into memory etc will be eliminated.
There is however a very elegant solution to this problem of O(n), O(1) time and space complexities respectively, involving scanning the entire file line by line once, at every ith line choosing whether or not to replace the final result line with the current line by sampling a uniform distribution of range [0,1) and performing a replace if and only if the random value is less than 1/i.
The logic of this solution revolves around the fact that the probability for selecting the ith line will be 1/i and the total probability for selecting any of the previous i-1 lines will be 1 -(1/i) or (i-1)/i. Because there are (i-1) lines before the ith line, we divide the previous probability by (i-1), resulting in a probability of 1/i of being selected for any one of lines up to and including the ith line. If the ith line were to be the last line of the text file, then this results in each of the lines having a selection probability of 1/N - simple and sweet, and so to is the following implementation of said solution:
int main(int argc, char* argv[])
{
std::string file_name = argv[1];
std::string line;
std::size_t i = 0;
strtk::uniform_real_rng rng(static_cast<std::size_t>(::time(0));
strtk::for_each_line(file_name,
[&i,&line,&rng](const std::string& s)
{ if (rng() < (1.0 / ++i)) line = s; }
);
std::cout << line << std::endl;
return 0;
}
Note: TAOCP Volume II section 3.4.2 has an in depth discussion about this problem and other similar problems relating to random distributions. Also one should note that the above example has an inefficiency, in that upon every string replace, an actual string is being copied, normally all one needs to do is maintain a file offset to the beginning of the line, not doing this causes slow downs due to continuous memory allocations/reallocations which is made all the worse when large lines are encountered.
Strtk provides a means to easily parse and consume 2D grids of tokens in an efficient and simple manner. A grid is simply defined as a series of rows comprised of tokens. The ith token of a row is grouped with every ith token of all other rows to produce a column. Tokens can be processed as either rows or columns.
An example of a simple token grid, where each numeric value is deemed to be a token:
1.1 | 2.2 | 3.3 | 4.4 | 5.5 |
1.1 | 2.2 | 3.3 | 4.4 | 5.5 |
1.1 | 2.2 | 3.3 | 4.4 | 5.5 |
1.1 | 2.2 | 3.3 | 4.4 | 5.5 |
1.1 | 2.2 | 3.3 | 4.4 | 5.5 |
A token grid can be either passed in via a file or a string buffer. The delimiters to be used for parsing the columns and rows can also be provided, if not provided standard common defaults will be used.
The following demonstrates how each cell in the grid can be access and cast to a specific type.
std::string data = "1,2,3,4,5,6\n"
"7,8,9,0,1,2\n"
"3,4,5,6,7,8\n"
"9,0,1,2,3,4\n"
"5,6,7,8,9,0\n";
strtk::token_grid grid(data,data.size(),",");
for(std::size_t r = 0; r < grid.row_count(); ++r)
{
strtk::token_grid::row_type row = grid.row(r);
for(std::size_t c = 0; c < row.size(); ++c)
{
std::cout << grid.get<int>(r,c) << "\t";
}
std::cout << std::endl;
}
The following example demonstrates how averages over rows and columns of a token grid can be computed:
std::string data = "1.1,1.1,1.1,1.1,1.1,1.1\n"
"2.2,2.2,2.2,2.2,2.2,2.2\n"
"3.3,3.3,3.3,3.3,3.3,3.3\n"
"4.4,4.4,4.4,4.4,4.4,4.4\n"
"5.5,5.5,5.5,5.5,5.5,5.5\n"
"6.6,6.6,6.6,6.6,6.6,6.6\n";
strtk::token_grid grid(data,data.size(),",");
std::vector<double> avg_c(grid.row(0).size(),0.0);
std::vector<double> avg_r(grid.row_count(),0.0);
std::vector<double> tmp(grid.row(0).size(),0.0);
std::fill(avg_c.begin(),avg_c.end(),0.0);
for(std::size_t i = 0; i < grid.row_count(); ++i)
{
grid.row(i).parse(tmp.begin());
std::transform(avg_c.begin(),avg_c.end(),tmp.begin(),avg_c.begin(),std::plus());
avg_r[i] = std::accumulate(tmp.begin(),tmp.end(),0.0) / tmp.size();
}
for(std::size_t i = 0; i < avg_c.size(); avg_c[i++] /= grid.row_count());
std::cout << "Column Averages:\t";
std::copy(avg_c.begin(),avg_c.end(),std::ostream_iterator(std::cout,"\t"));
std::cout << "\n";
std::cout << "Row Averages:\t";
std::copy(avg_r.begin(),avg_r.end(),std::ostream_iterator(std::cout,"\t"));
std::cout << "\n";
The original intent of the token grid was to support simple and efficient processing of data tuples. The following example demonstrates a simple summation of volume based on OHLC data.
//Date,Symbol,Open,Close,High,Low,Volume
std::string market_data = "20090701,GOOG,424.2000,418.9900,426.4000,418.1500,2310768\n"
"20090701,MSFT,24.0500,24.0400,24.3000,23.9600,54915127\n"
"20090702,GOOG,415.4100,408.4900,415.4100,406.8100,2517630\n"
"20090702,MSFT,23.7600,23.3700,24.0400,23.2100,65427699\n"
"20090703,GOOG,408.4900,408.4900,408.4900,408.4900,0\n"
"20090703,MSFT,23.3700,23.3700,23.3700,23.3700,0\n"
"20090706,GOOG,406.5000,409.6100,410.6400,401.6600,2262557\n"
"20090706,MSFT,23.2100,23.2000,23.2800,22.8700,49207638\n"
"20090707,GOOG,408.2400,396.6300,409.1900,395.9801,3260307\n"
"20090707,MSFT,23.0800,22.5300,23.1400,22.4600,52842412\n"
"20090708,GOOG,400.0000,402.4900,406.0000,398.0600,3441854\n"
"20090708,MSFT,22.3100,22.5600,22.6900,2200000,73023306\n"
"20090709,GOOG,406.1200,410.3900,414.4500,405.8000,3275816\n"
"20090709,MSFT,22.6500,22.4400,22.8100,22.3700,46981174\n"
"20090710,GOOG,409.5700,414.4000,417.3700,408.7000,2929559\n"
"20090710,MSFT,22.1900,22.3900,22.5400,22.1500,43238698\n";
strtk::token_grid grid(market_data,market_data.size(),",");
std::size_t goog_total_volume = 0; //should be long long.
std::size_t msft_total_volume = 0;
static const std::size_t volume_column = 6;
static const std::size_t symbol_column = 1;
grid.accumulate_column(volume_column,
[](const strtk::token_grid::row_type& row) -> bool
{
return "GOOG" == row.get<std::string>(symbol_column);
},
goog_total_volume);
grid.accumulate_column(volume_column,
[](const strtk::token_grid::row_type& row) -> bool
{
return "MSFT" == row.get<std::string>(symbol_column);
},
msft_total_volume);
std::cout << "Total Volume (GOOG): " << goog_total_volume << std::endl;
std::cout << "Total Volume (MSFT): " << msft_total_volume << std::endl;
A typical operation carried out upon time-series data is to group tuples into buckets (or bins) based upon the time index value. For example grouping data into 2-minute buckets and then performing some kind of operation upon the grouped tuples such as a summation or an average etc.
The strtk::token_grid class provides a method known as
sequential_partition. The sequential_partition
method requires a Transition Predicate, a Function
and optionally a row-range. The Transition Predicate
consumes a row and returns true only if the row is
in the next partition. All other subsequent consecutive rows until
the transition predicate returns a true are said to be
in the current partition. Prior to transitioning to a new partition,
the function predicate is invoked and provided with the range of
rows in the current partition.
The following example takes a simple time-series (time and
value), partitions the tuples into groups of Time-Buckets
of period length 3 and then proceeds to compute the total sum of each
group. The below summarizer class provides provides a
Transition Predicate and Function.
class summarizer
{
public:
enum column_index
{
tick_time_column = 0,
tick_value_column = 1
};
summarizer(std::deque<double>& sum_value)
: next_tick_time_(0),
sum_value_(sum_value)
{}
// Transition Predicate
bool operator()(const strtk::token_grid::row_type& row)
{
if (row.get<std::size_t>(tick_time_column) >= next_tick_time_)
{
next_tick_time_ = row.get<std::size_t>(tick_time_column) + 3;
return true;
}
else
return false;
}
// Function
bool operator()(const strtk::token_grid& grid,
const strtk::token_grid::row_range_type& range)
{
double bucket_sum = 0.0;
if(!grid.accumulate_column(tick_value_column,range,bucket_sum))
{
std::cout << "failed to accumulate!" << std::endl;
return false;
}
else
sum_value_.push_back(bucket_sum);
return true;
}
private:
summarizer& operator=(const summarizer&);
std::size_t next_tick_time_;
std::deque<double>& sum_value_;
};
int main()
{
//time index, value
std::string data = "10000,123.456\n"
"10001,612.345\n"
"10002,561.234\n"
"10003,456.123\n"
"10004,345.612\n"
"10005,234.561\n"
"10006,123.456\n";
strtk::token_grid grid(data,data.size(),",");
std::deque<double> sum_value;
summarizer s(sum_value);
grid.sequential_partition(s,s);
for(std::size_t i = 0; i < sum_value.size(); ++i)
{
std::cout << "bucket[" << i << "] = " << sum_value[i] << std::endl;
}
}
The expected output is as follows:
bucket[0] = 1297.035
bucket[1] = 1036.296
bucket[2] = 123.456
As previously mentioned the concept of a delimiter based predicate can lead to some very interesting solutions. A predicate as has been defined so far, with the exception of the offset predicate, has been a stateless entity. Adding the ability to maintain a state based on what the predicate has encountered so far can allow it to behave differently from the simple single and multiple delimiter predicates.
For this example, lets assume a typical command line parsing problem which consists of bracketed groupings and escapable special characters, which can be considered as being dual use delimiters and data. An example string is as follows:
Data: abc;"123, mno xyz",i\,jk
Delimiters: <space>;,.
Expected Tokens:
Token0 = [abc]
Token1 = [123, mno xyz]
Token2 = [i\,jk]
In order to tokenize the above described string, one can create a composite predicate using a multiple char delimiter predicate and some simple state rules. The following is an example of such an extended predicate:
class extended_predicate
{
public:
extended_predicate(const std::string& delimiters)
: escape_(false),
in_bracket_range_(false),
mdp_(delimiters)
{}
inline bool operator()(const unsigned char c) const
{
if (escape_)
{
escape_ = false;
return false;
}
else if ('\\' == c)
{
escape_ = true;
return false;
}
else if ('"' == c)
{
in_bracket_range_ = !in_bracket_range_;
return true;
}
else if (in_bracket_range_)
return false;
else
return mdp_(c);
}
inline void reset()
{
escape_ = false;
in_bracket_range_ = false;
}
private:
mutable bool escape_;
mutable bool in_bracket_range_;
mutable strtk::multiple_char_delimiter_predicate mdp_;
};
Usage of the newly defined extended predicate is as follows:
int main()
{
std::string str = "abc;\"123, mno xyz\",i\\,jk";
strtk::std_string::token_list_type token_list;
strtk::split(extended_predicate(".,; "),
str,
std::back_inserter(token_list),
strtk::split_options::compress_delimiters);
return 0;
}
Originally Strtk was mostly compatible with C++ compilers
ranging from GCC v2.95 and Visual studio v6.0, however due to
upgrades it now requires at least GCC v4.4, Intel C++ Compiler v11
or Visual Studio v9.0 to compile easily. It also requires the
BOOST library for its boost::lexical_cast routine
for types other than PODs. This dependency can be removed
manually. For Visual Studio users, BoostPro
provides a free and easy to use installer for the latest BOOST
libraries that can be obtained from Here.
Like most things there is a trade-off between performance and usability with the above mentioned tokenizers and parsing methods. The original aim was to provide an interface similar to that of the BOOST tokenizer and split, but to also provide some simplifications that will hopefully provide the user more flexibility in the long run, this is by no means a complete replacement but rather a more simpler way of looking at the problem. Tokenizing a string isn't the most interesting problem one could tackle but it does have its interesting points when one has a few TB of data to process, doing it properly could mean the difference between finishing a simple data processing job today or next month.
| You must Sign In to use this message board. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||