Introduction
This article will present the tokenizing and splitting functionality
of a simple C++ library called the String Toolkit. Tokenization in
the context of string processing, is the method by which a sequence
of elements are broken up or fragmented in sub-sequences called tokens.
The indices in the original sequence that determine such breaks in the
sequence are known as delimiters.
There are two types of delimiters, normal or thin delimiters which
are of length one element and a thick delimiters which are of length
two or more elements. Even though tokenization is primarily used in
conjunction with strings, any sequence of types that can be iterated
in a linear fashion can be tokenized, examples may be list of
integers, a vector of person classes or a map of strings.
Another Tokenizer?
To date there have been many attempts to define a
"standard" Tokenizer implementation in C++. Of them all the
likely candidate might be the implementation in the Boost
library. Regardless proposed implementations should to some extent
consider one or more of the following areas: over-all usage
patterns, constructions, generality (or is it genericty these
days?) of design, performance-efficiency.
1. Over-all usage patterns
This requirement is concerned with how easy it is to instantiate the
tokenizer and integrate it into existing processing patterns, which
most often than not requires integration with C++ STL algorithms and
containers. A tokenzier by definition would be classed as a producer,
so the question becomes how easy is it for others to consume its
output? Another issue is consistency of the definition of a token in
the situation where one has consecutive delimiters but is not
compressing them - can or should there be such a thing as an
empty token? and what do preceding and trailing delimiters mean? and
when should they be included as part of the token?
2. Constructions
In the context of string tokenizing, the majority of implementations
return the token as a new instance of a string. This process requires
a string to be created on heap, populated by the sub-string in
question from the original sequence, then returned back (some of
this may be alleviated by Return Value Optimization RVO).
In the case of iterators this is essentially another copy to the
caller. Furthermore two kinds of tokens can make this situation
worse, they are primarily a large sequence made up of lots of very
short tokens or a large sequence made up of lots of very large
tokens. The solution is not to return the token as a string but
rather as a range and allow the caller to decide how they wish to
inspect and further manipulate the token.

This minor change in interface design provides a great deal of
flexibility and performance gain.
3. Generality(Genericity) of design
Most tokenizer implementations concern themselves only with strings,
which for the most part is ok, because most things that need
tokenizing are strings. However there will be times when one has a
sequence of types that may need to be tokenized that aren't strings,
hence a tokenizer should be designed in such a way to enable such
features, moreover it becomes clear that the most basic of tokenizing
algorithms are invariant to the type of the delimiter.
4. Performance and Efficiency
Tokenizing strings range from low frequency inputs such as user input
or parsing of simple configuration files to more complex situations
such as tokenizing of HTML pages that a web crawler/indexer might do,
to parsing of large market-data streams in FIX format. Performance is
generally important and can usually be helped along with good usage
patterns that encourage reuse of instances, minimal preprocessing and
allow for user supplied predicates for the more nasty areas of the
process. It should be noted that everything in the proceeding article
can be done by purely using the STL - that said, C++'s ability to
allow one to skin the proverbial cat in numerous way gives
rise to novel solutions that are for the most part not of any
practical use other than to showcase ones abilities in using the
STL.
Getting started
The String Toolkit Library (StrTk) provides two common tokenization
concepts, a split function and a token iterator. Both these concepts
require the user to provide a delimiter predicate and an iterator
range over which the process will be carried out.
The tokenization process can be further parametrized by switching
between "compressed delimiters" or "no compressed delimiters" mode.
This essentially means that consecutive delimiters will be compressed
down to one and treated as such. Below are two tables depicting the
expected tokens from various types of input. The tables represent no
compressed and compressed tokenization processes respectively. The
delimiter in this instance is a pipe symbol | and <> denotes
an empty token.
No Compressed Delimiters
| Input | Token List |
a | a |
a|b | a,b |
a||b | a,<>,b |
|a | <>,a |
a| | a,<> |
|a||b | <>,a,<>,b |
||a||b|| | <>,<>,a,<>,b,<>,<> |
| | <>,<> |
|| | <>,<>,<> |
||| | <>,<>,<>,<> |
Compressed Delimiters
| Input | Token List |
a | a |
a||b | a,b |
|a||b | <>,a,b |
||a||b|| | <>,a,b,<> |
| | <>,<> |
|| | <>,<> |
||| | <>,<> |
Delimiters
Two forms of delimiters are supported and they are single
delimiter predicate and multiple delimiters predicate otherwise known
as SDP and MDP respectively. Essentially an SDP is
where there is only one type that can break or fragment the
sequence, where as with MDPs there is more than one unique type that
can break the sequence. It is possible to represent a SDP using
the MDP, however from a performance POV having separate
predicates is far more efficient. Additionally for strings based
on char or unsigned char (8-bit versions) there is a MDP that has a
look-up complexity of O(1) making it greatly more efficient than the
basic MDP.
Single Delimiter Predicate
Instantiation requires specialization of type and construction
requires an instance of the type.
strtk::single_delimiter_predicate<typename T>(const T& t)
strtk::single_delimiter_predicate<std::string::value_type> predicate('|');
Multiple Delimiter Predicate
Instantiation requires specialization of type and construction
requires a sequence of potential delimiters through a range
described by iterators.
strtk::multiple_delimiter_predicate<typename T>(Iterator begin, Iterator end);
std::string str_delimiters = " ,.;:<>'[]{}()_?/'`~!@#$%^&*|-_\"=+";
strtk::multiple_delimiter_predicate<char> mdp1(str_delimiters.begin(),str_delimiters.end());
unsigned int uint_delimiters[5] = {1,10,101,1010,10101};
strtk::multiple_delimiter_predicate<unsigned int> mdp2(uint_delimiters,uint_delimiters + 5);
As previously mentioned tokenization of data need not be limited to
strings comprised of chars, but can also be extended to other PODs or
complex types. In the above example a predicate used for tokenizing a
sequence of unsigned ints is being defined.
Multiple Char Delimiter Predicate
Instantiation requires an iterator range based on either unsigned
char or char. This delimiter is more efficient than the simple MDP as
it has a predicate evaluation of O(1) due to the use
of a lookup-table as opposed to O(n) where n is the
number of delimiters. Performance increase is seen as more
delimiters are used.
strtk::multiple_char_delimiter_predicate(const std::string&)
strtk::multiple_char_delimiter_predicate(const std::string::const_iterator begin,const std::string::const_iterator end)
strtk::multiple_char_delimiter_predicate predicate(" .;?");
The delimeter concept can be extended to the point where the
predicate itself can act as a state machine transitioning from
state to state based on conditions and rules related to the current
symbol being processed. This simple extension can lead to some very
interesting parsing capabilities.
Split
This is a function that performs tokenization over an entire sequence
in one go. strtk::split takes a sequence through
a range of iterators or in the case of a string through
a reference to its instance, a delimiter predicate and an output
iterator or otherwise known as a type sink. It populates the output iterator with the tokens it
extracts. The tokens in this case are std::pairs of iterators for
the sequence.

Split can be used in a "simple - no frills" manner by simply
passing the necessary parameters:
std::string str = "abc|123|xyz|789";
strtk::std_string::token_list_type token_list;
strtk::split(" |.;?",str,std::back_inserter(token_list));
strtk::split can also be used in a more explicit
manner whereby the exact type of delimiter predicate can be
specificed by the user:
{
std::string str = "abc|123|xyz|789";
strtk::std_string::token_list_type token_list;
strtk::single_delimiter_predicate<std::string::value_type> predicate('|');
strtk::split(predicate,str,std::back_inserter(token_list));
}
{
std::string data = "abc|123|xyz|789";
std::deque<std::string> token_list;
strtk::split([](const char c)
{
return '|' == c;
},
data,
strtk::range_to_type_back_inserter(token_list));
}
strtk::split provides an additional usage option that
allows the user to specify if they would like to either compress the
delimiters and whether or not they would like to include the
delimiter as part of the token range. This enum parameter is called
strtk::split_options and has the following values:
| Split Option | Definition |
strtk::split_options::default_mode | Default options |
strtk::split_options::compress_delimiters | Consecutive delimiters are treated as one |
strtk::split_options::include_1st_delimiter | The first delimiter is included in the resulting token range |
strtk::split_options::include_delimiters | All delimiters are included in the resulting token range |
The simple example below demonstrates a split that will occur over
a string given a predicate where the provided split options indicate
that consecutive delimiters will be treated as one and also all delimiters
encountered after each token will also be included in the token up until
the next token begins.
strtk::split(predicate,
str,
std::back_inserter(token_list),
strtk::split_options::compress_delimiters | strtk::split_options::include_delimiters);
Another way of using split may be to use the
strtk::multiple_char_delimiter_predicate as follows:
std::string str = "abc?123;xyz.789";
strtk::std_string::token_list_type token_list;
strtk::multiple_char_delimiter_predicate predicate(" .;?");
strtk::split(predicate,str,std::back_inserter(token_list));
The contents of the token_list can be printed out as follows:
strtk::std_string::token_list_type::iterator itr = token_list.begin();
while (token_list.end() != itr)
{
std::cout << (*itr) << '\t';
++itr;
}
Split N-Tokens
A natural extension of strtk::split is strtk::split_n.
This function provides the ability to tokenize a sequence up until a specific
number of tokens have been encountered or until there are no more tokens left.
The return value of the strtk::split_n would be the number of tokens
encountered.
std::string data = "token1?token2,token3;token4,token5";
strtk::std_string::token_list_type token_list;
const std::size_t token_count = 4;
const std::string delimiters (" ,.;?");
strtk::split_n(delimiters,
data,
token_count,
std::back_inserter(token_list));
strtk::std_string::token_list_type::iterator itr = token_list.begin();
while (token_list.end() != itr)
{
std::cout << "[" << (*itr) << "]\t";
++itr;
}
std::cout << std::endl;
Offset Splitter
Another simple variant is the strtk::offset_splitter.
This form of split takes a series of offsets and an iterator range or
string and determines the tokens based on the offsets. This function
can be set to perform a single pass of the offsets or to rotate
them until the range has been completely traversed. The example
below demonstrates how a string representing date and time can be
tokenized into its constituent parts (year, month, day, hour, minutes
,seconds,milliseconds)
std::string time_data = "20000101091011345"; const std::size_t offset_list_size = 7;
const int offset_list[offset_list_size] = {
4, 2, 2, 2, 2, 2, 3 };
const strtk::offset_predicate<offset_list_size> predicate(offset_list);
strtk::std_string::token_list_type token_list;
strtk::offset_splitter(time_data,predicate,std::back_inserter(token_list));
strtk::std_string::token_list_type::iterator itr = token_list.begin();
while (token_list.end() != itr)
{
std::cout << "[" << (*itr) << "] ";
++itr;
}
std::cout << std::endl;
Split Regex
Another form of splitter is based on the concept of using regular
expressions as the delimiter predicate. Below is a simple example
of splitting tokens wrapped in round-brackets.
std::string str = "(12)(345)(6789)(0ijkx)(yz)";
std::list<std::string> token_list;
strtk::split_regex("\\(.*?\\)",
s,
std::back_inserter(token_list),
strtk::regex_match_mode::match_1);
std::list<std::string>::iterator itr = token_list.begin();
while (token_list.end() != itr)
{
std::cout << "[" << (*itr) << "]\t";
++itr;
}
std::cout << std::endl;
Note: The parameter regex_match_mode represents the capture of the
marked sub-expression in the current match. By default it is
strtk::regex_match_mode::match_all which in this case
would provide the entire match including the bounding pattern,
eg: Token3 would be (0ijkx). However in the above example
we are only interested in the sub-expression between the round-brackets,
hence strtk::regex_match_mode::match_1 is used resulting
in Token3 being 0ijkx.
The following examples demonstrate the use of strtk::split_regex
and strtk::split_regex_n routines in extracting specific
types of data - in this case the POD types int and double.
int main()
{
{
std::string data = "a 1^bc,0023| def?gh(4567ijk)-89 10l,m$n-op+123r@st+3u v*w2y56yz+";
std::deque<int> int_list;
strtk::split_regex("([+-]?([\\d]+))",
data,
strtk::range_to_type_back_inserter(int_list),
strtk::regex_match_mode::match_1);
}
{
std::string data = "ab$c1.1?d-2.2ef#ghi+3.3%(123.456)!&*-7.89E+12@^=";
std::deque<double> double_list;
strtk::split_regex(strtk::ieee754_expression,
data,
strtk::range_to_type_back_inserter(double_list),
strtk::regex_match_mode::match_1);
}
{
std::string data = "a 1^bc,0023| def?gh(4567ijk)-89 10l,m$n-op+123r@st+3u v*w2y56yz+";
std::deque<int> int_list;
strtk::split_regex_n("([+-]?([\\d]+))",
data,
3,
strtk::range_to_type_back_inserter(int_list),
strtk::regex_match_mode::match_1);
}
{
std::string data = "ab$c1.1?d-2.2ef#ghi+3.3%(123.456)!&*-7.89E+12@^=";
std::deque<double> double_list;
strtk::split_regex_n(strtk::ieee754_expression,
data,
4,
strtk::range_to_type_back_inserter(double_list),
strtk::regex_match_mode::match_1);
}
return 0;
}
The following table describes the various regex patterns built
into StrTk which can be used seemlessly with the strtk::split_regex
and strtk::split_regex_n routines.
| Regex | Definition |
strtk::uri_expression | URI, URL address e.g.: http://www.example.com, domain.example.net/index.html |
strtk::email_expression | E-mail address e.g.: some.one@example.com |
strtk::ip_expression | IPv4 address e.g.: 192.168.0.1, 127.0.0.1 |
strtk::ieee754_expression | Floating point value e.g.: 1.1, 1.234e-123, -1.0001E+10, 0.1234 |
Tokenizer
The tokenizer is specialized on a sequence iterator and predicate. It
is constructed with a range of iterators for the sequence and a
reference to the desired predicate. Definitions exist for std::string
tokenizers. The tokenizer provides an iteration pattern as a means
for accessing the tokens in the sequence.
const unsigned int data_size = 12;
unsigned int data[data_size] = {1,2,3,0,4,5,6,0,7,8,0,9};
strtk::single_delimiter_predicate<unsigned int> predicate(0);
typedef strtk::tokenizer<unsigned int*,strtk::single_delimiter_predicate<unsigned int> > tokenizer_type;
tokenizer_type tokenizer(data,data + data_size,predicate);
Similar to that of strtk::split,
strtk::tokenizer provides tokenizing options that are
passed in during construction. Below is a table depicting said
options:
| Tokenize Option | Definition |
strtk::tokenize_options::default_mode | Default options |
strtk::tokenize_options::compress_delimiters | Consecutive delimiters are treated as one |
strtk::tokenize_options::include_1st_delimiter | The first delimiter is included in the resulting token range |
strtk::tokenize_options::include_delimiters | All delimiters are included in the resulting token range |
typedef strtk::tokenizer<unsigned int*,strtk::single_delimiter_predicate<unsigned int> > tokenizer_type
tokenizer_type tokenizer(data, data + data_size,
predicate,
strtk::tokenize_options::compress_delimiters |
strtk::tokenize_options::include_1st_delimiter);
Furthermore, Iteration over the tokens of strtk::tokenizer
is performed as follows:
tokenizer_type::iterator itr = tokenizer.begin();
while (tokenizer.end() != itr)
{
std::copy((*itr).first,(*itr).second,std::ostream_iterator<unsigned int>(std::cout," "));
std::cout << std::endl;
++itr;
}
A typical std::string can be tokenized in the following
manner:
std::string str = "abc|123|xyz|789";
strtk::std_string::tokenizer<>::type tokenizer(str,"|");
strtk::std_string::tokenizer<>::type::iterator itr = tokenizer.begin();
while (tokenizer.end() != itr)
{
std::cout << "[" << (*itr) << "]\t";
++itr;
}
std::cout << std::endl;
Another common situation may be tokenizing a sequence of strings,
such as the following:
const std::string str_list[] = { "abc" , "delimiter" , "ijk" , "delimiter" ,
"mno" , "delimiter" , "rst" , "uvw" ,
"delimiter" , "xyz" };
const std::size_t str_list_size = sizeof(str_list) / sizeof(std::string);
strtk::range_adapter<std::string> range(str_list,str_list_size);
strtk::single_delimiter_predicate<std::string> predicate("delimiter");
typedef strtk::tokenizer<std::string*,strtk::single_delimiter_predicate<std::string> > tokenizer_type;
tokenizer_type tokenizer(range.begin(),range.end(),predicate);
tokenizer_type::iterator itr = tokenizer.begin();
while (tokenizer.end() != itr)
{
std::copy((*itr).first,(*itr).second,std::ostream_iterator<std::string>(std::cout," "));
std::cout << std::endl;
++itr;
}
The Parse Routine
Till now the mentioned routines worked specifically with tokens, or in
other words ranges of characters. The responsibility of managing the
tokens and converting the tokens to user specified types was done
manually via "range to type" oriented back inserters and converters.
This can be a bit cumbersome and as such StrTk provides a series of
helper routines called strtk::parse. Parse takes an
std::string representing a tuple of delimited values as
input data, a delimiter set, and a series of references to variables
that are to be populated with the values from the parsed tokens. The
following diagram demonstrates the flow of data, tokens and the
corresponding relationships and conversions between each of the
parameters.

Note: strtk::parse returns a boolean value of true
upon successful parsing and false for all other results. Situations
that cause strtk::parse to fail include:
- Insufficient number of tokens for the given number of variables
- Conversion failure from token range to variable type
- Empty or null token(s)
Some Simple Parse Examples
strtk::parse can take an arbitary number of variable
references. The code below demonstrates the basic usage of strtk::parse
taking various number of parameters.
std::string data = "abcde,-1234|567.890#1.1f";
std::string delimiters = ",|#";
std::string var0;
int var1;
double var2;
float var3;
strtk::parse(data,delimiters,var0);
strtk::parse(data,delimiters,var0,var1);
strtk::parse(data,delimiters,var0,var1,var2);
strtk::parse(data,delimiters,var0,var1,var2,var3);
The following examples demonstrate parsing of PODs such as int and
double into STL compatible sequences (std::vector,
std::deque, std::list, std::set,
std::queue, std::stack and
std::priority_queue).
std::string int_string = "1 +2 -3 4 +5 -6 7 +8 -9 10 +11 -12 13 +14 -15";
std::vector<int> int_vector;
strtk::parse(int_string," ",int_vector);
std::string double_string = "-123.456,789.012,-345.678,901.234,+567.890";
std::deque<double> double_deque;
strtk::parse(double_string,",",double_deque);
std::string data_string = "a,bc,def,ghij,klmno,pqrstu,vwxyz";
std::list<std::string> string_list;
strtk::parse(data_string,",",string_list);
std::string set_string = "a|bc#def|ghij#klmno|pqrstu#vwxyz";
std::set<std::string> string_set;
strtk::parse(set_string,"#|",string_set);
std::string queue_string = "value1,value2,value3,value4,value5";
std::queue<std::string> string_queue;
strtk::parse(queue_string,",|",string_queue);
std::string stack_string = "value1|value2,value3|value4,value5";
std::stack<std::string> string_stack;
strtk::parse(stack_string,",|",string_stack);
std::string priority_queue_string = "value1|value2,value3#value4,value5";
std::priority_queue<std::string> string_priority_queue;
strtk::parse(priority_queue_string,",|#",string_priority_queue);
Similar to what is described above, the following demonstrates parsing of upto "N" elements into an
STL compatible sequence.
std::string int_string = "100,-200,+300,400,-500,+600,700,-800,+900";
std::vector<int> int_vector;
strtk::parse_n(int_string,",",5,int_vector);
std::string double_string = "123.456,+789.012,345.678,-901.234,567.890";
std::deque<double> double_deque;
strtk::parse_n(double_string,",",3,double_deque);
std::string data_string = "a,bc,def,ghij,klmno,pqrstu,vwxyz";
std::list<std::string> string_list;
strtk::parse_n(data_string,",",6,string_list);
std::string set_string = "a|bc#def|ghij#klmno|pqrstu#vwxyz";
std::set<std::string> string_set;
strtk::parse_n(set_string,"#|",6,string_set);
std::string queue_string = "value0,value1,value2,value3,value4,value5";
std::queue<std::string> string_queue;
strtk::parse_n(queue_string,",|",4,string_queue);
std::string stack_string = "value0|value1|value2,value3|value4,value5";
std::stack<std::string> string_stack;
strtk::parse_n(stack_string,",|",4,string_stack);
std::string priority_queue_string = "value0#value1|value2,value3#value4,value5";
std::priority_queue<std::string> string_priority_queue;
strtk::parse_n(priority_queue_string,",|#",4,string_priority_queue);
Note: The return value of the routine strtk::parse_n
indicates how many elements were parsed and placed into the specified
sequence.
Some Initial Examples
As a first example, we'll tackle the simple problem of reversing words in a sentence.
That is given a sentence, to have the first word be the last and the last to be
the first, the second word to be the second last so on and so forth. Using StrTk
we come up with the following solution:
int main()
{
std::string sentence = "The quick brown fox jumps over the lazy dog";
std::cout << "Before: " << sentence << std::endl;
strtk::split(" ",
sentence,
strtk::functional_inserter(
[](const strtk::range::string& range)
{ strtk::reverse(range); })
);
strtk::reverse(sentence);
std::cout << "After: " << sentence << std::endl;
return 0;
}
With an expected output of:
Before: The quick brown fox jumps over the lazy dog
After: dog lazy the over jumps fox brown quick The
Another example, given a list of words to blank out and a sentence,
transform the sentence such that the blank-out words are
removed. Using StrTk we come up with the following solution:
int main()
{
std::string sentence = "The quick brown fox jumps over the lazy dog";
std::unordered_set<std::string> blankout_words;
blankout_words.insert("quick");
blankout_words.insert("over");
blankout_words.insert("lazy");
std::cout << "Before: " << sentence << std::endl;
strtk::split(" ",
sentence,
strtk::functional_inserter(
[&blankout_words](const strtk::range::string& range)
{
auto itr = blankout_words.find(range);
if (blankout_words.end() != itr)
{
strtk::fill(range,' ');
}
})
);
strtk::remove_consecutives_inplace(' ',sentence);
std::cout << "After: " << sentence << std::endl;
return 0;
}
With an expected output of:
Before: The quick brown fox jumps over the lazy dog
After: The brown fox jumps the dog
Another simple example would be given a text file to read each
of the lines and populate a word list structure by
tokenizing each line into words. The following is an example
of how this can be achieved using StrTk:
int main()
{
std::deque<std::string> word_list;
strtk::for_each_line("text.txt",
[&word_list](const std::string& line)
{
static const std::string delimiters = "0123456789()[]{}<>"
"\t\r\n ,,.;:'\""
"!@#$%^&*_-=+`~/\\";
strtk::parse(line,delimiters,word_list);
});
return 0;
}
The following simple example takes a user specificed text file,
proceeds to process it and return information relating to the
file, such as word, letter, uppercase, lowercase, vowel and
consonant counts.
int main()
{
std::size_t word_count = 0;
std::size_t letter_count = 0;
std::size_t uppercase_count = 0;
std::size_t lowercase_count = 0;
std::size_t vowel_count = 0;
std::size_t consonant_count = 0;
using namespace strtk;
for_each_line("data.txt",
[&](const std::string& line)
{
static multiple_char_delimiter_predicate is_vowel("AEIOUaeiou");
static multiple_char_delimiter_predicate is_lowercase(ext_string::all_lowercase_letters());
static const std::string delimiters = ext_string::all_chars()
- ext_string::all_lowercase_letters()
- ext_string::all_uppercase_letters();
split(delimiters,
line,
functional_inserter(
[&](const strtk::range::string& range)
{
if (0 == range.size()) return;
++word_count;
letter_count += range.size();
std::size_t current_lowercase_count = 0;
std::size_t current_vowel_count = 0;
for (std::size_t i = 0; i < range.size(); ++i)
{
if (is_vowel(range[i])) ++current_vowel_count;
if (is_lowercase(range[i])) ++current_lowercase_count;
}
uppercase_count += range.size() - current_lowercase_count;
lowercase_count += current_lowercase_count;
consonant_count += range.size() - current_vowel_count;
vowel_count += current_vowel_count;
})
);
});
std::cout << "Word count: " << word_count << std::endl;
std::cout << "Letter count: " << letter_count << std::endl;
std::cout << "Uppercase count: " << uppercase_count << std::endl;
std::cout << "Lowercase count: " << lowercase_count << std::endl;
std::cout << "Vowel count: " << vowel_count << std::endl;
std::cout << "Consonant count: " << consonant_count << std::endl;
}
As a final simple example, we would like to calculate the word
frequency model of a user specified text file. The process involves
reading each line, splitting the line into words, then incrementing
the relevant count for each word and maintaining a global word count.
Once the file has been processed, the occurrence frequency of each
word will be printed to stdout.
int main()
{
typedef std::unordered_map<std::string,unsigned int> map_t;
map_t word_hit_list;
unsigned int word_count = 0;
strtk::for_each_line("data.txt",
[&](const std::string& line)
{
static const std::string delimiters = strtk::ext_string::all_chars()
- strtk::ext_string::all_lowercase_letters()
- strtk::ext_string::all_uppercase_letters();
strtk::split(delimiters,
line,
strtk::functional_inserter(
[&](const strtk::range::string& range)
{
if (range.begin() == range.end()) return;
++word_count;
std::string word(range.begin(),range.end());
++word_hit_list[word];
})
);
});
for (map_t::value_type v : word_hit_list)
{
printf("%s %10d %10.9f\n",
strtk::text::right_align(15,' ',v.first).c_str(),
v.second,
(1.0 * v.second) / word_count);
}
return 0;
}
A Practical Example
Lets assume you have been given an english text file to process, with
the intention of extracting a lexicon from the file.
One solution would be to break the problem down to a line by line
tokenization problem. In this case you would define a functional
object such as the following which will take the container in which
you plan on storing your tokens (words) and a predicate and insert
the tokens as strings into your container.
template<typename Container, typename Predicate>
struct parse_line
{
public:
parse_line(Container& container, const Predicate& predicate)
: container_(container),
predicate_(predicate)
{}
inline void operator() (const std::string& str)
{
strtk::split(str,
predicate_,
strtk::range_to_type_back_inserter(container_),
strtk::split_options::compress_delimiters);
}
private:
Container& container_;
const Predicate& predicate_;
};
The whole thing together would include a process of opening the
file and reading it line by line each time invoking the
parse_line would be as follows:
template<typename Container>
void parse_text(const std::string& file_name, Container& c)
{
static const std::string delimiters = " ,.;:<>'[]{}()_?/"
"`~!@#$%^&*|-_\"=+\t\r\n\0"
"0123456789";
strtk::multiple_char_delimiter_predicate predicate(delimiters);
strtk::for_each_line(file_name,
parse_line<Container,strtk::multiple_char_delimiter_predicate>(c,predicate));
}
int main()
{
std::string text_file_name = "text.txt";
std::deque<std::string> word_list;
parse_text(text_file_name,word_list);
std::cout << "Token Count: " << word_list.size() << std::endl;
return 0;
}
Before we continue on with the example, a re-written version of the
above code using C++11 lambdas is as follows:
int main()
{
std::string text_file_name = "text.txt";
std::deque<std::string> word_list;
strtk::for_each_line(text_file_name,
[&word_list](const std::string& line)
{
static const std::string delimiters = " ,.;:<>'[]{}()_?/"
"`~!@#$%^&*|-_\"=+\t\r\n\0"
"0123456789";
strtk::parse(line,delimiters,word_list);
});
std::cout << "Token Count: " << word_list.size() << std::endl;
return 0;
}
Now coming back to the original problem, that being the construction
of a lexicon. In this case the set of "words" should only
contain words of interest. For the sake of simplicity lets define
words of interest as being anything other than the following
prepositions: as, at, but, by, for, in, like, next, of, on,
opposite, out, past, to, up and via. This type of list is commonly
known as a Stop Word List. In this example the stop-word list
definition will be as follows:
const std::string stop_word_list [] =
{
"as", "at", "but", "by", "for",
"in", "like", "next", "of", "on",
"opposite", "out", "past", "to",
"up", "via", ""
};
const std::size_t stop_word_list_size = sizeof(stop_word_list) / sizeof(std::string);
Some minor updates to the parse_line processor include using the
filter_on_match predicate for determining if the
currently processed token is a preposition and also the invocation of
the range_to_type back_inserter to convert the tokens
from their range iterator representation to a type representation
compatible with the user defined container. For the new implementation
to provide unique words of interest the simplest change that can be
made is to replace the deque used as the container for the word_list
to some kind of 1-1 associative container such as a set. The following
is the improved version of the parse_line processor:
template<typename Container, typename Predicate>
struct parse_line
{
public:
parse_line(Container& container, const Predicate& predicate)
: container_(container),
predicate_(predicate),
tmp_(" "),
tokenizer_(tmp_,predicate_,true),
filter_(stop_word_list,stop_word_list + stop_word_list_size,
strtk::range_to_string_back_inserter_iterator<Container>(container_),
true,false)
{}
inline void operator() (const std::string& s)
{
const filter_type& filter = filter_;
strtk::for_each_token(s,tokenizer_,filter);
}
private:
Container& container_;
const Predicate& predicate_;
std::string tmp_;
typename strtk::std_string_tokenizer<Predicate>::type tokenizer_;
strtk::filter_on_match<strtk::range_to_string_back_inserter_iterator<Container>> filter_;
};
The above described predicate can be greatly simplified by using
binders and various lamba expressions.
Another Example
When performing serialization or deserialization of an
instance object such as a class or struct, a simple approach
one could take would be to take each of the members and
convert them into string representations and from those strings
construct a larger string delimiting each member with a special
character guaranteed not to exist in any of the string
representations.
In this example we will assume that there exists a struct which
represents the properties of a person, a person struct if you
will:
struct person
{
unsigned int id;
std::string name;
unsigned int age;
double height
float weight;
};
The process of populating a person struct would entail having an
instance of a person and the necessary data string. The following is
an example of how this would be done using the
strtk::parse function.
Person Tuple Format
| Token0 | Delimiter0 | Token1 | Delimiter1 | Token2 | Delimiter2 | Token3 | Delimiter3 | Token4 |
Unique ID(hex) | | | Name | | | Age | | | Height(m) | | | Weight(kg) |
std::string data = "0xFA37ED12|Rumpelstiltskin|397|1.31|58.7";
person p;
strtk::hex_to_number_sink<unsigned int> hex_sink(p.id); strtk::parse(data,"|",hex_sink,p.name,p.age,p.height,p.weight);
Batch processing of a text file comprised of one person tuple
per-line is somewhat similar to the previous example. A predicate
is defined that takes a container specialized on the person struct,
and a delimiter predicate with which the
strtk::parse function will be invoked. This
predicate is then instantiated and coupled with the text
file name, is fed to the strtk::for_each_line
processor.
template<typename Container, typename Predicate>
struct parse_person_tuple
{
public:
parse_person_tuple(Container& container)
: container_(container),
hex_sink(p_.id)
{}
inline void operator() (const std::string& s)
{
if (strtk::parse(s,"|",hex_sink,p_.name,p_.age,p_.height,p_.weight))
container_.push_back(p_);
else
std::cerr << "Failed to parse: " << s << std::endl;
}
private:
Container& container_;
person p_;
strtk::hex_to_number_sink<unsigned int> hex_sink;
};
Bringing the above pieces together to process a file results in the
following:
int main()
{
std::string text_file_name = "person_records.txt";
std::deque<person> person_list;
strtk::for_each_line(file_name,predicate_type(person_list));
return 0;
}
Before we continue on with the example, a re-written version of the
above code using C++11 lambdas is as follows:
int main()
{
std::string text_file_name = "person_records.txt";
std::deque<person> person_list;
person p;
strtk::hex_to_number_sink<unsigned int> hex_sink;
strtk::for_each_line(text_file_name,
[&person_list,&p,&hex_sink](const std::string& line)
{
if (strtk::parse(line,"|",hex_sink,p.name,p.age,p.height,p.weight))
container_.push_back(p);
else
std::cerr << "Failed to parse: " << line << std::endl;
});
return 0;
}
To make things easier one could adapt a struct (made up entirely
of PODs) to a parser. This makes the usage syntax little easier
to follow. An example of this adaption is as follows:
struct type
{
std::string s;
double d;
int i;
char c;
bool b;
};
strtk_parse_begin(type)
strtk_parse_type(s)
strtk_parse_type(d)
strtk_parse_type(i)
strtk_parse_type(c)
strtk_parse_type(b)
strtk_parse_end()
int main()
{
type t;
std::string s = "abcdefghijklmnop|123.456|987654321|A|1";
strtk::parse(s,"|",t);
return 0;
}
Another similar example to the previous, would be parsing a text
file of 3D coordinates into a sequence. This can be done easily and
cleanly using lambdas and StrTk as follows:
struct point
{
double x,y,z;
};
int main()
{
std::string point_data = "point_data.txt";
std::deque<point> points;
point p;
strtk::for_each_line(point_data,
[&](const std::string& str)
{
if (strtk::parse(str,",",p.x,p.y,p.z))
points.push_back(p);
});
return 0;
}
Simple Date-Time Parser
Assuming the datetime struct defined below, and a string
representation of a combined date and time in the form of YYYY-MM-DD
HH:MM:SS.MS eg: 2005-06-26 15:57:03.678
struct datetime
{
unsigned int year;
unsigned int month;
unsigned int day;
unsigned int hour;
unsigned int minute;
unsigned int second;
unsigned int msecond;
};
The following assumes an input of date-time values separated by a
pipe. To facilitate parsing of a date-time by
the strtk::parse routine into an STL compatible sequence an
implementation of string_to_type_converter_impl specific
to the datetime type is required. The following demonstrates how such
a routine can be generated and used within the strtk::parse
context:
strtk_string_to_type_begin(datetime)
static const std::string delimiters ("-:. ");
return strtk::parse(begin, end, delimiters,
t.year, t.month, t.day,
t.hour, t.minute, t.second, t.msecond);
strtk_string_to_type_end()
A simple example of using strtk::parse in conjunction
with the newly integrated datetime variable mixed with variables
of other types is as follows:
int main()
{
const std::string data = "abc 123 xyz,2000-01-10 03:01:16.123|+98765.43210";
std::string var0;
datetime var1;
double var2;
strtk::parse(data,",|",var0,var1,var2);
return 0;
}
Bringing the above pieces together, in the following we can then
proceed to parse a sequence of date-time strings delimited by pipe
"|" into a deque of type datetime.
int main()
{
static const std::string data = "2000-01-10 03:01:16.123|2001-02-22 05:12:24.234|"
"2002-03-13 07:23:32.345|2003-04-24 09:34:47.456|"
"2004-05-15 11:46:51.767|2005-06-26 15:57:03.678|"
"2006-07-17 17:45:31.561|2007-08-26 19:06:02.809|"
"2008-09-18 21:16:23.267|2009-10-26 23:12:03.798|"
"2010-11-23 13:47:11.963|2011-12-26 15:35:08.168";
std::deque<datetime> datetime_list;
strtk::parse(data,"|",datetime_list);
for (std::size_t i = 0; i < datetime_list.size(); ++i)
{
datetime& dt = datetime_list[i];
std::cout << dt.year << "-"
<< strtk::text::right_align(2,'0', dt.month) << "-"
<< strtk::text::right_align(2,'0', dt.day) << " "
<< strtk::text::right_align(2,'0', dt.hour) << ":"
<< strtk::text::right_align(2,'0', dt.minute) << ":"
<< strtk::text::right_align(2,'0', dt.second) << "."
<< strtk::text::right_align(3,'0',dt.msecond) << std::endl;
}
return 0;
}
Parsing Sub-Lists
So far the demonstrated capabilities of the strtk::parse
function has been based on passing a series of parameters that are
populated in a linear fashion as the parser processes the tokens it
encounters. That said, some formats have their own sub-structures, a
simple example would be a list of values (such as integers) that need
to be loaded into a deque or stack. StrTk provides a series of sink
facilities that consume a range and an STL container which can be
forwarded onto strtk::parse.
In the following example, the data string is comprised of 3
separate lists delimited by a pipe "|". An integer, a
string and a double type list. Each list is to be parsed into an STL
container of appropriate type. In this case a vector, a deque and a
list. StrTk provides the ability to instantiate a sink for the
specific container type that is compatible with
strtk::parse.
int main()
{
std::string data = "1,+2,-3,4|abc,ijk,rst,xyz|123.456,+234.567,-345.678,456.789,567.890";
std::vector<int> int_vector;
std::deque<std::string> string_deque;
std::list<double> double_list;
std::set<int> int_set;
std::queue<std::string> string_queue;
std::stack<double> double_stack;
std::priority_queue<int> int_priority_queue;
strtk::vector_sink<int>::type vec_sink(",");
strtk::deque_sink<std::string>::type deq_sink(",");
strtk::list_sink<double>::type lst_sink(",");
strtk::set_sink<int>::type set_sink(",");
strtk::queue_sink<std::string>::type que_sink(",");
strtk::stack_sink<double>::type stk_sink(",");
strtk::priority_queue_sink<int>::type prq_sink(",");
strtk::parse(data,"|",vec_sink( int_vector),
deq_sink(string_deque),
lst_sink( double_list));
strtk::parse(data,"|",set_sink( int_set),
que_sink( string_queue),
stk_sink( double_stack),
prq_sink(int_priority_queue));
return 0;
}
If only a certain number of elements in the list are required, for
example only the first 3, then the element count on the sink can be
set appropriately. The above example could be modified as follows:
int main()
{
std::string data = "1,+2,-3,4|string0|abc,ijk,rst,xyz|string1|123.456,+234.567,-345.678,456.789,567.890";
std::vector<int> int_vector;
std::deque<std::string> string_deque;
std::list<double> double_list;
strtk::vector_sink<int>::type vec_sink(",");
strtk::deque_sink<std::string>::type deq_sink(",");
strtk::list_sink<double>::type lst_sink(",");
std::string string_0;
std::string string_1;
strtk::parse(data,"|",vec_sink( int_vector).count(2), string_0,
deq_sink(string_deque).count(3), string_1,
lst_sink( double_list).count(4)); return 0;
}
Note: If there aren't enough elements in a particular
list, then parsing of that list fails and subsequently the whole
strtk::parse call will fail as well.
Parsing Trailing-Lists
Another way one might want to parse a tuple of values might be to parse
a prefix of values into a specific number of possibly varying types,
then to parse the remaining values (assuming they are all of the same
type) into a sequence or list etc. StrTk provides the following simple
solution to the given problem, as demonstrated below:
int main()
{
{
std::string data = "A String Value,111.111,222.222,333.333,444.444,555.555";
std::string token;
std::vector<double> double_list;
strtk::parse(data,",",token,double_list);
}
{
std::string data = "A String Value,01-02-2003,111.111,222.222,333.333,444.444,555.555";
std::string token;
std::string date;
std::deque<double> double_list;
strtk::parse(data,",",token,date,double_list);
}
{
std::string data = "A String Value,01-02-2003,123456789,111.111,222.222,333.333,444.444,555.555";
std::string token;
std::string date;
int i;
std::list<double> double_list;
strtk::parse(data,",",token,date,i,double_list);
}
{
std::string data = "A String Value,01-02-2003,123456789,111.111,222.222,333.333,444.444,555.555";
std::string token;
std::string date;
int i;
double d;
std::vector<double> double_list;
strtk::parse(data,",",token,date,i,d,double_list);
}
{
std::string data = "A String Value,01-02-2003,123456789,111.111,222.222,333.333,444.444,555.555";
std::string token;
std::string date;
int i;
double d1;
double d2;
std::deque<double> double_list;
strtk::parse(data,",",token,date,i,d1,d2,double_list);
}
return 0;
}
Extending The Date-Time Parser Example
Building upon the previous datetime example, We are presented with a
tuple of data that represents an astronomical event. The event
defines a name, a location and a series of date-times in UTC the
event was observed. In order to construct the necessary sink(s) that
will be used for parsing the required type into a container, the
macro strtk_register_userdef_type_sink with the specified type
is invoked. The following is a definition of the struct one might
construct:
struct event_information
{
std::size_t id;
std::string name;
std::string location;
std::deque<datetime> observation_datetimes;
};
strtk_register_userdef_type_sink(datetime)
Bringing the above together with a call to strtk::parse
results in the following code which parses the event data tuple into
the allocated event_information instance.
int main()
{
std::string event_data = "172493|Lunar Impact|Mare Tranquillitatis|"
"2010-01-19 00:28:45.357,2010-02-18 00:57:07.109,"
"2010-03-20 01:15:11.261,2010-04-21 01:07:27.972";
strtk::deque_sink<datetime>::type deq_sink(",");
event_information evt_info;
strtk::parse(event_data,"|",evt_info.id,
evt_info.name,
evt_info.location,
deq_sink(evt_info.observation_datetimes));
return 0;
}
Ignoring Tokens During Parsing
There may be scenarios when given a delimited tuple of data, that one
or more of the tokens need to be ignored or skipped. StrTk provides a
mechanism called strtk::ignore_token that allows the
parser to consume specific tokens easily without affecting overall
performance. Below is an example of how strtk::ignore_token
can be used in conjunction with strtk::parse to skip the
2nd and 4th tokens in the tuple:
int main()
{
static const std::string data = "+123,ignore0,123.456,ignore1,abcdef,ignore2";
int i;
double d;
std::string s;
strtk::ignore_token ignore;
strtk::parse(data,",",i,ignore,d,ignore,s);
std::cout << "i=" << i << " d=" << d << " s=" << s << std::endl;
return 0;
}
Simple 3D Mesh File Format Parser
Lets assume there is a file format for expressing a mesh. The format
consists of a section that defines the unique vertexes in the mesh,
and another section that defines the triangles in the mesh as a tuple
of three indicies indicating the vertexes used. Each section is
preceded by an integer that denotes the number of subsequent
elements in that section. An example of such a file is the following:
5
+1.0,+1.0,+1.0
-1.0,+1.0,-1.0
-1.0,-1.0,+1.0
+1.0,-1.0,-1.0
+0.0,+0.0,+0.0
4
0,1,4
1,2,4
2,3,4
3,1,4
Code to parse such a file format is as follows:
struct point
{
double x,y,z;
};
struct triangle
{
std::size_t i0,i1,i2;
};
int main()
{
std::string mesh_file = "mesh.txt";
std::ifstream stream(mesh_file.c_str());
std::string s;
std::deque<point> points;
point p;
std::size_t point_count = 0;
strtk::parse_line(stream," ",point_count);
strtk::for_each_line_n(stream,
point_count,
[&points,&p](const std::string& line)
{
if (strtk::parse(line,",",p.x,p.y,p.z))
points.push_back(p);
});
std::deque<triangle> triangles;
triangle t;
std::size_t triangle_count = 0;
strtk::parse_line(stream," ",triangle_count);
strtk::for_each_line_n(stream,
triangle_count,
[&triangles,&t](const std::string& line)
{
if (strtk::parse(line,",",t.i0,t.i1,t.i2))
triangles.push_back(t);
});
return 0;
}
Simple Semantic Actions
A semantic action is an action that is undertaken upon a token, it can be in the
form of either an assessment or a manipulation. StrTk provides a very simplified
semantic action capability, named strtk::util::semantic_action for
types that are being parsed via the strtk::parse function. A function
(stateful or stateless), taking an iterator range is used to construct the
semantic_action. The body of the function performs whatever operations are required
and also makes sure to maintain the contract with regards to the return status
for the parse routine to complete successfully.
The following is an example where a comma delimited string is parsed into 3 types,
an integer, a double and a string. The rules regarding parsing and updating of the
variables is as follows, the int variable "i" will only be updated if the
value parsed is odd, the double value "d" will only be updated if the parsed
value is greater than or equal to 99.99 and the string value "s" will only
be updated if the presented range contains the string "ring". Upon every successful
update a corresponding counter will be incremented.
int main()
{
std::string data = "12345,123.456,A string";
int i = 0;
double d = 0.0;
std::string s = "";
std::size_t i_update_count = 0;
std::size_t d_update_count = 0;
std::size_t s_update_count = 0;
typedef strtk::range::ustring::const_iterator itr_type;
using strtk::util::semantic_action;
strtk::parse(data,",",
semantic_action([&i,&i_update_count]
(itr_type begin,itr_type end) -> bool
{
int temp = 0;
if (strtk::string_to_type_converter(begin,end,temp))
{
if (temp % 2 != 0)
{
i = temp;
++i_update_count;
}
return true;
}
else
return false;
}).ref(),
semantic_action([&d,&d_update_count]
(itr_type begin,itr_type end) -> bool
{
double temp = 0.0;
if (strtk::string_to_type_converter(begin,end,temp))
{
if (temp >= 99.99)
{
d = temp;
++d_update_count;
}
return true;
}
else
return false;
}).ref(),
semantic_action([&s,&s_update_count]
(itr_type begin,itr_type end) -> bool
{
static unsigned char pattern[] = "ring";
if (end != std::search(begin,end,pattern,pattern + 4))
{
s.assign(begin,end);
++s_update_count;
}
return true;
}).ref());
std::cout << "i: " << i << std::endl;
std::cout << "d: " << d << std::endl;
std::cout << "s: " << s << std::endl;
return 0;
}
Pick A Random Line From A Text File
A random line of text is to be selected from a user provided
text file comprised of lines of varying length, such that the
probability of the line being selected is 1/N where N is the number
of lines in the text file. There are many trivial solutions to this
problem, however if one were to further constrain the problem by
indicating the file is very large (TBs) and that the system upon
which the solution is to run is very limited memory-wise, most if
not all trivial solutions such as storing indexes of all line offsets,
or reading the entire file into memory etc will be eliminated.
However, there exists a very elegant solution to this problem of O(n),
O(1) time and space complexities respectively, that involves
scanning the entire file once line by line, and at every ith
line choosing whether or not to replace the final result line with
the current line by sampling a uniform distribution of range
[0,1) and performing a replace if and only if the random value is
less than 1 / i.

The logic behind this solution revolves around the fact that the
probability of selecting the ith line will be 1/i and as such the
total probability for selecting any of the previous i-1 lines will be
1 - (1/i) or (i - 1)/i. Because there are (i - 1) lines before the
ith line, we divide the previous sum of probabilities by (i - 1),
resulting in a selection probability of 1/i for any one of the lines
up to and including the ith line. If the ith line were to be the last
line of the text file, this then results in each of the lines
having a selection probability of 1/N - simple and sweet, and so
to is the following implementation of said solution:
int main(int argc, char* argv[])
{
std::string file_name = argv[1];
std::string line;
std::size_t i = 0;
strtk::uniform_real_rng rng(static_cast<std::size_t>(::time(0)));
strtk::for_each_line(file_name,
[&](const std::string& s)
{
if (rng() < (1.0 / ++i))
{
line = s;
}
});
std::cout << line << std::endl;
return 0;
}
What changes to the above code would be required If the probability of line selection was changed to 1/(N-K) where 0 <= K <= N and K is the number of lines what will be randomly ignored.
Note: TAOCP Volume II section 3.4.2 has an in depth discussion
about this problem and other similar problems relating to random
distributions. Also one should note that the above example has an
inefficiency, in that upon every string replace, an actual string is
being copied, normally all one needs to do is maintain a file offset
to the beginning of the line, not doing this causes slow downs due to
continuous memory allocations/reallocations which is made all the
worse when large lines are encountered.
The Buy Low And Sell High Problem
Assume we are given a piece of data in csv format which represents
a time-series for the close prices of the SPDR S&P 500 ETF.
The time-series ranges from 04/01/1999 to 11/11/2011. The objective
is to select two dates, the first being the buy date and the second
being the sell date, such that if we were to buy then sell X shares
of the ETF we will have maximized our profit. It should be noted that of course
the buy date must be before the sell date and that short selling is not
an option in this strategy. The following is a chart that represents the
price of the SPDR over the given period:

Through visual inspection we can approximate that the best buy date would be
towards the end of 2002 and the corresponding sell date would be towards
the end of 2007, as these points seem to give the largest difference between
the two prices. However it also does seems that a buy roughly at the start of
2009 and a sell at the beginning of 2011 could also provide such a large price
difference. As such the visual inspection approach has lead to an ambiguity,
hence a more thorough and precise approach is required.
One could take the naive approach to solving the buy low sell high problem, by
initially loading the entire time series into memory, then for each ith time
point take the pricei and test it against every other pricej
where i < j, and maintain a "best profit encountered" structure that
contains the best profit so far and the corresponding buy/sell prices and dates.
This solution has a few problems, initially it is of O(N2) time
complexity and O(N) space complexity. As an example for one million time points
it will require one trillion comparisons and one million date/price units of storage.
If memory and computational processing was limited on the hardware this solution
can not even be executed. Furthermore as suggested by the time-complexity as the
size of the data grows, regardless of computational abilities, the compute times
for the results would become astronomical and practically useless - specially for
real-time reactive systems.
In these types of problems one tries to assess if an on-line or streaming based
solution is feasible. That is a solution that does not require all the data to be
available all at once, can work on the data incrementally and requires no more
than one-pass for each piece of data. Such a solution would typically have a time
complexity of O(N) and a space complexity of O(1).
With regards to this problem the crucial insight required to convert the naive solution
from an O(N2) complex one to an on-line version of O(n) time complexity
is that every new global minima encountered is the beginning of a new period and an
indicator of an end to the previous period. Looking at the chart, if one were to scan
from left to right, the intuitive response is to find the point with the lowest price,
ignore everything preceding, then try and find the next point with the highest price or
global maxima. There are a few edge cases that need to be dealt with. The main one being
the problem described above that there are two buy/sell points that could potentially be
the solution. The way around this is to simply maintain the best encountered period, and
compare the profit from any new period to the best so far, if it is better(more) then
replace the best with the current period. Another edge case is when the data is in a
continual decline, in a situation like this there will be no profitable buy/sell points.
The following is a small subsection of the time-series in question:
Download spy500.csv
15/06/2000,148.16
16/06/2000,146.59
19/06/2000,148.47
20/06/2000,147.94
21/06/2000,147.88
22/06/2000,145.63
23/06/2000,144.38
26/06/2000,146.23
27/06/2000,145.16
28/06/2000,145.56
29/06/2000,144.19
The code below is a very simple single pass incremental solution to the given problem.
It reads every line of the input file, parses each line into a date and price variable,
checks to see if the current price is less than the current global minima price, if it
is the case, it will set the current period start to the current date and set the buy price
to be the current price, otherwise it checks to see if the current price is larger than the
global maxima price, if that is the case then it updates the current profit, sell price
and sell dates accordingly. If at the end, the buy price is not less than the sell price,
it is indicative that there exists no two time points within the given time series for
which a profitable transaction could occur, otherwise it prints out buy and sell dates
for the required transaction and the expected proft per share.
struct period
{
period()
: profit(std::numeric_limits<double>::min()),
buy_price(std::numeric_limits<double>::max()),
sell_price(std::numeric_limits<double>::min())
{}
bool operator>(const period&p) const
{
return profit > p.profit;
}
double profit;
double buy_price;
double sell_price;
std::string buy_date;
std::string sell_date;
};
int main()
{
period best_period;
period curr_period;
strtk::for_each_line("spy500.csv",
[&best_period,&curr_period]
(const std::string& line)
{
std::string date;
double price;
if (!strtk::parse(line,",",date,price)) return;
if (price < curr_period.buy_price)
{
if (curr_period > best_period)
{
best_period = curr_period;
}
curr_period.buy_date = date;
curr_period.buy_price = price;
curr_period.sell_price = std::numeric_limits<double>::min();
curr_period.sell_date = "";
}
else if (price > curr_period.sell_price)
{
curr_period.sell_price = price;
curr_period.sell_date = date;
curr_period.profit = curr_period.sell_price - curr_period.buy_price;
}
});
if (best_period.buy_price >= best_period.sell_price)
{
std::cout << "No period in time-series can be profitablly exploited." << std::endl;
}
else
{
std::cout << "Buy: " << best_period.buy_date << std::endl;
std::cout << "Sell: " << best_period.sell_date << std::endl;
std::cout << "Profit per share: $" << best_period.profit << std::endl;
}
return 0;
}
For the given data it is expected the above piece of code will produce the following output:
Buy: 09/10/2002
Sell: 09/10/2007
Profit per share: $78.38
What changes to the above piece of code would be required if:
- Rather than prices, we instead are given percentage differences from the previous price?
- We have the constraint that we can't hold a position for more than K days
- A penalty of 1/K% is applied to the profit for every day a position is held? (where K is the condition and value described above)
- Short selling is allowed?
- We want to process independent sections of the time series concurrently so as to speed up overall processing time.
- The prices could be extremely large or small?
Token Grid
StrTk provides a means to easily parse and consume 2D grids of tokens
in an efficient and simple manner. A grid is simply defined as a
series of rows comprised of tokens, otherwise known as Delimiter
Separated Values (DSV). The ith token of a row is grouped with
every ith token of all other rows to produce a column. Tokens can be
processed as either rows or columns.
An example of a simple token grid, where each numeric value is deemed
to be a token:
1.1 | 2.2 | 3.3 | 4.4 | 5.5 |
1.1 | 2.2 | 3.3 | 4.4 | 5.5 |
1.1 | 2.2 | 3.3 | 4.4 | 5.5 |
1.1 | 2.2 | 3.3 | 4.4 | 5.5 |
1.1 | 2.2 | 3.3 | 4.4 | 5.5 |
A token grid can be either passed in via a file or a string buffer.
The delimiters to be used for parsing the columns and rows can also
be provided, if not provided standard common defaults will be used.
The following demonstrates how each cell in the grid can be access
and cast to a specific type.
std::string data = "1,2,3,4,5,6\n"
"7,8,9,0,1,2\n"
"3,4,5,6,7,8\n"
"9,0,1,2,3,4\n"
"5,6,7,8,9,0\n";
strtk::token_grid grid(data,data.size(),",");
for (std::size_t r = 0; r < grid.row_count(); ++r)
{
strtk::token_grid::row_type row = grid.row(r);
for (std::size_t c = 0; c < row.size(); ++c)
{
std::cout << grid.get<int>(r,c) << '\t';
}
std::cout << std::endl;
}
The strtk::token_grid provides various helper functions
for traversing rows and columns in batch mode. The functions are namely:
for_each_row that is used for iterating either all or a sub-range
of rows of the token_grid, and for_each_column that is used
for iterating either all or a sub-range of columns of a row.
int main()
{
std::string data = "1.1,2.1,3.1,4.1,5.1,6.1,7.1\n" "1.2,2.2,3.2,4.2,5.2,6.2,7.2\n" "1.3,2.3,3.3,4.3,5.3,6.3,7.3\n" "1.4,2.4,3.4,4.4,5.4,6.4,7.4\n" "1.5,2.5,3.5,4.5,5.5,6.5,7.5\n" "1.6,2.6,3.6,4.6,5.6,6.6,7.6\n" "1.7,2.7,3.7,4.7,5.7,6.7,7.7\n" "1.8,2.8,3.8,4.8,5.8,6.8,7.8\n" "1.9,2.9,3.9,4.9,5.9,6.9,7.9\n";
strtk::token_grid grid(data,data.size(),",");
{
grid.for_each_row([](const strtk::token_grid::row_type& row)
{
for (std::size_t i = 0; i < row.size(); ++i)
{
std::cout << row.get<double>(i) << '\t';
}
std::cout << '\n';
});
}
{
grid.for_each_row(grid.range(2,6),
[](const strtk::token_grid::row_type& row)
{
for (std::size_t i = 0; i < row.size(); ++i)
{
std::cout << row.get<double>(i) << '\t';
}
std::cout << '\n';
});
}
{
grid.for_each_row([](const strtk::token_grid::row_type& row)
{
row.for_each_column([](const strtk::token_grid::row_type::range_type& range)
{
std::cout << std::string(range.first,range.second) << '\t';
});
std::cout << '\n';
});
}
{
grid.for_each_row(grid.range(2,6),
[](const strtk::token_grid::row_type& row)
{
row.for_each_column(row.range(1,5),
[](const strtk::token_grid::row_type::range_type& range)
{
std::cout << std::string(range.first,range.second) << '\t';
});
std::cout << '\n';
});
}
return 0;
}
The following example demonstrates how averages over rows and columns
of a token grid can be computed:
std::string data = "1.1,1.1,1.1,1.1,1.1,1.1\n"
"2.2,2.2,2.2,2.2,2.2,2.2\n"
"3.3,3.3,3.3,3.3,3.3,3.3\n"
"4.4,4.4,4.4,4.4,4.4,4.4\n"
"5.5,5.5,5.5,5.5,5.5,5.5\n"
"6.6,6.6,6.6,6.6,6.6,6.6\n";
strtk::token_grid grid(data,data.size(),",");
std::vector<double> avg_c(grid.row(0).size(),0.0);
std::vector<double> avg_r(grid.row_count(),0.0);
std::vector<double> tmp(grid.row(0).size(),0.0);
std::fill(avg_c.begin(),avg_c.end(),0.0);
for (std::size_t i = 0; i < grid.row_count(); ++i)
{
grid.row(i).parse<double>(tmp.begin());
std::transform(avg_c.begin(),avg_c.end(),tmp.begin(),avg_c.begin(),std::plus<double>());
avg_r[i] = std::accumulate(tmp.begin(),tmp.end(),0.0) / tmp.size();
}
for (std::size_t i = 0; i < avg_c.size(); avg_c[i++] /= grid.row_count());
std::cout << "Column Averages:\t";
std::copy(avg_c.begin(),avg_c.end(),std::ostream_iterator<double>(std::cout,'\t'));
std::cout << '\n';
std::cout << "Row Averages:\t";
std::copy(avg_r.begin(),avg_r.end(),std::ostream_iterator<double>(std::cout,'\t'));
std::cout << '\n';
Processing Of Comma Seperated Values Data
The original intent of the token grid was to support fast and
efficient processing of simple data tuples, such as comma seperated
values (CSV) formats et. al. The following example demonstrates a
simple summation of traded floor volume and average daily volume based
on NASDAQ OHLC (Open-High-Low-Close) data.
std::string market_data = "20090701,GOOG,424.2000,418.9900,426.4000,418.1500,2310768\n"
"20090701,MSFT,24.0500,24.0400,24.3000,23.9600,54915127\n"
"20090702,GOOG,415.4100,408.4900,415.4100,406.8100,2517630\n"
"20090702,MSFT,23.7600,23.3700,24.0400,23.2100,65427699\n"
"20090703,GOOG,408.4900,408.4900,408.4900,408.4900,0\n"
"20090703,MSFT,23.3700,23.3700,23.3700,23.3700,0\n"
"20090706,GOOG,406.5000,409.6100,410.6400,401.6600,2262557\n"
"20090706,MSFT,23.2100,23.2000,23.2800,22.8700,49207638\n"
"20090707,GOOG,408.2400,396.6300,409.1900,395.9801,3260307\n"
"20090707,MSFT,23.0800,22.5300,23.1400,22.4600,52842412\n"
"20090708,GOOG,400.0000,402.4900,406.0000,398.0600,3441854\n"
"20090708,MSFT,22.3100,22.5600,22.6900,2200000,73023306\n"
"20090709,GOOG,406.1200,410.3900,414.4500,405.8000,3275816\n"
"20090709,MSFT,22.6500,22.4400,22.8100,22.3700,46981174\n"
"20090710,GOOG,409.5700,414.4000,417.3700,408.7000,2929559\n"
"20090710,MSFT,22.1900,22.3900,22.5400,22.1500,43238698\n";
strtk::token_grid grid(market_data,market_data.size(),",");
struct stock_info
{
stock_info(const std::string& s = " ")
: symbol(s),
total_volume(0),
day_count(0),
average_daily_volume(0.0)
{}
std::string symbol;
unsigned long long total_volume;
std::size_t day_count;
double average_daily_volume;
};
stock_info goog("GOOG");
stock_info msft("MSFT");
static const std::size_t volume_column = 6;
static const std::size_t symbol_column = 1;
goog.day_count = grid.accumulate_column(volume_column,
[](const strtk::token_grid::row_type& row) -> bool
{
static const std::string google_symbol("GOOG");
return google_symbol == row.get<std::string>(symbol_column);
},
goog.total_volume);
msft.day_count = grid.accumulate_column(volume_column,
[](const strtk::token_grid::row_type& row) -> bool
{
static const std::string microsoft_symbol("MSFT");
return microsoft_symbol == row.get<std::string>(symbol_column);
},
msft.total_volume);
goog.average_daily_volume = (1.0 * goog.total_volume) / goog.day_count;
msft.average_daily_volume = (1.0 * msft.total_volume) / msft.day_count;
std::cout << "[GOOG] Total Volume: " << goog.total_volume << std::endl;
std::cout << "[MSFT] Total Volume: " << msft.total_volume << std::endl;
std::cout << "[GOOG] ADV: " << goog.average_daily_volume << std::endl;
std::cout << "[MSFT] ADV: " << msft.average_daily_volume << std::endl;
The strtk::token_grid is thread-safe iff read operations
are in play. As such the above calls to accumulate_column et al. can
all be safely and easily executed concurrently using threads. This
allows for a far more efficient data processing methodology.
TIMTOWTDI
Playing devil's advocate, another way of performing the above
processing task, assuming only the specific values for computing the
ADV are required and no further processing of the CSV data is
needed, then the problem can be solved efficiently by utilizing a
single pass of the data as follows:
std::string file_name = "market_data.csv";
std::unordered_map<std::string,stock_info> stock_map;
stock_map.insert(std::make_pair<std::string,stock_info>("GOOG",stock_info("GOOG")));
stock_map.insert(std::make_pair<std::string,stock_info>("MSFT",stock_info("MSFT")));
strtk::for_each_line(file_name,
[&](const std::string& line)
{
strtk::ignore_token ignore;
stock_info temp;
const bool result = strtk::parse(line,
",",
ignore,
temp.symbol,
ignore,
ignore,
ignore,
ignore,
temp.total_volume);
if (!result) return;
auto itr = stock_map.find(temp.symbol);
if (stock_map.end() == itr) return;
(*itr).second.total_volume += temp.total_volume;
(*itr).second.day_count++;
});
auto itr = stock_map.begin();
auto end = stock_map.end();
while (end != itr)
{
stock_info& stock = (*itr++).second;
stock.average_daily_volume = (1.0 * stock.total_volume) / stock.day_count;
}
TIMTOWTDI - II (with a vengeance)
Playing the devil's other advocate, the above two examples, have both required that
the filter condition be explicitly defined at compile time. However even though the
condition maybe be set in stone at compile time, some of the underlyings (such as
symbol) can be engineered to be modified at run-time. That still doesn't give us the
freedom to perform arbitrarily complex filter expressions determined at run-time.
That said, an extremely efficient and very simple solution is at hand. The solution
is called the C++ DSV Filter library, it is based on StrTK and ExprTk libraries. It
uses the strtk::token_grid as a CSV/DSV store and index, and ExprTk as
the underlying expression evaluation engine. The example below takes the OHLC market
data table defined above and performs a row-wise query. The expression's definition
is:
select all rows where the open price is greater than the close price and the
symbol matches the wild-card pattern of "*FT*" and the date is equal to or
after 20090101.
int main()
{
std::string file_name = "market_data.csv";
dsv_filter filter;
filter.set_input_delimiter(",");
if (!filter.load(file_name))
return 1;
std::string expression = "(open > close) and (symbol like '*FT*') and (date >= '20090101')";
filter.add_filter(filter_expression);
for (std::size_t row = 1; row < filter.row_count(); ++row)
{
if (dsv_filter::e_match == filter[row])
{
}
}
return 0;
}
Other example queries:
volume >= 1000000 and symbol == 'GOOG'
abs(open - close) > abs(high - low)
avg(open,close,high,low)* volume > 10^7 and inrange('20090702',date,'20090730')
(open > close) and (symbol like '*FT*') and (date >= '20090101')
It should be noted that in the example above, the rows begin at index 1.
That is done because the dsv_filter expects the first row or
row at index 0 to be a column definition header. The format of the column
definitions is to simply add a suffix of "_s" if the values in the column
are to be treated as strings or "_n" if they are to be treated as numbers.
When defining expressions the suffixes should not be included when including
the column names. The above mentioned OHLC csv file's header would be as
follows:
Date_n,Symbol_s,Open_n,Close_n,High_n,Low_n,Volume_n
C++ DSV Filter and Dependencies
Sequential Partitions
A typical operation carried out upon time-series data is to group
tuples into buckets (or bins) based upon the time index
value. For example grouping data into 2-minute buckets and then
performing some kind of operation upon the grouped tuples such as a
summation or an average etc. This process is sometimes also called:
"discretization"
The strtk::token_grid class provides a method known as
sequential_partition. The sequential_partition
method requires a Transition Predicate, a Function
and optionally a row-range. The Transition Predicate
consumes a row and returns true only if the row is
in the next partition. All other subsequent consecutive rows until
the transition predicate returns a true are said to be
in the current partition. Prior to transitioning to a new partition,
the function predicate is invoked and provided with the range of
rows in the current partition.
The following example takes a simple time-series (time and
value), partitions the tuples into groups of Time-Buckets
of period length 3 and then proceeds to compute the total sum of each
group. The below summarizer class provides provides a
Transition Predicate and Function.

class summarizer
{
public:
enum column_index
{
tick_time_column = 0,
tick_value_column = 1
};
summarizer(std::deque<double>& sum_value)
: next_tick_time_(0),
sum_value_(sum_value)
{}
inline bool operator()(const strtk::token_grid::row_type& row)
{
if (row.get<std::size_t>(tick_time_column) >= next_tick_time_)
{
next_tick_time_ = row.get<std::size_t>(tick_time_column) + 3;
return true;
}
else
return false;
}
inline bool operator()(const strtk::token_grid& grid,
const strtk::token_grid::row_range_type& range)
{
double bucket_sum = 0.0;
if (!grid.accumulate_column(tick_value_column,range,bucket_sum))
{
std::cout << "failed to accumulate!" << std::endl;
return false;
}
else
sum_value_.push_back(bucket_sum);
return true;
}
private:
summarizer& operator=(const summarizer&);
std::size_t next_tick_time_;
std::deque<double>& sum_value_;
};
int main()
{
std::string data = "10000,123.456\n"
"10001,612.345\n"
"10002,561.234\n"
"10003,456.123\n"
"10004,345.612\n"
"10005,234.561\n"
"10006,123.456\n";
strtk::token_grid grid(data,data.size(),",");
std::deque<double> sum_value;
summarizer s(sum_value);
grid.sequential_partition(s,s);
for (std::size_t i = 0; i < sum_value.size(); ++i)
{
std::cout << "bucket[" << i << "] = " << sum_value[i] << std::endl;
}
return 0;
}
The expected output is as follows:
bucket[0] = 1297.035
bucket[1] = 1036.296
bucket[2] = 123.456
Parsing CSV Files With Embedded Double-Quotes
One of the simple extensions to the CSV format is the concept of
double quoted tokens. Such tokens may contain column or row
delimiters. When such a scenario is encountered, all subsequent
delimiters are ignored, and kept as part of the token, until the
corresponding closing double quote is encountered. The StrTk
token_grid supports the parsing of such tokens. This
parsing mode can be easily activated via the token_grid
option set. Below is an example of a token_grid loading a
CSV data set representing various airports from around the world and
their specific codes and locations, in which some of the cells are
double-quoted:
| ICAO | IATA | Airport | City | Country |
| AYGA | GKA | "Goroka Gatue" | Goroka | Papua New Guinea |
| BGCO | GCO | "Nerlerit Inaat Constable Pynt" | "Nerlerit Inaat" | Greenland |
| BZGD | ZGD | Godley | Auckland | New Zealand |
| CYQM | YQM | "Greater Moncton International" | Moncton | Canada |
| EDRK | ZNV | "Koblenz Winningen" | Koblenz | Germany |
| FAHU | AHU | "HMS Bastard Memorial" | Kwazulu-Natal | South Africa |
| FQMP | MZB | "Mocimboa Da Praia" | "Mocimboa Da Praia" | Mozambique |
| KINS | INS | "Indian Springs AF AUX" | Indian Springs | USA |
| UHNN | HNN | Nikolaevsk | "Nikolaevsk Na Amure" | Russia |
| WBKK | BKI | "Kota Kinabalu International" | Kota Kinabalu | Malaysia |
| ZSJD | JDZ | "Jingdezhen Airport" | Jingdezhen | China |
The following is the StrTk code example using token_grid to parse the above CSV data set:
int main()
{
std::string airport_data_file_name = "airport_data.csv";
strtk::token_grid::options options;
options.column_delimiters = "| ,";
options.support_dquotes = true;
strtk::token_grid airport_grid(airport_data_file_name,options);
for (std::size_t r = 0; r < airport_grid.row_count(); ++r)
{
strtk::token_grid::row_type row = airport_grid.row(r);
for (std::size_t c = 0; c < row.size(); ++c)
{
std::cout << "[" << row.get<std::string>(c) << "] ";
}
std::cout << std::endl;
}
return 0;
}
Extending Delimiter Predicates
As previously mentioned the concept of a delimiter based predicate
can lead to some very interesting solutions. A predicate as has been
defined so far, with the exception of the offset predicate, has been
a stateless entity. Adding the ability to maintain a state based on
what the predicate has encountered so far can allow it to behave
differently from the simple single and multiple delimiter predicates.
For this example, lets assume a typical command line parsing problem
which consists of double quotation mark groupings and escapable
special characters, which can be considered being dual use as either
delimiters or data. An example input and output is as follows:
Inputs
Data | abc;"123, mno xyz",i\,jk |
Delimiters | <space>;,. |
Output Tokens
Token0 | abc |
Token1 | 123, mno xyz |
Token2 | i\,jk |
In order to tokenize the above described string, one can create a
composite predicate using a multiple char delimiter predicate and some
simple state rules. The following is an example of such an extended
predicate:
class extended_predicate
{
public:
extended_predicate(const std::string& delimiters)
: escape_(false),
in_bracket_range_(false),
mdp_(delimiters)
{}
inline bool operator()(const unsigned char c) const
{
if (escape_)
{
escape_ = false;
return false;
}
else if ('\\' == c)
{
escape_ = true;
return false;
}
else if ('"' == c)
{
in_bracket_range_ = !in_bracket_range_;
return true;
}
else if (in_bracket_range_)
return false;
else
return mdp_(c);
}
inline void reset()
{
escape_ = false;
in_bracket_range_ = false;
}
private:
mutable bool escape_;
mutable bool in_bracket_range_;
mutable strtk::multiple_char_delimiter_predicate mdp_;
};
Usage of the newly defined extended predicate is as follows:
int main()
{
std::string str = "abc;\"123, mno xyz\",i\\,jk";
strtk::std_string::token_list_type token_list;
strtk::split(extended_predicate(".,; "),
str,
std::back_inserter(token_list),
strtk::split_options::compress_delimiters);
return 0;
}
High Performance Key-Value Parsing
Taking our previous person struct as an example. It
is clear that the tuple format has to be very specific with
regards to the ordering of data. For most situations this is
acceptable as the serializers and deserializers attempt to
function in the simplest manner possible, however sometimes
pieces of information may come in different orderings, or may be
deemed optional and hence not be present in a particular
tuple of data.
An approach is required that provides the means of mapping
specific pieces of data to the corresponding variables (or
members) that will store or make sense of them, in a very
efficient and simple manner. For clarity purposes, the term
"mapping" here means to populate the desired member with the
given data. The association between the two being made with the
key that is paired in-line with the data. Key-Value pairs or some
times known as Attribute-Value pairs are one of the means
by which such an association can be accomplished. The following
is a diagram that demonstrates the mapping of various fields
of a data struct to their corresponding data elements in a tuple
comprised of key-value pairs.

In the diagram above, the key-value pairs are separated (delimited)
by the pipe symbol "|". With regards to the key-value pairs
themselves, the key is traditionally the first element in the pair
and the value is the second element, they are separated in this case
by a single equal sign "=".
Back to our original problem relating to the person struct.
If we were to add keys to each of the pieces of data, we could then not
only parse a tuple of data representing the various fields of the person
struct, but those fields could be any order within the tuple. Furthermore
any of the fields could be deemed optional, hence not necessarily be present
in the tuple. It should be noted that what would denote a correct or successful
parse of tuple may not only depend upon successful parsing of ranges into the
various types, but it may also depend upon the mandatory presence of certain
fields.
Our objective will be to successfully and in the most efficient way
possible parse the following list of tuples that represent instances
of our person struct.
Tuple0 = "ID=0xFA37ED12|NAME=Rumpelstiltskin|AGE=397|HEIGHT=1.31|WEIGHT=58.7"
Tuple1 = "NAME=Rumpelstiltskin|AGE=397|HEIGHT=1.31|WEIGHT=58.7|ID=0xFA37ED12"
Tuple2 = "AGE=397|HEIGHT=1.31|WEIGHT=58.7|ID=0xFA37ED12|NAME=Rumpelstiltskin"
Tuple3 = "HEIGHT=1.31|WEIGHT=58.7|ID=0xFA37ED12|NAME=Rumpelstiltskin|AGE=397"
Tuple4 = "WEIGHT=58.7|ID=0xFA37ED12|NAME=Rumpelstiltskin|AGE=397|HEIGHT=1.31"
Tuple5 = "ID=0xFA37ED12|WEIGHT=58.7|AGE=397|NAME=Rumpelstiltskin|HEIGHT=1.31"
Tuple6 = "WEIGHT=58.7|AGE=397|NAME=Rumpelstiltskin|HEIGHT=1.31|ID=0xFA37ED12"
Tuple7 = "AGE=397|NAME=Rumpelstiltskin|HEIGHT=1.31|ID=0xFA37ED12|WEIGHT=58.7"
Tuple8 = "NAME=Rumpelstiltskin|HEIGHT=1.31|ID=0xFA37ED12|WEIGHT=58.7|AGE=397"
Tuple9 = "HEIGHT=1.31|ID=0xFA37ED12|WEIGHT=58.7|AGE=397|NAME=Rumpelstiltskin"
StrTk provides a means to achieve the above key-value pair parsing task, namely
via the strtk::keyvalue::parser and associated key-mappers. The
following is a table that depicts the various kind of key to value mappers that
are available in the StrTk library:
Key To Value Mappers
| Mapper | Type | Key Lookup Complexity | Maximum Size |
strtk::keyvalue::stringkey_map | std::string | O(log(n)) | Limited to available memory |
strtk::keyvalue::uintkey_map | cardinal value | O(1) | Limited to expected maximum key value |
In the code below we initially begin by defining the delimiters we expect to see
between pairs of key-values and in between the key and value pairs themselves. In
the example below the pair_block_delimiter field denotes the delimiter
we expect between pairs (or blocks) of key-values and the field pair_delimiter
denotes the delimiter we expect between a particular key and value.
Next we define p as an instance of the person struct and
register each of the members of the instance p with a corresponding
key with the keyvalue::parser. After which we then process
each tuple of data, parsing the tuple, populating the instance p, then
printing out the various fields to stdout.
struct person
{
unsigned int id; std::string name; unsigned int age; double height float weight; };
int main()
{
const std::string person_data[] =
{
"ID=0xFA37ED12|NAME=Rumpelstiltskin|AGE=397|HEIGHT=1.31|WEIGHT=58.7",
"NAME=Rumpelstiltskin|AGE=397|HEIGHT=1.31|WEIGHT=58.7|ID=0xFA37ED12",
"AGE=397|HEIGHT=1.31|WEIGHT=58.7|ID=0xFA37ED12|NAME=Rumpelstiltskin",
"HEIGHT=1.31|WEIGHT=58.7|ID=0xFA37ED12|NAME=Rumpelstiltskin|AGE=397",
"WEIGHT=58.7|ID=0xFA37ED12|NAME=Rumpelstiltskin|AGE=397|HEIGHT=1.31",
"ID=0xFA37ED12|WEIGHT=58.7|AGE=397|NAME=Rumpelstiltskin|HEIGHT=1.31",
"WEIGHT=58.7|AGE=397|NAME=Rumpelstiltskin|HEIGHT=1.31|ID=0xFA37ED12",
"AGE=397|NAME=Rumpelstiltskin|HEIGHT=1.31|ID=0xFA37ED12|WEIGHT=58.7",
"NAME=Rumpelstiltskin|HEIGHT=1.31|ID=0xFA37ED12|WEIGHT=58.7|AGE=397",
"HEIGHT=1.31|ID=0xFA37ED12|WEIGHT=58.7|AGE=397|NAME=Rumpelstiltskin",
};
const std::size_t person_data_size = sizeof(person_data) / sizeof(std::string);
typedef unsigned char char_type;
typedef strtk::keyvalue::parser<strtk::keyvalue::stringkey_map> kvp_type;
typedef strtk::keyvalue::options<char_type> opts_type;
opts_type options;
options.pair_block_delimiter = '|';
options.pair_delimiter = '=';
kvp_type kvp(options);
person p;
strtk::hex_to_number_sink<unsigned int> h2ns(p.id);
kvp.register_keyvalue("ID" , h2ns);
kvp.register_keyvalue("NAME" , p. name);
kvp.register_keyvalue("AGE" , p. age);
kvp.register_keyvalue("HEIGHT", p.height);
kvp.register_keyvalue("WEIGHT", p.weight);
for (std::size_t i = 0; i < person_data_size; ++i)
{
if (kvp(person_data[i]))
{
std::cout << i << '\t'
<< "ID: " << p.id << '\t'
<< "Name: " << p.name << '\t'
<< "AGE: " << p.age << '\t'
<< "HEIGHT: " << p.height << '\t'
<< "WEIGHT: " << p.weight << '\n';
}
}
return 0;
}
It should be noted that the underlying associative container for the
strtk::keyvalue::stringkey_map can be explicitly specified
at compile time. By default it is std::map, however it can
be easily changed to std::unordered_map giving it a key
lookup complexity of O(1) or in fact any other container that is
compatible with STL associative map semantics. The following are three
examples of keyvalue_parsers specialized using std::unordered_map,
the first being based on the std::string type, the second
being based on the int type and the third being based on
the double type:
int main()
{
{
typedef unsigned char char_type;
typedef std::unordered_map<std::string,strtk::util::value> key_to_val_map_t;
typedef strtk::keyvalue::key_map<std::string,key_to_val_map_t> stringhashkey_mapper;
typedef strtk::keyvalue::parser<stringhashkey_mapper> kvp_type;
typedef strtk::keyvalue::options<char_type> opts_type;
opts_type options;
options.pair_block_delimiter = '|';
options.pair_delimiter = '=';
kvp_type kvp(options);
}
{
typedef unsigned char char_type;
typedef std::unordered_map<int,strtk::util::value> key_to_val_map_t;
typedef strtk::keyvalue::key_map<int,key_to_val_map_t> inthash_key_mapper;
typedef strtk::keyvalue::parser<inthash_key_mapper> kvp_type;
typedef strtk::keyvalue::options<char_type> opts_type;
opts_type options;
options.pair_block_delimiter = '|';
options.pair_delimiter = '=';
kvp_type kvp(options);
}
{
typedef unsigned char char_type;
typedef std::unordered_map<double,strtk::util::value> key_to_val_map_t;
typedef strtk::keyvalue::key_map<double,key_to_val_map_t> doublehash_key_mapper;
typedef strtk::keyvalue::parser<doublehash_key_mapper> kvp_type;
typedef strtk::keyvalue::options<char_type> opts_type;
opts_type options;
options.pair_block_delimiter = '|';
options.pair_delimiter = '=';
kvp_type kvp(options);
}
return 0;
}
Key-Value Pairs And Multiple Distinct Data Structures
To further the previous example, one might have a situation where
a tuple contains values for different members of different types.
The following example demonstrates parsing of key-value pairs that
map to members of multiple types. The tuple in the example consists
of values some of which are intended for the instance of data1 and
others which are intended for the instance of data2.
struct data1
{
int x0;
std::string x1;
double x2;
float x3;
}
struct data2
{
int y0;
std::string y1;
double y2;
float y3;
}
int main()
{
typedef unsigned char char_type;
typedef strtk::keyvalue::parser<strtk::keyvalue::stringkey_map> kvp_type;
typedef strtk::keyvalue::options<char_type> opts_type;
opts_type options;
options.pair_block_delimiter = '|';
options.pair_delimiter = '=';
kvp_type kvp(options);
data1 d1;
data2 d2;
std::string data = "D1X0=123456789|D1X1=AbCdEfG123456|D1X2=123.456|D1X3=0101.0202f|"
"D2Y0=987654321|D2Y1=tUvWxYz789012|D2Y2=789.012|D2Y3=0707.0909f";
kvp.register_keyvalue("D1X0", d1.x0);
kvp.register_keyvalue("D1X1", d1.x1);
kvp.register_keyvalue("D1X2", d1.x2);
kvp.register_keyvalue("D1X3", d1.x3);
kvp.register_keyvalue("D2Y0", d2.y0);
kvp.register_keyvalue("D2Y1", d2.y1);
kvp.register_keyvalue("D2Y2", d2.y2);
kvp.register_keyvalue("D2Y3", d2.y3);
kvp(data);
std::cout << "D1X0: " << d1.x0 << '\t'
<< "D1X1: " << d1.x1 << '\t'
<< "D1X2: " << d1.x2 << '\t'
<< "D1X3: " << d1.x3 << '\n';
std::cout << "D2Y0: " << d2.y0 << '\t'
<< "D2Y1: " << d2.y1 << '\t'
<< "D2Y2: " << d2.y2 << '\t'
<< "D2Y3: " << d2.y3 << '\n';
return 0;
}
Key-Value Pairs And Lists
The values in the key-value pairs need not always be singular. In some
scenarios, a value could be a list of values. The StrTk key-value parser
supports parsing of such key-value pairs through the previously demonstrated
sequence sink mechanisms.
In the following example we have a complex_data struct that
consists of some PODs but also a number of sequences (vector,deque,list).
The tuple to be parsed consists of simple key-value pairs for the PODs and
more complex looking pairs where the values for the specific sequence of values
are separated by commas ",".
struct complex_data
{
unsigned int v0;
std::vector<int> v1;
double v2;
std::deque<double> v3;
std::string v4;
std::list<std::string> v5;
};
int main()
{
typedef unsigned char char_type;
typedef strtk::keyvalue::parser<strtk::keyvalue::stringkey_map> kvp_type;
typedef strtk::keyvalue::options<char_type> opts_type;
opts_type options;
options.pair_block_delimiter = '|';
options.pair_delimiter = '=';
kvp_type kvp(options);
complex_data cd;
std::string data = "V0=123456789|V1=-3,-2,-1,0,+1,2+3|V2=111.222|V3=1.1,2.2,3.3,4.4,5.5|"
"V4=Some Text|V5=Text1,Text2,Text3,Text4";
strtk::vector_sink<int>::type vec_sink(",");
strtk::deque_sink<double>::type deq_sink(",");
strtk::list_sink<std::string>::type lst_sink(",");
kvp.register_keyvalue("V0", cd.v0);
kvp.register_keyvalue("V1", vec_sink(cd.v1));
kvp.register_keyvalue("V2", cd.v2);
kvp.register_keyvalue("V3", deq_sink(cd.v3));
kvp.register_keyvalue("V4", cd.v4);
kvp.register_keyvalue("V5", lst_sink(cd.v5));
kvp(data);
std::cout << "V0: " << cd.v0 << '\n'
<< "V1: " << strtk::join(" ",cd.v1) << '\n'
<< "V2: " << cd.v2 << '\n'
<< "V3: " << strtk::join(" ",cd.v3) << '\n'
<< "V4: " << cd.v4 << '\n'
<< "V5: " << strtk::join(" ",cd.v5) << '\n';
return 0;
}
Key-Value Pairs With Cardinal Keys
So far all the examples have assumed keys of arbitrary values. In
some situations such as the FIX protocol, the keys are always guaranteed
to be positive integer values. If this is the case then a different kind
of key-mapper can be used that is much more efficient than the general
purpose string key-mapper, providing a key lookup complexity of O(1).
As such StrTk provides the strtk::keyvalue::uintkey_map
type for this purpose. The only difference in terms of setting up the
uintkey_map from the stringkey_map is that it
requires a key_count to be set. This value represents the
largest possible key value that can exist. The following is an example of
how the strtk::keyvalue::uintkey_map key-mapper can be used.
struct data_store
{
char c; unsigned char uc; short s; unsigned short us; int i; std::string str; };
int main()
{
const std::string data = "121=A|122=z|123=-123|124=456|125=-12345678|126=Some simple text";
typedef strtk::keyvalue::parser<strtk::keyvalue::uintkey_map> kvp_type;
strtk::keyvalue::uintkey_map::options options;
options.key_count = 127; options.pair_block_delimiter = '|';
options.pair_delimiter = '=';
kvp_type kvp(options);
data_store ds;
kvp.register_keyvalue(121,ds. c);
kvp.register_keyvalue(122,ds. uc);
kvp.register_keyvalue(123,ds. s);
kvp.register_keyvalue(124,ds. us);
kvp.register_keyvalue(125,ds. i);
kvp.register_keyvalue(126,ds.str);
kvp(data);
return 0;
}
Optional Key-Value Pairs
As previously mentioned there is a use-case where certain fields
may be deemed optional and hence their absence would not constitute a
parsing error. That said, it would also be beneficial to know if a
particular field has been populated or not once the sequence of key
value pairs has been completely parsed and mapped. StrTk provides such
functionality though the use of the strtk::util::attribute
type.
The strtk::util::attribute acts as a proxy for the underlying
type which it is specialized upon. It provides a conversion cast to the
underlying type, and also maintains an 'initialised' state value that can
be used to query the attribute about the underlying type's initialised
status. The following is an example of how one could use the
strtk::util::attribute type in conjunction with the
keyvalue::parser:
struct data_store
{
strtk::util::attribute<int> d1;
strtk::util::attribute<unsigned int> d2;
strtk::util::attribute<double> d3;
strtk::util::attribute<float> d4;
strtk::util::attribute<std::string> d5;
};
int main()
{
std::string data = "INT=-1234|UINT=+5678|DOUBLE=1234.5678|FLOAT=90123.4567f|STRING=Some simple text";
typedef unsigned char char_type;
typedef strtk::keyvalue::parser<strtk::keyvalue::stringkey_map> kvp_type;
typedef strtk::keyvalue::options<char_type> opts_type;
opts_type options;
options.pair_block_delimiter = '|';
options.pair_delimiter = '=';
kvp_type kvp(options);
data_store ds;
d1.initialised() = false;
d2.initialised() = false;
d3.initialised() = false;
d4.initialised() = false;
d5.initialised() = false;
kvp.register_keyvalue("INT" , ds.d1);
kvp.register_keyvalue("UINT" , ds.d2);
kvp.register_keyvalue("DOUBLE", ds.d3);
kvp.register_keyvalue("FLOAT" , ds.d4);
kvp.register_keyvalue("STRING", ds.d5);
if (!kvp(data))
{
std::cout << "Failed to parse key-value data: " << data << std::endl;
return 1;
}
if (!d1.initialised()) { std::cout << "d1 is not present hence has not been initialised." << std::endl; }
if (!d2.initialised()) { std::cout << "d2 is not present hence has not been initialised." << std::endl; }
if (!d3.initialised()) { std::cout << "d3 is not present hence has not been initialised." << std::endl; }
if (!d4.initialised()) { std::cout << "d4 is not present hence has not been initialised." << std::endl; }
if (!d5.initialised()) { std::cout << "d5 is not present hence has not been initialised." << std::endl; }
return 0;
}
Semantic Actions with Key-Value Pairs
There may be times that when key-value pairs are being parsed certain actions need
to be executed or behaviours exhibited in-situ with the parsing process. As previously
mentioned, StrTk provides the type semantic_action that can act as a
proxy for a generic type during the parsing process that also takes a functor or
lambda and executes it at the conversion call. The following example demonstrates
the parsing of an array of tuples comprised of key-value pairs that map to members
of a struct namely data_store. The keys 111, 222 and 333 each represent
a specific value type, they also require a certain behavior to be exhibited. In this
example, for simplicity, as the values of the various keys are being parsed, a simple
message will be printed to the console denoting the nature of the parsing process.
The code is as follows:
struct data_store
{
data_store()
: i1(0),
d1(0.0),
s1("")
{}
int i1; double d1; std::string s1;
};
int main()
{
static const std::string data[] =
{
"111=-12345|222=1987.654321|333=An interesting string0", "222=2987.654321|333=An interesting string1|111=+12345", "333=An interesting string2|111=-12345|222=3987.654321", "222=4987.654321|111=+12345|333=An interesting string3", "333=An interesting string4|222=5987.654321|111=-12345", "111=+12345|333=An interesting string5|222=6987.654321", };
static const std::size_t data_size = sizeof(data) / sizeof(std::string);
typedef strtk::keyvalue::parser<strtk::keyvalue::uintkey_map> kvp_type;
strtk::keyvalue::uintkey_map::options options;
options.key_count = 334; options.pair_block_delimiter = '|';
options.pair_delimiter = '=';
kvp_type kvp(options);
data_store ds;
typedef strtk::range::ustring::const_iterator itr_type;
using strtk::util::semantic_action;
auto lambda_key111 = [&ds](itr_type begin,itr_type end) -> bool
{
if (!strtk::string_to_type_converter(begin,end,ds.i1))
{
std::cout << "Failed to parse value for key=111\n";
return false;
}
std::cout << "key[111] value=" << ds.i1 << std::endl;
return true;
};
auto lambda_key222 = [&ds](itr_type begin,itr_type end) -> bool
{
if (!strtk::string_to_type_converter(begin,end,ds.d1))
{
std::cout << "Failed to parse value for key=222\n";
return false;
}
std::cout << "key[222] value=" << ds.d1 << std::endl;
return true;
};
auto lambda_key333 = [&ds](itr_type begin,itr_type end) -> bool
{
if (!strtk::string_to_type_converter(begin,end,ds.s1))
{
std::cout << "Failed to parse value for key=333\n";
return false;
}
std::cout << "key[333] value=" << ds.s1 << std::endl;
return true;
};
kvp.register_keyvalue(111,semantic_action(lambda_key111).ref());
kvp.register_keyvalue(222,semantic_action(lambda_key222).ref());
kvp.register_keyvalue(333,semantic_action(lambda_key333).ref());
for (std::size_t i = 0; i < data_size; ++i)
{
if (!kvp(data[i]))
std::cout << "Error while parsing tuple: " << i << std::endl;
std::cout << std::endl;
}
return 0;
}
Note: It should be noted that semantic actions during the parsing process have
a multitude of uses, some of which are: validation of parsed values, that is making sure
that they're in a specified range or within a predefined set of values, complex parsed
value manipulations, invoking of external state-machines to transition to new states
based on the parsed value or even simply the presence of a key etc.
Performance Comparisons
The following are tables of results generated by running the
strtk_tokenizer_cmp test. Currently it covers simple
comparisons between Boost String Algorithms, Boost lexical_cast, The
Standard Library, Spirit (Karma Qi) and StrTk in the following areas:
- Tokenization
- Splitting
- Integer To String
- String To Integer
- String To Double
Scenario 0 - MSVC 2010 (64-bit, O2, Ot, GL and PGO)
| Source | Test | Size | Time(sec) | Rate | % from Baseline |
Boost | Tokenizer | 24000000 | 8.5857sec | 2795359.4074tks/sec | 100.0%, 100.0% |
StrTk | Tokenizer | 24000000 | 3.5019sec | 6853393.1186tks/sec | 40.7%, 245.1% |
Boost | Split | 9600000 | 5.5414sec | 1732414.5137tks/sec | 100.0%, 100.0% |
StrTk | Split | 9600000 | 0.8218sec | 11681814.9167tks/sec | 14.8%, 674.3% |
sprintf | Integer To String | 80000000 | 35.8128sec | 2233840.0564nums/sec | 100.0%, 100.0% |
Boost | Integer To String | 80000000 | 19.3994sec | 4123832.0477nums/sec | 54.1%, 184.6% |
Karma | Integer To String | 80000000 | 6.2528sec | 12794349.6524nums/sec | 17.4%, 572.7% |
StrTk | Integer To String | 80000000 | 1.5664sec | 51071439.9822nums/sec | 4.3%, 2286.2% |
atoi | String To Integer | 88500000 | 5.1802sec | 17084370.4936nums/sec | 100.0%, 100.0% |
Boost | String To Integer | 88500000 | 119.6261sec | 739805.3712nums/sec | 2309.2%, 4.3% |
Qi | String To Integer | 88500000 | 2.1951sec | 40317238.6629nums/sec | 42.3%, 235.9% |
StrTk | String To Integer | 88500000 | 1.8181sec | 48677773.5466nums/sec | 35.0%, 284.9% |
atof | String To Double | 30650000 | 15.2306sec | 2012396.7122nums/sec | 100.0%, 100.0% |
Boost | String To Double | 30650000 | 52.9244sec | 579127.8866nums/sec | 347.4%, 28.7% |
Qi | String To Double | 30650000 | 2.8665sec | 10692313.5853nums/sec | 18.8%, 531.3% |
StrTk | String To Double | 30650000 | 1.6069sec | 19073975.7679nums/sec | 10.5%, 947.8% |

Scenario 1 - MSVC 2010 (O2, Ot, GL and PGO)
| Source | Test | Size | Time(sec) | Rate | % from Baseline |
Boost | Tokenizer | 24000000 | 9.4715sec | 2533910.4769tks/sec | 100.0%, 100.0% |
StrTk | Tokenizer | 24000000 | 2.8889sec | 8307786.9292tks/sec | 30.5%, 327.8% |
Boost | Split | 9600000 | 7.2291sec | 1327965.9706tks/sec | 100.0%, 100.0% |
StrTk | Split | 9600000 | 1.1301sec | 8494610.9664tks/sec | 15.6%, 639.6% |
sprintf | Integer To String | 80000000 | 38.2576sec | 2091088.8038nums/sec | 100.0%, 100.0% |
Boost | Integer To String | 80000000 | 28.9931sec | 2759277.4769nums/sec | 75.7%, 131.9% |
Karma | Integer To String | 80000000 | 4.9173sec | 16269254.0190nums/sec | 12.8%, 778.0% |
StrTk | Integer To String | 80000000 | 1.8270sec | 43786838.0279nums/sec | 4.7%, 2093.9% |
atoi | String To Integer | 88500000 | 6.0076sec | 14731435.8942nums/sec | 100.0%, 100.0% |
Boost | String To Integer | 88500000 | 185.4955sec | 477100.6474nums/sec | 3087.0%, 3.2% |
Qi | String To Integer | 88500000 | 2.5060sec | 35314785.8370nums/sec | 41.7%, 239.7% |
StrTk | String To Integer | 88500000 | 2.2095sec | 40054213.0736nums/sec | 36.7%, 271.8% |
atof | String To Double | 30650000 | 17.6435sec | 1737179.9302nums/sec | 100.0%, 100.0% |
Boost | String To Double | 30650000 | 78.6528sec | 389687.3997nums/sec | 445.7%, 22.4% |
Qi | String To Double | 30650000 | 3.8034sec | 8058494.1994nums/sec | 21.5%, 463.8% |
StrTk | String To Double | 30650000 | 2.0450sec | 14987780.2310nums/sec | 11.5%, 862.7% |

Scenario 2 - MSVC 2008 SP1 (O2, Ot, GL and PGO)
| Source | Test | Size | Time(sec) | Rate | % from Baseline |
Boost | Tokenizer | 24000000 | 9.6533sec | 2486184.8282tks/sec | 100.0%, 100.0% |
StrTk | Tokenizer | 24000000 | 3.4748sec | 6906943.9529tks/sec | 35.9%, 277.8% |
Boost | Split | 9600000 | 10.2600sec | 935674.7490tks/sec | 100.0%, 100.0% |
StrTk | Split | 9600000 | 1.3793sec | 6959830.0652tks/sec | 13.4%, 743.8% |
sprintf | Integer To String | 80000000 | 24.6427sec | 3246397.8287nums/sec | 100.0%, 100.0% |
Boost | Integer To String | 80000000 | 27.5865sec | 2899968.5753nums/sec | 111.9%, 89.3% |
Karma | Integer To String | 80000000 | 5.4864sec | 14581630.6963nums/sec | 22.2%, 449.1% |
StrTk | Integer To String | 80000000 | 2.4224sec | 33025441.1256nums/sec | 9.8%, 1017.2% |
atoi | String To Integer | 88500000 | 5.9297sec | 14924814.8683nums/sec | 100.0%, 100.0% |
Boost | String To Integer | 88500000 | 186.1372sec | 475455.6660nums/sec | 3139.0%, 3.1% |
Qi | String To Integer | 88500000 | 2.0874sec | 42397446.1804nums/sec | 35.2%, 284.0% |
StrTk | String To Integer | 88500000 | 2.0485sec | 43202160.1371nums/sec | 34.5%, 289.4% |
atof | String To Double | 30650000 | 18.0458sec | 1698455.0767nums/sec | 100.0%, 100.0% |
Boost | String To Double | 30650000 | 77.4527sec | 395725.4111nums/sec | 429.2%, 23.2% |
Qi | String To Double | 30650000 | 3.9631sec | 7733881.1294nums/sec | 21.9%, 455.3% |
StrTk | String To Double | 30650000 | 2.0723sec | 14790236.0804nums/sec | 11.4%, 870.8% |

Scenario 3 - Intel C++ v11.1.060 IA-32 (O2, Ot, Qipo, QxHost and PGO)
| Source | Test | Size | Time(sec) | Rate | % from Baseline |
Boost | Tokenizer | 24000000 | 10.0096sec | 2397697.7836tks/sec | 100.0%, 100.0% |
StrTk | Tokenizer | 24000000 | 3.1837sec | 7538416.8541tks/sec | 31.8%, 314.4% |
Boost | Split | 9600000 | 9.5450sec | 1005760.0310tks/sec | 100.0%, 100.0% |
StrTk | Split | 9600000 | 1.4292sec | 6716893.1359tks/sec | 14.9%, 667.8% |
sprintf | Integer To String | 80000000 | 23.8979sec | 3347577.5824nums/sec | 100.0%, 100.0% |
Boost | Integer To String | 80000000 | 27.5618sec | 2902565.2045nums/sec | 115.3%, 86.7% |
Karma | Integer To String | 80000000 | 4.6600sec | 17167208.7654nums/sec | 19.4%, 512.8% |
StrTk | Integer To String | 80000000 | 2.8450sec | 28119857.2736nums/sec | 11.9%, 840.0% |
atoi | String To Integer | 88500000 | 5.9386sec | 14902610.8922nums/sec | 100.0%, 100.0% |
Boost | String To Integer | 88500000 | 180.5856sec | 490072.4001nums/sec | 3040.8%, 3.2% |
Qi | String To Integer | 88500000 | 2.5273sec | 35017073.8639nums/sec | 42.5%, 234.9% |
StrTk | String To Integer | 88500000 | 1.8718sec | 47281492.1287nums/sec | 31.5%, 317.2% |
atof | String To Double | 30650000 | 18.4357sec | 1662538.0810nums/sec | 100.0%, 100.0% |
Boost | String To Double | 30650000 | 78.1543sec | 392172.9598nums/sec | 423.9%, 23.5% |
Qi | String To Double | 30650000 | 2.8321sec | 10822353.0510nums/sec | 15.3%, 650.9% |
StrTk | String To Double | 30650000 | 2.2930sec | 13366541.5515nums/sec | 12.4%, 803.9% |

Scenario 4 - GCC 4.5 (O3, PGO)
| Source | Test | Size | Time(sec) | Rate | % from Baseline |
Boost | Tokenizer | 24000000 | 9.2510sec | 2594305.4347tks/sec | 100.0%, 100.0% |
StrTk | Tokenizer | 24000000 | 3.9717sec | 6042688.5734tks/sec | 42.9%, 232.9% |
Boost | Split | 9600000 | 5.0640sec | 1895728.2331tks/sec | 100.0%, 100.0% |
StrTk | Split | 9600000 | 1.5411sec | 6229231.8384tks/sec | 30.4%, 328.5% |
sprintf | Integer To String | 80000000 | 14.7807sec | 5412477.0993nums/sec | 100.0%, 100.0% |
Boost | Integer To String | 80000000 | 19.1131sec | 4185620.7707nums/sec | 129.3%, 77.3% |
Karma | Integer To String | 80000000 | 6.4455sec | 12411808.2841nums/sec | 43.6%, 229.3% |
StrTk | Integer To String | 80000000 | 4.5174sec | 17709364.5349nums/sec | 30.5%, 327.1% |
atoi | String To Integer | 88500000 | 5.2139sec | 16973721.6103nums/sec | 100.0%, 100.0% |
Boost | String To Integer | 88500000 | 50.5326sec | 1751344.8498nums/sec | 969.1%, 10.3% |
Qi | String To Integer | 88500000 | 1.9694sec | 44937612.8835nums/sec | 37.7%, 264.7% |
StrTk | String To Integer | 88500000 | 1.9008sec | 46558706.5833nums/sec | 36.4%, 274.2% |
atof | String To Double | 30650000 | 6.6975sec | 4576328.3036nums/sec | 100.0%, 100.0% |
Boost | String To Double | 30650000 | 29.6375sec | 1034162.2422nums/sec | 442.5%, 22.5% |
Qi | String To Double | 30650000 | 2.9852sec | 10267435.7138nums/sec | 44.5%, 224.3% |
StrTk | String To Double | 30650000 | 1.5961sec | 19202937.1409nums/sec | 23.8%, 419.6% |

Scenario 5 - GCC 4.5 (O3, PGO) Intel Atom N450
| Source | Test | Size | Time(sec) | Rate | % from Baseline |
Boost | Tokenizer | 24000000 | 29.1370sec | 823695.4389tks/sec | 100.0%, 100.0% |
StrTk | Tokenizer | 24000000 | 12.3607sec | 1941644.0499tks/sec | 42.4%, 235.7% |
Boost | Split | 9600000 | 16.5261sec | 580899.9726tks/sec | 100.0%, 100.0% |
StrTk | Split | 9600000 | 4.9102sec | 1955110.2611tks/sec | 29.7%, 336.5% |
sprintf | Integer To String | 80000000 | 50.3456sec | 1589015.6118nums/sec | 100.0%, 100.0% |
Boost | Integer To String | 80000000 | 91.1475sec | 877698.1401nums/sec | 181.0%, 55.2% |
Karma | Integer To String | 80000000 | 21.8904sec | 3654568.8712nums/sec | 43.4%, 229.9% |
StrTk | Integer To String | 80000000 | 12.1877sec | 6564009.9274nums/sec | 24.2%, 413.0% |
atoi | String To Integer | 88500000 | 17.6615sec | 5010896.5768nums/sec | 100.0%, 100.0% |
Boost | String To Integer | 88500000 | 191.9446sec | 461070.5357nums/sec | 1086.7%, 9.2% |
Qi | String To Integer | 88500000 | 6.2808sec | 14090561.7119nums/sec | 35.5%, 281.1% |
StrTk | String To Integer | 88500000 | 6.1552sec | 14378086.8208nums/sec | 34.8%, 286.9% |
atof | String To Double | 30650000 | 21.4865sec | 1426474.1027nums/sec | 100.0%, 100.0% |
Boost | String To Double | 30650000 | 139.8166sec | 219215.7409nums/sec | 650.7%, 15.3% |
Qi | String To Double | 30650000 | 11.3916sec | 2690567.9223nums/sec | 53.0%, 188.6% |
StrTk | String To Double | 30650000 | 6.4396sec | 4759608.7027nums/sec | 29.9%, 333.6% |

Scenario 6 - GCC 4.5 (O3, PGO) Intel Xeon E5540
| Source | Test | Size | Time(sec) | Rate | % from Baseline |
Boost | Tokenizer | 24000000 | 7.5657sec | 3172216.8787tks/sec | 100.0%, 100.0% |
StrTk | Tokenizer | 24000000 | 2.7379sec | 8765832.8290tks/sec | 36.1%, 276.3% |
Boost | Split | 9600000 | 3.0706sec | 3126386.1126tks/sec | 100.0%, 100.0% |
StrTk | Split | 9600000 | 1.1279sec | 8511136.2899tks/sec | 36.7%, 272.2% |
sprintf | Integer To String | 80000000 | 10.9012sec | 7338642.9638nums/sec | 100.0%, 100.0% |
Boost | Integer To String | 80000000 | 12.3317sec | 6487328.7872nums/sec | 113.1%, 88.3% |
Karma | Integer To String | 80000000 | 3.7202sec | 21504260.6660nums/sec | 34.1%, 293.0% |
StrTk | Integer To String | 80000000 | 2.5183sec | 31768042.4612nums/sec | 23.1%, 432.8% |
atoi | String To Integer | 88500000 | 4.0087sec | 22077037.6357nums/sec | 100.0%, 100.0% |
Boost | String To Integer | 88500000 | 30.3659sec | 2914454.4393nums/sec | 757.4%, 13.2% |
Qi | String To Integer | 88500000 | 1.7976sec | 49231871.5454nums/sec | 43.3%, 223.0% |
StrTk | String To Integer | 88500000 | 1.7384sec | 50908881.7303nums/sec | 43.3%, 230.5% |
atof | String To Double | 30650000 | 5.2118sec | 5880843.9328nums/sec | 100.0%, 100.0% |
Boost | String To Double | 30650000 | 21.5546sec | 1421966.9538nums/sec | 413.5%, 24.1% |
Qi | String To Double | 30650000 | 3.2149sec | 9533840.3118nums/sec | 61.6%, 162.1% |
StrTk | String To Double | 30650000 | 1.3929sec | 22003661.2944nums/sec | 26.7%, 374.1% |

Scenario 7 - GCC 4.5 (O3, PGO) Intel Xeon X5650
| Source | Test | Size | Time(sec) | Rate | % from Baseline |
Boost | Tokenizer | 24000000 | 4.1944sec | 5721901.2924tks/sec | 100.00%, 100.00% |
StrTk | Tokenizer | 24000000 | 2.5087sec | 9566860.3956tks/sec | 59.81%, 167.19% |
Boost | Split | 9600000 | 2.9104sec | 3298520.2014tks/sec | 100.00%, 100.00% |
StrTk | Split | 9600000 | 1.1105sec | 8644949.2334tks/sec | 38.15%, 262.08% |
sprintf | Integer To String | 80000000 | 9.7148sec | 8234840.3537nums/sec | 100.00%, 100.00% |
Boost | Integer To String | 80000000 | 11.5726sec | 6912860.7120nums/sec | 119.12%, 83.94% |
Karma | Integer To String | 80000000 | 4.0832sec | 19592620.4395nums/sec | 42.03%, 237.92% |
StrTk | Integer To String | 80000000 | 2.2204sec | 36029641.5861nums/sec | 22.85%, 437.52% |
atoi | String To Integer | 88500000 | 3.1028sec | 28522836.1561nums/sec | 100.00%, 100.00% |
Boost | String To Integer | 88500000 | 6.1270sec | 14444145.2249nums/sec | 197.46%, 50.64% |
Qi | String To Integer | 88500000 | 1.5313sec | 57794031.2153nums/sec | 49.35%, 202.62% |
StrTk | String To Integer | 88500000 | 1.4409sec | 61417814.6362nums/sec | 46.43%, 215.32% |
atof | String To Double | 42880000 | 6.1342sec | 6990292.6549nums/sec | 100.00%, 100.00% |
Boost | String To Double | 42880000 | 28.9461sec | 1481374.9232nums/sec | 772.37%, 21.19% |
Qi | String To Double | 42880000 | 3.6549sec | 11732137.3557nums/sec | 3644.68%, 167.83% |
StrTk | String To Double | 42880000 | 1.3860sec | 30937683.0792nums/sec | 460.19%, 442.58% |

Note 1: The tests are compiled with specific optimisation
flags to produce the best possible results for the respective
compilers and architectures. Furthermore the tests are run
natively (no virtualizations were used) on an almost completely idle
machine so as to reduce interference from background processes. The
Boost version used was 1.48. Furthermore the standard libraries
including libc were rebuilt for the linux system based tests, using
architecture specific flags and optimizations. The following is a
table mapping the scenarios to their respective architectures:
| Scenario | Architecture |
| 0 | ThinkPad W510 (64-Bit Intel Quad Core Extreme i7-920XM 2.0GHz, 16GB RAM, Windows 7) |
| 1-3 | ThinkPad x61 (32-Bit Intel Core 2 Duo 2.4GHz, 2GB RAM, Windows 7) |
| 4 | ThinkPad x61 (32-Bit Intel Core 2 Duo 2.4GHz, 2GB RAM, Ubuntu 10.10) |
| 5 | Acer Aspire One (32-Bit Intel Atom N450 1.6Ghz, 1GB RAM, Ubuntu 10.10) |
| 6 | HP Proliant DL380G6 (64-Bit Intel Xeon E5540 2.5GHz, 8GB RAM, Ubuntu 10.10) |
| 7 | Custom box (64-Bit Intel Xeon X5650 2.66GHz, 32GB RAM, Ubuntu 10.10) |
Note 2:
The percentages in the final column represent the percentage of the
current row versus the baseline in total running time and rate
respectively. For the first percentage the lower the value the better
and for the second percentage the higher the value the better. The
baseline used for a specific combination of tests is defined in the
following table:
| Test Combination | Baseline |
| Boost, StrTk | Boost |
| Boost, StdLib/STL, Spirit, StrTk | StdLib/STL |
| StdLib/STL, Spirit, StrTk | StdLib/STL |
Note 3:
The test sizes are set such that no single run will result in a
running time less than one second. This is done so as to ensure
that runs-per-second results are not deemed to have been projected.
In the future these sizes may need to be revisited once 3.5+GHz CPU
speeds become more commonplace. Furthermore the charts represent the
rate of operation over a one second interval - In short, the larger
the rate the better.
Note 4:
The binaries used for the above performance tests can be downloaded from
here
Note 5:
It would be great to have comparisons for other architectures. If you can
provide access to shell accounts with GCC 4.5+ or Clang/LLVM 2.0+ for the
following architectures: UltraSPARC T2 Plus, SPARC64 VII, POWER6/7, please
feel free to get in contact.
StrTk Library Dependency
StrTk makes use of the Boost library for its
boost::lexical_cast routine for types other than
PODs, and its TR1 compliant Random and Regex libraries. These
dependencies are not compulsory and can be easily removed simply
by defining the preprocessor: strtk_no_tr1_or_boost. That
said Boost is an integral part of modern C++ programming, and
having it around is as beneficial as having access to the STL,
hence it is recommended that it be installed. For Visual Studio
users, BoostPro provides a free and easy to use
installer for the latest Boost libraries that can be obtained
from Here. For Linux users, mainstream
distributions such as Ubuntu and Red-Hat(Fedora) provide easy
installation of the Boost libraries via their respective package
management systems. For more information please consult the
readme.txt found in the StrTk distribution.
Compiler Support
The following is a listing of the various compilers that StrTk can be
built with error and warning free.
- GCC - verions 4.1+
- Intel C++ Compiler - versions 8+
- Clang/LLVM - versions 1.0+
- MSVC - versions 7.1+
- Comeau C/C++ - versions 4.3+
Note: Versions of compilers prior to the ones denoted above
"should" compile, however they may require a very lineant
warning/error level be set during compilation.
Conclusion
StrTk was designed with performance and efficiency as its sole primary
principles, and as such some of the available interfaces may not be as
user-friendly as they should - however that said, the gains
made in other areas hopefully will compensate for any perceived
difficulties. Like most things there is a trade-off between
performance and usability with the above mentioned tokenizers and
parsing methods. The original aim was to provide an interface similar
to that of the Boost Tokenizer and Split routines, but to also avail
the developer simplifications that will hopefully provide them more
flexibility and efficiency in the long run. That said, tokenizing a
string isn't the most fascinating problem one could tackle but it does
have its interesting points when one has a few TB of data to process,
doing it properly could mean the difference between finishing a simple
data processing job today or next month.