Introduction
This article will present the tokenizing and splitting functionality
of a simple C++ library called the String Toolkit. Tokenization in
the context of string processing, is the method by which a sequence of
elements are broken up or fragmented into sub-sequences called tokens.
The indices in the original sequence that determine such breaks in the
sequence are known as delimiters.
There are two types of delimiters, normal or thin delimiters which
are of length one element and thick delimiters which are of length
two or more elements. Even though tokenization is primarily used in
conjunction with strings, any sequence of types that can be iterated
in a linear fashion can be tokenized, examples may be list of
integers, a vector of person classes or a map of strings.
Another Tokenizer?
To date there have been many attempts to define a
"standard" Tokenizer implementation in C++. Of them all the
likely candidate might be the implementation in the Boost
library. Regardless proposed implementations should to some extent
consider one or more of the following areas: over-all usage
patterns, constructions, generality (or is it genericty these
days?) of design, performance-efficiency.
1. Overall Usage Patterns
This requirement is concerned with how easy it is to instantiate the
tokenizer and integrate it into existing processing patterns, which
most often than not requires integration with C++ STL algorithms and
containers. A tokenzier by definition would be classed as a producer,
so the question becomes how easy is it for others to consume its
output? Another issue is consistency of the definition of a token in
the situation where one has consecutive delimiters but is not
compressing them - can or should there be such a thing as an
empty token? and what do preceding and trailing delimiters mean? and
when should they be included as part of the token?
2. Constructions
In the context of string tokenizing, the majority of implementations
return the token as a new instance of a string. This process requires
a string to be created on heap, populated by the sub-string in
question from the original sequence, then returned back (some of
this may be alleviated by Return Value Optimization RVO).
In the case of iterators this is essentially another copy to the
caller. Furthermore two kinds of tokens can make this situation
worse, they are primarily a large sequence made up of lots of very
short tokens or a large sequence made up of lots of very large
tokens. The solution is not to return the token as a string but
rather as a range and allow the caller to decide how they wish to
inspect and further manipulate the token.

This minor change in interface design provides a great deal of
flexibility and performance gain.
3. Generality(Genericity) of design
Most tokenizer implementations concern themselves only with strings,
which for the most part is ok, because most things that need
tokenizing are strings. However there will be times when one has a
sequence of types that may need to be tokenized that aren't strings,
hence a tokenizer should be designed in such a way to enable such
features, moreover it becomes clear that the most basic of tokenizing
algorithms are invariant to the type of the delimiter.
4. Performance and Efficiency
Tokenizing strings range from low frequency inputs such as user input
or parsing of simple configuration files to more complex situations
such as tokenizing of HTML pages that a web crawler/indexer might do,
to parsing of large market-data streams in FIX format. Performance is
generally important and can usually be helped along with good usage
patterns that encourage reuse of instances, minimal preprocessing and
allow for user supplied predicates for the more nasty areas of the
process. It should be noted that everything in the proceeding article
can be done by purely using the STL/IOStream libraries - that said,
C++'s ability to allow one to skin the proverbial cat in
numerous way gives rise to novel solutions that are for the most
part not of any practical use other than to showcase ones abilities
in using the STL/IOStreams.
Getting started
The String Toolkit Library (StrTk) provides two common tokenization
concepts, a split function and a token iterator. Both these concepts
require the user to provide a delimiter predicate and an iterator
range over which the process will be carried out.
The tokenization process can be further parametrized by switching
between "compressed delimiters" or "no compressed delimiters" mode.
This essentially means that consecutive delimiters will be compressed
down to one and treated as such. Below are two tables depicting the
expected tokens from various types of input. The tables represent no
compressed and compressed tokenization processes respectively. The
delimiter in this instance is a pipe symbol | and <> denotes
an empty token.
No Compressed Delimiters
Input | Token List |
a | a |
a|b | a,b |
a||b | a,<>,b |
|a | <>,a |
a| | a,<> |
|a||b | <>,a,<>,b |
||a||b|| | <>,<>,a,<>,b,<>,<> |
| | <>,<> |
|| | <>,<>,<> |
||| | <>,<>,<>,<> |
Compressed Delimiters
Input | Token List |
a | a |
a||b | a,b |
|a||b | <>,a,b |
||a||b|| | <>,a,b,<> |
| | <>,<> |
|| | <>,<> |
||| | <>,<> |
Delimiters
Two forms of delimiters are supported and they are single
delimiter predicate and multiple delimiters predicate otherwise known
as SDP and MDP respectively. Essentially an SDP is
where there is only one type that can break or fragment the
sequence, where as with MDPs there is more than one unique type that
can break the sequence. It is possible to represent a SDP using
the MDP, however from a performance POV having separate
predicates is far more efficient. Additionally for strings based
on char or unsigned char (8-bit versions) there is a MDP that has a
look-up complexity of O(1) making it greatly more efficient than the
basic MDP.
Single Delimiter Predicate
Instantiation requires specialization of type and construction
requires an instance of the type.
strtk::single_delimiter_predicate<typename T>(const T& t)
strtk::single_delimiter_predicate<std::string::value_type> predicate('|');
Multiple Delimiter Predicate
Instantiation requires specialization of type and construction
requires a sequence of potential delimiters through a range
described by iterators.
strtk::multiple_delimiter_predicate<typename T>(Iterator begin, Iterator end);
std::string str_delimiters = " ,.;:<>'[]{}()_?/'`~!@#$%^&*|-_\"=+";
strtk::multiple_delimiter_predicate<char> mdp1(str_delimiters.begin(),str_delimiters.end());
unsigned int uint_delimiters[5] = {1,10,101,1010,10101};
strtk::multiple_delimiter_predicate<unsigned int> mdp2(uint_delimiters,uint_delimiters + 5);
As previously mentioned tokenization of data need not be limited to
strings comprised of chars, but can also be extended to other PODs or
complex types. In the above example a predicate used for tokenizing a
sequence of unsigned ints is being defined.
Multiple Char Delimiter Predicate
Instantiation requires an iterator range based on either unsigned
char or char. This delimiter is more efficient than the simple MDP as
it has a predicate evaluation of O(1) due to the use
of a lookup-table as opposed to O(n) where n is the
number of delimiters. Performance increase is seen as more
delimiters are used.
strtk::multiple_char_delimiter_predicate(const std::string&)
strtk::multiple_char_delimiter_predicate(const std::string::const_iterator begin,const std::string::const_iterator end)
strtk::multiple_char_delimiter_predicate predicate(" .;?");
The delimiter concept can be extended to the point where the
predicate itself can act as a state machine transitioning from
state to state based on conditions and rules related to the current
symbol being processed. This simple extension can lead to some very
interesting parsing capabilities.
Split
This is a function that performs tokenization over an entire sequence
in one go. strtk::split
takes a sequence through
a range of iterators or in the case of a string through
a reference to its instance, a delimiter predicate and an output
iterator or otherwise known as a type sink. It populates the output iterator with the tokens it
extracts. The tokens in this case are std::pairs of iterators for
the sequence.

Split can be used in a "simple - no frills" manner by simply
passing the necessary parameters:
std::string str = "abc|123|xyz|789";
strtk::std_string::token_list_type token_list;
strtk::split(" |.;?",str,std::back_inserter(token_list));
strtk::split
can also be used in a more explicit
manner whereby the exact type of delimiter predicate can be
specified by the user:
{
std::string str = "abc|123|xyz|789";
strtk::std_string::token_list_type token_list;
strtk::single_delimiter_predicate<std::string::value_type> predicate('|');
strtk::split(predicate,str,std::back_inserter(token_list));
}
{
std::string data = "abc|123|xyz|789";
std::deque<std::string> token_list;
strtk::split([](const char c)
{
return '|' == c;
},
data,
strtk::range_to_type_back_inserter(token_list));
}
strtk::split
provides an additional usage option that
allows the user to specify if they would like to either compress the
delimiters and whether or not they would like to include the
delimiter as part of the token range. This enum parameter is called
strtk::split_options
and has the following values:
Split Option | Definition |
strtk::split_options::default_mode | Default options |
strtk::split_options::compress_delimiters | Consecutive delimiters are treated as one |
strtk::split_options::include_1st_delimiter | The first delimiter is included in the resulting token range |
strtk::split_options::include_delimiters | All delimiters are included in the resulting token range |
The simple example below demonstrates a split that will occur over
a string given a predicate where the provided split options indicate
that consecutive delimiters will be treated as one and also all delimiters
encountered after each token will also be included in the token up until
the next token begins.
strtk::split(predicate,
str,
std::back_inserter(token_list),
strtk::split_options::compress_delimiters | strtk::split_options::include_delimiters);
Another way of using split may be to use the
strtk::multiple_char_delimiter_predicate
as follows:
std::string str = "abc?123;xyz.789";
strtk::std_string::token_list_type token_list;
strtk::multiple_char_delimiter_predicate predicate(" .;?");
strtk::split(predicate,str,std::back_inserter(token_list));
The contents of the token_list
can be printed out as follows:
strtk::std_string::token_list_type::iterator itr = token_list.begin();
while (token_list.end() != itr)
{
std::cout << (*itr) << '\t';
++itr;
}
Split N-Tokens
A natural extension of strtk::split
is strtk::split_n
.
This function provides the ability to tokenize a sequence up until a specific
number of tokens have been encountered or until there are no more tokens left.
The return value of the strtk::split_n
would be the number of tokens
encountered.
std::string data = "token1?token2,token3;token4,token5";
strtk::std_string::token_list_type token_list;
const std::size_t token_count = 4;
const std::string delimiters (" ,.;?");
strtk::split_n(delimiters,
data,
token_count,
std::back_inserter(token_list));
strtk::std_string::token_list_type::iterator itr = token_list.begin();
while (token_list.end() != itr)
{
std::cout << "[" << (*itr) << "]\t";
++itr;
}
std::cout << std::endl;
Offset Splitter
Another simple variant is the strtk::offset_splitter
.
This form of split takes a series of offsets and an iterator range or
string and determines the tokens based on the offsets. This function
can be set to perform a single pass of the offsets or to rotate
them until the range has been completely traversed. The example
below demonstrates how a string representing date and time can be
tokenized into its constituent parts (year, month, day, hour, minutes
,seconds,milliseconds)
std::string time_data = "20000101091011345"; const std::size_t offset_list_size = 7;
const int offset_list[offset_list_size] = {
4, 2, 2, 2, 2, 2, 3 };
const strtk::offset_predicate<offset_list_size> predicate(offset_list);
strtk::std_string::token_list_type token_list;
strtk::offset_splitter(time_data,predicate,std::back_inserter(token_list));
strtk::std_string::token_list_type::iterator itr = token_list.begin();
while (token_list.end() != itr)
{
std::cout << "[" << (*itr) << "] ";
++itr;
}
std::cout << std::endl;
Split Regex
Another form of splitter is based on the concept of using regular
expressions as the delimiter predicate. Below is a simple example
of splitting tokens wrapped in round-brackets.
std::string str = "(12)(345)(6789)(0ijkx)(yz)";
std::list<std::string> token_list;
strtk::split_regex("\\(.*?\\)",
s,
std::back_inserter(token_list),
strtk::regex_match_mode::match_1);
std::list<std::string>::iterator itr = token_list.begin();
while (token_list.end() != itr)
{
std::cout << "[" << (*itr) << "]\t";
++itr;
}
std::cout << std::endl;
</std::string>
Note: The parameter regex_match_mode represents the capture
of the marked sub-pattern in the current match. By default it is
strtk::regex_match_mode::match_all
which in this case
would provide the entire match including the bounding pattern,
eg: Token3 would be (0ijkx). However in the above example
we are only interested in the sub-pattern between the round-brackets,
hence strtk::regex_match_mode::match_1
is used resulting
in Token3 being 0ijkx.
The following examples demonstrate the use of strtk::split_regex
and strtk::split_regex_n
routines in extracting specific
types of data - in this case the POD types int and double.
int main()
{
{
std::string data = "a 1^bc,0023| def?gh(4567ijk)-89 10l,m$n-op+123r@st+3u v*w2y56yz+";
std::deque<int> int_list;
strtk::split_regex("([+-]?([\\d]+))",
data,
strtk::range_to_type_back_inserter(int_list),
strtk::regex_match_mode::match_1);
}
{
std::string data = "ab$c1.1?d-2.2ef#ghi+3.3%(123.456)!&*-7.89E+12@^=";
std::deque<double> double_list;
strtk::split_regex(strtk::ieee754_expression,
data,
strtk::range_to_type_back_inserter(double_list),
strtk::regex_match_mode::match_1);
}
{
std::string data = "a 1^bc,0023| def?gh(4567ijk)-89 10l,m$n-op+123r@st+3u v*w2y56yz+";
std::deque<int> int_list;
strtk::split_regex_n("([+-]?([\\d]+))",
data,
3,
strtk::range_to_type_back_inserter(int_list),
strtk::regex_match_mode::match_1);
}
{
std::string data = "ab$c1.1?d-2.2ef#ghi+3.3%(123.456)!&*-7.89E+12@^=";
std::deque<double> double_list;
strtk::split_regex_n(strtk::ieee754_expression,
data,
4,
strtk::range_to_type_back_inserter(double_list),
strtk::regex_match_mode::match_1);
}
return 0;
}
The following table describes the various regex patterns built
into StrTk which can be used seamlessly with the strtk::split_regex
and strtk::split_regex_n
routines.
Regex | Definition |
strtk::uri_expression | URI, URL address e.g.: http://www.example.com, domain.example.net/index.html |
strtk::email_expression | E-mail address e.g.: some.one@example.com |
strtk::ip_expression | IPv4 address e.g.: 192.168.0.1, 127.0.0.1 |
strtk::ieee754_expression | Floating point value e.g.: 1.1, 1.234e-123, -1.0001E+10, 0.1234 |
Tokenizer
The tokenizer is specialized on a sequence iterator and predicate. It
is constructed with a range of iterators for the sequence and a
reference to the desired predicate. Definitions exist for std::string
tokenizers. The tokenizer provides an iteration pattern as a means
for accessing the tokens in the sequence.
const unsigned int data_size = 12;
unsigned int data[data_size] = {1,2,3,0,4,5,6,0,7,8,0,9};
strtk::single_delimiter_predicate<unsigned int> predicate(0);
typedef strtk::tokenizer<unsigned int*,strtk::single_delimiter_predicate<unsigned int> > tokenizer_type;
tokenizer_type tokenizer(data,data + data_size,predicate);
Similar to that of strtk::split
,
strtk::tokenizer
provides tokenizing options that are
passed in during construction. Below is a table depicting said
options:
Tokenize Option | Definition |
strtk::tokenize_options::default_mode | Default options |
strtk::tokenize_options::compress_delimiters | Consecutive delimiters are treated as one |
strtk::tokenize_options::include_1st_delimiter | The first delimiter is included in the resulting token range |
strtk::tokenize_options::include_delimiters | All delimiters are included in the resulting token range |
typedef strtk::tokenizer<unsigned int*,strtk::single_delimiter_predicate<unsigned int> > tokenizer_type
tokenizer_type tokenizer(data, data + data_size,
predicate,
strtk::tokenize_options::compress_delimiters |
strtk::tokenize_options::include_1st_delimiter);
Furthermore, iteration over tokens using strtk::tokenizer
is performed as follows:
tokenizer_type::iterator itr = tokenizer.begin();
while (tokenizer.end() != itr)
{
std::copy((*itr).first,(*itr).second,std::ostream_iterator<unsigned int>(std::cout," "));
std::cout << std::endl;
++itr;
}
A typical std::string
can be tokenized in the following
manner:
std::string str = "abc|123|xyz|789";
strtk::std_string::tokenizer<>::type tokenizer(str,"|");
strtk::std_string::tokenizer<>::type::iterator itr = tokenizer.begin();
while (tokenizer.end() != itr)
{
std::cout << "[" << (*itr) << "]\t";
++itr;
}
std::cout << std::endl;
Another common situation may be tokenizing a sequence of strings,
such as the following:
const std::string str_list[] = { "abc" , "delimiter" , "ijk" , "delimiter" ,
"mno" , "delimiter" , "rst" , "uvw" ,
"delimiter" , "xyz" };
const std::size_t str_list_size = sizeof(str_list) / sizeof(std::string);
strtk::range_adapter<std::string> range(str_list,str_list_size);
strtk::single_delimiter_predicate<std::string> predicate("delimiter");
typedef strtk::tokenizer<std::string*,strtk::single_delimiter_predicate<std::string> > tokenizer_type;
tokenizer_type tokenizer(range.begin(),range.end(),predicate);
tokenizer_type::iterator itr = tokenizer.begin();
while (tokenizer.end() != itr)
{
std::copy((*itr).first,(*itr).second,std::ostream_iterator<std::string>(std::cout," "));
std::cout << std::endl;
++itr;
}
Note: For performance and efficient resource management purposes
the strtk::tokenizer
does not take ownership or make an
internal copy of the sequence being tokenized, as such the
strtk::tokenizer
expects the range to be valid during
the entirety of the tokenization process, this is also the case for
the specified predicate.
The Parse Routine
Till now the mentioned routines worked specifically with tokens, or in
other words ranges of characters. The responsibility of managing the
tokens and converting the tokens to user specified types was done
manually via "range to type" oriented back inserters and converters.
This can be a bit cumbersome and as such StrTk provides a series of
helper routines called strtk::parse
. Parse takes an
std::string
representing a tuple of delimited values as
input data, a delimiter set, and a series of references to variables
that are to be populated with the values from the parsed tokens. The
following diagram demonstrates the flow of data, tokens and the
corresponding relationships and conversions between each of the
parameters.

Note: strtk::parse
returns a boolean value of true
upon successful parsing and false for all other results. Situations
that cause strtk::parse
to fail include:
- Insufficient number of tokens for the given number of variables
- Conversion failure from token range to variable type
- Empty or null token(s)
Some Simple Parse Examples
strtk::parse
can take an arbitrary number of variable
references. The code below demonstrates the basic usage of strtk::parse
taking various number of parameters.
std::string data = "abcde,-1234|567.890#1.1f";
std::string delimiters = ",|#";
std::string var0;
int var1;
double var2;
float var3;
strtk::parse(data,delimiters,var0);
strtk::parse(data,delimiters,var0,var1);
strtk::parse(data,delimiters,var0,var1,var2);
strtk::parse(data,delimiters,var0,var1,var2,var3);
The following examples demonstrate parsing of PODs such as int and
double into STL compatible sequences (std::vector
,
std::deque
, std::list
, std::set
,
std::queue
, std::stack
and
std::priority_queue
).
std::string int_string = "1 +2 -3 4 +5 -6 7 +8 -9 10 +11 -12 13 +14 -15";
std::vector<int> int_vector;
strtk::parse(int_string," ",int_vector);
std::string double_string = "-123.456,789.012,-345.678,901.234,+567.890";
std::deque<double> double_deque;
strtk::parse(double_string,",",double_deque);
std::string data_string = "a,bc,def,ghij,klmno,pqrstu,vwxyz";
std::list<std::string> string_list;
strtk::parse(data_string,",",string_list);
std::string set_string = "a|bc#def|ghij#klmno|pqrstu#vwxyz";
std::set<std::string> string_set;
strtk::parse(set_string,"#|",string_set);
std::string queue_string = "value1,value2,value3,value4,value5";
std::queue<std::string> string_queue;
strtk::parse(queue_string,",|",string_queue);
std::string stack_string = "value1|value2,value3|value4,value5";
std::stack<std::string> string_stack;
strtk::parse(stack_string,",|",string_stack);
std::string priority_queue_string = "value1|value2,value3#value4,value5";
std::priority_queue<std::string> string_priority_queue;
strtk::parse(priority_queue_string,",|#",string_priority_queue);
Similar to what is described above, the following demonstrates parsing of up to "N" elements into an
STL compatible sequence.
std::string int_string = "100,-200,+300,400,-500,+600,700,-800,+900";
std::vector<int> int_vector;
strtk::parse_n(int_string,",",5,int_vector);
std::string double_string = "123.456,+789.012,345.678,-901.234,567.890";
std::deque<double> double_deque;
strtk::parse_n(double_string,",",3,double_deque);
std::string data_string = "a,bc,def,ghij,klmno,pqrstu,vwxyz";
std::list<std::string> string_list;
strtk::parse_n(data_string,",",6,string_list);
std::string set_string = "a|bc#def|ghij#klmno|pqrstu#vwxyz";
std::set<std::string> string_set;
strtk::parse_n(set_string,"#|",6,string_set);
std::string queue_string = "value0,value1,value2,value3,value4,value5";
std::queue<std::string> string_queue;
strtk::parse_n(queue_string,",|",4,string_queue);
std::string stack_string = "value0|value1|value2,value3|value4,value5";
std::stack<std::string> string_stack;
strtk::parse_n(stack_string,",|",4,string_stack);
std::string priority_queue_string = "value0#value1|value2,value3#value4,value5";
std::priority_queue<std::string> string_priority_queue;
strtk::parse_n(priority_queue_string,",|#",4,string_priority_queue);
Note: The return value of the routine strtk::parse_n
indicates how many elements were parsed and placed into the specified
sequence.
Some Initial Simple Examples
Simple Example 0
As a first example, we'll tackle the simple problem of reversing words in a sentence.
That is given a sentence, to have the first word be the last and the last to be
the first, the second word to be the second last so on and so forth. Using StrTk
we come up with the following solution:
int main()
{
std::string sentence = "The quick brown fox jumps over the lazy dog";
std::cout << "Before: " << sentence << std::endl;
strtk::split(" ",
sentence,
strtk::functional_inserter(
[](const strtk::range::string& range)
{ strtk::reverse(range); })
);
strtk::reverse(sentence);
std::cout << "After: " << sentence << std::endl;
return 0;
}
With an expected output of:
Before: The quick brown fox jumps over the lazy dog
After: dog lazy the over jumps fox brown quick The
Simple Example 1
Another example, given a list of words to blank out and a sentence,
transform the sentence such that the blank-out words are
removed. Using StrTk we come up with the following solution:
int main()
{
std::string sentence = "The quick brown fox jumps over the lazy dog";
std::unordered_set<std::string> blankout_words;
blankout_words.insert("quick");
blankout_words.insert("over");
blankout_words.insert("lazy");
std::cout << "Before: " << sentence << std::endl;
strtk::split(" ",
sentence,
strtk::functional_inserter(
[&blankout_words](const strtk::range::string& range)
{
auto itr = blankout_words.find(range);
if (blankout_words.end() != itr)
{
strtk::fill(range,' ');
}
})
);
strtk::remove_consecutives_inplace(' ',sentence);
std::cout << "After: " << sentence << std::endl;
return 0;
}
With an expected output of:
Before: The quick brown fox jumps over the lazy dog
After: The brown fox jumps the dog
Simple Example 2
Another simple example would be given a text file to read each
of the lines and populate a word list structure by
tokenizing each line into words. The following is an example
of how this can be achieved using StrTk:
int main()
{
std::deque<std::string> word_list;
strtk::for_each_line("text.txt",
[&word_list](const std::string& line)
{
static const std::string delimiters = "0123456789()[]{}<>"
"\t\r\n ,,.;:'"
"!@#$%^&*_-=+`~/";
strtk::parse(line,delimiters,word_list);
});
return 0;
}
Simple Example 3
The following simple example takes a user specified text file,
proceeds to process it and returns information relating to
the file, such as word, letter, uppercase character,
lowercase character, vowel and consonant counts.
int main()
{
std::size_t word_count = 0;
std::size_t letter_count = 0;
std::size_t uppercase_count = 0;
std::size_t lowercase_count = 0;
std::size_t vowel_count = 0;
std::size_t consonant_count = 0;
using namespace strtk;
for_each_line("data.txt",
[&](const std::string& line)
{
static multiple_char_delimiter_predicate is_vowel("AEIOUaeiou");
static multiple_char_delimiter_predicate is_lowercase(ext_string::all_lowercase_letters());
static const std::string delimiters = ext_string::all_chars()
- ext_string::all_lowercase_letters()
- ext_string::all_uppercase_letters();
split(delimiters,
line,
functional_inserter(
[&](const strtk::range::string& range)
{
if (0 == range.size()) return;
++word_count;
letter_count += range.size();
std::size_t current_lowercase_count = 0;
std::size_t current_vowel_count = 0;
for (std::size_t i = 0; i < range.size(); ++i)
{
if (is_vowel(range[i])) ++current_vowel_count;
if (is_lowercase(range[i])) ++current_lowercase_count;
}
uppercase_count += range.size() - current_lowercase_count;
lowercase_count += current_lowercase_count;
consonant_count += range.size() - current_vowel_count;
vowel_count += current_vowel_count;
})
);
});
std::cout << "Word count: " << word_count << std::endl;
std::cout << "Letter count: " << letter_count << std::endl;
std::cout << "Uppercase count: " << uppercase_count << std::endl;
std::cout << "Lowercase count: " << lowercase_count << std::endl;
std::cout << "Vowel count: " << vowel_count << std::endl;
std::cout << "Consonant count: " << consonant_count << std::endl;
return 0;
}
Simple Example 4
For the next example, assume we have a text file with a list of names,
one per line that represents the order of people that entered a
building. Some of the people may enter and leave then reenter the
building many times, hence their name will appear multiple times in
the list. Our task is to reduce this list to a list of unique names
but to also maintain the relative order of names found in the original
list. The following is how this particular requirement can be
accomplished by using StrTk:
int main()
{
strtk::for_each_line("file_name.txt",
[](const std::string& line)
{
static std::unordered_set<std::string> line_set;
if (line_set.end() != line_set.find(line))
return;
line_set.insert(line);
std::cout << line << std::endl;
});
return 0;
}
Simple Example 5
As a final simple example, we would like to calculate the word
frequency model of a user specified text file. The process involves
reading each line, splitting the line into words, then incrementing
the relevant count for each word and maintaining a global word count.
Once the file has been processed, the occurrence frequency of each
word will be printed to stdout.
int main()
{
typedef std::unordered_map<std::string,unsigned int> map_t;
map_t word_hit_list;
unsigned int word_count = 0;
strtk::for_each_line("data.txt",
[&](const std::string& line)
{
static const std::string delimiters = strtk::ext_string::all_chars()
- strtk::ext_string::all_lowercase_letters()
- strtk::ext_string::all_uppercase_letters();
strtk::split(delimiters,
line,
strtk::functional_inserter(
[&](const strtk::range::string& range)
{
if (range.begin() == range.end()) return;
++word_count;
std::string word(range.begin(),range.end());
++word_hit_list[word];
})
);
});
for (map_t::value_type v : word_hit_list)
{
printf("%s %10d %10.9f\n",
strtk::text::right_align(15,' ',v.first).c_str(),
v.second,
(1.0 * v.second) / word_count);
}
return 0;
}
A Practical Example
Lets assume we have been given an English text file to process, with
the intention of extracting a lexicon from the file.
One solution would be to break the problem down to a line by line
tokenization problem. In this case we would define a functional
object such as the following which will take the container in which
we plan on storing our tokens (words) and a predicate and insert
the tokens as strings into our container.
template<typename Container, typename Predicate>
struct parse_line
{
public:
parse_line(Container& container, const Predicate& predicate)
: container_(container),
predicate_(predicate)
{}
inline void operator() (const std::string& str)
{
strtk::split(str,
predicate_,
strtk::range_to_type_back_inserter(container_),
strtk::split_options::compress_delimiters);
}
private:
Container& container_;
const Predicate& predicate_;
};
The whole thing together would include a process of opening the
file and reading it line by line each time invoking the
parse_line
would be as follows:
template<typename Container>
void parse_text(const std::string& file_name, Container& c)
{
static const std::string delimiters = " ,.;:<>'[]{}()_?/"
"`~!@#$%^&*|-_\"=+\t\r\n\0"
"0123456789";
strtk::multiple_char_delimiter_predicate predicate(delimiters);
strtk::for_each_line(file_name,
parse_line<Container,strtk::multiple_char_delimiter_predicate>(c,predicate));
}
int main()
{
std::string text_file_name = "text.txt";
std::deque<std::string> word_list;
parse_text(text_file_name,word_list);
std::cout << "Token Count: " << word_list.size() << std::endl;
return 0;
}
Before we continue on with the example, a re-written version of the
above code using C++11 lambdas is as follows:
int main()
{
std::string text_file_name = "text.txt";
std::deque<std::string> word_list;
strtk::for_each_line(text_file_name,
[&word_list](const std::string& line)
{
static const std::string delimiters = " ,.;:<>'[]{}()_?/"
"`~!@#$%^&*|-_\"=+\t\r\n\0"
"0123456789";
strtk::parse(line,delimiters,word_list);
});
std::cout << "Token Count: " << word_list.size() << std::endl;
return 0;
}
Now coming back to the original problem, that being the construction
of a lexicon. In this case the set of "words" should only
contain words of interest. For the sake of simplicity lets define
words of interest as being anything other than the following
prepositions: as, at, but, by, for, in, like, next, of, on,
opposite, out, past, to, up and via. This type of list is commonly
known as a Stop Word List. In this example the stop-word list
definition will be as follows:
const std::string stop_word_list [] =
{
"as", "at", "but", "by", "for",
"in", "like", "next", "of", "on",
"opposite", "out", "past", "to",
"up", "via", ""
};
const std::size_t stop_word_list_size = sizeof(stop_word_list) / sizeof(std::string);
Some minor updates to the parse_line processor include using the
filter_on_match
predicate for determining if the
currently processed token is a preposition and also the invocation of
the range_to_type
back_inserter to convert the tokens
from their range iterator representation to a type representation
compatible with the user defined container. For the new implementation
to provide unique words of interest the simplest change that can be
made is to replace the deque used as the container for the word_list
to some kind of 1-1 associative container such as a set. The following
is the improved version of the parse_line
processor:
template<typename Container, typename Predicate>
struct parse_line
{
public:
parse_line(Container& container, const Predicate& predicate)
: container_(container),
predicate_(predicate),
tmp_(" "),
tokenizer_(tmp_,predicate_,true),
filter_(stop_word_list,stop_word_list + stop_word_list_size,
strtk::range_to_string_back_inserter_iterator<Container>(container_),
true,false)
{}
inline void operator() (const std::string& s)
{
const filter_type& filter = filter_;
strtk::for_each_token(s,tokenizer_,filter);
}
private:
Container& container_;
const Predicate& predicate_;
std::string tmp_;
typename strtk::std_string_tokenizer<Predicate>::type tokenizer_;
strtk::filter_on_match<strtk::range_to_string_back_inserter_iterator<Container>> filter_;
};
The above described predicate can be greatly simplified by using
binders and various lambda expressions.
Another Example
When performing serialization or deserialization of an
instance object such as a class or struct, a simple approach
one could take would be to take each of the members and
convert them into string representations and from those strings
construct a larger string delimiting each member with a special
character guaranteed not to exist in any of the string
representations.
In this example we will assume that there exists a struct which
represents the properties of a person, a person struct if you
will:
struct person
{
unsigned int id;
std::string name;
unsigned int age;
double height
float weight;
};
The process of populating a person struct would entail having an
instance of a person and the necessary data string. The following is
an example of how this would be done using the
strtk::parse
function.
Person Tuple Format
Token0 | Delimiter0 | Token1 | Delimiter1 | Token2 | Delimiter2 | Token3 | Delimiter3 | Token4 |
Unique ID(hex) | | | Name | | | Age | | | Height(m) | | | Weight(kg) |
std::string data = "0xFA37ED12|Rumpelstiltskin|397|1.31|58.7";
person p;
strtk::hex_to_number_sink<unsigned int> hex_sink(p.id); strtk::parse(data,"|",hex_sink,p.name,p.age,p.height,p.weight);
Batch processing of a text file comprised of one person tuple
per-line is somewhat similar to the previous example. A predicate
is defined that takes a container specialized on the person struct,
and a delimiter predicate with which the
strtk::parse
function will be invoked. This
predicate is then instantiated and coupled with the text
file name, is fed to the strtk::for_each_line
processor.
template<typename Container, typename Predicate>
struct parse_person_tuple
{
public:
parse_person_tuple(Container& container)
: container_(container),
hex_sink(p_.id)
{}
inline void operator() (const std::string& s)
{
if (strtk::parse(s,"|",hex_sink,p_.name,p_.age,p_.height,p_.weight))
container_.push_back(p_);
else
std::cerr << "Failed to parse: " << s << std::endl;
}
private:
Container& container_;
person p_;
strtk::hex_to_number_sink<unsigned int> hex_sink;
};
Bringing the above pieces together to process a file results in the
following:
int main()
{
std::string text_file_name = "person_records.txt";
std::deque<person> person_list;
strtk::for_each_line(file_name,predicate_type(person_list));
return 0;
}
Before we continue on with the example, a re-written version of the
above code using C++11 lambdas is as follows:
int main()
{
std::string text_file_name = "person_records.txt";
std::deque<person> person_list;
person p;
strtk::hex_to_number_sink<unsigned int> hex_sink;
strtk::for_each_line(text_file_name,
[&](const std::string& line)
{
if (strtk::parse(line,"|",hex_sink,p.name,p.age,p.height,p.weight))
container_.push_back(p);
else
std::cerr << "Failed to parse: " << line << std::endl;
});
return 0;
}
To make things easier one could adapt a struct (made up entirely
of PODs) to a parser. This makes the usage syntax little easier
to follow. An example of this adaption is as follows:
struct type
{
std::string s;
double d;
int i;
char c;
bool b;
};
strtk_parse_begin(type)
strtk_parse_type(s)
strtk_parse_type(d)
strtk_parse_type(i)
strtk_parse_type(c)
strtk_parse_type(b)
strtk_parse_end()
int main()
{
type t;
std::string s = "abcdefghijklmnop|123.456|987654321|A|1";
strtk::parse(s,"|",t);
return 0;
}
Another similar example to the previous, would be parsing a text
file of 3D coordinates into a sequence. This can be done easily and
cleanly using lambdas and StrTk as follows:
struct point
{
double x,y,z;
};
int main()
{
std::string point_data = "point_data.txt";
std::deque<point> points;
point p;
strtk::for_each_line(point_data,
[&](const std::string& str)
{
if (strtk::parse(str,",",p.x,p.y,p.z))
points.push_back(p);
});
return 0;
}
Simple Date-Time Parser
Assuming the datetime struct defined below, and a string
representation of a combined date and time in the form of YYYY-MM-DD
HH:MM:SS.MS eg: 2005-06-26 15:57:03.678
struct datetime
{
unsigned int year;
unsigned int month;
unsigned int day;
unsigned int hour;
unsigned int minute;
unsigned int second;
unsigned int msecond;
};
The following assumes an input of date-time values separated by a
pipe. To facilitate parsing of a date-time by
the strtk::parse
routine into an STL compatible sequence an
implementation of string_to_type_converter_impl
specific
to the datetime type is required. The following demonstrates how such
a routine can be generated and used within the strtk::parse
context:
strtk_string_to_type_begin(datetime)
static const std::string delimiters ("-:. ");
return strtk::parse(begin, end, delimiters,
t.year, t.month, t.day,
t.hour, t.minute, t.second, t.msecond);
strtk_string_to_type_end()
A simple example of using strtk::parse
in conjunction
with the newly integrated datetime variable mixed with variables
of other types is as follows:
int main()
{
const std::string data = "abc 123 xyz,2000-01-10 03:01:16.123|+98765.43210";
std::string var0;
datetime var1;
double var2;
strtk::parse(data,",|",var0,var1,var2);
return 0;
}
Bringing the above pieces together, in the following we can then
proceed to parse a sequence of date-time strings delimited by pipe
"|" into a deque of type datetime.
int main()
{
static const std::string data = "2000-01-10 03:01:16.123|2001-02-22 05:12:24.234|"
"2002-03-13 07:23:32.345|2003-04-24 09:34:47.456|"
"2004-05-15 11:46:51.767|2005-06-26 15:57:03.678|"
"2006-07-17 17:45:31.561|2007-08-26 19:06:02.809|"
"2008-09-18 21:16:23.267|2009-10-26 23:12:03.798|"
"2010-11-23 13:47:11.963|2011-12-26 15:35:08.168";
std::deque<datetime> datetime_list;
strtk::parse(data,"|",datetime_list);
for (std::size_t i = 0; i < datetime_list.size(); ++i)
{
datetime& dt = datetime_list[i];
std::cout << dt.year << "-"
<< strtk::text::right_align(2,'0', dt.month) << "-"
<< strtk::text::right_align(2,'0', dt.day) << " "
<< strtk::text::right_align(2,'0', dt.hour) << ":"
<< strtk::text::right_align(2,'0', dt.minute) << ":"
<< strtk::text::right_align(2,'0', dt.second) << "."
<< strtk::text::right_align(3,'0',dt.msecond) << std::endl;
}
return 0;
}
As a side note, the more commonly used date, time and date-time formats
can be easily parsed with a simple utilities library based on StrTk called
Datetime_Utils
The library makes use of the technique described above in conjunction
with the strtk::offset_splitter
to provide efficient and
high performance parsers for formats such as the ones denoted below:
Format | Example |
YYYYMMDD | 20060304 |
YYYYDDMM | 20060403 |
YYYY/MM/DD | 2006/03/04 |
YYYY/DD/MM | 2006/04/03 |
DD/MM/YYYY | 04/03/2006 |
HH:MM:SS.mss | 13:27:54.123 |
HH:MM:SS | 13:27:54 |
YYYYMMDD HH:MM:SS.mss | 20060304 13:27:54.123 |
YYYY/MM/DD HH:MM:SS.mss | 2006/03/04 13:27:54.123 |
DD/MM/YYYY HH:MM:SS.mss | 04/03/2006 13:27:54.123 |
YYYYMMDD HH:MM:SS | 20060304 13:27:54 |
YYYY/MM/DD HH:MM:SS | 2006/03/04 13:27:54 |
DD/MM/YYYY HH:MM:SS | 04/03/2006 13:27:54 |
YYYY-MM-DD HH:MM:SS.mss | 2006-03-04 13:27:54.123 |
DD-MM-YYYY HH:MM:SS | 04-03-2006 13:27:54 |
YYYY-MM-DDTHH:MM:SS | 2006-03-04T13:27:54 |
YYYY-MM-DDTHH:MM:SS.mss | 2006-03-04T13:27:54.123 |
YYYYMMDDTHH:MM:SS | 20060304T13:27:54 |
YYYYMMDDTHH:MM:SS.mss | 20060304T13:27:54.123 |
In the following simple example we have an array of data
representing tuples of trade executions in CSV format. The
objective is to populate the trade_list instance with the
given data via the defined trade
struct. In
the example the dt_utils::datetime_format6
date-time parser is used, it populates a general date-time
type instance called dt_utils::datetime
. If
the parse operation succeeds, then the date-time components
the trade
type requires are updated and the
instance itself is subsequently added to the trade_list
.
struct trade
{
std::string ticker;
double price;
unsigned int volume;
unsigned short hr,min,sec,ms;
};
int main()
{
std::string trade_data[] =
{
"2006-03-04 13:27:54.123,ABC,12.347,6676",
"2006-03-04 13:27:54.231,XYZ,23.455,71547",
"2006-03-04 13:27:54.312,IJK,34.562,514",
"2006-03-04 13:27:55.263,PQR,67.893,5949",
"2006-03-04 13:27:55.327,ABC,78.963,19",
"2006-03-04 13:27:55.524,XYZ,45.677,1276",
"2006-03-04 13:27:56.623,IJK,36.182,6676",
"2006-03-04 13:27:56.877,PQR,62.339,1547"
};
std::deque<trade> trade_list;
trade t;
dt_utils::datetime dt;
dt_utils::datetime_format6 dt6(dt);
for (std::size_t i = 0; i < sizeof(trade_data) / sizeof(std::string); ++i)
{
bool result =
strtk::parse(trade_data[i],",",
dt6,t.ticker,t.price,t.volume);
if (result)
{
t.hr = td.hour;
t.min = td.minute;
t.sec = td.second;
t.ms = td.millisecond;
trade_list.push_back(t);
}
}
return 0;
}
Parsing Sub-Lists
So far the demonstrated capabilities of the strtk::parse
function has been based on passing a series of parameters that are
populated in a linear fashion as the parser processes the tokens it
encounters. That said, some formats have their own sub-structures, a
simple example would be a list of values (such as integers) that need
to be loaded into a deque or stack. StrTk provides a series of sink
facilities that consume a range and an STL container which can be
forwarded onto strtk::parse
.
In the following example, the data string is comprised of 3
separate lists delimited by a pipe "|". An integer, a
string and a double type list. Each list is to be parsed into an STL
container of appropriate type. In this case a vector, a deque and a
list. StrTk provides the ability to instantiate a sink for the
specific container type that is compatible with
strtk::parse
.
int main()
{
std::string data = "1,+2,-3,4|abc,ijk,rst,xyz|123.456,+234.567,-345.678,456.789,567.890";
std::vector<int> int_vector;
std::deque<std::string> string_deque;
std::list<double> double_list;
std::set<int> int_set;
std::queue<std::string> string_queue;
std::stack<double> double_stack;
std::priority_queue<int> int_priority_queue;
strtk::vector_sink<int>::type vec_sink(",");
strtk::deque_sink<std::string>::type deq_sink(",");
strtk::list_sink<double>::type lst_sink(",");
strtk::set_sink<int>::type set_sink(",");
strtk::queue_sink<std::string>::type que_sink(",");
strtk::stack_sink<double>::type stk_sink(",");
strtk::priority_queue_sink<int>::type prq_sink(",");
strtk::parse(data,"|",vec_sink( int_vector),
deq_sink(string_deque),
lst_sink( double_list));
strtk::parse(data,"|",set_sink( int_set),
que_sink( string_queue),
stk_sink( double_stack),
prq_sink(int_priority_queue));
return 0;
}
If only a certain number of elements in the list are required, for
example only the first 3, then the element count on the sink can be
set appropriately. The above example could be modified as follows:
int main()
{
std::string data = "1,+2,-3,4|string0|abc,ijk,rst,xyz|string1|123.456,+234.567,-345.678,456.789,567.890";
std::vector<int> int_vector;
std::deque<std::string> string_deque;
std::list<double> double_list;
strtk::vector_sink<int>::type vec_sink(",");
strtk::deque_sink<std::string>::type deq_sink(",");
strtk::list_sink<double>::type lst_sink(",");
std::string string_0;
std::string string_1;
strtk::parse(data,"|",vec_sink( int_vector).count(2), string_0,
deq_sink(string_deque).count(3), string_1,
lst_sink( double_list).count(4)); return 0;
}
Note: If there aren't enough elements in a particular
list, then parsing of that list fails and subsequently the call to
strtk::parse
will fail.
Parsing Trailing-Lists
Another way one might want to parse a tuple of values might be to parse
a prefix of values into a specific number of possibly varying types,
then to parse the remaining values (assuming they are all of the same
type) into a sequence or list etc. StrTk provides the following simple
solution to the given problem, as demonstrated below:
int main()
{
{
std::string data = "A String Value,111.111,222.222,333.333,444.444,555.555";
std::string token;
std::vector<double> double_list;
strtk::parse(data,",",token,double_list);
}
{
std::string data = "A String Value,01-02-2003,111.111,222.222,333.333,444.444,555.555";
std::string token;
std::string date;
std::deque<double> double_list;
strtk::parse(data,",",token,date,double_list);
}
{
std::string data = "A String Value,01-02-2003,123456789,111.111,222.222,333.333,444.444,555.555";
std::string token;
std::string date;
int i;
std::list<double> double_list;
strtk::parse(data,",",token,date,i,double_list);
}
{
std::string data = "A String Value,01-02-2003,123456789,111.111,222.222,333.333,444.444,555.555";
std::string token;
std::string date;
int i;
double d;
std::vector<double> double_list;
strtk::parse(data,",",token,date,i,d,double_list);
}
{
std::string data = "A String Value,01-02-2003,123456789,111.111,222.222,333.333,444.444,555.555";
std::string token;
std::string date;
int i;
double d1;
double d2;
std::deque<double> double_list;
strtk::parse(data,",",token,date,i,d1,d2,double_list);
}
return 0;
}
Extending The Date-Time Parser Example
Building upon the previous datetime example, We are presented with a
tuple of data that represents an astronomical event. The event
defines a name, a location and a series of date-times in UTC the
event was observed. In order to construct the necessary sink(s) that
will be used for parsing the required type into a container, the
macro strtk_register_userdef_type_sink with the specified type
is invoked. The following is a definition of the struct one might
construct:
struct event_information
{
std::size_t id;
std::string name;
std::string location;
std::deque<datetime> observation_datetimes;
};
strtk_register_userdef_type_sink(datetime)
Bringing the above together with a call to strtk::parse
results in the following code which parses the event data tuple into
the allocated event_information
instance.
int main()
{
std::string event_data = "172493|Lunar Impact|Mare Tranquillitatis|"
"2010-01-19 00:28:45.357,2010-02-18 00:57:07.109,"
"2010-03-20 01:15:11.261,2010-04-21 01:07:27.972";
strtk::deque_sink<datetime>::type deq_sink(",");
event_information evt_info;
strtk::parse(event_data,"|",evt_info.id,
evt_info.name,
evt_info.location,
deq_sink(evt_info.observation_datetimes));
return 0;
}
Token Processing During Parsing
StrTk offers a set of convenient and simple to use token processing
primitives that can be invoked during a call to the strtk::parse
routine to perform various actions upon the tokens being parsed.
These actions include such things as modifications and
validations of tokens.
The following is a list of token processing primitives used for
constraint and verification purposes:
- strtk::ignore_token
- strtk::expect
- strtk::iexpect
- strtk::like
- strtk::inrange
The following is a list of token processing primitives used for
modifying and normalising purposes:
- strtk::trim
- strtk::trim_leading
- strtk::trim_trailing
- strtk::as_lcase
- strtk::as_ucase
The primitives all return either a true or false value upon
parsing completion, which is then further used by the
strtk::parse
routine to determine if the parse
operation as a whole has succeeded or failed.
Ignore Token Processing
There may be scenarios when given a delimited tuple of data, that
one or more of the tokens need to be ignored or skipped. StrTk
provides a mechanism called strtk::ignore_token that allows the
parser to consume specific tokens easily without affecting
overall performance. Below is an example of how
strtk::ignore_token
can be used in conjunction with
strtk::parse
to skip the 2nd and 4th tokens in the
tuple:
int main()
{
static const std::string data = "+123,ignore0,123.456,ignore1,abcdef,ignore2";
int i;
double d;
std::string s;
strtk::ignore_token ignore;
strtk::parse(data,",",i,ignore,d,ignore,s);
std::cout << "i = " << i << std::endl;
std::cout << "d = " << d << std::endl;
std::cout << "s = " << s << std::endl;
return 0;
}
Expect and IExpect Token Processing
When parsing a tuple, one may want to ensure that specific tokens
of the tuple are of a certain string value. StrTk provides this
type of functionality via the strtk::expect
and
strtk::iexpect
mechanisms. The strtk::expect form
enforces an exact string match, whereas the strtk::iexpect
enforces only a case-insensitive match. The following is an
example where we attempt to parse a 'pascal-like' variable
declaration and definition. The requirement is that the first
token be "var" followed by a variable name and then a type name
of 'Integer' which is not case sensitive.
int main()
{
static const std::string data = "var foo : InTeGeR = 3;";
std::string variable_name;
int initial_value;
bool result = strtk::parse(data,
" ;",
strtk::expect("var").ref(),
variable_name,
strtk::expect(":").ref(),
strtk::iexpect("Integer").ref(),
strtk::expect("=").ref(),
initial_value);
if (result)
std::cout << variable_name << " = " << initial_value << std::endl;
else
std::cout << "Failed to parse statement!" << std::endl;
return 0;
}
Like Token Processing
Similar to the above mentioned strtk::expect and strtk::iexpect
primitives, StrTk provides a simple wildcard matching of tokens
functionality via the strtk::like mechanism. The special
characters of '*' and '?' are used denoting 'zero or more' and
'zero or one' match modes respectively. The following example
uses the strtk::like in conjunction with strtk::expect to parse a
tuple of key/value pairs.
int main()
{
static const std::string data = "token0=+123;token1=abc;token2=-456.678;";
int i;
std::string s;
double d;
strtk::parse(data,
"=;",
strtk::like("to*n?").ref(),
i,
strtk::like("token?").ref(),
s,
strtk::iexpect("tOkEn2").ref(),
d);
std::cout << "i = " << i << std::endl;
std::cout << "s = " << s << std::endl;
std::cout << "d = " << d << std::endl;
return 0;
}
In-Range Token Processing
When parsing tokens, one may wish to determine if the token when
viewed in its target type resides within a specified range
[r0,r1]. As the tokens can be of any type, not necessarily
just string or numerical in nature, the type must have a less-
than comparable attribute. The following example attempts to
parse a key/value tuple that contains a temperature and a name
component.
int main()
{
static const std::string data = "temperature=+123.456;name=Rumpelstilzchen";
double temperature;
std::string name;
strtk::parse(data,
"=;",
strtk::expect("temperature").ref(),
strtk::inrange(temperature,-432.1,+432.1).ref(),
strtk::expect("name").ref(),
strtk::inrange(name,"AAAA","zzzz").ref());
std::cout << "temperature = " << temperature << std::endl;
std::cout << "name = " << name << std::endl;
return 0;
}
Trim Token Processing
At times tokens within a tuple may have padding added to either
the left, right or both ends. When processing the token it maybe
necessary to remove the superfluous padding before attempting
to convert the string or range representation of the token into
its target type(int, double etc). The example below, demonstrates
the use of various forms of token trimming in conjunction strtk:parse.
int main()
{
{
std::string data = "****abc123****,****abc123****,****abc123****";
std::string s0;
std::string s1;
std::string s2;
strtk::parse(data,",",
strtk::trim("*",s0).ref(),
strtk::trim_leading ("*",s1).ref(),
strtk::trim_trailing("*",s2).ref());
std::cout << "s0 = [" << s0 << "]" << std::endl;
std::cout << "s1 = [" << s1 << "]" << std::endl;
std::cout << "s2 = [" << s2 << "]" << std::endl;
}
{
std::string data = "*?*?a string*?*?,*?*123456,123.456?*?*?";
std::string s;
int i;
double d;
strtk::parse(data,",",
strtk::trim("*?",s).ref(),
strtk::trim_leading ("?*",i).ref(),
strtk::trim_trailing("*?",d).ref());
std::cout << "s = [" << s << "]" << std::endl;
std::cout << "i = [" << i << "]" << std::endl;
std::cout << "d = [" << d << "]" << std::endl;
}
return 0;
}
Case Normalisation Token Processing
Another pair of token processing mechanisms provided by StrTk,
are the strtk::as_lcase and strtk::as_ucase mechanisms. They
convert the string representation of the token to lowercase and
uppercase characters respectively. The following example, parses
a two token tuple, and converts the first token s0 to all
lowercase and the second token s1 to all uppercase.
int main()
{
std::string data = "AbCd,EfGhI";
std::string s0;
std::string s1;
strtk::parse(data,",",
strtk::as_lcase(s0).ref(),
strtk::as_ucase(s1).ref());
std::cout << "s0 = [" << s0 << "]" << std::endl;
std::cout << "s1 = [" << s1 << "]" << std::endl;
return 0;
}
Parsing Truncated Values
There may be times during parsing when a token which is intended to be
parsed as an integral type (eg: int, short, unsigned int et al) may be
represented using decimal notation (eg: 1234.00000).
Normally if the token were to be passed as-is it would cause a
conversion error due to the fact that there are invalid characters
within the token.
StrTk provides a facility called strtk::truncated_int
that can be used with both signed and unsigned integral types. The
type truncated_int is specialised with the required type, then an
instance of the type is registered with it either prior to or inline
with the conversion/parse call. When the conversion occurs
strtk::truncated_int simply redefines the end of the token range
to be the decimal point (if it is present) and then passes
it along to the appropriate string_to_type_converter_impl
call.
int main()
{
{
int i = 0;
std::string data = "-1234.0000";
strtk::truncated_int<int> ti;
strtk::string_to_type_converter(data,ti(i));
}
{
int i0 = 0;
int i1 = 0;
std::string data = "-1234.0000|456.7890";
strtk::truncated_int<int> ti0;
strtk::truncated_int<int> ti1;
strtk::parse(data,"|",ti0(i0),ti1(i1));
}
}
An optional parameter that can be utilized is the
'fractional_size
' which denotes the exact number of
digits after the decimal place that is expected. In the event this
condition is not met a conversion error is returned.
int main()
{
std::string data = "-1234.000";
strtk::truncated_int<int> ti;
ti.fractional_size(3);
int i = 0;
if (strtk::string_to_type_converter(data,ti(i)))
std::cout << "i: " << i << std::endl;
else
std::cout << "Error parsing: " << data << std::endl;
return 0;
}
In the following example, we have an array of trade execution
tuples in csv format. The tuple is comprised of the following
fields: ticker name (string), trade id (uint64 right
aligned with space as padding), execution price (double),
executed volume (unsigned int with 4 decimal place suffix).
The struct trade will be used to store the tuples in memory.
The objective is to parse each tuple and populate the trade_list
structure with all the trades, noting any errors that occur along
the way.
struct trade
{
unsigned long long id;
std::string ticker;
double price;
unsigned int volume;
};
int main()
{
std::string trade_data[] =
{
"AAA, 0,36.900,009352.0000",
"BBB, 1,46.000,009800.0000",
"CCC, 2,46.000,009400.0000",
"DDD, 3,46.000,001500.0000",
"AAA, 4,46.000,000600.0000",
"BBB, 5,46.000,039800.0000",
"CCC, 6,46.000,003200.0000",
"DDD, 7,46.000,000100.0000",
"AAA, 8,46.000,000800.0000",
"BBB, 9,48.500,002400.0000",
"CCC, 10,37.395,101200.0000",
"DDD, 11,37.193,200000.0000",
"AAA, 12,37.146,020000.0000",
"BBB, 13,38.314,001752.0000",
"CCC, 14,37.386,130000.0000",
"DDD, 15,37.443,100000.0000"
};
std::deque<trade> trade_list;
strtk::truncated_int<unsigned int> tui;
tui.fractional_size(4);
for (std::size_t i = 0; i < sizeof(trade_data) / sizeof(std::string); ++i)
{
trade t;
bool result = strtk::parse(trade_data[i],",",
t.ticker,
strtk::trim_leading(" ",t.id).ref(),
t.price,
tui(t.volume));
if (result)
trade_list.push_back(t);
else
std::cout << "Tuple parse error: " << trade_data[i] << std::endl;
}
return 0;
}
It should be noted that truncated_int
can be used in conjunction
with the various StrTk token processing primitives such as strtk::trim,
strtk::trim_leading, strtk::as_lcase et al
. The following example
demonstrates parsing a tuple of values intended for int types, where the tokens
have random space padding. The simple composition of strtk::truncated_int and
strtk::trim allows for efficient and error free parsing of the tuple.
int main()
{
std::string data = " 123.0 | 456.00 | 789.000";
int i0,i1,i2;
strtk::truncated_int<int> ti0;
strtk::truncated_int<int> ti1;
strtk::truncated_int<int> ti2;
strtk::parse(data,"|",
strtk::trim(" ",ti0(i0)).ref(),
strtk::trim(" ",ti1(i1)).ref(),
strtk::trim(" ",ti2(i2)).ref());
printf("i0 = [%d] i1 = [%d] i2 = [%d]\n",i0,i1,i2);
return 0;
}
Columnwise Parsing
In the previous section the ability to ignore tokens in a tuple was
discussed. The concept works well if only a few tokens need to be
ignored. However problems arise when the tuples contain a large number
of tokens and the tokens that are to be ignored are numerous and
distributed uniformly over the entire tuple.
Situations such as this one are common, and using the ignore_token
technique can not only make both the coding of the solution cumbersome
and error prone but also make the parsing process itself quite
inefficient.
A natural extension to ignore_token that scales and is also extremely
efficient, can be found in the combined functionalities of the
parse_columns and column_list. The column_list is a structure used to
hold the indexes of the tokens in the tuple that are required. The
indexes have to be valid, unique and in ascending order.
auto cl_even = strtk::column_list(0,2,4,6,8,10,12);
auto cl_odd = strtk::column_list(1,3,5,7);
The strtk::parse_columns
function takes a string of
data representing a tuple, a delimiter to determine the tokens in
the tuple, a strkt::column_list
and a compatible number
of types as target references.
The number of types has to be equal to the number of indexes in the
column_list, and the types need to be convertible within the strtk
namespace from a string range representation to the type. In the
following example we have a tuple consisting of integers. We're only
interested in the first four even numbered indexes in the tuple, the
code below demonstrates how the tuple is parsed with the given
constraints:
int main()
{
const std::string data = "1,22;333|4444 55555|666666;7777777,88888888 999;"
"0000|1111,22222,333333,4444444";
int i0,i1,i2,i3,i4;
strtk::parse_columns(data,
",;| ",
strtk::column_list(0,2,4,6,8),
i0,i1,i2,i3,i4);
return 0;
}
In the following example we have a tuple consisting of a mixture of
types. We are only interested in the first, fifth and eighth indexes
in the tuple, which happen to be of type int, double and std::string
respectively. The code below demonstrates how the tuple is
parsed with the given constraints:
int main()
{
const std::string data = "token0,1234;token2;token3,token4,11.22|token6;token7|"
"text-i-text,1 2 3 4 5 6 7";
int i;
double d;
std::string s;
strtk::parse_columns(data,
",;| ",
strtk::column_list(1,5,8),
i,d,s);
return 0;
}
Simple 3D Mesh File Format Parser
Lets assume there is a file format for expressing a mesh. The format
consists of a section that defines the unique vertexes in the mesh,
and another section that defines the triangles in the mesh as a tuple
of three indicies indicating the vertexes used. Each section is
preceded by an integer that denotes the number of subsequent
elements in that section. An example of such a file is the following:
5
+1.0,+1.0,+1.0
-1.0,+1.0,-1.0
-1.0,-1.0,+1.0
+1.0,-1.0,-1.0
+0.0,+0.0,+0.0
4
0,1,4
1,2,4
2,3,4
3,1,4
Code to parse such a file format is as follows:
struct point
{
double x,y,z;
};
struct triangle
{
std::size_t i0,i1,i2;
};
int main()
{
std::string mesh_file = "mesh.txt";
std::ifstream stream(mesh_file.c_str());
std::string s;
std::deque<point> points;
point p;
std::size_t point_count = 0;
strtk::parse_line(stream," ",point_count);
strtk::for_each_line_n(stream,
point_count,
[&points,&p](const std::string& line)
{
if (strtk::parse(line,",",p.x,p.y,p.z))
points.push_back(p);
});
std::deque<triangle> triangles;
triangle t;
std::size_t triangle_count = 0;
strtk::parse_line(stream," ",triangle_count);
strtk::for_each_line_n(stream,
triangle_count,
[&triangles,&t](const std::string& line)
{
if (strtk::parse(line,",",t.i0,t.i1,t.i2))
triangles.push_back(t);
});
return 0;
}
Simple Semantic Actions
A semantic action is an action that is undertaken upon a token, it can be
in the form of either an assessment or a manipulation. StrTk provides a very
simplified semantic action capability, named strtk::util::semantic_action
for types that are being parsed via the strtk::parse
function.
A function (stateful or stateless), taking an iterator range is used to construct
the semantic_action. The body of the function performs whatever operations
are required and also makes sure to maintain the contract with regards to the
return status for the parse routine to complete successfully.
The following is an example where a comma delimited string is parsed into 3 types,
an integer, a double and a string. The rules regarding parsing and updating of the
variables is as follows, the int variable "i" will only be updated if the
value parsed is odd, the double value "d" will only be updated if the parsed
value is greater than or equal to 99.99 and the string value "s" will only
be updated if the presented range contains the string "ring". Upon every successful
update a corresponding counter will be incremented.
int main()
{
std::string data = "12345,123.456,A string";
int i = 0;
double d = 0.0;
std::string s = "";
std::size_t i_update_count = 0;
std::size_t d_update_count = 0;
std::size_t s_update_count = 0;
typedef strtk::range::ustring::const_iterator itr_type;
using strtk::util::semantic_action;
strtk::parse(data,",",
semantic_action([&i,&i_update_count]
(itr_type begin,itr_type end) -> bool
{
int temp = 0;
if (strtk::string_to_type_converter(begin,end,temp))
{
if (temp % 2 != 0)
{
i = temp;
++i_update_count;
}
return true;
}
else
return false;
}).ref(),
semantic_action([&d,&d_update_count]
(itr_type begin,itr_type end) -> bool
{
double temp = 0.0;
if (strtk::string_to_type_converter(begin,end,temp))
{
if (temp >= 99.99)
{
d = temp;
++d_update_count;
}
return true;
}
else
return false;
}).ref(),
semantic_action([&s,&s_update_count]
(itr_type begin,itr_type end) -> bool
{
static unsigned char pattern[] = "ring";
if (end != std::search(begin,end,pattern,pattern + 4))
{
s.assign(begin,end);
++s_update_count;
}
return true;
}).ref());
std::cout << "i: " << i << std::endl;
std::cout << "d: " << d << std::endl;
std::cout << "s: " << s << std::endl;
return 0;
}
Pick A Random Line From A Text File
A random line of text is to be selected from a user provided
text file comprised of lines of varying length, such that the
probability of the line being selected is 1/N where N is the number
of lines in the text file. There are many trivial solutions to this
problem, however if one were to further constrain the problem by
indicating the file is very large (TBs) and that the system upon
which the solution is to run is very limited memory-wise, most if
not all trivial solutions such as storing indexes of all line offsets,
or reading the entire file into memory etc will be eliminated.
However, there exists a very elegant solution to this problem of O(n),
O(1) time and space complexities respectively, that involves
scanning the entire file once line by line, and at every ith
line choosing whether or not to replace the final result line with
the current line by sampling a uniform distribution of range
[0,1) and performing a replace if and only if the random value is
less than 1 / i.

The logic behind this solution revolves around the fact that the
probability of selecting the ith line will be 1/i and as such the
total probability for selecting any of the previous i-1 lines will be
1 - (1/i) or (i - 1)/i. Because there are (i - 1) lines before the
ith line, we divide the previous sum of probabilities by (i - 1),
resulting in a selection probability of 1/i for any one of the lines
up to and including the ith line. If the ith line were to be the last
line of the text file, this then results in each of the lines
having a selection probability of 1/N - simple and sweet, and so
too is the following implementation of said solution:
int main(int argc, char* argv[])
{
std::string file_name = argv[1];
std::string line;
std::size_t i = 0;
strtk::uniform_real_rng rng(static_cast<std::size_t>(::time(0)));
strtk::for_each_line(file_name,
[&](const std::string& s)
{
if (rng() < (1.0 / ++i))
{
line = s;
}
});
std::cout << line << std::endl;
return 0;
}
What changes to the above code would be required If the probability of line selection
was changed to 1/(N-K) where 0 <= K <= N and K is the number of lines that will be randomly
ignored.
Note: TAOCP Volume II section 3.4.2 has an in depth discussion
about this problem, which is generally known as reservoir sampling,
and other similar problems relating to random distributions. Also one
should note that the above example has an inefficiency, in that upon
every string replace, an actual string is being copied, normally all
one needs to do is maintain a file offset to the beginning of the line,
not doing this causes slow downs due to continuous memory allocations/reallocations
which is made all the worse when large lines are encountered.
The Buy Low And Sell High Problem
Assume we are given a piece of data in csv format which represents
a time-series for the close prices of the SPDR S&P 500 ETF.
The time-series ranges from 04/01/1999 to 11/11/2011. The objective
is to select two dates, the first being the buy date and the second
being the sell date, such that if we were to buy then sell X shares
of the ETF we will have maximized our profit. It should be noted that of course
the buy date must be before the sell date and that short selling is not
an option in this strategy. The following is a chart that represents the
price of the SPDR over the given period:

Through visual inspection we can approximate that the best buy date
would be towards the end of 2002 and the corresponding sell date would
be towards the end of 2007, as these time points seem to give the largest
difference between the two prices. However it also seems that a buy
roughly at the start of 2009 and a sell at the beginning of 2011
could also provide such a large price difference. As such the visual
inspection approach has lead to an ambiguity, hence a more thorough
and precise approach is required.
One could take the naive approach to solving the buy low sell high
problem, by initially loading the entire time series into memory, then
for each ith time point take the pricei and test it against
every other pricej where i < j, and maintain a "best
profit encountered" structure that contains the best profit so
far and the corresponding buy/sell prices and dates. This solution has
a few problems, initially it is of O(N2) time complexity
and O(N) space complexity. As an example for one million time points
it will require one trillion comparisons and one million date/price
units of storage. If memory and computational processing was limited
on the hardware this solution can not be practically executed in a
continuous online manner. Furthermore as suggested by the time
-complexity as the size of the data grows, regardless of computational
abilities, the compute times for the results would become astronomical
and practically useless - specially for real-time reactive systems.
In these types of problems one tries to assess if an online or
streaming based solution is feasible. That is a solution that does not
require the data to be available all at once, can work on the data
incrementally and requires no more than one-pass for each piece of
data. Such a solution would typically have a time complexity of O(N)
and a space complexity of O(1).
With regards to this problem the crucial insight required to convert
the naive solution from O(N2) complexity to an online
version of O(N) time complexity, is that every new global minima
encountered is the beginning of a new period and an indicator of an
end to the previous period. Looking at the chart, if one were to scan
from left to right, the intuitive response is to find the point with
the lowest price, ignore everything preceding, then try and find the
next point with the highest price or global maxima. There are a few
edge cases that need to be dealt with. The main one being the problem
described above that there are two buy/sell points that could
potentially be the solution. The way around this is to simply maintain
the best encountered period, and compare the profit from any new
period to the best so far, if it is better (more) then replace
the best with the current period. Another edge case is when the data
is in a continual decline, in a situation like this there will
be no profitable buy/sell points. The following is a small
subsection of the time-series in question:
Download spy500.csv
15/06/2000,148.16
16/06/2000,146.59
19/06/2000,148.47
20/06/2000,147.94
21/06/2000,147.88
22/06/2000,145.63
23/06/2000,144.38
26/06/2000,146.23
27/06/2000,145.16
28/06/2000,145.56
29/06/2000,144.19
The code below is a very simple single pass incremental solution to the given problem.
It reads every line of the input file, parses each line into a date and price variable,
checks to see if the current price is less than the current global minima price, if it
is the case, it will set the current period start to the current date and set the buy price
to be the current price, otherwise it checks to see if the current price is larger than the
global maxima price, if that is the case then it updates the current profit, sell price
and sell dates accordingly. If at the end, the buy price is not less than the sell price,
it is indicative that there exists no two time points within the given time series for
which a profitable transaction could occur, otherwise it prints out buy and sell dates
for the required transaction and the expected profit per share.
struct period
{
period()
: profit(std::numeric_limits<double>::min()),
buy_price(std::numeric_limits<double>::max()),
sell_price(std::numeric_limits<double>::min())
{}
bool operator>(const period&p) const
{
return profit > p.profit;
}
double profit;
double buy_price;
double sell_price;
std::string buy_date;
std::string sell_date;
};
int main()
{
period best_period;
period curr_period;
strtk::for_each_line("spy500.csv",
[&best_period,&curr_period]
(const std::string& line)
{
std::string date;
double price;
if (!strtk::parse(line,",",date,price)) return;
if (price < curr_period.buy_price)
{
if (curr_period > best_period)
{
best_period = curr_period;
}
curr_period.buy_date = date;
curr_period.buy_price = price;
curr_period.sell_price = std::numeric_limits<double>::min();
curr_period.sell_date = "";
}
else if (price > curr_period.sell_price)
{
curr_period.sell_price = price;
curr_period.sell_date = date;
curr_period.profit = curr_period.sell_price - curr_period.buy_price;
}
});
if (best_period.buy_price >= best_period.sell_price)
{
std::cout << "No period in the time-series can be profitably exploited." << std::endl;
}
else
{
std::cout << "Buy: " << best_period.buy_date << std::endl;
std::cout << "Sell: " << best_period.sell_date << std::endl;
std::cout << "Profit per share: $" << best_period.profit << std::endl;
}
return 0;
}
For the given data it is expected the above piece of code will produce the following output:
Buy: 09/10/2002
Sell: 09/10/2007
Profit per share: $78.38
What changes to the above piece of code would be required if:
- Rather than prices, we instead are given percentage differences from the previous price?
- We have the constraint that we can't hold a position for more than K days
- A penalty of 1/K% is applied to the profit for every day a position is held? (where K is the condition and value described above)
- Short selling is allowed?
- We want to process independent sections of the time series concurrently so as to speed up overall processing time.
- The prices could be extremely large or small?
Token Grid
StrTk provides a means to easily parse and consume 2D grids of tokens
in an efficient and simple manner. A grid is simply defined as a
series of rows comprised of tokens, otherwise known as Delimiter
Separated Values (DSV). The ith token of a row is grouped with
every ith token of all other rows to produce a column. Tokens can be
processed as either rows or columns.
An example of a simple token grid, where each numeric value is deemed
to be a token:
1.1 | 2.2 | 3.3 | 4.4 | 5.5 |
1.1 | 2.2 | 3.3 | 4.4 | 5.5 |
1.1 | 2.2 | 3.3 | 4.4 | 5.5 |
1.1 | 2.2 | 3.3 | 4.4 | 5.5 |
1.1 | 2.2 | 3.3 | 4.4 | 5.5 |
A token grid can be either passed in via a file or a string buffer.
The delimiters to be used for parsing the columns and rows can also
be provided, if not provided standard common defaults will be used.
The following demonstrates how each cell in the grid can be access
and cast to a specific type.
std::string data = "1,2,3,4,5,6\n"
"7,8,9,0,1,2\n"
"3,4,5,6,7,8\n"
"9,0,1,2,3,4\n"
"5,6,7,8,9,0\n";
strtk::token_grid grid(data,data.size(),",");
for (std::size_t r = 0; r < grid.row_count(); ++r)
{
strtk::token_grid::row_type row = grid.row(r);
for (std::size_t c = 0; c < row.size(); ++c)
{
std::cout << grid.get<int>(r,c) << '\t';
}
std::cout << std::endl;
}
The strtk::token_grid
provides various helper functions
for traversing rows and columns in batch mode. The functions are namely:
for_each_row
that is used for iterating either all or a sub-range
of rows of the token_grid, and for_each_column
that is used
for iterating either all or a sub-range of columns of a row.
int main()
{
std::string data = "1.1,2.1,3.1,4.1,5.1,6.1,7.1\n" "1.2,2.2,3.2,4.2,5.2,6.2,7.2\n" "1.3,2.3,3.3,4.3,5.3,6.3,7.3\n" "1.4,2.4,3.4,4.4,5.4,6.4,7.4\n" "1.5,2.5,3.5,4.5,5.5,6.5,7.5\n" "1.6,2.6,3.6,4.6,5.6,6.6,7.6\n" "1.7,2.7,3.7,4.7,5.7,6.7,7.7\n" "1.8,2.8,3.8,4.8,5.8,6.8,7.8\n" "1.9,2.9,3.9,4.9,5.9,6.9,7.9\n";
strtk::token_grid grid(data,data.size(),",");
{
grid.for_each_row(
[](const strtk::token_grid::row_type& row)
{
for (std::size_t i = 0; i < row.size(); ++i)
{
std::cout << row.get<double>(i) << '\t';
}
std::cout << '\n';
});
}
{
grid.for_each_row(grid.range(2,6),
[](const strtk::token_grid::row_type& row)
{
for (std::size_t i = 0; i < row.size(); ++i)
{
std::cout << row.get<double>(i) << '\t';
}
std::cout << '\n';
});
}
{
grid.for_each_row(
[](const strtk::token_grid::row_type& row)
{
row.for_each_column(
[](const strtk::token_grid::row_type::range_type& range)
{
std::cout << std::string(range.first,range.second) << '\t';
});
std::cout << '\n';
});
}
{
grid.for_each_row(grid.range(2,6),
[](const strtk::token_grid::row_type& row)
{
row.for_each_column(row.range(1,5),
[](const strtk::token_grid::row_type::range_type& range)
{
std::cout << std::string(range.first,range.second) << '\t';
});
std::cout << '\n';
});
}
return 0;
}
The following example demonstrates how averages over rows and columns
of a token grid can be computed:
std::string data = "1.1,1.1,1.1,1.1,1.1,1.1\n"
"2.2,2.2,2.2,2.2,2.2,2.2\n"
"3.3,3.3,3.3,3.3,3.3,3.3\n"
"4.4,4.4,4.4,4.4,4.4,4.4\n"
"5.5,5.5,5.5,5.5,5.5,5.5\n"
"6.6,6.6,6.6,6.6,6.6,6.6\n";
strtk::token_grid grid(data,data.size(),",");
std::vector<double> avg_c(grid.row(0).size(),0.0);
std::vector<double> avg_r(grid.row_count(),0.0);
std::vector<double> tmp(grid.row(0).size(),0.0);
std::fill(avg_c.begin(),avg_c.end(),0.0);
for (std::size_t i = 0; i < grid.row_count(); ++i)
{
grid.row(i).parse<double>(tmp.begin());
std::transform(avg_c.begin(),avg_c.end(),tmp.begin(),avg_c.begin(),std::plus<double>());
avg_r[i] = std::accumulate(tmp.begin(),tmp.end(),0.0) / tmp.size();
}
for (std::size_t i = 0; i < avg_c.size(); avg_c[i++] /= grid.row_count());
std::cout << "Column Averages:\t";
std::copy(avg_c.begin(),avg_c.end(),std::ostream_iterator<double>(std::cout,'\t'));
std::cout << '\n';
std::cout << "Row Averages:\t";
std::copy(avg_r.begin(),avg_r.end(),std::ostream_iterator<double>(std::cout,'\t'));
std::cout << '\n';
</double></double></double></double>
Processing Of Comma Separated Values Data
The original intent of the token grid was to support fast and
efficient processing of simple data tuples, such as comma separated
values (CSV) formats et. al. The following example demonstrates a
simple summation of traded floor volume and average daily volume based
on NASDAQ OHLC (Open-High-Low-Close) data.
std::string market_data = "20090701,GOOG,424.2000,418.9900,426.4000,418.1500,2310768\n"
"20090701,MSFT,24.0500,24.0400,24.3000,23.9600,54915127\n"
"20090702,GOOG,415.4100,408.4900,415.4100,406.8100,2517630\n"
"20090702,MSFT,23.7600,23.3700,24.0400,23.2100,65427699\n"
"20090703,GOOG,408.4900,408.4900,408.4900,408.4900,0\n"
"20090703,MSFT,23.3700,23.3700,23.3700,23.3700,0\n"
"20090706,GOOG,406.5000,409.6100,410.6400,401.6600,2262557\n"
"20090706,MSFT,23.2100,23.2000,23.2800,22.8700,49207638\n"
"20090707,GOOG,408.2400,396.6300,409.1900,395.9801,3260307\n"
"20090707,MSFT,23.0800,22.5300,23.1400,22.4600,52842412\n"
"20090708,GOOG,400.0000,402.4900,406.0000,398.0600,3441854\n"
"20090708,MSFT,22.3100,22.5600,22.6900,2200000,73023306\n"
"20090709,GOOG,406.1200,410.3900,414.4500,405.8000,3275816\n"
"20090709,MSFT,22.6500,22.4400,22.8100,22.3700,46981174\n"
"20090710,GOOG,409.5700,414.4000,417.3700,408.7000,2929559\n"
"20090710,MSFT,22.1900,22.3900,22.5400,22.1500,43238698\n";
strtk::token_grid grid(market_data,market_data.size(),",");
struct stock_info
{
stock_info(const std::string& s = " ")
: symbol(s),
total_volume(0),
day_count(0),
average_daily_volume(0.0)
{}
std::string symbol;
unsigned long long total_volume;
std::size_t day_count;
double average_daily_volume;
};
stock_info goog("GOOG");
stock_info msft("MSFT");
static const std::size_t volume_column = 6;
static const std::size_t symbol_column = 1;
goog.day_count = grid.accumulate_column(volume_column,
[](const strtk::token_grid::row_type& row) -> bool
{
static const std::string google_symbol("GOOG");
return google_symbol == row.get<std::string>(symbol_column);
},
goog.total_volume);
msft.day_count = grid.accumulate_column(volume_column,
[](const strtk::token_grid::row_type& row) -> bool
{
static const std::string microsoft_symbol("MSFT");
return microsoft_symbol == row.get<std::string>(symbol_column);
},
msft.total_volume);
goog.average_daily_volume = (1.0 * goog.total_volume) / goog.day_count;
msft.average_daily_volume = (1.0 * msft.total_volume) / msft.day_count;
std::cout << "[GOOG] Total Volume: " << goog.total_volume << std::endl;
std::cout << "[MSFT] Total Volume: " << msft.total_volume << std::endl;
std::cout << "[GOOG] ADV: " << goog.average_daily_volume << std::endl;
std::cout << "[MSFT] ADV: " << msft.average_daily_volume << std::endl;
The strtk::token_grid
is thread-safe iff read operations
are in play. As such the above calls to accumulate_column et al. can
all be safely and easily executed concurrently using threads. This
allows for a far more efficient data processing methodology.
TIMTOWTDI
Playing devil's advocate, another way of performing the above
processing task, assuming only the specific values for computing the
ADV are required and no further processing of the CSV data is
needed, then the problem can be solved efficiently by utilizing a
single pass of the data as follows:
std::string file_name = "market_data.csv";
std::unordered_map<std::string,stock_info> stock_map;
stock_map.insert(std::make_pair<std::string,stock_info>("GOOG",stock_info("GOOG")));
stock_map.insert(std::make_pair<std::string,stock_info>("MSFT",stock_info("MSFT")));
strtk::for_each_line(file_name,
[&](const std::string& line)
{
strtk::ignore_token ignore;
stock_info temp;
const bool result = strtk::parse(line,
",",
ignore,
temp.symbol,
ignore,
ignore,
ignore,
ignore,
temp.total_volume);
if (!result) return;
auto itr = stock_map.find(temp.symbol);
if (stock_map.end() == itr) return;
(*itr).second.total_volume += temp.total_volume;
(*itr).second.day_count++;
});
auto itr = stock_map.begin();
auto end = stock_map.end();
while (end != itr)
{
stock_info& stock = (*itr++).second;
stock.average_daily_volume = (1.0 * stock.total_volume) / stock.day_count;
}
TIMTOWTDI - II (with a vengeance)
Playing the devil's other advocate, the above two examples, have both
required that the filter condition be explicitly defined at compile
time. However even though the condition maybe be set in stone at
compile time, some of the underlyings (such as symbol) can be
engineered to be modified at run-time. That still doesn't give us the
freedom to perform arbitrarily complex filter expressions determined
at run-time.
That said, an extremely efficient and very simple solution is at hand. The solution
is called the C++ DSV Filter library, it is based on StrTK and ExprTk libraries. It
uses the strtk::token_grid
as a CSV/DSV store and index, and ExprTk as
the underlying expression evaluation engine. The example below takes the OHLC market
data table defined above and performs a row-wise query. The expression's definition
is:
select all rows where the open price is greater than the close price and the
symbol matches the wild-card pattern of "*FT*" and the date is equal to or
after 20090101.
int main()
{
std::string file_name = "market_data.csv";
dsv_filter filter;
filter.set_input_delimiter(",");
if (!filter.load(file_name))
return 1;
std::string expression = "(open > close) and (symbol like '*FT*') and (date >= '20090101')";
filter.add_filter(filter_expression);
for (std::size_t row = 1; row < filter.row_count(); ++row)
{
if (dsv_filter::e_match == filter[row])
{
}
}
return 0;
}
Other example queries:
volume >= 1000000 and symbol == 'GOOG'
abs(open - close) > abs(high - low)
avg(open,close,high,low) * volume > 10^7 and inrange('20090702',date,'20090730')
(open > close) and (symbol like '*FT*') and (date >= '20090101')
It should be noted that in the example above, the rows begin at index 1.
That is done because the dsv_filter
expects the first row or
row at index 0 to be a column definition header. The format of the column
definitions is to simply add a suffix of "_s" if the values in the column
are to be treated as strings or "_n" if they are to be treated as numbers.
When defining expressions the suffixes should not be included when including
the column names. The above mentioned OHLC csv file's header would be as
follows:
Date_s,Symbol_s,Open_n,Close_n,High_n,Low_n,Volume_n
C++ DSV Filter and Dependencies
Sequential Partitions
A typical operation carried out upon time-series data is to group
tuples into buckets (or bins) based upon the time index
value. For example grouping data into 2-minute buckets and then
performing some kind of operation upon the grouped tuples such as a
summation or an average etc. This process is sometimes also called:
"discretization"
The strtk::token_grid
class provides a method known as
sequential_partition
. The sequential_partition
method requires a Transition Predicate, a Function
and optionally a row-range. The Transition Predicate
consumes a row and returns true
only if the row is
in the next partition. All other subsequent consecutive rows until
the transition predicate returns a true
are said to be
in the current partition. Prior to transitioning to a new partition,
the function predicate is invoked and provided with the range of
rows in the current partition.
The following example takes a simple time-series (time and
value), partitions the tuples into groups of Time-Buckets
of period length 3 and then proceeds to compute the total sum of each
group. The below summarizer
class provides provides a
Transition Predicate and Function.

class summarizer
{
public:
enum column_index
{
tick_time_column = 0,
tick_value_column = 1
};
summarizer(std::deque<double>& sum_value)
: next_tick_time_(0),
sum_value_(sum_value)
{}
inline bool operator()(const strtk::token_grid::row_type& row)
{
if (row.get<std::size_t>(tick_time_column) >= next_tick_time_)
{
next_tick_time_ = row.get<std::size_t>(tick_time_column) + 3;
return true;
}
else
return false;
}
inline bool operator()(const strtk::token_grid& grid,
const strtk::token_grid::row_range_type& range)
{
double bucket_sum = 0.0;
if (!grid.accumulate_column(tick_value_column,range,bucket_sum))
{
std::cout << "failed to accumulate!" << std::endl;
return false;
}
else
sum_value_.push_back(bucket_sum);
return true;
}
private:
summarizer& operator=(const summarizer&);
std::size_t next_tick_time_;
std::deque<double>& sum_value_;
};
int main()
{
std::string data = "10000,123.456\n"
"10001,612.345\n"
"10002,561.234\n"
"10003,456.123\n"
"10004,345.612\n"
"10005,234.561\n"
"10006,123.456\n";
strtk::token_grid grid(data,data.size(),",");
std::deque<double> sum_value;
summarizer s(sum_value);
grid.sequential_partition(s,s);
for (std::size_t i = 0; i < sum_value.size(); ++i)
{
std::cout << "bucket[" << i << "] = " << sum_value[i] << std::endl;
}
return 0;
}
The expected output is as follows:
bucket[0] = 1297.035
bucket[1] = 1036.296
bucket[2] = 123.456
Parsing CSV Files With Embedded Double-Quotes
One of the simple extensions to the CSV format is the concept of
double quoted tokens. Such tokens may contain column or row
delimiters. When such a scenario is encountered, all subsequent
delimiters are ignored, and kept as part of the token, until the
corresponding closing double quote is encountered. The StrTk
token_grid
supports the parsing of such tokens. This
parsing mode can be easily activated via the token_grid
option set. Below is an example of a token_grid
loading a
CSV data set representing various airports from around the world and
their specific codes and locations, in which some of the cells are
double-quoted:
ICAO | IATA | Airport | City | Country |
AYGA | GKA | "Goroka Gatue" | Goroka | Papua New Guinea |
BGCO | GCO | "Nerlerit Inaat Constable Pynt" | "Nerlerit Inaat" | Greenland |
BZGD | ZGD | Godley | Auckland | New Zealand |
CYQM | YQM | "Greater Moncton International" | Moncton | Canada |
EDRK | ZNV | "Koblenz Winningen" | Koblenz | Germany |
FAHU | AHU | "HMS Bastard Memorial" | Kwazulu-Natal | South Africa |
FQMP | MZB | "Mocimboa Da Praia" | "Mocimboa Da Praia" | Mozambique |
KINS | INS | "Indian Springs AF AUX" | Indian Springs | USA |
UHNN | HNN | Nikolaevsk | "Nikolaevsk Na Amure" | Russia |
WBKK | BKI | "Kota Kinabalu International" | Kota Kinabalu | Malaysia |
ZSJD | JDZ | "Jingdezhen Airport" | Jingdezhen | China |
The following is the StrTk code example using token_grid
to parse the above CSV data set:
int main()
{
std::string airport_data_file_name = "airport_data.csv";
strtk::token_grid::options options;
options.column_delimiters = "| ,";
options.support_dquotes = true;
strtk::token_grid airport_grid(airport_data_file_name,options);
for (std::size_t r = 0; r < airport_grid.row_count(); ++r)
{
strtk::token_grid::row_type row = airport_grid.row(r);
for (std::size_t c = 0; c < row.size(); ++c)
{
std::cout << "[" << row.get<std::string>(c) << "] ";
}
std::cout << std::endl;
}
return 0;
}
</std::string>
Extending Delimiter Predicates
As previously mentioned the concept of a delimiter based predicate
can lead to some very interesting solutions. A predicate as has been
defined so far, with the exception of the offset predicate, has been
a stateless entity. Adding the ability to maintain a state based on
what the predicate has encountered so far can allow it to behave
differently from the simple single and multiple delimiter predicates.
For this example, lets assume a typical command line parsing problem
which consists of double quotation mark groupings and escapable
special characters, which can be considered being dual use as either
delimiters or data. An example input and output is as follows:
Inputs
Data | abc;"123, mno xyz",i\,jk |
Delimiters | <space>;,. |
Output Tokens
Token0 | abc |
Token1 | 123, mno xyz |
Token2 | i\,jk |
In order to tokenize the above described string, one can create a
composite predicate using a multiple char delimiter predicate and some
simple state rules. The following is an example of such an extended
predicate:
class extended_predicate
{
public:
extended_predicate(const std::string& delimiters)
: escape_(false),
in_bracket_range_(false),
mdp_(delimiters)
{}
inline bool operator()(const unsigned char c) const
{
if (escape_)
{
escape_ = false;
return false;
}
else if ('\\' == c)
{
escape_ = true;
return false;
}
else if ('"' == c)
{
in_bracket_range_ = !in_bracket_range_;
return true;
}
else if (in_bracket_range_)
return false;
else
return mdp_(c);
}
inline void reset()
{
escape_ = false;
in_bracket_range_ = false;
}
private:
mutable bool escape_;
mutable bool in_bracket_range_;
mutable strtk::multiple_char_delimiter_predicate mdp_;
};
Usage of the newly defined extended predicate is as follows:
int main()
{
std::string str = "abc;\"123, mno xyz\",i\\,jk";
strtk::std_string::token_list_type token_list;
strtk::split(extended_predicate(".,; "),
str,
std::back_inserter(token_list),
strtk::split_options::compress_delimiters);
return 0;
}
High Performance Key-Value Parsing
Taking our previous person
struct as an example. It
is clear that the tuple format has to be very specific with
regards to the ordering of data. For most situations this is
acceptable as the serializers and deserializers attempt to
function in the simplest manner possible, however sometimes
pieces of information may come in different orderings, or may be
deemed optional and hence not be present in a particular
tuple of data.
An approach is required that provides the means of mapping
specific pieces of data to the corresponding variables (or
members) that will store or make sense of them, in a very
efficient and simple manner. For clarity purposes, the term
"mapping" here means to populate the desired member with the
given data. The association between the two being made with the
key that is paired in-line with the data. Key-Value pairs or some
times known as Attribute-Value pairs are one of the means
by which such an association can be accomplished. The following
is a diagram that demonstrates the mapping of various fields
of a data struct to their corresponding data elements in a tuple
comprised of key-value pairs.

In the diagram above, the key-value pairs are separated (delimited)
by the pipe symbol "|". With regards to the key-value pairs
themselves, the key is traditionally the first element in the pair
and the value is the second element, they are separated in this case
by a single equal sign "=".
Back to our original problem relating to the person
struct.
If we were to add keys to each of the pieces of data, we could then not
only parse a tuple of data representing the various fields of the person
struct, but those fields could be any order within the tuple. Furthermore
any of the fields could be deemed optional, hence not necessarily be present
in the tuple. It should be noted that what would denote a correct or successful
parse of tuple may not only depend upon successful parsing of ranges into the
various types, but it may also depend upon the mandatory presence of certain
fields.
Our objective will be to successfully and in the most efficient way
possible parse the following list of tuples that represent instances
of our person
struct.
Tuple0 = "ID=0xFA37ED12|NAME=Rumpelstiltskin|AGE=397|HEIGHT=1.31|WEIGHT=58.7"
Tuple1 = "NAME=Rumpelstiltskin|AGE=397|HEIGHT=1.31|WEIGHT=58.7|ID=0xFA37ED12"
Tuple2 = "AGE=397|HEIGHT=1.31|WEIGHT=58.7|ID=0xFA37ED12|NAME=Rumpelstiltskin"
Tuple3 = "HEIGHT=1.31|WEIGHT=58.7|ID=0xFA37ED12|NAME=Rumpelstiltskin|AGE=397"
Tuple4 = "WEIGHT=58.7|ID=0xFA37ED12|NAME=Rumpelstiltskin|AGE=397|HEIGHT=1.31"
Tuple5 = "ID=0xFA37ED12|WEIGHT=58.7|AGE=397|NAME=Rumpelstiltskin|HEIGHT=1.31"
Tuple6 = "WEIGHT=58.7|AGE=397|NAME=Rumpelstiltskin|HEIGHT=1.31|ID=0xFA37ED12"
Tuple7 = "AGE=397|NAME=Rumpelstiltskin|HEIGHT=1.31|ID=0xFA37ED12|WEIGHT=58.7"
Tuple8 = "NAME=Rumpelstiltskin|HEIGHT=1.31|ID=0xFA37ED12|WEIGHT=58.7|AGE=397"
Tuple9 = "HEIGHT=1.31|ID=0xFA37ED12|WEIGHT=58.7|AGE=397|NAME=Rumpelstiltskin"
StrTk provides a means to achieve the above key-value pair parsing task, namely
via the strtk::keyvalue::parser
and associated key-mappers. The
following is a table that depicts the various kind of key to value mappers that
are available in the StrTk library:
Key To Value Mappers
Mapper | Type | Key Lookup Complexity | Maximum Size |
strtk::keyvalue::stringkey_map | std::string | O(log(n)) | Limited to available memory |
strtk::keyvalue::uintkey_map | cardinal value | O(1) | Limited to expected maximum key value |
In the code below we initially begin by defining the delimiters we expect to see
between pairs of key-values and in between the key and value pairs themselves. In
the example below the pair_block_delimiter
field denotes the delimiter
we expect between pairs (or blocks) of key-values and the field pair_delimiter
denotes the delimiter we expect between a particular key and value.
Next we define p as an instance of the person
struct and
register each of the members of the instance p with a corresponding
key with the keyvalue::parser
. After which we then process
each tuple of data, parsing the tuple, populating the instance p, then
printing out the various fields to stdout.
struct person
{
unsigned int id; std::string name; unsigned int age; double height float weight; };
int main()
{
const std::string person_data[] =
{
"ID=0xFA37ED12|NAME=Rumpelstiltskin|AGE=397|HEIGHT=1.31|WEIGHT=58.7",
"NAME=Rumpelstiltskin|AGE=397|HEIGHT=1.31|WEIGHT=58.7|ID=0xFA37ED12",
"AGE=397|HEIGHT=1.31|WEIGHT=58.7|ID=0xFA37ED12|NAME=Rumpelstiltskin",
"HEIGHT=1.31|WEIGHT=58.7|ID=0xFA37ED12|NAME=Rumpelstiltskin|AGE=397",
"WEIGHT=58.7|ID=0xFA37ED12|NAME=Rumpelstiltskin|AGE=397|HEIGHT=1.31",
"ID=0xFA37ED12|WEIGHT=58.7|AGE=397|NAME=Rumpelstiltskin|HEIGHT=1.31",
"WEIGHT=58.7|AGE=397|NAME=Rumpelstiltskin|HEIGHT=1.31|ID=0xFA37ED12",
"AGE=397|NAME=Rumpelstiltskin|HEIGHT=1.31|ID=0xFA37ED12|WEIGHT=58.7",
"NAME=Rumpelstiltskin|HEIGHT=1.31|ID=0xFA37ED12|WEIGHT=58.7|AGE=397",
"HEIGHT=1.31|ID=0xFA37ED12|WEIGHT=58.7|AGE=397|NAME=Rumpelstiltskin",
};
const std::size_t person_data_size = sizeof(person_data) / sizeof(std::string);
typedef unsigned char char_type;
typedef strtk::keyvalue::parser<strtk::keyvalue::stringkey_map> kvp_type;
typedef strtk::keyvalue::options<char_type> opts_type;
opts_type options;
options.pair_block_delimiter = '|';
options.pair_delimiter = '=';
kvp_type kvp(options);
person p;
strtk::hex_to_number_sink<unsigned int> h2ns(p.id);
kvp.register_keyvalue("ID" , h2ns);
kvp.register_keyvalue("NAME" , p. name);
kvp.register_keyvalue("AGE" , p. age);
kvp.register_keyvalue("HEIGHT", p.height);
kvp.register_keyvalue("WEIGHT", p.weight);
for (std::size_t i = 0; i < person_data_size; ++i)
{
if (kvp(person_data[i]))
{
std::cout << i << '\t'
<< "ID: " << p.id << '\t'
<< "Name: " << p.name << '\t'
<< "AGE: " << p.age << '\t'
<< "HEIGHT: " << p.height << '\t'
<< "WEIGHT: " << p.weight << '\n';
}
}
return 0;
}
It should be noted that the underlying associative container for the
strtk::keyvalue::stringkey_map
can be explicitly specified
at compile time. By default it is std::map
, however it can
be easily changed to std::unordered_map
giving it a key
lookup complexity of O(1) or in fact any other container that is
compatible with STL associative map semantics. The following are three
examples of keyvalue_parsers specialized using std::unordered_map
,
the first being based on the std::string
type, the second
being based on the int
type and the third being based on
the double
type:
int main()
{
{
typedef unsigned char char_type;
typedef std::unordered_map<std::string,strtk::util::value> key_to_val_map_t;
typedef strtk::keyvalue::key_map<std::string,key_to_val_map_t> stringhashkey_mapper;
typedef strtk::keyvalue::parser<stringhashkey_mapper> kvp_type;
typedef strtk::keyvalue::options<char_type> opts_type;
opts_type options;
options.pair_block_delimiter = '|';
options.pair_delimiter = '=';
kvp_type kvp(options);
}
{
typedef unsigned char char_type;
typedef std::unordered_map<int,strtk::util::value> key_to_val_map_t;
typedef strtk::keyvalue::key_map<int,key_to_val_map_t> inthash_key_mapper;
typedef strtk::keyvalue::parser<inthash_key_mapper> kvp_type;
typedef strtk::keyvalue::options<char_type> opts_type;
opts_type options;
options.pair_block_delimiter = '|';
options.pair_delimiter = '=';
kvp_type kvp(options);
}
{
typedef unsigned char char_type;
typedef std::unordered_map<double,strtk::util::value> key_to_val_map_t;
typedef strtk::keyvalue::key_map<double,key_to_val_map_t> doublehash_key_mapper;
typedef strtk::keyvalue::parser<doublehash_key_mapper> kvp_type;
typedef strtk::keyvalue::options<char_type> opts_type;
opts_type options;
options.pair_block_delimiter = '|';
options.pair_delimiter = '=';
kvp_type kvp(options);
}
return 0;
}
Key-Value Pairs And Multiple Distinct Data Structures
To further the previous example, one might have a situation where
a tuple contains values for different members of different types.
The following example demonstrates parsing of key-value pairs that
map to members of multiple types. The tuple in the example consists
of values some of which are intended for the instance of data1 and
others which are intended for the instance of data2.
struct data1
{
int x0;
std::string x1;
double x2;
float x3;
}
struct data2
{
int y0;
std::string y1;
double y2;
float y3;
}
int main()
{
typedef unsigned char char_type;
typedef strtk::keyvalue::parser<strtk::keyvalue::stringkey_map> kvp_type;
typedef strtk::keyvalue::options<char_type> opts_type;
opts_type options;
options.pair_block_delimiter = '|';
options.pair_delimiter = '=';
kvp_type kvp(options);
data1 d1;
data2 d2;
std::string data = "D1X0=123456789|D1X1=AbCdEfG123456|D1X2=123.456|D1X3=0101.0202f|"
"D2Y0=987654321|D2Y1=tUvWxYz789012|D2Y2=789.012|D2Y3=0707.0909f";
kvp.register_keyvalue("D1X0", d1.x0);
kvp.register_keyvalue("D1X1", d1.x1);
kvp.register_keyvalue("D1X2", d1.x2);
kvp.register_keyvalue("D1X3", d1.x3);
kvp.register_keyvalue("D2Y0", d2.y0);
kvp.register_keyvalue("D2Y1", d2.y1);
kvp.register_keyvalue("D2Y2", d2.y2);
kvp.register_keyvalue("D2Y3", d2.y3);
kvp(data);
std::cout << "D1X0: " << d1.x0 << '\t'
<< "D1X1: " << d1.x1 << '\t'
<< "D1X2: " << d1.x2 << '\t'
<< "D1X3: " << d1.x3 << '\n';
std::cout << "D2Y0: " << d2.y0 << '\t'
<< "D2Y1: " << d2.y1 << '\t'
<< "D2Y2: " << d2.y2 << '\t'
<< "D2Y3: " << d2.y3 << '\n';
return 0;
}
Key-Value Pairs And Lists
The values in the key-value pairs need not always be singular. In some
scenarios, a value could be a list of values. The StrTk key-value parser
supports parsing of such key-value pairs through the previously demonstrated
sequence sink mechanisms.
In the following example we have a complex_data
struct that
consists of some PODs but also a number of sequences (vector,deque,list).
The tuple to be parsed consists of simple key-value pairs for the PODs and
more complex looking pairs where the values for the specific sequence of values
are separated by commas ",".
struct complex_data
{
unsigned int v0;
std::vector<int> v1;
double v2;
std::deque<double> v3;
std::string v4;
std::list<std::string> v5;
};
int main()
{
typedef unsigned char char_type;
typedef strtk::keyvalue::parser<strtk::keyvalue::stringkey_map> kvp_type;
typedef strtk::keyvalue::options<char_type> opts_type;
opts_type options;
options.pair_block_delimiter = '|';
options.pair_delimiter = '=';
kvp_type kvp(options);
complex_data cd;
std::string data = "V0=123456789|V1=-3,-2,-1,0,+1,2+3|V2=111.222|V3=1.1,2.2,3.3,4.4,5.5|"
"V4=Some Text|V5=Text1,Text2,Text3,Text4";
strtk::vector_sink<int>::type vec_sink(",");
strtk::deque_sink<double>::type deq_sink(",");
strtk::list_sink<std::string>::type lst_sink(",");
kvp.register_keyvalue("V0", cd.v0);
kvp.register_keyvalue("V1", vec_sink(cd.v1));
kvp.register_keyvalue("V2", cd.v2);
kvp.register_keyvalue("V3", deq_sink(cd.v3));
kvp.register_keyvalue("V4", cd.v4);
kvp.register_keyvalue("V5", lst_sink(cd.v5));
kvp(data);
std::cout << "V0: " << cd.v0 << '\n'
<< "V1: " << strtk::join(" ",cd.v1) << '\n'
<< "V2: " << cd.v2 << '\n'
<< "V3: " << strtk::join(" ",cd.v3) << '\n'
<< "V4: " << cd.v4 << '\n'
<< "V5: " << strtk::join(" ",cd.v5) << '\n';
return 0;
}
Key-Value Pairs With Cardinal Keys
So far all the examples have assumed keys of arbitrary values. In
some situations such as the FIX protocol,
the keys are always guaranteed to be positive integer values. If
this is the case then a different kind of key-mapper can be used
that is much more efficient than the general purpose string key-mapper,
providing a key lookup complexity of O(1). As such StrTk provides
the strtk::keyvalue::uintkey_map
type for this purpose.
The only difference in terms of setting up the uintkey_map
from the stringkey_map
is that it requires a key_count
to be set. This value represents the largest possible key value that
can exist. The following is an example of how the strtk::keyvalue::uintkey_map
key-mapper can be used.
struct data_store
{
char c; unsigned char uc; short s; unsigned short us; int i; std::string str; };
int main()
{
const std::string data = "121=A|122=z|123=-123|124=456|125=-12345678|126=Some simple text";
typedef strtk::keyvalue::parser<strtk::keyvalue::uintkey_map> kvp_type;
strtk::keyvalue::uintkey_map::options options;
options.key_count = 127; options.pair_block_delimiter = '|';
options.pair_delimiter = '=';
kvp_type kvp(options);
data_store ds;
kvp.register_keyvalue(121,ds. c);
kvp.register_keyvalue(122,ds. uc);
kvp.register_keyvalue(123,ds. s);
kvp.register_keyvalue(124,ds. us);
kvp.register_keyvalue(125,ds. i);
kvp.register_keyvalue(126,ds.str);
kvp(data);
return 0;
}
Optional Key-Value Pairs
As previously mentioned there is a use-case where certain fields
may be deemed optional and hence their absence would not constitute a
parsing error. That said, it would also be beneficial to know if a
particular field has been populated or not once the sequence of key
value pairs has been completely parsed and mapped. StrTk provides such
functionality though the use of the strtk::util::attribute
type.
The strtk::util::attribute
acts as a proxy for the underlying
type which it is specialized upon. It provides a conversion cast to the
underlying type, and also maintains an 'initialised' state value that can
be used to query the attribute about the underlying type's initialised
status. The following is an example of how one could use the
strtk::util::attribute
type in conjunction with the
keyvalue::parser
:
struct data_store
{
strtk::util::attribute<int> d1;
strtk::util::attribute<unsigned int> d2;
strtk::util::attribute<double> d3;
strtk::util::attribute<float> d4;
strtk::util::attribute<std::string> d5;
};
int main()
{
std::string data = "INT=-1234|UINT=+5678|DOUBLE=1234.5678|FLOAT=90123.4567f|STRING=Some simple text";
typedef unsigned char char_type;
typedef strtk::keyvalue::parser<strtk::keyvalue::stringkey_map> kvp_type;
typedef strtk::keyvalue::options<char_type> opts_type;
opts_type options;
options.pair_block_delimiter = '|';
options.pair_delimiter = '=';
kvp_type kvp(options);
data_store ds;
d1.initialised() = false;
d2.initialised() = false;
d3.initialised() = false;
d4.initialised() = false;
d5.initialised() = false;
kvp.register_keyvalue("INT" , ds.d1);
kvp.register_keyvalue("UINT" , ds.d2);
kvp.register_keyvalue("DOUBLE", ds.d3);
kvp.register_keyvalue("FLOAT" , ds.d4);
kvp.register_keyvalue("STRING", ds.d5);
if (!kvp(data))
{
std::cout << "Failed to parse key-value data: " << data << std::endl;
return 1;
}
if (!d1.initialised()) { std::cout << "d1 has not been initialised." << std::endl; }
if (!d2.initialised()) { std::cout << "d2 has not been initialised." << std::endl; }
if (!d3.initialised()) { std::cout << "d3 has not been initialised." << std::endl; }
if (!d4.initialised()) { std::cout << "d4 has not been initialised." << std::endl; }
if (!d5.initialised()) { std::cout << "d5 has not been initialised." << std::endl; }
return 0;
}
Semantic Actions with Key-Value Pairs
There may be times that when key-value pairs are being parsed certain
actions need to be executed or behaviours exhibited in-situ with the
parsing process. As previously mentioned, StrTk provides the type
semantic_action
that can act as a proxy for a generic
type during the parsing process that also takes a functor or lambda
and executes it at the conversion call. The following example
demonstrates the parsing of an array of tuples comprised of key-value
pairs that map to members of a struct namely data_store
.
The keys 111, 222 and 333 each represent a specific value type, they
also require a certain behavior to be exhibited. In this example, for
simplicity, as the values of the various keys are being parsed, a
simple message will be printed to the console denoting the nature of
the parsing process. The code is as follows:
struct data_store
{
data_store()
: i1(0),
d1(0.0),
s1("")
{}
int i1; double d1; std::string s1;
};
int main()
{
static const std::string data[] =
{
"111=-12345|222=1987.654321|333=An interesting string0", "222=2987.654321|333=An interesting string1|111=+12345", "333=An interesting string2|111=-12345|222=3987.654321", "222=4987.654321|111=+12345|333=An interesting string3", "333=An interesting string4|222=5987.654321|111=-12345", "111=+12345|333=An interesting string5|222=6987.654321", };
static const std::size_t data_size = sizeof(data) / sizeof(std::string);
typedef strtk::keyvalue::parser<strtk::keyvalue::uintkey_map> kvp_type;
strtk::keyvalue::uintkey_map::options options;
options.key_count = 334; options.pair_block_delimiter = '|';
options.pair_delimiter = '=';
kvp_type kvp(options);
data_store ds;
typedef strtk::range::ustring::const_iterator itr_type;
using strtk::util::semantic_action;
auto lambda_key111 = [&ds](itr_type begin,itr_type end) -> bool
{
if (!strtk::string_to_type_converter(begin,end,ds.i1))
{
std::cout << "Failed to parse value for key=111\n";
return false;
}
std::cout << "key[111] value=" << ds.i1 << std::endl;
return true;
};
auto lambda_key222 = [&ds](itr_type begin,itr_type end) -> bool
{
if (!strtk::string_to_type_converter(begin,end,ds.d1))
{
std::cout << "Failed to parse value for key=222\n";
return false;
}
std::cout << "key[222] value=" << ds.d1 << std::endl;
return true;
};
auto lambda_key333 = [&ds](itr_type begin,itr_type end) -> bool
{
if (!strtk::string_to_type_converter(begin,end,ds.s1))
{
std::cout << "Failed to parse value for key=333\n";
return false;
}
std::cout << "key[333] value=" << ds.s1 << std::endl;
return true;
};
kvp.register_keyvalue(111,semantic_action(lambda_key111).ref());
kvp.register_keyvalue(222,semantic_action(lambda_key222).ref());
kvp.register_keyvalue(333,semantic_action(lambda_key333).ref());
for (std::size_t i = 0; i < data_size; ++i)
{
if (!kvp(data[i]))
std::cout << "Error while parsing tuple: " << i << std::endl;
std::cout << std::endl;
}
return 0;
}
Note: It should be noted that semantic actions during the parsing
process have a multitude of uses, some of which are: validation of parsed
values, that is making sure that they're in a specified range or within a
predefined set of values, complex parsed value manipulations, invoking of
external state-machines to transition to new states based on the parsed
value or even simply the presence of a key etc.
An Attempt At Improving File Processing Performance
In a number of examples from above, a call to strtk::for_each_line
was invoked. The routine is intended to provide a simple wrapper around
the error prone boiler plate code that is required when processing a
text file line by line.
The idea behind strtk::for_each_line
is that the user
provide a lambda/functor that accepts a std::string, the routine will
in turn invoke the lambda passing the current line it has read from
the file, all other details related to opening of the file, reading
lines from the file and the handling of errors are transparent to
the user specified lambda.
For most situations this is perfectly fine, however for certain
types of high performance data processing, the overhead incurred
from std::ifstream
(virtual method calls on a per
char basis, multiple copies due to back buffers etc) via the
invocation of std::getline
, result in an unacceptable
performance profile.
Due to the simple nature of the I/O access pattern (forward only)
utilized within strtk::for_each_line
, a memory mapped
file approach would be far more efficient, being roughly 5x-7x
faster than the standard strtk::for_each_line routine. The following
is a simple example of code that utilises the Boost.IOStreams mmap
facility namely: mapped_file_source
for mmap'ing
the user specified file in conjunction with strtk::split
to delimit on line boundaries that are denoted by new-line characters.
namespace strtk
{
template <typename Function>
inline std::size_t for_each_line_mmap(const std::string& file_name,
Function function,
const std::size_t&
buffer_size = one_kilobyte)
{
boost::iostreams::mapped_file_source input_source;
try
{
input_source.open(file_name);
}
catch (std::exception&)
{
return 0;
}
const char* data = input_source.data();
multiple_char_delimiter_predicate delimiter("\n");
std::string buffer;
buffer.reserve(buffer_size);
std::size_t line_count = 0;
split(delimiter,
data, data + input_source.size(),
functional_inserter(
[&](const range::string& range)
{
if (range.begin() == range.end())
return;
const char* end = range.end();
if (line_count && ('\r' == *(end - 1)))
end--;
buffer.assign(range.begin(),end);
function(buffer);
++line_count;
}),
split_options::compress_delimiters);
return line_count;
}
}
Note: The following are some points to consider with the solution
presented above:
-
Because the routine attempts to map the entire file, on 32-bit
systems this may fail when the file is very large, due to the
limited availability of virtual address space. This can be easily
remedied by windowing the mmap access over the file.
-
If the system is under heavy load memory-wise, the performance
observed is expected to be similar to but no worse than if the
original
strtk::for_each_line
routine were used.
-
The interface with the lambda can be improved to instead only
pass a range rather than to populate and pass a
std::string
as is done when using std::getline
. This improvement
results in a roughly 5% increase in performance.
-
An interesting side effect of the range based approached, is
that the ranges denote the exact offsets from the original
file for the begin and end positions of the string currently
being processed. This can be very useful in situations where
the exact location of a token is required post processing -
as is the case with the problem presented earlier:
"Pick A Random Line From A Text File"
'for_each_line' via MMap Benchmark
The following benchmark demonstrates all three variations of
strtk::for_each_line
routine, which include normal,
mmap+std::string
and mmap+ranges. The example,
iterates through a file of roughly 825MB in size, comprised of
two types of CSV lines, each containing 100 tokens, where some
of the tokens are to be parsed as integers. Each line is parsed,
the integers extracted as necessary and then accumulated into
a final sum so as to minimize any aggressive optimizations.
The following is a set of results derived from a run of the above
benchmark, based on a build using GCC 4.8, with O2 and PGO, running on
a Xeon X5650 3.2GHz 64GB RAM, kernel 3.11. The base measure is lines
processed per second. The greater the value the higher
the performance.

Method | Lines/sec |
Standard | 126817 |
MMap | 710317 |
MMap+Range | 735317 |
The Letters Game
In the popular TV game show Countdown
(aka Letters and Numbers), contestants in the Letters round take turns
choosing letters from either a vowel or consonant bin. Typically up to
9 letters are chosen, after which the contestants are given a certain
amount of time (usually 30 seconds) to find the longest 'valid'
English word made up of only the letters that had been chosen. The
contestant with the longest valid word wins the round. What defines
a valid word is usually specific to the version of the show. Some examples
are using the OED
or the
Macquarie dictionary
coupled together with rules related to proper nouns, plurals and combination
words. The Letters round is essentially an anagram solving challenge.
The Letters game can be generally defined as: Given a canonical set of
substrings of varying length called C, generated from the alphabet A,
and a subset of not necessarily unique elements derived from A called
D, find the longest substring in the set C that is also in the set of
the 2|D| unique combinations generated from the set D.
A typical process for solving the Letters problem is as follows:
-
Step 0 - A list of unique valid words is specified (eg: OED or Word List)
- Step 0.0 - The longest word length from the list of words is noted as being L
- Step 0.1 - Each word is normalised to a common case. (eg: all lower case)
- Step 0.2 - Store the length of the words into a "word length" set
- Step 1 - For every word wi, generate the key ki by sorting the letters of the word. (eg: For the word 'english', the key 'eghilns' is derived.)
- Step 2 - Insert each key/word pair into an associative map M (eg: hash-table)
-
Step 3 - A list of potentially non-unique letters of length N is specified comprised of both consonants and vowels
- Step 3.0 - Lexicographically order this list of letters
-
Step 4 - For every N-choose-I combinations over the list of N letters, where I starts from min(L,N) and tends to 1:
- Step 4.0 - If no words of length I exist proceed to the next value of I
- Step 4.1 - Test the current combination x as a key in M
- Step 4.2 - If x exists in M, add all the words associated with x into a solution list S,
- Step 4.3 - Otherwise continue on with the next combinations and values of I
- Step 4.4 - Once all the combinations for the current I have been enumerated present the solution list S - (if it is not empty)
int main()
{
std::string word_list_file_name = "word_list.txt";
typedef std::unordered_multimap<std::string,std::string> word_map_t;
typedef std::pair<word_map_t::iterator,word_map_t::iterator> word_map_range_t;
word_map_t word_map;
std::set<unsigned int> word_length;
std::size_t longest_word = 0;
strtk::for_each_line(word_list_file_name,
[&](std::string word)
{
if (word.empty()) return;
strtk::remove_leading_trailing(" \t\n\r",word);
strtk::convert_to_lowercase(word);
std::string key = word;
strtk::sort(key);
word_map.insert(std::make_pair(key,word));
longest_word = std::max(longest_word,word.size());
word_length.insert(word.size());
});
std::string letters;
for (;;)
{
std::cout << "Enter letters: ";
if(!std::getline(std::cin,letters))
break;
static const std::string illegal_chars = strtk::ext_string::all_chars() -
strtk::ext_string::all_letters();
strtk::multiple_char_delimiter_predicate pred(illegal_chars);
strtk::remove_inplace(pred,letters);
if (letters.empty()) break;
strtk::convert_to_lowercase(letters);
strtk::sort(letters);
const std::size_t upper_bound = std::min(longest_word,letters.size());
std::unordered_set<std::string> solution_list;
for (std::size_t i = upper_bound; i > 0; --i)
{
if (word_length.end() == word_length.find(i))
continue;
typedef std::string::iterator str_itr_t;
strtk::for_each_combination(letters.begin(),letters.end(),
i,
[&](str_itr_t begin, str_itr_t end)
{
std::string key(begin,end);
word_map_range_t itr_range = word_map.equal_range(key);
if (0 == strtk::distance(itr_range)) return;
auto itr = itr_range.first;
while (itr_range.second != itr)
{
solution_list.insert(itr->second);
++itr;
}
});
if (!solution_list.empty())
{
std::copy(solution_list.begin(),
solution_list.end(),
std::ostream_iterator<std::string>(std::cout,"\n"));
break;
}
}
}
return 0;
}
Notes On The Letters Round Solution
The time complexity of the given solution is O(2min(L,N)),
which is quite large. What makes the solution practical, is the fact
that natural languages such as English tend to have short common words
derived from relatively small alphabets, with an upper range length of
about length 10-12 characters (excluding names et al and of course
Pneumonoultramicroscopicsilicovolcanoconiosis). So for example an
N of 10 or even 20 (assuming L is adequately large), will only amount
to a total of 1024 and 1048576 unique combinations respectively -
furthermore both search spaces can be trivially enumerated using brute
force in a mere fraction of a millisecond using modern hardware.
However the problem space becomes daunting at around N of 64 and larger. At which
point a constant multiplier can be applied by distributing the enumeration
process and performing said computations concurrently. Note this will not
reduce the overall complexity of the solution, just the time it will take
to complete, furthermore today this technique may only be practical for
values of N less than 67.
One final note, the above process will not only provide the first
solution it encounters, it will return all possible solutions for
largest encountered length combination.
Past Letters Round Games
The following is a short list of the Letters round games played on countdown
during the 2010 season.
TSDTOEIRO DELAWERAD FSGTIOAOV LESADEPIL YTAINROTD LLBUEISUT GEMUHRONA TRNSEAENI
BPSQEAEVN GOXERAMJA NTRTIOAEP FLUENTLIP SCRSIEONU HPAVSOEDA WKCJROEAM IGTYEARTN
DSOEXDALM SMNRIEIUG EFCAORTNA HGRNEOITE RGPNIEUQA TREULDEOF SNRTEUIDA TRUCEENDS
HTNXIUOAD TREULDEOF SNRTEUIDA TRUCEENDS HTNXIUOAD SWDEIUHBE NWCKPAIAE NYDAOUTAU
SGLAEOMVI GLQGEIODV RTLEOESTI JCRRUAEOS LSDSEAEFO TRMAEASRI NFRNOIAES BROADMOOR
Performance Comparisons
The following are tables of results generated by running the
strtk_tokenizer_cmp test. Currently it covers simple
comparisons between Boost String Algorithms, Boost lexical_cast, The
Standard Library, Spirit (Karma Qi) and StrTk in the following areas:
- Tokenization
- Splitting
- Integer To String
- String To Integer
- String To Double
Scenario 0 - MSVC 2010 (64-bit, O2, Ot, GL and PGO)
Source | Test | Size | Time(sec) | Rate | % from Baseline |
Boost | Tokenizer | 24000000 | 8.5857sec | 2795359.4074tks/sec | 100.0%, 100.0% |
StrTk | Tokenizer | 24000000 | 3.5019sec | 6853393.1186tks/sec | 40.7%, 245.1% |
Boost | Split | 9600000 | 5.5414sec | 1732414.5137tks/sec | 100.0%, 100.0% |
StrTk | Split | 9600000 | 0.8218sec | 11681814.9167tks/sec | 14.8%, 674.3% |
sprintf | Integer To String | 80000000 | 35.8128sec | 2233840.0564nums/sec | 100.0%, 100.0% |
Boost | Integer To String | 80000000 | 19.3994sec | 4123832.0477nums/sec | 54.1%, 184.6% |
Karma | Integer To String | 80000000 | 6.2528sec | 12794349.6524nums/sec | 17.4%, 572.7% |
StrTk | Integer To String | 80000000 | 1.5664sec | 51071439.9822nums/sec | 4.3%, 2286.2% |
atoi | String To Integer | 88500000 | 5.1802sec | 17084370.4936nums/sec | 100.0%, 100.0% |
Boost | String To Integer | 88500000 | 119.6261sec | 739805.3712nums/sec | 2309.2%, 4.3% |
Qi | String To Integer | 88500000 | 2.1951sec | 40317238.6629nums/sec | 42.3%, 235.9% |
StrTk | String To Integer | 88500000 | 1.8181sec | 48677773.5466nums/sec | 35.0%, 284.9% |
atof | String To Double | 30650000 | 15.2306sec | 2012396.7122nums/sec | 100.0%, 100.0% |
Boost | String To Double | 30650000 | 52.9244sec | 579127.8866nums/sec | 347.4%, 28.7% |
Qi | String To Double | 30650000 | 2.8665sec | 10692313.5853nums/sec | 18.8%, 531.3% |
StrTk | String To Double | 30650000 | 1.6069sec | 19073975.7679nums/sec | 10.5%, 947.8% |

Scenario 1 - MSVC 2010 (O2, Ot, GL and PGO)
Source | Test | Size | Time(sec) | Rate | % from Baseline |
Boost | Tokenizer | 24000000 | 9.4715sec | 2533910.4769tks/sec | 100.0%, 100.0% |
StrTk | Tokenizer | 24000000 | 2.8889sec | 8307786.9292tks/sec | 30.5%, 327.8% |
Boost | Split | 9600000 | 7.2291sec | 1327965.9706tks/sec | 100.0%, 100.0% |
StrTk | Split | 9600000 | 1.1301sec | 8494610.9664tks/sec | 15.6%, 639.6% |
sprintf | Integer To String | 80000000 | 38.2576sec | 2091088.8038nums/sec | 100.0%, 100.0% |
Boost | Integer To String | 80000000 | 28.9931sec | 2759277.4769nums/sec | 75.7%, 131.9% |
Karma | Integer To String | 80000000 | 4.9173sec | 16269254.0190nums/sec | 12.8%, 778.0% |
StrTk | Integer To String | 80000000 | 1.8270sec | 43786838.0279nums/sec | 4.7%, 2093.9% |
atoi | String To Integer | 88500000 | 6.0076sec | 14731435.8942nums/sec | 100.0%, 100.0% |
Boost | String To Integer | 88500000 | 185.4955sec | 477100.6474nums/sec | 3087.0%, 3.2% |
Qi | String To Integer | 88500000 | 2.5060sec | 35314785.8370nums/sec | 41.7%, 239.7% |
StrTk | String To Integer | 88500000 | 2.2095sec | 40054213.0736nums/sec | 36.7%, 271.8% |
atof | String To Double | 30650000 | 17.6435sec | 1737179.9302nums/sec | 100.0%, 100.0% |
Boost | String To Double | 30650000 | 78.6528sec | 389687.3997nums/sec | 445.7%, 22.4% |
Qi | String To Double | 30650000 | 3.8034sec | 8058494.1994nums/sec | 21.5%, 463.8% |
StrTk | String To Double | 30650000 | 2.0450sec | 14987780.2310nums/sec | 11.5%, 862.7% |

Scenario 2 - MSVC 2008 SP1 (O2, Ot, GL and PGO)
Source | Test | Size | Time(sec) | Rate | % from Baseline |
Boost | Tokenizer | 24000000 | 9.6533sec | 2486184.8282tks/sec | 100.0%, 100.0% |
StrTk | Tokenizer | 24000000 | 3.4748sec | 6906943.9529tks/sec | 35.9%, 277.8% |
Boost | Split | 9600000 | 10.2600sec | 935674.7490tks/sec | 100.0%, 100.0% |
StrTk | Split | 9600000 | 1.3793sec | 6959830.0652tks/sec | 13.4%, 743.8% |
sprintf | Integer To String | 80000000 | 24.6427sec | 3246397.8287nums/sec | 100.0%, 100.0% |
Boost | Integer To String | 80000000 | 27.5865sec | 2899968.5753nums/sec | 111.9%, 89.3% |
Karma | Integer To String | 80000000 | 5.4864sec | 14581630.6963nums/sec | 22.2%, 449.1% |
StrTk | Integer To String | 80000000 | 2.4224sec | 33025441.1256nums/sec | 9.8%, 1017.2% |
atoi | String To Integer | 88500000 | 5.9297sec | 14924814.8683nums/sec | 100.0%, 100.0% |
Boost | String To Integer | 88500000 | 186.1372sec | 475455.6660nums/sec | 3139.0%, 3.1% |
Qi | String To Integer | 88500000 | 2.0874sec | 42397446.1804nums/sec | 35.2%, 284.0% |
StrTk | String To Integer | 88500000 | 2.0485sec | 43202160.1371nums/sec | 34.5%, 289.4% |
atof | String To Double | 30650000 | 18.0458sec | 1698455.0767nums/sec | 100.0%, 100.0% |
Boost | String To Double | 30650000 | 77.4527sec | 395725.4111nums/sec | 429.2%, 23.2% |
Qi | String To Double | 30650000 | 3.9631sec | 7733881.1294nums/sec | 21.9%, 455.3% |
StrTk | String To Double | 30650000 | 2.0723sec | 14790236.0804nums/sec | 11.4%, 870.8% |

Scenario 3 - Intel C++ v11.1.060 IA-32 (O2, Ot, Qipo, QxHost and PGO)
Source | Test | Size | Time(sec) | Rate | % from Baseline |
Boost | Tokenizer | 24000000 | 10.0096sec | 2397697.7836tks/sec | 100.0%, 100.0% |
StrTk | Tokenizer | 24000000 | 3.1837sec | 7538416.8541tks/sec | 31.8%, 314.4% |
Boost | Split | 9600000 | 9.5450sec | 1005760.0310tks/sec | 100.0%, 100.0% |
StrTk | Split | 9600000 | 1.4292sec | 6716893.1359tks/sec | 14.9%, 667.8% |
sprintf | Integer To String | 80000000 | 23.8979sec | 3347577.5824nums/sec | 100.0%, 100.0% |
Boost | Integer To String | 80000000 | 27.5618sec | 2902565.2045nums/sec | 115.3%, 86.7% |
Karma | Integer To String | 80000000 | 4.6600sec | 17167208.7654nums/sec | 19.4%, 512.8% |
StrTk | Integer To String | 80000000 | 2.8450sec | 28119857.2736nums/sec | 11.9%, 840.0% |
atoi | String To Integer | 88500000 | 5.9386sec | 14902610.8922nums/sec | 100.0%, 100.0% |
Boost | String To Integer | 88500000 | 180.5856sec | 490072.4001nums/sec | 3040.8%, 3.2% |
Qi | String To Integer | 88500000 | 2.5273sec | 35017073.8639nums/sec | 42.5%, 234.9% |
StrTk | String To Integer | 88500000 | 1.8718sec | 47281492.1287nums/sec | 31.5%, 317.2% |
atof | String To Double | 30650000 | 18.4357sec | 1662538.0810nums/sec | 100.0%, 100.0% |
Boost | String To Double | 30650000 | 78.1543sec | 392172.9598nums/sec | 423.9%, 23.5% |
Qi | String To Double | 30650000 | 2.8321sec | 10822353.0510nums/sec | 15.3%, 650.9% |
StrTk | String To Double | 30650000 | 2.2930sec | 13366541.5515nums/sec | 12.4%, 803.9% |

Scenario 4 - GCC 4.5 (O3, PGO)
Source | Test | Size | Time(sec) | Rate | % from Baseline |
Boost | Tokenizer | 24000000 | 9.2510sec | 2594305.4347tks/sec | 100.0%, 100.0% |
StrTk | Tokenizer | 24000000 | 3.9717sec | 6042688.5734tks/sec | 42.9%, 232.9% |
Boost | Split | 9600000 | 5.0640sec | 1895728.2331tks/sec | 100.0%, 100.0% |
StrTk | Split | 9600000 | 1.5411sec | 6229231.8384tks/sec | 30.4%, 328.5% |
sprintf | Integer To String | 80000000 | 14.7807sec | 5412477.0993nums/sec | 100.0%, 100.0% |
Boost | Integer To String | 80000000 | 19.1131sec | 4185620.7707nums/sec | 129.3%, 77.3% |
Karma | Integer To String | 80000000 | 6.4455sec | 12411808.2841nums/sec | 43.6%, 229.3% |
StrTk | Integer To String | 80000000 | 4.5174sec | 17709364.5349nums/sec | 30.5%, 327.1% |
atoi | String To Integer | 88500000 | 5.2139sec | 16973721.6103nums/sec | 100.0%, 100.0% |
Boost | String To Integer | 88500000 | 50.5326sec | 1751344.8498nums/sec | 969.1%, 10.3% |
Qi | String To Integer | 88500000 | 1.9694sec | 44937612.8835nums/sec | 37.7%, 264.7% |
StrTk | String To Integer | 88500000 | 1.9008sec | 46558706.5833nums/sec | 36.4%, 274.2% |
atof | String To Double | 30650000 | 6.6975sec | 4576328.3036nums/sec | 100.0%, 100.0% |
Boost | String To Double | 30650000 | 29.6375sec | 1034162.2422nums/sec | 442.5%, 22.5% |
Qi | String To Double | 30650000 | 2.9852sec | 10267435.7138nums/sec | 44.5%, 224.3% |
StrTk | String To Double | 30650000 | 1.5961sec | 19202937.1409nums/sec | 23.8%, 419.6% |

Scenario 5 - GCC 4.5 (O3, PGO) Intel Atom N450
Source | Test | Size | Time(sec) | Rate | % from Baseline |
Boost | Tokenizer | 24000000 | 29.1370sec | 823695.4389tks/sec | 100.0%, 100.0% |
StrTk | Tokenizer | 24000000 | 12.3607sec | 1941644.0499tks/sec | 42.4%, 235.7% |
Boost | Split | 9600000 | 16.5261sec | 580899.9726tks/sec | 100.0%, 100.0% |
StrTk | Split | 9600000 | 4.9102sec | 1955110.2611tks/sec | 29.7%, 336.5% |
sprintf | Integer To String | 80000000 | 50.3456sec | 1589015.6118nums/sec | 100.0%, 100.0% |
Boost | Integer To String | 80000000 | 91.1475sec | 877698.1401nums/sec | 181.0%, 55.2% |
Karma | Integer To String | 80000000 | 21.8904sec | 3654568.8712nums/sec | 43.4%, 229.9% |
StrTk | Integer To String | 80000000 | 12.1877sec | 6564009.9274nums/sec | 24.2%, 413.0% |
atoi | String To Integer | 88500000 | 17.6615sec | 5010896.5768nums/sec | 100.0%, 100.0% |
Boost | String To Integer | 88500000 | 191.9446sec | 461070.5357nums/sec | 1086.7%, 9.2% |
Qi | String To Integer | 88500000 | 6.2808sec | 14090561.7119nums/sec | 35.5%, 281.1% |
StrTk | String To Integer | 88500000 | 6.1552sec | 14378086.8208nums/sec | 34.8%, 286.9% |
atof | String To Double | 30650000 | 21.4865sec | 1426474.1027nums/sec | 100.0%, 100.0% |
Boost | String To Double | 30650000 | 139.8166sec | 219215.7409nums/sec | 650.7%, 15.3% |
Qi | String To Double | 30650000 | 11.3916sec | 2690567.9223nums/sec | 53.0%, 188.6% |
StrTk | String To Double | 30650000 | 6.4396sec | 4759608.7027nums/sec | 29.9%, 333.6% |

Scenario 6 - GCC 4.5 (O3, PGO) Intel Xeon E5540
Source | Test | Size | Time(sec) | Rate | % from Baseline |
Boost | Tokenizer | 24000000 | 7.5657sec | 3172216.8787tks/sec | 100.0%, 100.0% |
StrTk | Tokenizer | 24000000 | 2.7379sec | 8765832.8290tks/sec | 36.1%, 276.3% |
Boost | Split | 9600000 | 3.0706sec | 3126386.1126tks/sec | 100.0%, 100.0% |
StrTk | Split | 9600000 | 1.1279sec | 8511136.2899tks/sec | 36.7%, 272.2% |
sprintf | Integer To String | 80000000 | 10.9012sec | 7338642.9638nums/sec | 100.0%, 100.0% |
Boost | Integer To String | 80000000 | 12.3317sec | 6487328.7872nums/sec | 113.1%, 88.3% |
Karma | Integer To String | 80000000 | 3.7202sec | 21504260.6660nums/sec | 34.1%, 293.0% |
StrTk | Integer To String | 80000000 | 2.5183sec | 31768042.4612nums/sec | 23.1%, 432.8% |
atoi | String To Integer | 88500000 | 4.0087sec | 22077037.6357nums/sec | 100.0%, 100.0% |
Boost | String To Integer | 88500000 | 30.3659sec | 2914454.4393nums/sec | 757.4%, 13.2% |
Qi | String To Integer | 88500000 | 1.7976sec | 49231871.5454nums/sec | 43.3%, 223.0% |
StrTk | String To Integer | 88500000 | 1.7384sec | 50908881.7303nums/sec | 43.3%, 230.5% |
atof | String To Double | 30650000 | 5.2118sec | 5880843.9328nums/sec | 100.0%, 100.0% |
Boost | String To Double | 30650000 | 21.5546sec | 1421966.9538nums/sec | 413.5%, 24.1% |
Qi | String To Double | 30650000 | 3.2149sec | 9533840.3118nums/sec | 61.6%, 162.1% |
StrTk | String To Double | 30650000 | 1.3929sec | 22003661.2944nums/sec | 26.7%, 374.1% |

Scenario 7 - GCC 4.5 (O3, PGO) Intel Xeon X5650
Source | Test | Size | Time(sec) | Rate | % from Baseline |
Boost | Tokenizer | 24000000 | 4.1944sec | 5721901.2924tks/sec | 100.00%, 100.00% |
StrTk | Tokenizer | 24000000 | 2.5087sec | 9566860.3956tks/sec | 59.81%, 167.19% |
Boost | Split | 9600000 | 2.9104sec | 3298520.2014tks/sec | 100.00%, 100.00% |
StrTk | Split | 9600000 | 1.1105sec | 8644949.2334tks/sec | 38.15%, 262.08% |
sprintf | Integer To String | 80000000 | 9.7148sec | 8234840.3537nums/sec | 100.00%, 100.00% |
Boost | Integer To String | 80000000 | 11.5726sec | 6912860.7120nums/sec | 119.12%, 83.94% |
Karma | Integer To String | 80000000 | 4.0832sec | 19592620.4395nums/sec | 42.03%, 237.92% |
StrTk | Integer To String | 80000000 | 2.2204sec | 36029641.5861nums/sec | 22.85%, 437.52% |
atoi | String To Integer | 88500000 | 3.1028sec | 28522836.1561nums/sec | 100.00%, 100.00% |
Boost | String To Integer | 88500000 | 6.1270sec | 14444145.2249nums/sec | 197.46%, 50.64% |
Qi | String To Integer | 88500000 | 1.5313sec | 57794031.2153nums/sec | 49.35%, 202.62% |
StrTk | String To Integer | 88500000 | 1.4409sec | 61417814.6362nums/sec | 46.43%, 215.32% |
atof | String To Double | 42880000 | 6.1342sec | 6990292.6549nums/sec | 100.00%, 100.00% |
Boost | String To Double | 42880000 | 28.9461sec | 1481374.9232nums/sec | 772.37%, 21.19% |
Qi | String To Double | 42880000 | 3.6549sec | 11732137.3557nums/sec | 59.58%, 167.83% |
StrTk | String To Double | 42880000 | 1.3860sec | 30937683.0792nums/sec | 22.59%, 442.58% |

Note 1: The tests are compiled with specific optimisation
flags to produce the best possible results for the respective
compilers and architectures. Furthermore the tests are run
natively (no virtualizations were used) on an almost completely idle
machine so as to reduce interference from background processes. The
Boost version used was 1.55. Furthermore the standard libraries
including libc were rebuilt for the linux system based tests, using
architecture specific flags and optimizations. The following is a
table mapping the scenarios to their respective architectures:
Scenario | Architecture |
0 | ThinkPad W510 (64-Bit Intel Quad Core Extreme i7-920XM 2.0GHz, 16GB RAM, Windows 7) |
1-3 | ThinkPad x61 (32-Bit Intel Core 2 Duo 2.4GHz, 2GB RAM, Windows 7) |
4 | ThinkPad x61 (32-Bit Intel Core 2 Duo 2.4GHz, 2GB RAM, Mint 15) |
5 | Acer Aspire One (32-Bit Intel Atom N450 1.6Ghz, 1GB RAM, Mint 15) |
6 | HP Proliant DL380G6 (64-Bit Intel Xeon E5540 2.5GHz, 8GB RAM, Mint 15) |
7 | Custom box (64-Bit Intel Xeon X5650 2.66GHz, 32GB RAM, Mint 15) |
Note 2:
The percentages in the final column represent the percentage of the
current row versus the baseline in total running time and rate
respectively. For the first percentage the lower the value the better
and for the second percentage the higher the value the better. The
baseline used for a specific combination of tests is defined in the
following table:
Test Combination | Baseline |
Boost, StrTk | Boost |
Boost, StdLib/STL, Spirit, StrTk | StdLib/STL |
StdLib/STL, Spirit, StrTk | StdLib/STL |
Note 3:
The test sizes are set such that no single run will result in a
running time less than one second. This is done so as to ensure
that runs-per-second results are not deemed to have been projected.
In the future these sizes may need to be revisited once 3.5+GHz CPU
speeds become more commonplace. Furthermore the charts represent the
rate of operation over a one second interval - In short, the larger
the rate the better.
Note 4:
The binaries used for the above performance tests can be downloaded from
here
Note 5:
It would be great to have comparisons for other architectures. If you can
provide access to shell accounts with GCC 4.5+ or Clang/LLVM 2.0+ for the
following architectures: UltraSPARC T2 Plus, SPARC64 VII, POWER6/7, please
feel free to get in contact.
A Final Digression - Fast Integer To String Conversion
The task of converting an integer to a std::string
is quite common and rather trivial. It is mainly used in serialisations,
such as printing to stdout or a file et al. The above benchmarks do
include timings for int to std::string
conversions,
however let's see if there's more that can be done.
The following takes this very simple task to the next level. Based
on a series solutions found on StackOverflow
and elsewhere, the following benchmark has been derived. Its objective
is simple: Determine the fastest integer to std::string
conversion routine, written entirely in conforming C++. The results
are as follows:
Method | Rate (per/sec) |
sprintf | 3903490 |
karma | 17305869 |
zverovich | 19036262 |
voigt | 21489702 |
so | 32617186 |
timo | 46211794 |
hopman | 52160759 |
strtk | 52650806 |
jiaendu | 60955167 |

Note: The following are details and also some points to consider with
regards to the benchmark presented above:
-
A total of nine conversion routines are employed, of varying
complexities, ranging from simple divide/mod and loop to
elaborate single pass bit-twiddle LUT based strategies.
-
A wide range of values within the int space are used. They consist
of sets of small and large random values, values near to and including
min/max of int and all values from -20000000 to +20000000. Furthermore
the various values are interleaved amoungst each other during the main
loop for each conversion routine.
-
Prior to the benchmark beginning, a simple sanity check is run over
all the conversion routines to make sure each routine functions
correctly.
-
Before each conversion routine's benchmark begins, an attempt is made
to flush the I/D caches. This is done so as to avoid the previous
conversion routine's execution profile affecting the next routine.
-
It should be noted that the results presented were derived from running
the benchmark upon a processor which has a large L1/L2 cache. This is
important because when run on processors that have smaller caches or
none at all, the rankings change quite a bit. In fact the simple div/mod
and loop strategies will tend to out do most of the LUT based strategies
under certain architectural conditions. This point is applicable not only
to processes running atop of embedded and low power processors but also
to processes running within virtualized environments.
-
The system details are: Intel Xeon W3680 3.3GHz, kernel 3.11, GCC 4.8 O2
compiled with PGO.
-
The conversion implementations found in boost::lexical_cast,
NumToA
and Folly
were found to be exceedingly lacking when it came to performance
and were hence left out.
StrTk Library Dependency
StrTk makes use of the Boost library for its
boost::lexical_cast
routine for types other than
PODs, and its TR1 compliant Random and Regex libraries. These
dependencies are not compulsory and can be easily removed simply
by defining the preprocessor: strtk_no_tr1_or_boost. That
said Boost is an integral part of modern C++ programming, and
having it around is as beneficial as having access to the STL,
hence it is recommended that it be installed. For Visual Studio
users, BoostPro provides a free and easy to use
installer for the latest Boost libraries that can be obtained
from Here. For Linux users, mainstream
distributions such as Ubuntu and Red-Hat(Fedora) provide easy
installation of the Boost libraries via their respective package
management systems. For more information please consult the
readme.txt found in the StrTk distribution.
Compiler Support
The following is a listing of the various compilers that StrTk can be
built with error and warning free.
- GCC - verions 3.1+
- Clang/LLVM - versions 1.0+
- Intel C++ Compiler - versions 8+
- MSVC - versions 7.1+
- Comeau C/C++ - versions 4.3+
- PGI C++ - versions 10.x+
- IBM XL C/C++ - versions 10.x+
Note: Versions of compilers prior to the ones denoted above
"should" compile, however they may require a very lenient
warning/error level be set during compilation.
Conclusion
StrTk was designed with performance and efficiency as its sole primary
principles, and as such some of the available interfaces may not be as
user-friendly as they should - however that said, the gains
made in other areas hopefully will compensate for any perceived
difficulties. Like most things there is a trade-off between
performance and usability with the above mentioned tokenizers and
parsing methods. The original aim was to provide an interface similar
to that of the Boost Tokenizer and Split routines, but to also avail
the developer with abstractions and various other simplifications that
will hopefully provide them more flexibility and efficiency in the long
run. That said, tokenizing a string isn't the most fascinating problem
one could tackle but it does have its interesting points when one has
a few TB of data to process, doing it properly could mean the difference
between finishing a simple data processing job today or next month.