Introduction
This article will present the tokenizing and splitting functionality
of a simple C++ library called the String Toolkit. Tokenization in
the context of string processing, is the method by which a sequence
of elements are broken up or fragmented in sub-sequences called tokens.
The indices in the original sequence that determine such breaks in the
sequence are known as delimiters.
There are two types of delimiters, normal or thin delimiters which
are of length one element and thick delimiters which are of length
two or more elements. Even though tokenization is primarily used in
conjunction with strings, any sequence of types that can be iterated
in a linear fashion can be tokenized, examples may be list of
integers, a vector of person classes or a map of strings.
Another Tokenizer?
To date there have been many attempts to define a
"standard" Tokenizer implementation in C++. Of them all the
likely candidate might be the implementation in the Boost
library. Regardless proposed implementations should to some extent
consider one or more of the following areas: over-all usage
patterns, constructions, generality (or is it genericty these
days?) of design, performance-efficiency.
1. Over-all usage patterns
This requirement is concerned with how easy it is to instantiate the
tokenizer and integrate it into existing processing patterns, which
most often than not requires integration with C++ STL algorithms and
containers. A tokenzier by definition would be classed as a producer,
so the question becomes how easy is it for others to consume its
output? Another issue is consistency of the definition of a token in
the situation where one has consecutive delimiters but is not
compressing them - can or should there be such a thing as an
empty token? and what do preceding and trailing delimiters mean? and
when should they be included as part of the token?
2. Constructions
In the context of string tokenizing, the majority of implementations
return the token as a new instance of a string. This process requires
a string to be created on heap, populated by the sub-string in
question from the original sequence, then returned back (some of
this may be alleviated by Return Value Optimization RVO).
In the case of iterators this is essentially another copy to the
caller. Furthermore two kinds of tokens can make this situation
worse, they are primarily a large sequence made up of lots of very
short tokens or a large sequence made up of lots of very large
tokens. The solution is not to return the token as a string but
rather as a range and allow the caller to decide how they wish to
inspect and further manipulate the token.

This minor change in interface design provides a great deal of
flexibility and performance gain.
3. Generality(Genericity) of design
Most tokenizer implementations concern themselves only with strings,
which for the most part is ok, because most things that need
tokenizing are strings. However there will be times when one has a
sequence of types that may need to be tokenized that aren't strings,
hence a tokenizer should be designed in such a way to enable such
features, moreover it becomes clear that the most basic of tokenizing
algorithms are invariant to the type of the delimiter.
4. Performance and Efficiency
Tokenizing strings range from low frequency inputs such as user input
or parsing of simple configuration files to more complex situations
such as tokenizing of HTML pages that a web crawler/indexer might do,
to parsing of large market-data streams in FIX format. Performance is
generally important and can usually be helped along with good usage
patterns that encourage reuse of instances, minimal preprocessing and
allow for user supplied predicates for the more nasty areas of the
process. It should be noted that everything in the proceeding article
can be done by purely using the STL - that said, C++'s ability to
allow one to skin the proverbial cat in numerous way gives
rise to novel solutions that are for the most part not of any
practical use other than to showcase ones abilities in using the
STL.
Getting started
The String Toolkit Library (StrTk) provides two common tokenization
concepts, a split function and a token iterator. Both these concepts
require the user to provide a delimiter predicate and an iterator
range over which the process will be carried out.
The tokenization process can be further parametrized by switching
between "compressed delimiters" or "no compressed delimiters" mode.
This essentially means that consecutive delimiters will be compressed
down to one and treated as such. Below are two tables depicting the
expected tokens from various types of input. The tables represent no
compressed and compressed tokenization processes respectively. The
delimiter in this instance is a pipe symbol | and <> denotes
an empty token.
No Compressed Delimiters
| Input | Token List |
a | a |
a|b | a,b |
a||b | a,<>,b |
|a | <>,a |
a| | a,<> |
|a||b | <>,a,<>,b |
||a||b|| | <>,<>,a,<>,b,<>,<> |
| | <>,<> |
|| | <>,<>,<> |
||| | <>,<>,<>,<> |
Compressed Delimiters
| Input | Token List |
a | a |
a||b | a,b |
|a||b | <>,a,b |
||a||b|| | <>,a,b,<> |
| | <>,<> |
|| | <>,<> |
||| | <>,<> |
Delimiters
Two forms of delimiters are supported and they are single
delimiter predicate and multiple delimiters predicate otherwise known
as SDP and MDP respectively. Essentially an SDP is
where there is only one type that can break or fragment the
sequence, where as with MDPs there is more than one unique type that
can break the sequence. It is possible to represent a SDP using
the MDP, however from a performance POV having separate
predicates is far more efficient. Additionally for strings based
on char or unsigned char (8-bit versions) there is a MDP that has a
look-up complexity of O(1) making it greatly more efficient than the
basic MDP.
Single Delimiter Predicate
Instantiation requires specialization of type and construction
requires an instance of the type.
strtk::single_delimiter_predicate<typename T>(const T& t)
strtk::single_delimiter_predicate<std::string::value_type> predicate('|');
Multiple Delimiter Predicate
Instantiation requires specialization of type and construction
requires a sequence of potential delimiters through a range
described by iterators.
strtk::multiple_delimiter_predicate<typename T>(Iterator begin, Iterator end);
std::string str_delimiters = " ,.;:<>'[]{}()_?/'`~!@#$%^&*|-_\"=+";
strtk::multiple_delimiter_predicate<char> mdp1(str_delimiters.begin(),str_delimiters.end());
unsigned int uint_delimiters[5] = {1,10,101,1010,10101};
strtk::multiple_delimiter_predicate<unsigned int> mdp2(uint_delimiters,uint_delimiters + 5);
As previously mentioned tokenization of data need not be limited to
strings comprised of chars, but can also be extended to other PODs or
complex types. In the above example a predicate used for tokenizing a
sequence of unsigned ints is being defined.
Multiple Char Delimiter Predicate
Instantiation requires an iterator range based on either unsigned
char or char. This delimiter is more efficient than the simple MDP as
it has a predicate evaluation of O(1) due to the use
of a lookup-table as opposed to O(n) where n is the
number of delimiters. Performance increase is seen as more
delimiters are used.
strtk::multiple_char_delimiter_predicate(const std::string&)
strtk::multiple_char_delimiter_predicate(const std::string::const_iterator begin,const std::string::const_iterator end)
strtk::multiple_char_delimiter_predicate predicate(" .;?");
The delimiter concept can be extended to the point where the
predicate itself can act as a state machine transitioning from
state to state based on conditions and rules related to the current
symbol being processed. This simple extension can lead to some very
interesting parsing capabilities.
Split
This is a function that performs tokenization over an entire sequence
in one go. strtk::split takes a sequence through
a range of iterators or in the case of a string through
a reference to its instance, a delimiter predicate and an output
iterator or otherwise known as a type sink. It populates the output iterator with the tokens it
extracts. The tokens in this case are std::pairs of iterators for
the sequence.

Split can be used in a "simple - no frills" manner by simply
passing the necessary parameters:
std::string str = "abc|123|xyz|789";
strtk::std_string::token_list_type token_list;
strtk::split(" |.;?",str,std::back_inserter(token_list));
strtk::split can also be used in a more explicit
manner whereby the exact type of delimiter predicate can be
specified by the user:
{
std::string str = "abc|123|xyz|789";
strtk::std_string::token_list_type token_list;
strtk::single_delimiter_predicate<std::string::value_type> predicate('|');
strtk::split(predicate,str,std::back_inserter(token_list));
}
{
std::string data = "abc|123|xyz|789";
std::deque<std::string> token_list;
strtk::split([](const char c)
{
return '|' == c;
},
data,
strtk::range_to_type_back_inserter(token_list));
}
strtk::split provides an additional usage option that
allows the user to specify if they would like to either compress the
delimiters and whether or not they would like to include the
delimiter as part of the token range. This enum parameter is called
strtk::split_options and has the following values:
| Split Option | Definition |
strtk::split_options::default_mode | Default options |
strtk::split_options::compress_delimiters | Consecutive delimiters are treated as one |
strtk::split_options::include_1st_delimiter | The first delimiter is included in the resulting token range |
strtk::split_options::include_delimiters | All delimiters are included in the resulting token range |
The simple example below demonstrates a split that will occur over
a string given a predicate where the provided split options indicate
that consecutive delimiters will be treated as one and also all delimiters
encountered after each token will also be included in the token up until
the next token begins.
strtk::split(predicate,
str,
std::back_inserter(token_list),
strtk::split_options::compress_delimiters | strtk::split_options::include_delimiters);
Another way of using split may be to use the
strtk::multiple_char_delimiter_predicate as follows:
std::string str = "abc?123;xyz.789";
strtk::std_string::token_list_type token_list;
strtk::multiple_char_delimiter_predicate predicate(" .;?");
strtk::split(predicate,str,std::back_inserter(token_list));
The contents of the token_list can be printed out as follows:
strtk::std_string::token_list_type::iterator itr = token_list.begin();
while (token_list.end() != itr)
{
std::cout << (*itr) << '\t';
++itr;
}
Split N-Tokens
A natural extension of strtk::split is strtk::split_n.
This function provides the ability to tokenize a sequence up until a specific
number of tokens have been encountered or until there are no more tokens left.
The return value of the strtk::split_n would be the number of tokens
encountered.
std::string data = "token1?token2,token3;token4,token5";
strtk::std_string::token_list_type token_list;
const std::size_t token_count = 4;
const std::string delimiters (" ,.;?");
strtk::split_n(delimiters,
data,
token_count,
std::back_inserter(token_list));
strtk::std_string::token_list_type::iterator itr = token_list.begin();
while (token_list.end() != itr)
{
std::cout << "[" << (*itr) << "]\t";
++itr;
}
std::cout << std::endl;
Offset Splitter
Another simple variant is the strtk::offset_splitter.
This form of split takes a series of offsets and an iterator range or
string and determines the tokens based on the offsets. This function
can be set to perform a single pass of the offsets or to rotate
them until the range has been completely traversed. The example
below demonstrates how a string representing date and time can be
tokenized into its constituent parts (year, month, day, hour, minutes
,seconds,milliseconds)
std::string time_data = "20000101091011345"; const std::size_t offset_list_size = 7;
const int offset_list[offset_list_size] = {
4, 2, 2, 2, 2, 2, 3 };
const strtk::offset_predicate<offset_list_size> predicate(offset_list);
strtk::std_string::token_list_type token_list;
strtk::offset_splitter(time_data,predicate,std::back_inserter(token_list));
strtk::std_string::token_list_type::iterator itr = token_list.begin();
while (token_list.end() != itr)
{
std::cout << "[" << (*itr) << "] ";
++itr;
}
std::cout << std::endl;
Split Regex
Another form of splitter is based on the concept of using regular
expressions as the delimiter predicate. Below is a simple example
of splitting tokens wrapped in round-brackets.
std::string str = "(12)(345)(6789)(0ijkx)(yz)";
std::list<std::string> token_list;
strtk::split_regex("\\(.*?\\)",
s,
std::back_inserter(token_list),
strtk::regex_match_mode::match_1);
std::list<std::string>::iterator itr = token_list.begin();
while (token_list.end() != itr)
{
std::cout << "[" << (*itr) << "]\t";
++itr;
}
std::cout << std::endl;
Note: The parameter regex_match_mode represents the capture of the
marked sub-expression in the current match. By default it is
strtk::regex_match_mode::match_all which in this case
would provide the entire match including the bounding pattern,
eg: Token3 would be (0ijkx). However in the above example
we are only interested in the sub-expression between the round-brackets,
hence strtk::regex_match_mode::match_1 is used resulting
in Token3 being 0ijkx.
The following examples demonstrate the use of strtk::split_regex
and strtk::split_regex_n routines in extracting specific
types of data - in this case the POD types int and double.
int main()
{
{
std::string data = "a 1^bc,0023| def?gh(4567ijk)-89 10l,m$n-op+123r@st+3u v*w2y56yz+";
std::deque<int> int_list;
strtk::split_regex("([+-]?([\\d]+))",
data,
strtk::range_to_type_back_inserter(int_list),
strtk::regex_match_mode::match_1);
}
{
std::string data = "ab$c1.1?d-2.2ef#ghi+3.3%(123.456)!&*-7.89E+12@^=";
std::deque<double> double_list;
strtk::split_regex(strtk::ieee754_expression,
data,
strtk::range_to_type_back_inserter(double_list),
strtk::regex_match_mode::match_1);
}
{
std::string data = "a 1^bc,0023| def?gh(4567ijk)-89 10l,m$n-op+123r@st+3u v*w2y56yz+";
std::deque<int> int_list;
strtk::split_regex_n("([+-]?([\\d]+))",
data,
3,
strtk::range_to_type_back_inserter(int_list),
strtk::regex_match_mode::match_1);
}
{
std::string data = "ab$c1.1?d-2.2ef#ghi+3.3%(123.456)!&*-7.89E+12@^=";
std::deque<double> double_list;
strtk::split_regex_n(strtk::ieee754_expression,
data,
4,
strtk::range_to_type_back_inserter(double_list),
strtk::regex_match_mode::match_1);
}
return 0;
}
The following table describes the various regex patterns built
into StrTk which can be used seamlessly with the strtk::split_regex
and strtk::split_regex_n routines.
| Regex | Definition |
strtk::uri_expression | URI, URL address e.g.: http://www.example.com, domain.example.net/index.html |
strtk::email_expression | E-mail address e.g.: some.one@example.com |
strtk::ip_expression | IPv4 address e.g.: 192.168.0.1, 127.0.0.1 |
strtk::ieee754_expression | Floating point value e.g.: 1.1, 1.234e-123, -1.0001E+10, 0.1234 |
Tokenizer
The tokenizer is specialized on a sequence iterator and predicate. It
is constructed with a range of iterators for the sequence and a
reference to the desired predicate. Definitions exist for std::string
tokenizers. The tokenizer provides an iteration pattern as a means
for accessing the tokens in the sequence.
const unsigned int data_size = 12;
unsigned int data[data_size] = {1,2,3,0,4,5,6,0,7,8,0,9};
strtk::single_delimiter_predicate<unsigned int> predicate(0);
typedef strtk::tokenizer<unsigned int*,strtk::single_delimiter_predicate<unsigned int> > tokenizer_type;
tokenizer_type tokenizer(data,data + data_size,predicate);
Similar to that of strtk::split,
strtk::tokenizer provides tokenizing options that are
passed in during construction. Below is a table depicting said
options:
| Tokenize Option | Definition |
strtk::tokenize_options::default_mode | Default options |
strtk::tokenize_options::compress_delimiters | Consecutive delimiters are treated as one |
strtk::tokenize_options::include_1st_delimiter | The first delimiter is included in the resulting token range |
strtk::tokenize_options::include_delimiters | All delimiters are included in the resulting token range |
typedef strtk::tokenizer<unsigned int*,strtk::single_delimiter_predicate<unsigned int> > tokenizer_type
tokenizer_type tokenizer(data, data + data_size,
predicate,
strtk::tokenize_options::compress_delimiters |
strtk::tokenize_options::include_1st_delimiter);
Furthermore, Iteration over the tokens of strtk::tokenizer
is performed as follows:
tokenizer_type::iterator itr = tokenizer.begin();
while (tokenizer.end() != itr)
{
std::copy((*itr).first,(*itr).second,std::ostream_iterator<unsigned int>(std::cout," "));
std::cout << std::endl;
++itr;
}
A typical std::string can be tokenized in the following
manner:
std::string str = "abc|123|xyz|789";
strtk::std_string::tokenizer<>::type tokenizer(str,"|");
strtk::std_string::tokenizer<>::type::iterator itr = tokenizer.begin();
while (tokenizer.end() != itr)
{
std::cout << "[" << (*itr) << "]\t";
++itr;
}
std::cout << std::endl;
Another common situation may be tokenizing a sequence of strings,
such as the following:
const std::string str_list[] = { "abc" , "delimiter" , "ijk" , "delimiter" ,
"mno" , "delimiter" , "rst" , "uvw" ,
"delimiter" , "xyz" };
const std::size_t str_list_size = sizeof(str_list) / sizeof(std::string);
strtk::range_adapter<std::string> range(str_list,str_list_size);
strtk::single_delimiter_predicate<std::string> predicate("delimiter");
typedef strtk::tokenizer<std::string*,strtk::single_delimiter_predicate<std::string> > tokenizer_type;
tokenizer_type tokenizer(range.begin(),range.end(),predicate);
tokenizer_type::iterator itr = tokenizer.begin();
while (tokenizer.end() != itr)
{
std::copy((*itr).first,(*itr).second,std::ostream_iterator<std::string>(std::cout," "));
std::cout << std::endl;
++itr;
}
Note: For performance and efficient resource management purposes
the strtk::tokenizer does not take ownership or make an
internal copy of the sequence being tokenized, as such the
strtk::tokenizer expects the range to be valid during
the entirety of the tokenization process, this is also the case for
the specified predicate.
The Parse Routine
Till now the mentioned routines worked specifically with tokens, or in
other words ranges of characters. The responsibility of managing the
tokens and converting the tokens to user specified types was done
manually via "range to type" oriented back inserters and converters.
This can be a bit cumbersome and as such StrTk provides a series of
helper routines called strtk::parse. Parse takes an
std::string representing a tuple of delimited values as
input data, a delimiter set, and a series of references to variables
that are to be populated with the values from the parsed tokens. The
following diagram demonstrates the flow of data, tokens and the
corresponding relationships and conversions between each of the
parameters.

Note: strtk::parse returns a boolean value of true
upon successful parsing and false for all other results. Situations
that cause strtk::parse to fail include:
- Insufficient number of tokens for the given number of variables
- Conversion failure from token range to variable type
- Empty or null token(s)
Some Simple Parse Examples
strtk::parse can take an arbitrary number of variable
references. The code below demonstrates the basic usage of strtk::parse
taking various number of parameters.
std::string data = "abcde,-1234|567.890#1.1f";
std::string delimiters = ",|#";
std::string var0;
int var1;
double var2;
float var3;
strtk::parse(data,delimiters,var0);
strtk::parse(data,delimiters,var0,var1);
strtk::parse(data,delimiters,var0,var1,var2);
strtk::parse(data,delimiters,var0,var1,var2,var3);
The following examples demonstrate parsing of PODs such as int and
double into STL compatible sequences (std::vector,
std::deque, std::list, std::set,
std::queue, std::stack and
std::priority_queue).
std::string int_string = "1 +2 -3 4 +5 -6 7 +8 -9 10 +11 -12 13 +14 -15";
std::vector<int> int_vector;
strtk::parse(int_string," ",int_vector);
std::string double_string = "-123.456,789.012,-345.678,901.234,+567.890";
std::deque<double> double_deque;
strtk::parse(double_string,",",double_deque);
std::string data_string = "a,bc,def,ghij,klmno,pqrstu,vwxyz";
std::list<std::string> string_list;
strtk::parse(data_string,",",string_list);
std::string set_string = "a|bc#def|ghij#klmno|pqrstu#vwxyz";
std::set<std::string> string_set;
strtk::parse(set_string,"#|",string_set);
std::string queue_string = "value1,value2,value3,value4,value5";
std::queue<std::string> string_queue;
strtk::parse(queue_string,",|",string_queue);
std::string stack_string = "value1|value2,value3|value4,value5";
std::stack<std::string> string_stack;
strtk::parse(stack_string,",|",string_stack);
std::string priority_queue_string = "value1|value2,value3#value4,value5";
std::priority_queue<std::string> string_priority_queue;
strtk::parse(priority_queue_string,",|#",string_priority_queue);
Similar to what is described above, the following demonstrates parsing of up to "N" elements into an
STL compatible sequence.
std::string int_string = "100,-200,+300,400,-500,+600,700,-800,+900";
std::vector<int> int_vector;
strtk::parse_n(int_string,",",5,int_vector);
std::string double_string = "123.456,+789.012,345.678,-901.234,567.890";
std::deque<double> double_deque;
strtk::parse_n(double_string,",",3,double_deque);
std::string data_string = "a,bc,def,ghij,klmno,pqrstu,vwxyz";
std::list<std::string> string_list;
strtk::parse_n(data_string,",",6,string_list);
std::string set_string = "a|bc#def|ghij#klmno|pqrstu#vwxyz";
std::set<std::string> string_set;
strtk::parse_n(set_string,"#|",6,string_set);
std::string queue_string = "value0,value1,value2,value3,value4,value5";
std::queue<std::string> string_queue;
strtk::parse_n(queue_string,",|",4,string_queue);
std::string stack_string = "value0|value1|value2,value3|value4,value5";
std::stack<std::string> string_stack;
strtk::parse_n(stack_string,",|",4,string_stack);
std::string priority_queue_string = "value0#value1|value2,value3#value4,value5";
std::priority_queue<std::string> string_priority_queue;
strtk::parse_n(priority_queue_string,",|#",4,string_priority_queue);
Note: The return value of the routine strtk::parse_n
indicates how many elements were parsed and placed into the specified
sequence.
Some Initial Simple Examples
Simple Example 0
As a first example, we'll tackle the simple problem of reversing words in a sentence.
That is given a sentence, to have the first word be the last and the last to be
the first, the second word to be the second last so on and so forth. Using StrTk
we come up with the following solution:
int main()
{
std::string sentence = "The quick brown fox jumps over the lazy dog";
std::cout << "Before: " << sentence << std::endl;
strtk::split(" ",
sentence,
strtk::functional_inserter(
[](const strtk::range::string& range)
{ strtk::reverse(range); })
);
strtk::reverse(sentence);
std::cout << "After: " << sentence << std::endl;
return 0;
}
With an expected output of:
Before: The quick brown fox jumps over the lazy dog
After: dog lazy the over jumps fox brown quick The
Simple Example 1
Another example, given a list of words to blank out and a sentence,
transform the sentence such that the blank-out words are
removed. Using StrTk we come up with the following solution:
int main()
{
std::string sentence = "The quick brown fox jumps over the lazy dog";
std::unordered_set<std::string> blankout_words;
blankout_words.insert("quick");
blankout_words.insert("over");
blankout_words.insert("lazy");
std::cout << "Before: " << sentence << std::endl;
strtk::split(" ",
sentence,
strtk::functional_inserter(
[&blankout_words](const strtk::range::string& range)
{
auto itr = blankout_words.find(range);
if (blankout_words.end() != itr)
{
strtk::fill(range,' ');
}
})
);
strtk::remove_consecutives_inplace(' ',sentence);
std::cout << "After: " << sentence << std::endl;
return 0;
}
With an expected output of:
Before: The quick brown fox jumps over the lazy dog
After: The brown fox jumps the dog
Simple Example 2
Another simple example would be given a text file to read each
of the lines and populate a word list structure by
tokenizing each line into words. The following is an example
of how this can be achieved using StrTk:
int main()
{
std::deque<std::string> word_list;
strtk::for_each_line("text.txt",
[&word_list](const std::string& line)
{
static const std::string delimiters = "0123456789()[]{}<>"
"\t\r\n ,,.;:'"
"!@#$%^&*_-=+`~/";
strtk::parse(line,delimiters,word_list);
});
return 0;
}
Simple Example 3
The following simple example takes a user specified text file,
proceeds to process it and returns information relating to
the file, such as word, letter, uppercase character,
lowercase character, vowel and consonant counts.
int main()
{
std::size_t word_count = 0;
std::size_t letter_count = 0;
std::size_t uppercase_count = 0;
std::size_t lowercase_count = 0;
std::size_t vowel_count = 0;
std::size_t consonant_count = 0;
using namespace strtk;
for_each_line("data.txt",
[&](const std::string& line)
{
static multiple_char_delimiter_predicate is_vowel("AEIOUaeiou");
static multiple_char_delimiter_predicate is_lowercase(ext_string::all_lowercase_letters());
static const std::string delimiters = ext_string::all_chars()
- ext_string::all_lowercase_letters()
- ext_string::all_uppercase_letters();
split(delimiters,
line,
functional_inserter(
[&](const strtk::range::string& range)
{
if (0 == range.size()) return;
++word_count;
letter_count += range.size();
std::size_t current_lowercase_count = 0;
std::size_t current_vowel_count = 0;
for (std::size_t i = 0; i < range.size(); ++i)
{
if (is_vowel(range[i])) ++current_vowel_count;
if (is_lowercase(range[i])) ++current_lowercase_count;
}
uppercase_count += range.size() - current_lowercase_count;
lowercase_count += current_lowercase_count;
consonant_count += range.size() - current_vowel_count;
vowel_count += current_vowel_count;
})
);
});
std::cout << "Word count: " << word_count << std::endl;
std::cout << "Letter count: " << letter_count << std::endl;
std::cout << "Uppercase count: " << uppercase_count << std::endl;
std::cout << "Lowercase count: " << lowercase_count << std::endl;
std::cout << "Vowel count: " << vowel_count << std::endl;
std::cout << "Consonant count: " << consonant_count << std::endl;
return 0;
}
Simple Example 4
For the next example, assume we have a text file with a list of names,
one per line that represents the order of people that entered a
building. Some of the people may enter and leave then reenter the
building many times, hence their name will appear multiple times in
the list. Our task is to reduce this list to a list of unique names
but to also maintain the relative order of names found in the original
list. The following is how this particular requirement can be
accomplished by using StrTk:
int main()
{
strtk::for_each_line("file_name.txt",
[](const std::string& line)
{
static std::unordered_set<std::string> line_set;
if (line_set.end() != line_set.find(line))
return;
line_set.insert(line);
std::cout << line << std::endl;
});
return 0;
}
Simple Example 5
As a final simple example, we would like to calculate the word
frequency model of a user specified text file. The process involves
reading each line, splitting the line into words, then incrementing
the relevant count for each word and maintaining a global word count.
Once the file has been processed, the occurrence frequency of each
word will be printed to stdout.
int main()
{
typedef std::unordered_map<std::string,unsigned int> map_t;
map_t word_hit_list;
unsigned int word_count = 0;
strtk::for_each_line("data.txt",
[&](const std::string& line)
{
static const std::string delimiters = strtk::ext_string::all_chars()
- strtk::ext_string::all_lowercase_letters()
- strtk::ext_string::all_uppercase_letters();
strtk::split(delimiters,
line,
strtk::functional_inserter(
[&](const strtk::range::string& range)
{
if (range.begin() == range.end()) return;
++word_count;
std::string word(range.begin(),range.end());
++word_hit_list[word];
})
);
});
for (map_t::value_type v : word_hit_list)
{
printf("%s %10d %10.9f\n",
strtk::text::right_align(15,' ',v.first).c_str(),
v.second,
(1.0 * v.second) / word_count);
}
return 0;
}
A Practical Example
Lets assume you have been given an English text file to process, with
the intention of extracting a lexicon from the file.
One solution would be to break the problem down to a line by line
tokenization problem. In this case you would define a functional
object such as the following which will take the container in which
you plan on storing your tokens (words) and a predicate and insert
the tokens as strings into your container.
template<typename Container, typename Predicate>
struct parse_line
{
public:
parse_line(Container& container, const Predicate& predicate)
: container_(container),
predicate_(predicate)
{}
inline void operator() (const std::string& str)
{
strtk::split(str,
predicate_,
strtk::range_to_type_back_inserter(container_),
strtk::split_options::compress_delimiters);
}
private:
Container& container_;
const Predicate& predicate_;
};
The whole thing together would include a process of opening the
file and reading it line by line each time invoking the
parse_line would be as follows:
template<typename Container>
void parse_text(const std::string& file_name, Container& c)
{
static const std::string delimiters = " ,.;:<>'[]{}()_?/"
"`~!@#$%^&*|-_\"=+\t\r\n\0"
"0123456789";
strtk::multiple_char_delimiter_predicate predicate(delimiters);
strtk::for_each_line(file_name,
parse_line<Container,strtk::multiple_char_delimiter_predicate>(c,predicate));
}
int main()
{
std::string text_file_name = "text.txt";
std::deque<std::string> word_list;
parse_text(text_file_name,word_list);
std::cout << "Token Count: " << word_list.size() << std::endl;
return 0;
}
Before we continue on with the example, a re-written version of the
above code using C++11 lambdas is as follows:
int main()
{
std::string text_file_name = "text.txt";
std::deque<std::string> word_list;
strtk::for_each_line(text_file_name,
[&word_list](const std::string& line)
{
static const std::string delimiters = " ,.;:<>'[]{}()_?/"
"`~!@#$%^&*|-_\"=+\t\r\n\0"
"0123456789";
strtk::parse(line,delimiters,word_list);
});
std::cout << "Token Count: " << word_list.size() << std::endl;
return 0;
}
Now coming back to the original problem, that being the construction
of a lexicon. In this case the set of "words" should only
contain words of interest. For the sake of simplicity lets define
words of interest as being anything other than the following
prepositions: as, at, but, by, for, in, like, next, of, on,
opposite, out, past, to, up and via. This type of list is commonly
known as a Stop Word List. In this example the stop-word list
definition will be as follows:
const std::string stop_word_list [] =
{
"as", "at", "but", "by", "for",
"in", "like", "next", "of", "on",
"opposite", "out", "past", "to",
"up", "via", ""
};
const std::size_t stop_word_list_size = sizeof(stop_word_list) / sizeof(std::string);
Some minor updates to the parse_line processor include using the
filter_on_match predicate for determining if the
currently processed token is a preposition and also the invocation of
the range_to_type back_inserter to convert the tokens
from their range iterator representation to a type representation
compatible with the user defined container. For the new implementation
to provide unique words of interest the simplest change that can be
made is to replace the deque used as the container for the word_list
to some kind of 1-1 associative container such as a set. The following
is the improved version of the parse_line processor:
template<typename Container, typename Predicate>
struct parse_line
{
public:
parse_line(Container& container, const Predicate& predicate)
: container_(container),
predicate_(predicate),
tmp_(" "),
tokenizer_(tmp_,predicate_,true),
filter_(stop_word_list,stop_word_list + stop_word_list_size,
strtk::range_to_string_back_inserter_iterator<Container>(container_),
true,false)
{}
inline void operator() (const std::string& s)
{
const filter_type& filter = filter_;
strtk::for_each_token(s,tokenizer_,filter);
}
private:
Container& container_;
const Predicate& predicate_;
std::string tmp_;
typename strtk::std_string_tokenizer<Predicate>::type tokenizer_;
strtk::filter_on_match<strtk::range_to_string_back_inserter_iterator<Container>> filter_;
};
The above described predicate can be greatly simplified by using
binders and various lambda expressions.
Another Example
When performing serialization or deserialization of an
instance object such as a class or struct, a simple approach
one could take would be to take each of the members and
convert them into string representations and from those strings
construct a larger string delimiting each member with a special
character guaranteed not to exist in any of the string
representations.
In this example we will assume that there exists a struct which
represents the properties of a person, a person struct if you
will:
struct person
{
unsigned int id;
std::string name;
unsigned int age;
double height
float weight;
};
The process of populating a person struct would entail having an
instance of a person and the necessary data string. The following is
an example of how this would be done using the
strtk::parse function.
Person Tuple Format
| Token0 | Delimiter0 | Token1 | Delimiter1 | Token2 | Delimiter2 | Token3 | Delimiter3 | Token4 |
Unique ID(hex) | | | Name | | | Age | | | Height(m) | | | Weight(kg) |
std::string data = "0xFA37ED12|Rumpelstiltskin|397|1.31|58.7";
person p;
strtk::hex_to_number_sink<unsigned int> hex_sink(p.id); strtk::parse(data,"|",hex_sink,p.name,p.age,p.height,p.weight);
Batch processing of a text file comprised of one person tuple
per-line is somewhat similar to the previous example. A predicate
is defined that takes a container specialized on the person struct,
and a delimiter predicate with which the
strtk::parse function will be invoked. This
predicate is then instantiated and coupled with the text
file name, is fed to the strtk::for_each_line
processor.
template<typename Container, typename Predicate>
struct parse_person_tuple
{
public:
parse_person_tuple(Container& container)
: container_(container),
hex_sink(p_.id)
{}
inline void operator() (const std::string& s)
{
if (strtk::parse(s,"|",hex_sink,p_.name,p_.age,p_.height,p_.weight))
container_.push_back(p_);
else
std::cerr << "Failed to parse: " << s << std::endl;
}
private:
Container& container_;
person p_;
strtk::hex_to_number_sink<unsigned int> hex_sink;
};
Bringing the above pieces together to process a file results in the
following:
int main()
{
std::string text_file_name = "person_records.txt";
std::deque<person> person_list;
strtk::for_each_line(file_name,predicate_type(person_list));
return 0;
}
Before we continue on with the example, a re-written version of the
above code using C++11 lambdas is as follows:
int main()
{
std::string text_file_name = "person_records.txt";
std::deque<person> person_list;
person p;
strtk::hex_to_number_sink<unsigned int> hex_sink;
strtk::for_each_line(text_file_name,
[&](const std::string& line)
{
if (strtk::parse(line,"|",hex_sink,p.name,p.age,p.height,p.weight))
container_.push_back(p);
else
std::cerr << "Failed to parse: " << line << std::endl;
});
return 0;
}
To make things easier one could adapt a struct (made up entirely
of PODs) to a parser. This makes the usage syntax little easier
to follow. An example of this adaption is as follows:
struct type
{
std::string s;
double d;
int i;
char c;
bool b;
};
strtk_parse_begin(type)
strtk_parse_type(s)
strtk_parse_type(d)
strtk_parse_type(i)
strtk_parse_type(c)
strtk_parse_type(b)
strtk_parse_end()
int main()
{
type t;
std::string s = "abcdefghijklmnop|123.456|987654321|A|1";
strtk::parse(s,"|",t);
return 0;
}
Another similar example to the previous, would be parsing a text
file of 3D coordinates into a sequence. This can be done easily and
cleanly using lambdas and StrTk as follows:
struct point
{
double x,y,z;
};
int main()
{
std::string point_data = "point_data.txt";
std::deque<point> points;
point p;
strtk::for_each_line(point_data,
[&](const std::string& str)
{
if (strtk::parse(str,",",p.x,p.y,p.z))
points.push_back(p);
});
return 0;
}
Simple Date-Time Parser
Assuming the datetime struct defined below, and a string
representation of a combined date and time in the form of YYYY-MM-DD
HH:MM:SS.MS eg: 2005-06-26 15:57:03.678
struct datetime
{
unsigned int year;
unsigned int month;
unsigned int day;
unsigned int hour;
unsigned int minute;
unsigned int second;
unsigned int msecond;
};
The following assumes an input of date-time values separated by a
pipe. To facilitate parsing of a date-time by
the strtk::parse routine into an STL compatible sequence an
implementation of string_to_type_converter_impl specific
to the datetime type is required. The following demonstrates how such
a routine can be generated and used within the strtk::parse
context:
strtk_string_to_type_begin(datetime)
static const std::string delimiters ("-:. ");
return strtk::parse(begin, end, delimiters,
t.year, t.month, t.day,
t.hour, t.minute, t.second, t.msecond);
strtk_string_to_type_end()
A simple example of using strtk::parse in conjunction
with the newly integrated datetime variable mixed with variables
of other types is as follows:
int main()
{
const std::string data = "abc 123 xyz,2000-01-10 03:01:16.123|+98765.43210";
std::string var0;
datetime var1;
double var2;
strtk::parse(data,",|",var0,var1,var2);
return 0;
}
Bringing the above pieces together, in the following we can then
proceed to parse a sequence of date-time strings delimited by pipe
"|" into a deque of type datetime.
int main()
{
static const std::string data = "2000-01-10 03:01:16.123|2001-02-22 05:12:24.234|"
"2002-03-13 07:23:32.345|2003-04-24 09:34:47.456|"
"2004-05-15 11:46:51.767|2005-06-26 15:57:03.678|"
"2006-07-17 17:45:31.561|2007-08-26 19:06:02.809|"
"2008-09-18 21:16:23.267|2009-10-26 23:12:03.798|"
"2010-11-23 13:47:11.963|2011-12-26 15:35:08.168";
std::deque<datetime> datetime_list;
strtk::parse(data,"|",datetime_list);
for (std::size_t i = 0; i < datetime_list.size(); ++i)
{
datetime& dt = datetime_list[i];
std::cout << dt.year << "-"
<< strtk::text::right_align(2,'0', dt.month) << "-"
<< strtk::text::right_align(2,'0', dt.day) << " "
<< strtk::text::right_align(2,'0', dt.hour) << ":"
<< strtk::text::right_align(2,'0', dt.minute) << ":"
<< strtk::text::right_align(2,'0', dt.second) << "."
<< strtk::text::right_align(3,'0',dt.msecond) << std::endl;
}
return 0;
}
Parsing Sub-Lists
So far the demonstrated capabilities of the strtk::parse
function has been based on passing a series of parameters that are
populated in a linear fashion as the parser processes the tokens it
encounters. That said, some formats have their own sub-structures, a
simple example would be a list of values (such as integers) that need
to be loaded into a deque or stack. StrTk provides a series of sink
facilities that consume a range and an STL container which can be
forwarded onto strtk::parse.
In the following example, the data string is comprised of 3
separate lists delimited by a pipe "|". An integer, a
string and a double type list. Each list is to be parsed into an STL
container of appropriate type. In this case a vector, a deque and a
list. StrTk provides the ability to instantiate a sink for the
specific container type that is compatible with
strtk::parse.
int main()
{
std::string data = "1,+2,-3,4|abc,ijk,rst,xyz|123.456,+234.567,-345.678,456.789,567.890";
std::vector<int> int_vector;
std::deque<std::string> string_deque;
std::list<double> double_list;
std::set<int> int_set;
std::queue<std::string> string_queue;
std::stack<double> double_stack;
std::priority_queue<int> int_priority_queue;
strtk::vector_sink<int>::type vec_sink(",");
strtk::deque_sink<std::string>::type deq_sink(",");
strtk::list_sink<double>::type lst_sink(",");
strtk::set_sink<int>::type set_sink(",");
strtk::queue_sink<std::string>::type que_sink(",");
strtk::stack_sink<double>::type stk_sink(",");
strtk::priority_queue_sink<int>::type prq_sink(",");
strtk::parse(data,"|",vec_sink( int_vector),
deq_sink(string_deque),
lst_sink( double_list));
strtk::parse(data,"|",set_sink( int_set),
que_sink( string_queue),
stk_sink( double_stack),
prq_sink(int_priority_queue));
return 0;
}
If only a certain number of elements in the list are required, for
example only the first 3, then the element count on the sink can be
set appropriately. The above example could be modified as follows:
int main()
{
std::string data = "1,+2,-3,4|string0|abc,ijk,rst,xyz|string1|123.456,+234.567,-345.678,456.789,567.890";
std::vector<int> int_vector;
std::deque<std::string> string_deque;
std::list<double> double_list;
strtk::vector_sink<int>::type vec_sink(",");
strtk::deque_sink<std::string>::type deq_sink(",");
strtk::list_sink<double>::type lst_sink(",");
std::string string_0;
std::string string_1;
strtk::parse(data,"|",vec_sink( int_vector).count(2), string_0,
deq_sink(string_deque).count(3), string_1,
lst_sink( double_list).count(4)); return 0;
}
Note: If there aren't enough elements in a particular
list, then parsing of that list fails and subsequently the whole
strtk::parse call will fail as well.
Parsing Trailing-Lists
Another way one might want to parse a tuple of values might be to parse
a prefix of values into a specific number of possibly varying types,
then to parse the remaining values (assuming they are all of the same
type) into a sequence or list etc. StrTk provides the following simple
solution to the given problem, as demonstrated below:
int main()
{
{
std::string data = "A String Value,111.111,222.222,333.333,444.444,555.555";
std::string token;
std::vector<double> double_list;
strtk::parse(data,",",token,double_list);
}
{
std::string data = "A String Value,01-02-2003,111.111,222.222,333.333,444.444,555.555";
std::string token;
std::string date;
std::deque<double> double_list;
strtk::parse(data,",",token,date,double_list);
}
{
std::string data = "A String Value,01-02-2003,123456789,111.111,222.222,333.333,444.444,555.555";
std::string token;
std::string date;
int i;
std::list<double> double_list;
strtk::parse(data,",",token,date,i,double_list);
}
{
std::string data = "A String Value,01-02-2003,123456789,111.111,222.222,333.333,444.444,555.555";
std::string token;
std::string date;
int i;
double d;
std::vector<double> double_list;
strtk::parse(data,",",token,date,i,d,double_list);
}
{
std::string data = "A String Value,01-02-2003,123456789,111.111,222.222,333.333,444.444,555.555";
std::string token;
std::string date;
int i;
double d1;
double d2;
std::deque<double> double_list;
strtk::parse(data,",",token,date,i,d1,d2,double_list);
}
return 0;
}
Extending The Date-Time Parser Example
Building upon the previous datetime example, We are presented with a
tuple of data that represents an astronomical event. The event
defines a name, a location and a series of date-times in UTC the
event was observed. In order to construct the necessary sink(s) that
will be used for parsing the required type into a container, the
macro strtk_register_userdef_type_sink with the specified type
is invoked. The following is a definition of the struct one might
construct:
struct event_information
{
std::size_t id;
std::string name;
std::string location;
std::deque<datetime> observation_datetimes;
};
strtk_register_userdef_type_sink(datetime)
Bringing the above together with a call to strtk::parse
results in the following code which parses the event data tuple into
the allocated event_information instance.
int main()
{
std::string event_data = "172493|Lunar Impact|Mare Tranquillitatis|"
"2010-01-19 00:28:45.357,2010-02-18 00:57:07.109,"
"2010-03-20 01:15:11.261,2010-04-21 01:07:27.972";
strtk::deque_sink<datetime>::type deq_sink(",");
event_information evt_info;
strtk::parse(event_data,"|",evt_info.id,
evt_info.name,
evt_info.location,
deq_sink(evt_info.observation_datetimes));
return 0;
}
Token Processing During Parsing
StrTk offers a set of convenient and simple token processing
primitives that can be used during a call to the strtk::parse
routine to perform various actions upon the tokens being parsed.
These actions include such things as modifications and
validations of tokens.
The following is a list of token processing primitives used for
constraint and verification purposes:
- strtk::ignore_token
- strtk::expect
- strtk::iexpect
- strtk::like
- strtk::inrange
The following is a list of token processing primitives used for
modifying and normalising purposes:
- strtk::trim
- strtk::trim_leading
- strtk::trim_trailing
- strtk::as_lcase
- strtk::as_ucase
The primitives all return either a true or false value upon
parsing completion, which is then further used by the
strtk::parse routine to determine if the parse
operation as a whole has succeeded or failed.
Ignore Token Processing
There may be scenarios when given a delimited tuple of data, that
one or more of the tokens need to be ignored or skipped. StrTk
provides a mechanism called strtk::ignore_token that allows the
parser to consume specific tokens easily without affecting
overall performance. Below is an example of how
strtk::ignore_token can be used in conjunction with
strtk::parse to skip the 2nd and 4th tokens in the
tuple:
int main()
{
static const std::string data = "+123,ignore0,123.456,ignore1,abcdef,ignore2";
int i;
double d;
std::string s;
strtk::ignore_token ignore;
strtk::parse(data,",",i,ignore,d,ignore,s);
std::cout << "i = " << i << std::endl;
std::cout << "d = " << d << std::endl;
std::cout << "s = " << s << std::endl;
return 0;
}
Expect and IExpect Token Processing
When parsing a tuple, one may want to ensure that specific tokens
of the tuple are of a certain string value. StrTk provides this
type of functionality via the strtk::expect and
strtk::iexpect mechanisms. The strtk::expect form
enforces an exact string match, whereas the strtk::iexpect
enforces only a case-insensitive match. The following is an
example where we attempt to parse a 'pascal-like' variable
declaration and definition. The requirement is that the first
token be "var" followed by a variable name and then a type name
of 'Integer' which is not case sensitive.
int main()
{
static const std::string data = "var foo : InTeGeR = 3;";
std::string variable_name;
int initial_value;
bool result = strtk::parse(data,
" ;",
strtk::expect("var").ref(),
variable_name,
strtk::expect(":").ref(),
strtk::iexpect("Integer").ref(),
strtk::expect("=").ref(),
initial_value);
if (result)
std::cout << variable_name << " = " << initial_value << std::endl;
else
std::cout << "Failed to parse statement!" << std::endl;
return 0;
}
Like Token Processing
Similar to the above mentioned strtk::expect and strtk::iexpect
primitives, StrTk provides a simple wildcard matching of tokens
functionality via the strtk::like mechanism. The special
characters of '*' and '?' are used denoting 'zero or more' and
'zero or one' match modes respectively. The following example
uses the strtk::like in conjunction with strtk::expect to parse a
tuple of key/value pairs.
int main()
{
static const std::string data = "token0=+123;token1=abc;token2=-456.678;";
int i;
std::string s;
double d;
strtk::parse(data,
"=;",
strtk::like("to*n?").ref(),
i,
strtk::like("token?").ref(),
s,
strtk::iexpect("tOkEn2").ref(),
d);
std::cout << "i = " << i << std::endl;
std::cout << "s = " << s << std::endl;
std::cout << "d = " << d << std::endl;
return 0;
}
In-Range Token Processing
When parsing tokens, one may wish to determine if the token when
viewed in its target type resides within a specified range
[r0,r1]. As the tokens can be of any type, not necessarily
just string or numerical in nature, the type must have a less-
than comparable attribute. The following example attempts to
parse a key/value tuple that contains a temperature and a name
component.
int main()
{
static const std::string data = "temperature=+123.456;name=Rumpelstilzchen";
double temperature;
std::string name;
strtk::parse(data,
"=;",
strtk::expect("temperature").ref(),
strtk::inrange(temperature,-432.1,+432.1).ref(),
strtk::expect("name").ref(),
strtk::inrange(name,"AAAA","zzzz").ref());
std::cout << "temperature = " << temperature << std::endl;
std::cout << "name = " << name << std::endl;
return 0;
}
Trim Token Processing
At times tokens within a tuple may have padding added to either
the left, right or both ends. When processing the token it maybe
necessary to remove the superfluous padding before attempting
to convert the string or range representation of the token into
its target type(int, double etc). The example below, demonstrates
the use of various forms of token trimming in conjunction strtk:parse.
int main()
{
{
std::string data = "****abc123****,****abc123****,****abc123****";
std::string s0;
std::string s1;
std::string s2;
strtk::parse(data,",",
strtk::trim("*",s0).ref(),
strtk::trim_leading ("*",s1).ref(),
strtk::trim_trailing("*",s2).ref());
std::cout << "s0 = [" << s0 << "]" << std::endl;
std::cout << "s1 = [" << s1 << "]" << std::endl;
std::cout << "s2 = [" << s2 << "]" << std::endl;
}
{
std::string data = "*?*?a string*?*?,*?*123456,123.456?*?*?";
std::string s;
int i;
double d;
strtk::parse(data,",",
strtk::trim("*?",s).ref(),
strtk::trim_leading ("?*",i).ref(),
strtk::trim_trailing("*?",d).ref());
std::cout << "s = [" << s << "]" << std::endl;
std::cout << "i = [" << i << "]" << std::endl;
std::cout << "d = [" << d << "]" << std::endl;
}
return 0;
}
Case Normalisation Token Processing
Another pair of token processing mechanisms provided by StrTk,
are the strtk::as_lcase and strtk::as_ucase mechanisms. They
convert the string representation of the token to lowercase and
uppercase characters respectively. The following example, parses
a two token tuple, and converts the first token s0 to all
lowercase and the second token s1 to all uppercase.
int main()
{
std::string data = "AbCd,EfGhI";
std::string s0;
std::string s1;
strtk::parse(data,",",
strtk::as_lcase(s0).ref(),
strtk::as_ucase(s1).ref());
std::cout << "s0 = [" << s0 << "]" << std::endl;
std::cout << "s1 = [" << s1 << "]" << std::endl;
return 0;
}
Columnwise Parsing
In the previous section the ability to ignore tokens in a tuple was
discussed. The concept works well if only a few tokens need to be
ignored. However problems arise when the tuples contain a large number
of tokens and the tokens that are to be ignored are numerous and
distributed uniformly over the entire tuple.
Situations such as this one are common, and using the ignore_token
technique can not only make both the coding of the solution cumbersome
and error prone but also make the parsing process itself quite
inefficient.
A natural extension to ignore_token that scales and is also extremely
efficient, can be found in the combined functionalities of the
parse_columns and column_list. The column_list is a structure used to
hold the indexes of the tokens in the tuple that are required. The
indexes have to be valid, unique and in ascending order.
auto cl_even = strtk::column_list(0,2,4,6,8,10,12);
auto cl_odd = strtk::column_list(1,3,5,7);
The strtk::parse_columns function takes a string of
data representing a tuple, a delimiter to determine the tokens in
the tuple, a strkt::column_list and a compatible number
of types as target references.
The number of types has to be equal to the number of indexes in the
column_list, and the types need to be convertible within the strtk
namespace from a string range representation to the type. In the
following example we have a tuple consisting of integers. We're only
interested in the first four even numbered indexes in the tuple, the
code below demonstrates how the tuple is parsed with the given
constraints:
int main()
{
const std::string data = "1,22;333|4444 55555|666666;7777777,88888888 999;"
"0000|1111,22222,333333,4444444";
int i0,i1,i2,i3,i4;
strtk::parse_columns(data,
",;| ",
strtk::column_list(0,2,4,6,8),
i0,i1,i2,i3,i4);
return 0;
}
In the following example we have a tuple consisting of a mixture of
types. We are only interested in the first, fifth and eighth indexes
in the tuple, which happen to be of type int, double and std::string
respectively. The code below demonstrates how the tuple is
parsed with the given constraints:
int main()
{
const std::string data = "token0,1234;token2;token3,token4,11.22|token6;token7|"
"text-i-text,1 2 3 4 5 6 7";
int i;
double d;
std::string s;
strtk::parse_columns(data,
",;| ",
strtk::column_list(1,5,8),
i,d,s);
return 0;
}
Simple 3D Mesh File Format Parser
Lets assume there is a file format for expressing a mesh. The format
consists of a section that defines the unique vertexes in the mesh,
and another section that defines the triangles in the mesh as a tuple
of three indicies indicating the vertexes used. Each section is
preceded by an integer that denotes the number of subsequent
elements in that section. An example of such a file is the following:
5
+1.0,+1.0,+1.0
-1.0,+1.0,-1.0
-1.0,-1.0,+1.0
+1.0,-1.0,-1.0
+0.0,+0.0,+0.0
4
0,1,4
1,2,4
2,3,4
3,1,4
Code to parse such a file format is as follows:
struct point
{
double x,y,z;
};
struct triangle
{
std::size_t i0,i1,i2;
};
int main()
{
std::string mesh_file = "mesh.txt";
std::ifstream stream(mesh_file.c_str());
std::string s;
std::deque<point> points;
point p;
std::size_t point_count = 0;
strtk::parse_line(stream," ",point_count);
strtk::for_each_line_n(stream,
point_count,
[&points,&p](const std::string& line)
{
if (strtk::parse(line,",",p.x,p.y,p.z))
points.push_back(p);
});
std::deque<triangle> triangles;
triangle t;
std::size_t triangle_count = 0;
strtk::parse_line(stream," ",triangle_count);
strtk::for_each_line_n(stream,
triangle_count,
[&triangles,&t](const std::string& line)
{
if (strtk::parse(line,",",t.i0,t.i1,t.i2))
triangles.push_back(t);
});
return 0;
}
Simple Semantic Actions
A semantic action is an action that is undertaken upon a token, it can be
in the form of either an assessment or a manipulation. StrTk provides a very
simplified semantic action capability, named strtk::util::semantic_action
for types that are being parsed via the strtk::parse function.
A function (stateful or stateless), taking an iterator range is used to construct
the semantic_action. The body of the function performs whatever operations
are required and also makes sure to maintain the contract with regards to the
return status for the parse routine to complete successfully.
The following is an example where a comma delimited string is parsed into 3 types,
an integer, a double and a string. The rules regarding parsing and updating of the
variables is as follows, the int variable "i" will only be updated if the
value parsed is odd, the double value "d" will only be updated if the parsed
value is greater than or equal to 99.99 and the string value "s" will only
be updated if the presented range contains the string "ring". Upon every successful
update a corresponding counter will be incremented.
int main()
{
std::string data = "12345,123.456,A string";
int i = 0;
double d = 0.0;
std::string s = "";
std::size_t i_update_count = 0;
std::size_t d_update_count = 0;
std::size_t s_update_count = 0;
typedef strtk::range::ustring::const_iterator itr_type;
using strtk::util::semantic_action;
strtk::parse(data,",",
semantic_action([&i,&i_update_count]
(itr_type begin,itr_type end) -> bool
{
int temp = 0;
if (strtk::string_to_type_converter(begin,end,temp))
{
if (temp % 2 != 0)
{
i = temp;
++i_update_count;
}
return true;
}
else
return false;
}).ref(),
semantic_action([&d,&d_update_count]
(itr_type begin,itr_type end) -> bool
{
double temp = 0.0;
if (strtk::string_to_type_converter(begin,end,temp))
{
if (temp >= 99.99)
{
d = temp;
++d_update_count;
}
return true;
}
else
return false;
}).ref(),
semantic_action([&s,&s_update_count]
(itr_type begin,itr_type end) -> bool
{
static unsigned char pattern[] = "ring";
if (end != std::search(begin,end,pattern,pattern + 4))
{
s.assign(begin,end);
++s_update_count;
}
return true;
}).ref());
std::cout << "i: " << i << std::endl;
std::cout << "d: " << d << std::endl;
std::cout << "s: " << s << std::endl;
return 0;
}
Pick A Random Line From A Text File
A random line of text is to be selected from a user provided
text file comprised of lines of varying length, such that the
probability of the line being selected is 1/N where N is the number
of lines in the text file. There are many trivial solutions to this
problem, however if one were to further constrain the problem by
indicating the file is very large (TBs) and that the system upon
which the solution is to run is very limited memory-wise, most if
not all trivial solutions such as storing indexes of all line offsets,
or reading the entire file into memory etc will be eliminated.
However, there exists a very elegant solution to this problem of O(n),
O(1) time and space complexities respectively, that involves
scanning the entire file once line by line, and at every ith
line choosing whether or not to replace the final result line with
the current line by sampling a uniform distribution of range
[0,1) and performing a replace if and only if the random value is
less than 1 / i.

The logic behind this solution revolves around the fact that the
probability of selecting the ith line will be 1/i and as such the
total probability for selecting any of the previous i-1 lines will be
1 - (1/i) or (i - 1)/i. Because there are (i - 1) lines before the
ith line, we divide the previous sum of probabilities by (i - 1),
resulting in a selection probability of 1/i for any one of the lines
up to and including the ith line. If the ith line were to be the last
line of the text file, this then results in each of the lines
having a selection probability of 1/N - simple and sweet, and so
too is the following implementation of said solution:
int main(int argc, char* argv[])
{
std::string file_name = argv[1];
std::string line;
std::size_t i = 0;
strtk::uniform_real_rng rng(static_cast<std::size_t>(::time(0)));
strtk::for_each_line(file_name,
[&](const std::string& s)
{
if (rng() < (1.0 / ++i))
{
line = s;
}
});
std::cout << line << std::endl;
return 0;
}
What changes to the above code would be required If the probability of line selection
was changed to 1/(N-K) where 0 <= K <= N and K is the number of lines that will be randomly
ignored.
Note: TAOCP Volume II section 3.4.2 has an in depth discussion
about this problem, which is generally known as reservoir sampling,
and other similar problems relating to random distributions. Also one
should note that the above example has an inefficiency, in that upon
every string replace, an actual string is being copied, normally all
one needs to do is maintain a file offset to the beginning of the line,
not doing this causes slow downs due to continuous memory allocations/reallocations
which is made all the worse when large lines are encountered.
The Buy Low And Sell High Problem
Assume we are given a piece of data in csv format which represents
a time-series for the close prices of the SPDR S&P 500 ETF.
The time-series ranges from 04/01/1999 to 11/11/2011. The objective
is to select two dates, the first being the buy date and the second
being the sell date, such that if we were to buy then sell X shares
of the ETF we will have maximized our profit. It should be noted that of course
the buy date must be before the sell date and that short selling is not
an option in this strategy. The following is a chart that represents the
price of the SPDR over the given period:

Through visual inspection we can approximate that the best buy date
would be towards the end of 2002 and the corresponding sell date would
be towards the end of 2007, as these time points seem to give the largest
difference between the two prices. However it also does seems that a
buy roughly at the start of 2009 and a sell at the beginning of 2011
could also provide such a large price difference. As such the visual
inspection approach has lead to an ambiguity, hence a more thorough
and precise approach is required.
One could take the naive approach to solving the buy low sell high
problem, by initially loading the entire time series into memory, then
for each ith time point take the pricei and test it against
every other pricej where i < j, and maintain a "best
profit encountered" structure that contains the best profit so
far and the corresponding buy/sell prices and dates. This solution has
a few problems, initially it is of O(N2) time complexity
and O(N) space complexity. As an example for one million time points
it will require one trillion comparisons and one million date/price
units of storage. If memory and computational processing was limited
on the hardware this solution can not be practically executed in a
continuous online manner. Furthermore as suggested by the time
-complexity as the size of the data grows, regardless of computational
abilities, the compute times for the results would become astronomical
and practically useless - specially for real-time reactive systems.
In these types of problems one tries to assess if an online or
streaming based solution is feasible. That is a solution that does not
require the data to be available all at once, can work on the data
incrementally and requires no more than one-pass for each piece of
data. Such a solution would typically have a time complexity of O(N)
and a space complexity of O(1).
With regards to this problem the crucial insight required to convert
the naive solution from O(N2) complexity to an online
version of O(N) time complexity, is that every new global minima
encountered is the beginning of a new period and an indicator of an
end to the previous period. Looking at the chart, if one were to scan
from left to right, the intuitive response is to find the point with
the lowest price, ignore everything preceding, then try and find the
next point with the highest price or global maxima. There are a few
edge cases that need to be dealt with. The main one being the problem
described above that there are two buy/sell points that could
potentially be the solution. The way around this is to simply maintain
the best encountered period, and compare the profit from any new
period to the best so far, if it is better (more) then replace
the best with the current period. Another edge case is when the data
is in a continual decline, in a situation like this there will
be no profitable buy/sell points. The following is a small
subsection of the time-series in question:
Download spy500.csv
15/06/2000,148.16
16/06/2000,146.59
19/06/2000,148.47
20/06/2000,147.94
21/06/2000,147.88
22/06/2000,145.63
23/06/2000,144.38
26/06/2000,146.23
27/06/2000,145.16
28/06/2000,145.56
29/06/2000,144.19
The code below is a very simple single pass incremental solution to the given problem.
It reads every line of the input file, parses each line into a date and price variable,
checks to see if the current price is less than the current global minima price, if it
is the case, it will set the current period start to the current date and set the buy price
to be the current price, otherwise it checks to see if the current price is larger than the
global maxima price, if that is the case then it updates the current profit, sell price
and sell dates accordingly. If at the end, the buy price is not less than the sell price,
it is indicative that there exists no two time points within the given time series for
which a profitable transaction could occur, otherwise it prints out buy and sell dates
for the required transaction and the expected profit per share.
struct period
{
period()
: profit(std::numeric_limits<double>::min()),
buy_price(std::numeric_limits<double>::max()),
sell_price(std::numeric_limits<double>::min())
{}
bool operator>(const period&p) const
{
return profit > p.profit;
}
double profit;
double buy_price;
double sell_price;
std::string buy_date;
std::string sell_date;
};
int main()
{
period best_period;
period curr_period;
strtk::for_each_line("spy500.csv",
[&best_period,&curr_period]
(const std::string& line)
{
std::string date;
double price;
if (!strtk::parse(line,",",date,price)) return;
if (price < curr_period.buy_price)
{
if (curr_period > best_period)
{
best_period = curr_period;
}
curr_period.buy_date = date;
curr_period.buy_price = price;
curr_period.sell_price = std::numeric_limits<double>::min();
curr_period.sell_date = "";
}
else if (price > curr_period.sell_price)
{
curr_period.sell_price = price;
curr_period.sell_date = date;
curr_period.profit = curr_period.sell_price - curr_period.buy_price;
}
});
if (best_period.buy_price >= best_period.sell_price)
{
std::cout << "No period in time-series can be profitably exploited." << std::endl;
}
else
{
std::cout << "Buy: " << best_period.buy_date << std::endl;
std::cout << "Sell: " << best_period.sell_date << std::endl;
std::cout << "Profit per share: $" << best_period.profit << std::endl;
}
return 0;
}
For the given data it is expected the above piece of code will produce the following output:
Buy: 09/10/2002
Sell: 09/10/2007
Profit per share: $78.38
What changes to the above piece of code would be required if:
- Rather than prices, we instead are given percentage differences from the previous price?
- We have the constraint that we can't hold a position for more than K days
- A penalty of 1/K% is applied to the profit for every day a position is held? (where K is the condition and value described above)
- Short selling is allowed?
- We want to process independent sections of the time series concurrently so as to speed up overall processing time.
- The prices could be extremely large or small?
Token Grid
StrTk provides a means to easily parse and consume 2D grids of tokens
in an efficient and simple manner. A grid is simply defined as a
series of rows comprised of tokens, otherwise known as Delimiter
Separated Values (DSV). The ith token of a row is grouped with
every ith token of all other rows to produce a column. Tokens can be
processed as either rows or columns.
An example of a simple token grid, where each numeric value is deemed
to be a token:
1.1 | 2.2 | 3.3 | 4.4 | 5.5 |
1.1 | 2.2 | 3.3 | 4.4 | 5.5 |
1.1 | 2.2 | 3.3 | 4.4 | 5.5 |
1.1 | 2.2 | 3.3 | 4.4 | 5.5 |
1.1 | 2.2 | 3.3 | 4.4 | 5.5 |
A token grid can be either passed in via a file or a string buffer.
The delimiters to be used for parsing the columns and rows can also
be provided, if not provided standard common defaults will be used.
The following demonstrates how each cell in the grid can be access
and cast to a specific type.
std::string data = "1,2,3,4,5,6\n"
"7,8,9,0,1,2\n"
"3,4,5,6,7,8\n"
"9,0,1,2,3,4\n"
"5,6,7,8,9,0\n";
strtk::token_grid grid(data,data.size(),",");
for (std::size_t r = 0; r < grid.row_count(); ++r)
{
strtk::token_grid::row_type row = grid.row(r);
for (std::size_t c = 0; c < row.size(); ++c)
{
std::cout << grid.get<int>(r,c) << '\t';
}
std::cout << std::endl;
}
The strtk::token_grid provides various helper functions
for traversing rows and columns in batch mode. The functions are namely:
for_each_row that is used for iterating either all or a sub-range
of rows of the token_grid, and for_each_column that is used
for iterating either all or a sub-range of columns of a row.
int main()
{
std::string data = "1.1,2.1,3.1,4.1,5.1,6.1,7.1\n" "1.2,2.2,3.2,4.2,5.2,6.2,7.2\n" "1.3,2.3,3.3,4.3,5.3,6.3,7.3\n" "1.4,2.4,3.4,4.4,5.4,6.4,7.4\n" "1.5,2.5,3.5,4.5,5.5,6.5,7.5\n" "1.6,2.6,3.6,4.6,5.6,6.6,7.6\n" "1.7,2.7,3.7,4.7,5.7,6.7,7.7\n" "1.8,2.8,3.8,4.8,5.8,6.8,7.8\n" "1.9,2.9,3.9,4.9,5.9,6.9,7.9\n";
strtk::token_grid grid(data,data.size(),",");
{
grid.for_each_row(
[](const strtk::token_grid::row_type& row)
{
for (std::size_t i = 0; i < row.size(); ++i)
{
std::cout << row.get<double>(i) << '\t';
}
std::cout << '\n';
});
}
{
grid.for_each_row(grid.range(2,6),
[](const strtk::token_grid::row_type& row)
{
for (std::size_t i = 0; i < row.size(); ++i)
{
std::cout << row.get<double>(i) << '\t';
}
std::cout << '\n';
});
}
{
grid.for_each_row(
[](const strtk::token_grid::row_type& row)
{
row.for_each_column(
[](const strtk::token_grid::row_type::range_type& range)
{
std::cout << std::string(range.first,range.second) << '\t';
});
std::cout << '\n';
});
}
{
grid.for_each_row(grid.range(2,6),
[](const strtk::token_grid::row_type& row)
{
row.for_each_column(row.range(1,5),
[](const strtk::token_grid::row_type::range_type& range)
{
std::cout << std::string(range.first,range.second) << '\t';
});
std::cout << '\n';
});
}
return 0;
}
The following example demonstrates how averages over rows and columns
of a token grid can be computed:
std::string data = "1.1,1.1,1.1,1.1,1.1,1.1\n"
"2.2,2.2,2.2,2.2,2.2,2.2\n"
"3.3,3.3,3.3,3.3,3.3,3.3\n"
"4.4,4.4,4.4,4.4,4.4,4.4\n"
"5.5,5.5,5.5,5.5,5.5,5.5\n"
"6.6,6.6,6.6,6.6,6.6,6.6\n";
strtk::token_grid grid(data,data.size(),",");
std::vector<double> avg_c(grid.row(0).size(),0.0);
std::vector<double> avg_r(grid.row_count(),0.0);
std::vector<double> tmp(grid.row(0).size(),0.0);
std::fill(avg_c.begin(),avg_c.end(),0.0);
for (std::size_t i = 0; i < grid.row_count(); ++i)
{
grid.row(i).parse<double>(tmp.begin());
std::transform(avg_c.begin(),avg_c.end(),tmp.begin(),avg_c.begin(),std::plus<double>());
avg_r[i] = std::accumulate(tmp.begin(),tmp.end(),0.0) / tmp.size();
}
for (std::size_t i = 0; i < avg_c.size(); avg_c[i++] /= grid.row_count());
std::cout << "Column Averages:\t";
std::copy(avg_c.begin(),avg_c.end(),std::ostream_iterator<double>(std::cout,'\t'));
std::cout << '\n';
std::cout << "Row Averages:\t";
std::copy(avg_r.begin(),avg_r.end(),std::ostream_iterator<double>(std::cout,'\t'));
std::cout << '\n';
Processing Of Comma Separated Values Data
The original intent of the token grid was to support fast and
efficient processing of simple data tuples, such as comma separated
values (CSV) formats et. al. The following example demonstrates a
simple summation of traded floor volume and average daily volume based
on NASDAQ OHLC (Open-High-Low-Close) data.
std::string market_data = "20090701,GOOG,424.2000,418.9900,426.4000,418.1500,2310768\n"
"20090701,MSFT,24.0500,24.0400,24.3000,23.9600,54915127\n"
"20090702,GOOG,415.4100,408.4900,415.4100,406.8100,2517630\n"
"20090702,MSFT,23.7600,23.3700,24.0400,23.2100,65427699\n"
"20090703,GOOG,408.4900,408.4900,408.4900,408.4900,0\n"
"20090703,MSFT,23.3700,23.3700,23.3700,23.3700,0\n"
"20090706,GOOG,406.5000,409.6100,410.6400,401.6600,2262557\n"
"20090706,MSFT,23.2100,23.2000,23.2800,22.8700,49207638\n"
"20090707,GOOG,408.2400,396.6300,409.1900,395.9801,3260307\n"
"20090707,MSFT,23.0800,22.5300,23.1400,22.4600,52842412\n"
"20090708,GOOG,400.0000,402.4900,406.0000,398.0600,3441854\n"
"20090708,MSFT,22.3100,22.5600,22.6900,2200000,73023306\n"
"20090709,GOOG,406.1200,410.3900,414.4500,405.8000,3275816\n"
"20090709,MSFT,22.6500,22.4400,22.8100,22.3700,46981174\n"
"20090710,GOOG,409.5700,414.4000,417.3700,408.7000,2929559\n"
"20090710,MSFT,22.1900,22.3900,22.5400,22.1500,43238698\n";
strtk::token_grid grid(market_data,market_data.size(),",");
struct stock_info
{
stock_info(const std::string& s = " ")
: symbol(s),
total_volume(0),
day_count(0),
average_daily_volume(0.0)
{}
std::string symbol;
unsigned long long total_volume;
std::size_t day_count;
double average_daily_volume;
};
stock_info goog("GOOG");
stock_info msft("MSFT");
static const std::size_t volume_column = 6;
static const std::size_t symbol_column = 1;
goog.day_count = grid.accumulate_column(volume_column,
[](const strtk::token_grid::row_type& row) -> bool
{
static const std::string google_symbol("GOOG");
return google_symbol == row.get<std::string>(symbol_column);
},
goog.total_volume);
msft.day_count = grid.accumulate_column(volume_column,
[](const strtk::token_grid::row_type& row) -> bool
{
static const std::string microsoft_symbol("MSFT");
return microsoft_symbol == row.get<std::string>(symbol_column);
},
msft.total_volume);
goog.average_daily_volume = (1.0 * goog.total_volume) / goog.day_count;
msft.average_daily_volume = (1.0 * msft.total_volume) / msft.day_count;
std::cout << "[GOOG] Total Volume: " << goog.total_volume << std::endl;
std::cout << "[MSFT] Total Volume: " << msft.total_volume << std::endl;
std::cout << "[GOOG] ADV: " << goog.average_daily_volume << std::endl;
std::cout << "[MSFT] ADV: " << msft.average_daily_volume << std::endl;
The strtk::token_grid is thread-safe iff read operations
are in play. As such the above calls to accumulate_column et al. can
all be safely and easily executed concurrently using threads. This
allows for a far more efficient data processing methodology.
TIMTOWTDI
Playing devil's advocate, another way of performing the above
processing task, assuming only the specific values for computing the
ADV are required and no further processing of the CSV data is
needed, then the problem can be solved efficiently by utilizing a
single pass of the data as follows:
std::string file_name = "market_data.csv";
std::unordered_map<std::string,stock_info> stock_map;
stock_map.insert(std::make_pair<std::string,stock_info>("GOOG",stock_info("GOOG")));
stock_map.insert(std::make_pair<std::string,stock_info>("MSFT",stock_info("MSFT")));
strtk::for_each_line(file_name,
[&](const std::string& line)
{
strtk::ignore_token ignore;
stock_info temp;
const bool result = strtk::parse(line,
",",
ignore,
temp.symbol,
ignore,
ignore,
ignore,
ignore,
temp.total_volume);
if (!result) return;
auto itr = stock_map.find(temp.symbol);
if (stock_map.end() == itr) return;
(*itr).second.total_volume += temp.total_volume;
(*itr).second.day_count++;
});
auto itr = stock_map.begin();
auto end = stock_map.end();
while (end != itr)
{
stock_info& stock = (*itr++).second;
stock.average_daily_volume = (1.0 * stock.total_volume) / stock.day_count;
}
TIMTOWTDI - II (with a vengeance)
Playing the devil's other advocate, the above two examples, have both
required that the filter condition be explicitly defined at compile
time. However even though the condition maybe be set in stone at
compile time, some of the underlyings (such as symbol) can be
engineered to be modified at run-time. That still doesn't give us the
freedom to perform arbitrarily complex filter expressions determined
at run-time.
That said, an extremely efficient and very simple solution is at hand. The solution
is called the C++ DSV Filter library, it is based on StrTK and ExprTk libraries. It
uses the strtk::token_grid as a CSV/DSV store and index, and ExprTk as
the underlying expression evaluation engine. The example below takes the OHLC market
data table defined above and performs a row-wise query. The expression's definition
is:
select all rows where the open price is greater than the close price and the
symbol matches the wild-card pattern of "*FT*" and the date is equal to or
after 20090101.
int main()
{
std::string file_name = "market_data.csv";
dsv_filter filter;
filter.set_input_delimiter(",");
if (!filter.load(file_name))
return 1;
std::string expression = "(open > close) and (symbol like '*FT*') and (date >= '20090101')";
filter.add_filter(filter_expression);
for (std::size_t row = 1; row < filter.row_count(); ++row)
{
if (dsv_filter::e_match == filter[row])
{
}
}
return 0;
}
Other example queries:
volume >= 1000000 and symbol == 'GOOG'
abs(open - close) > abs(high - low)
avg(open,close,high,low) * volume > 10^7 and inrange('20090702',date,'20090730')
(open > close) and (symbol like '*FT*') and (date >= '20090101')
It should be noted that in the example above, the rows begin at index 1.
That is done because the dsv_filter expects the first row or
row at index 0 to be a column definition header. The format of the column
definitions is to simply add a suffix of "_s" if the values in the column
are to be treated as strings or "_n" if they are to be treated as numbers.
When defining expressions the suffixes should not be included when including
the column names. The above mentioned OHLC csv file's header would be as
follows:
Date_n,Symbol_s,Open_n,Close_n,High_n,Low_n,Volume_n
C++ DSV Filter and Dependencies
Sequential Partitions
A typical operation carried out upon time-series data is to group
tuples into buckets (or bins) based upon the time index
value. For example grouping data into 2-minute buckets and then
performing some kind of operation upon the grouped tuples such as a
summation or an average etc. This process is sometimes also called:
"discretization"
The strtk::token_grid class provides a method known as
sequential_partition. The sequential_partition
method requires a Transition Predicate, a Function
and optionally a row-range. The Transition Predicate
consumes a row and returns true only if the row is
in the next partition. All other subsequent consecutive rows until
the transition predicate returns a true are said to be
in the current partition. Prior to transitioning to a new partition,
the function predicate is invoked and provided with the range of
rows in the current partition.
The following example takes a simple time-series (time and
value), partitions the tuples into groups of Time-Buckets
of period length 3 and then proceeds to compute the total sum of each
group. The below summarizer class provides provides a
Transition Predicate and Function.

class summarizer
{
public:
enum column_index
{
tick_time_column = 0,
tick_value_column = 1
};
summarizer(std::deque<double>& sum_value)
: next_tick_time_(0),
sum_value_(sum_value)
{}
inline bool operator()(const strtk::token_grid::row_type& row)
{
if (row.get<std::size_t>(tick_time_column) >= next_tick_time_)
{
next_tick_time_ = row.get<std::size_t>(tick_time_column) + 3;
return true;
}
else
return false;
}
inline bool operator()(const strtk::token_grid& grid,
const strtk::token_grid::row_range_type& range)
{
double bucket_sum = 0.0;
if (!grid.accumulate_column(tick_value_column,range,bucket_sum))
{
std::cout << "failed to accumulate!" << std::endl;
return false;
}
else
sum_value_.push_back(bucket_sum);
return true;
}
private:
summarizer& operator=(const summarizer&);
std::size_t next_tick_time_;
std::deque<double>& sum_value_;
};
int main()
{
std::string data = "10000,123.456\n"
"10001,612.345\n"
"10002,561.234\n"
"10003,456.123\n"
"10004,345.612\n"
"10005,234.561\n"
"10006,123.456\n";
strtk::token_grid grid(data,data.size(),",");
std::deque<double> sum_value;
summarizer s(sum_value);
grid.sequential_partition(s,s);
for (std::size_t i = 0; i < sum_value.size(); ++i)
{
std::cout << "bucket[" << i << "] = " << sum_value[i] << std::endl;
}
return 0;
}
The expected output is as follows:
bucket[0] = 1297.035
bucket[1] = 1036.296
bucket[2] = 123.456
Parsing CSV Files With Embedded Double-Quotes
One of the simple extensions to the CSV format is the concept of
double quoted tokens. Such tokens may contain column or row
delimiters. When such a scenario is encountered, all subsequent
delimiters are ignored, and kept as part of the token, until the
corresponding closing double quote is encountered. The StrTk
token_grid supports the parsing of such tokens. This
parsing mode can be easily activated via the token_grid
option set. Below is an example of a token_grid loading a
CSV data set representing various airports from around the world and
their specific codes and locations, in which some of the cells are
double-quoted:
| ICAO | IATA | Airport | City | Country |
| AYGA | GKA | "Goroka Gatue" | Goroka | Papua New Guinea |
| BGCO | GCO | "Nerlerit Inaat Constable Pynt" | "Nerlerit Inaat" | Greenland |
| BZGD | ZGD | Godley | Auckland | New Zealand |
| CYQM | YQM | "Greater Moncton International" | Moncton | Canada |
| EDRK | ZNV | "Koblenz Winningen" | Koblenz | Germany |
| FAHU | AHU | "HMS Bastard Memorial" | Kwazulu-Natal | South Africa |
| FQMP | MZB | "Mocimboa Da Praia" | "Mocimboa Da Praia" | Mozambique |
| KINS | INS | "Indian Springs AF AUX" | Indian Springs | USA |
| UHNN | HNN | Nikolaevsk | "Nikolaevsk Na Amure" | Russia |
| WBKK | BKI | "Kota Kinabalu International" | Kota Kinabalu | Malaysia |
| ZSJD | JDZ | "Jingdezhen Airport" | Jingdezhen | China |
The following is the StrTk code example using token_grid to parse the above CSV data set:
int main()
{
std::string airport_data_file_name = "airport_data.csv";
strtk::token_grid::options options;
options.column_delimiters = "| ,";
options.support_dquotes = true;
strtk::token_grid airport_grid(airport_data_file_name,options);
for (std::size_t r = 0; r < airport_grid.row_count(); ++r)
{
strtk::token_grid::row_type row = airport_grid.row(r);
for (std::size_t c = 0; c < row.size(); ++c)
{
std::cout << "[" << row.get<std::string>(c) << "] ";
}
std::cout << std::endl;
}
return 0;
}
Extending Delimiter Predicates
As previously mentioned the concept of a delimiter based predicate
can lead to some very interesting solutions. A predicate as has been
defined so far, with the exception of the offset predicate, has been
a stateless entity. Adding the ability to maintain a state based on
what the predicate has encountered so far can allow it to behave
differently from the simple single and multiple delimiter predicates.
For this example, lets assume a typical command line parsing problem
which consists of double quotation mark groupings and escapable
special characters, which can be considered being dual use as either
delimiters or data. An example input and output is as follows:
Inputs
Data | abc;"123, mno xyz",i\,jk |
Delimiters | <space>;,. |
Output Tokens
Token0 | abc |
Token1 | 123, mno xyz |
Token2 | i\,jk |
In order to tokenize the above described string, one can create a
composite predicate using a multiple char delimiter predicate and some
simple state rules. The following is an example of such an extended
predicate:
class extended_predicate
{
public:
extended_predicate(const std::string& delimiters)
: escape_(false),
in_bracket_range_(false),
mdp_(delimiters)
{}
inline bool operator()(const unsigned char c) const
{
if (escape_)
{
escape_ = false;
return false;
}
else if ('\\' == c)
{
escape_ = true;
return false;
}
else if ('"' == c)
{
in_bracket_range_ = !in_bracket_range_;
return true;
}
else if (in_bracket_range_)
return false;
else
return mdp_(c);
}
inline void reset()
{
escape_ = false;
in_bracket_range_ = false;
}
private:
mutable bool escape_;
mutable bool in_bracket_range_;
mutable strtk::multiple_char_delimiter_predicate mdp_;
};
Usage of the newly defined extended predicate is as follows:
int main()
{
std::string str = "abc;\"123, mno xyz\",i\\,jk";
strtk::std_string::token_list_type token_list;
strtk::split(extended_predicate(".,; "),
str,
std::back_inserter(token_list),
strtk::split_options::compress_delimiters);
return 0;
}
High Performance Key-Value Parsing
Taking our previous person struct as an example. It
is clear that the tuple format has to be very specific with
regards to the ordering of data. For most situations this is
acceptable as the serializers and deserializers attempt to
function in the simplest manner possible, however sometimes
pieces of information may come in different orderings, or may be
deemed optional and hence not be present in a particular
tuple of data.
An approach is required that provides the means of mapping
specific pieces of data to the corresponding variables (or
members) that will store or make sense of them, in a very
efficient and simple manner. For clarity purposes, the term
"mapping" here means to populate the desired member with the
given data. The association between the two being made with the
key that is paired in-line with the data. Key-Value pairs or some
times known as Attribute-Value pairs are one of the means
by which such an association can be accomplished. The following
is a diagram that demonstrates the mapping of various fields
of a data struct to their corresponding data elements in a tuple
comprised of key-value pairs.

In the diagram above, the key-value pairs are separated (delimited)
by the pipe symbol "|". With regards to the key-value pairs
themselves, the key is traditionally the first element in the pair
and the value is the second element, they are separated in this case
by a single equal sign "=".
Back to our original problem relating to the person struct.
If we were to add keys to each of the pieces of data, we could then not
only parse a tuple of data representing the various fields of the person
struct, but those fields could be any order within the tuple. Furthermore
any of the fields could be deemed optional, hence not necessarily be present
in the tuple. It should be noted that what would denote a correct or successful
parse of tuple may not only depend upon successful parsing of ranges into the
various types, but it may also depend upon the mandatory presence of certain
fields.
Our objective will be to successfully and in the most efficient way
possible parse the following list of tuples that represent instances
of our person struct.
Tuple0 = "ID=0xFA37ED12|NAME=Rumpelstiltskin|AGE=397|HEIGHT=1.31|WEIGHT=58.7"
Tuple1 = "NAME=Rumpelstiltskin|AGE=397|HEIGHT=1.31|WEIGHT=58.7|ID=0xFA37ED12"
Tuple2 = "AGE=397|HEIGHT=1.31|WEIGHT=58.7|ID=0xFA37ED12|NAME=Rumpelstiltskin"
Tuple3 = "HEIGHT=1.31|WEIGHT=58.7|ID=0xFA37ED12|NAME=Rumpelstiltskin|AGE=397"
Tuple4 = "WEIGHT=58.7|ID=0xFA37ED12|NAME=Rumpelstiltskin|AGE=397|HEIGHT=1.31"
Tuple5 = "ID=0xFA37ED12|WEIGHT=58.7|AGE=397|NAME=Rumpelstiltskin|HEIGHT=1.31"
Tuple6 = "WEIGHT=58.7|AGE=397|NAME=Rumpelstiltskin|HEIGHT=1.31|ID=0xFA37ED12"
Tuple7 = "AGE=397|NAME=Rumpelstiltskin|HEIGHT=1.31|ID=0xFA37ED12|WEIGHT=58.7"
Tuple8 = "NAME=Rumpelstiltskin|HEIGHT=1.31|ID=0xFA37ED12|WEIGHT=58.7|AGE=397"
Tuple9 = "HEIGHT=1.31|ID=0xFA37ED12|WEIGHT=58.7|AGE=397|NAME=Rumpelstiltskin"
StrTk provides a means to achieve the above key-value pair parsing task, namely
via the strtk::keyvalue::parser and associated key-mappers. The
following is a table that depicts the various kind of key to value mappers that
are available in the StrTk library:
Key To Value Mappers
| Mapper | Type | Key Lookup Complexity | Maximum Size |
strtk::keyvalue::stringkey_map | std::string | O(log(n)) | Limited to available memory |
strtk::keyvalue::uintkey_map | cardinal value | O(1) | Limited to expected maximum key value |
In the code below we initially begin by defining the delimiters we expect to see
between pairs of key-values and in between the key and value pairs themselves. In
the example below the pair_block_delimiter field denotes the delimiter
we expect between pairs (or blocks) of key-values and the field pair_delimiter
denotes the delimiter we expect between a particular key and value.
Next we define p as an instance of the person struct and
register each of the members of the instance p with a corresponding
key with the keyvalue::parser. After which we then process
each tuple of data, parsing the tuple, populating the instance p, then
printing out the various fields to stdout.
struct person
{
unsigned int id; std::string name; unsigned int age; double height float weight; };
int main()
{
const std::string person_data[] =
{
"ID=0xFA37ED12|NAME=Rumpelstiltskin|AGE=397|HEIGHT=1.31|WEIGHT=58.7",
"NAME=Rumpelstiltskin|AGE=397|HEIGHT=1.31|WEIGHT=58.7|ID=0xFA37ED12",
"AGE=397|HEIGHT=1.31|WEIGHT=58.7|ID=0xFA37ED12|NAME=Rumpelstiltskin",
"HEIGHT=1.31|WEIGHT=58.7|ID=0xFA37ED12|NAME=Rumpelstiltskin|AGE=397",
"WEIGHT=58.7|ID=0xFA37ED12|NAME=Rumpelstiltskin|AGE=397|HEIGHT=1.31",
"ID=0xFA37ED12|WEIGHT=58.7|AGE=397|NAME=Rumpelstiltskin|HEIGHT=1.31",
"WEIGHT=58.7|AGE=397|NAME=Rumpelstiltskin|HEIGHT=1.31|ID=0xFA37ED12",
"AGE=397|NAME=Rumpelstiltskin|HEIGHT=1.31|ID=0xFA37ED12|WEIGHT=58.7",
"NAME=Rumpelstiltskin|HEIGHT=1.31|ID=0xFA37ED12|WEIGHT=58.7|AGE=397",
"HEIGHT=1.31|ID=0xFA37ED12|WEIGHT=58.7|AGE=397|NAME=Rumpelstiltskin",
};
const std::size_t person_data_size = sizeof(person_data) / sizeof(std::string);
typedef unsigned char char_type;
typedef strtk::keyvalue::parser<strtk::keyvalue::stringkey_map> kvp_type;
typedef strtk::keyvalue::options<char_type> opts_type;
opts_type options;
options.pair_block_delimiter = '|';
options.pair_delimiter = '=';
kvp_type kvp(options);
person p;
strtk::hex_to_number_sink<unsigned int> h2ns(p.id);
kvp.register_keyvalue("ID" , h2ns);
kvp.register_keyvalue("NAME" , p. name);
kvp.register_keyvalue("AGE" , p. age);
kvp.register_keyvalue("HEIGHT", p.height);
kvp.register_keyvalue("WEIGHT", p.weight);
for (std::size_t i = 0; i < person_data_size; ++i)
{
if (kvp(person_data[i]))
{
std::cout << i << '\t'
<< "ID: " << p.id << '\t'
<< "Name: " << p.name << '\t'
<< "AGE: " << p.age << '\t'
<< "HEIGHT: " << p.height << '\t'
<< "WEIGHT: " << p.weight << '\n';
}
}
return 0;
}
It should be noted that the underlying associative container for the
strtk::keyvalue::stringkey_map can be explicitly specified
at compile time. By default it is std::map, however it can
be easily changed to std::unordered_map giving it a key
lookup complexity of O(1) or in fact any other container that is
compatible with STL associative map semantics. The following are three
examples of keyvalue_parsers specialized using std::unordered_map,
the first being based on the std::string type, the second
being based on the int type and the third being based on
the double type:
int main()
{
{
typedef unsigned char char_type;
typedef std::unordered_map<std::string,strtk::util::value> key_to_val_map_t;
typedef strtk::keyvalue::key_map<std::string,key_to_val_map_t> stringhashkey_mapper;
typedef strtk::keyvalue::parser<stringhashkey_mapper> kvp_type;
typedef strtk::keyvalue::options<char_type> opts_type;
opts_type options;
options.pair_block_delimiter = '|';
options.pair_delimiter = '=';
kvp_type kvp(options);
}
{
typedef unsigned char char_type;
typedef std::unordered_map<int,strtk::util::value> key_to_val_map_t;
typedef strtk::keyvalue::key_map<int,key_to_val_map_t> inthash_key_mapper;
typedef strtk::keyvalue::parser<inthash_key_mapper> kvp_type;
typedef strtk::keyvalue::options<char_type> opts_type;
opts_type options;
options.pair_block_delimiter = '|';
options.pair_delimiter = '=';
kvp_type kvp(options);
}
{
typedef unsigned char char_type;
typedef std::unordered_map<double,strtk::util::value> key_to_val_map_t;
typedef strtk::keyvalue::key_map<double,key_to_val_map_t> doublehash_key_mapper;
typedef strtk::keyvalue::parser<doublehash_key_mapper> kvp_type;
typedef strtk::keyvalue::options<char_type> opts_type;
opts_type options;
options.pair_block_delimiter = '|';
options.pair_delimiter = '=';
kvp_type kvp(options);
}
return 0;
}
Key-Value Pairs And Multiple Distinct Data Structures
To further the previous example, one might have a situation where
a tuple contains values for different members of different types.
The following example demonstrates parsing of key-value pairs that
map to members of multiple types. The tuple in the example consists
of values some of which are intended for the instance of data1 and
others which are intended for the instance of data2.
struct data1
{
int x0;
std::string x1;
double x2;
float x3;
}
struct data2
{
int y0;
std::string y1;
double y2;
float y3;
}
int main()
{
typedef unsigned char char_type;
typedef strtk::keyvalue::parser<strtk::keyvalue::stringkey_map> kvp_type;
typedef strtk::keyvalue::options<char_type> opts_type;
opts_type options;
options.pair_block_delimiter = '|';
options.pair_delimiter = '=';
kvp_type kvp(options);
data1 d1;
data2 d2;
std::string data = "D1X0=123456789|D1X1=AbCdEfG123456|D1X2=123.456|D1X3=0101.0202f|"
"D2Y0=987654321|D2Y1=tUvWxYz789012|D2Y2=789.012|D2Y3=0707.0909f";
kvp.register_keyvalue("D1X0", d1.x0);
kvp.register_keyvalue("D1X1", d1.x1);
kvp.register_keyvalue("D1X2", d1.x2);
kvp.register_keyvalue("D1X3", d1.x3);
kvp.register_keyvalue("D2Y0", d2.y0);
kvp.register_keyvalue("D2Y1", d2.y1);
kvp.register_keyvalue("D2Y2", d2.y2);
kvp.register_keyvalue("D2Y3", d2.y3);
kvp(data);
std::cout << "D1X0: " << d1.x0 << '\t'
<< "D1X1: " << d1.x1 << '\t'
<< "D1X2: " << d1.x2 << '\t'
<< "D1X3: " << d1.x3 << '\n';
std::cout << "D2Y0: " << d2.y0 << '\t'
<< "D2Y1: " << d2.y1 << '\t'
<< "D2Y2: " << d2.y2 << '\t'
<< "D2Y3: " << d2.y3 << '\n';
return 0;
}
Key-Value Pairs And Lists
The values in the key-value pairs need not always be singular. In some
scenarios, a value could be a list of values. The StrTk key-value parser
supports parsing of such key-value pairs through the previously demonstrated
sequence sink mechanisms.
In the following example we have a complex_data struct that
consists of some PODs but also a number of sequences (vector,deque,list).
The tuple to be parsed consists of simple key-value pairs for the PODs and
more complex looking pairs where the values for the specific sequence of values
are separated by commas ",".
struct complex_data
{
unsigned int v0;
std::vector<int> v1;
double v2;
std::deque<double> v3;
std::string v4;
std::list<std::string> v5;
};
int main()
{
typedef unsigned char char_type;
typedef strtk::keyvalue::parser<strtk::keyvalue::stringkey_map> kvp_type;
typedef strtk::keyvalue::options<char_type> opts_type;
opts_type options;
options.pair_block_delimiter = '|';
options.pair_delimiter = '=';
kvp_type kvp(options);
complex_data cd;
std::string data = "V0=123456789|V1=-3,-2,-1,0,+1,2+3|V2=111.222|V3=1.1,2.2,3.3,4.4,5.5|"
"V4=Some Text|V5=Text1,Text2,Text3,Text4";
strtk::vector_sink<int>::type vec_sink(",");
strtk::deque_sink<double>::type deq_sink(",");
strtk::list_sink<std::string>::type lst_sink(",");
kvp.register_keyvalue("V0", cd.v0);
kvp.register_keyvalue("V1", vec_sink(cd.v1));
kvp.register_keyvalue("V2", cd.v2);
kvp.register_keyvalue("V3", deq_sink(cd.v3));
kvp.register_keyvalue("V4", cd.v4);
kvp.register_keyvalue("V5", lst_sink(cd.v5));
kvp(data);
std::cout << "V0: " << cd.v0 << '\n'
<< "V1: " << strtk::join(" ",cd.v1) << '\n'
<< "V2: " << cd.v2 << '\n'
<< "V3: " << strtk::join(" ",cd.v3) << '\n'
<< "V4: " << cd.v4 << '\n'
<< "V5: " << strtk::join(" ",cd.v5) << '\n';
return 0;
}
Key-Value Pairs With Cardinal Keys
So far all the examples have assumed keys of arbitrary values. In
some situations such as the FIX protocol, the keys are always guaranteed
to be positive integer values. If this is the case then a different kind
of key-mapper can be used that is much more efficient than the general
purpose string key-mapper, providing a key lookup complexity of O(1).
As such StrTk provides the strtk::keyvalue::uintkey_map
type for this purpose. The only difference in terms of setting up the
uintkey_map from the stringkey_map is that it
requires a key_count to be set. This value represents the
largest possible key value that can exist. The following is an example of
how the strtk::keyvalue::uintkey_map key-mapper can be used.
struct data_store
{
char c; unsigned char uc; short s; unsigned short us; int i; std::string str; };
int main()
{
const std::string data = "121=A|122=z|123=-123|124=456|125=-12345678|126=Some simple text";
typedef strtk::keyvalue::parser<strtk::keyvalue::uintkey_map> kvp_type;
strtk::keyvalue::uintkey_map::options options;
options.key_count = 127; options.pair_block_delimiter = '|';
options.pair_delimiter = '=';
kvp_type kvp(options);
data_store ds;
kvp.register_keyvalue(121,ds. c);
kvp.register_keyvalue(122,ds. uc);
kvp.register_keyvalue(123,ds. s);
kvp.register_keyvalue(124,ds. us);
kvp.register_keyvalue(125,ds. i);
kvp.register_keyvalue(126,ds.str);
kvp(data);
return 0;
}
Optional Key-Value Pairs
As previously mentioned there is a use-case where certain fields
may be deemed optional and hence their absence would not constitute a
parsing error. That said, it would also be beneficial to know if a
particular field has been populated or not once the sequence of key
value pairs has been completely parsed and mapped. StrTk provides such
functionality though the use of the strtk::util::attribute
type.
The strtk::util::attribute acts as a proxy for the underlying
type which it is specialized upon. It provides a conversion cast to the
underlying type, and also maintains an 'initialised' state value that can
be used to query the attribute about the underlying type's initialised
status. The following is an example of how one could use the
strtk::util::attribute type in conjunction with the
keyvalue::parser:
struct data_store
{
strtk::util::attribute<int> d1;
strtk::util::attribute<unsigned int> d2;
strtk::util::attribute<double> d3;
strtk::util::attribute<float> d4;
strtk::util::attribute<std::string> d5;
};
int main()
{
std::string data = "INT=-1234|UINT=+5678|DOUBLE=1234.5678|FLOAT=90123.4567f|STRING=Some simple text";
typedef unsigned char char_type;
typedef strtk::keyvalue::parser<strtk::keyvalue::stringkey_map> kvp_type;
typedef strtk::keyvalue::options<char_type> opts_type;
opts_type options;
options.pair_block_delimiter = '|';
options.pair_delimiter = '=';
kvp_type kvp(options);
data_store ds;
d1.initialised() = false;
d2.initialised() = false;
d3.initialised() = false;
d4.initialised() = false;
d5.initialised() = false;
kvp.register_keyvalue("INT" , ds.d1);
kvp.register_keyvalue("UINT" , ds.d2);
kvp.register_keyvalue("DOUBLE", ds.d3);
kvp.register_keyvalue("FLOAT" , ds.d4);
kvp.register_keyvalue("STRING", ds.d5);
if (!kvp(data))
{
std::cout << "Failed to parse key-value data: " << data << std::endl;
return 1;
}
if (!d1.initialised()) { std::cout << "d1 has not been initialised." << std::endl; }
if (!d2.initialised()) { std::cout << "d2 has not been initialised." << std::endl; }
if (!d3.initialised()) { std::cout << "d3 has not been initialised." << std::endl; }
if (!d4.initialised()) { std::cout << "d4 has not been initialised." << std::endl; }
if (!d5.initialised()) { std::cout << "d5 has not been initialised." << std::endl; }
return 0;
}
Semantic Actions with Key-Value Pairs
There may be times that when key-value pairs are being parsed certain
actions need to be executed or behaviours exhibited in-situ with the
parsing process. As previously mentioned, StrTk provides the type
semantic_action that can act as a proxy for a generic
type during the parsing process that also takes a functor or lambda
and executes it at the conversion call. The following example
demonstrates the parsing of an array of tuples comprised of key-value
pairs that map to members of a struct namely data_store.
The keys 111, 222 and 333 each represent a specific value type, they
also require a certain behavior to be exhibited. In this example, for
simplicity, as the values of the various keys are being parsed, a
simple message will be printed to the console denoting the nature of
the parsing process. The code is as follows:
struct data_store
{
data_store()
: i1(0),
d1(0.0),
s1("")
{}
int i1; double d1; std::string s1;
};
int main()
{
static const std::string data[] =
{
"111=-12345|222=1987.654321|333=An interesting string0", "222=2987.654321|333=An interesting string1|111=+12345", "333=An interesting string2|111=-12345|222=3987.654321", "222=4987.654321|111=+12345|333=An interesting string3", "333=An interesting string4|222=5987.654321|111=-12345", "111=+12345|333=An interesting string5|222=6987.654321", };
static const std::size_t data_size = sizeof(data) / sizeof(std::string);
typedef strtk::keyvalue::parser<strtk::keyvalue::uintkey_map> kvp_type;
strtk::keyvalue::uintkey_map::options options;
options.key_count = 334; options.pair_block_delimiter = '|';
options.pair_delimiter = '=';
kvp_type kvp(options);
data_store ds;
typedef strtk::range::ustring::const_iterator itr_type;
using strtk::util::semantic_action;
auto lambda_key111 = [&ds](itr_type begin,itr_type end) -> bool
{
if (!strtk::string_to_type_converter(begin,end,ds.i1))
{
std::cout << "Failed to parse value for key=111\n";
return false;
}
std::cout << "key[111] value=" << ds.i1 << std::endl;
return true;
};
auto lambda_key222 = [&ds](itr_type begin,itr_type end) -> bool
{
if (!strtk::string_to_type_converter(begin,end,ds.d1))
{
std::cout << "Failed to parse value for key=222\n";
return false;
}
std::cout << "key[222] value=" << ds.d1 << std::endl;
return true;
};
auto lambda_key333 = [&ds](itr_type begin,itr_type end) -> bool
{
if (!strtk::string_to_type_converter(begin,end,ds.s1))
{
std::cout << "Failed to parse value for key=333\n";
return false;
}
std::cout << "key[333] value=" << ds.s1 << std::endl;
return true;
};
kvp.register_keyvalue(111,semantic_action(lambda_key111).ref());
kvp.register_keyvalue(222,semantic_action(lambda_key222).ref());
kvp.register_keyvalue(333,semantic_action(lambda_key333).ref());
for (std::size_t i = 0; i < data_size; ++i)
{
if (!kvp(data[i]))
std::cout << "Error while parsing tuple: " << i << std::endl;
std::cout << std::endl;
}
return 0;
}
Note: It should be noted that semantic actions during the parsing process have
a multitude of uses, some of which are: validation of parsed values, that is making sure
that they're in a specified range or within a predefined set of values, complex parsed
value manipulations, invoking of external state-machines to transition to new states
based on the parsed value or even simply the presence of a key etc.
The Letters Game
In the popular TV game show Countdown (aka Letters and Numbers),
contestants in the Letters round take turns choosing letters from
either a vowel or consonant bin. Typically up to 9 letters are chosen,
after which the contestants are given a certain amount of time to find
the longest 'valid' English word made up of only the letters that had
been chosen. The contestant with the longest valid word wins the
round. What defines a valid word is usually specific to the version of
the show. Some examples are the use the OED
or the Macquarie dictionary
coupled together with rules related to proper nouns, plurals and
combination words. The Letters round is essentially an anagram
solving challenge.
The Letters game can be generally defined as: Given a canonical set of
substrings of varying length called C, generated from the alphabet A,
and a subset of not necessarily unique elements derived from A called
D, find the longest substring in the set C that is also in the set of
the 2|D| unique combinations generated from the set D.
A typical process for solving the Letters problem is as follows:
-
Step 0 - A list of unique valid words is specified (eg: OED or Word List)
- Step 0.0 - The longest word length from the list of words is noted as being L
- Step 0.1 - Each word is normalised to a common case. (eg: all lower case)
- Step 1 - For every word wi, generate the key ki by sorting the letters of the word. (eg: For the word 'english', the key 'eghilns' is derived.)
- Step 2 - Insert each key/word pair into an associative map M (eg: hash-table)
-
Step 3 - A list of potentially non-unique letters of length N is specified comprised of both consonants and vowels
- Step 3.0 - Lexicographically order this list of letters
-
Step 4 - For every N-choose-I combinations over the list of N letters, where I starts from min(L,N) and tends to 1:
- Step 4.0 - Test the current combination x as a key in M
- Step 4.1 - If x exists in M, add all the words associated with x into a solution list S,
- Step 4.2 - Otherwise continue on with the next combinations and values of I
- Step 4.3 - Once all the combinations for the current I have been enumerated present the solution list S - (if it is not empty)
int main()
{
std::string word_list_file_name = "word_list.txt";
typedef std::unordered_multimap<std::string,std::string> word_map_t;
typedef std::pair<word_map_t::iterator,word_map_t::iterator> word_map_range_t;
word_map_t word_map;
std::size_t longest_word = 0;
strtk::for_each_line(word_list_file_name,
[&](const std::string& word)
{
if (word.empty()) return;
strtk::remove_leading_trailing(" \t\n\r",word);
strtk::convert_to_lowercase(word);
std::string key = word;
strtk::sort(key);
word_map.insert(std::make_pair(key,word));
longest_word = std::max(longest_word,word.size());
});
std::string letters;
for (;;)
{
std::cout << "Enter letters: ";
if(!std::getline(std::cin,letters))
break;
static const std::string illegal_chars = strtk::ext_string::all_chars() -
strtk::ext_string::all_letters();
strtk::multiple_char_delimiter_predicate pred(illegal_chars);
strtk::remove_inplace(pred,letters);
if (letters.empty()) break;
strtk::convert_to_lowercase(letters);
strtk::sort(letters);
const std::size_t upper_bound = std::min(longest_word,letters.size());
std::unordered_set<std::string> solution_list;
for (std::size_t i = upper_bound; i > 0; --i)
{
typedef std::string::iterator str_itr_t;
strtk::for_each_combination(letters.begin(),letters.end(),
i,
[&](str_itr_t begin, str_itr_t end)
{
std::string key(begin,end);
word_map_range_t itr_range = word_map.equal_range(key);
if (0 == strtk::distance(itr_range)) return;
auto itr = itr_range.first;
while (itr_range.second != itr)
{
solution_list.insert(itr->second);
++itr;
}
});
if (!solution_list.empty())
{
std::copy(solution_list.begin(),
solution_list.end(),
std::ostream_iterator<std::string>(std::cout,"\n"));
break;
}
}
}
return 0;
}
Notes On The Letters Round Solution
The time complexity of the given solution is O(2min(L,N)), which
is quite large. What makes the solution practical, is the fact that human
languages tend to have short common words, with an upper range of about
10-12 characters (excluding names et al and of course Pneumonoultramicroscopicsilicovolcanoconiosis).
So for example an N of 10 (assuming L is adequately large), will only
amount to a total of 1024 combinations - which can be trivially enumerated.
The problem space starts to become daunting at around N of 64 and larger.
At which point a constant multiplier can be applied by distributing the
enumeration process and performing said computations concurrently. Note
this will not reduce the overall complexity of the solution, just
the time it will take to complete, furthermore today this technique may
only be practical for values of N less than 85.
One final note, the above process will not only provide the first
solution it encounters, it will return all possible solutions for
largest encountered length combination.
Past Letters Round Games
The following is a short list of the Letters round games played on countdown
during the 2010 season.
TSDTOEIRO DELAWERAD FSGTIOAOV LESADEPIL YTAINROTD LLBUEISUT GEMUHRONA TRNSEAENI
BPSQEAEVN GOXERAMJA NTRTIOAEP FLUENTLIP SCRSIEONU HPAVSOEDA WKCJROEAM IGTYEARTN
DSOEXDALM SMNRIEIUG EFCAORTNA HGRNEOITE RGPNIEUQA TREULDEOF SNRTEUIDA TRUCEENDS
HTNXIUOAD TREULDEOF SNRTEUIDA TRUCEENDS HTNXIUOAD SWDEIUHBE NWCKPAIAE NYDAOUTAU
SGLAEOMVI GLQGEIODV RTLEOESTI JCRRUAEOS LSDSEAEFO TRMAEASRI NFRNOIAES BROADMOOR
Performance Comparisons
The following are tables of results generated by running the
strtk_tokenizer_cmp test. Currently it covers simple
comparisons between Boost String Algorithms, Boost lexical_cast, The
Standard Library, Spirit (Karma Qi) and StrTk in the following areas:
- Tokenization
- Splitting
- Integer To String
- String To Integer
- String To Double
Scenario 0 - MSVC 2010 (64-bit, O2, Ot, GL and PGO)
| Source | Test | Size | Time(sec) | Rate | % from Baseline |
Boost | Tokenizer | 24000000 | 8.5857sec | 2795359.4074tks/sec | 100.0%, 100.0% |
StrTk | Tokenizer | 24000000 | 3.5019sec | 6853393.1186tks/sec | 40.7%, 245.1% |
Boost | Split | 9600000 | 5.5414sec | 1732414.5137tks/sec | 100.0%, 100.0% |
StrTk | Split | 9600000 | 0.8218sec | 11681814.9167tks/sec | 14.8%, 674.3% |
sprintf | Integer To String | 80000000 | 35.8128sec | 2233840.0564nums/sec | 100.0%, 100.0% |
Boost | Integer To String | 80000000 | 19.3994sec | 4123832.0477nums/sec | 54.1%, 184.6% |
Karma | Integer To String | 80000000 | 6.2528sec | 12794349.6524nums/sec | 17.4%, 572.7% |
StrTk | Integer To String | 80000000 | 1.5664sec | 51071439.9822nums/sec | 4.3%, 2286.2% |
atoi | String To Integer | 88500000 | 5.1802sec | 17084370.4936nums/sec | 100.0%, 100.0% |
Boost | String To Integer | 88500000 | 119.6261sec | 739805.3712nums/sec | 2309.2%, 4.3% |
Qi | String To Integer | 88500000 | 2.1951sec | 40317238.6629nums/sec | 42.3%, 235.9% |
StrTk | String To Integer | 88500000 | 1.8181sec | 48677773.5466nums/sec | 35.0%, 284.9% |
atof | String To Double | 30650000 | 15.2306sec | 2012396.7122nums/sec | 100.0%, 100.0% |
Boost | String To Double | 30650000 | 52.9244sec | 579127.8866nums/sec | 347.4%, 28.7% |
Qi | String To Double | 30650000 | 2.8665sec | 10692313.5853nums/sec | 18.8%, 531.3% |
StrTk | String To Double | 30650000 | 1.6069sec | 19073975.7679nums/sec | 10.5%, 947.8% |

Scenario 1 - MSVC 2010 (O2, Ot, GL and PGO)
| Source | Test | Size | Time(sec) | Rate | % from Baseline |
Boost | Tokenizer | 24000000 | 9.4715sec | 2533910.4769tks/sec | 100.0%, 100.0% |
StrTk | Tokenizer | 24000000 | 2.8889sec | 8307786.9292tks/sec | 30.5%, 327.8% |
Boost | Split | 9600000 | 7.2291sec | 1327965.9706tks/sec | 100.0%, 100.0% |
StrTk | Split | 9600000 | 1.1301sec | 8494610.9664tks/sec | 15.6%, 639.6% |
sprintf | Integer To String | 80000000 | 38.2576sec | 2091088.8038nums/sec | 100.0%, 100.0% |
Boost | Integer To String | 80000000 | 28.9931sec | 2759277.4769nums/sec | 75.7%, 131.9% |
Karma | Integer To String | 80000000 | 4.9173sec | 16269254.0190nums/sec | 12.8%, 778.0% |
StrTk | Integer To String | 80000000 | 1.8270sec | 43786838.0279nums/sec | 4.7%, 2093.9% |
atoi | String To Integer | 88500000 | 6.0076sec | 14731435.8942nums/sec | 100.0%, 100.0% |
Boost | String To Integer | 88500000 | 185.4955sec | 477100.6474nums/sec | 3087.0%, 3.2% |
Qi | String To Integer | 88500000 | 2.5060sec | 35314785.8370nums/sec | 41.7%, 239.7% |
StrTk | String To Integer | 88500000 | 2.2095sec | 40054213.0736nums/sec | 36.7%, 271.8% |
atof | String To Double | 30650000 | 17.6435sec | 1737179.9302nums/sec | 100.0%, 100.0% |
Boost | String To Double | 30650000 | 78.6528sec | 389687.3997nums/sec | 445.7%, 22.4% |
Qi | String To Double | 30650000 | 3.8034sec | 8058494.1994nums/sec | 21.5%, 463.8% |
StrTk | String To Double | 30650000 | 2.0450sec | 14987780.2310nums/sec | 11.5%, 862.7% |

Scenario 2 - MSVC 2008 SP1 (O2, Ot, GL and PGO)
| Source | Test | Size | Time(sec) | Rate | % from Baseline |
Boost | Tokenizer | 24000000 | 9.6533sec | 2486184.8282tks/sec | 100.0%, 100.0% |
StrTk | Tokenizer | 24000000 | 3.4748sec | 6906943.9529tks/sec | 35.9%, 277.8% |
Boost | Split | 9600000 | 10.2600sec | 935674.7490tks/sec | 100.0%, 100.0% |
StrTk | Split | 9600000 | 1.3793sec | 6959830.0652tks/sec | 13.4%, 743.8% |
sprintf | Integer To String | 80000000 | 24.6427sec | 3246397.8287nums/sec | 100.0%, 100.0% |
Boost | Integer To String | 80000000 | 27.5865sec | 2899968.5753nums/sec | 111.9%, 89.3% |
Karma | Integer To String | 80000000 | 5.4864sec | 14581630.6963nums/sec | 22.2%, 449.1% |
StrTk | Integer To String | 80000000 | 2.4224sec | 33025441.1256nums/sec | 9.8%, 1017.2% |
atoi | String To Integer | 88500000 | 5.9297sec | 14924814.8683nums/sec | 100.0%, 100.0% |
Boost | String To Integer | 88500000 | 186.1372sec | 475455.6660nums/sec | 3139.0%, 3.1% |
Qi | String To Integer | 88500000 | 2.0874sec | 42397446.1804nums/sec | 35.2%, 284.0% |
StrTk | String To Integer | 88500000 | 2.0485sec | 43202160.1371nums/sec | 34.5%, 289.4% |
atof | String To Double | 30650000 | 18.0458sec | 1698455.0767nums/sec | 100.0%, 100.0% |
Boost | String To Double | 30650000 | 77.4527sec | 395725.4111nums/sec | 429.2%, 23.2% |
Qi | String To Double | 30650000 | 3.9631sec | 7733881.1294nums/sec | 21.9%, 455.3% |
StrTk | String To Double | 30650000 | 2.0723sec | 14790236.0804nums/sec | 11.4%, 870.8% |

Scenario 3 - Intel C++ v11.1.060 IA-32 (O2, Ot, Qipo, QxHost and PGO)
| Source | Test | Size | Time(sec) | Rate | % from Baseline |
Boost | Tokenizer | 24000000 | 10.0096sec | 2397697.7836tks/sec | 100.0%, 100.0% |
StrTk | Tokenizer | 24000000 | 3.1837sec | 7538416.8541tks/sec | 31.8%, 314.4% |
Boost | Split | 9600000 | 9.5450sec | 1005760.0310tks/sec | 100.0%, 100.0% |
StrTk | Split | 9600000 | 1.4292sec | 6716893.1359tks/sec | 14.9%, 667.8% |
sprintf | Integer To String | 80000000 | 23.8979sec | 3347577.5824nums/sec | 100.0%, 100.0% |
Boost | Integer To String | 80000000 | 27.5618sec | 2902565.2045nums/sec | 115.3%, 86.7% |
Karma | Integer To String | 80000000 | 4.6600sec | 17167208.7654nums/sec | 19.4%, 512.8% |
StrTk | Integer To String | 80000000 | 2.8450sec | 28119857.2736nums/sec | 11.9%, 840.0% |
atoi | String To Integer | 88500000 | 5.9386sec | 14902610.8922nums/sec | 100.0%, 100.0% |
Boost | String To Integer | 88500000 | 180.5856sec | 490072.4001nums/sec | 3040.8%, 3.2% |
Qi | String To Integer | 88500000 | 2.5273sec | 35017073.8639nums/sec | 42.5%, 234.9% |
StrTk | String To Integer | 88500000 | 1.8718sec | 47281492.1287nums/sec | 31.5%, 317.2% |
atof | String To Double | 30650000 | 18.4357sec | 1662538.0810nums/sec | 100.0%, 100.0% |
Boost | String To Double | 30650000 | 78.1543sec | 392172.9598nums/sec | 423.9%, 23.5% |
Qi | String To Double | 30650000 | 2.8321sec | 10822353.0510nums/sec | 15.3%, 650.9% |
StrTk | String To Double | 30650000 | 2.2930sec | 13366541.5515nums/sec | 12.4%, 803.9% |

Scenario 4 - GCC 4.5 (O3, PGO)
| Source | Test | Size | Time(sec) | Rate | % from Baseline |
Boost | Tokenizer | 24000000 | 9.2510sec | 2594305.4347tks/sec | 100.0%, 100.0% |
StrTk | Tokenizer | 24000000 | 3.9717sec | 6042688.5734tks/sec | 42.9%, 232.9% |
Boost | Split | 9600000 | 5.0640sec | 1895728.2331tks/sec | 100.0%, 100.0% |
StrTk | Split | 9600000 | 1.5411sec | 6229231.8384tks/sec | 30.4%, 328.5% |
sprintf | Integer To String | 80000000 | 14.7807sec | 5412477.0993nums/sec | 100.0%, 100.0% |
Boost | Integer To String | 80000000 | 19.1131sec | 4185620.7707nums/sec | 129.3%, 77.3% |
Karma | Integer To String | 80000000 | 6.4455sec | 12411808.2841nums/sec | 43.6%, 229.3% |
StrTk | Integer To String | 80000000 | 4.5174sec | 17709364.5349nums/sec | 30.5%, 327.1% |
atoi | String To Integer | 88500000 | 5.2139sec | 16973721.6103nums/sec | 100.0%, 100.0% |
Boost | String To Integer | 88500000 | 50.5326sec | 1751344.8498nums/sec | 969.1%, 10.3% |
Qi | String To Integer | 88500000 | 1.9694sec | 44937612.8835nums/sec | 37.7%, 264.7% |
StrTk | String To Integer | 88500000 | 1.9008sec | 46558706.5833nums/sec | 36.4%, 274.2% |
atof | String To Double | 30650000 | 6.6975sec | 4576328.3036nums/sec | 100.0%, 100.0% |
Boost | String To Double | 30650000 | 29.6375sec | 1034162.2422nums/sec | 442.5%, 22.5% |
Qi | String To Double | 30650000 | 2.9852sec | 10267435.7138nums/sec | 44.5%, 224.3% |
StrTk | String To Double | 30650000 | 1.5961sec | 19202937.1409nums/sec | 23.8%, 419.6% |

Scenario 5 - GCC 4.5 (O3, PGO) Intel Atom N450
| Source | Test | Size | Time(sec) | Rate | % from Baseline |
Boost | Tokenizer | 24000000 | 29.1370sec | 823695.4389tks/sec | 100.0%, 100.0% |
StrTk | Tokenizer | 24000000 | 12.3607sec | 1941644.0499tks/sec | 42.4%, 235.7% |
Boost | Split | 9600000 | 16.5261sec | 580899.9726tks/sec | 100.0%, 100.0% |
StrTk | Split | 9600000 | 4.9102sec | 1955110.2611tks/sec | 29.7%, 336.5% |
sprintf | Integer To String | 80000000 | 50.3456sec | 1589015.6118nums/sec | 100.0%, 100.0% |
Boost | Integer To String | 80000000 | 91.1475sec | 877698.1401nums/sec | 181.0%, 55.2% |
Karma | Integer To String | 80000000 | 21.8904sec | 3654568.8712nums/sec | 43.4%, 229.9% |
StrTk | Integer To String | 80000000 | 12.1877sec | 6564009.9274nums/sec | 24.2%, 413.0% |
atoi | String To Integer | 88500000 | 17.6615sec | 5010896.5768nums/sec | 100.0%, 100.0% |
Boost | String To Integer | 88500000 | 191.9446sec | 461070.5357nums/sec | 1086.7%, 9.2% |
Qi | String To Integer | 88500000 | 6.2808sec | 14090561.7119nums/sec | 35.5%, 281.1% |
StrTk | String To Integer | 88500000 | 6.1552sec | 14378086.8208nums/sec | 34.8%, 286.9% |
atof | String To Double | 30650000 | 21.4865sec | 1426474.1027nums/sec | 100.0%, 100.0% |
Boost | String To Double | 30650000 | 139.8166sec | 219215.7409nums/sec | 650.7%, 15.3% |
Qi | String To Double | 30650000 | 11.3916sec | 2690567.9223nums/sec | 53.0%, 188.6% |
StrTk | String To Double | 30650000 | 6.4396sec | 4759608.7027nums/sec | 29.9%, 333.6% |

Scenario 6 - GCC 4.5 (O3, PGO) Intel Xeon E5540
| Source | Test | Size | Time(sec) | Rate | % from Baseline |
Boost | Tokenizer | 24000000 | 7.5657sec | 3172216.8787tks/sec | 100.0%, 100.0% |
StrTk | Tokenizer | 24000000 | 2.7379sec | 8765832.8290tks/sec | 36.1%, 276.3% |
Boost | Split | 9600000 | 3.0706sec | 3126386.1126tks/sec | 100.0%, 100.0% |
StrTk | Split | 9600000 | 1.1279sec | 8511136.2899tks/sec | 36.7%, 272.2% |
sprintf | Integer To String | 80000000 | 10.9012sec | 7338642.9638nums/sec | 100.0%, 100.0% |
Boost | Integer To String | 80000000 | 12.3317sec | 6487328.7872nums/sec | 113.1%, 88.3% |
Karma | Integer To String | 80000000 | 3.7202sec | 21504260.6660nums/sec | 34.1%, 293.0% |
StrTk | Integer To String | 80000000 | 2.5183sec | 31768042.4612nums/sec | 23.1%, 432.8% |
atoi | String To Integer | 88500000 | 4.0087sec | 22077037.6357nums/sec | 100.0%, 100.0% |
Boost | String To Integer | 88500000 | 30.3659sec | 2914454.4393nums/sec | 757.4%, 13.2% |
Qi | String To Integer | 88500000 | 1.7976sec | 49231871.5454nums/sec | 43.3%, 223.0% |
StrTk | String To Integer | 88500000 | 1.7384sec | 50908881.7303nums/sec | 43.3%, 230.5% |
atof | String To Double | 30650000 | 5.2118sec | 5880843.9328nums/sec | 100.0%, 100.0% |
Boost | String To Double | 30650000 | 21.5546sec | 1421966.9538nums/sec | 413.5%, 24.1% |
Qi | String To Double | 30650000 | 3.2149sec | 9533840.3118nums/sec | 61.6%, 162.1% |
StrTk | String To Double | 30650000 | 1.3929sec | 22003661.2944nums/sec | 26.7%, 374.1% |

Scenario 7 - GCC 4.5 (O3, PGO) Intel Xeon X5650
| Source | Test | Size | Time(sec) | Rate | % from Baseline |
Boost | Tokenizer | 24000000 | 4.1944sec | 5721901.2924tks/sec | 100.00%, 100.00% |
StrTk | Tokenizer | 24000000 | 2.5087sec | 9566860.3956tks/sec | 59.81%, 167.19% |
Boost | Split | 9600000 | 2.9104sec | 3298520.2014tks/sec | 100.00%, 100.00% |
StrTk | Split | 9600000 | 1.1105sec | 8644949.2334tks/sec | 38.15%, 262.08% |
sprintf | Integer To String | 80000000 | 9.7148sec | 8234840.3537nums/sec | 100.00%, 100.00% |
Boost | Integer To String | 80000000 | 11.5726sec | 6912860.7120nums/sec | 119.12%, 83.94% |
Karma | Integer To String | 80000000 | 4.0832sec | 19592620.4395nums/sec | 42.03%, 237.92% |
StrTk | Integer To String | 80000000 | 2.2204sec | 36029641.5861nums/sec | 22.85%, 437.52% |
atoi | String To Integer | 88500000 | 3.1028sec | 28522836.1561nums/sec | 100.00%, 100.00% |
Boost | String To Integer | 88500000 | 6.1270sec | 14444145.2249nums/sec | 197.46%, 50.64% |
Qi | String To Integer | 88500000 | 1.5313sec | 57794031.2153nums/sec | 49.35%, 202.62% |
StrTk | String To Integer | 88500000 | 1.4409sec | 61417814.6362nums/sec | 46.43%, 215.32% |
atof | String To Double | 42880000 | 6.1342sec | 6990292.6549nums/sec | 100.00%, 100.00% |
Boost | String To Double | 42880000 | 28.9461sec | 1481374.9232nums/sec | 772.37%, 21.19% |
Qi | String To Double | 42880000 | 3.6549sec | 11732137.3557nums/sec | 3644.68%, 167.83% |
StrTk | String To Double | 42880000 | 1.3860sec | 30937683.0792nums/sec | 460.19%, 442.58% |

Note 1: The tests are compiled with specific optimisation
flags to produce the best possible results for the respective
compilers and architectures. Furthermore the tests are run
natively (no virtualizations were used) on an almost completely idle
machine so as to reduce interference from background processes. The
Boost version used was 1.48. Furthermore the standard libraries
including libc were rebuilt for the linux system based tests, using
architecture specific flags and optimizations. The following is a
table mapping the scenarios to their respective architectures:
| Scenario | Architecture |
| 0 | ThinkPad W510 (64-Bit Intel Quad Core Extreme i7-920XM 2.0GHz, 16GB RAM, Windows 7) |
| 1-3 | ThinkPad x61 (32-Bit Intel Core 2 Duo 2.4GHz, 2GB RAM, Windows 7) |
| 4 | ThinkPad x61 (32-Bit Intel Core 2 Duo 2.4GHz, 2GB RAM, Ubuntu 10.10) |
| 5 | Acer Aspire One (32-Bit Intel Atom N450 1.6Ghz, 1GB RAM, Ubuntu 10.10) |
| 6 | HP Proliant DL380G6 (64-Bit Intel Xeon E5540 2.5GHz, 8GB RAM, Ubuntu 10.10) |
| 7 | Custom box (64-Bit Intel Xeon X5650 2.66GHz, 32GB RAM, Ubuntu 10.10) |
Note 2:
The percentages in the final column represent the percentage of the
current row versus the baseline in total running time and rate
respectively. For the first percentage the lower the value the better
and for the second percentage the higher the value the better. The
baseline used for a specific combination of tests is defined in the
following table:
| Test Combination | Baseline |
| Boost, StrTk | Boost |
| Boost, StdLib/STL, Spirit, StrTk | StdLib/STL |
| StdLib/STL, Spirit, StrTk | StdLib/STL |
Note 3:
The test sizes are set such that no single run will result in a
running time less than one second. This is done so as to ensure
that runs-per-second results are not deemed to have been projected.
In the future these sizes may need to be revisited once 3.5+GHz CPU
speeds become more commonplace. Furthermore the charts represent the
rate of operation over a one second interval - In short, the larger
the rate the better.
Note 4:
The binaries used for the above performance tests can be downloaded from
here
Note 5:
It would be great to have comparisons for other architectures. If you can
provide access to shell accounts with GCC 4.5+ or Clang/LLVM 2.0+ for the
following architectures: UltraSPARC T2 Plus, SPARC64 VII, POWER6/7, please
feel free to get in contact.
StrTk Library Dependency
StrTk makes use of the Boost library for its
boost::lexical_cast routine for types other than
PODs, and its TR1 compliant Random and Regex libraries. These
dependencies are not compulsory and can be easily removed simply
by defining the preprocessor: strtk_no_tr1_or_boost. That
said Boost is an integral part of modern C++ programming, and
having it around is as beneficial as having access to the STL,
hence it is recommended that it be installed. For Visual Studio
users, BoostPro provides a free and easy to use
installer for the latest Boost libraries that can be obtained
from Here. For Linux users, mainstream
distributions such as Ubuntu and Red-Hat(Fedora) provide easy
installation of the Boost libraries via their respective package
management systems. For more information please consult the
readme.txt found in the StrTk distribution.
Compiler Support
The following is a listing of the various compilers that StrTk can be
built with error and warning free.
- GCC - verions 3.1+
- Intel C++ Compiler - versions 8+
- Clang/LLVM - versions 1.0+
- MSVC - versions 7.1+
- Comeau C/C++ - versions 4.3+
Note: Versions of compilers prior to the ones denoted above
"should" compile, however they may require a very lenient
warning/error level be set during compilation.
Conclusion
StrTk was designed with performance and efficiency as its sole primary
principles, and as such some of the available interfaces may not be as
user-friendly as they should - however that said, the gains
made in other areas hopefully will compensate for any perceived
difficulties. Like most things there is a trade-off between
performance and usability with the above mentioned tokenizers and
parsing methods. The original aim was to provide an interface similar
to that of the Boost Tokenizer and Split routines, but to also avail
the developer simplifications that will hopefully provide them more
flexibility and efficiency in the long run. That said, tokenizing a
string isn't the most fascinating problem one could tackle but it does
have its interesting points when one has a few TB of data to process,
doing it properly could mean the difference between finishing a simple
data processing job today or next month.