Click here to Skip to main content
Click here to Skip to main content

C++ String Toolkit (StrTk) Tokenizer

, 7 Jan 2014 CPL
Rate this:
Please Sign up or sign in to vote.
A brief introduction to a C++ String Tokenizer implementation.

Introduction

This article will present the tokenizing and splitting functionality of a simple C++ library called the String Toolkit. Tokenization in the context of string processing, is the method by which a sequence of elements are broken up or fragmented into sub-sequences called tokens. The indices in the original sequence that determine such breaks in the sequence are known as delimiters.

There are two types of delimiters, normal or thin delimiters which are of length one element and thick delimiters which are of length two or more elements. Even though tokenization is primarily used in conjunction with strings, any sequence of types that can be iterated in a linear fashion can be tokenized, examples may be list of integers, a vector of person classes or a map of strings.

Another Tokenizer?

To date there have been many attempts to define a "standard" Tokenizer implementation in C++. Of them all the likely candidate might be the implementation in the Boost library. Regardless proposed implementations should to some extent consider one or more of the following areas: over-all usage patterns, constructions, generality (or is it genericty these days?) of design, performance-efficiency.

1. Overall Usage Patterns

This requirement is concerned with how easy it is to instantiate the tokenizer and integrate it into existing processing patterns, which most often than not requires integration with C++ STL algorithms and containers. A tokenzier by definition would be classed as a producer, so the question becomes how easy is it for others to consume its output? Another issue is consistency of the definition of a token in the situation where one has consecutive delimiters but is not compressing them - can or should there be such a thing as an empty token? and what do preceding and trailing delimiters mean? and when should they be included as part of the token?

2. Constructions

In the context of string tokenizing, the majority of implementations return the token as a new instance of a string. This process requires a string to be created on heap, populated by the sub-string in question from the original sequence, then returned back (some of this may be alleviated by Return Value Optimization RVO). In the case of iterators this is essentially another copy to the caller. Furthermore two kinds of tokens can make this situation worse, they are primarily a large sequence made up of lots of very short tokens or a large sequence made up of lots of very large tokens. The solution is not to return the token as a string but rather as a range and allow the caller to decide how they wish to inspect and further manipulate the token.

StrTk Token Iterators - Copyright Arash Partow

This minor change in interface design provides a great deal of flexibility and performance gain.

3. Generality(Genericity) of design

Most tokenizer implementations concern themselves only with strings, which for the most part is ok, because most things that need tokenizing are strings. However there will be times when one has a sequence of types that may need to be tokenized that aren't strings, hence a tokenizer should be designed in such a way to enable such features, moreover it becomes clear that the most basic of tokenizing algorithms are invariant to the type of the delimiter.

4. Performance and Efficiency

Tokenizing strings range from low frequency inputs such as user input or parsing of simple configuration files to more complex situations such as tokenizing of HTML pages that a web crawler/indexer might do, to parsing of large market-data streams in FIX format. Performance is generally important and can usually be helped along with good usage patterns that encourage reuse of instances, minimal preprocessing and allow for user supplied predicates for the more nasty areas of the process. It should be noted that everything in the proceeding article can be done by purely using the STL/IOStream libraries - that said, C++'s ability to allow one to skin the proverbial cat in numerous way gives rise to novel solutions that are for the most part not of any practical use other than to showcase ones abilities in using the STL/IOStreams.

Getting started

The String Toolkit Library (StrTk) provides two common tokenization concepts, a split function and a token iterator. Both these concepts require the user to provide a delimiter predicate and an iterator range over which the process will be carried out.

The tokenization process can be further parametrized by switching between "compressed delimiters" or "no compressed delimiters" mode. This essentially means that consecutive delimiters will be compressed down to one and treated as such. Below are two tables depicting the expected tokens from various types of input. The tables represent no compressed and compressed tokenization processes respectively. The delimiter in this instance is a pipe symbol | and <> denotes an empty token.

No Compressed Delimiters

InputToken List
aa
a|ba,b
a||ba,<>,b
|a<>,a
a|a,<>
|a||b<>,a,<>,b
||a||b||<>,<>,a,<>,b,<>,<>
|<>,<>
||<>,<>,<>
|||<>,<>,<>,<>

Compressed Delimiters

InputToken List
aa
a||ba,b
|a||b<>,a,b
||a||b||<>,a,b,<>
|<>,<>
||<>,<>
|||<>,<>

Delimiters

Two forms of delimiters are supported and they are single delimiter predicate and multiple delimiters predicate otherwise known as SDP and MDP respectively. Essentially an SDP is where there is only one type that can break or fragment the sequence, where as with MDPs there is more than one unique type that can break the sequence. It is possible to represent a SDP using the MDP, however from a performance POV having separate predicates is far more efficient. Additionally for strings based on char or unsigned char (8-bit versions) there is a MDP that has a look-up complexity of O(1) making it greatly more efficient than the basic MDP.

Single Delimiter Predicate

Instantiation requires specialization of type and construction requires an instance of the type.

strtk::single_delimiter_predicate<typename T>(const T& t)

strtk::single_delimiter_predicate<std::string::value_type> predicate('|');

Multiple Delimiter Predicate

Instantiation requires specialization of type and construction requires a sequence of potential delimiters through a range described by iterators.

strtk::multiple_delimiter_predicate<typename T>(Iterator begin, Iterator end);

std::string str_delimiters = " ,.;:<>'[]{}()_?/'`~!@#$%^&*|-_\"=+";
strtk::multiple_delimiter_predicate<char> mdp1(str_delimiters.begin(),str_delimiters.end());

unsigned int uint_delimiters[5] = {1,10,101,1010,10101};
strtk::multiple_delimiter_predicate<unsigned int> mdp2(uint_delimiters,uint_delimiters + 5);

As previously mentioned tokenization of data need not be limited to strings comprised of chars, but can also be extended to other PODs or complex types. In the above example a predicate used for tokenizing a sequence of unsigned ints is being defined.

Multiple Char Delimiter Predicate

Instantiation requires an iterator range based on either unsigned char or char. This delimiter is more efficient than the simple MDP as it has a predicate evaluation of O(1) due to the use of a lookup-table as opposed to O(n) where n is the number of delimiters. Performance increase is seen as more delimiters are used.

strtk::multiple_char_delimiter_predicate(const std::string&)
strtk::multiple_char_delimiter_predicate(const std::string::const_iterator begin,const std::string::const_iterator end)

strtk::multiple_char_delimiter_predicate predicate(" .;?");

The delimiter concept can be extended to the point where the predicate itself can act as a state machine transitioning from state to state based on conditions and rules related to the current symbol being processed. This simple extension can lead to some very interesting parsing capabilities.

Split

This is a function that performs tokenization over an entire sequence in one go. strtk::split takes a sequence through a range of iterators or in the case of a string through a reference to its instance, a delimiter predicate and an output iterator or otherwise known as a type sink. It populates the output iterator with the tokens it extracts. The tokens in this case are std::pairs of iterators for the sequence.

StrTk Split Routine - Copyright Arash Partow

Split can be used in a "simple - no frills" manner by simply passing the necessary parameters:

std::string str = "abc|123|xyz|789";
strtk::std_string::token_list_type token_list;
strtk::split(" |.;?",str,std::back_inserter(token_list));

strtk::split can also be used in a more explicit manner whereby the exact type of delimiter predicate can be specified by the user:

   //split using strtk predicates
   {
      std::string str = "abc|123|xyz|789";
      strtk::std_string::token_list_type token_list;

      strtk::single_delimiter_predicate<std::string::value_type> predicate('|');
      strtk::split(predicate,str,std::back_inserter(token_list));
   }

   //split using a lambda as a predicate
   {
      std::string data = "abc|123|xyz|789";
      std::deque<std::string> token_list;
      strtk::split([](const char c)
                   {
                      return '|' == c;
                   },
                   data,
                   strtk::range_to_type_back_inserter(token_list));
   }

strtk::split provides an additional usage option that allows the user to specify if they would like to either compress the delimiters and whether or not they would like to include the delimiter as part of the token range. This enum parameter is called strtk::split_options and has the following values:

Split OptionDefinition
strtk::split_options::default_modeDefault options
strtk::split_options::compress_delimitersConsecutive delimiters are treated as one
strtk::split_options::include_1st_delimiterThe first delimiter is included in the resulting token range
strtk::split_options::include_delimitersAll delimiters are included in the resulting token range

The simple example below demonstrates a split that will occur over a string given a predicate where the provided split options indicate that consecutive delimiters will be treated as one and also all delimiters encountered after each token will also be included in the token up until the next token begins.

strtk::split(predicate,
             str,
             std::back_inserter(token_list),
             strtk::split_options::compress_delimiters | strtk::split_options::include_delimiters);

Another way of using split may be to use the strtk::multiple_char_delimiter_predicate as follows:

std::string str = "abc?123;xyz.789";
strtk::std_string::token_list_type token_list;
strtk::multiple_char_delimiter_predicate predicate(" .;?");
strtk::split(predicate,str,std::back_inserter(token_list));

The contents of the token_list can be printed out as follows:

strtk::std_string::token_list_type::iterator itr = token_list.begin();
while (token_list.end() != itr)
{
   std::cout << (*itr) << '\t';
   ++itr;
}

Split N-Tokens

A natural extension of strtk::split is strtk::split_n. This function provides the ability to tokenize a sequence up until a specific number of tokens have been encountered or until there are no more tokens left. The return value of the strtk::split_n would be the number of tokens encountered.

std::string data = "token1?token2,token3;token4,token5";
strtk::std_string::token_list_type token_list;
const std::size_t token_count = 4;
const std::string delimiters (" ,.;?");
strtk::split_n(delimiters,
               data,
               token_count,
               std::back_inserter(token_list));
strtk::std_string::token_list_type::iterator itr = token_list.begin();
while (token_list.end() != itr)
{
   std::cout << "[" << (*itr) << "]\t";
   ++itr;
}
std::cout << std::endl;

Offset Splitter

Another simple variant is the strtk::offset_splitter. This form of split takes a series of offsets and an iterator range or string and determines the tokens based on the offsets. This function can be set to perform a single pass of the offsets or to rotate them until the range has been completely traversed. The example below demonstrates how a string representing date and time can be tokenized into its constituent parts (year, month, day, hour, minutes ,seconds,milliseconds)

std::string time_data = "20000101091011345"; //2000-01-01 9:10:11sec 345ms
const std::size_t offset_list_size = 7;
const int offset_list[offset_list_size] = {
                                            4, // year
                                            2, // month
                                            2, // day
                                            2, // hour
                                            2, // minute
                                            2, // second
                                            3  // ms
                                          };
const strtk::offset_predicate<offset_list_size> predicate(offset_list);
strtk::std_string::token_list_type token_list;
strtk::offset_splitter(time_data,predicate,std::back_inserter(token_list));

strtk::std_string::token_list_type::iterator itr = token_list.begin();
while (token_list.end() != itr)
{
   std::cout << "[" << (*itr) << "] ";
   ++itr;
}
std::cout << std::endl;

Split Regex

Another form of splitter is based on the concept of using regular expressions as the delimiter predicate. Below is a simple example of splitting tokens wrapped in round-brackets.

std::string str = "(12)(345)(6789)(0ijkx)(yz)";
std::list<std::string> token_list;
strtk::split_regex("\\(.*?\\)",
                   s,
                   std::back_inserter(token_list),
                   strtk::regex_match_mode::match_1);

std::list<std::string>::iterator itr = token_list.begin();
while (token_list.end() != itr)
{
   std::cout << "[" << (*itr) << "]\t";
   ++itr;
}
std::cout << std::endl;
</std::string>

Note: The parameter regex_match_mode represents the capture of the marked sub-pattern in the current match. By default it is strtk::regex_match_mode::match_all which in this case would provide the entire match including the bounding pattern, eg: Token3 would be (0ijkx). However in the above example we are only interested in the sub-pattern between the round-brackets, hence strtk::regex_match_mode::match_1 is used resulting in Token3 being 0ijkx.

The following examples demonstrate the use of strtk::split_regex and strtk::split_regex_n routines in extracting specific types of data - in this case the POD types int and double.

int main()
{
   {
      // Extract ints from data string
      std::string data = "a 1^bc,0023| def?gh(4567ijk)-89 10l,m$n-op+123r@st+3u v*w2y56yz+";
      std::deque<int> int_list;
      strtk::split_regex("([+-]?([\\d]+))",
                         data,
                         strtk::range_to_type_back_inserter(int_list),
                         strtk::regex_match_mode::match_1);
   }

   {
      // Extract doubles from data string
      std::string data = "ab$c1.1?d-2.2ef#ghi+3.3%(123.456)!&*-7.89E+12@^=";
      std::deque<double> double_list;
      strtk::split_regex(strtk::ieee754_expression,
                         data,
                         strtk::range_to_type_back_inserter(double_list),
                         strtk::regex_match_mode::match_1);
   }

   {
      // Extract the first 3 ints from data string
      std::string data = "a 1^bc,0023| def?gh(4567ijk)-89 10l,m$n-op+123r@st+3u v*w2y56yz+";
      std::deque<int> int_list;
      strtk::split_regex_n("([+-]?([\\d]+))",
                           data,
                           3,
                           strtk::range_to_type_back_inserter(int_list),
                           strtk::regex_match_mode::match_1);
   }

   {
      // Extract the first 4 doubles from data string
      std::string data = "ab$c1.1?d-2.2ef#ghi+3.3%(123.456)!&*-7.89E+12@^=";
      std::deque<double> double_list;
      strtk::split_regex_n(strtk::ieee754_expression,
                           data,
                           4,
                           strtk::range_to_type_back_inserter(double_list),
                           strtk::regex_match_mode::match_1);
   }

   return 0;
}

The following table describes the various regex patterns built into StrTk which can be used seamlessly with the strtk::split_regex and strtk::split_regex_n routines.

RegexDefinition
strtk::uri_expressionURI, URL address e.g.: http://www.example.com, domain.example.net/index.html
strtk::email_expressionE-mail address e.g.: some.one@example.com
strtk::ip_expressionIPv4 address e.g.: 192.168.0.1, 127.0.0.1
strtk::ieee754_expressionFloating point value e.g.: 1.1, 1.234e-123, -1.0001E+10, 0.1234

Tokenizer

The tokenizer is specialized on a sequence iterator and predicate. It is constructed with a range of iterators for the sequence and a reference to the desired predicate. Definitions exist for std::string tokenizers. The tokenizer provides an iteration pattern as a means for accessing the tokens in the sequence.

const unsigned int data_size = 12;
unsigned int data[data_size] = {1,2,3,0,4,5,6,0,7,8,0,9};
strtk::single_delimiter_predicate<unsigned int> predicate(0);
typedef strtk::tokenizer<unsigned int*,strtk::single_delimiter_predicate<unsigned int> > tokenizer_type;
tokenizer_type tokenizer(data,data + data_size,predicate);

Similar to that of strtk::split, strtk::tokenizer provides tokenizing options that are passed in during construction. Below is a table depicting said options:

Tokenize OptionDefinition
strtk::tokenize_options::default_modeDefault options
strtk::tokenize_options::compress_delimitersConsecutive delimiters are treated as one
strtk::tokenize_options::include_1st_delimiterThe first delimiter is included in the resulting token range
strtk::tokenize_options::include_delimitersAll delimiters are included in the resulting token range
typedef strtk::tokenizer<unsigned int*,strtk::single_delimiter_predicate<unsigned int> > tokenizer_type
tokenizer_type tokenizer(data, data + data_size,
                         predicate,
                         strtk::tokenize_options::compress_delimiters |
                         strtk::tokenize_options::include_1st_delimiter);

Furthermore, iteration over tokens using strtk::tokenizer is performed as follows:

tokenizer_type::iterator itr = tokenizer.begin();
while (tokenizer.end() != itr)
{
   std::copy((*itr).first,(*itr).second,std::ostream_iterator<unsigned int>(std::cout," "));
   std::cout << std::endl;
   ++itr;
}

A typical std::string can be tokenized in the following manner:

std::string str = "abc|123|xyz|789";
strtk::std_string::tokenizer<>::type tokenizer(str,"|");
strtk::std_string::tokenizer<>::type::iterator itr = tokenizer.begin();
while (tokenizer.end() != itr)
{
   std::cout << "[" << (*itr) << "]\t";
   ++itr;
}
std::cout << std::endl;

Another common situation may be tokenizing a sequence of strings, such as the following:

const std::string str_list[] = { "abc" , "delimiter" , "ijk" , "delimiter" ,
                                 "mno" , "delimiter" , "rst" , "uvw" ,
                                 "delimiter" , "xyz" };

const std::size_t str_list_size = sizeof(str_list) / sizeof(std::string);

strtk::range_adapter<std::string> range(str_list,str_list_size);
strtk::single_delimiter_predicate<std::string> predicate("delimiter");
typedef strtk::tokenizer<std::string*,strtk::single_delimiter_predicate<std::string> > tokenizer_type;
tokenizer_type tokenizer(range.begin(),range.end(),predicate);
tokenizer_type::iterator itr = tokenizer.begin();
while (tokenizer.end() != itr)
{
   std::copy((*itr).first,(*itr).second,std::ostream_iterator<std::string>(std::cout," "));
   std::cout << std::endl;
   ++itr;
}

Note: For performance and efficient resource management purposes the strtk::tokenizer does not take ownership or make an internal copy of the sequence being tokenized, as such the strtk::tokenizer expects the range to be valid during the entirety of the tokenization process, this is also the case for the specified predicate.

The Parse Routine

Till now the mentioned routines worked specifically with tokens, or in other words ranges of characters. The responsibility of managing the tokens and converting the tokens to user specified types was done manually via "range to type" oriented back inserters and converters. This can be a bit cumbersome and as such StrTk provides a series of helper routines called strtk::parse. Parse takes an std::string representing a tuple of delimited values as input data, a delimiter set, and a series of references to variables that are to be populated with the values from the parsed tokens. The following diagram demonstrates the flow of data, tokens and the corresponding relationships and conversions between each of the parameters.

StrTk Parse Routine - Copyright Arash Partow

Note: strtk::parse returns a boolean value of true upon successful parsing and false for all other results. Situations that cause strtk::parse to fail include:

  • Insufficient number of tokens for the given number of variables
  • Conversion failure from token range to variable type
  • Empty or null token(s)

Some Simple Parse Examples

strtk::parse can take an arbitrary number of variable references. The code below demonstrates the basic usage of strtk::parse taking various number of parameters.

std::string data = "abcde,-1234|567.890#1.1f";
std::string delimiters = ",|#";

std::string var0;
int var1;
double var2;
float var3;

strtk::parse(data,delimiters,var0);

strtk::parse(data,delimiters,var0,var1);

strtk::parse(data,delimiters,var0,var1,var2);

strtk::parse(data,delimiters,var0,var1,var2,var3);

The following examples demonstrate parsing of PODs such as int and double into STL compatible sequences (std::vector, std::deque, std::list, std::set, std::queue, std::stack and std::priority_queue).

// Insert into std::vector
std::string int_string = "1 +2 -3 4 +5 -6 7 +8 -9 10 +11 -12 13 +14 -15";
std::vector<int> int_vector;
strtk::parse(int_string," ",int_vector);

// Insert into std::deque
std::string double_string = "-123.456,789.012,-345.678,901.234,+567.890";
std::deque<double> double_deque;
strtk::parse(double_string,",",double_deque);

// Insert into std::list
std::string data_string = "a,bc,def,ghij,klmno,pqrstu,vwxyz";
std::list<std::string> string_list;
strtk::parse(data_string,",",string_list);

// Insert into std::set
std::string set_string = "a|bc#def|ghij#klmno|pqrstu#vwxyz";
std::set<std::string> string_set;
strtk::parse(set_string,"#|",string_set);

// Insert into std::queue
std::string queue_string = "value1,value2,value3,value4,value5";
std::queue<std::string> string_queue;
strtk::parse(queue_string,",|",string_queue);

// Insert into std::stack
std::string stack_string = "value1|value2,value3|value4,value5";
std::stack<std::string> string_stack;
strtk::parse(stack_string,",|",string_stack);

// Insert into std::priority_queue
std::string priority_queue_string = "value1|value2,value3#value4,value5";
std::priority_queue<std::string> string_priority_queue;
strtk::parse(priority_queue_string,",|#",string_priority_queue);

Similar to what is described above, the following demonstrates parsing of up to "N" elements into an STL compatible sequence.

// Insert 5 elements into std::vector
std::string int_string = "100,-200,+300,400,-500,+600,700,-800,+900";
std::vector<int> int_vector;
strtk::parse_n(int_string,",",5,int_vector);

// Insert 3 elements into std::deque
std::string double_string = "123.456,+789.012,345.678,-901.234,567.890";
std::deque<double> double_deque;
strtk::parse_n(double_string,",",3,double_deque);

// Insert 6 elements into std::list
std::string data_string = "a,bc,def,ghij,klmno,pqrstu,vwxyz";
std::list<std::string> string_list;
strtk::parse_n(data_string,",",6,string_list);

// Insert 6 elements into std::set
std::string set_string = "a|bc#def|ghij#klmno|pqrstu#vwxyz";
std::set<std::string> string_set;
strtk::parse_n(set_string,"#|",6,string_set);

// Insert 4 elements into std::queue
std::string queue_string = "value0,value1,value2,value3,value4,value5";
std::queue<std::string> string_queue;
strtk::parse_n(queue_string,",|",4,string_queue);

// Insert 4 elements into std::stack
std::string stack_string = "value0|value1|value2,value3|value4,value5";
std::stack<std::string> string_stack;
strtk::parse_n(stack_string,",|",4,string_stack);

// Insert 4 elements into std::priority_queue
std::string priority_queue_string = "value0#value1|value2,value3#value4,value5";
std::priority_queue<std::string> string_priority_queue;
strtk::parse_n(priority_queue_string,",|#",4,string_priority_queue);

Note: The return value of the routine strtk::parse_n indicates how many elements were parsed and placed into the specified sequence.

Some Initial Simple Examples

Simple Example 0

As a first example, we'll tackle the simple problem of reversing words in a sentence. That is given a sentence, to have the first word be the last and the last to be the first, the second word to be the second last so on and so forth. Using StrTk we come up with the following solution:

int main()
{
   std::string sentence = "The quick brown fox jumps over the lazy dog";
   std::cout << "Before: " << sentence << std::endl;
   strtk::split(" ",
                sentence,
                strtk::functional_inserter(
                   [](const strtk::range::string& range)
                   { strtk::reverse(range); })
               );
   strtk::reverse(sentence);
   std::cout << "After: " << sentence << std::endl;
   return 0;
}

With an expected output of:

Before: The quick brown fox jumps over the lazy dog
After: dog lazy the over jumps fox brown quick The

Simple Example 1

Another example, given a list of words to blank out and a sentence, transform the sentence such that the blank-out words are removed. Using StrTk we come up with the following solution:

int main()
{
   std::string sentence = "The quick brown fox jumps over the lazy dog";
   std::unordered_set<std::string> blankout_words;
   blankout_words.insert("quick");
   blankout_words.insert("over");
   blankout_words.insert("lazy");
   std::cout << "Before: " << sentence << std::endl;
   strtk::split(" ",
                sentence,
                strtk::functional_inserter(
                   [&blankout_words](const strtk::range::string& range)
                   {
                      auto itr = blankout_words.find(range);
                      if (blankout_words.end() != itr)
                      {
                         strtk::fill(range,' ');
                      }
                   })
               );
   strtk::remove_consecutives_inplace(' ',sentence);
   std::cout << "After: " << sentence << std::endl;
   return 0;
}

With an expected output of:

Before: The quick brown fox jumps over the lazy dog
After: The brown fox jumps the dog

Simple Example 2

Another simple example would be given a text file to read each of the lines and populate a word list structure by tokenizing each line into words. The following is an example of how this can be achieved using StrTk:

int main()
{
   std::deque<std::string> word_list;
   strtk::for_each_line("text.txt",
                        [&word_list](const std::string& line)
                        {
                           static const std::string delimiters = "0123456789()[]{}<>"
                                                                 "\t\r\n ,,.;:'"
                                                                 "!@#$%^&*_-=+`~/";
                           strtk::parse(line,delimiters,word_list);
                        });
   return 0;
}

Simple Example 3

The following simple example takes a user specified text file, proceeds to process it and returns information relating to the file, such as word, letter, uppercase character, lowercase character, vowel and consonant counts.

int main()
{
   std::size_t word_count      = 0;
   std::size_t letter_count    = 0;
   std::size_t uppercase_count = 0;
   std::size_t lowercase_count = 0;
   std::size_t vowel_count     = 0;
   std::size_t consonant_count = 0;

   using namespace strtk;

   for_each_line("data.txt",
                 [&](const std::string& line)
                 {
                    static multiple_char_delimiter_predicate is_vowel("AEIOUaeiou");
                    static multiple_char_delimiter_predicate is_lowercase(ext_string::all_lowercase_letters());
                    static const std::string delimiters = ext_string::all_chars()
                                                        - ext_string::all_lowercase_letters()
                                                        - ext_string::all_uppercase_letters();
                    split(delimiters,
                          line,
                          functional_inserter(
                             [&](const strtk::range::string& range)
                             {
                                if (0 == range.size()) return;
                                ++word_count;
                                letter_count += range.size();
                                std::size_t current_lowercase_count = 0;
                                std::size_t current_vowel_count     = 0;
                                for (std::size_t i = 0; i < range.size(); ++i)
                                {
                                   if (is_vowel(range[i])) ++current_vowel_count;
                                   if (is_lowercase(range[i])) ++current_lowercase_count;
                                }
                                uppercase_count += range.size() - current_lowercase_count;
                                lowercase_count += current_lowercase_count;
                                consonant_count += range.size() - current_vowel_count;
                                vowel_count += current_vowel_count;
                             })
                         );
                 });

   std::cout << "Word count:      " << word_count      << std::endl;
   std::cout << "Letter count:    " << letter_count    << std::endl;
   std::cout << "Uppercase count: " << uppercase_count << std::endl;
   std::cout << "Lowercase count: " << lowercase_count << std::endl;
   std::cout << "Vowel count:     " << vowel_count     << std::endl;
   std::cout << "Consonant count: " << consonant_count << std::endl;
   return 0;
}

Simple Example 4

For the next example, assume we have a text file with a list of names, one per line that represents the order of people that entered a building. Some of the people may enter and leave then reenter the building many times, hence their name will appear multiple times in the list. Our task is to reduce this list to a list of unique names but to also maintain the relative order of names found in the original list. The following is how this particular requirement can be accomplished by using StrTk:

int main()
{
   strtk::for_each_line("file_name.txt",
                        [](const std::string& line)
                        {
                           static std::unordered_set<std::string> line_set;
                           if (line_set.end() != line_set.find(line))
                              return;
                           line_set.insert(line);
                           std::cout << line << std::endl;
                        });
   return 0;
}

Simple Example 5

As a final simple example, we would like to calculate the word frequency model of a user specified text file. The process involves reading each line, splitting the line into words, then incrementing the relevant count for each word and maintaining a global word count. Once the file has been processed, the occurrence frequency of each word will be printed to stdout.

int main()
{
   typedef std::unordered_map<std::string,unsigned int> map_t;
   map_t word_hit_list;
   unsigned int word_count = 0;
   strtk::for_each_line("data.txt",
                        [&](const std::string& line)
                        {
                           static const std::string delimiters = strtk::ext_string::all_chars()
                                                               - strtk::ext_string::all_lowercase_letters()
                                                               - strtk::ext_string::all_uppercase_letters();
                            strtk::split(delimiters,
                                         line,
                                         strtk::functional_inserter(
                                            [&](const strtk::range::string& range)
                                            {
                                               if (range.begin() == range.end()) return;
                                               ++word_count;
                                               std::string word(range.begin(),range.end());
                                               ++word_hit_list[word];
                                            })
                                        );
                        });

   for (map_t::value_type v : word_hit_list)
   {
      printf("%s %10d %10.9f\n",
             strtk::text::right_align(15,' ',v.first).c_str(),
             v.second,
             (1.0 * v.second) / word_count);
   }
   return 0;
}

A Practical Example

Lets assume we have been given an English text file to process, with the intention of extracting a lexicon from the file.

One solution would be to break the problem down to a line by line tokenization problem. In this case we would define a functional object such as the following which will take the container in which we plan on storing our tokens (words) and a predicate and insert the tokens as strings into our container.

template<typename Container, typename Predicate>
struct parse_line
{
public:

   parse_line(Container& container, const Predicate& predicate)
   : container_(container),
     predicate_(predicate)
   {}

   inline void operator() (const std::string& str)
   {
      strtk::split(str,
                   predicate_,
                   strtk::range_to_type_back_inserter(container_),
                   strtk::split_options::compress_delimiters);
   }

private:

   Container& container_;
   const Predicate& predicate_;
};

The whole thing together would include a process of opening the file and reading it line by line each time invoking the parse_line would be as follows:

template<typename Container>
void parse_text(const std::string& file_name, Container& c)
{
   static const std::string delimiters = " ,.;:<>'[]{}()_?/"
                                         "`~!@#$%^&*|-_\"=+\t\r\n\0"
                                         "0123456789";
   strtk::multiple_char_delimiter_predicate predicate(delimiters);
   strtk::for_each_line(file_name,
                        parse_line<Container,strtk::multiple_char_delimiter_predicate>(c,predicate));
}

int main()
{
   std::string text_file_name = "text.txt";
   std::deque<std::string> word_list;
   parse_text(text_file_name,word_list);
   std::cout << "Token Count: " << word_list.size() << std::endl;
   return 0;
}

Before we continue on with the example, a re-written version of the above code using C++11 lambdas is as follows:

int main()
{
   std::string text_file_name = "text.txt";
   std::deque<std::string> word_list;
   strtk::for_each_line(text_file_name,
                        [&word_list](const std::string& line)
                        {
                           static const std::string delimiters = " ,.;:<>'[]{}()_?/"
                                                                 "`~!@#$%^&*|-_\"=+\t\r\n\0"
                                                                 "0123456789";
                           strtk::parse(line,delimiters,word_list);
                        });
   std::cout << "Token Count: " << word_list.size() << std::endl;
   return 0;
}

Now coming back to the original problem, that being the construction of a lexicon. In this case the set of "words" should only contain words of interest. For the sake of simplicity lets define words of interest as being anything other than the following prepositions: as, at, but, by, for, in, like, next, of, on, opposite, out, past, to, up and via. This type of list is commonly known as a Stop Word List. In this example the stop-word list definition will be as follows:

const std::string stop_word_list [] =
                  {
                     "as", "at", "but", "by", "for",
                     "in", "like", "next", "of", "on",
                     "opposite", "out", "past", "to",
                     "up", "via", ""
                  };

const std::size_t stop_word_list_size = sizeof(stop_word_list) / sizeof(std::string);

Some minor updates to the parse_line processor include using the filter_on_match predicate for determining if the currently processed token is a preposition and also the invocation of the range_to_type back_inserter to convert the tokens from their range iterator representation to a type representation compatible with the user defined container. For the new implementation to provide unique words of interest the simplest change that can be made is to replace the deque used as the container for the word_list to some kind of 1-1 associative container such as a set. The following is the improved version of the parse_line processor:

template<typename Container, typename Predicate>
struct parse_line
{
public:

   parse_line(Container& container, const Predicate& predicate)
   : container_(container),
     predicate_(predicate),
     tmp_(" "),
     tokenizer_(tmp_,predicate_,true),
     filter_(stop_word_list,stop_word_list + stop_word_list_size,
             strtk::range_to_string_back_inserter_iterator<Container>(container_),
             true,false)
   {}

   inline void operator() (const std::string& s)
   {
      const filter_type& filter = filter_;
      strtk::for_each_token(s,tokenizer_,filter);
   }

private:

   Container& container_;
   const Predicate& predicate_;
   std::string tmp_;
   typename strtk::std_string_tokenizer<Predicate>::type tokenizer_;
   strtk::filter_on_match<strtk::range_to_string_back_inserter_iterator<Container>> filter_;
};

The above described predicate can be greatly simplified by using binders and various lambda expressions.

Another Example

When performing serialization or deserialization of an instance object such as a class or struct, a simple approach one could take would be to take each of the members and convert them into string representations and from those strings construct a larger string delimiting each member with a special character guaranteed not to exist in any of the string representations.

In this example we will assume that there exists a struct which represents the properties of a person, a person struct if you will:

struct person
{
   unsigned int id;
   std::string name;
   unsigned int age;
   double height
   float weight;
};

The process of populating a person struct would entail having an instance of a person and the necessary data string. The following is an example of how this would be done using the strtk::parse function.

Person Tuple Format

Token0Delimiter0Token1Delimiter1Token2Delimiter2Token3Delimiter3Token4
Unique ID(hex)|Name|Age|Height(m)|Weight(kg)
std::string data = "0xFA37ED12|Rumpelstiltskin|397|1.31|58.7";
person p;
strtk::hex_to_number_sink<unsigned int> hex_sink(p.id); // register id with the hex sink
strtk::parse(data,"|",hex_sink,p.name,p.age,p.height,p.weight);

Batch processing of a text file comprised of one person tuple per-line is somewhat similar to the previous example. A predicate is defined that takes a container specialized on the person struct, and a delimiter predicate with which the strtk::parse function will be invoked. This predicate is then instantiated and coupled with the text file name, is fed to the strtk::for_each_line processor.

template<typename Container, typename Predicate>
struct parse_person_tuple
{
public:

   parse_person_tuple(Container& container)
   : container_(container),
     hex_sink(p_.id)
   {}

   inline void operator() (const std::string& s)
   {
      if (strtk::parse(s,"|",hex_sink,p_.name,p_.age,p_.height,p_.weight))
         container_.push_back(p_);
      else
         std::cerr << "Failed to parse: " << s << std::endl;
   }

private:

   Container& container_;
   person p_;
   strtk::hex_to_number_sink<unsigned int> hex_sink;
};

Bringing the above pieces together to process a file results in the following:

int main()
{
   std::string text_file_name = "person_records.txt";
   std::deque<person> person_list;
   strtk::for_each_line(file_name,predicate_type(person_list));
   return 0;
}

Before we continue on with the example, a re-written version of the above code using C++11 lambdas is as follows:

int main()
{
   std::string text_file_name = "person_records.txt";
   std::deque<person> person_list;
   person p;
   strtk::hex_to_number_sink<unsigned int> hex_sink;
   strtk::for_each_line(text_file_name,
                        [&](const std::string& line)
                        {
                           if (strtk::parse(line,"|",hex_sink,p.name,p.age,p.height,p.weight))
                              container_.push_back(p);
                           else
                              std::cerr << "Failed to parse: " << line << std::endl;
                        });
   return 0;
}

To make things easier one could adapt a struct (made up entirely of PODs) to a parser. This makes the usage syntax little easier to follow. An example of this adaption is as follows:

struct type
{
   std::string s;
   double d;
   int i;
   char c;
   bool b;
};

strtk_parse_begin(type)
 strtk_parse_type(s)
 strtk_parse_type(d)
 strtk_parse_type(i)
 strtk_parse_type(c)
 strtk_parse_type(b)
strtk_parse_end()

int main()
{
   type t;
   std::string s = "abcdefghijklmnop|123.456|987654321|A|1";
   strtk::parse(s,"|",t);
   return 0;
}

Another similar example to the previous, would be parsing a text file of 3D coordinates into a sequence. This can be done easily and cleanly using lambdas and StrTk as follows:

struct point
{
   double x,y,z;
};

int main()
{
   std::string point_data = "point_data.txt";
   std::deque<point> points;
   point p;
   strtk::for_each_line(point_data,
                        [&](const std::string& str)
                        {
                           if (strtk::parse(str,",",p.x,p.y,p.z))
                              points.push_back(p);
                        });
   return 0;
}

Simple Date-Time Parser

Assuming the datetime struct defined below, and a string representation of a combined date and time in the form of YYYY-MM-DD HH:MM:SS.MS eg: 2005-06-26 15:57:03.678

struct datetime
{
   unsigned int year;
   unsigned int month;
   unsigned int day;
   unsigned int hour;
   unsigned int minute;
   unsigned int second;
   unsigned int msecond;
};

The following assumes an input of date-time values separated by a pipe. To facilitate parsing of a date-time by the strtk::parse routine into an STL compatible sequence an implementation of string_to_type_converter_impl specific to the datetime type is required. The following demonstrates how such a routine can be generated and used within the strtk::parse context:

strtk_string_to_type_begin(datetime)
   static const std::string delimiters ("-:. ");
   return strtk::parse(begin, end, delimiters,
                       t.year, t.month,  t.day,
                       t.hour, t.minute, t.second, t.msecond);
strtk_string_to_type_end()

A simple example of using strtk::parse in conjunction with the newly integrated datetime variable mixed with variables of other types is as follows:

int main()
{
   const std::string data = "abc 123 xyz,2000-01-10 03:01:16.123|+98765.43210";

   std::string var0;
   datetime var1;
   double var2;

   strtk::parse(data,",|",var0,var1,var2);

   return 0;
}

Bringing the above pieces together, in the following we can then proceed to parse a sequence of date-time strings delimited by pipe "|" into a deque of type datetime.

int main()
{
   static const std::string data = "2000-01-10 03:01:16.123|2001-02-22 05:12:24.234|"
                                   "2002-03-13 07:23:32.345|2003-04-24 09:34:47.456|"
                                   "2004-05-15 11:46:51.767|2005-06-26 15:57:03.678|"
                                   "2006-07-17 17:45:31.561|2007-08-26 19:06:02.809|"
                                   "2008-09-18 21:16:23.267|2009-10-26 23:12:03.798|"
                                   "2010-11-23 13:47:11.963|2011-12-26 15:35:08.168";

   std::deque<datetime> datetime_list;
   strtk::parse(data,"|",datetime_list);

   for (std::size_t i = 0; i < datetime_list.size(); ++i)
   {
      datetime& dt = datetime_list[i];
      std::cout << dt.year << "-"
                << strtk::text::right_align(2,'0',  dt.month) << "-"
                << strtk::text::right_align(2,'0',    dt.day) << " "
                << strtk::text::right_align(2,'0',   dt.hour) << ":"
                << strtk::text::right_align(2,'0', dt.minute) << ":"
                << strtk::text::right_align(2,'0', dt.second) << "."
                << strtk::text::right_align(3,'0',dt.msecond) << std::endl;
   }
   return 0;
}

As a side note, the more commonly used date, time and date-time formats can be easily parsed with a simple utilities library based on StrTk called Datetime_Utils The library makes use of the technique described above in conjunction with the strtk::offset_splitter to provide efficient and high performance parsers for formats such as the ones denoted below:

FormatExample
YYYYMMDD 20060304
YYYYDDMM 20060403
YYYY/MM/DD 2006/03/04
YYYY/DD/MM 2006/04/03
DD/MM/YYYY 04/03/2006
HH:MM:SS.mss 13:27:54.123
HH:MM:SS 13:27:54
YYYYMMDD HH:MM:SS.mss 20060304 13:27:54.123
YYYY/MM/DD HH:MM:SS.mss2006/03/04 13:27:54.123
DD/MM/YYYY HH:MM:SS.mss04/03/2006 13:27:54.123
YYYYMMDD HH:MM:SS 20060304 13:27:54
YYYY/MM/DD HH:MM:SS 2006/03/04 13:27:54
DD/MM/YYYY HH:MM:SS 04/03/2006 13:27:54
YYYY-MM-DD HH:MM:SS.mss2006-03-04 13:27:54.123
DD-MM-YYYY HH:MM:SS 04-03-2006 13:27:54
YYYY-MM-DDTHH:MM:SS 2006-03-04T13:27:54
YYYY-MM-DDTHH:MM:SS.mss2006-03-04T13:27:54.123
YYYYMMDDTHH:MM:SS 20060304T13:27:54
YYYYMMDDTHH:MM:SS.mss 20060304T13:27:54.123

In the following simple example we have an array of data representing tuples of trade executions in CSV format. The objective is to populate the trade_list instance with the given data via the defined trade struct. In the example the dt_utils::datetime_format6 date-time parser is used, it populates a general date-time type instance called dt_utils::datetime. If the parse operation succeeds, then the date-time components the trade type requires are updated and the instance itself is subsequently added to the trade_list.

struct trade
{
   std::string ticker;
   double price;
   unsigned int volume;
   unsigned short hr,min,sec,ms;
};

int main()
{
   std::string trade_data[] =
                 {
                    "2006-03-04 13:27:54.123,ABC,12.347,6676",
                    "2006-03-04 13:27:54.231,XYZ,23.455,71547",
                    "2006-03-04 13:27:54.312,IJK,34.562,514",
                    "2006-03-04 13:27:55.263,PQR,67.893,5949",
                    "2006-03-04 13:27:55.327,ABC,78.963,19",
                    "2006-03-04 13:27:55.524,XYZ,45.677,1276",
                    "2006-03-04 13:27:56.623,IJK,36.182,6676",
                    "2006-03-04 13:27:56.877,PQR,62.339,1547"
                 };

   std::deque<trade> trade_list;
   trade t;
   dt_utils::datetime dt;
   dt_utils::datetime_format6 dt6(dt);

   for (std::size_t i = 0; i < sizeof(trade_data) / sizeof(std::string); ++i)
   {
      bool result =
         strtk::parse(trade_data[i],",",
                      dt6,t.ticker,t.price,t.volume);
      if (result)
      {
         t.hr  = td.hour;
         t.min = td.minute;
         t.sec = td.second;
         t.ms  = td.millisecond;
         trade_list.push_back(t);
      }
   }

   return 0;
}

Parsing Sub-Lists

So far the demonstrated capabilities of the strtk::parse function has been based on passing a series of parameters that are populated in a linear fashion as the parser processes the tokens it encounters. That said, some formats have their own sub-structures, a simple example would be a list of values (such as integers) that need to be loaded into a deque or stack. StrTk provides a series of sink facilities that consume a range and an STL container which can be forwarded onto strtk::parse.

In the following example, the data string is comprised of 3 separate lists delimited by a pipe "|". An integer, a string and a double type list. Each list is to be parsed into an STL container of appropriate type. In this case a vector, a deque and a list. StrTk provides the ability to instantiate a sink for the specific container type that is compatible with strtk::parse.

int main()
{
   std::string data = "1,+2,-3,4|abc,ijk,rst,xyz|123.456,+234.567,-345.678,456.789,567.890";

   //define containers
   std::vector<int> int_vector;
   std::deque<std::string> string_deque;
   std::list<double> double_list;
   std::set<int> int_set;
   std::queue<std::string> string_queue;
   std::stack<double> double_stack;
   std::priority_queue<int> int_priority_queue;

   //define sinks
   strtk::vector_sink<int>::type         vec_sink(",");
   strtk::deque_sink<std::string>::type  deq_sink(",");
   strtk::list_sink<double>::type        lst_sink(",");
   strtk::set_sink<int>::type            set_sink(",");
   strtk::queue_sink<std::string>::type  que_sink(",");
   strtk::stack_sink<double>::type       stk_sink(",");
   strtk::priority_queue_sink<int>::type prq_sink(",");


   strtk::parse(data,"|",vec_sink(  int_vector),
                         deq_sink(string_deque),
                         lst_sink( double_list));

   strtk::parse(data,"|",set_sink(           int_set),
                         que_sink(      string_queue),
                         stk_sink(      double_stack),
                         prq_sink(int_priority_queue));

   return 0;
}

If only a certain number of elements in the list are required, for example only the first 3, then the element count on the sink can be set appropriately. The above example could be modified as follows:

int main()
{
   std::string data = "1,+2,-3,4|string0|abc,ijk,rst,xyz|string1|123.456,+234.567,-345.678,456.789,567.890";

   std::vector<int> int_vector;
   std::deque<std::string> string_deque;
   std::list<double> double_list;

   strtk::vector_sink<int>::type        vec_sink(",");
   strtk::deque_sink<std::string>::type deq_sink(",");
   strtk::list_sink<double>::type       lst_sink(",");

   std::string string_0;
   std::string string_1;

   strtk::parse(data,"|",vec_sink(  int_vector).count(2),  // consume first 2 values
                         string_0,
                         deq_sink(string_deque).count(3),  // consume first 3 values
                         string_1,
                         lst_sink( double_list).count(4)); // consume first 4 values
   return 0;
}

Note: If there aren't enough elements in a particular list, then parsing of that list fails and subsequently the call to strtk::parse will fail.

Parsing Trailing-Lists

Another way one might want to parse a tuple of values might be to parse a prefix of values into a specific number of possibly varying types, then to parse the remaining values (assuming they are all of the same type) into a sequence or list etc. StrTk provides the following simple solution to the given problem, as demonstrated below:

int main()
{
   {
      std::string data = "A String Value,111.111,222.222,333.333,444.444,555.555";
      std::string token;
      std::vector<double> double_list;
      strtk::parse(data,",",token,double_list);
   }
   {
      std::string data = "A String Value,01-02-2003,111.111,222.222,333.333,444.444,555.555";
      std::string token;
      std::string date;
      std::deque<double> double_list;
      strtk::parse(data,",",token,date,double_list);
   }
   {
      std::string data = "A String Value,01-02-2003,123456789,111.111,222.222,333.333,444.444,555.555";
      std::string token;
      std::string date;
      int i;
      std::list<double> double_list;
      strtk::parse(data,",",token,date,i,double_list);
   }
   {
      std::string data = "A String Value,01-02-2003,123456789,111.111,222.222,333.333,444.444,555.555";
      std::string token;
      std::string date;
      int i;
      double d;
      std::vector<double> double_list;
      strtk::parse(data,",",token,date,i,d,double_list);
   }
   {
      std::string data = "A String Value,01-02-2003,123456789,111.111,222.222,333.333,444.444,555.555";
      std::string token;
      std::string date;
      int i;
      double d1;
      double d2;
      std::deque<double> double_list;
      strtk::parse(data,",",token,date,i,d1,d2,double_list);
   }
   return 0;
}

Extending The Date-Time Parser Example

Building upon the previous datetime example, We are presented with a tuple of data that represents an astronomical event. The event defines a name, a location and a series of date-times in UTC the event was observed. In order to construct the necessary sink(s) that will be used for parsing the required type into a container, the macro strtk_register_userdef_type_sink with the specified type is invoked. The following is a definition of the struct one might construct:

struct event_information
{
   std::size_t id;
   std::string name;
   std::string location;
   std::deque<datetime> observation_datetimes;
};

strtk_register_userdef_type_sink(datetime)

Bringing the above together with a call to strtk::parse results in the following code which parses the event data tuple into the allocated event_information instance.

int main()
{
   std::string event_data = "172493|Lunar Impact|Mare Tranquillitatis|"
                            "2010-01-19 00:28:45.357,2010-02-18 00:57:07.109,"
                            "2010-03-20 01:15:11.261,2010-04-21 01:07:27.972";

   strtk::deque_sink<datetime>::type deq_sink(",");

   event_information evt_info;
   strtk::parse(event_data,"|",evt_info.id,
                               evt_info.name,
                               evt_info.location,
                               deq_sink(evt_info.observation_datetimes));
   return 0;
}

Token Processing During Parsing

StrTk offers a set of convenient and simple to use token processing primitives that can be invoked during a call to the strtk::parse routine to perform various actions upon the tokens being parsed. These actions include such things as modifications and validations of tokens.

The following is a list of token processing primitives used for constraint and verification purposes:

  • strtk::ignore_token
  • strtk::expect
  • strtk::iexpect
  • strtk::like
  • strtk::inrange

The following is a list of token processing primitives used for modifying and normalising purposes:

  • strtk::trim
  • strtk::trim_leading
  • strtk::trim_trailing
  • strtk::as_lcase
  • strtk::as_ucase

The primitives all return either a true or false value upon parsing completion, which is then further used by the strtk::parse routine to determine if the parse operation as a whole has succeeded or failed.

Ignore Token Processing

There may be scenarios when given a delimited tuple of data, that one or more of the tokens need to be ignored or skipped. StrTk provides a mechanism called strtk::ignore_token that allows the parser to consume specific tokens easily without affecting overall performance. Below is an example of how strtk::ignore_token can be used in conjunction with strtk::parse to skip the 2nd and 4th tokens in the tuple:

int main()
{
   static const std::string data = "+123,ignore0,123.456,ignore1,abcdef,ignore2";

   int i;
   double d;
   std::string s;

   strtk::ignore_token ignore;
   strtk::parse(data,",",i,ignore,d,ignore,s);

   std::cout << "i = " << i << std::endl;
   std::cout << "d = " << d << std::endl;
   std::cout << "s = " << s << std::endl;

   return 0;
}

Expect and IExpect Token Processing

When parsing a tuple, one may want to ensure that specific tokens of the tuple are of a certain string value. StrTk provides this type of functionality via the strtk::expect and strtk::iexpect mechanisms. The strtk::expect form enforces an exact string match, whereas the strtk::iexpect enforces only a case-insensitive match. The following is an example where we attempt to parse a 'pascal-like' variable declaration and definition. The requirement is that the first token be "var" followed by a variable name and then a type name of 'Integer' which is not case sensitive.

int main()
{
   static const std::string data = "var foo : InTeGeR = 3;";

   std::string variable_name;
   int initial_value;

   bool result = strtk::parse(data,
                              " ;",
                              strtk::expect("var").ref(),
                              variable_name,
                              strtk::expect(":").ref(),
                              strtk::iexpect("Integer").ref(),
                              strtk::expect("=").ref(),
                              initial_value);

   if (result)
      std::cout << variable_name << " = " << initial_value << std::endl;
   else
     std::cout << "Failed to parse statement!" << std::endl;

   return 0;
}

Like Token Processing

Similar to the above mentioned strtk::expect and strtk::iexpect primitives, StrTk provides a simple wildcard matching of tokens functionality via the strtk::like mechanism. The special characters of '*' and '?' are used denoting 'zero or more' and 'zero or one' match modes respectively. The following example uses the strtk::like in conjunction with strtk::expect to parse a tuple of key/value pairs.

int main()
{
   static const std::string data = "token0=+123;token1=abc;token2=-456.678;";

   int i;
   std::string s;
   double d;

   strtk::parse(data,
                "=;",
                strtk::like("to*n?").ref(),
                i,
                strtk::like("token?").ref(),
                s,
                strtk::iexpect("tOkEn2").ref(),
                d);

   std::cout << "i = " << i << std::endl;
   std::cout << "s = " << s << std::endl;
   std::cout << "d = " << d << std::endl;

   return 0;
}

In-Range Token Processing

When parsing tokens, one may wish to determine if the token when viewed in its target type resides within a specified range [r0,r1]. As the tokens can be of any type, not necessarily just string or numerical in nature, the type must have a less- than comparable attribute. The following example attempts to parse a key/value tuple that contains a temperature and a name component.

int main()
{
   static const std::string data = "temperature=+123.456;name=Rumpelstilzchen";

   double temperature;
   std::string name;

   strtk::parse(data,
                "=;",
                //Process temperature section
                strtk::expect("temperature").ref(),
                strtk::inrange(temperature,-432.1,+432.1).ref(),
                //Process name section
                strtk::expect("name").ref(),
                strtk::inrange(name,"AAAA","zzzz").ref());

   std::cout << "temperature = " << temperature << std::endl;
   std::cout << "name = " << name << std::endl;

   return 0;
}

Trim Token Processing

At times tokens within a tuple may have padding added to either the left, right or both ends. When processing the token it maybe necessary to remove the superfluous padding before attempting to convert the string or range representation of the token into its target type(int, double etc). The example below, demonstrates the use of various forms of token trimming in conjunction strtk:parse.

int main()
{
   {
      std::string data = "****abc123****,****abc123****,****abc123****";

      std::string s0;
      std::string s1;
      std::string s2;

      strtk::parse(data,",",
                   strtk::trim("*",s0).ref(),
                   strtk::trim_leading ("*",s1).ref(),
                   strtk::trim_trailing("*",s2).ref());

      std::cout << "s0 = [" << s0 << "]" << std::endl;
      std::cout << "s1 = [" << s1 << "]" << std::endl;
      std::cout << "s2 = [" << s2 << "]" << std::endl;

      //Expected Output:
      //s0 = [abc123]
      //s1 = [abc123****]
      //s2 = [****abc123]
   }

   {
      std::string data = "*?*?a string*?*?,*?*123456,123.456?*?*?";

      std::string s;
      int i;
      double d;

      strtk::parse(data,",",
                   strtk::trim("*?",s).ref(),
                   strtk::trim_leading ("?*",i).ref(),
                   strtk::trim_trailing("*?",d).ref());

      std::cout << "s = [" << s << "]" << std::endl;
      std::cout << "i = [" << i << "]" << std::endl;
      std::cout << "d = [" << d << "]" << std::endl;

      //Expected Output:
      //s = [a string]
      //i = [123456]
      //d = [123.456]
   }

   return 0;
}

Case Normalisation Token Processing

Another pair of token processing mechanisms provided by StrTk, are the strtk::as_lcase and strtk::as_ucase mechanisms. They convert the string representation of the token to lowercase and uppercase characters respectively. The following example, parses a two token tuple, and converts the first token s0 to all lowercase and the second token s1 to all uppercase.

int main()
{
   std::string data = "AbCd,EfGhI";

   std::string s0;
   std::string s1;

   strtk::parse(data,",",
                strtk::as_lcase(s0).ref(),
                strtk::as_ucase(s1).ref());

   std::cout << "s0 = [" << s0 << "]" << std::endl;
   std::cout << "s1 = [" << s1 << "]" << std::endl;

   //Expected Output:
   //s0 = [abcd]
   //s1 = [EFGHI]

   return 0;
}

Parsing Truncated Values

There may be times during parsing when a token which is intended to be parsed as an integral type (eg: int, short, unsigned int et al) may be represented using decimal notation (eg: 1234.00000).

Normally if the token were to be passed as-is it would cause a conversion error due to the fact that there are invalid characters within the token.

StrTk provides a facility called strtk::truncated_int that can be used with both signed and unsigned integral types. The type truncated_int is specialised with the required type, then an instance of the type is registered with it either prior to or inline with the conversion/parse call. When the conversion occurs strtk::truncated_int simply redefines the end of the token range to be the decimal point (if it is present) and then passes it along to the appropriate string_to_type_converter_impl call.

int main()
{
   {
      //Convert decimal representation to an int
      int i = 0;
      std::string data = "-1234.0000";
      strtk::truncated_int<int> ti;
      strtk::string_to_type_converter(data,ti(i));
   }

   {
      //Parse a tuple of decimal values into ints i0 and i1
      int i0 = 0;
      int i1 = 0;
      std::string data = "-1234.0000|456.7890";
      strtk::truncated_int<int> ti0;
      strtk::truncated_int<int> ti1;
      strtk::parse(data,"|",ti0(i0),ti1(i1));
   }
}

An optional parameter that can be utilized is the 'fractional_size' which denotes the exact number of digits after the decimal place that is expected. In the event this condition is not met a conversion error is returned.

int main()
{
   std::string data = "-1234.000";
   strtk::truncated_int<int> ti;
   ti.fractional_size(3);
   int i = 0;

   if (strtk::string_to_type_converter(data,ti(i)))
      std::cout << "i: " << i << std::endl;
   else
      std::cout << "Error parsing: " << data << std::endl;

   return 0;
}

In the following example, we have an array of trade execution tuples in csv format. The tuple is comprised of the following fields: ticker name (string), trade id (uint64 right aligned with space as padding), execution price (double), executed volume (unsigned int with 4 decimal place suffix). The struct trade will be used to store the tuples in memory. The objective is to parse each tuple and populate the trade_list structure with all the trades, noting any errors that occur along the way.

struct trade
{
   unsigned long long id;
   std::string    ticker;
   double          price;
   unsigned int   volume;
};

int main()
{
   std::string trade_data[] =
               {
                 "AAA,   0,36.900,009352.0000",
                 "BBB,   1,46.000,009800.0000",
                 "CCC,   2,46.000,009400.0000",
                 "DDD,   3,46.000,001500.0000",
                 "AAA,   4,46.000,000600.0000",
                 "BBB,   5,46.000,039800.0000",
                 "CCC,   6,46.000,003200.0000",
                 "DDD,   7,46.000,000100.0000",
                 "AAA,   8,46.000,000800.0000",
                 "BBB,   9,48.500,002400.0000",
                 "CCC,  10,37.395,101200.0000",
                 "DDD,  11,37.193,200000.0000",
                 "AAA,  12,37.146,020000.0000",
                 "BBB,  13,38.314,001752.0000",
                 "CCC,  14,37.386,130000.0000",
                 "DDD,  15,37.443,100000.0000"
               };

   std::deque<trade> trade_list;

   strtk::truncated_int<unsigned int> tui;
   tui.fractional_size(4);

   for (std::size_t i = 0; i < sizeof(trade_data) / sizeof(std::string); ++i)
   {
      trade t;
      bool result = strtk::parse(trade_data[i],",",
                                 t.ticker,
                                 strtk::trim_leading(" ",t.id).ref(),
                                 t.price,
                                 tui(t.volume));
     if (result)
        trade_list.push_back(t);
     else
        std::cout << "Tuple parse error: " << trade_data[i] << std::endl;
   }

   return 0;
}

It should be noted that truncated_int can be used in conjunction with the various StrTk token processing primitives such as strtk::trim, strtk::trim_leading, strtk::as_lcase et al . The following example demonstrates parsing a tuple of values intended for int types, where the tokens have random space padding. The simple composition of strtk::truncated_int and strtk::trim allows for efficient and error free parsing of the tuple.

int main()
{
   std::string data = "   123.0 |  456.00    | 789.000";

   int i0,i1,i2;

   strtk::truncated_int<int> ti0;
   strtk::truncated_int<int> ti1;
   strtk::truncated_int<int> ti2;

   strtk::parse(data,"|",
                strtk::trim(" ",ti0(i0)).ref(),
                strtk::trim(" ",ti1(i1)).ref(),
                strtk::trim(" ",ti2(i2)).ref());

   printf("i0 = [%d]  i1 = [%d]  i2 = [%d]\n",i0,i1,i2);
   // i0 = [123]  i1 = [456]  i2 = [789]

   return 0;
}

Columnwise Parsing

In the previous section the ability to ignore tokens in a tuple was discussed. The concept works well if only a few tokens need to be ignored. However problems arise when the tuples contain a large number of tokens and the tokens that are to be ignored are numerous and distributed uniformly over the entire tuple.

Situations such as this one are common, and using the ignore_token technique can not only make both the coding of the solution cumbersome and error prone but also make the parsing process itself quite inefficient.

A natural extension to ignore_token that scales and is also extremely efficient, can be found in the combined functionalities of the parse_columns and column_list. The column_list is a structure used to hold the indexes of the tokens in the tuple that are required. The indexes have to be valid, unique and in ascending order.

//even indexes [0,12]
auto cl_even = strtk::column_list(0,2,4,6,8,10,12);

//odd indexes [1,7]
auto cl_odd = strtk::column_list(1,3,5,7);

The strtk::parse_columns function takes a string of data representing a tuple, a delimiter to determine the tokens in the tuple, a strkt::column_list and a compatible number of types as target references.

The number of types has to be equal to the number of indexes in the column_list, and the types need to be convertible within the strtk namespace from a string range representation to the type. In the following example we have a tuple consisting of integers. We're only interested in the first four even numbered indexes in the tuple, the code below demonstrates how the tuple is parsed with the given constraints:

int main()
{
   const std::string data = "1,22;333|4444 55555|666666;7777777,88888888 999;"
                            "0000|1111,22222,333333,4444444";

   int i0,i1,i2,i3,i4;

   strtk::parse_columns(data,
                        ",;| ",
                        strtk::column_list(0,2,4,6,8),
                        i0,i1,i2,i3,i4);
   return 0;
}

In the following example we have a tuple consisting of a mixture of types. We are only interested in the first, fifth and eighth indexes in the tuple, which happen to be of type int, double and std::string respectively. The code below demonstrates how the tuple is parsed with the given constraints:

int main()
{
   const std::string data = "token0,1234;token2;token3,token4,11.22|token6;token7|"
                            "text-i-text,1 2 3 4 5 6 7";

   int i;
   double d;
   std::string s;

   strtk::parse_columns(data,
                        ",;| ",
                        strtk::column_list(1,5,8),
                        i,d,s);
   return 0;
}

Simple 3D Mesh File Format Parser

Lets assume there is a file format for expressing a mesh. The format consists of a section that defines the unique vertexes in the mesh, and another section that defines the triangles in the mesh as a tuple of three indicies indicating the vertexes used. Each section is preceded by an integer that denotes the number of subsequent elements in that section. An example of such a file is the following:

5
+1.0,+1.0,+1.0
-1.0,+1.0,-1.0
-1.0,-1.0,+1.0
+1.0,-1.0,-1.0
+0.0,+0.0,+0.0
4
0,1,4
1,2,4
2,3,4
3,1,4

Code to parse such a file format is as follows:

struct point
{
   double x,y,z;
};

struct triangle
{
   std::size_t i0,i1,i2;
};

int main()
{
   std::string mesh_file = "mesh.txt";
   std::ifstream stream(mesh_file.c_str());
   std::string s;
   // Process points section
   std::deque<point> points;
   point p;
   std::size_t point_count = 0;
   strtk::parse_line(stream," ",point_count);
   strtk::for_each_line_n(stream,
                          point_count,
                          [&points,&p](const std::string& line)
                          {
                             if (strtk::parse(line,",",p.x,p.y,p.z))
                                points.push_back(p);
                          });

   // Process triangles section
   std::deque<triangle> triangles;
   triangle t;
   std::size_t triangle_count = 0;
   strtk::parse_line(stream," ",triangle_count);
   strtk::for_each_line_n(stream,
                          triangle_count,
                          [&triangles,&t](const std::string& line)
                          {
                             if (strtk::parse(line,",",t.i0,t.i1,t.i2))
                                triangles.push_back(t);
                          });
   return 0;
}

Simple Semantic Actions

A semantic action is an action that is undertaken upon a token, it can be in the form of either an assessment or a manipulation. StrTk provides a very simplified semantic action capability, named strtk::util::semantic_action for types that are being parsed via the strtk::parse function. A function (stateful or stateless), taking an iterator range is used to construct the semantic_action. The body of the function performs whatever operations are required and also makes sure to maintain the contract with regards to the return status for the parse routine to complete successfully.

The following is an example where a comma delimited string is parsed into 3 types, an integer, a double and a string. The rules regarding parsing and updating of the variables is as follows, the int variable "i" will only be updated if the value parsed is odd, the double value "d" will only be updated if the parsed value is greater than or equal to 99.99 and the string value "s" will only be updated if the presented range contains the string "ring". Upon every successful update a corresponding counter will be incremented.

int main()
{
   std::string data = "12345,123.456,A string";
   int i = 0;
   double d = 0.0;
   std::string s = "";
   std::size_t i_update_count = 0;
   std::size_t d_update_count = 0;
   std::size_t s_update_count = 0;

   typedef strtk::range::ustring::const_iterator itr_type;
   using strtk::util::semantic_action;

   strtk::parse(data,",",
                //Token_0 (i) - Parse and update if value is odd
                semantic_action([&i,&i_update_count]
                                (itr_type begin,itr_type end) -> bool
                                {
                                   int temp = 0;
                                   if (strtk::string_to_type_converter(begin,end,temp))
                                   {
                                      if (temp % 2 != 0)
                                      {
                                         i = temp;
                                         ++i_update_count;
                                      }
                                      return true;
                                   }
                                   else
                                      return false;
                                }).ref(),
                //Token_1 (d) - Parse and update if value is greater than 99.99
                semantic_action([&d,&d_update_count]
                                (itr_type begin,itr_type end) -> bool
                                {
                                   double temp = 0.0;
                                   if (strtk::string_to_type_converter(begin,end,temp))
                                   {
                                      if (temp >= 99.99)
                                      {
                                         d = temp;
                                         ++d_update_count;
                                      }
                                      return true;
                                   }
                                   else
                                      return false;
                                }).ref(),
                //Token_2 (s) - Parse and update if value contains "ring"
                semantic_action([&s,&s_update_count]
                                (itr_type begin,itr_type end) -> bool
                                {
                                   static unsigned char pattern[] = "ring";
                                   if (end != std::search(begin,end,pattern,pattern + 4))
                                   {
                                      s.assign(begin,end);
                                      ++s_update_count;
                                   }
                                   return true;
                                }).ref());

   std::cout << "i: " << i << std::endl;
   std::cout << "d: " << d << std::endl;
   std::cout << "s: " << s << std::endl;

   return 0;
}

Pick A Random Line From A Text File

A random line of text is to be selected from a user provided text file comprised of lines of varying length, such that the probability of the line being selected is 1/N where N is the number of lines in the text file. There are many trivial solutions to this problem, however if one were to further constrain the problem by indicating the file is very large (TBs) and that the system upon which the solution is to run is very limited memory-wise, most if not all trivial solutions such as storing indexes of all line offsets, or reading the entire file into memory etc will be eliminated.

However, there exists a very elegant solution to this problem of O(n), O(1) time and space complexities respectively, that involves scanning the entire file once line by line, and at every ith line choosing whether or not to replace the final result line with the current line by sampling a uniform distribution of range [0,1) and performing a replace if and only if the random value is less than 1 / i.

StrTk Line Probability Diagram - Copyright Arash Partow

The logic behind this solution revolves around the fact that the probability of selecting the ith line will be 1/i and as such the total probability for selecting any of the previous i-1 lines will be 1 - (1/i) or (i - 1)/i. Because there are (i - 1) lines before the ith line, we divide the previous sum of probabilities by (i - 1), resulting in a selection probability of 1/i for any one of the lines up to and including the ith line. If the ith line were to be the last line of the text file, this then results in each of the lines having a selection probability of 1/N - simple and sweet, and so too is the following implementation of said solution:

int main(int argc, char* argv[])
{
   std::string file_name = argv[1];
   std::string line;
   std::size_t i = 0;
   strtk::uniform_real_rng rng(static_cast<std::size_t>(::time(0)));
   strtk::for_each_line(file_name,
                        [&](const std::string& s)
                        {
                           if (rng() < (1.0 / ++i))
                           {
                              line = s;
                           }
                        });
   std::cout << line << std::endl;
   return 0;
}

What changes to the above code would be required If the probability of line selection was changed to 1/(N-K) where 0 <= K <= N and K is the number of lines that will be randomly ignored.

Note: TAOCP Volume II section 3.4.2 has an in depth discussion about this problem, which is generally known as reservoir sampling, and other similar problems relating to random distributions. Also one should note that the above example has an inefficiency, in that upon every string replace, an actual string is being copied, normally all one needs to do is maintain a file offset to the beginning of the line, not doing this causes slow downs due to continuous memory allocations/reallocations which is made all the worse when large lines are encountered.

The Buy Low And Sell High Problem

Assume we are given a piece of data in csv format which represents a time-series for the close prices of the SPDR S&P 500 ETF. The time-series ranges from 04/01/1999 to 11/11/2011. The objective is to select two dates, the first being the buy date and the second being the sell date, such that if we were to buy then sell X shares of the ETF we will have maximized our profit. It should be noted that of course the buy date must be before the sell date and that short selling is not an option in this strategy. The following is a chart that represents the price of the SPDR over the given period:

SPDR SP 500 ETF - Copyright Arash Partow

Through visual inspection we can approximate that the best buy date would be towards the end of 2002 and the corresponding sell date would be towards the end of 2007, as these time points seem to give the largest difference between the two prices. However it also seems that a buy roughly at the start of 2009 and a sell at the beginning of 2011 could also provide such a large price difference. As such the visual inspection approach has lead to an ambiguity, hence a more thorough and precise approach is required.

One could take the naive approach to solving the buy low sell high problem, by initially loading the entire time series into memory, then for each ith time point take the pricei and test it against every other pricej where i < j, and maintain a "best profit encountered" structure that contains the best profit so far and the corresponding buy/sell prices and dates. This solution has a few problems, initially it is of O(N2) time complexity and O(N) space complexity. As an example for one million time points it will require one trillion comparisons and one million date/price units of storage. If memory and computational processing was limited on the hardware this solution can not be practically executed in a continuous online manner. Furthermore as suggested by the time -complexity as the size of the data grows, regardless of computational abilities, the compute times for the results would become astronomical and practically useless - specially for real-time reactive systems.

In these types of problems one tries to assess if an online or streaming based solution is feasible. That is a solution that does not require the data to be available all at once, can work on the data incrementally and requires no more than one-pass for each piece of data. Such a solution would typically have a time complexity of O(N) and a space complexity of O(1).

With regards to this problem the crucial insight required to convert the naive solution from O(N2) complexity to an online version of O(N) time complexity, is that every new global minima encountered is the beginning of a new period and an indicator of an end to the previous period. Looking at the chart, if one were to scan from left to right, the intuitive response is to find the point with the lowest price, ignore everything preceding, then try and find the next point with the highest price or global maxima. There are a few edge cases that need to be dealt with. The main one being the problem described above that there are two buy/sell points that could potentially be the solution. The way around this is to simply maintain the best encountered period, and compare the profit from any new period to the best so far, if it is better (more) then replace the best with the current period. Another edge case is when the data is in a continual decline, in a situation like this there will be no profitable buy/sell points. The following is a small subsection of the time-series in question:

Download spy500.csv

15/06/2000,148.16
16/06/2000,146.59
19/06/2000,148.47
20/06/2000,147.94
21/06/2000,147.88
22/06/2000,145.63
23/06/2000,144.38
26/06/2000,146.23
27/06/2000,145.16
28/06/2000,145.56
29/06/2000,144.19

The code below is a very simple single pass incremental solution to the given problem. It reads every line of the input file, parses each line into a date and price variable, checks to see if the current price is less than the current global minima price, if it is the case, it will set the current period start to the current date and set the buy price to be the current price, otherwise it checks to see if the current price is larger than the global maxima price, if that is the case then it updates the current profit, sell price and sell dates accordingly. If at the end, the buy price is not less than the sell price, it is indicative that there exists no two time points within the given time series for which a profitable transaction could occur, otherwise it prints out buy and sell dates for the required transaction and the expected profit per share.

struct period
{
   period()
   : profit(std::numeric_limits<double>::min()),
     buy_price(std::numeric_limits<double>::max()),
     sell_price(std::numeric_limits<double>::min())
   {}
   bool operator>(const period&p) const
   {
      return profit > p.profit;
   }
   double profit;
   double buy_price;
   double sell_price;
   std::string buy_date;
   std::string sell_date;
};

int main()
{
   period best_period;
   period curr_period;

   strtk::for_each_line("spy500.csv",
                        [&best_period,&curr_period]
                        (const std::string& line)
                        {
                           std::string date;
                           double price;
                           if (!strtk::parse(line,",",date,price)) return;
                           if (price < curr_period.buy_price)
                           {
                              if (curr_period > best_period)
                              {
                                 best_period = curr_period;
                              }
                              curr_period.buy_date   = date;
                              curr_period.buy_price  = price;
                              curr_period.sell_price = std::numeric_limits<double>::min();
                              curr_period.sell_date  = "";
                           }
                           else if (price > curr_period.sell_price)
                           {
                              curr_period.sell_price = price;
                              curr_period.sell_date  = date;
                              curr_period.profit     = curr_period.sell_price - curr_period.buy_price;
                           }
                        });

   if (best_period.buy_price >= best_period.sell_price)
   {
      std::cout << "No period in the time-series can be profitably exploited." << std::endl;
   }
   else
   {
      std::cout << "Buy:    " << best_period.buy_date  << std::endl;
      std::cout << "Sell:   " << best_period.sell_date << std::endl;
      std::cout << "Profit per share: $" << best_period.profit << std::endl;
   }

   return 0;
}

For the given data it is expected the above piece of code will produce the following output:

Buy:    09/10/2002
Sell:   09/10/2007
Profit per share: $78.38

What changes to the above piece of code would be required if:

  • Rather than prices, we instead are given percentage differences from the previous price?
  • We have the constraint that we can't hold a position for more than K days
  • A penalty of 1/K% is applied to the profit for every day a position is held? (where K is the condition and value described above)
  • Short selling is allowed?
  • We want to process independent sections of the time series concurrently so as to speed up overall processing time.
  • The prices could be extremely large or small?

Token Grid

StrTk provides a means to easily parse and consume 2D grids of tokens in an efficient and simple manner. A grid is simply defined as a series of rows comprised of tokens, otherwise known as Delimiter Separated Values (DSV). The ith token of a row is grouped with every ith token of all other rows to produce a column. Tokens can be processed as either rows or columns.

An example of a simple token grid, where each numeric value is deemed to be a token:

1.12.23.34.45.5
1.12.23.34.45.5
1.12.23.34.45.5
1.12.23.34.45.5
1.12.23.34.45.5

A token grid can be either passed in via a file or a string buffer. The delimiters to be used for parsing the columns and rows can also be provided, if not provided standard common defaults will be used.

The following demonstrates how each cell in the grid can be access and cast to a specific type.

std::string data = "1,2,3,4,5,6\n"
                   "7,8,9,0,1,2\n"
                   "3,4,5,6,7,8\n"
                   "9,0,1,2,3,4\n"
                   "5,6,7,8,9,0\n";

strtk::token_grid grid(data,data.size(),",");

for (std::size_t r = 0; r < grid.row_count(); ++r)
{
   strtk::token_grid::row_type row = grid.row(r);
   for (std::size_t c = 0; c < row.size(); ++c)
   {
      std::cout << grid.get<int>(r,c) << '\t';
   }
   std::cout << std::endl;
}

The strtk::token_grid provides various helper functions for traversing rows and columns in batch mode. The functions are namely: for_each_row that is used for iterating either all or a sub-range of rows of the token_grid, and for_each_column that is used for iterating either all or a sub-range of columns of a row.

int main()
{
              //column  0   1   2   3   4   5   6
   std::string data = "1.1,2.1,3.1,4.1,5.1,6.1,7.1\n"  //row_0
                      "1.2,2.2,3.2,4.2,5.2,6.2,7.2\n"  //row_1
                      "1.3,2.3,3.3,4.3,5.3,6.3,7.3\n"  //row_2
                      "1.4,2.4,3.4,4.4,5.4,6.4,7.4\n"  //row_3
                      "1.5,2.5,3.5,4.5,5.5,6.5,7.5\n"  //row_4
                      "1.6,2.6,3.6,4.6,5.6,6.6,7.6\n"  //row_5
                      "1.7,2.7,3.7,4.7,5.7,6.7,7.7\n"  //row_6
                      "1.8,2.8,3.8,4.8,5.8,6.8,7.8\n"  //row_7
                      "1.9,2.9,3.9,4.9,5.9,6.9,7.9\n"; //row_8

   strtk::token_grid grid(data,data.size(),",");

   {
      // Process each row of the token grid
      grid.for_each_row(
         [](const strtk::token_grid::row_type& row)
         {
            for (std::size_t i = 0; i < row.size(); ++i)
            {
               std::cout << row.get<double>(i) << '\t';
            }
            std::cout << '\n';
         });
   }

   {
      // Process rows in the range [2,6] of the token grid
      grid.for_each_row(grid.range(2,6),
         [](const strtk::token_grid::row_type& row)
         {
            for (std::size_t i = 0; i < row.size(); ++i)
            {
               std::cout << row.get<double>(i) << '\t';
            }
            std::cout << '\n';
         });
   }

   {
      // Process each row and column of the token grid
      grid.for_each_row(
         [](const strtk::token_grid::row_type& row)
         {
            row.for_each_column(
               [](const strtk::token_grid::row_type::range_type& range)
               {
                  std::cout << std::string(range.first,range.second) << '\t';
               });
            std::cout << '\n';
         });
   }

   {
      // Process rows in the range [2,6] and the columns in the range [1,5] of the token grid
      grid.for_each_row(grid.range(2,6),
         [](const strtk::token_grid::row_type& row)
         {
            row.for_each_column(row.range(1,5),
               [](const strtk::token_grid::row_type::range_type& range)
               {
                  std::cout << std::string(range.first,range.second) << '\t';
               });
            std::cout << '\n';
         });
   }

   return 0;
}

The following example demonstrates how averages over rows and columns of a token grid can be computed:

std::string data = "1.1,1.1,1.1,1.1,1.1,1.1\n"
                   "2.2,2.2,2.2,2.2,2.2,2.2\n"
                   "3.3,3.3,3.3,3.3,3.3,3.3\n"
                   "4.4,4.4,4.4,4.4,4.4,4.4\n"
                   "5.5,5.5,5.5,5.5,5.5,5.5\n"
                   "6.6,6.6,6.6,6.6,6.6,6.6\n";

strtk::token_grid grid(data,data.size(),",");

std::vector<double> avg_c(grid.row(0).size(),0.0);
std::vector<double> avg_r(grid.row_count(),0.0);
std::vector<double> tmp(grid.row(0).size(),0.0);
std::fill(avg_c.begin(),avg_c.end(),0.0);

for (std::size_t i = 0; i < grid.row_count(); ++i)
{
   grid.row(i).parse<double>(tmp.begin());
   std::transform(avg_c.begin(),avg_c.end(),tmp.begin(),avg_c.begin(),std::plus<double>());
   avg_r[i] = std::accumulate(tmp.begin(),tmp.end(),0.0) / tmp.size();
}

for (std::size_t i = 0; i < avg_c.size(); avg_c[i++] /= grid.row_count());

std::cout << "Column Averages:\t";
std::copy(avg_c.begin(),avg_c.end(),std::ostream_iterator<double>(std::cout,'\t'));
std::cout << '\n';

std::cout << "Row Averages:\t";
std::copy(avg_r.begin(),avg_r.end(),std::ostream_iterator<double>(std::cout,'\t'));
std::cout << '\n';
</double></double></double></double>

Processing Of Comma Separated Values Data

The original intent of the token grid was to support fast and efficient processing of simple data tuples, such as comma separated values (CSV) formats et. al. The following example demonstrates a simple summation of traded floor volume and average daily volume based on NASDAQ OHLC (Open-High-Low-Close) data.

                          //Date,Symbol,Open,Close,High,Low,Volume
std::string market_data = "20090701,GOOG,424.2000,418.9900,426.4000,418.1500,2310768\n"
                          "20090701,MSFT,24.0500,24.0400,24.3000,23.9600,54915127\n"
                          "20090702,GOOG,415.4100,408.4900,415.4100,406.8100,2517630\n"
                          "20090702,MSFT,23.7600,23.3700,24.0400,23.2100,65427699\n"
                          "20090703,GOOG,408.4900,408.4900,408.4900,408.4900,0\n"
                          "20090703,MSFT,23.3700,23.3700,23.3700,23.3700,0\n"
                          "20090706,GOOG,406.5000,409.6100,410.6400,401.6600,2262557\n"
                          "20090706,MSFT,23.2100,23.2000,23.2800,22.8700,49207638\n"
                          "20090707,GOOG,408.2400,396.6300,409.1900,395.9801,3260307\n"
                          "20090707,MSFT,23.0800,22.5300,23.1400,22.4600,52842412\n"
                          "20090708,GOOG,400.0000,402.4900,406.0000,398.0600,3441854\n"
                          "20090708,MSFT,22.3100,22.5600,22.6900,2200000,73023306\n"
                          "20090709,GOOG,406.1200,410.3900,414.4500,405.8000,3275816\n"
                          "20090709,MSFT,22.6500,22.4400,22.8100,22.3700,46981174\n"
                          "20090710,GOOG,409.5700,414.4000,417.3700,408.7000,2929559\n"
                          "20090710,MSFT,22.1900,22.3900,22.5400,22.1500,43238698\n";

strtk::token_grid grid(market_data,market_data.size(),",");

struct stock_info
{
   stock_info(const std::string& s = " ")
   : symbol(s),
     total_volume(0),
     day_count(0),
     average_daily_volume(0.0)
   {}

   std::string symbol;
   unsigned long long total_volume;
   std::size_t day_count;
   double average_daily_volume;
};

stock_info goog("GOOG");
stock_info msft("MSFT");

static const std::size_t volume_column = 6;
static const std::size_t symbol_column = 1;

goog.day_count = grid.accumulate_column(volume_column,
                                        [](const strtk::token_grid::row_type& row) -> bool
                                        {
                                           static const std::string google_symbol("GOOG");
                                           return google_symbol == row.get<std::string>(symbol_column);
                                        },
                                        goog.total_volume);

msft.day_count = grid.accumulate_column(volume_column,
                                        [](const strtk::token_grid::row_type& row) -> bool
                                        {
                                           static const std::string microsoft_symbol("MSFT");
                                           return microsoft_symbol == row.get<std::string>(symbol_column);
                                        },
                                        msft.total_volume);

goog.average_daily_volume = (1.0 * goog.total_volume) / goog.day_count;
msft.average_daily_volume = (1.0 * msft.total_volume) / msft.day_count;

std::cout << "[GOOG] Total Volume: " << goog.total_volume << std::endl;
std::cout << "[MSFT] Total Volume: " << msft.total_volume << std::endl;

std::cout << "[GOOG] ADV: " << goog.average_daily_volume << std::endl;
std::cout << "[MSFT] ADV: " << msft.average_daily_volume << std::endl;

The strtk::token_grid is thread-safe iff read operations are in play. As such the above calls to accumulate_column et al. can all be safely and easily executed concurrently using threads. This allows for a far more efficient data processing methodology.

TIMTOWTDI

Playing devil's advocate, another way of performing the above processing task, assuming only the specific values for computing the ADV are required and no further processing of the CSV data is needed, then the problem can be solved efficiently by utilizing a single pass of the data as follows:

std::string file_name = "market_data.csv";

std::unordered_map<std::string,stock_info> stock_map;

stock_map.insert(std::make_pair<std::string,stock_info>("GOOG",stock_info("GOOG")));
stock_map.insert(std::make_pair<std::string,stock_info>("MSFT",stock_info("MSFT")));

strtk::for_each_line(file_name,
                     [&](const std::string& line)
                     {
                        strtk::ignore_token ignore;
                        stock_info temp;
                        const bool result = strtk::parse(line,
                                                         ",",
                                                         ignore,
                                                         temp.symbol,
                                                         ignore,
                                                         ignore,
                                                         ignore,
                                                         ignore,
                                                         temp.total_volume);
                        if (!result) return;
                        auto itr = stock_map.find(temp.symbol);
                        if (stock_map.end() == itr) return;
                        (*itr).second.total_volume += temp.total_volume;
                        (*itr).second.day_count++;
                     });

auto itr = stock_map.begin();
auto end = stock_map.end();

while (end != itr)
{
   stock_info& stock = (*itr++).second;
   stock.average_daily_volume = (1.0 * stock.total_volume) / stock.day_count;
}

TIMTOWTDI - II (with a vengeance)

Playing the devil's other advocate, the above two examples, have both required that the filter condition be explicitly defined at compile time. However even though the condition maybe be set in stone at compile time, some of the underlyings (such as symbol) can be engineered to be modified at run-time. That still doesn't give us the freedom to perform arbitrarily complex filter expressions determined at run-time.

That said, an extremely efficient and very simple solution is at hand. The solution is called the C++ DSV Filter library, it is based on StrTK and ExprTk libraries. It uses the strtk::token_grid as a CSV/DSV store and index, and ExprTk as the underlying expression evaluation engine. The example below takes the OHLC market data table defined above and performs a row-wise query. The expression's definition is:

select all rows where the open price is greater than the close price and the symbol matches the wild-card pattern of "*FT*" and the date is equal to or after 20090101.

int main()
{
   std::string file_name = "market_data.csv";
   dsv_filter filter;
   filter.set_input_delimiter(",");
   if (!filter.load(file_name))
      return 1;
   std::string expression = "(open > close) and (symbol like '*FT*') and (date >= '20090101')";
   filter.add_filter(filter_expression);
   for (std::size_t row = 1; row < filter.row_count(); ++row)
   {
      if (dsv_filter::e_match == filter[row])
      {
          // do something...
      }
   }
   return 0;
}

Other example queries:

  • volume >= 1000000 and symbol == 'GOOG'
  • abs(open - close) > abs(high - low)
  • avg(open,close,high,low) * volume > 10^7 and inrange('20090702',date,'20090730')
  • (open > close) and (symbol like '*FT*') and (date >= '20090101')

It should be noted that in the example above, the rows begin at index 1. That is done because the dsv_filter expects the first row or row at index 0 to be a column definition header. The format of the column definitions is to simply add a suffix of "_s" if the values in the column are to be treated as strings or "_n" if they are to be treated as numbers. When defining expressions the suffixes should not be included when including the column names. The above mentioned OHLC csv file's header would be as follows:

  Date_s,Symbol_s,Open_n,Close_n,High_n,Low_n,Volume_n

C++ DSV Filter and Dependencies

Sequential Partitions

A typical operation carried out upon time-series data is to group tuples into buckets (or bins) based upon the time index value. For example grouping data into 2-minute buckets and then performing some kind of operation upon the grouped tuples such as a summation or an average etc. This process is sometimes also called: "discretization"

The strtk::token_grid class provides a method known as sequential_partition. The sequential_partition method requires a Transition Predicate, a Function and optionally a row-range. The Transition Predicate consumes a row and returns true only if the row is in the next partition. All other subsequent consecutive rows until the transition predicate returns a true are said to be in the current partition. Prior to transitioning to a new partition, the function predicate is invoked and provided with the range of rows in the current partition.

The following example takes a simple time-series (time and value), partitions the tuples into groups of Time-Buckets of period length 3 and then proceeds to compute the total sum of each group. The below summarizer class provides provides a Transition Predicate and Function.

StrTk Sequential Partition - Copyright Arash Partow

class summarizer
{
public:

   enum column_index
   {
      tick_time_column  = 0,
      tick_value_column = 1
   };

   summarizer(std::deque<double>& sum_value)
   : next_tick_time_(0),
     sum_value_(sum_value)
   {}

   // Transition Predicate
   inline bool operator()(const strtk::token_grid::row_type& row)
   {
      if (row.get<std::size_t>(tick_time_column) >= next_tick_time_)
      {
         next_tick_time_ = row.get<std::size_t>(tick_time_column) + 3;
         return true;
      }
      else
         return false;
   }

   // Function
   inline bool operator()(const strtk::token_grid& grid,
                          const strtk::token_grid::row_range_type& range)
   {
      double bucket_sum = 0.0;
      if (!grid.accumulate_column(tick_value_column,range,bucket_sum))
      {
         std::cout << "failed to accumulate!" << std::endl;
         return false;
      }
      else
         sum_value_.push_back(bucket_sum);
      return true;
   }

private:

   summarizer& operator=(const summarizer&);

   std::size_t next_tick_time_;
   std::deque<double>& sum_value_;
};

int main()
{
                      //time index, value
   std::string data = "10000,123.456\n"
                      "10001,612.345\n"
                      "10002,561.234\n"
                      "10003,456.123\n"
                      "10004,345.612\n"
                      "10005,234.561\n"
                      "10006,123.456\n";

   strtk::token_grid grid(data,data.size(),",");
   std::deque<double> sum_value;
   summarizer s(sum_value);
   grid.sequential_partition(s,s);
   for (std::size_t i = 0; i < sum_value.size(); ++i)
   {
      std::cout << "bucket[" << i << "] = " << sum_value[i] << std::endl;
   }
   return 0;
}

The expected output is as follows:

  bucket[0] = 1297.035
  bucket[1] = 1036.296
  bucket[2] = 123.456

Parsing CSV Files With Embedded Double-Quotes

One of the simple extensions to the CSV format is the concept of double quoted tokens. Such tokens may contain column or row delimiters. When such a scenario is encountered, all subsequent delimiters are ignored, and kept as part of the token, until the corresponding closing double quote is encountered. The StrTk token_grid supports the parsing of such tokens. This parsing mode can be easily activated via the token_grid option set. Below is an example of a token_grid loading a CSV data set representing various airports from around the world and their specific codes and locations, in which some of the cells are double-quoted:

ICAOIATAAirportCityCountry
AYGAGKA"Goroka Gatue"GorokaPapua New Guinea
BGCOGCO"Nerlerit Inaat Constable Pynt""Nerlerit Inaat"Greenland
BZGDZGDGodleyAucklandNew Zealand
CYQMYQM"Greater Moncton International"MonctonCanada
EDRKZNV"Koblenz Winningen"KoblenzGermany
FAHUAHU"HMS Bastard Memorial"Kwazulu-NatalSouth Africa
FQMPMZB"Mocimboa Da Praia""Mocimboa Da Praia"Mozambique
KINSINS"Indian Springs AF AUX"Indian SpringsUSA
UHNNHNNNikolaevsk"Nikolaevsk Na Amure"Russia
WBKKBKI"Kota Kinabalu International"Kota KinabaluMalaysia
ZSJDJDZ"Jingdezhen Airport"JingdezhenChina

The following is the StrTk code example using token_grid to parse the above CSV data set:

int main()
{
   std::string airport_data_file_name = "airport_data.csv";

   strtk::token_grid::options options;
   options.column_delimiters = "| ,";
   options.support_dquotes = true;

   strtk::token_grid airport_grid(airport_data_file_name,options);

   // for each row r, for each column c, print cell[r,c]
   for (std::size_t r = 0; r < airport_grid.row_count(); ++r)
   {
      strtk::token_grid::row_type row = airport_grid.row(r);
      for (std::size_t c = 0; c < row.size(); ++c)
      {
         std::cout << "[" << row.get<std::string>(c) << "] ";
      }
      std::cout << std::endl;
   }
   return 0;
}
</std::string>

Extending Delimiter Predicates

As previously mentioned the concept of a delimiter based predicate can lead to some very interesting solutions. A predicate as has been defined so far, with the exception of the offset predicate, has been a stateless entity. Adding the ability to maintain a state based on what the predicate has encountered so far can allow it to behave differently from the simple single and multiple delimiter predicates.

For this example, lets assume a typical command line parsing problem which consists of double quotation mark groupings and escapable special characters, which can be considered being dual use as either delimiters or data. An example input and output is as follows:

Inputs

Dataabc;"123, mno xyz",i\,jk
Delimiters<space>;,.

Output Tokens

Token0abc
Token1123, mno xyz
Token2i\,jk

In order to tokenize the above described string, one can create a composite predicate using a multiple char delimiter predicate and some simple state rules. The following is an example of such an extended predicate:

class extended_predicate
{
public:

   extended_predicate(const std::string& delimiters)
   : escape_(false),
     in_bracket_range_(false),
     mdp_(delimiters)
   {}

   inline bool operator()(const unsigned char c) const
   {
      if (escape_)
      {
         escape_ = false;
         return false;
      }
      else if ('\\' == c)
      {
         escape_ = true;
         return false;
      }
      else if ('"' == c)
      {
         in_bracket_range_ = !in_bracket_range_;
         return true;
      }
      else if (in_bracket_range_)
         return false;
      else
         return mdp_(c);
   }

   inline void reset()
   {
      escape_ = false;
      in_bracket_range_ = false;
   }

private:

   mutable bool escape_;
   mutable bool in_bracket_range_;
   mutable strtk::multiple_char_delimiter_predicate mdp_;
};

Usage of the newly defined extended predicate is as follows:

int main()
{
   std::string str = "abc;\"123, mno xyz\",i\\,jk";
   strtk::std_string::token_list_type token_list;
   strtk::split(extended_predicate(".,; "),
                str,
                std::back_inserter(token_list),
                strtk::split_options::compress_delimiters);
   return 0;
}

High Performance Key-Value Parsing

Taking our previous person struct as an example. It is clear that the tuple format has to be very specific with regards to the ordering of data. For most situations this is acceptable as the serializers and deserializers attempt to function in the simplest manner possible, however sometimes pieces of information may come in different orderings, or may be deemed optional and hence not be present in a particular tuple of data.

An approach is required that provides the means of mapping specific pieces of data to the corresponding variables (or members) that will store or make sense of them, in a very efficient and simple manner. For clarity purposes, the term "mapping" here means to populate the desired member with the given data. The association between the two being made with the key that is paired in-line with the data. Key-Value pairs or some times known as Attribute-Value pairs are one of the means by which such an association can be accomplished. The following is a diagram that demonstrates the mapping of various fields of a data struct to their corresponding data elements in a tuple comprised of key-value pairs.

StrTk Key Value Pairs - Copyright Arash Partow

In the diagram above, the key-value pairs are separated (delimited) by the pipe symbol "|". With regards to the key-value pairs themselves, the key is traditionally the first element in the pair and the value is the second element, they are separated in this case by a single equal sign "=".

Back to our original problem relating to the person struct. If we were to add keys to each of the pieces of data, we could then not only parse a tuple of data representing the various fields of the person struct, but those fields could be any order within the tuple. Furthermore any of the fields could be deemed optional, hence not necessarily be present in the tuple. It should be noted that what would denote a correct or successful parse of tuple may not only depend upon successful parsing of ranges into the various types, but it may also depend upon the mandatory presence of certain fields.

Our objective will be to successfully and in the most efficient way possible parse the following list of tuples that represent instances of our person struct.

Tuple0 = "ID=0xFA37ED12|NAME=Rumpelstiltskin|AGE=397|HEIGHT=1.31|WEIGHT=58.7"
Tuple1 = "NAME=Rumpelstiltskin|AGE=397|HEIGHT=1.31|WEIGHT=58.7|ID=0xFA37ED12"
Tuple2 = "AGE=397|HEIGHT=1.31|WEIGHT=58.7|ID=0xFA37ED12|NAME=Rumpelstiltskin"
Tuple3 = "HEIGHT=1.31|WEIGHT=58.7|ID=0xFA37ED12|NAME=Rumpelstiltskin|AGE=397"
Tuple4 = "WEIGHT=58.7|ID=0xFA37ED12|NAME=Rumpelstiltskin|AGE=397|HEIGHT=1.31"
Tuple5 = "ID=0xFA37ED12|WEIGHT=58.7|AGE=397|NAME=Rumpelstiltskin|HEIGHT=1.31"
Tuple6 = "WEIGHT=58.7|AGE=397|NAME=Rumpelstiltskin|HEIGHT=1.31|ID=0xFA37ED12"
Tuple7 = "AGE=397|NAME=Rumpelstiltskin|HEIGHT=1.31|ID=0xFA37ED12|WEIGHT=58.7"
Tuple8 = "NAME=Rumpelstiltskin|HEIGHT=1.31|ID=0xFA37ED12|WEIGHT=58.7|AGE=397"
Tuple9 = "HEIGHT=1.31|ID=0xFA37ED12|WEIGHT=58.7|AGE=397|NAME=Rumpelstiltskin"

StrTk provides a means to achieve the above key-value pair parsing task, namely via the strtk::keyvalue::parser and associated key-mappers. The following is a table that depicts the various kind of key to value mappers that are available in the StrTk library:

Key To Value Mappers

MapperTypeKey Lookup ComplexityMaximum Size
strtk::keyvalue::stringkey_mapstd::stringO(log(n))Limited to available memory
strtk::keyvalue::uintkey_mapcardinal valueO(1)Limited to expected maximum key value

In the code below we initially begin by defining the delimiters we expect to see between pairs of key-values and in between the key and value pairs themselves. In the example below the pair_block_delimiter field denotes the delimiter we expect between pairs (or blocks) of key-values and the field pair_delimiter denotes the delimiter we expect between a particular key and value.

Next we define p as an instance of the person struct and register each of the members of the instance p with a corresponding key with the keyvalue::parser. After which we then process each tuple of data, parsing the tuple, populating the instance p, then printing out the various fields to stdout.

struct person
{
   unsigned int id;  // key = ID
   std::string name; // key = NAME
   unsigned int age; // key = AGE
   double height     // key = HEIGHT
   float weight;     // key = WEIGHT
};

int main()
{
   const std::string person_data[] =
            {
               "ID=0xFA37ED12|NAME=Rumpelstiltskin|AGE=397|HEIGHT=1.31|WEIGHT=58.7",
               "NAME=Rumpelstiltskin|AGE=397|HEIGHT=1.31|WEIGHT=58.7|ID=0xFA37ED12",
               "AGE=397|HEIGHT=1.31|WEIGHT=58.7|ID=0xFA37ED12|NAME=Rumpelstiltskin",
               "HEIGHT=1.31|WEIGHT=58.7|ID=0xFA37ED12|NAME=Rumpelstiltskin|AGE=397",
               "WEIGHT=58.7|ID=0xFA37ED12|NAME=Rumpelstiltskin|AGE=397|HEIGHT=1.31",
               "ID=0xFA37ED12|WEIGHT=58.7|AGE=397|NAME=Rumpelstiltskin|HEIGHT=1.31",
               "WEIGHT=58.7|AGE=397|NAME=Rumpelstiltskin|HEIGHT=1.31|ID=0xFA37ED12",
               "AGE=397|NAME=Rumpelstiltskin|HEIGHT=1.31|ID=0xFA37ED12|WEIGHT=58.7",
               "NAME=Rumpelstiltskin|HEIGHT=1.31|ID=0xFA37ED12|WEIGHT=58.7|AGE=397",
               "HEIGHT=1.31|ID=0xFA37ED12|WEIGHT=58.7|AGE=397|NAME=Rumpelstiltskin",
            };

   const std::size_t person_data_size = sizeof(person_data) / sizeof(std::string);

   //Basic type definition
   typedef unsigned char char_type;
   typedef strtk::keyvalue::parser<strtk::keyvalue::stringkey_map> kvp_type;
   typedef strtk::keyvalue::options<char_type> opts_type;

   //Setup the various delimiters
   opts_type options;
   options.pair_block_delimiter = '|';
   options.pair_delimiter       = '=';

   kvp_type kvp(options);

   person p;

   strtk::hex_to_number_sink<unsigned int> h2ns(p.id);

   //Define the mapping between the key and the
   //members of the person struct
   kvp.register_keyvalue("ID"    ,     h2ns);
   kvp.register_keyvalue("NAME"  , p.  name);
   kvp.register_keyvalue("AGE"   , p.   age);
   kvp.register_keyvalue("HEIGHT", p.height);
   kvp.register_keyvalue("WEIGHT", p.weight);

   //Go through each tuple
   for (std::size_t i = 0; i < person_data_size; ++i)
   {
      //If parsing of the ith tuple is successful
      //print the person struct out.
      if (kvp(person_data[i]))
      {
         std::cout << i                      << '\t'
                   << "ID: "     << p.id     << '\t'
                   << "Name: "   << p.name   << '\t'
                   << "AGE: "    << p.age    << '\t'
                   << "HEIGHT: " << p.height << '\t'
                   << "WEIGHT: " << p.weight << '\n';
     }
   }

   return 0;
}

It should be noted that the underlying associative container for the strtk::keyvalue::stringkey_map can be explicitly specified at compile time. By default it is std::map, however it can be easily changed to std::unordered_map giving it a key lookup complexity of O(1) or in fact any other container that is compatible with STL associative map semantics. The following are three examples of keyvalue_parsers specialized using std::unordered_map, the first being based on the std::string type, the second being based on the int type and the third being based on the double type:

int main()
{
   // std::string type key to value mapper based on std::unordered_map
   {
      //Basic type definition
      typedef unsigned char char_type;
      typedef std::unordered_map<std::string,strtk::util::value> key_to_val_map_t;
      typedef strtk::keyvalue::key_map<std::string,key_to_val_map_t> stringhashkey_mapper;
      typedef strtk::keyvalue::parser<stringhashkey_mapper> kvp_type;
      typedef strtk::keyvalue::options<char_type> opts_type;

      //Setup the various delimiters
      opts_type options;
      options.pair_block_delimiter = '|';
      options.pair_delimiter       = '=';

      kvp_type kvp(options);
   }

   // int type key to value mapper based on std::unordered_map
   {
      //Basic type definition
      typedef unsigned char char_type;
      typedef std::unordered_map<int,strtk::util::value> key_to_val_map_t;
      typedef strtk::keyvalue::key_map<int,key_to_val_map_t> inthash_key_mapper;
      typedef strtk::keyvalue::parser<inthash_key_mapper> kvp_type;
      typedef strtk::keyvalue::options<char_type> opts_type;

      //Setup the various delimiters
      opts_type options;
      options.pair_block_delimiter = '|';
      options.pair_delimiter       = '=';

      kvp_type kvp(options);
   }

   // double type key to value mapper based on std::unordered_map
   {
      //Basic type definition
      typedef unsigned char char_type;
      typedef std::unordered_map<double,strtk::util::value> key_to_val_map_t;
      typedef strtk::keyvalue::key_map<double,key_to_val_map_t> doublehash_key_mapper;
      typedef strtk::keyvalue::parser<doublehash_key_mapper> kvp_type;
      typedef strtk::keyvalue::options<char_type> opts_type;

      //Setup the various delimiters
      opts_type options;
      options.pair_block_delimiter = '|';
      options.pair_delimiter       = '=';

      kvp_type kvp(options);
   }

   return 0;
}

Key-Value Pairs And Multiple Distinct Data Structures

To further the previous example, one might have a situation where a tuple contains values for different members of different types. The following example demonstrates parsing of key-value pairs that map to members of multiple types. The tuple in the example consists of values some of which are intended for the instance of data1 and others which are intended for the instance of data2.

struct data1
{
   int x0;
   std::string x1;
   double x2;
   float x3;
}

struct data2
{
   int y0;
   std::string y1;
   double y2;
   float y3;
}

int main()
{
   typedef unsigned char char_type;
   typedef strtk::keyvalue::parser<strtk::keyvalue::stringkey_map> kvp_type;
   typedef strtk::keyvalue::options<char_type> opts_type;

   opts_type options;
   options.pair_block_delimiter = '|';
   options.pair_delimiter       = '=';

   kvp_type kvp(options);

   data1 d1;
   data2 d2;

   std::string data = "D1X0=123456789|D1X1=AbCdEfG123456|D1X2=123.456|D1X3=0101.0202f|"
                      "D2Y0=987654321|D2Y1=tUvWxYz789012|D2Y2=789.012|D2Y3=0707.0909f";

   //Register values for data1 type
   kvp.register_keyvalue("D1X0", d1.x0);
   kvp.register_keyvalue("D1X1", d1.x1);
   kvp.register_keyvalue("D1X2", d1.x2);
   kvp.register_keyvalue("D1X3", d1.x3);

   //Register values for data2 type
   kvp.register_keyvalue("D2Y0", d2.y0);
   kvp.register_keyvalue("D2Y1", d2.y1);
   kvp.register_keyvalue("D2Y2", d2.y2);
   kvp.register_keyvalue("D2Y3", d2.y3);

   //Parse the tuple
   kvp(data);

   std::cout << "D1X0: " << d1.x0 << '\t'
             << "D1X1: " << d1.x1 << '\t'
             << "D1X2: " << d1.x2 << '\t'
             << "D1X3: " << d1.x3 << '\n';

   std::cout << "D2Y0: " << d2.y0 << '\t'
             << "D2Y1: " << d2.y1 << '\t'
             << "D2Y2: " << d2.y2 << '\t'
             << "D2Y3: " << d2.y3 << '\n';

   return 0;
}

Key-Value Pairs And Lists

The values in the key-value pairs need not always be singular. In some scenarios, a value could be a list of values. The StrTk key-value parser supports parsing of such key-value pairs through the previously demonstrated sequence sink mechanisms.

In the following example we have a complex_data struct that consists of some PODs but also a number of sequences (vector,deque,list). The tuple to be parsed consists of simple key-value pairs for the PODs and more complex looking pairs where the values for the specific sequence of values are separated by commas ",".

struct complex_data
{
   unsigned int           v0;
   std::vector<int>       v1;
   double                 v2;
   std::deque<double>     v3;
   std::string            v4;
   std::list<std::string> v5;
};

int main()
{
   typedef unsigned char char_type;
   typedef strtk::keyvalue::parser<strtk::keyvalue::stringkey_map> kvp_type;
   typedef strtk::keyvalue::options<char_type> opts_type;

   opts_type options;

   options.pair_block_delimiter = '|';
   options.pair_delimiter       = '=';

   kvp_type kvp(options);

   complex_data cd;

   std::string data = "V0=123456789|V1=-3,-2,-1,0,+1,2+3|V2=111.222|V3=1.1,2.2,3.3,4.4,5.5|"
                      "V4=Some Text|V5=Text1,Text2,Text3,Text4";

   strtk::vector_sink<int>::type vec_sink(",");
   strtk::deque_sink<double>::type deq_sink(",");
   strtk::list_sink<std::string>::type lst_sink(",");

   //Register values for data1 type
   kvp.register_keyvalue("V0", cd.v0);
   kvp.register_keyvalue("V1", vec_sink(cd.v1));
   kvp.register_keyvalue("V2", cd.v2);
   kvp.register_keyvalue("V3", deq_sink(cd.v3));
   kvp.register_keyvalue("V4", cd.v4);
   kvp.register_keyvalue("V5", lst_sink(cd.v5));

   kvp(data);

   std::cout << "V0: " << cd.v0 << '\n'
             << "V1: " << strtk::join(" ",cd.v1) << '\n'
             << "V2: " << cd.v2 << '\n'
             << "V3: " << strtk::join(" ",cd.v3) << '\n'
             << "V4: " << cd.v4 << '\n'
             << "V5: " << strtk::join(" ",cd.v5) << '\n';

   return 0;
}

Key-Value Pairs With Cardinal Keys

So far all the examples have assumed keys of arbitrary values. In some situations such as the FIX protocol, the keys are always guaranteed to be positive integer values. If this is the case then a different kind of key-mapper can be used that is much more efficient than the general purpose string key-mapper, providing a key lookup complexity of O(1). As such StrTk provides the strtk::keyvalue::uintkey_map type for this purpose. The only difference in terms of setting up the uintkey_map from the stringkey_map is that it requires a key_count to be set. This value represents the largest possible key value that can exist. The following is an example of how the strtk::keyvalue::uintkey_map key-mapper can be used.

struct data_store
{
   char            c; // key = 121
   unsigned char  uc; // key = 122
   short           s; // key = 123
   unsigned short us; // key = 124
   int             i; // key = 125
   std::string   str; // key = 126
};

int main()
{
   const std::string data = "121=A|122=z|123=-123|124=456|125=-12345678|126=Some simple text";

   typedef strtk::keyvalue::parser<strtk::keyvalue::uintkey_map> kvp_type;

   strtk::keyvalue::uintkey_map::options options;

   options.key_count            = 127; //[0,126] --> 127 keys
   options.pair_block_delimiter = '|';
   options.pair_delimiter       = '=';

   kvp_type kvp(options);

   data_store ds;

   kvp.register_keyvalue(121,ds.  c);
   kvp.register_keyvalue(122,ds. uc);
   kvp.register_keyvalue(123,ds.  s);
   kvp.register_keyvalue(124,ds. us);
   kvp.register_keyvalue(125,ds.  i);
   kvp.register_keyvalue(126,ds.str);

   kvp(data);

   return 0;
}

Optional Key-Value Pairs

As previously mentioned there is a use-case where certain fields may be deemed optional and hence their absence would not constitute a parsing error. That said, it would also be beneficial to know if a particular field has been populated or not once the sequence of key value pairs has been completely parsed and mapped. StrTk provides such functionality though the use of the strtk::util::attribute type.

The strtk::util::attribute acts as a proxy for the underlying type which it is specialized upon. It provides a conversion cast to the underlying type, and also maintains an 'initialised' state value that can be used to query the attribute about the underlying type's initialised status. The following is an example of how one could use the strtk::util::attribute type in conjunction with the keyvalue::parser:

struct data_store
{
   strtk::util::attribute<int> d1;
   strtk::util::attribute<unsigned int> d2;
   strtk::util::attribute<double> d3;
   strtk::util::attribute<float> d4;
   strtk::util::attribute<std::string> d5;
};

int main()
{
   std::string data = "INT=-1234|UINT=+5678|DOUBLE=1234.5678|FLOAT=90123.4567f|STRING=Some simple text";

   typedef unsigned char char_type;
   typedef strtk::keyvalue::parser<strtk::keyvalue::stringkey_map> kvp_type;
   typedef strtk::keyvalue::options<char_type> opts_type;

   opts_type options;

   options.pair_block_delimiter = '|';
   options.pair_delimiter       = '=';

   kvp_type kvp(options);

   data_store ds;

   d1.initialised() = false;
   d2.initialised() = false;
   d3.initialised() = false;
   d4.initialised() = false;
   d5.initialised() = false;

   kvp.register_keyvalue("INT"   , ds.d1);
   kvp.register_keyvalue("UINT"  , ds.d2);
   kvp.register_keyvalue("DOUBLE", ds.d3);
   kvp.register_keyvalue("FLOAT" , ds.d4);
   kvp.register_keyvalue("STRING", ds.d5);

   if (!kvp(data))
   {
      std::cout << "Failed to parse key-value data: " << data << std::endl;
      return 1;
   }

   if (!d1.initialised()) { std::cout << "d1 has not been initialised." << std::endl; }
   if (!d2.initialised()) { std::cout << "d2 has not been initialised." << std::endl; }
   if (!d3.initialised()) { std::cout << "d3 has not been initialised." << std::endl; }
   if (!d4.initialised()) { std::cout << "d4 has not been initialised." << std::endl; }
   if (!d5.initialised()) { std::cout << "d5 has not been initialised." << std::endl; }

   return 0;
}

Semantic Actions with Key-Value Pairs

There may be times that when key-value pairs are being parsed certain actions need to be executed or behaviours exhibited in-situ with the parsing process. As previously mentioned, StrTk provides the type semantic_action that can act as a proxy for a generic type during the parsing process that also takes a functor or lambda and executes it at the conversion call. The following example demonstrates the parsing of an array of tuples comprised of key-value pairs that map to members of a struct namely data_store. The keys 111, 222 and 333 each represent a specific value type, they also require a certain behavior to be exhibited. In this example, for simplicity, as the values of the various keys are being parsed, a simple message will be printed to the console denoting the nature of the parsing process. The code is as follows:

struct data_store
{
   data_store()
   : i1(0),
     d1(0.0),
     s1("")
   {}

   int         i1; // key = 111
   double      d1; // key = 222
   std::string s1; // key = 333

};

int main()
{
   static const std::string data[] =
                      {
                         "111=-12345|222=1987.654321|333=An interesting string0", //Tuple0
                         "222=2987.654321|333=An interesting string1|111=+12345", //Tuple1
                         "333=An interesting string2|111=-12345|222=3987.654321", //Tuple2
                         "222=4987.654321|111=+12345|333=An interesting string3", //Tuple3
                         "333=An interesting string4|222=5987.654321|111=-12345", //Tuple4
                         "111=+12345|333=An interesting string5|222=6987.654321", //Tuple5
                      };

   static const std::size_t data_size = sizeof(data) / sizeof(std::string);

   typedef strtk::keyvalue::parser<strtk::keyvalue::uintkey_map> kvp_type;

   strtk::keyvalue::uintkey_map::options options;

   options.key_count            = 334; //[111,333]
   options.pair_block_delimiter = '|';
   options.pair_delimiter       = '=';

   kvp_type kvp(options);

   data_store ds;

   typedef strtk::range::ustring::const_iterator itr_type;
   using strtk::util::semantic_action;

   //Parsing action for key 111
   auto lambda_key111 = [&ds](itr_type begin,itr_type end) -> bool
                        {
                           if (!strtk::string_to_type_converter(begin,end,ds.i1))
                           {
                              std::cout << "Failed to parse value for key=111\n";
                              return false;
                           }
                           std::cout << "key[111] value=" << ds.i1 << std::endl;
                           return true;
                        };

   //Parsing action for key 222
   auto lambda_key222 = [&ds](itr_type begin,itr_type end) -> bool
                        {
                           if (!strtk::string_to_type_converter(begin,end,ds.d1))
                           {
                              std::cout << "Failed to parse value for key=222\n";
                              return false;
                           }
                           std::cout << "key[222] value=" << ds.d1 << std::endl;
                           return true;
                        };

   //Parsing action for key 333
   auto lambda_key333 = [&ds](itr_type begin,itr_type end) -> bool
                        {
                           if (!strtk::string_to_type_converter(begin,end,ds.s1))
                           {
                              std::cout << "Failed to parse value for key=333\n";
                              return false;
                           }
                           std::cout << "key[333] value=" << ds.s1 << std::endl;
                           return true;
                        };

   kvp.register_keyvalue(111,semantic_action(lambda_key111).ref());
   kvp.register_keyvalue(222,semantic_action(lambda_key222).ref());
   kvp.register_keyvalue(333,semantic_action(lambda_key333).ref());

   for (std::size_t i = 0; i < data_size; ++i)
   {
      if (!kvp(data[i]))
         std::cout << "Error while parsing tuple: " << i << std::endl;
      std::cout << std::endl;
   }

   return 0;
}

Note: It should be noted that semantic actions during the parsing process have a multitude of uses, some of which are: validation of parsed values, that is making sure that they're in a specified range or within a predefined set of values, complex parsed value manipulations, invoking of external state-machines to transition to new states based on the parsed value or even simply the presence of a key etc.

An Attempt At Improving File Processing Performance

In a number of examples from above, a call to strtk::for_each_line was invoked. The routine is intended to provide a simple wrapper around the error prone boiler plate code that is required when processing a text file line by line.

The idea behind strtk::for_each_line is that the user provide a lambda/functor that accepts a std::string, the routine will in turn invoke the lambda passing the current line it has read from the file, all other details related to opening of the file, reading lines from the file and the handling of errors are transparent to the user specified lambda.

For most situations this is perfectly fine, however for certain types of high performance data processing, the overhead incurred from std::ifstream (virtual method calls on a per char basis, multiple copies due to back buffers etc) via the invocation of std::getline, result in an unacceptable performance profile.

Due to the simple nature of the I/O access pattern (forward only) utilized within strtk::for_each_line, a memory mapped file approach would be far more efficient, being roughly 5x-7x faster than the standard strtk::for_each_line routine. The following is a simple example of code that utilises the Boost.IOStreams mmap facility namely: mapped_file_source for mmap'ing the user specified file in conjunction with strtk::split to delimit on line boundaries that are denoted by new-line characters.

namespace strtk
{
   template <typename Function>
   inline std::size_t for_each_line_mmap(const std::string& file_name,
                                         Function function,
                                         const std::size_t&
                                         buffer_size = one_kilobyte)
   {
      boost::iostreams::mapped_file_source input_source;

      try
      {
         input_source.open(file_name);
      }
      catch (std::exception&)
      {
         return 0;
      }

      const char* data = input_source.data();

      multiple_char_delimiter_predicate delimiter("\n");

      std::string buffer;
      buffer.reserve(buffer_size);
      std::size_t line_count = 0;

      split(delimiter,
            data, data + input_source.size(),
            functional_inserter(
               [&](const range::string& range)
               {
                  if (range.begin() == range.end())
                     return;
                  const char* end = range.end();
                  if (line_count && ('\r' == *(end - 1)))
                     end--;
                  buffer.assign(range.begin(),end);
                  function(buffer);
                  ++line_count;
               }),
            split_options::compress_delimiters);

      return line_count;
   }
}
Note: The following are some points to consider with the solution presented above:
  • Because the routine attempts to map the entire file, on 32-bit systems this may fail when the file is very large, due to the limited availability of virtual address space. This can be easily remedied by windowing the mmap access over the file.
  • If the system is under heavy load memory-wise, the performance observed is expected to be similar to but no worse than if the original strtk::for_each_line routine were used.
  • The interface with the lambda can be improved to instead only pass a range rather than to populate and pass a std::string as is done when using std::getline. This improvement results in a roughly 5% increase in performance.
  • An interesting side effect of the range based approached, is that the ranges denote the exact offsets from the original file for the begin and end positions of the string currently being processed. This can be very useful in situations where the exact location of a token is required post processing - as is the case with the problem presented earlier: "Pick A Random Line From A Text File"

'for_each_line' via MMap Benchmark

The following benchmark demonstrates all three variations of strtk::for_each_line routine, which include normal, mmap+std::string and mmap+ranges. The example, iterates through a file of roughly 825MB in size, comprised of two types of CSV lines, each containing 100 tokens, where some of the tokens are to be parsed as integers. Each line is parsed, the integers extracted as necessary and then accumulated into a final sum so as to minimize any aggressive optimizations.

The following is a set of results derived from a run of the above benchmark, based on a build using GCC 4.8, with O2 and PGO, running on a Xeon X5650 3.2GHz 64GB RAM, kernel 3.11. The base measure is lines processed per second. The greater the value the higher the performance.

StrTk MMap Benchmark Chart - Copyright Arash Partow

MethodLines/sec
Standard 126817
MMap 710317
MMap+Range735317

The Letters Game

In the popular TV game show Countdown (aka Letters and Numbers), contestants in the Letters round take turns choosing letters from either a vowel or consonant bin. Typically up to 9 letters are chosen, after which the contestants are given a certain amount of time (usually 30 seconds) to find the longest 'valid' English word made up of only the letters that had been chosen. The contestant with the longest valid word wins the round. What defines a valid word is usually specific to the version of the show. Some examples are using the OED or the Macquarie dictionary coupled together with rules related to proper nouns, plurals and combination words. The Letters round is essentially an anagram solving challenge.

The Letters game can be generally defined as: Given a canonical set of substrings of varying length called C, generated from the alphabet A, and a subset of not necessarily unique elements derived from A called D, find the longest substring in the set C that is also in the set of the 2|D| unique combinations generated from the set D.

A typical process for solving the Letters problem is as follows:

  • Step 0 - A list of unique valid words is specified (eg: OED or Word List)
    • Step 0.0 - The longest word length from the list of words is noted as being L
    • Step 0.1 - Each word is normalised to a common case. (eg: all lower case)
    • Step 0.2 - Store the length of the words into a "word length" set
  • Step 1 - For every word wi, generate the key ki by sorting the letters of the word. (eg: For the word 'english', the key 'eghilns' is derived.)
  • Step 2 - Insert each key/word pair into an associative map M (eg: hash-table)
  • Step 3 - A list of potentially non-unique letters of length N is specified comprised of both consonants and vowels
    • Step 3.0 - Lexicographically order this list of letters
  • Step 4 - For every N-choose-I combinations over the list of N letters, where I starts from min(L,N) and tends to 1:
    • Step 4.0 - If no words of length I exist proceed to the next value of I
    • Step 4.1 - Test the current combination x as a key in M
    • Step 4.2 - If x exists in M, add all the words associated with x into a solution list S,
    • Step 4.3 - Otherwise continue on with the next combinations and values of I
    • Step 4.4 - Once all the combinations for the current I have been enumerated present the solution list S - (if it is not empty)
int main()
{
   std::string word_list_file_name = "word_list.txt";
   typedef std::unordered_multimap<std::string,std::string> word_map_t;
   typedef std::pair<word_map_t::iterator,word_map_t::iterator> word_map_range_t;
   word_map_t word_map;
   std::set<unsigned int> word_length;
   std::size_t longest_word = 0;

   //Load the word map as per steps 0 and 1
   strtk::for_each_line(word_list_file_name,
                        [&](std::string word)
                        {
                           if (word.empty()) return;
                           strtk::remove_leading_trailing(" \t\n\r",word);
                           //Generate the key for the specified word.
                           strtk::convert_to_lowercase(word);
                           std::string key = word;
                           strtk::sort(key);
                           word_map.insert(std::make_pair(key,word));
                           longest_word = std::max(longest_word,word.size());
                           word_length.insert(word.size());
                        });

   std::string letters;

   for (;;)
   {
      //Load, prepare and validate the game letters
      std::cout << "Enter letters: ";
      if(!std::getline(std::cin,letters))
         break;

      static const std::string illegal_chars = strtk::ext_string::all_chars() -
                                               strtk::ext_string::all_letters();

      strtk::multiple_char_delimiter_predicate pred(illegal_chars);
      strtk::remove_inplace(pred,letters);

      if (letters.empty()) break;

      strtk::convert_to_lowercase(letters);
      strtk::sort(letters);

      const std::size_t upper_bound = std::min(longest_word,letters.size());

      std::unordered_set<std::string> solution_list;

      for (std::size_t i = upper_bound; i > 0; --i)
      {
         if (word_length.end() == word_length.find(i))
            continue;

         typedef std::string::iterator str_itr_t;

         //Enumerate all N-choose-I combinations as per step 4
         strtk::for_each_combination(letters.begin(),letters.end(),
                                     i,
                                     [&](str_itr_t begin, str_itr_t end)
                                     {
                                        std::string key(begin,end);
                                        word_map_range_t itr_range = word_map.equal_range(key);
                                        if (0 == strtk::distance(itr_range)) return;
                                        auto itr = itr_range.first;
                                        while (itr_range.second != itr)
                                        {
                                           solution_list.insert(itr->second);
                                           ++itr;
                                        }
                                    });

         //Present the solution list if solutions have been found as per step 4.4
         if (!solution_list.empty())
         {
            std::copy(solution_list.begin(),
                      solution_list.end(),
                      std::ostream_iterator<std::string>(std::cout,"\n"));
            break;
         }
      }
   }

   return 0;
}

Notes On The Letters Round Solution

The time complexity of the given solution is O(2min(L,N)), which is quite large. What makes the solution practical, is the fact that natural languages such as English tend to have short common words derived from relatively small alphabets, with an upper range length of about length 10-12 characters (excluding names et al and of course Pneumonoultramicroscopicsilicovolcanoconiosis). So for example an N of 10 or even 20 (assuming L is adequately large), will only amount to a total of 1024 and 1048576 unique combinations respectively - furthermore both search spaces can be trivially enumerated using brute force in a mere fraction of a millisecond using modern hardware.

However the problem space becomes daunting at around N of 64 and larger. At which point a constant multiplier can be applied by distributing the enumeration process and performing said computations concurrently. Note this will not reduce the overall complexity of the solution, just the time it will take to complete, furthermore today this technique may only be practical for values of N less than 67.

One final note, the above process will not only provide the first solution it encounters, it will return all possible solutions for largest encountered length combination.

Past Letters Round Games

The following is a short list of the Letters round games played on countdown during the 2010 season.

 TSDTOEIRO DELAWERAD FSGTIOAOV LESADEPIL YTAINROTD LLBUEISUT GEMUHRONA TRNSEAENI
 BPSQEAEVN GOXERAMJA NTRTIOAEP FLUENTLIP SCRSIEONU HPAVSOEDA WKCJROEAM IGTYEARTN
 DSOEXDALM SMNRIEIUG EFCAORTNA HGRNEOITE RGPNIEUQA TREULDEOF SNRTEUIDA TRUCEENDS
 HTNXIUOAD TREULDEOF SNRTEUIDA TRUCEENDS HTNXIUOAD SWDEIUHBE NWCKPAIAE NYDAOUTAU
 SGLAEOMVI GLQGEIODV RTLEOESTI JCRRUAEOS LSDSEAEFO TRMAEASRI NFRNOIAES BROADMOOR

Performance Comparisons

The following are tables of results generated by running the strtk_tokenizer_cmp test. Currently it covers simple comparisons between Boost String Algorithms, Boost lexical_cast, The Standard Library, Spirit (Karma Qi) and StrTk in the following areas:

  • Tokenization
  • Splitting
  • Integer To String
  • String To Integer
  • String To Double

Scenario 0 - MSVC 2010 (64-bit, O2, Ot, GL and PGO)

SourceTestSizeTime(sec)Rate% from Baseline
Boost Tokenizer 24000000 8.5857sec 2795359.4074tks/sec 100.0%, 100.0%
StrTk Tokenizer 24000000 3.5019sec 6853393.1186tks/sec 40.7%, 245.1%
Boost Split 9600000 5.5414sec 1732414.5137tks/sec 100.0%, 100.0%
StrTk Split 9600000 0.8218sec 11681814.9167tks/sec 14.8%, 674.3%
sprintfInteger To String80000000 35.8128sec 2233840.0564nums/sec 100.0%, 100.0%
Boost Integer To String80000000 19.3994sec 4123832.0477nums/sec 54.1%, 184.6%
Karma Integer To String80000000 6.2528sec 12794349.6524nums/sec 17.4%, 572.7%
StrTk Integer To String80000000 1.5664sec 51071439.9822nums/sec 4.3%, 2286.2%
atoi String To Integer88500000 5.1802sec 17084370.4936nums/sec 100.0%, 100.0%
Boost String To Integer88500000119.6261sec 739805.3712nums/sec2309.2%, 4.3%
Qi String To Integer88500000 2.1951sec 40317238.6629nums/sec 42.3%, 235.9%
StrTk String To Integer88500000 1.8181sec 48677773.5466nums/sec 35.0%, 284.9%
atof String To Double 30650000 15.2306sec 2012396.7122nums/sec 100.0%, 100.0%
Boost String To Double 30650000 52.9244sec 579127.8866nums/sec 347.4%, 28.7%
Qi String To Double 30650000 2.8665sec 10692313.5853nums/sec 18.8%, 531.3%
StrTk String To Double 30650000 1.6069sec 19073975.7679nums/sec 10.5%, 947.8%

StrTk MSVC10 Result Chart - Copyright Arash Partow


Scenario 1 - MSVC 2010 (O2, Ot, GL and PGO)

SourceTestSizeTime(sec)Rate% from Baseline
Boost Tokenizer 24000000 9.4715sec 2533910.4769tks/sec 100.0%, 100.0%
StrTk Tokenizer 24000000 2.8889sec 8307786.9292tks/sec 30.5%, 327.8%
Boost Split 9600000 7.2291sec 1327965.9706tks/sec 100.0%, 100.0%
StrTk Split 9600000 1.1301sec 8494610.9664tks/sec 15.6%, 639.6%
sprintfInteger To String80000000 38.2576sec 2091088.8038nums/sec 100.0%, 100.0%
Boost Integer To String80000000 28.9931sec 2759277.4769nums/sec 75.7%, 131.9%
Karma Integer To String80000000 4.9173sec 16269254.0190nums/sec 12.8%, 778.0%
StrTk Integer To String80000000 1.8270sec 43786838.0279nums/sec 4.7%, 2093.9%
atoi String To Integer88500000 6.0076sec 14731435.8942nums/sec 100.0%, 100.0%
Boost String To Integer88500000185.4955sec 477100.6474nums/sec3087.0%, 3.2%
Qi String To Integer88500000 2.5060sec 35314785.8370nums/sec 41.7%, 239.7%
StrTk String To Integer88500000 2.2095sec 40054213.0736nums/sec 36.7%, 271.8%
atof String To Double 30650000 17.6435sec 1737179.9302nums/sec 100.0%, 100.0%
Boost String To Double 30650000 78.6528sec 389687.3997nums/sec 445.7%, 22.4%
Qi String To Double 30650000 3.8034sec 8058494.1994nums/sec 21.5%, 463.8%
StrTk String To Double 30650000 2.0450sec 14987780.2310nums/sec 11.5%, 862.7%

StrTk MSVC10 Result Chart - Copyright Arash Partow


Scenario 2 - MSVC 2008 SP1 (O2, Ot, GL and PGO)

SourceTestSizeTime(sec)Rate% from Baseline
Boost Tokenizer 24000000 9.6533sec 2486184.8282tks/sec 100.0%, 100.0%
StrTk Tokenizer 24000000 3.4748sec 6906943.9529tks/sec 35.9%, 277.8%
Boost Split 9600000 10.2600sec 935674.7490tks/sec 100.0%, 100.0%
StrTk Split 9600000 1.3793sec 6959830.0652tks/sec 13.4%, 743.8%
sprintfInteger To String80000000 24.6427sec 3246397.8287nums/sec 100.0%, 100.0%
Boost Integer To String80000000 27.5865sec 2899968.5753nums/sec 111.9%, 89.3%
Karma Integer To String80000000 5.4864sec14581630.6963nums/sec 22.2%, 449.1%
StrTk Integer To String80000000 2.4224sec33025441.1256nums/sec 9.8%, 1017.2%
atoi String To Integer88500000 5.9297sec14924814.8683nums/sec 100.0%, 100.0%
Boost String To Integer88500000186.1372sec 475455.6660nums/sec3139.0%, 3.1%
Qi String To Integer88500000 2.0874sec42397446.1804nums/sec 35.2%, 284.0%
StrTk String To Integer88500000 2.0485sec43202160.1371nums/sec 34.5%, 289.4%
atof String To Double 30650000 18.0458sec 1698455.0767nums/sec 100.0%, 100.0%
Boost String To Double 30650000 77.4527sec 395725.4111nums/sec 429.2%, 23.2%
Qi String To Double 30650000 3.9631sec 7733881.1294nums/sec 21.9%, 455.3%
StrTk String To Double 30650000 2.0723sec14790236.0804nums/sec 11.4%, 870.8%

StrTk MSVC9 Result Chart - Copyright Arash Partow


Scenario 3 - Intel C++ v11.1.060 IA-32 (O2, Ot, Qipo, QxHost and PGO)

SourceTestSizeTime(sec)Rate% from Baseline
Boost Tokenizer 24000000 10.0096sec 2397697.7836tks/sec 100.0%, 100.0%
StrTk Tokenizer 24000000 3.1837sec 7538416.8541tks/sec 31.8%, 314.4%
Boost Split 9600000 9.5450sec 1005760.0310tks/sec 100.0%, 100.0%
StrTk Split 9600000 1.4292sec 6716893.1359tks/sec 14.9%, 667.8%
sprintfInteger To String80000000 23.8979sec 3347577.5824nums/sec 100.0%, 100.0%
Boost Integer To String80000000 27.5618sec 2902565.2045nums/sec 115.3%, 86.7%
Karma Integer To String80000000 4.6600sec17167208.7654nums/sec 19.4%, 512.8%
StrTk Integer To String80000000 2.8450sec28119857.2736nums/sec 11.9%, 840.0%
atoi String To Integer88500000 5.9386sec14902610.8922nums/sec 100.0%, 100.0%
Boost String To Integer88500000180.5856sec 490072.4001nums/sec3040.8%, 3.2%
Qi String To Integer88500000 2.5273sec35017073.8639nums/sec 42.5%, 234.9%
StrTk String To Integer88500000 1.8718sec47281492.1287nums/sec 31.5%, 317.2%
atof String To Double 30650000 18.4357sec 1662538.0810nums/sec 100.0%, 100.0%
Boost String To Double 30650000 78.1543sec 392172.9598nums/sec 423.9%, 23.5%
Qi String To Double 30650000 2.8321sec10822353.0510nums/sec 15.3%, 650.9%
StrTk String To Double 30650000 2.2930sec13366541.5515nums/sec 12.4%, 803.9%

StrTk Intel C++ v11 Result Chart - Copyright Arash Partow


Scenario 4 - GCC 4.5 (O3, PGO)

SourceTestSizeTime(sec)Rate% from Baseline
Boost Tokenizer 24000000 9.2510sec 2594305.4347tks/sec100.0%, 100.0%
StrTk Tokenizer 24000000 3.9717sec 6042688.5734tks/sec 42.9%, 232.9%
Boost Split 9600000 5.0640sec 1895728.2331tks/sec100.0%, 100.0%
StrTk Split 9600000 1.5411sec 6229231.8384tks/sec 30.4%, 328.5%
sprintfInteger To String8000000014.7807sec 5412477.0993nums/sec100.0%, 100.0%
Boost Integer To String8000000019.1131sec 4185620.7707nums/sec129.3%, 77.3%
Karma Integer To String80000000 6.4455sec12411808.2841nums/sec 43.6%, 229.3%
StrTk Integer To String80000000 4.5174sec17709364.5349nums/sec 30.5%, 327.1%
atoi String To Integer88500000 5.2139sec16973721.6103nums/sec100.0%, 100.0%
Boost String To Integer8850000050.5326sec 1751344.8498nums/sec969.1%, 10.3%
Qi String To Integer88500000 1.9694sec44937612.8835nums/sec 37.7%, 264.7%
StrTk String To Integer88500000 1.9008sec46558706.5833nums/sec 36.4%, 274.2%
atof String To Double 30650000 6.6975sec 4576328.3036nums/sec100.0%, 100.0%
Boost String To Double 3065000029.6375sec 1034162.2422nums/sec442.5%, 22.5%
Qi String To Double 30650000 2.9852sec10267435.7138nums/sec 44.5%, 224.3%
StrTk String To Double 30650000 1.5961sec19202937.1409nums/sec 23.8%, 419.6%

StrTk GCC v4.5 Result Chart - Copyright Arash Partow


Scenario 5 - GCC 4.5 (O3, PGO) Intel Atom N450

SourceTestSizeTime(sec)Rate% from Baseline
Boost Tokenizer 24000000 29.1370sec 823695.4389tks/sec 100.0%, 100.0%
StrTk Tokenizer 24000000 12.3607sec 1941644.0499tks/sec 42.4%, 235.7%
Boost Split 9600000 16.5261sec 580899.9726tks/sec 100.0%, 100.0%
StrTk Split 9600000 4.9102sec 1955110.2611tks/sec 29.7%, 336.5%
sprintfInteger To String80000000 50.3456sec 1589015.6118nums/sec 100.0%, 100.0%
Boost Integer To String80000000 91.1475sec 877698.1401nums/sec 181.0%, 55.2%
Karma Integer To String80000000 21.8904sec 3654568.8712nums/sec 43.4%, 229.9%
StrTk Integer To String80000000 12.1877sec 6564009.9274nums/sec 24.2%, 413.0%
atoi String To Integer88500000 17.6615sec 5010896.5768nums/sec 100.0%, 100.0%
Boost String To Integer88500000191.9446sec 461070.5357nums/sec1086.7%, 9.2%
Qi String To Integer88500000 6.2808sec14090561.7119nums/sec 35.5%, 281.1%
StrTk String To Integer88500000 6.1552sec14378086.8208nums/sec 34.8%, 286.9%
atof String To Double 30650000 21.4865sec 1426474.1027nums/sec 100.0%, 100.0%
Boost String To Double 30650000139.8166sec 219215.7409nums/sec 650.7%, 15.3%
Qi String To Double 30650000 11.3916sec 2690567.9223nums/sec 53.0%, 188.6%
StrTk String To Double 30650000 6.4396sec 4759608.7027nums/sec 29.9%, 333.6%

StrTk GCC v4.5 Atom N450 Result Chart - Copyright Arash Partow

Scenario 6 - GCC 4.5 (O3, PGO) Intel Xeon E5540

SourceTestSizeTime(sec)Rate% from Baseline
Boost Tokenizer 24000000 7.5657sec 3172216.8787tks/sec100.0%, 100.0%
StrTk Tokenizer 24000000 2.7379sec 8765832.8290tks/sec 36.1%, 276.3%
Boost Split 9600000 3.0706sec 3126386.1126tks/sec100.0%, 100.0%
StrTk Split 9600000 1.1279sec 8511136.2899tks/sec 36.7%, 272.2%
sprintfInteger To String8000000010.9012sec 7338642.9638nums/sec100.0%, 100.0%
Boost Integer To String8000000012.3317sec 6487328.7872nums/sec113.1%, 88.3%
Karma Integer To String80000000 3.7202sec21504260.6660nums/sec 34.1%, 293.0%
StrTk Integer To String80000000 2.5183sec31768042.4612nums/sec 23.1%, 432.8%
atoi String To Integer88500000 4.0087sec22077037.6357nums/sec100.0%, 100.0%
Boost String To Integer8850000030.3659sec 2914454.4393nums/sec757.4%, 13.2%
Qi String To Integer88500000 1.7976sec49231871.5454nums/sec 43.3%, 223.0%
StrTk String To Integer88500000 1.7384sec50908881.7303nums/sec 43.3%, 230.5%
atof String To Double 30650000 5.2118sec 5880843.9328nums/sec100.0%, 100.0%
Boost String To Double 3065000021.5546sec 1421966.9538nums/sec413.5%, 24.1%
Qi String To Double 30650000 3.2149sec 9533840.3118nums/sec 61.6%, 162.1%
StrTk String To Double 30650000 1.3929sec22003661.2944nums/sec 26.7%, 374.1%

StrTk GCC v4.5 Intel Xeon E5540 Result Chart - Copyright Arash Partow

Scenario 7 - GCC 4.5 (O3, PGO) Intel Xeon X5650

SourceTestSizeTime(sec)Rate% from Baseline
Boost Tokenizer 24000000 4.1944sec 5721901.2924tks/sec 100.00%, 100.00%
StrTk Tokenizer 24000000 2.5087sec 9566860.3956tks/sec 59.81%, 167.19%
Boost Split 9600000 2.9104sec 3298520.2014tks/sec 100.00%, 100.00%
StrTk Split 9600000 1.1105sec 8644949.2334tks/sec 38.15%, 262.08%
sprintfInteger To String80000000 9.7148sec 8234840.3537nums/sec 100.00%, 100.00%
Boost Integer To String8000000011.5726sec 6912860.7120nums/sec 119.12%, 83.94%
Karma Integer To String80000000 4.0832sec19592620.4395nums/sec 42.03%, 237.92%
StrTk Integer To String80000000 2.2204sec36029641.5861nums/sec 22.85%, 437.52%
atoi String To Integer88500000 3.1028sec28522836.1561nums/sec 100.00%, 100.00%
Boost String To Integer88500000 6.1270sec14444145.2249nums/sec 197.46%, 50.64%
Qi String To Integer88500000 1.5313sec57794031.2153nums/sec 49.35%, 202.62%
StrTk String To Integer88500000 1.4409sec61417814.6362nums/sec 46.43%, 215.32%
atof String To Double 42880000 6.1342sec 6990292.6549nums/sec 100.00%, 100.00%
Boost String To Double 4288000028.9461sec 1481374.9232nums/sec 772.37%, 21.19%
Qi String To Double 42880000 3.6549sec11732137.3557nums/sec 59.58%, 167.83%
StrTk String To Double 42880000 1.3860sec30937683.0792nums/sec 22.59%, 442.58%

StrTk GCC v4.5 Intel Xeon X5650 Result Chart - Copyright Arash Partow

Note 1: The tests are compiled with specific optimisation flags to produce the best possible results for the respective compilers and architectures. Furthermore the tests are run natively (no virtualizations were used) on an almost completely idle machine so as to reduce interference from background processes. The Boost version used was 1.55. Furthermore the standard libraries including libc were rebuilt for the linux system based tests, using architecture specific flags and optimizations. The following is a table mapping the scenarios to their respective architectures:

ScenarioArchitecture
0ThinkPad W510 (64-Bit Intel Quad Core Extreme i7-920XM 2.0GHz, 16GB RAM, Windows 7)
1-3ThinkPad x61 (32-Bit Intel Core 2 Duo 2.4GHz, 2GB RAM, Windows 7)
4ThinkPad x61 (32-Bit Intel Core 2 Duo 2.4GHz, 2GB RAM, Mint 15)
5Acer Aspire One (32-Bit Intel Atom N450 1.6Ghz, 1GB RAM, Mint 15)
6HP Proliant DL380G6 (64-Bit Intel Xeon E5540 2.5GHz, 8GB RAM, Mint 15)
7Custom box (64-Bit Intel Xeon X5650 2.66GHz, 32GB RAM, Mint 15)

Note 2: The percentages in the final column represent the percentage of the current row versus the baseline in total running time and rate respectively. For the first percentage the lower the value the better and for the second percentage the higher the value the better. The baseline used for a specific combination of tests is defined in the following table:

Test CombinationBaseline
Boost, StrTkBoost
Boost, StdLib/STL, Spirit, StrTkStdLib/STL
StdLib/STL, Spirit, StrTkStdLib/STL

Note 3: The test sizes are set such that no single run will result in a running time less than one second. This is done so as to ensure that runs-per-second results are not deemed to have been projected. In the future these sizes may need to be revisited once 3.5+GHz CPU speeds become more commonplace. Furthermore the charts represent the rate of operation over a one second interval - In short, the larger the rate the better.

Note 4: The binaries used for the above performance tests can be downloaded from here

Note 5: It would be great to have comparisons for other architectures. If you can provide access to shell accounts with GCC 4.5+ or Clang/LLVM 2.0+ for the following architectures: UltraSPARC T2 Plus, SPARC64 VII, POWER6/7, please feel free to get in contact.

A Final Digression - Fast Integer To String Conversion

The task of converting an integer to a std::string is quite common and rather trivial. It is mainly used in serialisations, such as printing to stdout or a file et al. The above benchmarks do include timings for int to std::string conversions, however let's see if there's more that can be done.

The following takes this very simple task to the next level. Based on a series solutions found on StackOverflow and elsewhere, the following benchmark has been derived. Its objective is simple: Determine the fastest integer to std::string conversion routine, written entirely in conforming C++. The results are as follows:

MethodRate (per/sec)
sprintf 3903490
karma 17305869
zverovich 19036262
voigt 21489702
so 32617186
timo 46211794
hopman52160759
strtk 52650806
jiaendu 60955167

Fast Integer To String Conversion Benchmark Result Chart - Copyright Arash Partow

Note: The following are details and also some points to consider with regards to the benchmark presented above:
  • A total of nine conversion routines are employed, of varying complexities, ranging from simple divide/mod and loop to elaborate single pass bit-twiddle LUT based strategies.
  • A wide range of values within the int space are used. They consist of sets of small and large random values, values near to and including min/max of int and all values from -20000000 to +20000000. Furthermore the various values are interleaved amoungst each other during the main loop for each conversion routine.
  • Prior to the benchmark beginning, a simple sanity check is run over all the conversion routines to make sure each routine functions correctly.
  • Before each conversion routine's benchmark begins, an attempt is made to flush the I/D caches. This is done so as to avoid the previous conversion routine's execution profile affecting the next routine.
  • It should be noted that the results presented were derived from running the benchmark upon a processor which has a large L1/L2 cache. This is important because when run on processors that have smaller caches or none at all, the rankings change quite a bit. In fact the simple div/mod and loop strategies will tend to out do most of the LUT based strategies under certain architectural conditions. This point is applicable not only to processes running atop of embedded and low power processors but also to processes running within virtualized environments.
  • The system details are: Intel Xeon W3680 3.3GHz, kernel 3.11, GCC 4.8 O2 compiled with PGO.
  • The conversion implementations found in boost::lexical_cast, NumToA and Folly were found to be exceedingly lacking when it came to performance and were hence left out.

StrTk Library Dependency

StrTk makes use of the Boost library for its boost::lexical_cast routine for types other than PODs, and its TR1 compliant Random and Regex libraries. These dependencies are not compulsory and can be easily removed simply by defining the preprocessor: strtk_no_tr1_or_boost. That said Boost is an integral part of modern C++ programming, and having it around is as beneficial as having access to the STL, hence it is recommended that it be installed. For Visual Studio users, BoostPro provides a free and easy to use installer for the latest Boost libraries that can be obtained from Here. For Linux users, mainstream distributions such as Ubuntu and Red-Hat(Fedora) provide easy installation of the Boost libraries via their respective package management systems. For more information please consult the readme.txt found in the StrTk distribution.

Compiler Support

The following is a listing of the various compilers that StrTk can be built with error and warning free.

  • GCC - verions 3.1+
  • Clang/LLVM - versions 1.0+
  • Intel C++ Compiler - versions 8+
  • MSVC - versions 7.1+
  • Comeau C/C++ - versions 4.3+
  • PGI C++ - versions 10.x+
  • IBM XL C/C++ - versions 10.x+

Note: Versions of compilers prior to the ones denoted above "should" compile, however they may require a very lenient warning/error level be set during compilation.

Conclusion

StrTk was designed with performance and efficiency as its sole primary principles, and as such some of the available interfaces may not be as user-friendly as they should - however that said, the gains made in other areas hopefully will compensate for any perceived difficulties. Like most things there is a trade-off between performance and usability with the above mentioned tokenizers and parsing methods. The original aim was to provide an interface similar to that of the Boost Tokenizer and Split routines, but to also avail the developer with abstractions and various other simplifications that will hopefully provide them more flexibility and efficiency in the long run. That said, tokenizing a string isn't the most fascinating problem one could tackle but it does have its interesting points when one has a few TB of data to process, doing it properly could mean the difference between finishing a simple data processing job today or next month.


License

This article, along with any associated source code and files, is licensed under The Common Public License Version 1.0 (CPL)

Share

About the Author

Arash Partow

Heard Island And Mcdonald Islands Heard Island And Mcdonald Islands
I'm a computer scientist by ambition and an artisan engineer by trade.

Comments and Discussions

 
AnswerRe: Question of the library PinmemberArash Partow22-Jun-13 11:47 
GeneralRe: Question of the library PinmemberAleksey197822-Jun-13 21:01 
GeneralRe: Question of the library PinmemberArash Partow22-Jun-13 22:44 
GeneralRe: Question of the library PinmemberAleksey197822-Jun-13 22:50 
GeneralRe: Question of the library PinmemberArash Partow22-Jun-13 23:05 
GeneralMy vote of 5 PinmemberD. Christian Ohle22-Jun-13 7:03 
GeneralRe: My vote of 5 PinmemberArash Partow22-Jun-13 11:32 
GeneralRe: My vote of 5 PinmemberD. Christian Ohle23-Jun-13 2:36 
Yes, STL is nice, very handy and allows good readable code.
Look and feeling like C#.
But there is the problem, many allocations of small memory blocks, and all the malloc and free calls behind are expensive in c++.
 
I working on graphics algorithms and speed is always very importand.
My experience is, with excessive use of STL in C++ it is many times slower than the same implementation in C#. Of course, new is fast in C#, no problems of memory fragmentation and the memory blocks are close together what is good for cach lines.
From there, for speed I try to avoid STL. It makes not much sense to have c++ dll's four times slower than in C#.
 
Unfortunatly I have no better base library. My way is to manage the data structures in big buffers, index buffers and so on and reuse the buffers if possible.
A lot of manual work, another style but the advantage is to get it approximately 4 times faster in c++ than the same in c#.
In my case it was possible to speed up the c++ code by a factor of 10.
GeneralRe: My vote of 5 PinmemberArash Partow23-Jun-13 18:14 
QuestionDifferent results PinmemberDr0pL3t25-Mar-13 2:23 
AnswerRe: Different results PinmemberArash Partow25-Mar-13 2:40 
GeneralRe: Different results PinmemberDr0pL3t31-Mar-13 23:09 
GeneralRe: Different results PinmemberArash Partow1-Apr-13 0:08 
GeneralRe: Different results PinmemberDr0pL3t18-Apr-13 19:51 
GeneralMy vote of 5 PinmemberG. Steudtel11-Feb-13 0:13 
GeneralRe: My vote of 5 PinmemberArash Partow11-Feb-13 0:20 
QuestionI'm impressed! PinmemberMarius Bancila16-Jan-13 11:26 
AnswerRe: I'm impressed! PinmemberArash Partow16-Jan-13 23:52 
GeneralMy vote of 5 PinmemberPhat (Phillip) H. VU8-Jan-13 21:57 
GeneralRe: My vote of 5 PinmemberArash Partow11-Jan-13 21:53 
Questionhex sink to vector [modified] Pinmemberbinkmer21-Dec-12 7:07 
AnswerRe: hex sink to vector PinmemberArash Partow22-Dec-12 11:51 
GeneralRe: hex sink to vector Pinmemberbinkmer22-Dec-12 23:15 
GeneralMy vote of 5 Pinmemberduccuongpx14-Nov-12 19:11 
GeneralRe: My vote of 5 PinmemberArash Partow15-Nov-12 8:41 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Terms of Use | Mobile
Web03 | 2.8.141223.1 | Last Updated 7 Jan 2014
Article Copyright 2008 by Arash Partow
Everything else Copyright © CodeProject, 1999-2014
Layout: fixed | fluid