|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Announcements
Want a new Job?
Chapters
Services
Feature Zones
|
Note: This is an unedited contribution. If this article is inappropriate,
needs attention or copies someone else's work without reference then please
Report This Article
IntroductionThis article will present the tokenizing and splitting functionality of a simple C++ library called the String Toolkit. Tokenization in the context of string processing, is the method by which a sequence of elements are broken up or fragmented in sub-sequences. The indicies in the original sequence that determine such breaks in the sequence are known as delimiters. There are two types of delimiters, normal or thin delimiters which are of length one element and a thick delimiters which are of length two or more elements. Even though tokenization is primarily used in conjunction with strings, any sequence of types that can be iterated in a linear fashion can be tokenized, examples may be list of integers, a vector of person classes or an map of strings. Another Tokenizer?To date there have been many attempts to define a "standard" Tokenizer implementation in C++. Of them all the most likely candidate would be the implementation in the BOOST library. Proposed implementations should to some extent consider one or more of the following areas: over-all usage patterns, constructions, generality (or is it genericty these days?) of design, performance. 1. Over-all usage patternsHere we are mainly concerned with how easy it is to instantiate the tokenizer and integrate it into existing processing patterns, which most often than not requires integration with C++ STL algorithms. A tokenzier by definition would be classed as a producer, so the question becomes how easy is it for others to consume its output? Another issue is consistency of the definition of a token in the situation where one has consecutive delimiters but is not compressing them - can or should or there be such a thing as an empty token? and what do preceding and trailing delimiters mean? 2. ConstructionsIn the context of string tokenizing, the majority of implementations return the token as a new instance of a string. This process requires a string to be created on heap, populated by the sub -string in question from the original sequence, then returned back (some of this may be alleviated by RVO). In the case of iterators this is essentially another copy to the caller. Furthermore two kinds of tokens can make this situation worse, they are primarily a large sequence made up of lots of very short tokens or a large sequence made up of lots of very large tokens. The solution is not to return the token as a string but rather as a range (BOOST split already has this feature) and allow the caller to decide how they wish to inspect the token.
This minor change in interface design provides a great deal of flexibility and performance gain. 3. Generality(Genericity) of designMost tokenizer implementations concern themselves only with strings, which for the most part is ok, because most things that need tokenizing are strings. However there will be times when one has a sequence of types that may need to be tokenized that aren't strings, hence a tokenizer should be designed in such a way to enable such features, moreover it becomes clear that the most basic of tokenizing algorithms are invariant to the type of the delimiter. 4. PerformanceTokenizing of strings range from low frequency inputs such as user input or parsing of simple configuration files to more complex situations such as tokenizing of HTML pages that a web crawler/indexer might do. Performance is generally important and can usually be helped along with good usage patterns that encourage reuse of instances, minimal preprocessing and allow for user supplied predicates for the more nasty areas of the process. Getting startedStrTk provides two common tokenization concepts, a split function and a token iterator. Both these concepts require the user to provide a delimiter predicate and an iterator range over which the process will be carried out. The tokenization process can be further parametrized by switching between "compressed delimiters" or "no compressed delimiters" mode. This essentially means that consecutive delimiters will be compressed down to one and treated as such. Below are two tables depicting the expected tokens from various types of input. The tables represent no compressed and compressed tokenization processes respectively. The delimiter in this instance is a pipe symbol | and <> denotes an empty token. No Compressed Delimiters
Compressed Delimiters
DelimitersTwo forms of delimiters are supported and they are single and multiple delimiters SDP and MDP respectively. Essentially a SDP is where there is only one type that can break or fragment the sequence, where as with MDPs there is more than one unique type that can break the sequence. It is possible to represent a SDP using the MDP, however from a performance POV having separate predicates is far more efficient. Additionally for strings based on char or unsigned char (8-bit versions) there is a MDP that has a look-up complexity of O(1) making it greatly more efficient than the basic MDP. Single delimiter predicateInstantiation requires specialization of type and construction requires an instance of the type. strtk::single_delimiter_predicate<typename T>(const T& t) strtk::single_delimiter_predicate<std::string::value_type> predicate('|'); Multiple delimiter predicateInstantiation requires specialization of type and construction requires a sequence of potential delimiters through a range described by iterators. strtk::multiple_delimiter_predicate<typename T>(Iterator begin, Iterator end); unsigned int delimiters[5] = {1,10,101,1010,10101}; strtk::multiple_delimiter_predicate<unsigned int> predicate(delimiters,delimiters + 5); Multiple char delimiter predicateInstantiation requires an iterator range based on either unsigned char or char. This delimiter is more efficient than the simple MDP as it has a predicate evaluation of O(1) due to the use of a lookup-table as opposed to O(n) where n is the number of delimiters. Performance increase is seen as more delimiters are used. strtk::multiple_char_delimiter_predicate(const std::string) strtk::multiple_char_delimiter_predicate(const std::string::const_iterator begin,const std::string::const_iterator end) strtk::multiple_char_delimiter_predicate predicate(" .;?"); Split
This is a function that performs tokenization over an entire sequence
in one go. std::string s = "abc|123|xyz|789"; std::list< std::pair< std::string::const_iterator,std::string::const_iterator > > token_list; strtk::single_delimiter_predicate<std::string::value_type> predicate('|'); strtk::split(s,predicate,std::back_inserter(token_list));
Adding "true" as the last parameter to the function call of
strtk::split(s,predicate,std::back_inserter(token_list),true);
Another way of using split may be to use the
std::string s = "abc?123;xyz.789"; std::list< std::pair< std::string::const_iterator,std::string::const_iterator > > token_list; strtk::multiple_char_delimiter_predicate predicate(" .;?"); strtk::split(s,predicate,std::back_inserter(token_list));
The contents of the std::list< std::pair< std::string::const_iterator,std::string::const_iterator > >::iterator it = token_list.begin(); while(it != token_list.end()) { std::string current_token = std::string((*it).first,(*it).second); std::cout << current_token << "\t"; ++it; } TokenizerThe tokenizer is specialized on a sequence iterator and predicate. It is constructed with a range of iterators for the sequence and a reference to the desired predicate. Defines exists for std::string tokenizers. The tokenizer provides an iteration pattern as a means for accessing the tokens in the sequence. const unsigned int data_size = 12; unsigned int data[data_size] = {1,2,3,0,4,5,6,0,7,8,0,9}; strtk::single_delimiter_predicate<unsigned int> predicate(0); strtk::tokenizer< unsigned int*,strtk::single_delimiter_predicate<unsigned int> > tokenizer(data,data + data_size,predicate); Adding "true" as the last parameter to the constructor of the tokenizer will enable compressed delimiters mode, by default it is set to false. strtk::tokenizer< unsigned int*,strtk::single_delimiter_predicate<unsigned int> > tokenizer(data,data + data_size,predicate,true); Iteration over the tokens is performed as follows: strtk::tokenizer< unsigned int*,strtk::single_delimiter_predicate<unsigned int > >::iterator it = tokenizer.begin(); while (it != tokenizer.end()) { std::copy((*it).first,(*it).second,std::ostream_iterator<unsigned int>(std::cout," ")); std::cout << std::endl; ++it; }
A typical std::string s = "abc|123|xyz|789"; strtk::single_delimiter_predicate<std::string::value_type> predicate('|'); strtk::std_string_tokenizer<strtk::single_delimiter_predicate<std::string::value_type> >::type tokenizer(s,predicate); strtk::std_string_tokenizer<strtk::single_delimiter_predicate<std::string::value_type> >::type::iterator it = tokenizer.begin(); while (it != tokenizer.end()) { std::string current_token = std::string((*it).first,(*it).second); std::cout << current_token << "\t"; ++it; } Another common situation may be tokenizing a sequence of strings, such as the following: const unsigned int str_list_size = 10; std::string str_list[str_list_size] = { "abc" , "delimiter" , "ijk" , "delimiter" , "mno" , "delimiter" , "rst" , "uvw" , "delimiter" , "xyz" }; strtk::range_adapter<std::string> range(str_list,str_list_size); strtk::single_delimiter_predicate<std::string> predicate("delimiter"); strtk::tokenizer< std::string*,strtk::single_delimiter_predicate<std::string> > tokenizer(range.begin(),range.end(),predicate); strtk::tokenizer< std::string*,strtk::single_delimiter_predicate<std::string> >::iterator it = tokenizer.begin(); while(it != tokenizer.end()) { std::copy((*it).first,(*it).second,std::ostream_iterator<std::string>(std::cout," ")); std::cout << std::endl; ++it; } A Practicle ExampleLets assume you have been given an english text file to process, with the intention of extracting a lexicon from the file. One solution would be to break the problem down to a line by line tokenization problem. In this case you would define a functional object such as the following which will take the container in which you plan on storing your token and a predicate and insert the tokens as strings into your container. template<typename Container, typename Predicate> struct parse_line { public: parse_line(Container& container, const Predicate& predicate) : container_(container), predicate_(predicate) {} inline void operator() (const std::string& s) const { strtk::split(s,predicate_,strtk::range_to_string_back_inserter(container_),true); } private: Container& container_; const Predicate& predicate_; };
The whole thing together would include a process of opening the
file and reading it line by line each time invoking the
template<typename Container> void parse_text(const std::string& file_name, Container& c) { std::string delimiters = " ,.;:<>'[]{}()_?/'`~!@#$%^&*|-_\"=+"; strtk::multiple_char_delimiter_predicate predicate(delimiters); strtk::for_each_line(file_name,parse_line<Container,strtk::multiple_char_delimiter_predicate>(c,predicate)); } int main(void) { std::string text_file_name = "text.txt"; std::deque< std::string > word_list; parse_text(text_file_name,word_list); std::cout << "Token Count: " << word_list.size() << std::endl; return 0; } Now coming back to the original problem, that being the construction of a lexicon. In this case the set of "words" should only contain words of interest. For the sake of simplicity lets define words of interest as being anything other than the following prepositions: as, at, but, by, for, in, like, next, of, on, opposite, out, past, to, up and via. The list definition might something like the following: const unsigned int list_size = 17; const std::string not_of_interest_list [list_size] = { "as", "at", "but", "by", "for", "in", "like", "next", "of", "on", "opposite", "out", "past", "to", "up", "via", ""};
Some minor updates to the parse_line predicate include using the
filter_on_match predicate for determining if the currently
processed token is a preposition and also the invocation of the
template<typename Container, typename Predicate> struct parse_line { public: parse_line(Container& container, const Predicate& predicate) : container_(container), predicate_(predicate), tmp_(""), tokenizer_(tmp_,predicate_,true), filter_(not_of_interest_list,not_of_interest_list + list_size, strtk::range_to_string_back_inserter_iterator<Container>(container_), true,false) {} inline void operator() (const std::string& s) { const filter_type& filter = filter_; strtk::for_each_token(s,tokenizer_,filter); } private: Container& container_; const Predicate& predicate_; std::string tmp_; typename strtk::std_string_tokenizer<Predicate>::type tokenizer_; strtk::filter_on_match<strtk::range_to_string_back_inserter_iterator<Container>> filter_; }; The above described predicate can be greatly simplified by using binders and various lamba expressions. Another ExampleWhen performing serialization or deserialization of an instance object such as a class or struct, a simple approach one could take would be to take each of the members and convert them into string representations and from those strings construct a larger string delimiting each member with a special character guaranteed not to exist in any of the string representations. In this example we will assume that there exists a struct which represents the properties of a person, a person struct if you will: struct person { std::string name; unsigned int age; double height double weight; };
The process of populating a person struct would entail having an
instance of a person and the necessary data string. The following is
an example of how this would be done using the
Person Tuple Format
std::string data = "Rumpelstiltskin|97|1.31|58.7"; person p; strtk::single_delimiter_predicate<std::string::value_type> predicate('|'); strtk::std_string_tokenizer<strtk::single_delimiter_predicate<std::string::value_type> >::type tokenizer(data,predicate); strtk::parse(tokenizer,p.name,p.age,p.height,p.weight);
Batch processing of a text file comprised of one person tuple
per-line is somewhat similar to the previous example. a predicate
is defined that takes a container specialized on the struct,
and a predicate from which it will instantiate a tokenizer. This
predicate is then instantiated and fed to the
template<typename Container, typename Predicate> struct parse_person_tuple { public: parse_person_tuple(Container& container, const Predicate& predicate) : container_(container), predicate_(predicate), tmp_(""), tokenizer_(tmp_,predicate_,true) {} inline void operator() (const std::string& s) { person p; strtk::parse(s,tokenizer_,p.name,p.age,p.height,p.weight); container_.push_back(p); } private: Container& container_; const Predicate& predicate_; std::string tmp_; typename strtk::std_string_tokenizer<Predicate>::type tokenizer_; }; Bringing the above pieces together to process a file: int main(void) { std::string text_file_name = "person_records.txt"; std::deque<person> person_list; typedef strtk::single_delimiter_predicate<std::string::value_type> sdp_predicate_type; typedef parse_person_tuple<std::deque<person>,sdp_predicate_type> predicate_type; strtk::for_each_line(file_name,predicate_type(person_list,sdp_predicate_type('|'))); return 0; } Strtk Library Dependency
Originally the library was mostly compatible with C++
compilers ranging from GCC 2.95 and Visual studio 6.0,
however due to upgrades it now requires at least GCC 3.4 and
Visual Studio 8.0 to compile easily. It also requires the
BOOST library for its ConclusionLike most things there is a trade-off between performance and usability with the above mentioned tokenizer. The original aim was to provide an interface similar to that of the BOOST tokenizer but to also provide some simplifications that will hopefully provide the user more flexibility in the long run, this is by no means a complete replacement but rather a more simpler way of looking at the problem. Tokenizing a string isn't the most interesting problem one could tackle but it does have its interesting points when one has a few TB of data to process, doing it properly could mean the difference between finishing the job this week or next month.
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||