|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Announcements
Services
Chapters
Feature Zones
|
Contents
IntroductionThis article is about a tokenizer function which provides a very customizable way of breaking up strings. I made it because the BackgroundFor my CSV-like text file class, which I will present in another article, I was in need of a function that could break up strings into a series of tokens. After a search on Google for the term "tokenizer", the only useful thing I found was the Features
Using the codeTo use the function, you just need to provide an input string, a vector that will receive the output and the various delimiters to the function. Optionally, you can pass quote and/or escape characters. The defaults for the delimiters are the common CSV ones (Space, TAB, Comma, Colon, Semicolon). The default quotes are (" and '), and the default escape character is the backslash (\). No delimiter characters will be preserved, by default. The function prototype: void tokenize ( const string& str, vector<string>& result, const string& delimiters, const string& delimiters_preserve, const string& quote, const string& esc );
Example: #include <string> #include <vector> #include "tokenizer.h" // A string, which contents will be tokenized. string input; // define the characters that will break the string string delimiter = ",\t"; // use comma and tab // define the characters that will break the string // and generate tokens themselves string keep_delim = ";:"; // use semicolon and colon // define the characters that will protect the enclosed text string quote = "\'\""; // use single quote and double quote // define the characters that will protect the following // character string esc = "\\#"; // use backslash and the hash sign // vector that contains the tokens for input vector<string> tokens; tokenize ( input, tokens, delimiter, keep_delim, quote, esc ); // to use the tokens, define a token-iterator vector<string>::iterator token; // and simply iterate through the tokens for ( token = tokens.begin(); tokens.end() != token; ++token ) { cout << *token << endl; } The demo applicationBy simply running the demo application, you will get the following output: Demo application for the tokenizer function.
The tokens are in []:
This;string,is for demonstration.
[This]
[string]
[is]
[for]
[demonstration.]
Delimiters can be preserved: sqrt(17 * (20 + a))
[sqrt] [(] [17] [*] [(] [20] [+] [a] [)] [)]
"This;string;contains;quoted;text";and;escaped\;characters.
[This;string;contains;quoted;text]
[and]
[escaped;characters.]
You can also provide parameters, or edit and use the included batch file to use the demo file: TokenizerDemo filename [delimiters] [preserved delimiters]
[quote chars] [escape chars]
How it worksEssentially, the string is iterated character by character, and each character is appended to the token string. Every time a character belongs to a delimiter, the token string is saved in a list and cleared for the next token. Furthermore, checks for special cases, like quotes, are made. Implementation detailsThe first part of the function clears the result vector, and initializes variables that hold the current position of the character in the string, the state of quotes, and the current token. The second part is the loop that performs the splitting, and the third part adds the remaining token, if there is one left, to the result. The loop: For every character in the string
Test if it is an escape character
If yes, skip all other tests
Test if it is a quote character
If yes, skip all other tests
Test if it is a delimiter
Token is complete
Test if it is a delimiter which should be preserved
Token is complete
flag the delimiter to be added
Append the character to the current token if it isn't
a special one.
If the token is complete and not empty
add the token to the results
If the delimiter is preserved
add it to the results
The loop iterates through the string character by character. It performs several tests on a character to be able to decide what to do with it. Before doing any test, it is assumed that the character isn't one of the special characters: string::size_type len = str.length(); while ( len > pos ) { ch = str.at(pos); delimiter = 0; bool add_char = true; After extracting the character of the string, a check is done to see if the character belongs to the group of escape characters. If it is found so, the position is increased by one to get the next character, if there is at least one more left. It's unnecessary to perform any further tests because an escape character will be added to the current token regardless of what it is: if ( string::npos != esc.find_first_of(ch) ) { ++pos; if ( pos < len ) { ch = str.at(pos); add_char = true; } else { add_char = false; } escaped = true; } After that, and if the character belongs to the group of quote-characters, it is checked to see if there's an open quote. If the open-quote state is set, it will be closed, if not it will be set. In the "open-quote" state, no delimiter-checks will be done, and any special character will be added to the current token: if ( false == escaped ) { if ( string::npos != quote.find_first_of(ch) ) { if ( false == quoted ) { quoted = true; current_quote = ch; add_char = false; } else if ( current_quote == ch ) { quoted = false; current_quote = 0; add_char = false; } } } If the character doesn't match one of the above groups, it is checked to see if it belongs to the group of delimiters. If it does, and the token string isn't empty, the token is flagged to be complete: if ( false == escaped && false == quoted ) { if ( string::npos != delimiters.find_first_of(ch) ) { if ( false == token.empty() ) { token_complete = true; } add_char = false; } } ...and if the delimiter should be preserved, it will be indicated by the add-delimiter flag: bool add_delimiter = false; if ( false == escaped && false == quoted ) { if ( string::npos != delimiters_preserve.find_first_of(ch) ) { if ( false == token.empty() ) { token_complete = true; } add_char = false; delimiter = ch; add_delimiter = true; } } If the character isn't a special character, it will be appended to the end of the token-string: if ( true == add_char ) { token.push_back( ch ); } If the token isn't empty and flagged to be complete, it is added to the results and reset for the next token: if ( true == token_complete && false == token.empty() ) { result.push_back( token ); token.clear(); token_complete = false; } If the delimiter is flagged as a preserved one, it will be added to the results as a token: if ( true == add_delimiter ) { string delim_token; delim_token.push_back( delimiter ); result.push_back( delim_token ); } When the loop is finished and the input string doesn't end with a delimiter, there may be a token left that hasn't been added yet because the token complete flag is only set in the delimiter tests - or there could be an unclosed quote. Whatever be the reason, if the token buffer isn't empty, it will be added to the results. Points of interestThis is the second approach for the implementation. In the original function, I've put all the special characters into one string and retrieved the position of one of these characters with the After thinking for a few minutes about it, I thought I could iterate through the string character by character in the function and look if the character belongs to any of the special character groups. The difference in the two approaches is that for the first approach, I have the positions (begin and end) of the substring to copy into the token-string, and in the second one, I just append the characters to a token string and clear it every time a delimiter is found. ConclusionI want to thank you, the reader, for your interest and feedback, and I want to thank the kind people who told me in the Lounge how to write a good introduction. However, I don't know if it turned out to be a good one. I don't know whether the words 'tokenizer, tokenizing' exist or not - for 'tokenized', the dictionary says something like 'translated to tokens', but the meanings should have become clear anyhow ;) Any unanswered questions? Feel free to ask. :) LicenseThe zlib/libpng license. This software is provided 'as-is', without any express or implied warranty. In no event will the authors be held liable for any damages arising from the use of this software. Permission is granted to anyone to use this software for any purpose, including commercial applications, and to alter it and redistribute it freely, subject to the following restrictions:
History
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||