A handy tokenizer function using the STL

Joerg Wiedenmann

4.71/5 (12 votes)

Mar 1, 2006

CPOL

7 min read

70531

1063

A handy and customizable tokenizer function that works with STL strings.

Introduction
Background
Features
Using the code
- The demo application
How it works
- Implementation details
Points of interest
Conclusion
License
History

Introduction

This article is about a tokenizer function which provides a very customizable way of breaking up strings. I made it because the std::string doesn't supply methods to efficiently break up its contents, and I didn't want to use another class to do it. I used the methods which are provided by std::string to implement this function.

Background

For my CSV-like text file class, which I will present in another article, I was in need of a function that could break up strings into a series of tokens. After a search on Google for the term "tokenizer", the only useful thing I found was the boost::tokenizer class. After tinkering a bit with it, I decided to implement my own function because I didn't want to define types for the various TokenizerFunction models. However, I liked the features provided by the boost class and implemented some of them into my function.

Features

All of the delimiter-, quote-, and escape characters are 100% customizable.
Multiple characters can be specified for each delimiter-group.
Quote text to protect it from being tokenized.
Escape single characters to protect them.
Optionally keep delimiters as a token.

Using the code

To use the function, you just need to provide an input string, a vector that will receive the output and the various delimiters to the function. Optionally, you can pass quote and/or escape characters.

The defaults for the delimiters are the common CSV ones (Space, TAB, Comma, Colon, Semicolon). The default quotes are (" and '), and the default escape character is the backslash (\). No delimiter characters will be preserved, by default.

The function prototype:

void tokenize ( const string& str, vector<string>& result,
  const string& delimiters, const string& delimiters_preserve,
  const string& quote, const string& esc );

`str`	The input string. This is the original string that will be tokenized.
`result`	The tokens. This vector holds all the generated tokens.
`delimiters`	The delimiters which will split the input string. Default: The common CSV ones (Space, TAB, Comma, Colon, Semicolon)
`delimiters_preserve`	The delimiters which will split the input string. These delimiters will appear in the result as tokens. No default characters.
`quote`	The quote characters. The quote characters protect the enclosed text (matching quotes). Default: " and '
`esc`	The escape characters. These characters protect single characters. Default: backslash (\)

Example:

#include <string>
#include <vector>
#include "tokenizer.h"

// A string, which contents will be tokenized.
string input;

// define the characters that will break the string
string delimiter = ",\t";  // use comma and  tab

// define the characters that will break the string 
// and generate tokens themselves
string keep_delim = ";:";  // use semicolon and  colon

// define the characters that will protect the enclosed text
string quote = "\'\"";  // use single quote and double  quote

// define the characters that will protect the following 
// character
string esc = "\\#";  // use backslash and the hash sign

// vector that contains the tokens for input
vector<string> tokens;

tokenize ( input, tokens, delimiter, keep_delim, quote, esc );

// to use the tokens, define a token-iterator
vector<string>::iterator token;

// and simply iterate through the tokens
for ( token = tokens.begin(); tokens.end() != token; ++token ) 
{
    cout  << *token << endl;
}

The demo application

By simply running the demo application, you will get the following output:

Demo application for the tokenizer function.
The tokens are in []:

This;string,is for      demonstration.

[This]
[string]
[is]
[for]
[demonstration.]

Delimiters can be preserved: sqrt(17 * (20 + a))
[sqrt]  [(]     [17]    [*]     [(]     [20]    [+]     [a]     [)]     [)]

"This;string;contains;quoted;text";and;escaped\;characters.

[This;string;contains;quoted;text]
[and]
[escaped;characters.]

You can also provide parameters, or edit and use the included batch file to use the demo file:

TokenizerDemo   filename [delimiters] [preserved delimiters] 
         [quote chars] [escape chars]

All parameters are optional, but you cannot skip parameters. E.g. if you don't want to provide quote chars but need the escape chars, you must pass an empty parameter, like this: "".
You must quote the space character if you want to use it as a delimiter. E.g. if you want to use comma, semicolon, space, and colon: ",; :".
A " must be quoted, too, like this: """.
Only the first 15 lines of a file will be processed.

How it works

Essentially, the string is iterated character by character, and each character is appended to the token string. Every time a character belongs to a delimiter, the token string is saved in a list and cleared for the next token. Furthermore, checks for special cases, like quotes, are made.

Implementation details

The first part of the function clears the result vector, and initializes variables that hold the current position of the character in the string, the state of quotes, and the current token. The second part is the loop that performs the splitting, and the third part adds the remaining token, if there is one left, to the result.

The loop:

For every character in the string
    Test if it is an escape character
        If yes, skip all other tests
    Test if it is a quote character
        If yes, skip all other tests
    Test if it is a delimiter
        Token is complete
    Test if it is a delimiter which should be preserved
        Token is complete
        flag the delimiter to be added

    Append the character to the current token if it isn't 
    a special one.
    If the token is complete and not empty
        add the token to the results
    If the delimiter is preserved
        add it to the results

The loop iterates through the string character by character. It performs several tests on a character to be able to decide what to do with it. Before doing any test, it is assumed that the character isn't one of the special characters:

string::size_type len = str.length();
while ( len > pos ) {
    ch = str.at(pos);
    delimiter = 0;

    bool add_char = true;

After extracting the character of the string, a check is done to see if the character belongs to the group of escape characters. If it is found so, the position is increased by one to get the next character, if there is at least one more left. It's unnecessary to perform any further tests because an escape character will be added to the current token regardless of what it is:

if ( string::npos != esc.find_first_of(ch) ) {
    ++pos;
    if ( pos < len ) {
        ch = str.at(pos);
        add_char = true;
    } else {
        add_char = false;
    }
    escaped = true;
}

After that, and if the character belongs to the group of quote-characters, it is checked to see if there's an open quote. If the open-quote state is set, it will be closed, if not it will be set. In the "open-quote" state, no delimiter-checks will be done, and any special character will be added to the current token:

if ( false == escaped ) {
    if ( string::npos != quote.find_first_of(ch) ) {
        if ( false == quoted ) {
            quoted = true;
            current_quote = ch;
            add_char = false;
        } else if ( current_quote == ch ) {
            quoted = false;
            current_quote = 0;
            add_char = false;
        }
    }
}

If the character doesn't match one of the above groups, it is checked to see if it belongs to the group of delimiters. If it does, and the token string isn't empty, the token is flagged to be complete:

if ( false == escaped && false == quoted ) {
    if ( string::npos != delimiters.find_first_of(ch) ) {
        if ( false == token.empty() ) {
            token_complete = true;
        }
        add_char = false;
    }
}

...and if the delimiter should be preserved, it will be indicated by the add-delimiter flag:

bool add_delimiter = false;
if ( false == escaped && false == quoted ) {
    if ( string::npos != delimiters_preserve.find_first_of(ch) ) {
        if ( false == token.empty() ) {
            token_complete = true;
        }
        add_char = false;
        delimiter = ch;
        add_delimiter = true;
    }
}

If the character isn't a special character, it will be appended to the end of the token-string:

if ( true == add_char ) {
    token.push_back( ch );
}

If the token isn't empty and flagged to be complete, it is added to the results and reset for the next token:

if ( true == token_complete && false == token.empty() ) {
    result.push_back( token );
    
    token.clear();
    token_complete = false;
}

If the delimiter is flagged as a preserved one, it will be added to the results as a token:

if ( true == add_delimiter ) {
    string delim_token;
    delim_token.push_back( delimiter );
    
    result.push_back( delim_token );
}

When the loop is finished and the input string doesn't end with a delimiter, there may be a token left that hasn't been added yet because the token complete flag is only set in the delimiter tests - or there could be an unclosed quote. Whatever be the reason, if the token buffer isn't empty, it will be added to the results.

Points of interest

This is the second approach for the implementation. In the original function, I've put all the special characters into one string and retrieved the position of one of these characters with the string::find_first_of method. This turned out to be unhandy because I had to double check and handle exceptions like quotes and escaped characters.

After thinking for a few minutes about it, I thought I could iterate through the string character by character in the function and look if the character belongs to any of the special character groups. The difference in the two approaches is that for the first approach, I have the positions (begin and end) of the substring to copy into the token-string, and in the second one, I just append the characters to a token string and clear it every time a delimiter is found.

Conclusion

I want to thank you, the reader, for your interest and feedback, and I want to thank the kind people who told me in the Lounge how to write a good introduction. However, I don't know if it turned out to be a good one.

I don't know whether the words 'tokenizer, tokenizing' exist or not - for 'tokenized', the dictionary says something like 'translated to tokens', but the meanings should have become clear anyhow ;)

Any unanswered questions? Feel free to ask. :)

License

The zlib/libpng license.

This software is provided 'as-is', without any express or implied warranty. In no event will the authors be held liable for any damages arising from the use of this software. Permission is granted to anyone to use this software for any purpose, including commercial applications, and to alter it and redistribute it freely, subject to the following restrictions:

The origin of this software must not be misrepresented; you must not claim that you wrote the original software. If you use this software in a product, an acknowledgment in the product documentation would be appreciated but is not required.
Altered source versions must be plainly marked as such, and must not be misrepresented as being the original software.
This notice may not be removed or altered from any source distribution.

History

2006-02-10 - Initial version.
2006-03-05 - Bug fix and some minor article changes. Thanks Elias for pointing this one out.

A handy tokenizer function using the STL

Contents