Click here to Skip to main content
Click here to Skip to main content
Technical Blog

Splitting strings again – strtok redeemed

, 21 Oct 2012 BSD
Rate this:
Please Sign up or sign in to vote.
There are no standard string tokenisers in C++, except for strtok from the C library. While there are many tokenisers written by fellow programmers available, they tend to forget one of the strengths of strtok, which I implement in C++ here.

The C++ source files for the string tokenisers discussed in this post and the Splitting strings post, plus the code for Removing whitespace and Static assert in C++, can be found here:
http://coolcowstudio.co.uk/source/cpp/utilities.zip.
 

One of the more curious omissions from the C++ standard library is a string splitter, e.g. a function that can take a string and split it up into its constituent parts, or tokens, based on some delimiter. There is one in other popular languages ((C# – String.Split, Java – String.split, Python – string.split etc), but C++ programmers are left to roll their own, or use one from a third-party library like the boost::tokenizer (or the one I presented in Splitting strings). 

There are many ways of going this; the Stack Overflow question How do I tokenize a string in C++? has 23 answers at the time of writing, and those contain 20 different solutions (boost::tokenizer and strtok are suggested multiple times).

The strtok recommendations, however, all have comments pointing out the problems with this function – it’s destructive, and not reentrant (it can’t be nested or run in parallell on multiple strings). As functions go, strtok has a rather poor reputation – there’s even a popular reentrant version, strtok_r, available in many C library implementations, though it’s not a standard function.

So it’s a good thing that there are so many other ways of splitting strings, isn’t it? Well, yes, but the clever thing with strtok, which you don’t get from any of the string splitters, not even the flexibility-obsessed Boost, is that you can change the token separator as you go along. The splitters often give the option to provide a selection of possible delimiters like this:

  char punct = " ,.?!;:-";
  std::string text = "Silly, right? This is - obviously - an example: Boo!";
  
  std::vector<std::string> tokens;
  std::vector<char> separators(punct, punct + strlen(punct));
  
  tokenise_string(text, separators, tokens);
  // tokens now contains the eight words of the string

But what if there are characters that are both legal inside some tokens, and separators for others? What if you have to parse strings with the format [Last name],[First name] – [Profession] – [Age] like this table:

Bradshawe, Adam - Colonel, retired - 73
Burton-West,Jenny - Surgeon - 37
Smith,Ben - Taxi driver - 56

While a string splitter would stumble over this – since both space, hyphen and comma can be both delimiters and valid content – old strtok doesn’t even raise an eyebrow:

  const char* table = "Jones, Adam - Colonel, retired - 73\n"
    "Burton-West,Jenny - Surgeon - 37\n"
    "Smith,Ben - Taxi driver - 56";
  char* changeable = (char*) malloc(strlen(table)); // cast not needed in C
  strcpy(changeable, table);

  char *last, *first, *profession, *age;
  // Start it off with the first search
  last = strtok(changeable, " ,");
  while (last)
  {
    first = strtok(NULL, " -");
    profession = strtok(NULL, "-");
    // Only snag - this probably has a trailing space we need to trim
    int proflen = strlen(profession);
    if (profession[proflen - 1] == ' ')
      profession[proflen - 1] = '\0';
      
    age = strtok(NULL, " -\n");
    // Have whole row, take care of data
    // ...
    // Start next row, if any
    last = strtok(NULL, " ,");
  }
  free(changeable);

In other words, strtok offers unique functionality not found in the string splitters. So let’s design a C++ version that is efficient, non-destructive, and reentrant. Thanks to the object-orientation support of C++, we can let each tokeniser have a const reference to the string we’re tokenising. This gives us both reentrancy and non-destructiveness. And while we’re at it, let’s have a flag to decide whether we should include empty tokens or not – something quite useful that’s missing from strtok.

  class string_tokeniser
  {
    // The string we're searching in
    const std::string& source_;
    // Flag indicating whether to include empty strings
    bool empty_;
    // Current location in string
    std::string::size_type current_;
    // Length of current string
    std::string::difference_type length_;
    // Location to start next search
    std::string::size_type next_;
    
  public:
    // Constructor, setting the string to work on
    string_tokeniser(const std::string& source, bool empty = false);
    ...

We’ll also need variables to keep track of where we are, where to start looking for next token, and so on, which I’ve also included above.

Now, what do we want to do with this? Well, we actually want to do two distinct tasks – advance to the next token, and extract a token. In strtok, those are done at the same step, but since we’re not constrained by the limitations of C, it’s better to keep it tidy. Like I did in my string splitter, I’ll overload the advancing function to allow the user to give a single character, a selection of characters, or a whole string as a delimiter.

    ...
    // Advance to next token, by given character separator
    bool next(const std::string::value_type& separator);

    // Advance to next token, by any of given character separators
    bool next(const std::vector<char>& separators);

    // Advance to next token, by given string separator
    bool next(const std::string& separator);
    ...

These all return true if a new token was found in the string. We’ll also need a way of accessing the tokens found, and some housekeeping:

    ...
    // Get current token, if any, safely
    bool get_token(std::string& token) const;

    // Get current token, if any, otherwise an empty token
    std::string get_token() const;

    // Reset search
    void reset();

    // Check token availability
    bool has_token() const;

    // Check if search is at end
    bool at_end() const;

    // Get source string
    const std::string& get_source() const;
  };

That’s the interface, let’s do some implementation. The way this will work is that we keep hold of a current location and token length, which are used to retrieve the token using std::string::substr, while also keeping the next location in which to start the search for a token. This will have to be next_ = current_ + length_ + length of delimiter, so the next search does not pick up the last delimiter.

When the string_tokeniser is first created, we have no search results, so need to initialise appropriately. The same values are set on a reset, and used to get the current token and check status:

  // Constructor
  string_tokeniser::string_tokeniser(const std::string& source, bool empty /*= false*/)
    : source_(source)
    , current_(std::string::npos)
    , length_(0)
    , next_(source.empty() ? std::string::npos : 0)
    , empty_(empty)
  {}

  // Reset search so it can be restarted
  void string_tokeniser::reset()
  {
    current_ = std::string::npos;
    next_ = source_.empty() ? std::string::npos : 0;
    length_ = 0;
  }

  // Return true if there is a current token
  bool string_tokeniser::has_token() const
  {
    // Not worried about length here, as it might be an empty token
    return (std::string::npos != current_);
  }

  // Return true if no further searches can be done
  bool string_tokeniser::at_end() const
  { 
    return (std::string::npos == next_);
  }

  // Get source string
  const std::string& string_tokeniser::get_source() const
  {
    return source_;
  }

  // Get current token, if any, safely
  bool string_tokeniser::get_token(std::string& token) const
  {
    if (!has_token())
      return false;
    if (0 == length_)
      token.clear();
    else
      token = source_.substr(current_, length_);
    return true;
  }

  // Get current token, if any, otherwise an empty token
  std::string string_tokeniser::get_token() const
  {
    if (!has_token() || (0 == length_))
      return std::string();
    return source_.substr(current_, length_);
  }

Right, that just leaves the implementation of the key function: next(). Since there are three overloads, with almost identical implementation, the sensible thing is to break out most of the common stuff into a helper function. Unfortunately, we can’t easily do that, since if we do not care about empty tokens, we have to recurse and try to find the next, in the case of repeated delimiters, which means it will have to be aware of which overload to chose. Instead, we’ll break out the common handling into a template function:

In class declaration:
    ...
    // Helper - handle the result of a search, advancing to prepare for next
    template <typename T>
    bool handle_next(size_t advance, const T& separator);
  public:
    // Constructor, setting the string to work on
    ...

Implementation:
  // Advance to next token
  bool string_tokeniser::next(const std::string::value_type& separator)
  {
    // Store the start
    current_ = next_;
    if (at_end())
      return false;
    // Find next
    next_ = source_.find(separator, current_);
    // Deal with result of search
    return handle_next(1, separator);
  }

  // Advance to next token
  bool string_tokeniser::next(const std::string& separator)
  {
    // Store the start
    current_ = next_;
    if (at_end())
      return false;
    // Find next
    next_ = source_.find(separator, current_);
    // Deal with result of search
    return handle_next(separator.size(), separator);
  }

  // Advance to next token
  bool string_tokeniser::next(const std::vector<char>& separators)
  {
    // Store the start
    current_ = next_;
    if (at_end())
      return false;
    // Find next
    next_ = source_.find_first_of(&separators[0], current_, separators.size());
    // Deal with result of search
    return handle_next(1, separators);
  }

  // Handle the result of a search, advancing to prepare for next search
  template <typename T>
  bool string_tokeniser::handle_next(size_t advance, const T& separator)
  {
    if (std::string::npos == next_)
    {
      // Separator not found, but there might still be data, at the end 
      length_ = source_.size() - current_;
    }
    else
    {
      // Store the length of the current token
      length_ = next_ - current_;
      // and move next starting point to beyond the one we found
      next_ += advance;
      // In the case of double separators (e.g. | in "a|b||d"), this gives an 
      // empty token. If empties aren't accepted, we'll recurse
      if ((0 == length_) && !empty_)
      {
        return next(separator);
      }
    }
    // Do we have a token?
    if (0 < length_)
      return true;
    // Even if empties are accepted, dismiss an empty token at the end of a 
    // string (e.g. "a|b|" gives "a" and "b" only)
    if (!empty_ || (std::string::npos == next_))
    {
      // Invalidate current, so extraction isn't valid
      current_ = next_;
      return false;
    }
    return true;    
  }

There, all done. And because we can use both single characters, a selection of characters, and strings as delimiters, the equivalent of the strtok example avoids the need to trim spaces:

  std::string table = "Jones, Adam - Colonel, retired - 73\n"
    "Burton-West,Jenny - Surgeon - 37\n"
    "Smith,Ben - Taxi driver - 56";

  std::string last, first, profession, age;
  // Prepare delimiters to use
  std::vector<char> comma_space;
  comma_space.push_back(' ');
  comma_space.push_back(',');
  std::string sp_dash_sp(" - ");
  char endl('\n');

  string_tokeniser tok(table);
  while (!tok.at_end())
  {
    tok.next(comma_space);
    tok.get_token(last);
    tok.next(sp_dash_sp);
    tok.get_token(first);
    tok.next(sp_dash_sp);
    tok.get_token(profession);
    tok.next(endl);
    tok.get_token(age);
  }

Because it is non-destructive, it is by necessity less efficient than strtok, since the tokens have to be copied. On the other hand, if you have to copy the const char* to a char* buffer to use strtok, maybe the efficiency loss isn’t that bad.

As always, if you found this interesting or useful, or have suggestions for improvements, please let me know.

Update: Jens Ayton has informed me that C11 introduced strtok_s, which is even safer, but not, I believe, incorporated into C++11 .

License

This article, along with any associated source code and files, is licensed under The BSD License

Share

About the Author

Orjan Westin

United Kingdom United Kingdom
Orjan has worked as a professional developer - in Sweden and England - since 1993, using a wide range of languages (C++, Pascal, Delphi, C, C#, Visual Basic, PHP, Python and x86 assembler), but tends to return to C++.

Comments and Discussions

 
QuestionMessage Removed PinprofessionalWong Shao Voon12-Jun-14 17:37 
Questionstrtok_r easily available PinmemberWes Peters12-Jun-14 14:15 
AnswerRe: strtok_r easily available PinmemberOrjan Westin19-Jun-14 4:28 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.141216.1 | Last Updated 21 Oct 2012
Article Copyright 2012 by Orjan Westin
Everything else Copyright © CodeProject, 1999-2014
Layout: fixed | fluid