Template String Tokenizer

Nick Alexeev

2.33/5 (8 votes)

Dec 6, 2006

CPOL

3 min read

36993

373

A template string tokenizer class that works with both CStringArray and CStringList.

Introduction

Often we need to parse a string and store the fragments in an array or a list. For example, we might need to parse a line from a comma separated value (CSV) file or NMEA string. MFC provides the CStringArray and CStringList classes for handling arrays and lists of strings, respectively. The idea of this submission is simple: a tokenizer class that would inherit publicly either from CStringArray or CStringList depending on a template parameter. Once the string is tokenized, the calling code can access the tokens through direct calls to the methods of the parent collection class.

Parameterized inheritance

OR-ed template inheritance (UML)

Since both of these collection classes inherit from CObject, they support run time type identification (RTTI), which prevents the CStringTokenizer class from inheriting from other classes.

template <class T> // T is either a CStringArray or CStringList
class CStringTokenizer : public T
public:
 enum OPTIONS
 {
  IGNORE_EMPTY_TOKENS = 0x01,
  TERMINATING_STRING = 0x02
 };
 CStringTokenizer();
 UINT Tokenize(CString& strSrc, LPCTSTR pStrDelimit, 
      LPCTSTR strTerminate = '\0', UINT iOffset = 0);
 virtual ~CStringTokenizer() {;} 
 void AddOptions(OPTIONS iOptions)  {m_iFlags |= iOptions;}
 void RemoveOptions(OPTIONS iOptions) {m_iFlags &= ~iOptions;}
protected:
 // helper function with 2 seperate implementations
 // (template specialization) for CStringArray and CStringList
 void Add(LPCTSTR pStrNew);
 UINT m_iFlags;   // tokenization options
 CMutex m_mtxTokenize;  // serves to make the Tokenize() method non-reentrant
};

// PURPOSE:   Initialize the new CStringTokenizer object.
// Check the type of the parent class T.
// PRECONDITIONS: T is either CStringArray or CStringList
template <class T>
CStringTokenizer::CStringTokenizer()
 : m_mtxTokenize(FALSE)  // the mutex is free for the taking
{
 m_iFlags = IGNORE_EMPTY_TOKENS;
 CRuntimeClass* pRTC = T::GetRuntimeClass();
 if (RUNTIME_CLASS(CStringArray) == pRTC) return; // admissible parent class
 if (RUNTIME_CLASS(CStringList) == pRTC)  return; // same as above
 ASSERT(FALSE);
}

Template specialization helps to smooth-out the difference between CStringArray and CStringList

The addition of a new token to a collection is the only place where the tokenizer code has to interact with the parent collection class. Unfortunately, between CStringArray and CStringList, there isn't a method with a common name for adding a new member to a collection. CStringList has AddHead() and AddTail(), while CStringArray has Add().

At first, I tried to fix this problem with RTTI built into the MFC framework. I tried to write code which would choose an appropriate method at run-time. This approach failed to compile. Then, I was suggested to try template specialization, and it worked! I've declared my own Add() method and added two separate implementations for the cases when CStringArray or CStringList is a parent.

// PURPOSE: A helper function, which does template specialization
// and resolves the difference between CStringArray and CStringlist
template <>
void
CStringTokenizer<CStringArray>::Add(LPCTSTR pStrNew)
// [in] a string to be added to array
{
 TRY
 {
  CStringArray::Add(pStrNew); // can throw CMemoryException
 }
 CATCH(CMemoryException, pExc)
 {
  THROW(pExc);    // rethrow to the calling code
 }
END_CATCH
}

// PURPOSE: A helper function, which does template
// serialization and resolves the difference
// between CStringArray and CStringlist
template <>
void
CStringTokenizer<CStringList>::Add(LPCTSTR pStrNew)
// [in] a string to be added to array
{
 TRY
 {
  CStringList::AddTail(pStrNew); // can throw CMemoryException
 }
 CATCH(CMemoryException, pExc)
 {
  THROW(pExc);    // rethrow to the calling code
 }
END_CATCH
}

Tokenization

Call the Tokenize(...) function to tokenize a string. After this call, you can deal with the tokens through the methods of CStringArray and CStringList. Note that the new tokens are appended to the collection, and Tokenize(...) doesn't remove the old tokens.

// PURPOSE:   Tokenize the string and APPEND the tokens into the parent collection class
// POSTCONDITIONS: The original string remains intact. The method can throw CMemoryException.
template <class T>
UINT // Offset of the next character after the terminator.
     // The return value and the iOffset parameter
     // can be used for parsing one sting with successive calls to Tokenize().
CStringTokenizer<T>::Tokenize(CString& strSrc,   // [in] a string that will be tokenized. 
         LPCTSTR pStrDelimit, // [in] a set of delimiting characters
         LPCTSTR pStrTerminate, // [in] a set of terminating characters, 
                                // or a terminating sequence, depending on options
         UINT  iOffset)  // [in] Tokenization will start at this offset. Defaulted to zero

Options

IGNORE_EMPTY_TOKENS

If there are two delimiters in a row, the token between them is an empty string. By default, this token will be ignored. If RemoveOptions() is called with IGNORE_EMPTY_TOKENS, these tokens will be added to the collection (not ignored). This option can be useful for parsing CSV files and NMEA strings.

TERMINATING_STRING

If this option is set, the tokenization stops when a terminating substring is encountered. Tokenize(...) treats pStrTerminate as an ordered substring. If this option is not set, the tokenization will stop when a character from a set of terminating characters is encountered. Tokenize(...) treats pStrTerminate as an unordered set of characters.

Thread safety notes

Even though the Tokenize(...) method is protected from re-entrancy with a mutex, the CStringTokenizer class is only partially thread-safe. The parent collection classes (CStringArray and CStringList) themselves are thread-safe. However, parsing is not thread-safe. If a producer thread writes the tokens to the CStringTokenizer object by calling Tokenize(...) and a consumer thread reads the tokens by calling the accessor methods of the parent collection classes, a situation may occur, when the consumer will see a combination of the old data and the new data.

Demo application / Test bed

void TestTokenizer()
{
   TRACE("Beginning of template string tokenizer demo\n");
 
   // a sting for parsing
   CString str1 = "She sells sea shells on a sea shore. \nShells  shine.";
   // declafre a tokenizer class derived from CStringArray
   CStringTokenizer<CStringArray> strTokArray;
   // Don't ignore the empty tokens. By default, they are ignored.
   strTokArray.RemoveOptions(CStringTokenizer<CStringArray>::IGNORE_EMPTY_TOKENS);
   // tokenize words in the 1st line
   UINT iStartOffset = strTokArray.Tokenize(str1, ". ", "\n");
   TRACE("Tokens in the Array:\n");
   for (int i = 0; i < strTokArray.GetSize(); ++i)
   // You can treat the tokenizer just like a regular CStringArray!
      TRACE("\t%s\n",strTokArray[i]);    // dump the parsed fragments
   // another string for parsing
   // declare a tokenizer class  derived from CStringList
   CStringTokenizer<CStringList> strTokList;
   // tokenize into separate words
   strTokList.Tokenize(str1, ". ", "\n", iStartOffset);
   TRACE("Tokens in the List:\n");
   for (POSITION pos = strTokList.GetHeadPosition(); pos != NULL; )
   // You can treat the tokenizer just like a regular CStringList!
      TRACE("\t%s\n", strTokList.GetNext(pos));
      // dump the parsed fragments
   // and another string for parsing
   str1 = "Marry had a little lamb... for dinner.";
   strTokList.RemoveAll();
   // terminate, when a given subscring is encountered
   strTokList.AddOptions(CStringTokenizer<CStringList>::TERMINATING_STRING);
   strTokList.Tokenize(str1, ". ", "dinner");  // tokenize
   TRACE("Tokens in the List:\n");
   for (pos = strTokList.GetHeadPosition(); pos != NULL; )
   // You can treat the tokenizer just like a regular CStringList!
      TRACE("\t%s\n", strTokList.GetNext(pos)); // dump the parsed fragments 
 TRACE("End of template string tokenizer demo\n");
}

Conclusion

This idea seems very obvious. Probably, I couldn't find similar code on the web because I wasn't looking well enough. However, Googling for 'parser tokenizer CStringArray CStringList template' didn't produce anything similar.

Of course, there are loads of string tokenizers out there on the web. Most of them have an interface similar to Java's StringTokenizer. I didn't follow this de-facto standard. Maybe, I should have. On the other hand, my class preserves the original string.

As usual, suggestions, bug notes, comments etc., are most welcome!

References

http://www.codeproject.com/cpp/strtok.asp: Another string tokenizer class on CodeProject.
http://www.codeguru.com/cpp/cpp/cpp_mfc/parsing/article.php/c781/: Yet another string tokenizer class (derived from CObject) on CodeGuru.
http://www.codeproject.com/string/cstringparser.asp
http://www.c-plusplus.de/forum/viewtopic-var-p-is-18971.html: String parser in German.

History

0.1: Initial submission: December 4, 2006.
0.2: Added a mutex to prevent re-entrance; added thread-safety notes: December 29, 2006.
0.3: Changed the tokenization algorithm code slightly; added the TERMINATING_STRING option and updated the demo app to exercise this option; added notes about the options: January 5, 2007