Template String Tokenizer






2.33/5 (8 votes)
A template string tokenizer class that works with both CStringArray and CStringList.
Introduction
Often we need to parse a string and store the fragments in an array or a list. For example, we might need to parse a line from a comma separated value (CSV) file or NMEA string. MFC provides the CStringArray
and CStringList
classes for handling arrays and lists of strings, respectively. The idea of this submission is simple: a tokenizer class that would inherit publicly either from CStringArray
or CStringList
depending on a template parameter. Once the string is tokenized, the calling code can access the tokens through direct calls to the methods of the parent collection class.
Parameterized inheritance
Since both of these collection classes inherit from CObject
, they support run time type identification (RTTI), which prevents the CStringTokenizer
class from inheriting from other classes.
template <class T> // T is either a CStringArray or CStringList
class CStringTokenizer : public T
public:
enum OPTIONS
{
IGNORE_EMPTY_TOKENS = 0x01,
TERMINATING_STRING = 0x02
};
CStringTokenizer();
UINT Tokenize(CString& strSrc, LPCTSTR pStrDelimit,
LPCTSTR strTerminate = '\0', UINT iOffset = 0);
virtual ~CStringTokenizer() {;}
void AddOptions(OPTIONS iOptions) {m_iFlags |= iOptions;}
void RemoveOptions(OPTIONS iOptions) {m_iFlags &= ~iOptions;}
protected:
// helper function with 2 seperate implementations
// (template specialization) for CStringArray and CStringList
void Add(LPCTSTR pStrNew);
UINT m_iFlags; // tokenization options
CMutex m_mtxTokenize; // serves to make the Tokenize() method non-reentrant
};
// PURPOSE: Initialize the new CStringTokenizer object.
// Check the type of the parent class T.
// PRECONDITIONS: T is either CStringArray or CStringList
template <class T>
CStringTokenizer ::CStringTokenizer()
: m_mtxTokenize(FALSE) // the mutex is free for the taking
{
m_iFlags = IGNORE_EMPTY_TOKENS;
CRuntimeClass* pRTC = T::GetRuntimeClass();
if (RUNTIME_CLASS(CStringArray) == pRTC) return; // admissible parent class
if (RUNTIME_CLASS(CStringList) == pRTC) return; // same as above
ASSERT(FALSE);
}
Template specialization helps to smooth-out the difference between CStringArray and CStringList
The addition of a new token to a collection is the only place where the tokenizer code has to interact with the parent collection class. Unfortunately, between CStringArray
and CStringList
, there isn't a method with a common name for adding a new member to a collection. CStringList
has AddHead()
and AddTail()
, while CStringArray
has Add()
.
At first, I tried to fix this problem with RTTI built into the MFC framework. I tried to write code which would choose an appropriate method at run-time. This approach failed to compile. Then, I was suggested to try template specialization, and it worked! I've declared my own Add()
method and added two separate implementations for the cases when CStringArray
or CStringList
is a parent.
// PURPOSE: A helper function, which does template specialization
// and resolves the difference between CStringArray and CStringlist
template <>
void
CStringTokenizer<CStringArray>::Add(LPCTSTR pStrNew)
// [in] a string to be added to array
{
TRY
{
CStringArray::Add(pStrNew); // can throw CMemoryException
}
CATCH(CMemoryException, pExc)
{
THROW(pExc); // rethrow to the calling code
}
END_CATCH
}
// PURPOSE: A helper function, which does template
// serialization and resolves the difference
// between CStringArray and CStringlist
template <>
void
CStringTokenizer<CStringList>::Add(LPCTSTR pStrNew)
// [in] a string to be added to array
{
TRY
{
CStringList::AddTail(pStrNew); // can throw CMemoryException
}
CATCH(CMemoryException, pExc)
{
THROW(pExc); // rethrow to the calling code
}
END_CATCH
}
Tokenization
Call the Tokenize(...)
function to tokenize a string. After this call, you can deal with the tokens through the methods of CStringArray
and CStringList
. Note that the new tokens are appended to the collection, and Tokenize(...)
doesn't remove the old tokens.
// PURPOSE: Tokenize the string and APPEND the tokens into the parent collection class
// POSTCONDITIONS: The original string remains intact. The method can throw CMemoryException.
template <class T>
UINT // Offset of the next character after the terminator.
// The return value and the iOffset parameter
// can be used for parsing one sting with successive calls to Tokenize().
CStringTokenizer<T>::Tokenize(CString& strSrc, // [in] a string that will be tokenized.
LPCTSTR pStrDelimit, // [in] a set of delimiting characters
LPCTSTR pStrTerminate, // [in] a set of terminating characters,
// or a terminating sequence, depending on options
UINT iOffset) // [in] Tokenization will start at this offset. Defaulted to zero
Options
IGNORE_EMPTY_TOKENS
If there are two delimiters in a row, the token between them is an empty string. By default, this token will be ignored. If RemoveOptions()
is called with IGNORE_EMPTY_TOKENS
, these tokens will be added to the collection (not ignored). This option can be useful for parsing
TERMINATING_STRING
If this option is set, the tokenization stops when a terminating substring is encountered. Tokenize(...)
treats pStrTerminate
as an ordered substring. If this option is not set, the tokenization will stop when a character from a set of terminating characters is encountered. Tokenize(...)
treats pStrTerminate
as an unordered set of characters.
Thread safety notes
Even though the Tokenize(...)
method is protected from re-entrancy with a mutex, the CStringTokenizer
class is only partially thread-safe. The parent collection classes (CStringArray
and CStringList
) themselves are thread-safe. However, parsing is not thread-safe. If a producer thread writes the tokens to the CStringTokenizer
object by calling Tokenize(...)
and a consumer thread reads the tokens by calling the accessor methods of the parent collection classes, a situation may occur, when the consumer will see a combination of the old data and the new data.
Demo application / Test bed
void TestTokenizer()
{
TRACE("Beginning of template string tokenizer demo\n");
// a sting for parsing
CString str1 = "She sells sea shells on a sea shore. \nShells shine.";
// declafre a tokenizer class derived from CStringArray
CStringTokenizer<CStringArray> strTokArray;
// Don't ignore the empty tokens. By default, they are ignored.
strTokArray.RemoveOptions(CStringTokenizer<CStringArray>::IGNORE_EMPTY_TOKENS);
// tokenize words in the 1st line
UINT iStartOffset = strTokArray.Tokenize(str1, ". ", "\n");
TRACE("Tokens in the Array:\n");
for (int i = 0; i < strTokArray.GetSize(); ++i)
// You can treat the tokenizer just like a regular CStringArray!
TRACE("\t%s\n",strTokArray[i]); // dump the parsed fragments
// another string for parsing
// declare a tokenizer class derived from CStringList
CStringTokenizer<CStringList> strTokList;
// tokenize into separate words
strTokList.Tokenize(str1, ". ", "\n", iStartOffset);
TRACE("Tokens in the List:\n");
for (POSITION pos = strTokList.GetHeadPosition(); pos != NULL; )
// You can treat the tokenizer just like a regular CStringList!
TRACE("\t%s\n", strTokList.GetNext(pos));
// dump the parsed fragments
// and another string for parsing
str1 = "Marry had a little lamb... for dinner.";
strTokList.RemoveAll();
// terminate, when a given subscring is encountered
strTokList.AddOptions(CStringTokenizer<CStringList>::TERMINATING_STRING);
strTokList.Tokenize(str1, ". ", "dinner"); // tokenize
TRACE("Tokens in the List:\n");
for (pos = strTokList.GetHeadPosition(); pos != NULL; )
// You can treat the tokenizer just like a regular CStringList!
TRACE("\t%s\n", strTokList.GetNext(pos)); // dump the parsed fragments
TRACE("End of template string tokenizer demo\n");
}
Conclusion
This idea seems very obvious. Probably, I couldn't find similar code on the web because I wasn't looking well enough. However, Googling for 'parser tokenizer CStringArray CStringList template' didn't produce anything similar.
Of course, there are loads of string tokenizers out there on the web. Most of them have an interface similar to Java's StringTokenizer
. I didn't follow this de-facto standard. Maybe, I should have. On the other hand, my class preserves the original string.
As usual, suggestions, bug notes, comments etc., are most welcome!
References
- http://www.codeproject.com/cpp/strtok.asp: Another string tokenizer class on CodeProject.
- http://www.codeguru.com/cpp/cpp/cpp_mfc/parsing/article.php/c781/: Yet another string tokenizer class (derived from
CObject
) on CodeGuru. - http://www.codeproject.com/string/cstringparser.asp
- http://www.c-plusplus.de/forum/viewtopic-var-p-is-18971.html: String parser in German.
History
- 0.1: Initial submission: December 4, 2006.
- 0.2: Added a mutex to prevent re-entrance; added thread-safety notes: December 29, 2006.
- 0.3: Changed the tokenization algorithm code slightly; added the
TERMINATING_STRING
option and updated the demo app to exercise this option; added notes about the options: January 5, 2007