Click here to Skip to main content
15,887,485 members
Articles / Programming Languages / C
Article

A Tiny Variable String Splitter

Rate me:
Please Sign up or sign in to vote.
4.38/5 (10 votes)
25 Jan 2008CPOL8 min read 32.7K   221   16   8
Tokenize and access string contents using a format mask

Introduction

Having to interpret strings and extract information always has been necessary and is mostly related to writing complex looking code and logic.

Many things can be handled with tokenizers, however with these, there is not much room for different variants of a string or separators that vary from case to case.

Regular expressions allow more complex processing, but it is a pain create the them, debug them and understand them a few months later.

This acticle shows the usage of a small class providing a "natural" way of dealing with complex extraction patterns using a very tiny class.

Note: The article and sources are updated to handle the problems of the first version.

Background

The best way to demonstrate the need is to provide two simple examples:

Example 1

We need to process strings from a log file like the following:

------------------snip------------------

Process: Tsk Mgr.EXE - Start Date: 2008-20-01 Duration: 00:01:54 - Description: Task Manager
Process: EXPLORER.EXE - Start Date: 2008-20-01 Duration: 00:01:54 - Description: Explorer
End-of-day

Process: Tsk Mgr.EXE - Start Date: 2008-21-01 Duration: 00:00:12 - Description: Task Manager
...

------------------snap------------------

The area of interest are the parts in italics: the process name, the start date and the duration. We need to extract them.

The log file lines looks similar, there first problem is that there are lines we don't need, like "End-of-day" or the empty line.

The second problem we encounter is that we cannot really deal with standard tokenizers, as we have no separation character. Space cannot be used, as the process name of "Tsk Mgr.exe" contains a space itself, which would shift all the following tokenized parts.

The colon (:) *could* be used, however, we would have to remove the " - Start Date" from the process name and we would like to get the "Duration" part as one.

If we would explain a human how to extract the values, we would tell him to take the part after "Process:" until the dash "-" before "Start Date:", then the part after "Start Date:" until "Duration:", the part after "Duration:" until the dash before "Description:".

Written in a "masked" way, the string processing format looks like:

Process: #### - Start Date: #### Duration: #### - Description: Task Manager

What we need to parse in our application are the "####" parts. And what would be more natural than to be able to enter exactly this mask pattern in our application. To further access the interesting parts, we should be able to give them variable names, represented in the part string by a surrounding percentage sign (%). So our mask string would look like

Process: %proc% - Start Date: %date% Duration: %duration% - Description: %desc%

Ideally, we would run each line of the log file through this mask and if it succeeds, we would query the variable "proc", "date" and "duration" - and ignore "desc". If it fails, i.e. if the input string does not match with the mask format, we would just continue with the next log-line.

Example 2

We need to extract version information from strings like:

<softwareA> v1.4
<softwareB> v5
<softwareC> v1.3.1 Beta 5
<softwareD> v8.4.87.405 Alpha

The only thing these strings have in common is the space and lowercase-"v" right before the version number. Again, we are only interested in the italic parts: the plain version numbers without the postfixed "Alpha" or "Beta 5".

In this case, we would have two different masks:
One with additional text after the version number and one with not, so the first mask would look like:

%software% v%version% %postfix%

where we have only two fixed elements:
- the " v" and
- the space between the version and the postfix

and another mask with no postfix

%software% v%version%

We would first check the one with, and if it fails (if there is no postfix) the one without postfix.
(When checking the second mask only, it would always succeed and include the words "Beta" and "Alpha" as well - which we don't want).


The idea of handling strings that way makes it must easier to adapt to many different tasks without having to reprogram any additional post-tokenizer logic that is usually involved when extracting data from strings.

Originally, a very good MP3 tag editor allows extracting data like artist and track name information from the various formatted filenames on freedb.org by using placeholder variables in the way described above, which got me quickly attracted to it, as it is very easy to understand. Many times I wished to have access to such a function, and never found anything similar to it, so I decided to code it myself.

Using the code

The string splitter contains three classes:

The main class CStringSplitter, which will be used by the application code, and the two helper classes CSearchStringChar and CSearchStringStr used by the other class to parse the mask and the string.

The usage of the string splitter is demonstrated in this small piece of code.

In the example, we use the percentage symbol '%' to indicate the start and end of a variable name:

C++
#include "StringSplitter.h"

    CStringSplitter Split( _T('%'), _T('%') );;
    CString strValue,
            strMask = _T("Process: %proc% - Start Date: %date% Duration: %duration% - Description: %desc%");

    while ( !isEndOfFile() )
    {
        strLogLine = getNextLine();

        if ( Split.matchMask( strLogLine, strMask ) )
        {
            if ( Split.getValue( strValue, _T("proc") ) )
            {
                // do something with strValue (proc)
                ...
            }
            if ( Split.getValue( strValue, _T("date") ) )
            {
                // do something with strValue (date)
                ...
            }
            if ( Split.getValue( strValue, _T("duration") ) )
            {
                // do something with strValue (duration)
                ...
            }
        }
    }

When using the default open- and close-brace characters to indicate the placeholders, the mask in the example above would be:

Process: (proc) - Start Date: (date) Duration: (duration) - Description: (desc)

The CStringSplitter class

The following functionality is provided by the class to process a string.

Configuring the start end end characters of a variable part in the mask

The CStringSplitter class must be constructed with two optional parameters to specify the start and end characters for the variable names. These default to the opening and close braces '(' and ')'.

Both, the opening and close characters can be the same (e.g. '%' as in the code example above).

Parsing a source string against a mask

The method matchMask checks an input line against the mask. It returns true if the processing was successful, false if not, i.e. if the mask does not fit onto the input string or if the mask contains syntactical errors.

Querying the values of the placeholders

After successfully processing the source string, the method getValue can be called to get the contents of the requested variable. Its first parameter is a string reference which will contain the value of the requested variable, if it exists. The return value is true if the variable is found or false if it does not exist.

Note: Variable names are treated case-insenstive! This can be changed in the getValue method by using the strcmp instead of the stricmp function.

Processing multiple lines with the same mask

When having thousands of lines to process with the same mask, the mask needs to be pre-parsed only once. This makes the string matching even faster.

The first step is to setup the mask by calling the setMask method with the mask string as the only parameter (which corresponds to the second parameter of matchMask). It will return true if the mask is valid or false if not.

Matching the strings can be done in a loop by calling the matchLastMask method using the string to be matches as the only parameter (which corresponds to the first parameter of matchMask). It returns true if the processing was successful, false if not, i.e. if the mask does not fit onto the input string.

After success, the placeholder values can be retrieved as described above.

Here's the earlier example with the necessary changes (in bold):

    CStringSplitter Split( _T('%'), _T('%') );;
    CString strValue,
            strMask = _T("Process: %proc% - Start Date: %date% Duration: %duration% - Description: %desc%");

    Split.setMask( strMask );
    while ( !isEndOfFile() )
    {
        strLogLine = getNextLine();

        if ( Split.matchLastMask( strLogLine ) )
        {
            if ( Split.getValue( strValue, _T("proc") ) )
            {
                // do something with strValue (proc)
                ...
            }
            ...
        }
    } 

Validity of Masks and Syntax

The parser is completely tolerant about typing errors, however, there are syntax rules to get the expected result.

1. Two variables may not follow each other

At least one fixed character must separate them:

Parsing any string using the mask "(part1)(part2)" will, from a logical point, not reveal and usable results, as there is no separation between part1 and part2.

2. Using a variable's start character in fixed text

Given the following input string:

"Humidity %89"

When using the '%' as the variable start and end characters, any double-appearance of it in the fixed text section will be interpreted as one occurence of the character.

To handle the '%' sign in the fixed text, the mask string must be:

"Humidity %%%HUM%"

Note that there are three (3) sequential percentage signs:

The bold portion belongs to the fixed text, any double-start-characters are interpreted as one - "Humidity %".

The non-bold portion indicates the variable name one '%' as the start character, "HUM" as the variable name and the next '%' as the end character.

In order to prevent annoyances, the end-character is not allowed in the variable name, even if it is doubled.

3. Improper ending

Not terminating a variable at the end of a string will automatically terminate it.

4. Mask validation (e.g. of user input)

To use this class in user input fields, the easiest way to check the validity of a mask is by simply calling setMask with the string. If it returns true, the mask can be used (hower, it cannot be guaranteed that the results are what the user intended - Microsoft's PSI-API is not yet ready for public use ;-)


Other platforms

The code is using some specifics or Visual Studio 2005, however it should be quickly portable to other platforms with no effort.

The first is MFC's CString for internal storage of variables and placeholders and for returning the variable's contents. This can be replaced by virtually any other string class, as nothing more than simple assignment is used.

The string processing itself is using the oldie-but-goldie C library functions strlen, strcpy, strchr, strstr and stricmp, i.e. their Visual Studio's _t-prefixed equivalents to achieve MBCS / UNICODE compatibility using the same code.
The strcpy_s function of VS 2005 can be replaced by the corresponding "unsafe" strcpy function for other compilers without problems, as the memory for the string is allocated directly before copying.

Since the basic character handling functions are used instead of the high-level string class routines, the mask parsing should be quite fast, as these libraries are mostly assembly-optimized in most compiler libraries.

History

1.0 2008-01-21 Public release

1.1 2008 -01-25 Problems if 1.0 fixed:

  • Variable open and close placeholders
  • Treat double open-characters inside fix-text to be interpreted as single characters
  • Failsafe with syntax errors in mask

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Germany Germany
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
GeneralWidth Format Specification Added Pin
Daniel B.11-Oct-10 22:38
Daniel B.11-Oct-10 22:38 
QuestionHow to use this class with StreamReader Pin
brzi6-Aug-08 13:20
brzi6-Aug-08 13:20 
AnswerRe: How to use this class with StreamReader Pin
mi-chi30-Sep-08 2:31
mi-chi30-Sep-08 2:31 
QuestionTiny Variable String Splitter: why not use Regular Expressions? Pin
Jon Summers28-Jan-08 3:18
Jon Summers28-Jan-08 3:18 
AnswerRe: Tiny Variable String Splitter: why not use Regular Expressions? Pin
Mass Nerder29-Jan-08 5:31
Mass Nerder29-Jan-08 5:31 
GeneralNice! Pin
Abu Mami25-Jan-08 1:31
Abu Mami25-Jan-08 1:31 
GeneralRegarding the use of % elsewhere in the input string Pin
.dan.g.21-Jan-08 11:17
professional.dan.g.21-Jan-08 11:17 
GeneralRe: Regarding the use of % elsewhere in the input string Pin
mi-chi25-Jan-08 0:26
mi-chi25-Jan-08 0:26 
Hello dan.g and thank you for sour suggestion!

Very interesting direction, didn't think of that before.

Being happy to have a possible simple resolution for the problem, I started implementing this case and came to a strange result, which I didn't expect (case 1).

I did more tests with treating double appearances of the placeholder character as one (case 2), decided to expand them a bit and updated the article.

Here are the results:

1: Variable names with placeholder as start character and terminating space

Given the string from the example

Process: EXPLORER.EXE - Start Date: 2008-20-01 Duration: 00:01:54 - Description: Explorer

Using the suggested method, the mask would - on first thought - look like this:

Process: %proc - Start Date: %date Duration: %duration - Description: %desc

This, however, would not reveal the expected result:
Since each space after a variable terminates the variable name, it will not belong to the following fix text part. Above, the next fix text part is
"- Start Date: " instead of
" - Start Date: "

Notice the missing space before the dash. And worse: The space from the source text would be appended to the value of %proc, i.e.
"EXPLORER.EXE " instead of
"EXPLORER.EXE"

The same is true for %date and %duration. Sure, this can be circumvented again by trimming the name later, but it is just not the result one might expect.

To get the needed result, we would have to include two spaces after each variable, one for terminating the name and one for the following text.

Treating the space after variable name for both, end-of-variable-name and space-in-text won't work because we don't always have spaces in the source text, e.g. in:

Wins(3)-Draws(2)-Losses(1)

which will make the mask look even more strange:

Wins(%WIN )-Draws(%DRAW )-Losses(%LOSSES )

And worse, When entered - by accident or by a user - without spaces
Wins(%WIN)-Draws(%DRAW)-Losses(%LOSSES)

we would get -> a long variable name
"WIN)-Draws(%DRAW)-Losses(%LOSSES)"



trying to find a better solution, this made me play with


2: Double-escaping the placeholder character

There is only one big potential problem:

Given the input string

"Humidity: %89"

The mask string would be

"Humidity: %%%HUM%"

Which leads to confusion of whether the user meant to have a '%' before the variable name or the variable should named '%HUM', which can be solved by simply

setting the rule that the placeholder character is not allowed inside the variable name.


Resolution:
The updated article and sources will allow setting variable start and end characters for the placeholders, so the mask string can look like:

Brackets "(" and ")":

"Process: (proc) - Start Date: (date) Duration: (duration) - Description: (desc)"


Percent / space: "%" and " ", as suggesteb by you:

"Process: %proc - Start Date: %date Duration: %duration - Description: %desc " (note the double-spaces as described above)


And even the same character for both, start and end (which makes it compatible to the first version), e.g. '%':

Process: %proc% - Start Date: %date% Duration: %duration% - Description: %desc%


Further, double-appearances of the start character in a fixed text section will be treated as a single character, and to resolve abmiguities with this rule, the end character may not be used in variable names - it will always be treated as end-of-variable-name.

For further additions, see the updated article.

Hope you will like it!

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.