Click here to Skip to main content
Click here to Skip to main content

Very Fast Splitter with support for (Multi-Characters) String Separator

By , 17 Jun 2004
Rate this:
Please Sign up or sign in to vote.

Introduction

This is simply and shortly a fast splitter function that combines the classic Split function that splits an expression with a character or a stream of characters, and my new one which handles a stream of characters as a single separator.

Why New

Have you ever tried to make File Sender Program with resume support? You have to, for example, make a simple protocol having the File ID, packet number & packet data. Now, what separator you can use to split these data? If you tried to use a single character, then you put yourself in a risk that the packet might have by chance the same characters in the same order. Then your program is crashed. Now, you can come up with a stream of characters that has the least possibility of occurrence, to split your packet with, and let's assume it will be '(:Smile | :) '. The problem is when you use the ordinary Split function, it matches any of the characters entered as a single splitter.

Example

Expression: Hi(:Smile | :) How are you? Smile | :) I hope you are fine(:Smile | :)

Output of ordinary Split:

  • Hi
  • Empty String
  • Empty String
  • Empty String
  • How are you?
  • Empty String
  • I hope you are fine
  • Empty String
  • Empty String
  • Empty String
  • Empty String

Output of my Split:

  • Hi
  • How are you? Smile | :) I hope you are fine

Usage

Function Header

public static string[] Split (string Expression , 
       string Delimiter, bool SingleSeparator, 
       int Count , ComparisonMethod Compare)
  • Expression: Expression to split.
  • Delimiter: String to split with.
  • SingleSeparator: true to consider the Delimiter characters as a single separator, false to execute the ordinary Split.
  • Count: Number of tokens to split from this Expression.
  • ComparisonMethod: Value indicates if delimiter matching is case sensitive or not.

Split Module Code

namespace Infinity
{
    public enum ComparisonMethod
    {
        Binary = 0,
        Text = 1
    }

    namespace StringSplitter
    {
        /// <summary>
        /// Split strings with support to multi-character,
        /// multi-lines Delimiter 
        /// </summary>
        public class CSplitter
        {
            /// <summary>
            /// Holds the string to split 
            /// </summary>
            private static string  m_Expression ; 
            /// <summary>
            /// Delimiter to split the expression with 
            /// </summary>
            private static string  m_Delimiter ;
            /// <summary>
            /// Constrctor for The Splitter
            /// </summary>
            public CSplitter()
            {
                //
                // TODO: Add constructor logic here
                //
            }

            private static bool 
              isValidDelimiterBinary(int StringIndex, 
              int DelimiterIndex )
            {
                if (DelimiterIndex == m_Delimiter.Length)  return true;
                if (StringIndex == m_Expression.Length) return false;
                //If the current character of the expression matches 
                //the current character of the Delimiter, 
                //then go to next character
                if (m_Expression[StringIndex] == 
                             m_Delimiter[DelimiterIndex]) 
                    return isValidDelimiterBinary(StringIndex + 1, 
                                               DelimiterIndex + 1);
                else
                    return false;
            }
            private static bool 
               isValidDelimiterText(int StringIndex, 
               int DelimiterIndex )
            {
                if (DelimiterIndex == m_Delimiter.Length)  return true;
                if (StringIndex == m_Expression.Length) return false;
                //If the current character of the expression 
                //matches the current character of the Delimiter, 
                //then go to next character
                if (Char.ToLower(m_Expression[StringIndex]) 
                        == Char.ToLower(m_Delimiter[DelimiterIndex])) 
                    return isValidDelimiterText(StringIndex + 1, 
                                        DelimiterIndex + 1);
                else
                    return false;
            }

            public static string[] Split(string Expression, 
               string  Delimiter, bool SingleSeparator, 
               int  Count, ComparisonMethod Compare) 
            {
                //Update Private Members 
                m_Expression = Expression;
                m_Delimiter = Delimiter;

                //Array to hold Splitted Tokens
                System.Collections.ArrayList Tokens = 
                      new System.Collections.ArrayList ();
                //If not using single separator, 
                //then use the regular split function
                if (!SingleSeparator)
                    if (Count >=0)
                      return Expression.Split(Delimiter.ToCharArray(), Count);
                    else
                      return Expression.Split(Delimiter.ToCharArray());

                //Check if count = 0 then return an empty array 
                if (Count ==0)
                    return new string [0];
                else
                    //Check if Count = 1 then return the whole expression
                    if (Count == 1)
                        return new string [] {Expression};
                    else
                        Count --;

                // Indexer to loop over the string with 
                int i ;
                //The Start index of the current 
                //token in the expression
                int iStart = 0 ;

                if (Compare == ComparisonMethod.Binary) 
                {
                    for (i = 0 ; i < Expression.Length ; i++)
                    {
                        if (isValidDelimiterBinary(i, 0))
                        {
                            //Assign New Token 
                            Tokens.Add (Expression.Substring(iStart, 
                                                          i - iStart));
                            //Update Index 
                            i += Delimiter.Length - 1;
                            //Update Current Token Start index
                            iStart = i + 1;
                            //If we reached the tokens limit , then exit For 
                            if (Tokens.Count == Count && Count >= 0) break;
                        }
                    }
                }
                else
                {
                    for (i = 0 ; i < Expression.Length ; i++)
                    {
                        if (isValidDelimiterText(i, 0))
                        {
                            //Assign New Token 
                            Tokens.Add (Expression.Substring(iStart, 
                                                            i - iStart));
                            //Update Index 
                            i += Delimiter.Length - 1;
                            //Update Current Token Start index
                            iStart = i + 1;
                            //If we reached the tokens limit , then exit For 
                            if (Tokens.Count == Count && Count >= 0) break;
                        }
                    }
                }
                string LastToken = "";
                //If there is still data & have not been added
                if (iStart < Expression.Length)
                {
                    LastToken = Expression.Substring(iStart, 
                                        Expression.Length - iStart);
                    if(LastToken == Delimiter)
                        Tokens.Add (null);
                    else
                        Tokens.Add (LastToken);
                }
                else
                    //If there is no elements in the tokens array, 
                    //then pass the whole string as the one element
                    if (Tokens.Count == 0) Tokens.Add (Expression);
                        //Return Splitted Tokens
                        return (string [])
                            Tokens.ToArray(Type.GetType("System.String"));
            }
        }
    }
}

Code in Details

Comparison Method Enumeration

public enum ComparisonMethod
{
       Binary = 0,
       Text = 1
}

Used to specify if the matching is case sensitive (Binary) or not (Text).

CSplitter members:

// Holds the string to split 
private static string  m_Expression ; 
// Delimiter to split the expression with 
private static string  m_Delimiter ;

Those variables I have made because I need them in Delimiter Matching function. And it’s not logical to send them as parameters every time I call those methods. So, I added them only once in the global section, and I pass only the indices of them as you can see below.

isValidDelimiterBinary Function

private static bool isValidDelimiterBinary(int StringIndex, int DelimiterIndex )
{
    if (DelimiterIndex == m_Delimiter.Length)  return true;
    if (StringIndex == m_Expression.Length) return false;
    //If the current character of the expression matches 
    //the current character of the Delimiter , then go to next character
    if (m_Expression[StringIndex] == m_Delimiter[DelimiterIndex]) 
        return isValidDelimiterBinary(StringIndex + 1, DelimiterIndex + 1);
    else
        return false;
}

This function is a recursive function used to take an Expression start index and Delimiter start index. This has the whole trick as I think; first, let’s go there line by line:

if (DelimiterIndex == m_Delimiter.Length)  return true;
if (StringIndex == m_Expression.Length) return false;
Those are 2 stop conditions:

First, one terminates the function if ALL the delimiter characters are checked and matched and returns true. The other one returns false if delimiter checking isn’t finished yet, but we reached the end of the expression, so it returns false.

if (m_Expression[StringIndex] == m_Delimiter[DelimiterIndex]) 
    return isValidDelimiterBinary(StringIndex + 1, DelimiterIndex + 1);
else
    return false;

If the current character of the expression matches the current character of the Delimiter, then call the function again with indices incremented by 1. When you call it from the main module, all you have to do is to send the start index you want matching to start from, & 0 as the DelimiterIndex to start from first character in delimiter.

bool res = isValidDelimiterText(i, 0);

isValidDelimiterText Function

It’s the same function exactly, but it is matched case insensitive way. I preferred to write two functions instead of checking whether user wants to match case sensitive or not every time I loop over expression characters. The only difference is this part in matching.

Char.ToLower(m_Expression[StringIndex]) == 
                        Char.ToLower(m_Delimiter[DelimiterIndex])

Here, I converted the two characters to lowercase to check them. Someone might ask me: Why you didn’t convert the whole string just one time to a temporary string or such, and work with it? Well, that’s a good idea. But the problem is that I loop once to convert them, and the second time to match them, and that’s not efficient. Another thing, Imagine a user sending a long string (30000 characters for instance) and he only wants two elements back. You will convert ALL the string while you might have the first separator which you need in the first 100 character? I guess this will be a performance disaster. Smile | :)

Split Function

Now, we go to the main function that does it all: first thing, we update the m_Expression and m_Delimiter member variables with the entered data.

m_Expression = Expression;
m_Delimiter = Delimiter;

//Array to hold Tokenized Tokens
System.Collections.ArrayList Tokens = new System.Collections.ArrayList();

This is an ArrayList to hold the tokenized data. We use it because you need fast, dynamic String-array convertible Object to hold the data.

SingleSeparator Parameter Handling

//If not using single separator, then use the regular split function
if (!SingleSeparator)
    if (Count >=0)
        return Expression.Split(Delimiter.ToCharArray(), Count);
    else
        return Expression.Split(Delimiter.ToCharArray());

This part checks if the user wants to use the regular split method or not. And if he wants to use the regular method, did he add the Count member or not?

Count Parameter Handling

//Check if count = 0 then return an empty array 
if (Count ==0)
        return new string [0];
else
        //Check if Count = 1 then return the whole expression
        if (Count == 1)
                return new string [] {Expression};
        else  
        Count--;

This part handles the Count parameters special cases as the following:

  • Count= 0. Return an empty string
  • Count= 1. Return the original string.
  • Else, decrement Count with one, this will be explained later.

The Main Loop

int i ; // Indexer to loop over the string with 
int iStart = 0 ; //The Start index of the current token in the expression 

if (Compare == ComparisonMethod.Binary) 
{
    for (i = 0 ; i < Expression.Length ; i++)
    {
        if (isValidDelimiterBinary(i, 0))
        {
            //Assign New Token 
            Tokens.Add (Expression.Substring(iStart, i - iStart));
            //Update Index 
            i += Delimiter.Length - 1;
            //Update Current Token Start index
            iStart = i + 1;
            //If we reached the tokens limit , then exit for 
            if (Tokens.Count == Count && Count >= 0) break;
        }
    }
}
else
{
    for (i = 0 ; i < Expression.Length ; i++)
    {
        if (isValidDelimiterText(i, 0))
        {
            //Assign New Token 
            Tokens.Add (Expression.Substring(iStart, i - iStart));
            //Update Index 
            i += Delimiter.Length - 1;
            //Update Current Token Start index
            iStart = i + 1;
            //If we reached the tokens limit , then exit for
            if (Tokens.Count == Count && Count >= 0) break;
        }
    }
}

Both parts of the if condition are the same, the only difference is one of them calls the isValidDelimiterText and the other part calls the isValidDelimiterBinary function. I will explain the Then part of the if condition (The binary matching):

for (i = 0 ; i < Expression.Length ; i++)
{
    if (isValidDelimiterBinary(i, 0))
    {
        //Assign New Token 
        Tokens.Add (Expression.Substring(iStart, i - iStart));
        //Update Index 
        i += Delimiter.Length - 1;
        //Update Current Token Start index
        iStart = i + 1;
        //If we reached the tokens limit , then exit for 
        if (Tokens.Count == Count && Count >= 0) break;
    }
}

This part does the loop thing. I used a for loop not an enumerator because I need to have an indexer to work with it. Yes, I might use the enumerator with an indexer incremented manually, but why more processing? Smile | :) Before we start, consider the string in the Demo Project: a(:Smile | :) b(:Smile | :) c()(:Smile | :) (:Smile | :) (:Smile | :) , we will split it by (:Smile | :) characters. Now, we check if the current Expression character is the first of a stream of the Delimiter characters or not.

if (isValidDelimiterBinary(i, 0))

If yes, we do the following

//Assign New Token 
Tokens.Add (Expression.Substring(iStart, i - iStart));

Add characters from the start index to the character prior to the current character. So, for example: for the first delimiter found: i = 1 and iStart = 0, then string returned would be ‘a’.

//Update Index 
i += Delimiter.Length - 1;

Update the indexer i and make it jump over the delimiter characters.

//Update Current Token Start index
iStart = i + 1;

Update the next token start index iStart and make it point to the next character after the delimiter characters (Will be ‘b’ in our case).

//If we reached the tokens limit, then exit for 
if (Tokens.Count == Count && Count >= 0) break;

This part checks if user asked for limited number of tokens, so we stop before the token number (Count) ends by one (we decremented it above). That is because we have to include the last part of the string at the last index of the limited array returned.

Remaining Characters Check

Now, we have finished the loop. Let’s see if there’re still remaining characters. If there are remaining characters, then we check and see if they are another delimiter. Then we add null string, else we add the remaining characters. If there is no remaining characters, then we check if there is a token returned or not, if no tokens returned, then add the whole string as one single token.

string LastToken = "";
//If there is still data & has not been added

if (iStart < Expression.Length){
  LastToken = Expression.Substring(iStart, Expression.Length - iStart);
  if(LastToken == Delimiter)
      Tokens.Add (null);
  else
      Tokens.Add (LastToken);
}
else
  //If there is no elements in the tokens array, 
  //then pass the whole string as the one element
      if (Tokens.Count == 0) Tokens.Add (Expression);

Return Array Of strings

Then at last, return the tokens as an array of string to the user.

//Return Tokenized Tokens
return (string [])Tokens.ToArray(Type.GetType("System.String"));

Disclaimer

This code is free for personal use. However, if you are going to use it for commercial purposes, you need to purchase a license.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

About the Author

Remon Zakaria
Web Developer
Egypt Egypt
No Biography provided

Comments and Discussions

 
GeneralNice article but PinmemberGulev24-Oct-06 12:01 
GeneralRe: Nice article but PinmemberVercas8-Oct-10 8:18 
GeneralGood article! Pinmemberikarbovsky29-Aug-06 6:06 
GeneralNice code - but almost totally pointless PinsussAnonymous14-Feb-05 7:04 
Generalbad article format PinmemberTaha Zayed22-Jun-04 15:57 
GeneralRe: bad article format PinmemberRemon Zakaria22-Jun-04 17:29 
GeneralLicensing.. PinmemberRocky Moore21-Jun-04 22:20 
GeneralGreat article PinmemberMichaelCoder21-Jun-04 16:15 
GeneralRe: Great article PinmemberRemon Zakaria21-Jun-04 20:15 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web03 | 2.8.140415.2 | Last Updated 18 Jun 2004
Article Copyright 2004 by Remon Zakaria
Everything else Copyright © CodeProject, 1999-2014
Terms of Use
Layout: fixed | fluid