Introduction
I have always appreciated the String.Split function (and the Split function provided with VB.Net). The Split function divides a text value into seperate parts based on a specified character, or delimiter. The Split function returns the parsed text value as a string array.
Unfortunately the Split function does not support text qualifiers. A text qualifier is a character used to mark the bounds of a block of text. Usually a quotation mark or apostraphe is used for the text qualifier although any character would work.
Since the Split function doesn't support text qualifiers when a delimiter is found within a text block the block is split. It would be nice if we could pass in a text qualifier so that the text block would be treated as a single element.
In this article we will create a Split function for both VB.Net and C#.Net that will support text qualifiers.
In the next article, Parsing Command Line Arguments, we will add support for assignment operators.
Approach
To solve this problem we will need to look at each character in the text expression passed into the routine. We will need to identify when we are in a text block so that the delimiters will be ignored. When outside a text block the delimiter will identify a new element in our array.
We begin by creating our routine and loop.
C# public string[] Split(string expression, string delimiter, string qualifier, bool ignoreCase)
{
for (int _CharIndex=0; _CharIndex<expression.Length-1; _CharIndex++)
{
}
}
VB.Net Public Function Split( _
ByVal expression As String, _
ByVal delimiter As String, _
ByVal qualifier As String, _
ByVal ignoreCase As Boolean) _
As String()
For _CharIndex As Integer = 0 To expression.Length - 1
Next
End Function
Managing Text Qualifiers
Next we will add a boolean variable to track when we are in a text block. When a text qualifier is found we will set the boolean value to true. When another text qualifier is found we will set the boolean value back to false. This can be done very easily by setting the boolean value equal to the opposite of it's current state. We just have to remember to initialize the boolean value to false indicating that we aren't in the block.
It is important to use the length of the text qualifier to define how many characters to compare. This way if we want to use more than one character to define the bounds of the text block we can. This can be desirable when there is potential for a single character to show up in the text block. For example, we may want to use a quotation mark and an apostraphe as a text qualifier. This way if either symbol is contained in the block it won't terminate the block.
There is another way to manage text blocks that contain text qualifiers. You already use this alternative approach when assigning values to a string variable. Simply duplicate the text qualifier inside the text block. Fortunately this approach is already supported by the way we are tracking text qualifiers.
Take the following example that uses the quotation mark for the text qualifier:
"Example With ""Text Qualifier"" Inside Text Block"
The first quotation mark turns on the boolean bit. The second one turns it off, however the next character is a text qualifier that turns it back on making it appear that we never closed the text block. Simple and effective!
C# public string[] Split(string expression, string delimiter, string qualifier, bool ignoreCase)
{
bool _QualifierState = false;
for (int _CharIndex=0; _CharIndex<expression.Length-1; _CharIndex++)
{
if ((qualifier!=null)
& (string.Compare(expression.Substring(_CharIndex, qualifier.Length), qualifier, ignoreCase)==0))
{
_QualifierState = !(_QualifierState);
}
}
}
VB.Net Public Function Split( _
ByVal expression As String, _
ByVal delimiter As String, _
ByVal qualifier As String, _
ByVal ignoreCase As Boolean) _
As String()
Dim _QualifierState As Boolean = False
For _CharIndex As Integer = 0 To expression.Length - 1
If Not Qualifier Is Nothing _
AndAlso String.Compare(experession.Substring(_CharIndex, qualifier.Length), qualifier, ignoreCase)=0 Then
_QualifierState = Not _QualifierState
End If
Next
End Function
Another benefit to the approach taken above is that if we don't want to use a text qualifier we don't have to. If the text qualifier is Nothing (VB.Net) or null (C#.Net) then the text qualifier logic is disabled!
Splitting the Text Expression
Now we are ready to search for the delimiter. We will use the length of the delimiter to define how many characters to use just in case we need to support more than one character for our delimiter. We will create a start index variable to use to track the first character of the text block. The current character index identifies the end of the text block.
Additionally we will store the values in a System.Collection.ArrayList object. Then when we are finished loading the ArrayList we will convert the list to a String array and return the values.
C# public string[] Split(string expression, string delimiter, string qualifier, bool ignoreCase)
{
bool _QualifierState = false;
int _StartIndex = 0;
System.Collections.ArrayList _Values = new System.Collections.ArrayList();
for (int _CharIndex=0; _CharIndex<expression.Length-1; _CharIndex++)
{
if ((qualifier!=null)
& (string.Compare(expression.Substring(_CharIndex, qualifier.Length), qualifier, ignoreCase)==0))
{
_QualifierState = !(_QualifierState);
}
else if (!(_QualifierState) & (delimiter!=null)
& (string.Compare(expression.Substring(_CharIndex, delimiter.Length), delimiter, ignoreCase)==0))
{
_Values.Add(expression.Substring(_StartIndex, _CharIndex - _StartIndex));
_StartIndex = _CharIndex + 1;
}
}
if (_StartIndex<expression.Length)
_Values.Add(expression.Substring(_StartIndex, expression.Length - _StartIndex));
string[] _returnValues = new string[_Values.Count];
_Values.CopyTo(_returnValues);
return _returnValues;
}VB.Net Public Function Split( _
ByVal expression As String, _
ByVal delimiter As String, _
ByVal qualifier As String, _
ByVal ignoreCase As Boolean) _
As String()
Dim _QualifierState As Boolean = False
Dim _StartIndex As Integer = 0
Dim _Values As New System.Collections.ArrayList
For _CharIndex As Integer = 0 To expression.Length - 1
If Not Qualifier Is Nothing _
AndAlso String.Compare(expression.Substring(_CharIndex, qualifier.Length), qualifier, ignoreCase)=0 Then
_QualifierState = Not _QualifierState
ElseIf Not _QualifierState _
AndAlso Not delimiter Is Nothing _
AndAlso String.Compare(expression.Substring(_CharIndex, delimiter.Length), delimiter, ignoreCase)=0 Then
_Values.Add(expression.Substring(_StartIndex, _CharIndex - _StartIndex))
_StartIndex = _CharIndex + 1
End If
Next
If _StartIndex<expression.Length Then
_Values.Add(expression.Substring(_StartIndex, expression.Length - _StartIndex))
Dim _returnValues(_Values.Count - 1) As String
_Values.CopyTo(_returnValues)
Return _returnValues
End Function
Using the code
That's it! We are done. Now we can play with our new toy!
The code sample below parses a text value using a period for the delimiter and a quotation mark for the text qualifier. We have a period contained in the text qualifier and a quotation mark in the text block to demonstrate that our logic works. The Split function returns a string array with two elements in it:
- This is an "example."
- Cool!
C# using System.Windows.Forms;
public void Example()
{
foreach (string _Part in Split("This is an ""example."".Cool!", ".", "\"", true))
MessageBox.Show(this, _Part, "Split Example", MessageBoxButtons.OK);
}
VB.Net Public Sub Example()
For Each _Part As String In Split("This is an ""example."".Cool!", ".", "\"", True))
MsgBox(_Part, MsgBoxStyle.OK, "Split Example")
Next
End Sub
Regular Expression Alternative
As an alternative you can parse the text as documented above using a Regular Expression. Regular Expressions are designed specifically for text parsing. While Regular Expressions are more elegant they increase the cost of maintaining the application because few people understand Regular Expressions and fewer yet can create the expressions (even when using tools).
In light of the benefits and costs associated with Regular Expressions it is worth taking time to demonstrate how a Regular Expression can solve our text parsing problem.
Abishek Bellamkonda was kind enough to provide a Regular Expression that could parse the text as documented above. Since I am no expert with Regular Expressions I won't dive into how the Regular Expression works.
Please don't bombard me with questions on this topic as I only understand abstract Regular Expression concepts. I am providing this as an example for those who are interested. I am not providing this as an explanation of how to make Regular Expressions.
C# using System.Text.RegularExpressions;
public string[] Split(string expression, string delimiter, string qualifier, bool ignoreCase)
{
string _Statement = String.Format("{0}(?=(?:[^{1}]*{1}[^{1}]*{1})*(?![^{1}]*{1}))",
Regex.Escape(delimiter), Regex.Escape(qualifier));
RegexOptions _Options = RegexOptions.Compiled | RegexOptions.Multiline;
if (ignoreCase) _Options = _Options | RegexOptions.IgnoreCase;
Regex _Expression = New Regex(_Statement, _Options);
return _Expression.Split(expression);
}VB.Net Imports System.Text.RegularExpressions
Public Function Split( _
ByVal expression As String, _
ByVal delimiter As String, _
ByVal qualifier As String, _
ByVal ignoreCase As Boolean) _
As String()
Dim _Statement As String = String.Format("{0}(?=(?:[^{1}]*{1}[^{1}]*{1})*(?![^{1}]*{1}))", _
Regex.Escape(delimiter), Regex.Escape(qualifier))
Dim _Options As RegexOptions = RegexOptions.Compiled Or RegexOptions.Multiline
If ignoreCase Then _Options = _Options Or RegexOptions.IgnoreCase
Dim _Expression As Regex = New Regex(_Statement, _Options)
Return _Expression.Split(expression)
End Function
| You must Sign In to use this message board. |
|
|
 |
|
 |
Here is the function (VB.net) that I use:
Public Overrides Function SplitLine(ByVal line As String,delimiter as Char,stringQualifier as Char) As String() ' This is a modified version of the code that I found some time ago. ' Credit for original code: http://www.freevbcode.com/ShowCode.asp?ID=4938&NoBox=True ' Original function was edited slightly to eliminate a bug.
Dim i As Integer Dim SplitString as New List(Of String) Dim IsDelimiter As Boolean Dim Total As Integer Dim Ch As String Dim Section As String
' We want to count the delimiter unless it is within the text qualifier IsDelimiter = True Total = 0 Section = Nothing
For i = 0 To Len(line) - 1 Ch = line(i) Select Case Ch Case stringQualifier IsDelimiter = Not IsDelimiter Case delimiter If IsDelimiter Then ' Add current section to collection SplitString.Add(Section) Section = Nothing Total += 1 Else ' Delimiter char is within text qualifier ' and is included in the value Section += Ch End If Case Else Section += Ch End Select Next
' Get the last field - as most files will not have an ending delimiter If IsDelimiter Then ' Add current section to collection SplitString.Add(Section) End If
' Convert List(of String) to String() Dim RetArr(0 To SplitString.Count - 1) As String For i = 0 To SplitString.Count - 1 RetArr(i) = SplitString(i) Next Return RetArr
End Function
modified on Thursday, November 5, 2009 2:24 PM
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
For the simpler part, go down to the example. I had to do the analysis because I wrote 90% of this message only looking at the expression, not using anything to test the analysis until after.
Analysis of the expression
Original: "{0}(?=(?:[^{1}]*{1}[^{1}]*{1})*(?![^{1}]*{1}))" Human Readable: "{delim}(?=(?:[^{qual}]*{qual}[^{qual}]*{qual})*(?![^{qual}]*{qual}))"
(these are actually new to me, so please correct me if i've got this wrong) The "?=" means to match the position before the expression following. The "?:" means to match this but don't keep track of the group, this is here so that the only match in the expression is the position specified by the "?=" The "?!" means to match a position where this expression is not found
So for the purposes of analyzing the operation of this, we will ignore those.
Original: "{0} [^{1}]*{1}[^{1}]*{1} [^{1}]*{1}" Human Readable: "{delim} [^{qual}]*{qual}[^{qual}]*{qual} [^{qual}]*{qual}"
{delim} The expression starts once it finds a delimiter
[^{qual}]* Run through anything that is not a qualifier, or nothing
{qual} We find a qualifier
[^{qual}]* Run through anything that is inside of that qualifier and the next
{qual} We find an end qualifier In the original this ends our (?:exp) part this block of matches can repeat any number of times, including 0
[^{qual}]* this finds anything after that block that isn't a qualifier
{qual} due to the (?!exp) this matches characters that are after the qualifier but not a delimiter (could someone confirm this is its only function?)
An example how the Regex evaluator goes through this
"Key=\"value is here\" Name=\"test case\"more" Where: {delim} = ' ' {qual} = '"'
1. The {delim} at the start causes it to jump completely over "Key=\"value and start to try to match the space after that, but due to the next part, a non-matching qualifiers are not allowed 2. It goes on to the next space, again the match is not allowed 3. It has skipped over "Key=\"value is here\" and starts building a match from the space after that 4. [^{qual}]* matches the Name= after that space 5. {qual} matches the quote, [^{qual}]* matches everything upto the next {qual}. so this matched \"test case\" 6. (?![^{qual}]*{qual}) causes it to match the position directly after more So it returns two positions to the Split function called in this project, one just after the {delim} it found, and one that would end up right before the next {delim} 7. It will start the match over again if it finds another {delim}
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
As it's very common to have a string with different text qualifiers (tipically " and ') I modified the split function to handle this case. Here's the code:
public static string[] Split(string value, string delimiter, params string[] qualifiers) { string currentQualifier = null; int startIndex = 0; int trimLen = 0; System.Collections.ArrayList values = new System.Collections.ArrayList(); for (int charIndex = 0; charIndex < value.Length; charIndex++) { if ( currentQualifier != null && string.Compare(value.Substring(charIndex, currentQualifier.Length), currentQualifier) == 0 ) { trimLen = currentQualifier.Length; currentQualifier = null; } else if ( qualifiers != null && currentQualifier == null ) { foreach(string qualifier in qualifiers) { if (string.Compare(value.Substring(charIndex, qualifier.Length), qualifier) == 0) { currentQualifier = qualifier; break; } } } if ( currentQualifier == null && delimiter != null && string.Compare(value.Substring(charIndex, delimiter.Length), delimiter) == 0 ) { string val = value.Substring(startIndex + trimLen, charIndex - startIndex - trimLen * 2); if (val != delimiter) values.Add(val); startIndex = charIndex + 1; trimLen = 0; } }
if (startIndex < value.Length) values.Add(value.Substring(startIndex + trimLen, value.Length - startIndex - trimLen * 2)); string[] returnValues = new string[values.Count]; values.CopyTo(returnValues); return returnValues; }
Thanks a lot for your article.
Tilly www.utillyty.eu
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
I want to split the line and add a word "TO" where ever space and comma appears in the line using c# programming lang. For Example my line contains abc def [xyz,pqr][out tan] first I want to split the line so that I can get [out tan] in separate line. secondly add th word "TO" my output must be abc to def to [xyz to pqr] [out to tan] How do I get this....?
|
| Sign In·View Thread·PermaLink | 3.00/5 |
|
|
|
 |
|
 |
What you are describing requires multiple delimiter support and additional processing after parsing. This class won't support your requirement out of the box. You are in a custom code scenario.
You could use the class to split your value using a space as the delimiter. Then run the output from the first split thru the class again with a comma defined as a delimiter. On the final pass you could then add the "to" to the end of the text as needed.
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
Hi Sowmy,
Use the substing function you will get what exactly wants. I am giving the code try this you will get the Exact solution. string s="abc def [xyz,pqr][out tan]"; string m; m = s.Substring(0,3).ToString() + " " + "to" + ";"; m += s.Substring(4,3).ToString() + " " + "to" + ";"; m += s.Substring(7,5).ToString() + " " + "to" + "" ; m += s.Substring(13,4).ToString() + " " + "to" + "" ; m += s.Substring(17,4).ToString() + " " + "to" + "" ; m += s.Substring(22,4).ToString() + " " + "to" + "" ; Response.Write(m.ToString());
Ramesh N
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
If the last character is the delimiter, you won't get an empty string as the last item in the array. Example: "745"|"blah blah blah"|""|5| <-- notice that the 5 is not wrapped in " because it is numeric.
Also, I would expect that the array would have the text qualifiers stripped off.
I also made another modification where it only flipped _QualifierState if in the text block and found the text qualifier and either at the end of the string or the delimiter immediatly follows. I ran into a situation where the program writing the file would include " as part of the data, so something like this:
"745.65"|"blah blah" blah"|""|5|"something"|928
Thanks for the code. It was a big help and turned into the perfect function once I tweaked it a bit.
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
I am trying to use the Regex version of this code. This is doing exactly what I am looking for, so thank you for sharing.
My question is about how it returns the split values with the text qualifier character included.
so if you have the string "first Field","Second Field","Third,Field" you get an array list of: "first Field" "Second Field" "Third,Field"
It is split correctly, however, i would like it to return: first Field Second Field Third,Field
I know i can take care of this using a Replace(exp, controlchars.quote, "") but is there a way to do this in the Regex, or elsewhere in the code? Thanks
Matt
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
I'm no expert with Regular Expressions. The RegEx version exists because of the generosity of reviewers like yourself. While I am sure there is a way to modify the expression to handle your concern I would simply remove it with a substring command instead of the replace command just to be safe.
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
 | Bug  Vikcia | 22:02 11 Sep '07 |
|
 |
Example: Valda real estate, LTD;;19977*;;
If split used with text qualifier " and split field ; then result is incorrect.
Regexp version works correct.
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
Thank you for the highly useful article. Here is a nice regular expression for csv files.
Regex re = new Regex(",(?=([^\"]*\"[^\"]*\")*(?![^\"]*\"))");
Dan Crowell www.crowsol.com
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
The regular expression you gave doesn't work. When I tried that code on the input string:
/cmd one two three "one two three" 'one two three' one two thre
it returned an array of 13 strings instead of 9
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
I am not sure why you are getting those results.
The following code example returns 11 results correctly. A space is treated as the delimiter and a quotation mark as the text qualifier.
Dim example As String example = "/cmd one two three ""one two three"" 'one two three' one two thre" System.Console.WriteLine("Input: " & example)
For Each _Value As String In Split(example, " ", """", True) System.Console.WriteLine("Output: " & _Value) Next
System.Console.ReadKey()
The code sample above correctly returns the following results:
Input: /cmd one two three "one two three" 'one two three' one two thre Output: /cmd Output: one Output: two Output: three Output: "one two three" Output: 'one Output: two Output: three' Output: one Output: two Output: thre
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
Sorry, I though I had posted my fix later on.
Anways, that regex, and the C# code with it only works if delimiter and qualify are only 1 character each. Because of this, I found the use of string types to be somewhat misleading (think C/C++ geek in a heavily unix-influenced world) in that snippet, and that was why I got the wrong output.
To cope with multiple characters as delimiters and text qualifiers the regex needs afew extra []s to accomodate that and should look like this:
[{0}](?=(?:[^{1}]*[{1}][^{1}]*[{1}])*(?![^{1}]*[{1}]))
With that regex I get the result I was expecting from :
/cmd one two three "one two three" 'one two three' one two three
with delimter = " \t" and qualifier = "\'\"":
/cmd one two three "one two three" 'one two three' one two three
My personal experience is that if you're needing to make sure that "-quoted text is returned as a single token, chances are good that you'll also need '-quoted text to be similarly parsed as well. Maybe even `-quoted (back quotes), and all in the same string, too.
Anyways, your article did point me in the right direction for somebody experience in C/C++ but new to C#/.Net. Thanks. I think you're probably also the quickest responding auther I have seen here on code project. 
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
Thanks much for your response. This is one of the things I am liking about the CodeProject. It's how people improve upon each other's ideas. I'm no RegEx expert and never claimed to be such.
Great Teamwork!
|
| Sign In·View Thread·PermaLink | 2.00/5 |
|
|
|
 |
|
 |
May i say that some good thinking, exactly what i was looking for. However problems on the c# version. Firstly you have round brackets on your _Values(_ValueIndex) at the bottom of Split, this won't compile as array item delimiters are of course square _Values[_ValueIndex]
Also you're missing a starting bracket on if ((qualifier!=null) && (string.Compare(expression.Substring(_CharIndex, qualifier.Length), qualifier, ignoreCase)==0)) in each function.
Oh yes, and you need a comparitive and && in the above (as added) in order to shortcut the comparison, else you will be performing a qualifier.Length on a null object and throwing another error.
Secondly the CalculateSplitLength may work correctly to pass a value out to an array redim in VB, however in the c# version this will always throw an IndexOutOfRange exception unless you seed the variable _ValueIndex with a 0 also.
I totally agree with Alnicol, if you use a collection or list that you can add values dynamically you have no need to perform the CalculateSplitLength routine, thus instantly doubling your performance.
Thanks again though, useful routine to have.
P
-- modified at 11:27 Wednesday 13th September, 2006
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
Thanks for pointing out the syntactical mistakes. I have corrected the article. Additionally I provided another example using Regular Expressions. Hope it helps!
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
Rather than using an array, why not use a Collection (such as an ArrayList or generic List)? Adding to a Collection is relatively fast - it doesn't work in the same way as the old 'ReDim Preserve' in VB6. You seem to do all the work twice ... once to work out the size of the array and then a second time to actually populate the array. I guess using a Collection wouldn't be quite as fast, but unless you are dealing with hundreds of thousands of 'splits', you'd have trouble telling the difference in speed.
|
| Sign In·View Thread·PermaLink | 2.00/5 |
|
|
|
 |
|
 |
Good suggestion. It makes the code smaller and easier to explain. Less code to support and easier to understand. Only downside is that the information has to be copied to a string array so that I can get the data back in the desired format. The string array is returned to simplify conversion from using the Visual Basic Split function.
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
Good job, thanks!
But there is bug in VB version. The parameter ignoreCase should be of tyoe boolean, not string.
Karel
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
 |
|
 |
Use below regular expression do a regular expression split, you will get the same result.
\.(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))
The .NET code (in C#) to do this:
Regex re = new Regex("\\.(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))", RegexOptions.IgnoreCase| RegexOptions.Compiled); string[] arr = re.Split(@"test string ""example."".cool!"); for (int i = 0; i < arr.Length; i++) Console.WriteLine(arr[i]);
Examples
@"test string ""example."".cool!" gives out @"test string ""example.""", @"cool!" @"test string ""example."".cool! test string '"example."".cool!.cool!" gives out @"test string ""example.""", @"cool! test string ""example.""", @"cool!", @"cool!"
|
| Sign In·View Thread·PermaLink | 5.00/5 |
|
|
|
 |
|
 |
I tend to use the brute force approach. Haven't worked much with regular expressions.
But you are absolutely correct. This is a good scenario for a regular expression.
Excellent alternative.
Thanks!
|
| Sign In·View Thread·PermaLink | 1.00/5 |
|
|
|
 |
|
|
 |
|
|