Click here to Skip to main content
Licence BSD
First Posted 8 Sep 2009
Views 10,262
Bookmarked 7 times

Splitting a Line of Comma-Separated Text

By | 11 Sep 2009 | Article
A quick and simple method for splitting up the lines in your .csv file.

Introduction

From time to time, you just need to open a comma-separated values (*.csv) text file and roll through the data. Unfortunately, once you have a line of text, you cannot simply split it on the commas because the fields may contain commas delimited as text. Here, I present a relatively simple lexer method that parses a line character-by-character (instead of using the Regular Expression engine) that may get the job done for you.

Background

I was looking for a quick copy/paste code snippet that would solve this problem. I found numerous Regular Expressions (which, by conventional wisdom with regard to style and practice, are generally the best way to solve this kind of problem); but none of the ones I tried seemed to parse everything in my file correctly. I found some other potential solutions online, but they wanted me to download something (... and I just wanted to copy/paste and move on, remember?). After fiddling with some expressions for a while, I figured I might actually get to my goal more quickly if I were to write a little lexer method that did the job.

Once I had the method written, I thought it might be helpful to someone else in two potential ways: First, it does solve a particular common problem. But second and perhaps more important, I thought that it could serve as an interesting working example of how to do this kind of string parsing, and serve as starting point, for someone who might have a similar parsing task at hand. Otherwise, you can simply think of it as an exercise in do-it-yourself parsing. I have done my best to keep the code snippet simple and explicit (sometimes sacrificing generality for clarity) so that if you want to use it and modify it, it should be reasonably easy to do so.

One advantage to parsing in this way (rather than using an expression) is that if you need to modify the logic, you can step through it in serial fashion and examine the states of the variables character-by-character, rather than passing the task off to the Regular Expression engine.

But use Regular Expressions when you can.

Using the Code

The code is just a single method which I include here as a snippet. Pass a line of comma-separated values from your *.csv file to the method, and you should receive an array of the individual "fields" in the line.

public string[] SplitCsv(string s)
{
    // Create a list to hold the tokens as we find them.
    List<string> tokens = new List<string>();
    // We'll need a "buffer" object to build up the tokens character-
    // by-character.
    StringBuilder buffer = new StringBuilder();

    // Convert the string to an array of characters.
    char[] chars = s.ToCharArray();
    // Create a pointer for the characters in the loop below. 
    // We'll just re-use this pointer each time.
    char c = char.MinValue;

    // We'll keep a couple of flags to manage state while we parse...
    // At any given moment, we'll want to know if we think we're
    // inside delimited text.
    bool inText = false;
    // And, as we evaluate one character, we'll want to know if the
    // one before it "escaped" it.
    bool escaped = false;

    // Now, let's look at each character...
    for (int i = 0; i < chars.Length; i++)
    {
        // Get the character at this index.
        c = chars[i];

        // If we are not currently within a block of text, and we've
        // hit the field delimiter (,)...
        if (!inText && c == ',')
        {
            // ...the contents of the buffer are a new token.
            tokens.Add(buffer.ToString());
            // Now clear the buffer.
            buffer.Length = 0;
            // And move along.
            continue;
        }

        // If this character is the "escape" character and we are
        // presently within text...
        if (c == '\\' && inText)
        {
            // If we weren't already in the escape mode, we are now.
            if (!escaped)
            {
                escaped = true;
            }
            else
            {
                // Otherwise, the previous character escaped this
                // one.
                buffer.Append(c);
                // And we're no longer in the escaped mode.
                escaped = false;
            }

            // This character is handled, so move along.
            continue;
        }

        // If we see a text delimiter, i.e. a quote (")...
        if (c == '"')
        {
            // But if this is the very first character we've seen
            // since the last field delimiter (,)...
            if (buffer.Length == 0)
            {
                // ...this is our signal that this field is delimited
                // with quotes.
                inText = true;
            }
            // Otherwise, if this is the last character in the string,
            // or the very next character is the field delimiter...
            else if (i == chars.Length - 1 || chars[i + 1] == ',')
            {
                // ...that means that text delimiting is at an end.
                inText = false;
            }
        }

        // If none of the blocks above handled this character,
        // simply add it to the buffer.
        buffer.Append(c);
        // Since this character was not the "escape" character (\),
        // we are not, at this point, in an escape mode.
        escaped = false;
    }

    // Place any remaining buffer contents as a token in the array.
    if (buffer.Length > 0)
        tokens.Add(buffer.ToString());

    // Convert the tokens to an array and return them.
    return tokens.ToArray();
}

I hope this lets you get on with parsing that file so you can get back to the business at hand, or maybe gives you some new ideas about different ways to tackle your string parsing problem.

History

  • 8th September, 2009: Initial post
  • 9th September, 2009: Added a couple of lines to the code snippet to handle problems with the "escape" mode, and to correct the fact that the last field in the line was not included in the list of tokens

License

This article, along with any associated source code and files, is licensed under The BSD License

About the Author

pat daburu

Software Developer (Senior)

United States United States

Member



Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board. (secure sign-in)
 
Search this forum  
 FAQ
    Noise  Layout  Per page   
  Refresh
GeneralThoughts PinmemberPIEBALDconsult16:39 10 Sep '09  
GeneralEscape PinmemberAECAEC9:00 10 Sep '09  
GeneralRe: Escape Pinmemberpat daburu9:13 10 Sep '09  
GeneralRe: Escape PinmemberPIEBALDconsult11:01 10 Sep '09  
GeneralMy vote of 1 Pinmembervoloda25:11 9 Sep '09  
QuestionWhat about using RegEx Pinmemberwarny21:27 8 Sep '09  
AnswerRe: What about using RegEx Pinmembervoloda223:13 8 Sep '09  
GeneralRe: What about using RegEx Pinmemberpat daburu3:40 9 Sep '09  
GeneralRe: What about using RegEx Pinmembervoloda24:23 9 Sep '09  
GeneralRe: What about using RegEx Pinmemberpat daburu6:28 9 Sep '09  
GeneralRe: What about using RegEx [modified] Pinmembervoloda28:13 9 Sep '09  
AnswerRe: What about using RegEx Pinmembertgrt17:21 9 Sep '09  
GeneralRe: What about using RegEx PinmemberPIEBALDconsult10:50 10 Sep '09  
GeneralRe: What about using RegEx Pinmemberpat daburu11:03 10 Sep '09  
GeneralRe: What about using RegEx PinmemberPIEBALDconsult14:49 10 Sep '09  
GeneralRe: What about using RegEx Pinmembertgrt11:51 10 Sep '09  
GeneralRe: What about using RegEx PinmemberPIEBALDconsult14:45 10 Sep '09  
QuestionHow about... Pinmembersupercat912:52 8 Sep '09  
AnswerRe: How about... Pinmemberpat daburu3:52 9 Sep '09  
GeneralRe: How about... PinmemberPIEBALDconsult10:38 10 Sep '09  
GeneralRe: How about... Pinmemberpat daburu10:49 10 Sep '09  

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Mobile
Web03 | 2.5.120517.1 | Last Updated 11 Sep 2009
Article Copyright 2009 by pat daburu
Everything else Copyright © CodeProject, 1999-2012
Terms of Use
Layout: fixed | fluid