CSV File Parser

#realJSOP

4.83/5 (36 votes)

Sep 12, 2016

CPOL

13 min read

113463

3358

Parse CSV files.

Download CSVParser.zip - 45.2 KB

Introduction

---------------------------------------------------------------
ATTENTION! - I've added an improved version of this article here - CSV/Excel File Parser - A Revisit[^]. Consider this article to be obsolete.
---------------------------------------------------------------

Let's be blunt - I honestly don't feel like typing an entire article about yet another CSV parser. In the end, people that are looking for a solution to a the same problem addressed herein probably won't actually read the text. They'll simply download the code, try to use it, and if I'm lucky, they'll take the time to use the debugger (using their uber awesome debugging skills) to figure out the issues on their own before posting a message below, demanding that I fix the code for their specific application.

For this reason, I'm not going to go into as much detail about how the code works as much as what it does and why it does it the way it does it when it isn't doing other things that need to be done. There will be few code snippets, absolutely no images, and as much blatant apathy as I can muster while still conveying the intended information.

Background

I live in a very bizarre programming world. My most persistent project involves importing data from almost five DOZEN different data sources, comprised mostly of Excel XLSX files, a web site that provides raw data in XML format, and a couple of actual database queries. The Excel spreadsheets come from a mix of database queries on web sites that can return the results in the form of said spreadsheet files, with the rest being manually generated by humans.

Garbage In

You would think that the human-generated files would present the most quirks because humans are flawed. Mind numbingly boring data entry, a substantial lack of monetary compensation in the form of a paycheck, and the fact that it's a government employee performing the work all conspire to form a perfect storm of inability to muster anything resembling an attention to detail, which leads to "nuances" in the tabular data. However, database pulls can be equally fraught with errors, (amusingly enough) especially if the data entry side of the database doesn't quite catch all of the potential errors that a human is apt to introduce.

Garbage Out

The primary method of importing said spreadsheets is a library called EPPlus. While it's a nice library in general, it has some weaknesses. The one that drove me to writing the code featured in this article is that, for some reason (as yet undiscovered by - well - anybody), some .XLSX files simply will not load using the library. This "nuance" forces me to use Excel to save the desired sheet as a CSV file, which then forced me to write more code to implement the capability. And that's why you're reading this.

Assumptions

As with most of my articles, this one is not about theory, or the newest doo-dad or gee-gaw that Microsoft seems to think we want to see in .Net. To put it simply, this is real-world code that lives and breathes in an actual project. As it gets used, it gets tested a little more thoroughly, and as problems crop up, they are promptly fixed. The code as presented here appears to work fairly well - today. Tomorrow will be a fresh hell of "might not", because I cannot think of every possible thing that might get thrown at it. I try to head most of the obvious stuff off, but like everything else related to programming, just when you think your code is idiot-proof, the world invents a better idiot, and you end up performing what I call knee-jerk programming.

This article assumes that you're a moderately accomplished developer, but one who wants some code to put out a fire that you'd really rather not put too much time into yourself. I didn't do anything too fancy or elegant, because "fancy and elegant" code is quite often a lot harder to understand and maintain. The code is heavily commented, so there should be ample explanation about how it works.

A very short sample file (comprised of a header row and two data rows) is provided to exercise the class. To ensure that the class meets your particular needs, use the included sample project to establish CSVParser's suitability, and make any changes that you deem necessary before putting it into your own project(s).

The Code

Once again, this article isn't as much about how it works, but more about what it does and why it does it. Keep in mind that more often than not, the answer to "why" is going to be because I'm the laziest redneck you'd ever want to meet. And I'm old. Really old. I simply don't care if it fits everyone's needs (especially anyone living outside the US), as long as it fits mine. Like I said, this code is about what *I* needed, and you're simply the beneficiary of my magnanimous outlook on sharing code.

By now, I've probably pissed more than a few people off, but that doesn't bother me because if I hadn't, I wouldn't be perpetuating my reputation here on CP, thus disappointing my legions of fans (okay, maybe one or two people would be disappointed, so "legions" is a subjective term).

What it is

The CSVParser class is an abstract class that parses a file (or stream) of comma-separated values. Being abstract, it must be inherited by a programmer-developed class, which must, at a minimum, implement the abstract methods. Most of the methods in the CSVParser class are virtual, allowing the programmer to override their functionality with new or supplementary processing.

What It Does

The code takes a comma-delimited text file (or stream) and parses each line into discrete fields.

Configuration Properties

public HasHeaderRow - Indicates that the first line of the file is a header row. All of the CSV files that I deal with have header rows, and that makes things quite a bit simpler for me. If you don't have a header row in your CSV file, a moderate amount of risk is introduced into the column identification process, because there is the possibility that the first line of data could be malformed. Default value is true.
public ExactDateTimeFormat - Indicates the DateTime format used when parsing a date time field. Default value is 'M/d/yyyy'.
public RemoveCurrencySymbols - In order to correctly cast values that represent currency, we have to strip the currency symbol (if it exists). Default value is true.
public CurrencySymbol - This is the currency symbol you want strip if RemoveCurrencySymbol is true. If this property is null or empty, the class uses the current culture to determine what the currency symbol is that you want to strip.
public ThrowFindExceptions - After a line is parsed, it's up to the calling method to process the resulting fields using the (overloaded) FindValue() method. If this flag is true, an exception will be thrown if the field being sought could not be cast to the appropriate type.

Internally Consumed Properties

protected DataStream - The stream either passed to the Parse method or created by loading the specified file.
protected Columns - When the stream is parsed, this dictionary (string, int) will be populated by the first row, whether it's row headers, or actual data. If HasHeaderRow is true, the keys stored in this dictionary will be the text contained in the parsed line. Excel automatically encloses header row fields with square brackets, but these brackets are stripped by the parser. If HasHeaderRow is false, column names are automatically assigned in the format "ColN", where "N" is the index number of the column. In both cases, the value of the KeyValuePair is the numeric index of the column.
protected CurrentData - this is a string array of the fields discovered when a line is parsed.
protected CurrentLine - this is the line currently being parsed. I implemented this property so that I could add the line to one of the custom exceptions supported by this class.
protected IsMalformed - Indicates that the current line is malformed.
public InvalidLines - Indicates a list of line indexes for lines that were invalid and could not be corrected.
public TotalLinesProcessed - Indicates a count of the total lines processed. This figure does not include blank lines or the header row.

Parsing, Generally Speaking

The constructor takes no parameters, allowing the programmer to use automatic properties to set the configuration properties as desired. I personally prefer this to setting up what results in being a seemingly endless succession of overridden constructors, each having a range of parameters that need to be provided. I still use parameterized constructors, but only when absolutely necessary or expedient. The constructor is only used to initialized the various List properties. In your derived class, you don't really have to do anything unless you prefer to set the configuration properties there instead of using automatic properties.

To start the parsing process, call the Parse method with the desired filename or stream object. Each line is parsed in turn, using the following method:

protected virtual string[] ReadFields(string text, bool removeQuotes=true)
{
    //assume we have a proper line of text
    this.IsMalformed = false;
    // split the string on commas (because this is a CSV file, after all)
    string[] parts = text.Trim().Split(',');
 
    // create a container for our results
    List<string> newParts = new List<string>();
    // set some initial values
    bool inQuotes = false;
    string currentPart = string.Empty;
 
    // iterate the parts array
    for (int i = 0; i < parts.Length; i++) 
    {
        // get the part at the current index
        string part = parts[i];
        // if we're in a quoted string and the current part starts with a single double 
        // quote AND currentPart isn't empty, assume the currentPart is complete, add it to 
        // the newParts list, and reset for the new part
        if (inQuotes && part.StartsWithSingleDoubleQuote()==true && !string.IsNullOrEmpty(currentPart))
        {
            currentPart = string.Concat(currentPart, "\"");
            newParts.Add(currentPart);
            currentPart = string.Empty;
            inQuotes = false;
        }
        // see if we're in a quoted string
        inQuotes = (inQuotes || (!inQuotes && part.StartsWithSingleDoubleQuote() == true));
        // if so, add the part to the current currentPart
        if (inQuotes)
        {
            currentPart = (string.IsNullOrEmpty(currentPart))? part : string.Format("{0},{1}", currentPart, part);
        }
        // otherwise, simply set the currentPart to the part
        else
        {
            currentPart = part;
        }
        // see if we're still in a quoted string
        inQuotes = (inQuotes && currentPart.EndsWithSingleDoubleQuote()==false);
        // if not
        if (!inQuotes)
        {
            // remove the quote characters
            currentPart = (removeQuotes) ? currentPart.Trim('\"') : currentPart;
            // put the currentPart into our container
            newParts.Add(currentPart);
            // reset the currentPart
            currentPart = string.Empty;
        }
    }
    this.IsMalformed = (inQuotes || (this.Columns.Count > 0 && newParts.Count != this.Columns.Count));
    return newParts.ToArray();
}

Originally, I was using the VisualBasic.FileIOTextFieldParser object to handle this, but I despise all things connected with VB, and adding a reference to a VB assembly simply felt wrong on so many levels (not to mention the fear that I woudl develop an unnatural desire to use goto statements). Since I'm handling all the other stuff (that I needed) that the VB object provided, I figured it would be a lot more "developery" to roll my own version of the TextFieldParser.ReadFields method.

When parsing of the line is finished, the abstract method ProcessFields(bool isMalFormed) is called. In your derived class, you would do something like this:

protected override void ProcessFields(bool isMalformed)
{
    if (this.CurrentData != null && !isMalformed)
    {
        // TO-DO: Your stuff
        try
        {
            int      col1 = this.FindValue("col1", -1);
            string   col2 = this.FindValue("col2", "ERROR");
            string   col3 = this.FindValue("col3", "ERROR");
            double   col4 = this.FindValue("col4", -1d);
            DateTime col5 = this.FindValue("col5", new DateTime(0));
        }
        catch (FindValueException fvex)
        {
            //TO-DO: react to an exception thrown because the value found could not be cast 
            //		 to the expected type.
        }
    }
}

This is where you retrieve the field values and do something with them. Most likely, that something would be to create an instance of an appropriate application-specific object and set its properties to the field values.

Remember that you can choose to throw exceptions if a field does not parse to the expected type (indicated by the receiving variable/property). However, setting a default to something that indicates an error can sometimes be infinitely more useful, especially when debugging. I have a method in my objects that performs validity checking based on the contents of the properties. I don't use it in this sample project, but I'm including it in this article because it might be useful to others.

public bool IsValid 
{
    get 
    {
        bool           valid = false;
        PropertyInfo[] infos = this.GetType().GetProperties();
        foreach(PropertyInfo info in infos)
        {
            if (info.Name != "IsValid")
            {
                object property   = info.GetValue(this, null);
                string propString = string.Format("{0}",property);
                valid             = (!propString.IsInExact("-1,-1.0,ERROR"));

                if (!valid)
                {
                    break;
                }
            }
        }
        // we don't need to check the dates specificially because if they weren't 
        // valid, this object would not have been created
        return valid;
    }
}

It uses reflection but that really can't be helped if you want to use it everywhere, such as from a base class or something. I also included the IsInExact() extension method in the project (this IsValid property utilizes it). You may note that it doesn't validate DateTimes because my code won't create the objects that contain this property if the dates aren't valid, but it wouldn't be difficult to include, even for a mildly skilled programmer.

Once the stream/file has been parsed, the CSVParser class calls the abstract Finished() method. This gives you the opportunity to do whatever you need to do with the parsed data that you've retained. It can also be of assistance during debugging, allowing you to examine the list of line indexes for valid, invalid, and corrected lines.

Finding Values

Once parsing has started, and the base class invokes the ProcessFields method, you can retrieve the data one field at a time. To do so, you use the FindValue() method. FindValue() is overridden for the four most common types (string, int, double, and DateTime). FindValue accepts the desired column name, as well as a default value to assign if there is a problem either finding the specified column, or not being able to convert the found data into the desired type.

Finding the specified column involves the IsLike extension method, which functions similarly to the SQL LIKE function. You can use any appropriate SQL wildcard character in the column name to find it in the dictionary of discovered columns. So, given a column name of "Really Long Column Name", you could use something like "really long col%", "%long column%", %column name", or "Really Long Column Name" to find that column in the dictionary. It is merely a vehicle to allow less typing. Of course, care must be taken so that the column name specified in FindValue is qualified enough to find the desired column. As will the SQL LIKE function, string matching is not case sensitive, regardless of whether wildcards are used or not.

What It Doesn't Do

Does not support any delimiter other than a comma (but can be easily modified to do so)
Does not do anything but parse the data lines. There is no direct support for populating external (programnmer-defined model) objects

Usage

First, you instantiate your derived parser object:

CSVFileParser parser = new CSVFileParser();
parser.Parse("sample1.csv");

Inside your derived parser object, override the two abstract methods:

public class CSVFileParser : CSVParser
{
    public CSVFileParser() : base()
    {
    }

    protected override void ProcessFields(bool isMalformed)
    {
        if (this.CurrentData != null && !isMalformed)
        {
            try
            {
	            // TO-DO: Your stuff (involves calling FindValue for each field 
                //        you want to retrieve from the current line)
            }
            catch (FindValueException fvex)
            {
                //TO-DO: react to an exception thrown because the value found could not be cast 
                //       to the expected type.
            }
        }
    }

    protected override void Finished()
    {
        //TO-DO: Celebration that could include dancing naked around a fire, pounding your 
        //       chest and singing songs of a successful parsing event.

        // At this point you can examine the properties that invalid lines, and the total lines processed 
		// (excluding the header row and blank lines).
    }
}

Of course, you can override pretty much any method in the base class to modify the parser's behavior.

Definitions

Malformed data - There's pretty much only one thing (that I can think of) that would indicate malformed data, and that would be a field that is supposed to be surrounded by double-quotes but has one of those quote characters missing. When Excel creates a CSV file from a worksheet, it will automatically enclose a field with double-quote characters, and from what I've observed, there should be no such thing as a malformed CSV file that is sourced from Excel. That leads me to postulate that malformed fields can only be generated by other software or when a human has manually mucked around in the file in question.

Caveats and Emptors

There are obviously other solutions to this problem that are either more minimalistic, or more substantive, and I leave you to your googling talents to find said alternatives. Please do not waste space in the comments section by informing me of these alternatives. I simply don't care. I wrote this code because it fits my needs. I'm sharing it because it might fit someone else's.

Article History

15 DEC 2016 - Fixed some spelling errors and expressed further disdain for lazy programmers with poor reading comprehension skills.
19 SEP 2016 - Oops. I left the wrong sample filename in program.cs. Uploaded a new version of the zip file.
18 SEP 2016 - I found a bug in the ReadFields method that would sometimes result in a "malformed" line, even when the line is not in fact malformed. Following that, I made several concentrated attempts to auto-correct malformed fields, but I came to the conclusion that a) a malformed file is the fault of the program that created it, or b) a human that messed with it. It is impossible to auto-correct a malformed file because there's no telling how many errors might occur, and where to accurately make the appropriate correction. In the interest of full disclosure, I included the old CSVParser source file that contains several attempts to resolve this. It is my considered opinion that you actually need an application written that allows you to visually inspect the errant lines and repair them manually, allowing you to reprocess the file with CSVParser.

I also deleted all references to auto-correcting from the article (but left the old code in the zip file so you could witness that tragic failures that ensued).

Finally, I added a bit of globalization to the class where removing currency symbols is concerned. If you don't specify the currency symbol to remove, the code uses the current culture info to determine the appropriate currency symbol.
12 SEP 2016 - Initial publication.