Click here to Skip to main content
Click here to Skip to main content

A Portable and Efficient Generic Parser for Flat Files

By , 16 Jul 2012
 

Introduction

We, as developers, are often faced with converting data from one format to another. For a project at work, I needed a portable solution that was efficient, had minimal external requirements, and parsed delimited and fixed-width data. As shown below, the GenericParser is a good replacement for any Microsoft provided solution and provides some unique functionality. The code is well organized and easy to follow to allow modification as necessary.

Note: The project was built using Visual Studio 2010, but the code is designed for .NET 2.0.

Definitions

  • Delimited data - Data whose columns are separated by a specific character (e.g., CSV - Comma Separated Values).
  • Fixed-width data - Data whose columns are of a set number of characters wide.

Features

The GenericParser (and the derived GenericParserAdapter) contains the following features:

  • Efficient - See Benchmarking below for more details.
    • Time: Approximately 3 to 10 times faster than any Microsoft provided solution
    • Memory: Approximately equal to or less than any Microsoft provided solution
  • Supports delimited and fixed-width formats
  • Supports a custom delimiter character (single character only)
  • Supports comment rows (single character marker)
  • Supports escape characters (single character only)
  • Supports a custom text qualifier to allow column/row delimiters to be ignored (e.g., multi-line data)
  • Supports escaped text qualifiers by doubling them up
  • Supports ignoring/including rows that contain no characters
  • Supports a header row
  • Supports the ability to dynamically add more columns to match the data
  • Supports the ability to enforce the number of columns to a specific number
  • Supports the ability to enforce the number of columns based on the first row
  • Supports trimming the strings of a column
  • Supports stripping off control characters
  • Supports reusing the same instance of the parser for different data sources
  • Supports TextReader and String (the file location) as data sources
  • Supports limiting the maximum number of rows to read
  • Supports customizing the size of the internal buffer
  • Supports skipping rows at the beginning of the data after the header row
  • Supports XML configuration which can be loaded/saved in numerous formats
  • Supports access to data via column name (when a header row is supplied)
  • Supports Unicode encoding
  • GenericParserAdapter supports skipping rows at the end of the data
  • GenericParserAdapter supports adding a line number to each row of output
  • GenericParserAdapter supports the following outputs - XML, DataTable, and DataSet
  • Thorough unit testing - 91.94% code coverage (tests supplied in source download)
  • Thorough XML documentation in code (including a .chm help file in the binary/source downloads)

Benchmarking

To benchmark the GenericParser, I chose to compare it to:

  • Microsoft's Text Driver
  • Microsoft's Text Field Parser
  • Sebastien Lorion's CsvReader 3.7 (CSV only - code found here[^])
  • GenericParser 1.0

To get a realistic datasource for benchmarking, I took 10 rows of data from the Northwind Database and replicated them for successively larger and larger sets of CSV and FixedWidth data. Using System.Diagnostics.Stopwatch to measure CPU usage, I executed each benchmark 10 times and averaged the results to minimize the amount of error in the instrumentation. For the memory usage, I used Visual Studio 2010's memory profiling and executed each benchmark only once.

I've tried to generate tests that exercise the code equally for each solution. As a caveat, these tests do not test every possible scenario - your mileage may vary. Please feel free to use my code as a basis (or create your own tests) to compare the code before you draw any conclusions.

For example, the tests below did not take into account escaped characters. In GenericParser 1.0, it allocated an additional buffer for escaped characters, which essentially doubled its memory requirements. In GenericParser 1.1, it reuses the existing buffer to unescape the column of data. You wouldn't see this benefit unless you specifically geared your tests to account for this.

Just because I know someone will comment about this, I am aware of FileHelpers[^], but I believe they fit into a different category which doesn't map easily for comparison to the above solutions. FileHelpers rely on a declarative definition of the file schema through attributes on concrete classes.  My solution depends on defining the schema through properties or XML.   You may feel free to compare them, if they fit into your problem space.

CPU Usage

Memory Usage

Note: Because profiling the memory was generating .vsp files upwards of 2 gigs and the memory usage seemed pretty stable, I only executed memory profiling for 10 to 10,000 rows of data.

Conclusion

As can be seen in the charts, GenericParser meets or exceeds anything Microsoft has put together in all areas. Furthermore, version 1.1 out performs version 1.0 in performance considerably, especially considering the bug fixes and the new features added.

As can be seen by the graphs, Sebastien Lorion's CsvReader is definitely the top contender for parsing delimited files. So, if you are looking at only parsing delimited files, I would highly recommend checking out his library. Otherwise, I find my library to be an effective implementation for being able to parse both formats.

In the source download, you can find all of my performance tests and results, including an Excel 2010 workbook that has all of the collected raw data together for charting purposes.

Using the Code

The code itself mimics most readers found within the .NET Framework, but the usage follows four basic steps:

  1. Set the data source through either the constructor or the SetDataSource() method.
  2. Configure the parser for the data source's format, either through properties, or by loading an XML CONFIG file via the Load() method.
  3. Call the Read() method and access the columns of data underneath, or for the GenericParserAdapter - GetXml(), GetDataTable(), GetDataSet() to extract data.
  4. Call Close() or Dispose().
DataSet dsResult;
 
// Using an XML Config file. 
using (GenericParserAdapter parser = new GenericParserAdapter("MyData.txt"))
{
    parser.Load("MyData.xml");
    dsResult = parser.GetDataSet();
}
 
// Or... programmatically setting up the parser for TSV. 
string strID, strName, strStatus;
using (GenericParser parser = new GenericParser())
{
    parser.SetDataSource("MyData.txt");
 
    parser.ColumnDelimiter = "\t".ToCharArray();
    parser.FirstRowHasHeader = true;
    parser.SkipStartingDataRows = 10;
    parser.MaxBufferSize = 4096;
    parser.MaxRows = 500;
    parser.TextQualifier = '\"';
 
    while (parser.Read())
    {
      strID = parser["ID"];
      strName = parser["Name"];
      strStatus = parser["Status"];
 
      // Your code here ...
    }
}
 
// Or... programmatically setting up the parser for Fixed-width. 
  using (GenericParser parser = new GenericParser())
  {
    parser.SetDataSource("MyData.txt");
 
    parser.ColumnWidths = new int[4] {10, 10, 10, 10};
    parser.SkipStartingDataRows = 10;
    parser.MaxRows = 500;
 
    while (parser.Read())
    {
      strID = parser["ID"];
      strName = parser["Name"];
      strStatus = parser["Status"];
 
      // Your code here ...
    }
}

Acknowledgements

While I did not create a derivative of Sebastien Lorion's CsvReader, I did use some of his concepts of provided functionality in his CsvReader for the GenericParser.

Tools Used

History

  • September 17, 2005 - 1.0 - First release
  • June 20, 2010 - 1.1
    • New features:
      • Supports ignoring/including blank rows of data (no characters found in row)
      • Supports the ability to enforce the number of columns based on the first row
      • Supports stripping off control characters
      • GenericParserAdapter supports skipping rows at the end of the data
      • Reduced memory overhead when using escaped characters
      • Support for specifying the data's encoding
    • Bug fixes:
      • Fixed a bug with parsing data with a header and no data
      • Fixed a bug in not handling text qualifiers/escape/comment characters consistently
      • Fixed a bug in reading a file across a high latency network
      • Fixed a bug with text qualifiers being interpreted in the middle of the column (only works if at the start and end of a column)
      • Fixed a bug with skipping row ends at the very end of the buffer
    • Breaking changes:
      • Fixed width parsing will no longer take text qualifiers or escape characters into account
      • The following properties have been converted to a char?:
        • ColumnDelimiter
        • CommentCharacter
        • EscapeCharacter
        • TextQualifier
      • RowDelimiter has been removed, and the code automatically handles looking for '\n' or '\r' to indicate a new row (assuming '\r' is not a column delimiter). If one of these characters is found, it will skip the paired '\n' or '\r' (assuming '\r' is not a column delimiter).
      • SkipDataRows has been renamed to SkipStartingDataRows
      • The FixedWidth property has been replaced by a property called TextFieldType, which is of the enum type FieldType
      • Due to the changes in the properties listed above, the XML produced by version 1.0 will not be 100% compatible with version 1.1
      • Read() will return true if it parses a header row and no data
      • ParserSetupException has been replaced by InvalidOperationException
      • Reworked the messages supplied in the exceptions to be more descriptive.
  • June 26, 2010 - 1.1.1
    • New features:
      • Reworked benchmarking to be more representative of real world data and switched over testing to not use DataSets
      • Slightly more efficient loading of configuration files
  • February 5, 2012 - 1.1.2
    • Bug fixes:
      • Fixed an issue where an exception was being thrown for the MaxBufferSize was too small, when it was indeed large enough (Reported by uberblue).
  • March 16, 2012 - 1.1.3
    • Bug fixes:
      • Fixed an issue where control characters were being removed accidently (Reported by John Voelk).
      • Fixed an issue where data at the end of the stream wasn't extracted properly (introduced in version 1.1.2).

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Andrew Rissing
Software Developer (Senior)
United States United States
Member
Since I've begun my profession as a software developer, I've learned one important fact - change is inevitable. Requirements change, code changes, and life changes.
 
So..If you're not moving forward, you're moving backwards.

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board.
Search this forum  
    Spacing  Noise  Layout  Per page   
QuestionStart/End byte offset optionsmemberMember 100367218 May '13 - 0:32 
Hi,
 
This looks almost perfect for my needs. I would like to be able to read lines from a file between 2 byte offsets. Would it be something you can add easily?
 
Background:
 
I want to be able to quickly divide a file into logical chunks which will then be loaded in parallel into SQL Server. This will be a major performance boost over the current BCP program and will negate the need to chainsaw the file first.
 
With my limited c# skills, I have already written the code to determine the start and end byte offsets of the lines that each thread will process.
AnswerRe: Start/End byte offset optionsmemberAndrew Rissing8 May '13 - 6:32 
I'm not 100% sure what you're asking for, but you could easily pass the parser whatever 'Stream' (see MemoryStream as a starting point) you want, which could contain the sectioned pieces to parse.
 
As for performance, I would think that having a thread to parse columns from each row would be horribly inefficient. Perhaps, a thread for every 200kb of data might be a good place to start.
 
Anyways, you likely don't need any help from me on this to use this parser in your scenario.
QuestionError RowsmemberMember 989307730 Apr '13 - 3:17 
How would you recommend retaining error rows? Currently, GenericParser throws an exception for bad data including "Expected column count of 4 not found. [Row: 10, Column: 2]". My goal is to end up with both a DataTable of parsed rows, and perhaps...
 
Dictionary<row number, raw text>
 
I got as far as changing:
 
while (this.Read())
 
to
 
bool continueParsing = this.Read();
 
while (continueParsing)
{
    ...
 
    try
    {
        continueParsing = this.Read();
    }
    catch (ParsingException)
    {
        // Something like: this._SkipToEndOfRow();
        // Something like: this.errorRows.Add(this.fileRowNumber, this.fileRowRawText);
    }
}
 

but discovered it is a character-by-character parser. How would I get the text of the entire row, and move the read index to the end of the row? Is there a better way you would recommend?
AnswerRe: Error RowsmemberAndrew Rissing30 Apr '13 - 6:15 
It really depends on what information you want to present back up the chain.
 
If you want the full text of the line (which might be quite large), you'll need to keep a running tally of this as you extract each column, 'reverse engineer' it by taking the parsed columns and reconstructing what the data would be, or just provide what the parsed values were.
 
From that point, your next question is how to continue the parsing. There are only two places in the code that call _CreateParsingException that allow for you to continue parsing (within _HandleEndOfRow and _ExtractColumn). You could just modify them to allow you to skip to the end of the row before throwing the exception. When the next read is called, it should progress as if nothing happened.
GeneralMy vote of 5memberhousemusic12 Apr '13 - 3:53 
Top notch
GeneralRe: My vote of 5memberAndrew Rissing12 Apr '13 - 4:25 
Thanks.
QuestionParsing Properties and XML for multi delimited rows.memberstixoffire18 Mar '13 - 18:38 
Hello I have a file that has a header that starts out with a semicolon ; and the next segment of data starts out with Colon :Template, which represents an object definition, then the next line is a colon : with the header info separated by columns followed by the info that goes into those columns separated by columns. How would I specify properties for this kind of file ?
AnswerRe: Parsing Properties and XML for multi delimited rows.memberAndrew Rissing19 Mar '13 - 4:31 
My gut feeling is that you probably will not be able to use this parser or any 'generic' parser out there for something like that, as the format sounds very unique. If you can provide a sample of what the data would look like, as your description is a little unclear, I could confirm my suspicions quite easily.
GeneralRe: Parsing Properties and XML for multi delimited rows.memberstixoffire19 Mar '13 - 14:49 
Here is the sample - I will cut it down a bit but you will get the jist of it.
 
; Created 3/11/2012 Project Sample
 

:TEMPLATE=ShoeFinish.Black
:ObjectName,FieldA,FieldB,FieldC .... all the way to FieldXYZ
Toledos,leather,,Strings,brown,,,,Special
 
:TEMPLATE=ShoeFinish.Brown
:ObjectName,FieldA,FieldB,FieldC .... all the way to FieldXYZ
Toledos,leather,,Strings,brown,,,,Special
 
The first ':' is the beginning of the group of items, the second ':' is the group of items fields with their respective values following.
 
BTW: I was thinking if this would not work that I might need to read it in with Stream Reader one line at a time and then parse with RegEx. but I am not sure about the best approach..
GeneralRe: Parsing Properties and XML for multi delimited rows.memberAndrew Rissing20 Mar '13 - 4:24 
Yeah, this is a pretty unique file format.
 
I'd say your best bet is RegEx, String.Split, or parsing the strings manually - either would suffice. Based on the complexity of the data encoded, you may find one form or another to be more successful at parsing the file. But that will be up to you to decide.
 
Good luck.
QuestionGetDataTable - File with No Rows Leaves out Headersmembermellamokb5 Mar '13 - 12:25 
If you use the GetDataTable method on a file that has a header row, but no content rows, the resulting DataTable will not have any Columns.   This seems like incorrect behavior to me and led to an obscure bug in our software because it complained that a core column was missing from the DataTable.   Can you resolve this please?
AnswerRe: GetDataTable - File with No Rows Leaves out HeadersmemberAndrew Rissing6 Mar '13 - 4:55 
It is a pretty simple fix. Update the final return statement in the Read method, as such:
public bool Read()
{
  /* .... */
 
  return ((this.m_lstData.Count > 0) || !this.m_blnSkipEmptyRows || (this.m_blnHeaderRowFound && (this.m_intDataRowNumber == 0)));
}
 
This should fix the problem you're having. Oddly enough, it only manifests itself if you have a blank row after the header. I believe someone else had reported a similar issue, but I unfortunately only resolved it for their specific case.
 
Applying the above change, should resolve your issue. I'll see if I can roll out this change sometime soon, but for now you can take the code and recompile it to suit your immediate needs.
QuestionBasic Newbie question - how do I add this code to my existing project or solution?memberwrxmarcus2 Mar '13 - 22:29 
Please feel free to redirect me if this is not an appropriate posting. I have been programming since the 80's, but sadly I am quite new to the Visual Studio environment (... well I used to program in VB4.0...).
 
I've spent about an hour attempting to figure out how to add this code (or project) to my existing project. I've then spent about another hour trying to learn about projects & solutions and adding references in VS, but again I have failed to get this code working when I compile. Can anyone give me step-by-step instructions for including this code?
 
Feel free to point me to something in this thread that I missed the already explains it. Thanks ahead of time!
-marcus
AnswerRe: Basic Newbie question - how do I add this code to my existing project or solution?memberAndrew Rissing3 Mar '13 - 14:31 
I'm not sure what exactly you are having a problem with, but I'd assume your issue is something at a very fundamental level. I'd recommend starting with a clean solution, adding a console application, and then including the references to the DLLs included in the article.
 
If that doesn't help you resolve your problem, feel free to post back with the specific error messages you're receiving and I'll try to help as best I can.
SuggestionSkipping header rowsmemberNeil Dobson19 Feb '13 - 12:36 
Hi, I have some files with a fixed number of rows at the top where there's some document titles and other random stuff. The real csv header always starts on row 7 and the data follows.
 
I've modified the code and inserted a new property SkipStartingHeaderRows and modified the ParseRowType method to skip header rows before evaluating m_blnFirstRowHasHeader. I have sample files for testing. Would I be able to submit this for consideration into the project?
 
Thank you.
GeneralRe: Skipping header rowsmemberAndrew Rissing19 Feb '13 - 15:21 
Sure, you definitely can.
 
You can either email me via this message or upload them via GitHub[^], whichever suits you.
 
I'll review the changes and, if everything looks good, incorporate them and add you as a contributer.
 
Thanks.
GeneralMy vote of 4memberBeyMelamed7 Jan '13 - 15:59 
Clear and concise. I would love Andrew to add Generic processing. What i mean is the use enumeration (or other methodology) to enable setting the reader based on some text token the semantics of which is the type of reader to process the input and return a dataset.
 
Much appreciated
QuestionSource file zip problemmemberrobertbielen5 Jan '13 - 15:06 
Hi Andrew,
 
There seems to be a problem with the zip file of the source - neither WinZip nor WinRAR recognize it to be a valid file. Could you possibly post a working version?
 
Thanks so much,
 
Robert
QuestionFeatures request - GenericParseAdapter [modified]memberCFQüeb5 Nov '12 - 7:03 
Hi Andrew,
 
After using your excellent tool to import flat file (Adapter mode), I have noticed the following:
 
bugs? Confused | :confused: (using Adapter mode)
Rows with a blank are considered in the final result and displayed to the user.
Rows with smaller or larger number of columns to those in ExpectedColumnCount property are also displayed. Maybe a property named AddOnlyRowsFulFillExpectedColumns will be nice.
 
new functionality
Possibility of allowing the developer to assign a string array with the headers used in the table used to display data in adapter mode.
Create a collection with rows ignored in the importing process.
 
Thank you. Blush | :O

modified 5 Nov '12 - 14:39.

QuestionSave and print option?memberMember 83787692 Nov '12 - 8:37 
Hi Andrew,
 
I have just read your article, which is interesting. My question: once ploted, is it possible to save as well as print the chart? Does this offer only line charts or other chart types?
Thank you in advance for your reply
AnswerRe: Save and print option?memberAndrew Rissing2 Nov '12 - 9:24 
You may want to re-read the article one more time. The GenericParser is used to parse data from flat files and does not contain any charting capability.
 
The charts I presented in the article are used solely for a performance comparison and are generated using Excel.
GeneralMy vote of 5memberCFQüeb31 Oct '12 - 13:06 
thank you for this great tool. Really useful.
GeneralRe: My vote of 5memberAndrew Rissing31 Oct '12 - 17:42 
You're welcome.
BugHeader row only bugmemberdukekujo7 Oct '12 - 17:33 
I've found that the utility does not work if the text file only has the header row, and no other rows. It thinks there are no columns. I'm hoping this can be fixed as I use the utility to determine the column order, even if there is no data in the file.
AnswerRe: Header row only bugmemberAndrew Rissing8 Oct '12 - 4:00 
using (var sr = new StringReader("a,b,c,d"))
using (var gp = new GenericParser(sr))
{
    gp.FirstRowHasHeader = true;
 
    var result = gp.Read(); // true
    var columnNames = Enumerable.Range(0, gp.LargestColumnCount).Select(x => gp.GetColumnName(x)).ToArray(); // { "a", "b", "c", "d" }
}
Try that and let me know if you have any further problems.

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Permalink | Advertise | Privacy | Mobile
Web01 | 2.6.130523.1 | Last Updated 17 Jul 2012
Article Copyright 2005 by Andrew Rissing
Everything else Copyright © CodeProject, 1999-2013
Terms of Use
Layout: fixed | fluid