Click here to Skip to main content
6,596,602 members and growing! (17,917 online)
Email Password   helpLost your password?
Database » Database » Utilities     Intermediate License: The Code Project Open License (CPOL)

A portable and efficient generic parser for flat files

By Andrew Rissing

The GenericParser is a C# implementation of a parser for delimited and fixed width format files.
C#, Windows, .NET 1.1VS.NET2003, DBA, Dev
Posted:19 Sep 2005
Views:92,074
Bookmarked:83 times
Announcements
Loading...
 
Search    
Advanced Search
Add to IE Search
printPrint   add Share
      Discuss Discuss   Broken Article?Report  
24 votes for this article.
Popularity: 6.00 Rating: 4.35 out of 5
2 votes, 8.3%
1

2
1 vote, 4.2%
3
2 votes, 8.3%
4
19 votes, 79.2%
5

Introduction

We, as developers, are often faced with converting data from one format to another. Recently at work, I encountered a problem of this type. It required a portable solution that was efficient and could easily be plugged into our current web-based product with minimal external requirements.

After extensive research, I found two possible solutions - Microsoft's ODBC Text Driver and S�bastien Lorion's CsvReader (found here[^]) on CodeProject. After comparing the two, I found the following pro's/con's:

  • Microsoft's ODBC Text Driver Pro
    • Stable
    • No coding required
    • Can parse delimited and fixed-width formats.
  • S�bastien Lorion's CsvReader (v1.1) Pro
    • Efficient
    • Not tied to MDAC
    • Open source
  • Microsoft's ODBC Text Driver Con
    • Tied to MDAC
    • Schema.ini file required for configuring.
    • Requires use of the file system to perform parsing.
    • Slow when compared to other custom solutions.
    • Closed source
  • S�bastien Lorion's CsvReader (v1.1) Con
    • Could not handle fixed-width formats.
    • No support for loading/saving configuration.
    • Learning curve for modifying code.

Since neither of these two solutions proved to be viable, I proceeded to write my own parser.

Note: S�bastien Lorion's CsvReader has been updated since I began my project, but still does not include fixed-width parsing and XML configuration to suit my needs.

Definitions

  • Delimited data - Data whose columns are separated by specific character(s).
  • Fixed-width data - Data whose columns are of a set number of characters wide.

Features

The GenericParser (and the derived GenericParserAdapter) contains the following features:

  • Efficient - See Benchmarking below for more details.
    • Time: Approximately 3-100 times faster than the Microsoft Text Driver.
    • Memory: Approximately equal to the Microsoft Text Driver (about 2 Kb more in memory consumption).
  • Supports delimited and fixed-width formats.
  • Supports comment rows.
  • Supports escape characters (single character only).
  • Ignores blank rows of data (no characters found in row).
  • Supports header row.
  • Supports the ability to dynamically add more columns to match the data.
  • Supports the ability to enforce the number of columns to a specific number.
  • Supports text qualifier to allow column/row delimiters to be ignored.
  • Supports multi-line data.
  • Supports trimming the strings of a column.
  • Supports reuse of the parser for different data sources.
  • Supports TextReader and string (the file location) as data sources.
  • Custom values can be supplied for:
    • Buffer size
    • Column delimiter
    • Row delimiter
    • Column widths (for fixed-width formats)
    • Comment character (single character)
    • Escape character (single character)
    • Max number of data rows to be read.
    • Number of data rows to skip from the beginning.
    • Expected number of columns.
    • Text qualifier (single character)
  • Supports XML configuration which can be loaded/saved in numerous formats.
  • Supports access to data via column name (when a header row is supplied).
  • Supports the following outputs - XML, DataTable, and DataSet.
  • Supports Unicode encoding.
  • GenericParserAdapter supports the ability to supply the line number that a row was retrieved from in the output.
  • Thorough testing with NUnit (tests supplied in demo project).
  • Thorough documentation in code (including NDoc produced help file supplied in the demo project).

Benchmarking

To benchmark the GenericParser, I chose to compare it to the Microsoft Text Driver (because the Microsoft Text Driver can perform delimited and fixed-width parsing). I generated five sets of CSV and fixed-width data with 8, 80, 800, 8000, and 40000 rows of data. Then, I ran that data through four different benchmarks:

  1. Opening and reading each character in the data file.
  2. Directly loading the data into a DataSet without performing any parsing.
  3. Parsing the data with the Microsoft Text Driver.
  4. Parsing the data with the GenericParserAdapter.

I executed each benchmark 50 times and averaged the results of each one to minimize the amount of error in the instrumentation. Granted, I do find a few of the values out of the expected range, but this can be attributed to the fluctuation of processes running on my machine.

The first two benchmarks were baselines to provide a sanity check. By loading up the file and reading each character, I knew that my parser could not perform faster than this benchmark. Additionally, I used the second benchmark to just load up the data into the DataSet to determine the minimum amount of memory needed to produce the DataSet.

Using these baselines, I was able to provide some certainty to the other two benchmarks. I found that the memory consumption for the GenericParser was roughly equivalent to the Microsoft Text Driver (it averaged about 2 Kb more for each test). So, the only factor to compare is the time it took to run the tests. Below are the charts for the time it took for both parsers to operate on the delimited and fixed-width data:

As can be seen in the charts, the GenericParser outperforms the Microsoft Text Driver. The GenericParser is roughly 3-100 times faster than the Microsoft Text Driver, depending on the size of the file.

Caution: It was observed during tests that the Microsoft Text Driver may surpass the GenericParser in efficiency as the number of rows increases beyond 40,000 rows of data. But this is expected as managed code would undoubtedly be slower as more objects are placed in memory to manage.

In the demo project, you can find all of my performance tests and results, including an Excel workbook that has all of the collected raw data together for charting purposes.

Using the code

The code itself mimics most readers found within the .NET Framework, but the usage follows four basic steps:

  1. Set the datasource through either the constructor or the SetDataSource() method.
  2. Configure the parser for the datasource's format either through properties or by loading an XML CONFIG file via the Load() method.
  3. Call the Read() method and access the columns of data underneath or for the GenericParserAdapter - GetXml(), GetDataTable(), GetDataSet() to extract data.
  4. Call Close() or Dispose().
  DataSet dsResult;

  // Using an Xml Config file.

  using (GenericParserAdapter parser = 
             new GenericParserAdapter("MyData.txt"))
  {
    parser.Load("MyData.xml");
    dsResult = parser.GetDataSet();
  }

  // Or...programmatically setting up the 

  // Parser for TSV.


  string strID, strName, strStatus;
  using (GenericParser parser = new GenericParser())
  {
    parser.SetDataSource("MyData.txt");

    parser.ColumnDelimiter = "\t".ToCharArray();
    parser.FirstRowHasHeader = true;
    parser.SkipDataRows = 10;
    parser.MaxBufferSize = 4096;
    parser.MaxRows = 500;
    parser.TextQualifier = '\'';

    while (parser.Read())
    {
      strID = parser["ID"];
      strName = parser["Name"];
      strStatus = parser["Status"];

      // Your code here ...
    }
  }

  // Or...programmatically setting up the Parser 
  // for Fixed-width.

  using (GenericParser parser = new GenericParser())
  {
    parser.SetDataSource("MyData.txt");

    parser.ColumnWidths = new int[4] {10, 10, 10, 10};
    parser.SkipDataRows = 10;
    parser.MaxRows = 500;

    while (parser.Read())
    {
      strID = parser["ID"];
      strName = parser["Name"];
      strStatus = parser["Status"];

      // Your code here ...
    }
  }

Acknowledgements

While I did not create a derivative of S�bastien Lorion's CsvReader, I did use some of his concepts of provided functionality in his CsvReader for the GenericParser.

Tools used

Possible future enhancements

  • Type casting for each column based on XML CONFIG or 'type' row within the data.

History

  • September 17th, 2005 - 1.0 - First release.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Andrew Rissing


Member
Andrew Rissing started coding at the age of 13 and was immediately hooked. While he's progressed beyond the days of his T-81 and QBasic, the desire to code has not been lost. Most of his driving force to code comes from a desire to continuously improve upon current algorithms to find more optimal solutions.
Occupation: Software Developer (Senior)
Location: United States United States

Other popular Database articles:

Article Top
You must Sign In to use this message board.
FAQ FAQ 
 
Noise Tolerance  Layout  Per page   
 Msgs 1 to 25 of 121 (Total in Forum: 121) (Refresh)FirstPrevNext
GeneralFantastic work. Thank you. PinmemberpseudoHiatus17:57 11 Mar '08  
GeneralRe: MaxBufferSize problem Pinmemberwoodste20:18 15 Nov '07  
GeneralRe: MaxBufferSize problem PinmemberAndrew Rissing5:19 16 Nov '07  
QuestionMaxBufferSize problem Pinmemberauxcom13:01 22 Oct '07  
AnswerRe: MaxBufferSize problem PinmemberAndrew Rissing17:48 22 Oct '07  
GeneralRe: MaxBufferSize problem Pinmemberauxcom5:26 23 Oct '07  
GeneralRe: MaxBufferSize problem PinmemberAndrew Rissing8:43 23 Oct '07  
GeneralRe: MaxBufferSize problem Pinmemberauxcom14:52 23 Oct '07  
GeneralGreat Work PinmemberDarthJabbaBoulet23:46 17 Oct '07  
GeneralSaving data to database Pinmemberrobinson mike5:54 12 Oct '07  
GeneralRe: Saving data to database PinmemberAndrew Rissing4:58 15 Oct '07  
GeneralGreat tool, but ... [modified] Pinmemberuwittig1:34 23 May '07  
GeneralRe: Great tool, but ... PinmemberAndrew Rissing5:13 23 May '07  
GeneralRe: Great tool, but ... PinmemberAndrew Rissing8:30 11 Jun '07  
QuestionTesting for unknown delimiter PinmemberMr David20:55 15 May '07  
AnswerRe: Testing for unknown delimiter PinmemberMr David21:30 15 May '07  
GeneralRe: Testing for unknown delimiter PinmemberAndrew Rissing5:16 16 May '07  
AnswerRe: Testing for unknown delimiter PinmemberAndrew Rissing4:47 16 May '07  
GeneralRe: Testing for unknown delimiter PinmemberMr David14:53 17 May '07  
GeneralColumn Problem Pinmemberjgoat9:20 12 Mar '07  
GeneralRe: Column Problem PinmemberAndrew Rissing8:30 13 Mar '07  
GeneralThanks PinmemberGeorge_Saveliev5:54 15 Feb '07  
GeneralRe: Thanks PinmemberAndrew Rissing10:52 15 Feb '07  
GeneralReading File Across The Network [modified] PinmemberChad Hughes7:53 29 Jul '06  
GeneralRe: Reading File Across The Network PinmemberAndrew Rissing14:56 29 Jul '06  

General General    News News    Question Question    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

PermaLink | Privacy | Terms of Use
Last Updated: 19 Sep 2005
Editor: Chris Maunder
Copyright 2005 by Andrew Rissing
Everything else Copyright © CodeProject, 1999-2009
Web15 | Advertise on the Code Project