![]() |
Database »
Database »
Utilities
Intermediate
License: The Code Project Open License (CPOL)
A portable and efficient generic parser for flat filesBy Andrew RissingThe GenericParser is a C# implementation of a parser for delimited and fixed width format files. |
C#, Windows, .NET 1.1VS.NET2003, DBA, Dev
|
|
Advanced Search Add to IE Search |
|
|
|
||||||||||||||||
We, as developers, are often faced with converting data from one format to another. Recently at work, I encountered a problem of this type. It required a portable solution that was efficient and could easily be plugged into our current web-based product with minimal external requirements.
After extensive research, I found two possible solutions - Microsoft's ODBC Text Driver and S�bastien Lorion's CsvReader (found here[^]) on CodeProject. After comparing the two, I found the following pro's/con's:
Since neither of these two solutions proved to be viable, I proceeded to write my own parser.
Note: S�bastien Lorion's CsvReader has been updated since I began my project, but still does not include fixed-width parsing and XML configuration to suit my needs.
The GenericParser (and the derived GenericParserAdapter) contains the following features:
TextReader and string (the file location) as data sources.
DataTable, and DataSet.
GenericParserAdapter supports the ability to supply the line number that a row was retrieved from in the output.
To benchmark the GenericParser, I chose to compare it to the Microsoft Text Driver (because the Microsoft Text Driver can perform delimited and fixed-width parsing). I generated five sets of CSV and fixed-width data with 8, 80, 800, 8000, and 40000 rows of data. Then, I ran that data through four different benchmarks:
DataSet without performing any parsing.
GenericParserAdapter. I executed each benchmark 50 times and averaged the results of each one to minimize the amount of error in the instrumentation. Granted, I do find a few of the values out of the expected range, but this can be attributed to the fluctuation of processes running on my machine.
The first two benchmarks were baselines to provide a sanity check. By loading up the file and reading each character, I knew that my parser could not perform faster than this benchmark. Additionally, I used the second benchmark to just load up the data into the DataSet to determine the minimum amount of memory needed to produce the DataSet.
Using these baselines, I was able to provide some certainty to the other two benchmarks. I found that the memory consumption for the GenericParser was roughly equivalent to the Microsoft Text Driver (it averaged about 2 Kb more for each test). So, the only factor to compare is the time it took to run the tests. Below are the charts for the time it took for both parsers to operate on the delimited and fixed-width data:


As can be seen in the charts, the GenericParser outperforms the Microsoft Text Driver. The GenericParser is roughly 3-100 times faster than the Microsoft Text Driver, depending on the size of the file.
Caution: It was observed during tests that the Microsoft Text Driver may surpass the GenericParser in efficiency as the number of rows increases beyond 40,000 rows of data. But this is expected as managed code would undoubtedly be slower as more objects are placed in memory to manage.
In the demo project, you can find all of my performance tests and results, including an Excel workbook that has all of the collected raw data together for charting purposes.
The code itself mimics most readers found within the .NET Framework, but the usage follows four basic steps:
SetDataSource() method.
Load() method.
Read() method and access the columns of data underneath or for the GenericParserAdapter - GetXml(), GetDataTable(), GetDataSet() to extract data.
Close() or Dispose(). DataSet dsResult;
// Using an Xml Config file.
using (GenericParserAdapter parser =
new GenericParserAdapter("MyData.txt"))
{
parser.Load("MyData.xml");
dsResult = parser.GetDataSet();
}
// Or...programmatically setting up the
// Parser for TSV.
string strID, strName, strStatus;
using (GenericParser parser = new GenericParser())
{
parser.SetDataSource("MyData.txt");
parser.ColumnDelimiter = "\t".ToCharArray();
parser.FirstRowHasHeader = true;
parser.SkipDataRows = 10;
parser.MaxBufferSize = 4096;
parser.MaxRows = 500;
parser.TextQualifier = '\'';
while (parser.Read())
{
strID = parser["ID"];
strName = parser["Name"];
strStatus = parser["Status"];
// Your code here ...
}
}
// Or...programmatically setting up the Parser
// for Fixed-width.
using (GenericParser parser = new GenericParser())
{
parser.SetDataSource("MyData.txt");
parser.ColumnWidths = new int[4] {10, 10, 10, 10};
parser.SkipDataRows = 10;
parser.MaxRows = 500;
while (parser.Read())
{
strID = parser["ID"];
strName = parser["Name"];
strStatus = parser["Status"];
// Your code here ...
}
}
While I did not create a derivative of S�bastien Lorion's CsvReader, I did use some of his concepts of provided functionality in his CsvReader for the GenericParser.
General
News
Question
Answer
Joke
Rant
Admin
|
PermaLink |
Privacy |
Terms of Use
Last Updated: 19 Sep 2005 Editor: Chris Maunder |
Copyright 2005 by Andrew Rissing Everything else Copyright © CodeProject, 1999-2009 Web15 | Advertise on the Code Project |