Click here to Skip to main content
13,148,214 members (59,717 online)
Click here to Skip to main content
Add your own
alternative version

Stats

7.9K views
183 downloads
18 bookmarked
Posted 16 Mar 2017

Why to build your own CSV parser (or maybe not)

, 19 Mar 2017
Rate this:
Please Sign up or sign in to vote.
Comparison of CSV parsers plus one custom implementation.

Introduction

I will discuss various properties of a CSV parser, compare several parsers from NuGet and also providing my custom implementation.

Background

This is based on a work I did in the past for a customer in finances. We were building a calculation engine and importing CSV files was one of the main tasks there. I evaluated a lot of CSV parsers but eventually we decided to roll out our own implementation because of our unique requirements. As it's usually the case these special requirements were not that important after all but of course we learned that only after the implementation was finished. In the end we replaced the core of our custom solution with TextFieldParser which had better handling of quoted field. Let me now share what I learned about parsing CSVs.

What is a CSV and how to parse it

CSV format is defined by RFC4180 as this:

file = [header CRLF] record *(CRLF record) [CRLF]
header = name *(COMMA name)
record = field *(COMMA field)
name = field
field = (escaped / non-escaped)
escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE
non-escaped = *TEXTDATA
COMMA = %x2C
CR = %x0D
DQUOTE =  %x22
LF = %x0A
CRLF = CR LF
TEXTDATA = %x20-21 / %x23-2B / %x2D-7E

Let us discuss few interesting facts there.

Comma as sole field separator

This is usually not true as people will send you files which can have pretty much anything for separator. Granted that comma, pipe and semicolon are the most common.

We actually had a requirement originally that the separator can be more than one character. This has never been used.

CR+LF as line ending

Again this requirement gets ignored very often. You can be sure that nobody will care that the file they are sending to you has Windows, Unix or Mac line endings or a mixture of thereof. You just have to deal with it. Ideally we should be able to parse any reasonable combination of CR and LF much like StreamReader.ReadLine method does.

In our implementation we were very specific and allowed the user to specify the line endings explicitly. This was never popular for good reasons - instead of solving the problem we just dumped it on the user. In the end we defaulted it to Windows+Unix+Mac and never touched it again.

Quoted fields

Quoted field must begin and end with double quote. Quote inside a field is escaped by doubling it. Quoted field can contain commas (the field separator) and new line (the line separator). This is very important as you can get all kind of characters in text fields.

Note that it is invalid for the quote to be included in the text data: abc "efg" hij. In theory this should fail as undefined state but in the real life you want to import as much as possible so this case would just be ignored and quote will be parsed as normal character.

I need to say that implementing a custom quote character is not much of a problem. Nevertheless I have never seen anything else than double quote. Occasionally people try to escape quotes using backslash but that's just wrong.

Skipping lines

There are no comments or empty lines!

In most implementations you are allowed to use single line comments. If you start the line with a hashtag (#), or a custom character of your choice, this line is skipped.

Likewise empty lines are not part of the grammar above. But many implementations still allow you to skip empty lines automatically.

Comments are useful - they are occasionally used. On the other hand I didn't see files with empty lines in production environment.

Occasionally there is the option to skip first N lines. This may bee useful if somebody sends you garbage at the beginning of the file. On the other side if there is garbage it is much better to use comments to skip it.

Data conversions

Some of the implementations allow you to trim white spaces around fields automatically. While this may be useful I feel it is not responsibility of the parser but rather some layer above. If this functionality is provided should be optional.

Some of the parsers try to do more than just giving you the lines and fields - they try to convert string into types specified by the user. Some implementers feel that loading data directly into POCO entities is the only correct way of parsing CSV. I certainly think these features have they target audience but in my case I was either disabling this functionality if possible or excluding the parser from my evaluation altogether as this didn't fit my needs.

Reading and writing data

There are implementations that parse the whole file into an array or list or DataTable and return the whole content to you in one piece. This is just fine for small files but becomes a blocker when you need to import gigabytes of data. I was only looking at implementations that could read the file line by line.

Some libraries provide the functionality to write CSV data. I didn't focus on this part as it was not important to my use case.

Feature comparison

Some of the interesting features of each parser. Empty fields mean I didn't find the respective information.

ParserAccessDelimiterLine EndingsQuoteEOL inside quoteCommentTrim white spaces
Microsoft.VisualBasic.FileIO.TextFieldParserRead linemultiple strings "yesmultiple stringsoptional
Nuget: Cinchoo ETL 1.0.2.4
(article on CP)
Enumeratorsingle stringsingle stringsingle charyesmultiple stringsyes
Nuget: Csv 1.0.11Enumerablesingle charStreamReader.ReadLine"nonooptional
Nuget: CsvHelper 3.0.0-beta7Read linesingle stringwindows, unixsingle charyessingle charoptional
Nuget: CsvToolkit 0.13.0Enumerablesingle charwindows, unixsingle charyesnooptional
Nuget: DevLib.Csv 2.16.23.19010Enumerablesingle charStreamReader.ReadLinesingle charnonono
Nuget: LibCsv4Net 1.8.9.1102Enumerablesingle charsingle stringsingle charyesnono

Nuget: LumenWorksCsvReader 3.9.1 
(article on CP)

Enumerablesingle charwindows, unixsingle charyessingle charoptional
Nuget: Net.Code.Csv 1.0.3Data readersingle char single charyessingle charoptional
Nuget: Nortal.Utilities.Csv 0.9.2Read linesingle charsingle stringsingle charyesnono
Nuget: Uncomplicated.Csv 1.5.2Read linesingle charwindowssingle charyesnono
My implementationRead linesingle charCR,LF,CR+LF,LF+CR normalizes into CR+LF"yes#no
CodeProject: C# - Light and Fast CSV ParserEnumerablesingle charwindows, unixsingle charyesnoyes
CodeProject: C# CSV File and String Reader ClassesRead linesingle char single charyesnono

Explanation of data access:

  • Read line: There is a method which can read a single line in form of a string array (or equivalent) and returns null when end of file is reached.
  • Enumerable: The parser returns an enumerable of lines (string arrays or equivalent).
  • Data reader: Parser implements IDataReader interface.

Using the code

To replicate the results shown below please use the SimpleCsvReader.Demo project which is a simple console application. There are three modes.

First you need to generate a random test file. You can specify number of lines i.e. the size in the second parameter.

SimpleCsvReader.Demo.exe /gen 10000

When you have the file generated you can parse it using the various parsers and record the time. The second parameter specifies how many times you want to repeat the measurements, averages of the results will be calculated.

SimpleCsvReader.Demo.exe /run 3

Finally there is a third option to verify if the parsed data matches the generated data. This is done using a SHA256 hash of the data written to and read from the file.

SimpleCsvReader.Demo.exe /verify

Performance comparison

The table shows the average time in seconds to parse a file with given number of rows. The sample files didn't contain any comments but they did contain quoted fields with line breaks. Please see also comments below the table.

Parser10k100k1M10M100M
Microsoft.VisualBasic.FileIO.TextFieldParser0.212.0420.17201.212061.95
Nuget: Cinchoo ETL 1.0.2.4 (3)0.716.8067.72678.226890.05
Nuget: Csv 1.0.11 (1)(3)0.666.5164.45662.486642.64
Nuget: CsvHelper 3.0.0-beta70.060.565.6055.69584.74
Nuget: CsvToolkit 0.13.0 (3)0.121.1811.68118.921221.08
Nuget: DevLib.Csv 2.16.23.19010 (1)0.040.353.5134.95374.55
Nuget: LibCsv4Net 1.8.9.11020.252.4624.50244.652507.25
Nuget: LumenWorksCsvReader 3.9.10.020.212.1120.96229.66
Nuget: Net.Code.Csv 1.0.3 (3)0.050.525.1651.30541.96
Nuget: Nortal.Utilities.Csv 0.9.20.060.595.8959.02613.04
Nuget: Uncomplicated.Csv 1.5.20.040.343.4333.95365.23
SimpleCsvParser (2)0.080.777.6377.53790.33
SimpleCsvParserMerged (2)0.070.656.4864.65675.81
CodeProject: C# - Light and Fast CSV Parser0.060.565.5356.26585.87
CodeProject: C# CSV File and String Reader Classes (3)0.050.515.1252.02545.95
  1. Doesn't respect line break inside quotes.
  2. My implementation.
  3. Additional overhead because output needs to be adapted to collection of strings.

The following graph shows the relative speed of the parser compared to the VB TextFieldParser which is used as reference. Parsers slower than the reference are not shown.

LumenWorks is the best. I was able to get under 40% with my own implementation. CsvHelper beta7 performance is about 30% better than the last stable version.

Custom implementation

First I need to say that what I'm showing here is not the implementation I wrote for my former employer as that is proprietary work. What I show here is my own implementation done for research purposes.

The code described in the section is located in SimpleCsvReader.Lib is the attached project.

Design

CSV parser can be implemented as a finite-state machine. Parsing one line can be represented by following diagram. Each transition is labeled with the corresponding input that triggers the transition slash the performed action. This is so called Mealy machine.

Start Line is the start state, End Line is the final state. To consume multiple line endings these are pre-processed so the state machine doesn't have to deal with it directly and is therefore represented by single token EOL. EOF is end of file. ELSE means any other character. When there is no more data in the input null is returned. If the line is empty and empty string array is returned.

Append char means adding the current char to the current field. End field takes the current field, adds it to the current line and clears the current field so parsing can continue. Keep char means that the input is not advanced to the next char so the same char is processed in the next state.

Please note that there is no transition from Quoted Field on EOF - this will throw an exception as the quoted field is not closed. Also there can be no text following Double Quote - this will again raise an exception. On the other hand we allow quote inside Regular Field.

TextReaderWrapper class

The underlying file is read char by char using TextReader.Read method. This returns an integer and returns -1 in case end of file is reached. To allow simple manipulation with line endings we wrap this and return -2 in case of line end.

This class has another important responsibility and that is tracking the current position in the input file. Not only the absolute position of the current character but also logical line and column number. This is useful when reporting errors to the user.

LineBuilder class

Here the line is built field by field and char by char. Current field is represented by a StringBuilder, current line by list of strings. Chars are being added to the current field, line ends are automatically translated to Environment.NewLine - the original value is lost. When field ends the string is appended to the current line and current field's StringBuilder is cleared. When the line ends the current line is returned to the caller.

Context and state classes

I chose to implement the parser using State design pattern. This has the benefit of having the transition logic nicely encapsulated. It is direct translation of the diagram above. Context class serves as the main hub facilitating the transitions and executing related actions while the actual logic deciding next state and what to do is in the respective state class.

CsvParser class

This is where is all comes together. CsvParser class is the main interface to the parser - the ReadLine method resides here. In case we would need more parameters than the delimiter this is the place where they should be processed and validated. We also catch the parsing exceptions here and inject line and column numbers to help with problem solving.

CsvParserMerged class

My parser was working fine, unit tests were passing but I was a bit disappointed with the performance. I though my parser would be lightning fast! And although I've surpassed the reference implementation by more than a half there were others who were better. And that of course I couldn't let go.

If you look at the code I didn't change that much. Mainly I merged everything into one class making it private. I also implemented the state machine using simple switch statement and states using enums. With all of this I was able to squeeze additional 5% improvement out of it.

By rewriting the state machine I got rid of the virtual methods. This in turn allowed the JIT compiler to inline far more stuff than before. I guess this is the main reason for the performance gain.

Points of Interest

If you think that CSV files are obsolete then you are probably right. There are better ways to exchange information. Is it dead? Not by far.

Can we further improve the performance?

It depends. Our implementation - the one from the beginning of this article - was not only parsing the file but also converting number and dates then wrapping it in a IDataReader implementation and finally inserting everything into database via SqlBulkCopy. If you imagine the whole pipeline it looks like this: FileStream > StreamReader > Data Conversion > Data Reader > SqlBulkCopy.

I spent some time with profiler looking at the performance and as you have probably guessed most time is spent in reading the file itself - no surprise there. The second bottleneck was data conversion. There is definitely room for improvement there.

Also there is a lot of buffering and copying between buffers going on. FileStream reads bytes into a buffer. StreamReader uses encoding to get chars from bytes and puts those into its own buffer. You then create strings out of these chars. Then conversions take in strings but internal implementations work with char arrays. SqlBulkCopy always calls GetValue method which returns object and your numbers and dates get boxed.

With some hacking you can save some memory allocations but I doubt it's worth the effort.

Is it worth to write my own CSV parser?

Not really. Just pick one from NuGet and be off with it. You can always replace it with something better if you need to.

Why we didn't use some existing library? Well one reason were the ridiculous requirements. Another reason was the conversion layer that turns strings into numbers and dates. You would be surprised how difficult is to get this right considering you need to parse data in different formats and cultures. This is why we kept the conversion layer and only replaced the parser.

Why we didn't use some existing solution like SSIS? Now that is a very good question and some day I just might write another article about that.

History

  • 16 March 2017 - Initial release.
  • 17 March 2017 - CsvParser 0.5.2 removed as it parses data upfront.
  • 20 March 2017 - new parsers added to the comparison:
    • Two parsers from CodeProject articles
    • Cinchoo ETL 1.0.2.4 
    • CsvHelper updated to 3.0.0-beta7

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Tomas Takac
Software Developer (Senior)
Czech Republic Czech Republic
I started programming in Basic in 1989, learned Pascal in 1993, switched to C/C++ in 1997, discovered Java in 2001 and finally settled with C#/.NET in 2003.

You may also be interested in...

Comments and Discussions

 
QuestionNice Pin
Sacha Barber30-Mar-17 3:55
mvpSacha Barber30-Mar-17 3:55 
AnswerRe: Nice Pin
Tomas Takac30-Mar-17 21:48
memberTomas Takac30-Mar-17 21:48 
Questioncsv line parser Pin
obermd21-Mar-17 15:22
memberobermd21-Mar-17 15:22 
AnswerRe: csv line parser Pin
Tomas Takac21-Mar-17 22:07
memberTomas Takac21-Mar-17 22:07 
QuestionMy vote of #5 ... and, there are a number of recent CSV parsers on CP Pin
BillWoodruff16-Mar-17 23:54
mvpBillWoodruff16-Mar-17 23:54 
AnswerRe: My vote of #5 ... and, there are a number of recent CSV parsers on CP Pin
Tomas Takac17-Mar-17 5:10
memberTomas Takac17-Mar-17 5:10 
BugF# CsvParser results are not correct Pin
bvsms16-Mar-17 13:48
memberbvsms16-Mar-17 13:48 
GeneralRe: F# CsvParser results are not correct Pin
Tomas Takac16-Mar-17 21:45
memberTomas Takac16-Mar-17 21:45 
GeneralCRLF in QUOTEs Pin
PIEBALDconsult16-Mar-17 6:40
protectorPIEBALDconsult16-Mar-17 6:40 
GeneralRe: CRLF in QUOTEs Pin
Tomas Takac16-Mar-17 11:20
memberTomas Takac16-Mar-17 11:20 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Terms of Use | Mobile
Web04 | 2.8.170924.2 | Last Updated 20 Mar 2017
Article Copyright 2017 by Tomas Takac
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid