Click here to Skip to main content
14,489,815 members

fastCSV - Tiny, Fast, Standard Compliant CSV Reader Writer

Rate this:
5.00 (23 votes)
Please Sign up or sign in to vote.
5.00 (23 votes)
26 Jan 2020CPOL
A CSV parser. CSV allows you to load tabular (2-dimensional) data into memory very quickly as opposed to other serializers like fastJSON

Introduction

With the increased prominence of machine learning and ingesting large datasets with the CSV format for this purpose, I decided to write a CSV parser which met my requirements of being small, fast and easy to use. Most libraries I looked at didn't really meet my requirements and fastCSV was born.

Also CSV allows you to load tabular (2-dimensional) data into memory very quickly as opposed to other serializers like fastJSON.

Features

  • Fully CSV standard compliant

    • Multi-line
    • Quoted columns
    • Keeps spaces between delimiters
  • Really fast reading and writing of CSV files (see performance)

  • Tiny 8kb DLL compiled to net40 or netstandard20

  • Ability to get a typed list of objects from a CSV file

  • Ability to filter a CSV file while loading

  • Ability to specify a custom delimiter

CSV Standard

You can read the CSV RFC here : https://tools.ietf.org/html/rfc4180 as a summary a CSV file can :

  • be multi lined if the values in a column contains new lines
  • a column must be quoted if it contains a new line, delmiter or quote character
    • quotes must be quoted
  • spaces between the delimiter are considered part of the column

Below is an example of a complex standard compliant CSV file from the wiki page https://en.wikipedia.org/wiki/Comma-separated_values :

Year,Make,Model,Description,Price
1997,Ford,"E350
F150","ac, abs,
moon",3000.00
1999,Chevy,"Venture ""Extended Edition""","",4900.00
1999,Chevy,"Venture ""Extended Edition, Very Large""",,5000.00
1996,Jeep,Grand Cherokee,"MUST SELL!
air,
"",moon,""
roof, loaded",4799.00
1999,BMW,Z3,"used",14900.00
1999, Toyota,Corolla,,7000.00   

as you can see some rows are multi lined and contain quotes and commas, which will give the table below:

Year Make Model Description Price
1997 Ford

E350

F150

ac, abc,

moon

3000.00
1999 Chevy Venture "Extended Edition"   4900.00
1999 Chevy Venture "Extended Edition, Very Large"   5000.00
1996 Jeep Grand Cherokee

MUST SELL!

air,

",moon,"

roof, loaded

4799.00
1999 BMW Z3 used 14900.00
1999  Toyota Corolla   7000.00

as you can see some columns are multi line, and " Toyota" column of the last row starts with a space.

Performance Benchmarks

Loading the https://www.ncdc.noaa.gov/orders/qclcd/QCLCD201503.zip (585Mb) file which has 4,496,263 rows on my machine as a relative comparison to other libraries:

  • fastCSV : 11.20s 639Mb used
  • NReco.CSV : 6.76s 800Mb used
  • fastCSV string.Split() : 11.50s 638Mb used
  • TinyCSVparser : 34s 992Mb used

As a comparison of a baseline of what is possible on the same dataset:

  1. File.ReadAllBytes() : 1.5s 573Mb used
  2. File.ReadAllLines() with no processing : 3.7s 1633Mb used
  3. File.ReadLines() with no processing : 1.9s
  4. File.ReadLines() + string.Split() no return list : 7.5s

The difference from 1 to 2 is the overhead of converting the bytes to Unicode strings : 2.2s

The difference between 2 and 3 is the memory overhead of creating string[] : 1.8s

The difference from 4 to fastCSV is the overhead of creating T objects and adding to a list : 4s

Roads not taken

  • Loading chunks in a buffer :

    • while initially I tried this route, it proved too complex and I couldn't get it to work properly. Judging by other libraries which did this, it is slow in comparison to the current implementation anyway.
  • StringBuilder character by character :

    • using this option proved too slow for parsing columns out of a line.

Using the code

Below are some examples of how to use fastCSV:

public class car
{
    // you can use fields or properties
    public string Year;
    public string Make;
    public string Model;
    public string Description;
    public string Price;
}

// listcars = List<car>
var listcars = fastCSV.ReadFile<cars>(
    "csvstandard.csv", // filename
    true,              // has header
    ',',               // delimiter
    (o, c) =>          // to object function o : car object, c : columns array read
    {
        o.Year = c[0];
        o.Make = c[1];
        o.Model = c[2];
        o.Description = c[3];
        o.Price = c[4];
        // add to list
        return true;
    });

fastCSV.WriteFile<LocalWeatherData>(
    "filename2.csv",   // filename
    new string[] { "WBAN", "Date", "SkyCondition" }, // headers defined or null
    '|',               // delimiter
    list,              // list of LocalWeatherData to save
    (o, c) =>          // from object function 
	{
    	c.Add(o.WBAN);
    	c.Add(o.Date.ToString("yyyyMMdd"));
    	c.Add(o.SkyCondition);
	});

Helper functions for performance

fastCSV has the following helper functions:

  • int ToInt(string s) creates an int from a string
  • int ToInt(string s, int index, int count) creates an int from a substring
  • DateTime ToDateTimeISO(string value, bool UseUTCDateTime) creates an ISO standard DateTime i.e. yyyy-MM-ddTHH:mm:ss ( optional part.nnnZ)
public class LocalWeatherData
{
    public string WBAN;
    public DateTime Date;
    public string SkyCondition;
}

var list = fastCSV.ReadFile<LocalWeatherData>("201503hourly.txt", true, ',', (o, c) =>
    {
        bool add = true;
        o.WBAN = c[0];
        // c[1] data is in "20150301" format
        o.Date = new DateTime(fastCSV.ToInt(c[1], 0, 4), 
                              fastCSV.ToInt(c[1], 4, 2), 
                              fastCSV.ToInt(c[1], 6, 2));
        o.SkyCondition = c[4];
        //if (o.Date.Day % 2 == 0)
        //    add = false;
        return add;
    });

Usage Scenarios

  • Filtering CSV while loading

    • In your map function while loading you can write conditions on your loaded line data and filter out lines you don't want by using return false;
  • Reading CSV to import to other systems

    • in your map function you can send the line data to another system and return false;
    • or process the entire file and use the List<T> returned
  • Processing/aggregating data while loading

    • You can have a List<T> which has no bearing on the columns of the CSV file and sum/min/max/avg/etc. the lines read

Inside the code

Essentially the reading is a loop through parsing a line, creating a generic element for a list, handing of the object created and the columns extracted from the line to the user defined map function and adding it to the list for return (if the map function says so) :

var c = ParseLine(line, delimiter, cols);
T o = new T();
var b = mapper(o, c);
if (b)
   list.Add(o);

Now the CSV standard complexity comes from handling multi lines correctly which is done by counting if there are odd number of quotes in a line, hence it's multi line and reading the lines until the quotes are even, which is done in the ReadFile() function.

The beauty of this approach is that it is simple, does no reflection and is really fast, with the control being in the users hands.

All the reading code is below :

public static List<T> ReadFile<T>(string filename, bool hasheader, char delimiter, ToOBJ<T> mapper) where T : new()
{
    string[] cols = null;
    List<T> list = new List<T>();
    int linenum = -1;
    StringBuilder sb = new StringBuilder();
    bool insb = false;
    foreach (var line in File.ReadLines(filename))
    {
        try
        {
            linenum++;
            if (linenum == 0)
            {
                if (hasheader)
                {
                    // actual col count
                    int cc = CountOccurence(line, delimiter);
                    if (cc == 0)
                        throw new Exception("File does not have '" + delimiter + "' as a delimiter");
                    cols = new string[cc + 1];
                    continue;
                }
                else
                    cols = new string[_COLCOUNT];
            }
            var qc = CountOccurence(line, '\"');
            bool multiline = qc % 2 == 1 || insb;

            string cline = line;
            // if multiline add line to sb and continue
            if (multiline)
            {
                insb = true;
                sb.Append(line);
                var s = sb.ToString();
                qc = CountOccurence(s, '\"');
                if (qc % 2 == 1)
                {
                    sb.AppendLine();
                    continue;
                }
                cline = s;
                sb.Clear();
                insb = false;
            }

            var c = ParseLine(cline, delimiter, cols);

            T o = new T();
            var b = mapper(o, c);
            if (b)
                list.Add(o);
        }
        catch (Exception ex)
        {
            throw new Exception("error on line " + linenum, ex);
        }
    }

    return list;
}

private unsafe static int CountOccurence(string text, char c)
{
    int count = 0;
    int len = text.Length;
    int index = -1;
    fixed (char* s = text)
    {
        while (index++ < len)
        {
            char ch = *(s + index);
            if (ch == c)
                count++;
        }
    }
    return count;
}

private unsafe static string[] ParseLine(string line, char delimiter, string[] columns)
{
    //return line.Split(delimiter);
    int col = 0;
    int linelen = line.Length;
    int index = 0;

    fixed (char* l = line)
    {
        while (index < linelen)
        {
            if (*(l + index) != '\"')
            {
                // non quoted
                var next = line.IndexOf(delimiter, index);
                if (next < 0)
                {
                    columns[col++] = new string(l, index, linelen - index);
                    break;
                }
                columns[col++] = new string(l, index, next - index);
                index = next + 1;
            }
            else
            {
                // quoted string change "" -> "
                int qc = 1;
                int start = index;
                char c = *(l + ++index);
                // find matching quote until delim or EOL
                while (index++ < linelen)
                {
                    if (c == '\"')
                        qc++;
                    if (c == delimiter && qc % 2 == 0)
                        break;
                    c = *(l + index);
                }
                columns[col++] = new string(l, start + 1, index - start - 3).Replace("\"\"", "\"");
            }
        }
    }

    return columns;
}

ParseLine() is responsible for extracting the columns from a line in an optimized unsafe way.

And the writing code is just :

public static void WriteFile<T>(string filename, string[] headers, char delimiter, List<T> list, FromObj<T> mapper)
{
    using (FileStream f = new FileStream(filename, FileMode.Create, FileAccess.Write))
    {
        using (StreamWriter s = new StreamWriter(f))
        {
            if (headers != null)
                s.WriteLine(string.Join(delimiter.ToString(), headers));

            foreach (var o in list)
            {
                List<object> cols = new List<object>();
                mapper(o, cols);
                for (int i = 0; i < cols.Count; i++)
                {
                    // qoute string if needed -> \" \r \n delim 
                    var str = cols[i].ToString();
                    bool quote = false;

                    if (str.IndexOf('\"') >= 0)
                    {
                        quote = true;
                        str = str.Replace("\"", "\"\"");
                    }

                    if (quote == false && str.IndexOf('\n') >= 0)
                        quote = true;

                    if (quote == false && str.IndexOf('\r') >= 0)
                        quote = true;

                    if (quote == false && str.IndexOf(delimiter) >= 0)
                        quote = true;

                    if (quote)
                        s.Write("\"");
                    s.Write(str);
                    if (quote)
                        s.Write("\"");

                    if (i < cols.Count - 1)
                        s.Write(delimiter);
                }
                s.WriteLine();
            }
            s.Flush();
        }
        f.Close();
    }
}

Sample Use cases

Splitting data sets for testing and training

In data science you generally split your data into training and testing sets, and in the example below every 3rd row is for testing ( you could make the splitting more elaborate) :

var testing = new List<LocalWeatherData>();
int line = 0;
var training = fastCSV.ReadFile<LocalWeatherData>("201503hourly.txt", true, ',', (o, c) =>
    {
        bool add = true;
        line++;
        o.Date = new DateTime(fastCSV.ToInt(c[1], 0, 4),
                              fastCSV.ToInt(c[1], 4, 2),
                              fastCSV.ToInt(c[1], 6, 2));
        o.SkyCondition = c[4];
        if (line % 3 == 0)
        {
            add = false;
            testing.Add(o);
        }
        return add;
    });

Appendix v2.0

With apologies to NReco.CSV, I messed up the timings for that library which was pointed out to me by KPixel on GitHub, this prompted me to go back to the drawing board and redo the internals of fastCSV for more speed.

The new code is even faster than a naïve roll-your-own File.ReadLines() and string.Split() which can't handle multi-lines.

Performance

The new performance numbers are below which in comparison to the v1 code is nearly 2x faster at the expense of a little more memory usage on the same 4,496,263 rows dataset.

  • fastCSV .net4 : 6.27s  753Mb used
  • fastCSV core : 6.51s  669Mb used

Interestingly on .net core, the library uses less memory.

Changes Made

  • First up was creating a buffered StreamReader instead of relying on File.ReadLines() this is handled by the BufReader class. This class also handles multi-lines without resorting to a StringBuilder and the possible case that a line is larger than the buffer being used to read it.
  • While testing I discovered that using IL to create the generic list objects is faster than using new T()
  • Poor mans (.net 4) Span with the MGSpan class which just passes around the buffer, start and length of the data and delays the creation of strings until they are actually used in the object filling delegate.
  • ReadFile<T>() looks cleaner now and does not use StringBuilder for multi-lines.

Some Weird Stuff

The first pass of FillBuffer() used _tr.BaseStream.Seek() to go back in the CSV data file when the end of the buffer being read was reached and a line was not completed, this was fine at first but failed non ASCII data and UTF8 encoded files, the reason being that some characters are 2 bytes not 1 and a char in .net while looking like a byte in fact can be 2 bytes. This messes the offset computation when seeking which results in reading incorrect lines.

To remedy this I used Array.Copy() to copy the characters to the start of the buffer and read the rest from the file until the buffer is full, interestingly again you can't use Buffer.BlockCopy() for the same reason as above.

A simple change of struct MGSpan instead of class MGSpan results in a very distinct speed up, probably because the objects are passed around on the stack and don't stay around long which faster than using the heap.

Previous Versions

History

  • Initial release : 1st January 2020
  • Update v2.0 : 27th January 2020
    • article corrections
    • speed ups

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Mehdi Gholam
Architect -
United Kingdom United Kingdom
Mehdi first started programming when he was 8 on BBC+128k machine in 6512 processor language, after various hardware and software changes he eventually came across .net and c# which he has been using since v1.0.
He is formally educated as a system analyst Industrial engineer, but his programming passion continues.

* Mehdi is the 5th person to get 6 out of 7 Platinum's on Code-Project (13th Jan'12)
* Mehdi is the 3rd person to get 7 out of 7 Platinum's on Code-Project (26th Aug'16)

Comments and Discussions

 
QuestionNice work Pin
Mike Hankey8-Feb-20 5:13
professionalMike Hankey8-Feb-20 5:13 
AnswerRe: Nice work Pin
Mehdi Gholam8-Feb-20 5:16
MemberMehdi Gholam8-Feb-20 5:16 
GeneralMy vote of 5 Pin
Member 102717962-Feb-20 13:32
MemberMember 102717962-Feb-20 13:32 
QuestionTiny 8kb DLL and VB.NET Pin
Member 102717962-Feb-20 13:24
MemberMember 102717962-Feb-20 13:24 
AnswerRe: Tiny 8kb DLL and VB.NET Pin
Mehdi Gholam2-Feb-20 18:13
MemberMehdi Gholam2-Feb-20 18:13 
QuestionMultiple Delimiters Pin
Tom Corbett Space Cadet30-Jan-20 5:07
professionalTom Corbett Space Cadet30-Jan-20 5:07 
AnswerRe: Multiple Delimiters Pin
Mehdi Gholam30-Jan-20 5:14
MemberMehdi Gholam30-Jan-20 5:14 
GeneralMy vote of 5 Pin
BillWoodruff28-Jan-20 16:51
mveBillWoodruff28-Jan-20 16:51 
GeneralRe: My vote of 5 Pin
Mehdi Gholam28-Jan-20 19:24
MemberMehdi Gholam28-Jan-20 19:24 
Questionwhat is the point of this project? Pin
vickoza28-Jan-20 9:42
Membervickoza28-Jan-20 9:42 
AnswerRe: what is the point of this project? Pin
Mehdi Gholam28-Jan-20 19:24
MemberMehdi Gholam28-Jan-20 19:24 
QuestionBenchmark code Pin
João Matos Silva28-Jan-20 4:46
professionalJoão Matos Silva28-Jan-20 4:46 
AnswerRe: Benchmark code Pin
Mehdi Gholam28-Jan-20 4:50
MemberMehdi Gholam28-Jan-20 4:50 
QuestionInternationalization Pin
carloscs28-Jan-20 2:35
Membercarloscs28-Jan-20 2:35 
AnswerRe: Internationalization Pin
Mehdi Gholam28-Jan-20 4:24
MemberMehdi Gholam28-Jan-20 4:24 
QuestionEmbedded Newlines Pin
David A. Gray16-Jan-20 7:20
MemberDavid A. Gray16-Jan-20 7:20 
AnswerRe: Embedded Newlines Pin
Mehdi Gholam16-Jan-20 7:30
MemberMehdi Gholam16-Jan-20 7:30 
GeneralRe: Embedded Newlines Pin
David A. Gray16-Jan-20 7:39
MemberDavid A. Gray16-Jan-20 7:39 
GeneralRe: Embedded Newlines Pin
Mehdi Gholam16-Jan-20 7:45
MemberMehdi Gholam16-Jan-20 7:45 
QuestionExcellent! Question about "unsafe" code Pin
Gary Schumacher4-Jan-20 19:24
MemberGary Schumacher4-Jan-20 19:24 
AnswerRe: Excellent! Question about "unsafe" code Pin
Mehdi Gholam4-Jan-20 20:02
MemberMehdi Gholam4-Jan-20 20:02 
QuestionToInt and ToDateTime Pin
#realJSOP2-Jan-20 2:02
mva#realJSOP2-Jan-20 2:02 
AnswerRe: ToInt and ToDateTime Pin
Mehdi Gholam2-Jan-20 2:37
MemberMehdi Gholam2-Jan-20 2:37 
QuestionMy vote of 5 + question Pin
softexpert1-Jan-20 23:27
Membersoftexpert1-Jan-20 23:27 
AnswerRe: My vote of 5 + question Pin
Mehdi Gholam2-Jan-20 2:42
MemberMehdi Gholam2-Jan-20 2:42 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Article
Posted 1 Jan 2020

Tagged as

Stats

17.2K views
981 downloads
50 bookmarked