Click here to Skip to main content
14,422,431 members

fastCSV

Rate this:
5.00 (10 votes)
Please Sign up or sign in to vote.
5.00 (10 votes)
15 Jan 2020CPOL
Tiny, fast, standard compliant CSV reader writer

Introduction

With the increased prominence of machine learning and ingesting large datasets with the CSV format for this purpose, I decided to write a CSV parser which met my requirements of being small, fast and easy to use. Most libraries I looked at didn't really meet my requirements and fastCSV was born.

Also CSV allows you to load tabular (2-dimensional) data into memory very quickly as opposed to other serializers like fastJSON.

Features

  • Fully CSV standard compliant

    • Multi-line
    • Quoted columns
    • Keeps spaces between delimiters
  • Really fast reading and writing of CSV files (see performance)

  • Tiny 8kb DLL compiled to net40 or netstandard20

  • Ability to get a typed list of objects from a CSV file

  • Ability to filter a CSV file while loading

  • Ability to specify a custom delimiter

CSV Standard

You can read the CSV RFC here : https://tools.ietf.org/html/rfc4180 as a summary a CSV file can :

  • be multi lined if the values in a column contains new lines
  • a column must be quoted if it contains a new line, delmiter or quote character
    • quotes must be quoted
  • spaces between the delimiter are considered part of the column

Below is an example of a complex standard compliant CSV file from the wiki page https://en.wikipedia.org/wiki/Comma-separated_values :

Year,Make,Model,Description,Price
1997,Ford,"E350
F150","ac, abs,
moon",3000.00
1999,Chevy,"Venture ""Extended Edition""","",4900.00
1999,Chevy,"Venture ""Extended Edition, Very Large""",,5000.00
1996,Jeep,Grand Cherokee,"MUST SELL!
air,
"",moon,""
roof, loaded",4799.00
1999,BMW,Z3,"used",14900.00
1999, Toyota,Corolla,,7000.00   

as you can see some rows are multi lined and contain quotes and commas, which will give the table below:

Year Make Model Description Price
1997 Ford

E350

F150

ac, abc,

moon

3000.00
1999 Chevy Venture "Extended Edition"   4900.00
1999 Chevy Venture "Extended Edition, Very Large"   5000.00
1996 Jeep Grand Cherokee

MUST SELL!

air,

",moon,"

roof, loaded

4799.00
1999 BMW Z3 used 14900.00
1999  Toyota Corolla   7000.00

as you can see some columns are multi line, and " Toyota" column of the last row starts with a space.

Performance Benchmarks

Loading the https://www.ncdc.noaa.gov/orders/qclcd/QCLCD201503.zip (585Mb) file which has 4,496,263 rows on my machine as a relative comparison to other libraries:

  • fastCSV : 11.20s 639Mb used
  • NReco.CSV : 19.05s 800Mb used
  • fastCSV string.Split() : 11.50s 638Mb used
  • TinyCSVparser : 34s 992Mb used

As a comparison of a baseline of what is possible on the same dataset:

  1. File.ReadAllBytes() : 1.5s 573Mb used
  2. File.ReadAllLines() with no processing : 3.7s 1633Mb used
  3. File.ReadLines() with no processing : 1.9s
  4. File.ReadLines() + string.Split() no return list : 7.5s

The difference from 1 to 2 is the overhead of converting the bytes to Unicode strings : 2.2s

The difference between 2 and 3 is the memory overhead of creating string[] : 1.8s

The difference from 4 to fastCSV is the overhead of creating T objects and adding to a list : 4s

Roads not taken

  • Loading chunks in a buffer :

    • while initially I tried this route, it proved too complex and I couldn't get it to work properly. Judging by other libraries which did this, it is slow in comparison to the current implementation anyway.
  • StringBuilder character by character :

    • using this option proved too slow for parsing columns out of a line.

Using the code

Below are some examples of how to use fastCSV:

public class car
{
    // you can use fields or properties
    public string Year;
    public string Make;
    public string Model;
    public string Description;
    public string Price;
}

// listcars = List<car>
var listcars = fastCSV.ReadFile<cars>(
    "csvstandard.csv", // filename
    true,              // has header
    ',',               // delimiter
    (o, c) =>          // to object function o : car object, c : columns array read
    {
        o.Year = c[0];
        o.Make = c[1];
        o.Model = c[2];
        o.Description = c[3];
        o.Price = c[4];
        // add to list
        return true;
    });

fastCSV.WriteFile<LocalWeatherData>(
    "filename2.csv",   // filename
    new string[] { "WBAN", "Date", "SkyCondition" }, // headers defined or null
    '|',               // delimiter
    list,              // list of LocalWeatherData to save
    (o, c) =>          // from object function 
	{
    	c.Add(o.WBAN);
    	c.Add(o.Date.ToString("yyyyMMdd"));
    	c.Add(o.SkyCondition);
	});

Helper functions for performance

fastCSV has the following helper functions:

  • int ToInt(string s) creates an int from a string
  • int ToInt(string s, int index, int count) creates an int from a substring
  • DateTime ToDateTimeISO(string value, bool UseUTCDateTime) creates an ISO standard DateTime i.e. yyyy-MM-ddTHH:mm:ss ( optional part.nnnZ)
public class LocalWeatherData
{
    public string WBAN;
    public DateTime Date;
    public string SkyCondition;
}

var list = fastCSV.ReadFile<LocalWeatherData>("201503hourly.txt", true, ',', (o, c) =>
    {
        bool add = true;
        o.WBAN = c[0];
        // c[1] data is in "20150301" format
        o.Date = new DateTime(fastCSV.ToInt(c[1], 0, 4), 
                              fastCSV.ToInt(c[1], 4, 2), 
                              fastCSV.ToInt(c[1], 6, 2));
        o.SkyCondition = c[4];
        //if (o.Date.Day % 2 == 0)
        //    add = false;
        return add;
    });

Usage Scenarios

  • Filtering CSV while loading

    • In your map function while loading you can write conditions on your loaded line data and filter out lines you don't want by using return false;
  • Reading CSV to import to other systems

    • in your map function you can send the line data to another system and return false;
    • or process the entire file and use the List<T> returned
  • Processing/aggregating data while loading

    • You can have a List<T> which has no bearing on the columns of the CSV file and sum/min/max/avg/etc. the lines read

Inside the code

Essentially the reading is a loop through parsing a line, creating a generic element for a list, handing of the object created and the columns extracted from the line to the user defined map function and adding it to the list for return (if the map function says so) :

var c = ParseLine(line, delimiter, cols);
T o = new T();
var b = mapper(o, c);
if (b)
   list.Add(o);

Now the CSV standard complexity comes from handling multi lines correctly which is done by counting if there are odd number of quotes in a line, hence it's multi line and reading the lines until the quotes are even, which is done in the ReadFile() function.

The beauty of this approach is that it is simple, does no reflection and is really fast, with the control being in the users hands.

All the reading code is below :

public static List<T> ReadFile<T>(string filename, bool hasheader, char delimiter, ToOBJ<T> mapper) where T : new()
{
    string[] cols = null;
    List<T> list = new List<T>();
    int linenum = -1;
    StringBuilder sb = new StringBuilder();
    bool insb = false;
    foreach (var line in File.ReadLines(filename))
    {
        try
        {
            linenum++;
            if (linenum == 0)
            {
                if (hasheader)
                {
                    // actual col count
                    int cc = CountOccurence(line, delimiter);
                    if (cc == 0)
                        throw new Exception("File does not have '" + delimiter + "' as a delimiter");
                    cols = new string[cc + 1];
                    continue;
                }
                else
                    cols = new string[_COLCOUNT];
            }
            var qc = CountOccurence(line, '\"');
            bool multiline = qc % 2 == 1 || insb;

            string cline = line;
            // if multiline add line to sb and continue
            if (multiline)
            {
                insb = true;
                sb.Append(line);
                var s = sb.ToString();
                qc = CountOccurence(s, '\"');
                if (qc % 2 == 1)
                {
                    sb.AppendLine();
                    continue;
                }
                cline = s;
                sb.Clear();
                insb = false;
            }

            var c = ParseLine(cline, delimiter, cols);

            T o = new T();
            var b = mapper(o, c);
            if (b)
                list.Add(o);
        }
        catch (Exception ex)
        {
            throw new Exception("error on line " + linenum, ex);
        }
    }

    return list;
}

private unsafe static int CountOccurence(string text, char c)
{
    int count = 0;
    int len = text.Length;
    int index = -1;
    fixed (char* s = text)
    {
        while (index++ < len)
        {
            char ch = *(s + index);
            if (ch == c)
                count++;
        }
    }
    return count;
}

private unsafe static string[] ParseLine(string line, char delimiter, string[] columns)
{
    //return line.Split(delimiter);
    int col = 0;
    int linelen = line.Length;
    int index = 0;

    fixed (char* l = line)
    {
        while (index < linelen)
        {
            if (*(l + index) != '\"')
            {
                // non quoted
                var next = line.IndexOf(delimiter, index);
                if (next < 0)
                {
                    columns[col++] = new string(l, index, linelen - index);
                    break;
                }
                columns[col++] = new string(l, index, next - index);
                index = next + 1;
            }
            else
            {
                // quoted string change "" -> "
                int qc = 1;
                int start = index;
                char c = *(l + ++index);
                // find matching quote until delim or EOL
                while (index++ < linelen)
                {
                    if (c == '\"')
                        qc++;
                    if (c == delimiter && qc % 2 == 0)
                        break;
                    c = *(l + index);
                }
                columns[col++] = new string(l, start + 1, index - start - 3).Replace("\"\"", "\"");
            }
        }
    }

    return columns;
}

ParseLine() is responsible for extracting the columns from a line in an optimized unsafe way.

And the writing code is just :

public static void WriteFile<T>(string filename, string[] headers, char delimiter, List<T> list, FromObj<T> mapper)
{
    using (FileStream f = new FileStream(filename, FileMode.Create, FileAccess.Write))
    {
        using (StreamWriter s = new StreamWriter(f))
        {
            if (headers != null)
                s.WriteLine(string.Join(delimiter.ToString(), headers));

            foreach (var o in list)
            {
                List<object> cols = new List<object>();
                mapper(o, cols);
                for (int i = 0; i < cols.Count; i++)
                {
                    // qoute string if needed -> \" \r \n delim 
                    var str = cols[i].ToString();
                    bool quote = false;

                    if (str.IndexOf('\"') >= 0)
                    {
                        quote = true;
                        str = str.Replace("\"", "\"\"");
                    }

                    if (quote == false && str.IndexOf('\n') >= 0)
                        quote = true;

                    if (quote == false && str.IndexOf('\r') >= 0)
                        quote = true;

                    if (quote == false && str.IndexOf(delimiter) >= 0)
                        quote = true;

                    if (quote)
                        s.Write("\"");
                    s.Write(str);
                    if (quote)
                        s.Write("\"");

                    if (i < cols.Count - 1)
                        s.Write(delimiter);
                }
                s.WriteLine();
            }
            s.Flush();
        }
        f.Close();
    }
}

Sample Use cases

Splitting data sets for testing and training

In data science you generally split your data into training and testing sets, and in the example below every 3rd row is for testing ( you could make the splitting more elaborate) :

var testing = new List<LocalWeatherData>();
int line = 0;
var training = fastCSV.ReadFile<LocalWeatherData>("201503hourly.txt", true, ',', (o, c) =>
    {
        bool add = true;
        line++;
        o.Date = new DateTime(fastCSV.ToInt(c[1], 0, 4),
                              fastCSV.ToInt(c[1], 4, 2),
                              fastCSV.ToInt(c[1], 6, 2));
        o.SkyCondition = c[4];
        if (line % 3 == 0)
        {
            add = false;
            test.Add(o);
        }
        return add;
    });

History

  • Initial release : 1st January 2020

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Mehdi Gholam
Architect -
United Kingdom United Kingdom
Mehdi first started programming when he was 8 on BBC+128k machine in 6512 processor language, after various hardware and software changes he eventually came across .net and c# which he has been using since v1.0.
He is formally educated as a system analyst Industrial engineer, but his programming passion continues.

* Mehdi is the 5th person to get 6 out of 7 Platinum's on Code-Project (13th Jan'12)
* Mehdi is the 3rd person to get 7 out of 7 Platinum's on Code-Project (26th Aug'16)

Comments and Discussions

 
QuestionEmbedded Newlines Pin
David A. Gray16-Jan-20 8:20
MemberDavid A. Gray16-Jan-20 8:20 
AnswerRe: Embedded Newlines Pin
Mehdi Gholam16-Jan-20 8:30
MemberMehdi Gholam16-Jan-20 8:30 
GeneralRe: Embedded Newlines Pin
David A. Gray16-Jan-20 8:39
MemberDavid A. Gray16-Jan-20 8:39 
GeneralRe: Embedded Newlines Pin
Mehdi Gholam16-Jan-20 8:45
MemberMehdi Gholam16-Jan-20 8:45 
QuestionExcellent! Question about "unsafe" code Pin
Gary Schumacher4-Jan-20 20:24
MemberGary Schumacher4-Jan-20 20:24 
AnswerRe: Excellent! Question about "unsafe" code Pin
Mehdi Gholam4-Jan-20 21:02
MemberMehdi Gholam4-Jan-20 21:02 
QuestionToInt and ToDateTime Pin
#realJSOP2-Jan-20 3:02
mva#realJSOP2-Jan-20 3:02 
AnswerRe: ToInt and ToDateTime Pin
Mehdi Gholam2-Jan-20 3:37
MemberMehdi Gholam2-Jan-20 3:37 
QuestionMy vote of 5 + question Pin
softexpert2-Jan-20 0:27
Membersoftexpert2-Jan-20 0:27 
AnswerRe: My vote of 5 + question Pin
Mehdi Gholam2-Jan-20 3:42
MemberMehdi Gholam2-Jan-20 3:42 
GeneralMy vote of 5 Pin
Franc Morales1-Jan-20 15:46
MemberFranc Morales1-Jan-20 15:46 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Article
Posted 1 Jan 2020

Tagged as

Stats

7K views
374 downloads
24 bookmarked