CSV Format Encoder and Decoder

Liu Junfeng

2.71/5 (5 votes)

10 May 2009CPOL2 min read

831

Make use of the comma separated values file format.

Download source code - 10.9 KB

Introduction

The CSV (Comma Separated Value) file format is often used to exchange data between disparate applications. CSV has much lower overhead, thereby using much less bandwidth and storage than XML. An important advantage of the CSV format is that it is easy to create a program to encode and decode. Many informal documents exist that describe the CSV format. There are differences in how special characters are handled between the formats. Here, I propose a solution that most people will agree with.

Definition of the CSV format

CSV data contains a list of records, and a record contains a list of fields. Records are not required to have the same number of fields.

Basic rules:

Fields are separated with commas.
Each record occupies just one line.

Extended rules:

Padding spaces can be added ahead of a field.
Fields may always be delimited with double quotes.
The first record may be a record of column names.

Special rules:

Within a double-quoted string, \\, \", \r, \n, \t are treated as escape sequences.
If a field value contains leading spaces or special chars like comma, double-quote, or line-break, it must be enclosed by double-quotes.
Empty strings need to be double quoted, null strings need not be.

Storage rules:

The text is treated as Unicode, ad loaded, and saved to a file using specific encoding.

Usually, UTF8 or UTF16 can be used. The byte order mark of UTF8 is optional, of UTF16 is required.

Padding spaces can be used to align fields to the same column. To stick to the basic rules, special chars need to be handled by special rules.

The grammar of CSV is expressed in PEG as:

CsvData <- Record* EndOfFile
Record <- !EndOfFile Field (Separator Field)* (EnfOfLine/EndOfFile)
Field <- Spacing (UnQuotedText / QuotedText)
UnQuotedText <- (-",\"\r\n")*
QuotedText <- '"' (-"\"\r\n\\" / EscapeSequence)* '"'
EscapeSequence <- '\\\\' / '\\"' / '\\r' / '\\n' / '\\t'
Spacing <- Space*
Space <- ' ' / '\t'
Separator <- ','
EnfOfLine <- '\r\n' / '\r' / '\n'
EndOfFile <- <end>

Here ",\"\r\n" means a char set, and -",\"\r\n" means a complement char set. According to this grammar, only leading spaces are ignored, and each record must end with line break chars. These both simplify the grammar, and does not hurt the formatting style.

Implementation code

Data structures

public partial class CsvData
{
    public CsvRecord Header;

    public List<CsvRecord> Records = new List<CsvRecord>();

    /// <summary>
    /// Check whether has header record
    /// </summary>
    public bool HasHeader
    {
        get { return Header != null; }
    }
}

public partial class CsvRecord
{
    public List<string> Fields = new List<string>();
}

Encoder

public class CsvEncoder
{
    public static string Encode(CsvData csvData)
    {
        return Encode(csvData, null);
    }

    /// <summary>
    /// Encode CsvData with Format Options
    /// </summary>
    /// <param name="csvData"></param>
    /// <param name="formatOptions">FieldFormatOption
    /// dict that use 0 based field index as key</param>
    public static string Encode(CsvData csvData, 
           Dictionary<int, FormatOption> formatOptions)
    {
        return Encode(csvData, formatOptions, ",");
    }

    public static string Encode(CsvData csvData, Dictionary<int, 
           FormatOption> formatOptions, string separator)
    {
        CsvEncoder encoder = new CsvEncoder();
        encoder.FormatOptions = formatOptions;
        encoder.Separator = separator;
        return encoder.EncodeCsvData(csvData);
    }

    Dictionary<int, FormatOption> FormatOptions;
    string Separator = ",";
    static readonly char[] sepcialChars = new char[] { ',', '"', '\r', '\n' };

    private string EncodeCsvData(CsvData csvData)
    {
        StringBuilder text = new StringBuilder();
        if (csvData.HasHeader)
        {
            text.AppendLine(EncodeRecord(csvData.Header));
        }
        foreach (CsvRecord record in csvData.Records)
        {
            text.AppendLine(EncodeRecord(record));
        }
        return text.ToString();
    }

    private string EncodeRecord(CsvRecord record)
    {
        StringBuilder text = new StringBuilder();
        for (int i = 0; i < record.Fields.Count; i++)
        {
            string field = record.Fields[i];

            FieldFormatOption option = FieldFormatOption.Default;
            if (FormatOptions != null && FormatOptions.ContainsKey(i))
            {
                option = FormatOptions[i];
            }
            int charsToPad = 0;
            if (field != null)
            {
                string value = option.AlwaysQuoted ? "\"" + 
                       EscapeString(field) + "\"" : EncodeField(field);
                
                charsToPad = option.TotalWidth - GetTextWidth(value);
                if (option.AlignRight && charsToPad > 0)
                {
                    text.Append(new string(' ', charsToPad));
                }
                text.Append(value);
            }
            if (i < record.Fields.Count - 1)
            {
                text.Append(Separator);
            }

            if (!option.AlignRight && charsToPad > 0)
            {
                text.Append(new string(' ', charsToPad));
            }
        }
        return text.ToString();
    }

    private static string EncodeField(string field)
    {
        if (field.Trim(' ', '\t').Length < field.Length || 
            field.IndexOfAny(sepcialChars) > -1)
        {
            return "\"" + EscapeString(field) + "\"";
        }
        else if (field == string.Empty)
        {
            return "\"\"";
        }
        else
        {
            return field;
        }
    }

    static int GetTextWidth(string text)
    {
        int width = 0;
        foreach (char ch in text)
        {
            if (ch < 0xff)
            {
                width += 1;
            }
            else
            {
                width += 2;
            }
        }
        return width;
    }

    static string EscapeString(string text)
    {
        if (text == null) return null;
        StringBuilder escapedtext = new StringBuilder();
        foreach (char ch in text)
        {
            switch (ch)
            {
                case '\\':
                    escapedtext.Append("\\\\");
                    break;
                case '\"':
                    escapedtext.Append("\\\"");
                    break;
                case '\r':
                    escapedtext.Append("\\r");
                    break;
                case '\n':
                    escapedtext.Append("\\n");
                    break;
                case '\t':
                    escapedtext.Append("\\t");
                    break;
                default:
                    escapedtext.Append(ch);
                    break;
            }
        }
        return escapedtext.ToString();
    }
}

public class FieldFormatOption
{
    public int TotalWidth;
    public bool AlignRight;
    public bool AlwaysQuoted;

    public FieldFormatOption() { }

    public FieldFormatOption(int totalWidth, 
           bool alignRight, bool alwaysQuoted)
    {
        TotalWidth = totalWidth;
        AlignRight = alignRight;
        AlwaysQuoted = alwaysQuoted;
    }

    public static readonly FieldFormatOption Default = 
                           new FieldFormatOption();
}

Decoder

public class CsvDecoder
{
    public static CsvData Decode(string text)
    {
        return Decode(text, false);
    }

    public static CsvData Decode(string text, bool hasHeader)
    {
        if (text == null)
        {
            throw new ArgumentNullException("text");
        }
        bool success;
        Parser parser = new Parser();
        CsvData csvData = parser.ParseCsvData(
                new TextInput(text), out success);
        if (success)
        {
            if (hasHeader)
            {
                csvData.Header = csvData.Records[0];
                csvData.Records.RemoveAt(0);
            }
            return csvData;
        }
        else
        {
            throw new Exception("There are syntax errors in the csv text.");
        }
    }
}

public partial class Parser
{
    public CsvData ParseCsvData(ParserInput<char /> input, out bool success)
    {
        this.SetInput(input);
        CsvData csvData = ParseCsvData(out success);
        return csvData;
    }

    // omit remained methods
}

A simple test

class Test
{
    static void Main(string[] args)
    {
        string text = @"A,B,C,D1997,Ford,E350,jefferson st.John,Doe,120 NJ, " + 
                      @"08075D:\test.csv, D:\test.csv, D:\test.csv, " + 
                      @"D:\test.csv ""  sp\t"", ""sp\"""", 
                      ""s,p"", ""sp\r\n""";

        CsvData csvData = CsvDecoder.Decode(text, true);

        Dictionary<int, FieldFormatOption> formatOptions = 
                  new Dictionary<int, FieldFormatOption>();
        formatOptions[0] = new FieldFormatOption(12, false, false);
        formatOptions[1] = new FieldFormatOption(12, true, false);
        formatOptions[2] = new FieldFormatOption(12, false, false);
        formatOptions[3] = new FieldFormatOption(15, true, true);
        string formatted = CsvEncoder.Encode(csvData, formatOptions, ", ");

        Console.Write(formatted);
        Console.ReadLine();
    }
}

The result:

Points of interest

How do we encode binary data?

CSV is not a suitable format to store large blocks of binary data. For small binary fields, they can be converted to text using Bin2Hex, Base64, etc.

How do we encode multiple tables?

A blank line is read as a record with a field having a null value. Normally, a table has more than one columns, so blank lines can be used to separate tables.

These features can be handled by the user of the CSV encoder/decoder.

History

2008-1-15 - Initial submission.
2009-5-09 - Enable EndOfFile to end the last record without EndOfLine.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)