Click here to Skip to main content
15,885,782 members
Articles / Programming Languages / C#

CSV Filter

Rate me:
Please Sign up or sign in to vote.
4.72/5 (10 votes)
27 Sep 2009CPOL2 min read 46.2K   1.1K   20   6
How to read, filter and write your CSV files
Image 1

Introduction

In this article, you will learn how to read a CSV (Comma Separated Values) file and re-write it to another file applying some basic filter of your choice.

The reason why I wrote this article is that I had some old lengthy contact list on my desktop and I wanted to organize it, remove duplicates and convert it into a CSV file. Of course, I had to first use Microsoft Word for replacing semicolons with commas and then insert the special character ^p at the end of each line making sure I have a CSV file format of two columns. 

Background

To understand this article, you need to be familiar with the basics of IO streams and ADO.

Using the Code

First, we start by reading the source CSV file using a BackgroundWorker component and we make sure we don't do anything stupid that may result in a cross-thread crash.

We create a StreamReader and read the source file line by line and split it according to the specified separator character and see if the resulting tokens match the ones expected, if so, then a row is created in a DataTable object:  

C#
private void backgroundWorker1_DoWork(object sender, DoWorkEventArgs e)
{
    DataTable dt;
    StreamReader sr;
    StreamWriter sw;

    string[] tokens={String.Empty};

    //Reading Source
OpenFile:
    try
    {
        sr = new StreamReader(strSource);
    }
    catch (IOException)
    {
        DialogResult dr = MessageBox.Show("Source file in use!",
	"Error", MessageBoxButtons.AbortRetryIgnore, MessageBoxIcon.Error);
        if (dr == DialogResult.Retry)
            goto OpenFile;
        else
            return;
    }

    dt = new DataTable("CSV_SOURCE");

    for (int x = 1; x <= (int)numTokens.Value; x++)
    dt.Columns.Add("COL_"+x.ToString(), typeof(string));

    backgroundWorker1.ReportProgress(0, "Reading source file...");
    while (!sr.EndOfStream)
    {
        try
        {
            tokens = sr.ReadLine().Split(strSourceSeperator.ToCharArray()[0]);

            if (tokens.Length == (int)numTokens.Value)
                dt.Rows.Add(tokens);
        }
        catch (Exception ex) { MessageBox.Show(ex.Message); }
    }
    sr.Close();
}

Our DataTable object columns were created in a simple loop and named COL_1, COL_2, COL_3..., etc. according the number of tokens specified by the user and found in the source file:

C#
dt = new DataTable("CSV_SOURCE");

            for (int x = 1; x <= (int)numTokens.Value; x++)
            dt.Columns.Add("COL_"+x.ToString(), typeof(string)); 

Notice that when starting the thread, we had to create a set of class global variables that stores the input from the interface controls in order to ensure thread safety:

C#
string strFilter, strSourceSeperator, strTargetSeperator;
private void btnFilter_Click(object sender, EventArgs e)
{
    strFilter = txtFilter.Text;
    strSourceSeperator = txtSourceSeperator.Text;
    strTargetSeperator = txtTargetSeperator.Text;

    backgroundWorker1.RunWorkerAsync();
}

Now we remove the duplicates:

C#
//Removing Duplicates
if (chUnique.Checked)
    dt = dsHelper.SelectDistinct("DISTINCT_CSV_SOURCE", dt, "COL_1");

Actually, I did something stupid in the code segment above and you'd better fix it by creating one more class global bool variable and assign the checkbox control state to it.

For the SelectDistinct method, you need to include this helper class in your solution:

C#
public class DataSetHelper
    {
        public DataSet ds;
        public DataSetHelper(ref DataSet DataSet)
        {
            ds = DataSet;
        }
        public DataSetHelper()
        {
            ds = null;
        }

        private bool ColumnEqual(object A, object B)
        {

            // Compares two values to see if they are equal. Also compares DBNULL.Value.
            // Note: If your DataTable contains object fields, then you must extend this
            // function to handle them in a meaningful way if you intend to group on them.

            if (A == DBNull.Value && B == DBNull.Value) //  both are DBNull.Value
                return true;
            if (A == DBNull.Value || B == DBNull.Value) //  only one is DBNull.Value
                return false;
            return (A.Equals(B));  // value type standard comparison
        }

        public DataTable SelectDistinct
	(string TableName, DataTable SourceTable, string FieldName)
        {
            DataTable dt = new DataTable(TableName);
            dt.Columns.Add(FieldName, SourceTable.Columns[FieldName].DataType);

            object LastValue = null;
            foreach (DataRow dr in SourceTable.Select("", FieldName))
            {
                if (LastValue == null || !(ColumnEqual(LastValue, dr[FieldName])))
                {
                    LastValue = dr[FieldName];
                    dt.Rows.Add(new object[] { LastValue });
                }
            }
            if (ds != null)
                ds.Tables.Add(dt);
            return dt;
        }
    }

Finally we create a StreamWriter object, iterate through our previously created DataTable object, fetch its data row by row and column by column, include the new separator character (or keep the original) and close the stream.

C#
//Writing Target
SaveFile:
    try
    {
        sw = new StreamWriter(strTarget);
    }
    catch (IOException)
    {
        DialogResult dr = MessageBox.Show("Target file in use!", 
	"Error", MessageBoxButtons.AbortRetryIgnore, MessageBoxIcon.Error);
        if (dr == DialogResult.Retry)
            goto SaveFile;
        else
            return;
    }

    backgroundWorker1.ReportProgress(50, "Writing to target file...");
    string tmpLine;
    foreach(DataRow dr in dt.Rows)
    {
        try
        {
            foreach (DataColumn dc in dt.Columns)
            {
                tmpLine = dr[dc].ToString();
                //Filtering
                switch(iFilter)
                {
                    case 0:
                    if (tmpLine.Contains(strFilter))
                        sw.Write(tmpLine);
                    break;

                    case 1:
                    if (!tmpLine.Contains(strFilter))
                        sw.Write(tmpLine);
                    break;
                }

                if (!dc.ColumnName.EndsWith("_"+ dt.Columns.Count.ToString()))
                    sw.Write(strTargetSeperator);
            }

            sw.Write("\r\n");
        }
        catch (Exception ex) { MessageBox.Show(ex.Message); }
    }

    sw.Close();

For filtering the results, we used the following basic verification steps:

C#
tmpLine = dr[dc].ToString();
//Filtering
switch(iFilter)
{
    case 0:
        if (tmpLine.Contains(strFilter))
            sw.Write(tmpLine);
    break;

    case 1:
        if (!tmpLine.Contains(strFilter))
            sw.Write(tmpLine);
    break;
}

History

  • 27th September, 2009: Initial post

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Retired QSoft
Yemen Yemen
Biography?! I'm not dead yet!
www.QSoftOnline.com

Comments and Discussions

 
GeneralCSV is not simply values split by commas Pin
Nate P27-Sep-09 6:36
Nate P27-Sep-09 6:36 
GeneralRe: CSV is not simply values split by commas Pin
Tom127-Sep-09 7:00
Tom127-Sep-09 7:00 
GeneralRe: CSV is not simply values split by commas Pin
Nate P27-Sep-09 9:30
Nate P27-Sep-09 9:30 
GeneralRe: CSV is not simply values split by commas Pin
Tom127-Sep-09 9:49
Tom127-Sep-09 9:49 
GeneralRe: CSV is not simply values split by commas Pin
Goran Bacvarovski28-Sep-09 6:32
Goran Bacvarovski28-Sep-09 6:32 
The RFC will only give you stadnards - however the CSV format as "flat file" format, or as text file is very open for the use and can be customized to fit the need of th application. What office is doing with the CSV file filter gets close to some kind of standardaziation - however the CSV file still lacks standard header information that can be used for the creators to specify at least encoding and text qualifiers so the readers can decode and use the text qualifiers to identify if the text belongs to single column or the comma in it may be represeneted as new column. For now all you can hope is a programmer that whoever is supplying the CSV text files (whichever system may produces it) the format wil not change. As programmer providing CSV files it is best if you can enter header information (define the file structure) and istruct your users (other systems) to parse out headers. The other quick fix is to read line by line and do some complex trial and catch to identify what are the text qalifiers used (compare number of columns got in each row then parse data in differnt rows to figure out what is used as text qalifier if you get same number of columns) - also it is hard to predict if the file is meant to be same number of columns from row 1 to the end.. what if that was not the case you logic is not going to work... etc.. Milions of reason why code like this may fail - and no quick fix can be applied that can address all possible issues.
GeneralRe: CSV is not simply values split by commas Pin
Liu Junfeng9-Oct-09 23:00
Liu Junfeng9-Oct-09 23:00 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.