Click here to Skip to main content
14,177,097 members
Click here to Skip to main content
Add your own
alternative version


62 bookmarked
Posted 11 Feb 2013
Licenced CPOL

Parse CSV file chunk by chunk and save in database

, 25 Nov 2013
Rate this:
Please Sign up or sign in to vote.
In this article I will demonstrate how to read a large CSV file chunk by chunk (line basis) and populate DataTable object and bulk insert to database.


In this article I will demonstrate how to read a large csv file chunk by chunk (1 chunk = no of lines) and populate System.Data.DataTable object and bulk insert to a database.

I will explain in details, what are the .NET framework components I used and face challenges  like memory management, performance, large file read/write etc and how to resolve the problem. When I wrote this article, my intention was sharing my real life experience with others who are currently working with similar requirement or near future will work and will get benefited to read this article.  


One day client send us new requirements. The requirements are: 


  • Client will upload very large csv type data file by their web application. 
  • After uploading finished, files will be stored a specific server location.
  • A software agent (Windows service, console app run with task scheduler) will parse those files sequentially and dump files data to a particular database.
  • Data files schema will be predefined and this schema will be configurable by database.
  • Client will configure that data file schema before uploading start.
  • After dumping data from files, client will generate various report from the database.    


Uploaded File Structure   

Previously I mentioned that file type will be .csv. We all know that .csv type file should have a pre-defined data separator.  In ourcase that is “tab”. Means it contains “tab” separated data. First line of the file will be its column definition(data schema). Next each new line represents a row/record and this need to dump to the database periodically by any software agent/component.  

Input data file contains four columns:

1. Code

2. YearlyExpense   

3. LastName

4. FirstName 


The file looks: 


File Parsing  

First off all I design an interface IFileParser which has two methods.   

  1. GetFileData
  2. WriteChunkData 
public interface IFileParser
    IEnumerable<DataTable> GetFileData(string sourceDirectory);
    void WriteChunkData(DataTable table, string distinationTable, 
         IList<KeyValuePair<string,   string>> mapList);

1. GetFileData: The responsibility of GetFileData method is it read a file chunk by chunk and populate datatable (ADO.NET object) and send that table to its client.

2. WriteChunkData WriteChunkData method responsibility is insert DataTable data to the database.

It needs to be clear that   

Mulitple Lines = 1 chunk of data = 1 DataTable.  

So the it will be better after reading data file, the method will return a single DataTable when finish to read a chunk and return it to its caller method. Then the caller method will call WriteChunkData method so that data will be dump to the database. This process will continue until reach end of the file. Two very important components of .NET framework will use: 

  1. IEnumerable<T>   
  2. yield 

Interface methods 

IEnumerable<DataTable> IFileParser.GetFileData(string sourceFileFullName)
    bool firstLineOfChunk = true;
    int chunkRowCount = 0;
    DataTable chunkDataTable = null;
    string columnData = null;
    bool firstLineOfFile = true;
    using (var sr = new StreamReader(sourceFileFullName))
        string line = null;
        //Read and display lines from the file until the end of the file is reached.                
        while ((line = sr.ReadLine()) != null)
            //when reach first line it is column list need to create 
            //datatable based on that.
            if (firstLineOfFile)
                columnData = line;
                firstLineOfFile = false;
            if (firstLineOfChunk)
                firstLineOfChunk = false;                        
                chunkDataTable = CreateEmptyDataTable(columnData);                        
            AddRow(chunkDataTable, line);                   
            if (chunkRowCount == _chunkRowLimit)
                firstLineOfChunk = true;
                chunkRowCount = 0;
                yield return chunkDataTable;
                chunkDataTable = null;
    //return last set of data which less then chunk size
    if (null != chunkDataTable)                           
        yield return chunkDataTable;            
DataTable Create Without Data:  
private DataTable CreateEmptyDataTable(string firstLine)
    IList<string> columnList = Split(firstLine);
    var dataTable = new DataTable("tblData");
    dataTable.Columns.AddRange(columnList.Select(v => new DataColumn(v)).ToArray());
    return dataTable;
Data Rows Added in DataTable:  
private void AddRow(DataTable dataTable, string line)
    DataRow newRow = dataTable.NewRow();
    IList<string> fieldData = Split(line);
    for (int columnIndex = 0; columnIndex < dataTable.Columns.Count; columnIndex++)
        newRow[columnIndex] = fieldData[columnIndex];
Tab Separated Data Split:  
private IList<string> Split(string input)
    //our csv file will be tab delimited
    var dataList = new List<string>();
    foreach (string column in input.Split('\t'))
    return dataList;

Why I use IEnumerable<T> <span style="color: rgb(17, 17, 17); font-family: "Segoe UI", Arial, sans-serif; font-size: 14px;">instead of </span>IList<T> or ICollection<T>. The reason behind is IList<T> or ICollection>T> are not support lazy return feature. Another thing is inside method I use yield statement and it use before return statement for returning chunk data lazily.

If you plan to use yield statement in a method, that method return type must be

  1. IEnumerable
  2. IEnumerable<T>
  3. IEnumerator
  4. IEnumerator<T>  

To use yield return you must respect some preconditions. If you use IList<T> instead of IEnumerable<T> you will get compilation errors: 

System.Collections.Generic.IList<System.Data.DataTable>' is not an iterator interface type 

You will find detail precondition to use yield statement in this link.

The main benefit of yield statement is, it supports evaluated code lazily. When return data by yield statement, current code is remembered and next time when the iteration method is called remaining statement is executed.

Another simple example:  

public void ClientCode()
    foreach (int i in GetValues())
public IEnumerable<int> GetValues()
    yield return 1;
    yield return 2;
    yield return 3;
    yield return 4;

The method will return 1 and print it to the console,  next return 2 and print it to console, next 3 and print it to the console. Last return 4 and print it to the console. That way yield statement support lazily evaluated code.

StreamReader object is used for reading data file line by line. It helps to control memory pressure. It picks single line of data from disk to memory at a time.

In our datafile, first line of the data file is column list and next every line of data is a row/record. Some flag variables are used to track state like First Line, First Chunk Line. 

When start reading, first chunk parsing time, create a DataTable and start populating data. When reach Maximum Chunk Size, method returns  populated DataTable to its caller.

One very important concept need to clear properly before take advantage from lazy evaluation. We said previously that using IEnumerable<T> with yield statement actually work with lazy evaluation. Sometimes it may cause performance penalty.  

IEnumerable<DataTable> dataTables = fileParser.GetFileData(sourceFileFullName);
Console.WriteLine("Total data tables for first time:" + dataTables.Count());
Console.WriteLine("Total data tables for second time: " + dataTables.Count());
Console.WriteLine("Total data tables for third time: " + dataTables.Count()); 

Though you call dataTables.Count() method 3 times, but actually it calls FileParser.GetFileData method 3 times. That type of scenario it will cause performance penalty. These scenarios need to handle carefully otherwise it will bring unfavorable circumstance for us. 

Bulk Insert 

void IFileParser.WriteChunkData(DataTable table, string distinationTable, 
        IList<KeyValuePair<string, string>> mapList)
    using (var bulkCopy = new SqlBulkCopy(_connectionString, SqlBulkCopyOptions.Default))
        bulkCopy.BulkCopyTimeout = 0;//unlimited
        bulkCopy.DestinationTableName = distinationTable;
        foreach (KeyValuePair<string, string> map in mapList)
            bulkCopy.ColumnMappings.Add(map.Key, map.Value);
        bulkCopy.WriteToServer(table, DataRowState.Added);

The SqlBulkCopy object is very efficient to write huge data to SQL Server database from datasource. It has a property named ColumnMappings. Mapping source and destination data column information need to add in this property. Point to be noted that without adding mapping information, object treats that destination column will be same sequence as its source column. So it is always better to add mapping information both source and destination column are same with same sequence. Some SqlBulkCopy object behaviours you should know. These are: 

  • If destination table has any column which is not mentioned in source data then its default value will be inserted.
  • If mapping destination column not found at destination table then InvalidOperationException will be raised with message “The given ColumnMapping does not match up with any column in the source or destination.”
  • If mapping source column does not found at source datatable then InvalidOperationException will be raised with message “The given ColumnName 'FirstName2' does not match up with any column in data source.”
  • Interesting point, source column is case insensitive but destination column is case sensitive. Means, destination database table column name must be same case in Mapping Information. Otherwise it treat that it is different.  

Client Code

IFileParser fileParser = new DefaultFileParser();
string destinationTableName = "Data";
string sourceDirectory = @"D:\Data";
string sourceFileFullName = GetFileFullName(sourceDirectory);
IEnumerable<DataTable> dataTables = fileParser.GetFileData(sourceFileFullName);

foreach(DataTable tbl in dataTables)            
    fileParser.WriteChunkData(tbl, destinationTableName, Map());
Console.WriteLine("Completed successfully!");

GetFileData method return list of datatable. Though it is IEnumerable<DataTable> so when accessing its any property then its body will execute. So one mistake you may do: 

    IEnumerable<DataTable> dataTables = fileParser.GetFileData(sourceFileFullName);
catch (Exception ex)

But it never throw/catch any exception because GetFileData method's code will not execute. So when you debug that method just mind it code will not go inside that method only because it returns IEnumerable<T> with yeild statement. When I access datatables inside foreach loop then it will execute the code and returns DataTable one by one.

I use GC.Collect method once per iteration. Why it is needed? Well, though I use tbl variable inside foreach, each time new datatable is assigned to that variable and previous (after database write) is reassigned with new one and data will be waiting for garbage collection for release memory. Though it is very large amount of data so i need it when it finish, garbage collection will collect that as quick as possible. So I called GC.Collect for force collection of garbage data. If you want to check its benefit, 

long totalMemoryBefore = System.Diagnostics.Process.GetCurrentProcess().PrivateMemorySize64;
fileParser.WriteChunkData(tbl, destinationTableName, Map());
long totalMemoryAfter = System.Diagnostics.Process.GetCurrentProcess().PrivateMemorySize64;  

You can use various counter to judge memory cleanup after calling GC.Collect.

Precaution When Writing Code 

When you will work with (file parsing and dumping to database type)  that type of requirement you need to following some technique. That will help you to manage your code easily.


  • Write more and more data validation code before start processing.
  • Write defensive coding as much as possible. 
  • Never trust data which you receive from difference sources.
  • First validate data schema. Then go for data. Xml type data has some kind of Schema validation with xsd. But text/csv file has no such tool. You can manually do it. 
  • Throw custom exception with detail error message where mismatch found so that you can understand where and what the problem found. 
  • Log custom exception with detail error data and stack trace if possible then including line number where the error found. 
  • Do not correct any data at run-time by your own decision without client/domain expert confirmation. Event single space is an issue. You do not know space has a special meaning there or not. 
  • Write client notification code like emailing/send sms so that if any error/exceptional things happen then client can know as early as possible. 


Points of Interest   

In my code sample and demonstration I did not present data validation/exception related things. Event I did not show any defensive programming. You need to write validation code for validating data. Though data comes from a .csv file, actually text file and text file data is always unstructured, so you need to write extra code for validating and because of tab separation you need to take special care of space/tab/carriage return/new line feed characters. You may need to write some extra code for data cleaning before dump it to database.  You may need to take decision from client or domain expert regarding that. 


This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


About the Author

S. M. Ahasan Habib
Bangladesh Bangladesh
I believe in and stay with living-learning culture. I believe learning is fun. I feel coding is like poetry. I am work with various technologies in my professional career (AngularJS, ASP.NET MVC/WebAPI, Scala-Play Framework, Nodejs, SharePoint, Redshift).

You may also be interested in...

Comments and Discussions

Praisegreat article Pin
rcflex17-Jul-17 21:38
memberrcflex17-Jul-17 21:38 
QuestionHow do I map the errors? Pin
Raphael Araujo Lima16-Jan-17 8:26
memberRaphael Araujo Lima16-Jan-17 8:26 
AnswerExample to read in chunks and process in parallel Pin
Member 102445038-Nov-16 19:47
memberMember 102445038-Nov-16 19:47 
GeneralThoughts Pin
PIEBALDconsult25-Oct-16 18:47
protectorPIEBALDconsult25-Oct-16 18:47 
QuestionIt would be even faster if you used Pin
Member 1115704230-Jun-16 21:53
memberMember 1115704230-Jun-16 21:53 
a IDatareader and not a datatable.Datatable is too slow.
I did implement a couple of years ago where i had to import massive csv files.
I think there is something called ObjectDataReader that does the job but also i think a guy at stackoverflow wrote a good one too,dont remember the name now.

you will notice the difference
GeneralMy vote of 5 Pin
Tridip Bhattacharjee30-May-16 3:02
professionalTridip Bhattacharjee30-May-16 3:02 
QuestionQuery Pin
Satyanshu702024-May-15 19:56
memberSatyanshu702024-May-15 19:56 
QuestionHow will you measure progress when reading large file in chunks? Pin
mdrizwan_122-Dec-14 0:06
membermdrizwan_122-Dec-14 0:06 
GeneralHow to neglect the first column and the Datasets of First column in this Project Pin
Shrivatsan Rangarajan9-Feb-14 20:57
memberShrivatsan Rangarajan9-Feb-14 20:57 
GeneralRe: How to neglect the first column and the Datasets of First column in this Project Pin
S. M. Ahasan Habib9-Feb-14 22:04
professionalS. M. Ahasan Habib9-Feb-14 22:04 
GeneralRe: How to neglect the first column and the Datasets of First column in this Project Pin
Shrivatsan Rangarajan9-Feb-14 22:27
memberShrivatsan Rangarajan9-Feb-14 22:27 
GeneralRe: How to neglect the first column and the Datasets of First column in this Project Pin
S. M. Ahasan Habib9-Feb-14 23:29
professionalS. M. Ahasan Habib9-Feb-14 23:29 
GeneralRe: How to neglect the first column and the Datasets of First column in this Project Pin
Shrivatsan Rangarajan9-Feb-14 23:39
memberShrivatsan Rangarajan9-Feb-14 23:39 
GeneralRe: How to neglect the first column and the Datasets of First column in this Project Pin
Shrivatsan Rangarajan9-Feb-14 23:45
memberShrivatsan Rangarajan9-Feb-14 23:45 
QuestionData holding capacity of CSV or xls file Pin
Tridip Bhattacharjee11-Dec-13 19:52
professionalTridip Bhattacharjee11-Dec-13 19:52 
AnswerRe: Data holding capacity of CSV or xls file Pin
S. M. Ahasan Habib11-Dec-13 20:06
professionalS. M. Ahasan Habib11-Dec-13 20:06 
QuestionTSV vs CSV Pin
Member 102414703-Dec-13 17:25
memberMember 102414703-Dec-13 17:25 
AnswerRe: TSV vs CSV Pin
S. M. Ahasan Habib3-Dec-13 18:11
professionalS. M. Ahasan Habib3-Dec-13 18:11 
AnswerRe: TSV vs CSV Pin
PIEBALDconsult25-Oct-16 18:17
protectorPIEBALDconsult25-Oct-16 18:17 
GeneralMy vote of 3 Pin
Member 93313593-Dec-13 4:29
memberMember 93313593-Dec-13 4:29 
GeneralRe: My vote of 3 Pin
S. M. Ahasan Habib3-Dec-13 5:12
professionalS. M. Ahasan Habib3-Dec-13 5:12 
GeneralRe: My vote of 3 Pin
Member 93313593-Dec-13 5:57
memberMember 93313593-Dec-13 5:57 
GeneralRe: My vote of 3 Pin
S. M. Ahasan Habib3-Dec-13 6:34
professionalS. M. Ahasan Habib3-Dec-13 6:34 
GeneralRe: My vote of 3 Pin
Member 93313593-Dec-13 10:04
memberMember 93313593-Dec-13 10:04 
QuestionBuilding a list of chunks seems to defeat the purpose Pin
Member 93313592-Dec-13 8:45
memberMember 93313592-Dec-13 8:45 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Cookies | Terms of Use | Mobile
Web06 | 2.8.190526.1 | Last Updated 25 Nov 2013
Article Copyright 2013 by S. M. Ahasan Habib
Everything else Copyright © CodeProject, 1999-2019
Layout: fixed | fluid