Introduction
In this article I will demonstrate how to read a large csv file chunk by chunk (line basis) and populate
DataTable object and bulk insert to database. I will explain what are the .NET framework components are used and challenges (memory pressure, performance,
file I/O), limitation and some way around like. To write this article my intention was sharing my real life knowledge with people who will face
similar requirement in near future and will be benefited.
Background
Client requirements are valid users will upload large CSV file time to time from we application. Files will be stored a specified server location. After certain time interval,
one software agent (may be Windows service, console app running with windows task scheduler) should parse those files sequentially and dump to database.
Data file will be created based on some predefined rules. Data file has its own dictionary which will be registered database before once started file upload process.
Uploaded data file structure
Previously I mentioned that file type will be .csv. We all know that .csv type file should have a pre-defined separator. Assume that its separator
is “tab”. Means it contains “tab” separated value. First line of the file will be its column definition. Next each new line represents
a row/record and this need to dump to the database periodically by any service component. Service component may be Windows Service, Console application run
periodically from Windows Task Scheduler etc. If we want to see the file with example data then it looks like

Using the code
File parsing C# code example
First off I design an interface which has two methods.
public interface IFileParser
{
IEnumerable<DataTable> GetFileData(string sourceDirectory);
void WriteChunkData(DataTable table, string distinationTable,
IList<KeyValuePair<string, string>> mapList);
}
The responsibility of GetFileData method is it read a file chunk by chunk and populate datatable (ADO.NET object) and send that table to its client.
WriteChunkData data method responsibility is bulk insert DataTable data to the database. It needs to clear that per chunk data means one single populated
DataTable. So the sequential process will be after reading data file, per chunk one single
DataTable will be returned to its caller.
Then caller will call WriteChunkData method, then caller dump that data to database then it will create and populate another
DataTable with data and return to its caller. This process will
continue until reach end of the file. IEnumerable<T>
and yield will solve problem. After analyzing solution code we all understand that. Now look the implementation of that interface method
IEnumerable<DataTable> IFileParser.GetFileData(string sourceFileFullName)
{
bool firstLineOfChunk = true;
int chunkRowCount = 0;
DataTable chunkDataTable = null;
string columnData = null;
bool firstLineOfFile = true;
using (var sr = new StreamReader(sourceFileFullName))
{
string line = null;
while ((line = sr.ReadLine()) != null)
{
if (firstLineOfFile)
{
columnData = line;
firstLineOfFile = false;
continue;
}
if (firstLineOfChunk)
{
firstLineOfChunk = false;
chunkDataTable = CreateEmptyDataTable(columnData);
}
AddRow(chunkDataTable, line);
chunkRowCount++;
if (chunkRowCount == _chunkRowLimit)
{
firstLineOfChunk = true;
chunkRowCount = 0;
yield return chunkDataTable;
chunkDataTable = null;
}
}
}
if (null != chunkDataTable)
yield return chunkDataTable;
}
Empty DataTable create:
private DataTable CreateEmptyDataTable(string firstLine)
{
IList<string> columnList = Split(firstLine);
var dataTable = new DataTable("tblData");
dataTable.Columns.AddRange(columnList.Select(v => new DataColumn(v)).ToArray());
return dataTable;
} Data Rows added in DataTable:
private void AddRow(DataTable dataTable, string line)
{
DataRow newRow = dataTable.NewRow();
IList<string> fieldData = Split(line);
for (int columnIndex = 0; columnIndex < dataTable.Columns.Count; columnIndex++)
{
newRow[columnIndex] = fieldData[columnIndex];
}
dataTable.Rows.Add(newRow);
} Tab separated string split:
private IList<string> Split(string input)
{
var dataList = new List<string>();
foreach (string column in input.Split('\t'))
{
dataList.Add(column);
}
return dataList;
} The method will return IEnumerable<T> Not IList<T> or
ICollection<T>. Because these are not support lazy return but IEnumerable<T> do.
Inside method I used yield statement for lazy return chunk data. If you use yield statement then the method return type must be return type must be
IEnumerable,
IEnumerable<T>,
IEnumerator,
or IEnumerator<T>. fTo use yield return you must respect some preconditions.
If you want to use IList<T> instead you will get compilation errors.

Detail precondition for use yield statement is this link.
The main benefit of yield statement is it supports evaluated code lazily. When return data from yield statement, current code remember and remaining statement
is executed when next time the iteration method is called. Another simple example:
public void ClientCode()
{
foreach (int i in GetValues())
{
Console.WriteLine(i.ToString());
}
}
public IEnumerable<int> GetValues()
{
yield return 1;
yield return 2;
yield return 3;
yield return 4;
}
The method will return 1 and print it to the console, next return 2 and print it to console, next 3 and print it to the console. Last return 4 and print it to the console.
That way yield support lazily evaluated code.
StreamReader object is used for read data file line by line. It helps to control memory pressure. It picks single line data to memory at a time.
First line of the data file is column list from next every line is datarow/record. Some flag variables are used for tracking state like First Line,
First Chunk Line. When start to read first chunk that time create a DataTable and start populating data. When reach to Max Chunk Size then return
that DataTable to its caller. One very important concept need to clear properly before take advantage from lazy evaluation. Now I want to talk with that.
We said previously that using IEnumerable<T> with yield statement actually work with lazy evaluation. Sometimes it may cause performance penalty.
IEnumerable<DataTable> dataTables = fileParser.GetFileData(sourceFileFullName);
Console.WriteLine("Total data tables for first time:" + dataTables.Count());
Console.WriteLine("Total data tables for second time: " + dataTables.Count());
Console.WriteLine("Total data tables for third time: " + dataTables.Count());
Though you call dataTables.Count() method three times but actually call
FileParser.GetFileData method three times. That type of scenario it will cause performance penalty.
These scenarios need to handle carefully otherwise it will bring unfavorable circumstance for us.
Bulk insert C# code example
void IFileParser.WriteChunkData(DataTable table, string distinationTable,
IList<KeyValuePair<string, string>> mapList)
{
using (var bulkCopy = new SqlBulkCopy(_connectionString, SqlBulkCopyOptions.Default))
{
bulkCopy.BulkCopyTimeout = 0; bulkCopy.DestinationTableName = distinationTable;
foreach (KeyValuePair<string, string> map in mapList)
{
bulkCopy.ColumnMappings.Add(map.Key, map.Value);
}
bulkCopy.WriteToServer(table, DataRowState.Added);
}
}
The SqlBulkCopy object is very efficient to write huge data to
SQL Server database from datasource. It has a property named ColumnMappings. Mapping source and destination data
column information need to add in this property. Without adding mapping information object treat that destination column will be same sequence as its source column.
So it is always better to add mapping information though both source and destination column is same and with same sequence. Some
SqlBulkCopy object behaviours you should know.
These are:
- If destination table has any column which is not mentioned in source data then its default value will be inserted.
- If mapping destination column not found at destination table then
InvalidOperationException will be raised with message “The given
ColumnMapping does not match up with any column
in the source or destination.” - If mapping source column does not found at source datatable then
InvalidOperationException will be
raised with message “The given ColumnName 'FirstName2' does not match up with any column in data source.” - Interesting point, source
column is case insensitive but destination column is case sensitive. Means, destination database table column name must be same case in Mapping Information.
Otherwise it treat that it is different.
Client Code
IFileParser fileParser = new DefaultFileParser();
string destinationTableName = "Data";
string sourceDirectory = @"D:\Data";
string sourceFileFullName = GetFileFullName(sourceDirectory);
IEnumerable<DataTable> dataTables = fileParser.GetFileData(sourceFileFullName);
foreach(DataTable tbl in dataTables)
{
fileParser.WriteChunkData(tbl, destinationTableName, Map());
GC.Collect();
}
Console.WriteLine("Completed successfully!");
GetFileData method return list of datatable. Though it is IEnumerable<DataTable> so when accessing its any property then its body will execute. So one mistake you may do like as follows
try
{
IEnumerable<DataTable> dataTables = fileParser.GetFileData(sourceFileFullName);
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
throw;
} But it never throw/catch any exception because GetFileData method's code will not execute. So when you debug that method just mind it code will not go inside that method only because it return IEnumerable<T> with yeild statement. When I access datatables inside foreach loop then it will execute the code and return DataTable one by one. I use GC.Collect method once per iteration. Why it is needed? Well, though I use tbl variable inside foreach, each time new datatable is assign to that variable and previous after database write is reassigned with new one and data will be waiting for garbage collection for release memory. Though it is very large amount of data so i need when it finish garbage collection will collect that as quick as possible. So I called GC.Collect for force collection of garbage data. If you want to check its benefit.
long totalMemoryBefore = System.Diagnostics.Process.GetCurrentProcess().PrivateMemorySize64;
fileParser.WriteChunkData(tbl, destinationTableName, Map());
GC.Collect();
long totalMemoryAfter = System.Diagnostics.Process.GetCurrentProcess().PrivateMemorySize64;
You can use various counter to judge memory cleanup after calling GC.Collect.
Points of Interest
In my code sample and demonstration I did not present data validation/exception related things. Event I did not show any defensive programming.
When writing code to capture real life data always respect a principle “Never trust data which you receive from different sources.” So you need
to write validation code for validating data. Though data comes from a .csv file, actually text file and text file data is always unstructured, so you need
to write extra code for validating and because of tab separation you need to take special care of space/tab/carriage return/new line feed characters.
You may need to write some extra code for data cleaning before dump it to database. You may need to take decision from client or domain expert
regarding that. Though it is a meter of data, if you need write any data cleaning logic you should inform to the client/domain expert and you receive decision
from them and never take decision by yourself, if so you must inform them and approve your decision.
I have 10+ years industry experience mostly Microsoft technologies. My expertise ares are Software architecture, design, use various design pattern to real life implementation, Unit testing, Performance tuning, Security, Java script, Ajax, ASP.NET Webform/MVC, SQL Server, NHibernet etc.