Click here to Skip to main content
Click here to Skip to main content

How to parse chunk by chunk a large CSV file and bulk insert to a database

By , 13 Feb 2013
 

Introduction

In this article I will demonstrate how to read a large csv file chunk by chunk (line basis) and populate DataTable object and bulk insert to database. I will explain what are the .NET framework components are used and challenges (memory pressure, performance, file I/O), limitation and some way around like. To write this article my intention was sharing my real life knowledge with people who will face similar requirement in near future and will be benefited. 

Background

Client requirements are valid users will upload large CSV file time to time from we application. Files will be stored a specified server location. After certain time interval, one software agent (may be Windows service, console app running with windows task scheduler) should parse those files sequentially and dump to database. Data file will be created based on some predefined rules. Data file has its own dictionary which will be registered database before once started file upload process. 

Uploaded data file structure  

Previously I mentioned that file type will be .csv. We all know that .csv type file should have a pre-defined separator.  Assume that its separator is “tab”. Means it contains “tab” separated value. First line of the file will be its column definition. Next each new line represents a row/record and this need to dump to the database periodically by any service component. Service component may be Windows Service, Console application run periodically from Windows Task Scheduler etc. If we want to see the file with example data then it looks like 

Using the code  

File parsing C# code example

First off I design an interface which has two methods. 

public interface IFileParser
{
    IEnumerable<DataTable> GetFileData(string sourceDirectory);
    void WriteChunkData(DataTable table, string distinationTable, 
         IList<KeyValuePair<string,   string>> mapList);
}

The responsibility of GetFileData method is it read a file chunk by chunk and populate datatable (ADO.NET object) and send that table to its client. WriteChunkData data method responsibility is bulk insert DataTable data to the database. It needs to clear that per chunk data means one single populated DataTable. So the sequential process will be after reading data file, per chunk one single DataTable will be returned to its caller. Then caller will call WriteChunkData method, then caller dump that data to database then it will create and populate another DataTable with data and return to its caller. This process will continue until reach end of the file. IEnumerable<T> and yield will solve problem. After  analyzing solution code we all understand that. Now look the implementation of that interface method 

IEnumerable<DataTable> IFileParser.GetFileData(string sourceFileFullName)
{            
    bool firstLineOfChunk = true;
    int chunkRowCount = 0;
    DataTable chunkDataTable = null;
    string columnData = null;
    bool firstLineOfFile = true;
    using (var sr = new StreamReader(sourceFileFullName))
    {
        string line = null;
        //Read and display lines from the file until the end of the file is reached.                
        while ((line = sr.ReadLine()) != null)
        {
            //when reach first line it is column list need to create 
            //datatable based on that.
            if (firstLineOfFile)
            {
                columnData = line;
                firstLineOfFile = false;
                continue;
            }
            if (firstLineOfChunk)
            {                       
                firstLineOfChunk = false;                        
                chunkDataTable = CreateEmptyDataTable(columnData);                        
            }                  
            AddRow(chunkDataTable, line);                   
            chunkRowCount++;
 
            if (chunkRowCount == _chunkRowLimit)
            {
                firstLineOfChunk = true;
                chunkRowCount = 0;
                yield return chunkDataTable;
                chunkDataTable = null;
            }
        }
    }
    //return last set of data which less then chunk size
    if (null != chunkDataTable)                           
        yield return chunkDataTable;            
}
Empty DataTable create: 
private DataTable CreateEmptyDataTable(string firstLine)
{
    IList<string> columnList = Split(firstLine);
    var dataTable = new DataTable("tblData");
    dataTable.Columns.AddRange(columnList.Select(v => new DataColumn(v)).ToArray());
    return dataTable;
} 
Data Rows added in DataTable:  
private void AddRow(DataTable dataTable, string line)
{
    DataRow newRow = dataTable.NewRow();
    IList<string> fieldData = Split(line);
    for (int columnIndex = 0; columnIndex < dataTable.Columns.Count; columnIndex++)
    {
        newRow[columnIndex] = fieldData[columnIndex];
    }
    dataTable.Rows.Add(newRow);
} 
Tab separated string split:  
private IList<string> Split(string input)
{
    //our csv file will be tab delimited
    var dataList = new List<string>();
    foreach (string column in input.Split('\t'))
    {
        dataList.Add(column);
    }
    return dataList;
} 

The method will return IEnumerable<T> Not IList<T> or ICollection<T>. Because these are not support lazy return but IEnumerable<T> do. Inside method I used yield statement for lazy return chunk data. If you use yield statement then the method return type must be return type must be IEnumerable, IEnumerable<T>, IEnumerator, or IEnumerator<T>. fTo use yield return you must respect some preconditions. If you want to use IList<T> instead you will get compilation errors. 

Detail precondition for use yield statement is this link.

The main benefit of yield statement is it supports evaluated code lazily. When return data from yield statement, current code remember and remaining statement is executed when next time the iteration method is called. Another simple example: 

public void ClientCode()
{
    foreach (int i in GetValues())
    {
        Console.WriteLine(i.ToString());
    }
}
public IEnumerable<int> GetValues()
{
    yield return 1;
    yield return 2;
    yield return 3;
    yield return 4;
} 

The method will return 1 and print it to the console,  next return 2 and print it to console, next 3 and print it to the console. Last return 4 and print it to the console. That way yield support lazily evaluated code.

StreamReader object is used for read data file line by line. It helps to control memory pressure. It picks single line data to memory at a time. First line of the data file is column list from next every line is datarow/record. Some flag variables are used for tracking state like First Line, First Chunk Line. When start to read first chunk that time create a DataTable and start populating data. When reach to Max Chunk Size then return that DataTable to its caller. One very important concept need to clear properly before take advantage from lazy evaluation. Now I want to talk with that. We said previously that using IEnumerable<T> with yield statement actually work with lazy evaluation. Sometimes it may cause performance penalty.  

IEnumerable<DataTable> dataTables = fileParser.GetFileData(sourceFileFullName);
Console.WriteLine("Total data tables for first time:" + dataTables.Count());
Console.WriteLine("Total data tables for second time: " + dataTables.Count());
Console.WriteLine("Total data tables for third time: " + dataTables.Count()); 

Though you call dataTables.Count() method three times but actually call FileParser.GetFileData method three times. That type of scenario it will cause performance penalty. These scenarios need to handle carefully otherwise it will bring unfavorable circumstance for us. 

Bulk insert C# code example

void IFileParser.WriteChunkData(DataTable table, string distinationTable, 
        IList<KeyValuePair<string, string>> mapList)
{
    using (var bulkCopy = new SqlBulkCopy(_connectionString, SqlBulkCopyOptions.Default))
    {               
        bulkCopy.BulkCopyTimeout = 0;//unlimited
        bulkCopy.DestinationTableName = distinationTable;
        foreach (KeyValuePair<string, string> map in mapList)
        {
            bulkCopy.ColumnMappings.Add(map.Key, map.Value);
        }
        bulkCopy.WriteToServer(table, DataRowState.Added);
    }
} 

The SqlBulkCopy object is very efficient to write huge data to SQL Server database from datasource. It has a property named ColumnMappings. Mapping source and destination data column information need to add in this property. Without adding mapping information object treat that destination column will be same sequence as its source column. So it is always better to add mapping information though both source and destination column is same and with same sequence. Some SqlBulkCopy object behaviours you should know. These are:

  • If destination table has any column which is not mentioned in source data then its default value will be inserted.
  • If mapping destination column not found at destination table then InvalidOperationException will be raised with message “The given ColumnMapping does not match up with any column in the source or destination.”
  • If mapping source column does not found at source datatable then InvalidOperationException will be raised with message “The given ColumnName 'FirstName2' does not match up with any column in data source.”
  • Interesting point, source column is case insensitive but destination column is case sensitive. Means, destination database table column name must be same case in Mapping Information. Otherwise it treat that it is different.  

Client Code  

IFileParser fileParser = new DefaultFileParser();
string destinationTableName = "Data";
string sourceDirectory = @"D:\Data";
string sourceFileFullName = GetFileFullName(sourceDirectory);
IEnumerable<DataTable> dataTables = fileParser.GetFileData(sourceFileFullName);

foreach(DataTable tbl in dataTables)            
{                
    fileParser.WriteChunkData(tbl, destinationTableName, Map());
    GC.Collect();
}
Console.WriteLine("Completed successfully!");

GetFileData method return list of datatable. Though it is IEnumerable<DataTable> so when accessing its any property then its body will execute. So one mistake you may do like as follows

try
{
    IEnumerable<DataTable> dataTables = fileParser.GetFileData(sourceFileFullName);
}
catch (Exception ex)
{
    Console.WriteLine(ex.Message);
    throw;
} 

But it never throw/catch any exception because GetFileData method's code will not execute. So when you debug that method just mind it code will not go inside that method only because it return IEnumerable<T> with yeild statement. When I access datatables inside foreach loop then it will execute the code and return DataTable one by one. I use GC.Collect method once per iteration. Why it is needed? Well, though I use tbl variable inside foreach, each time new datatable is assign to that variable and previous after database write is reassigned with new one and data will be waiting for garbage collection for release memory. Though it is very large amount of data so i need when it finish garbage collection will collect that as quick as possible. So I called GC.Collect for force collection of garbage data. If you want to check its benefit.

long totalMemoryBefore = System.Diagnostics.Process.GetCurrentProcess().PrivateMemorySize64;
fileParser.WriteChunkData(tbl, destinationTableName, Map());
GC.Collect();
long totalMemoryAfter = System.Diagnostics.Process.GetCurrentProcess().PrivateMemorySize64;  

You can use various counter to judge memory cleanup after calling GC.Collect.  

Points of Interest  

In my code sample and demonstration I did not present data validation/exception related things. Event I did not show any defensive programming. When writing code to capture real life data always respect a principle “Never trust data which you receive from different sources.” So you need to write validation code for validating data. Though data comes from a .csv file, actually text file and text file data is always unstructured, so you need to write extra code for validating and because of tab separation you need to take special care of space/tab/carriage return/new line feed characters. You may need to write some extra code for data cleaning before dump it to database.  You may need to take decision from client or domain expert regarding that. Though it is a meter of data, if you need write any data cleaning logic you should inform to the client/domain expert and you receive decision from them and never take decision by yourself, if so you must inform them and approve your decision. 

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

S. M. Ahasan Habib
Software Developer (Senior) The Jaxara IT Ltd.
Bangladesh Bangladesh
Member
I have 10+ years industry experience mostly Microsoft technologies. My expertise ares are Software architecture, design, use various design pattern to real life implementation, Unit testing, Performance tuning, Security, Java script, Ajax, ASP.NET Webform/MVC, SQL Server, NHibernet etc.

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
Hint: For improved responsiveness ensure Javascript is enabled and choose 'Normal' from the Layout dropdown and hit 'Update'.
You must Sign In to use this message board.
Search this forum  
    Spacing  Noise  Layout  Per page   
GeneralMy vote of 5memberMaxEgghead23 Feb '13 - 2:30 
GeneralRe: My vote of 5memberS. M. Ahasan Habib23 Feb '13 - 4:19 
GeneralMy vote of 5memberKhorshed Alam, Dhaka17 Feb '13 - 16:50 
GeneralRe: My vote of 5memberS. M. Ahasan Habib17 Feb '13 - 19:17 
GeneralMy vote of 5memberdaiweijr14 Feb '13 - 2:47 
GeneralRe: My vote of 5memberS. M. Ahasan Habib14 Feb '13 - 5:27 
GeneralMy vote of 5memberzp bappi13 Feb '13 - 22:11 
GeneralRe: My vote of 5memberS. M. Ahasan Habib13 Feb '13 - 22:12 
GeneralMy vote of 5memberMarco Rinaldi13 Feb '13 - 21:15 
GeneralRe: My vote of 5memberS. M. Ahasan Habib13 Feb '13 - 21:34 
GeneralMy vote of 5memberMonjurul Habib13 Feb '13 - 18:59 
GeneralRe: My vote of 5memberS. M. Ahasan Habib13 Feb '13 - 20:04 
GeneralMy vote of 5memberKamruzzaman Titu11 Feb '13 - 5:25 
GeneralRe: My vote of 5memberS. M. Ahasan Habib11 Feb '13 - 6:15 
GeneralMy vote of 5memberSk. Tajbir11 Feb '13 - 4:50 
GeneralRe: My vote of 5memberS. M. Ahasan Habib11 Feb '13 - 6:15 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Permalink | Advertise | Privacy | Mobile
Web04 | 2.6.130513.1 | Last Updated 13 Feb 2013
Article Copyright 2013 by S. M. Ahasan Habib
Everything else Copyright © CodeProject, 1999-2013
Terms of Use
Layout: fixed | fluid