Cinchoo ETL - Sorting Large CSV File using SqlServer
Easy way to sort large CSV files using Cinchoo ETL/SqlServer
1. Introduction
ChoETL
is an open source ETL (extract, transform and load) framework for .NET. It is a code based library for extracting data from multiple sources, transforming, and loading into your very own data warehouse in .NET environment. You can have data in your data warehouse in no time.
This article talks about sorting large CSV files using ChoCSVReader/ChoCSVWriter
components offered by ChoETL
framework. Large CSV files present a challenge when the need arises to sort. It fails with OutOfMemory
exception if you try to do with .NET application. With Microsoft Excel for sorting on large files, you will promptly receive an error the first time.
If a string
sort will work for you, then just use the Windows SORT
command. Sort the file and be done with it. It'll happily sort your large CSV file, and it's simple to use.
If you need to filter and convert the file, specifically the date / custom type fields, then you would need to write a small conversion program that converts the fields whatever you like, and rewrites the record. This is where Cinchoo ETL framework places a role to ease up the conversion of large CSV file the way you want with all possible requirements with minimal memory footprint and ultra-fast performance.
In this article, let's learn how to do that using a Cinchoo ETL framework and Microsoft SQLServer.
To learn more about Cinchoo ETL - CSVReader, please visit the CodeProject article here.
2. Requirement
This framework library is written in C# using .NET 4.5 Framework.
3. How to Use
3.1 Sample Data
Let's begin by looking into a simple example of reading the below CSV file having four columns.
Listing 1. Sample CSV Data File
CustId,Name,Balance,AddedDate
1732,Tom Perez,435.00,05/11/2002
5541,Pedro Gomez,12342.30,06/02/2000
1122,Mark Politti,0.00,01/02/2004
1924,Pablo Ramirez,3321.30,11/24/2002
The first thing to do is to install ChoETL.SqlServer
from nuget. To do this, run the following command in the Package Manager Console.
Install-Package ChoETL.SqlServer
Now, add ChoETL
namespace to the program.
using ChoETL;
3.2 Quick Sort- Zero Configuration Approach
This approach shows you how to do the sorting of large CSV file without any mapping class.
Listing 2. Sorting CSV Files without Mapping Class
public static void QuickSort()
{
using (var dr = new ChoCSVReader(@"Test.csv").WithFirstLineHeader().
WithField("CustId", fieldType: typeof(int)).
WithField("Name", fieldType: typeof(string)).
WithField("Balance", fieldType: typeof(double)).
WithField("AddedDate", fieldType: typeof(DateTime))
)
{
using (var dw = new ChoCSVWriter(@"TestOut.csv"))
dw.Write(dr.AsEnumerable().StageOnSqlServer("ORDER BY AddedDate"));
}
}
In the above, open the input CSV file for reading using ChoCSVReader
object. Then open ChoCSVWriter
for writing sorted CSV file. Use the StateOnSqlServer()
method on reader
object to load the data in the Microsoft SqlServer database for sorting operation. Finally, pass the SQL sort expression to the call on any CSV column to sort against it. This will produce the sorted CSV file.
Likewise, you can perform any type of projection, filtering or conversion on the staged data with minimal memory footprint.
3.3 Define Mapping Class
This approach talks about sorting of large CSV file using Mapping (POCO) Class. More type safe approach than the previous step. First, create a class which has properties with the same name of the column headings found in the CSV file. Below, you will find an example of a class which does this.
Listing 3. Mapping Class
public class Customer
{
public int CustId { get; set; }
public string Name { get; set; }
public decimal Balance { get; set; }
public DateTime AddedDate { get; set; }
}
3.3 Sort with ChoCSVReader/ChoCSVWriter
Finally, create an instance of ChoCSVReader
class to read the input file. Create an instance of ChoCSVWriter
for writing sorted CSV file as below.
Listing 4. Sorting CSV File with Mapping Class
public static void POCOSort()
{
using (var dr = new ChoCSVReader<Customer>(@"Test.csv").WithFirstLineHeader())
{
using (var dw = new ChoCSVWriter<Customer>(Console.Out))
dw.Write(dr.AsEnumerable().StageOnSqlServer().OrderBy(x => x.AddedDate));
}
}
In the above, the code uses the StateOnSqlServer()
method on reader
object to load the CSV data in the Microsoft SqlServer database for sorting operation. Finally, apply sorting expression on any CSV column to sort against it. This will produce the sorted CSV file.
The above approach inserts the records using INSERT
SQL to database, may have the performance bottleneck to this process when staging large data file in the database for this operation.
For that same reason, the next approach eliminates it by using the Bulk copy (bcp) operation to stage the data in the database for sorting operation. Sample shows how to use it.
Listing 4. Sorting CSV File with Bcp
public static void POCOSort()
{
using (var dr = new ChoCSVReader<Customer>(@"Test.csv").WithFirstLineHeader())
{
using (var dw = new ChoCSVWriter<Customer>(Console.Out))
dw.Write(dr.AsEnumerable().StageOnSqlServerUsingBcp().OrderBy(x => x.AddedDate));
}
}
Likewise, you can perform any type of projection, filtering or conversion on the staged data with minimal memory footprint.
For more information about Cinchoo ETL, please visit the below CodeProject article:
History
- 15th May, 2017: Initial version