Click here to Skip to main content
15,899,474 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
I have to merge two CSV files of 50GB size using .net. Please help me a quick process that took less than 5 mintues


What I have tried:

static void Main(string[] args)
{

string sourceFolder = @"D:\SingleBlockDataDump_June.csv";
string destinationFile = @"\D:\SingleBlockDataDump_July.csv";
string logFilePath = @"D:\log.txt";
// string[] filePaths = Directory.GetFiles(sourceFolder, "CSV_File_Number?.csv");
StreamWriter fileDest = new StreamWriter(destinationFile, true);

//int i=1;
//for (i = 0; i < filePaths.Length; i++)
{
//string file = filePaths[i];

string[] lines = File.ReadAllLines(sourceFolder); //File.ReadAllLines(file);

//if (i > 0)
//{
//lines = lines.Skip(1).ToArray(); // Skip header row for all but first file
lines = lines.ToArray();
//}
TimeSpan startTime = DateTime.Now.TimeOfDay;
string logText = "Started to merge: " + startTime +Environment.NewLine;

foreach (string line in lines)
{
fileDest.WriteLine(line);
}
TimeSpan endTime=DateTime.Now.TimeOfDay;
logText += "Finished merging: " + endTime;
//TimeSpan duration = DateTime.Parse(endTime).Subtract(DateTime.Parse(startTime));
logText += "Elapsed Time:";
using (StreamWriter writetext = new StreamWriter(logFilePath))
{
writetext.WriteLine(logText);
}
Console.ReadLine();
}

fileDest.Close();
}
Posted
Updated 29-Nov-17 3:03am
Comments
F-ES Sitecore 29-Nov-17 8:22am    
Rather than doing ReadAllLines, try processing the file line by line. Can't guarantee it's going to have a whole lot of impact on the performance but it's worth trying.

To preempt your next question google "read file line by line c#"

CSV files are text files with a header description, so if the file structures are the same, then just append the second file to the first file while skipping the second files header line (the first line).
 
Share this answer
 
Don't use reading line by line (or even ReadAllLines) if speed matters.

Allocate a large byte array to be used for copying. The size should be fair below the available free memory to avoid swapping to disk.

For each file get it's size and - except for the first one - read the first line to skip it and get the offset to the second line from the length. Subtract the offset from the size.

Now use a loop for block wise processing:

  • Determine the block size (min. of buffer size and remaining size)
  • Read into buffer
  • Write to output file
  • Decrement size by block size

Then there is only a single memory allocation and reading and writing the raw file content avoids the end of line checks. The required time depends then nearly completely on the speed of your storage device (HDD, SSD).
 
Share this answer
 
Comments
sankarisiva 30-Nov-17 8:41am    
Thanks. If you have any site reference for this implementation please share the link
Jochen Arndt 30-Nov-17 9:07am    
I don't have any specific site references because all tasks are common and not complicated.

It is just getting the file size and binary reading and writing of files. The only problem might be mixing binary and text reads to get the length of the header line.

That might be solved by open the file first in text mode to get the length, close it, and open it again in binary mode and use Seek() to skip the first line.

Another (probably faster) option is reading char wise from the file opened already in binary mode until a new line character occurs.

If you really need example code you might search for something like "c# copy binary file" because such would contain most of the required code besides the skipping of the first line.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900