Merge CSV files upto 50GB size

Question

0.00/5 (No votes)

See more:

I have to merge two CSV files of 50GB size using .net. Please help me a quick process that took less than 5 mintues

What I have tried:

static void Main(string[] args)
{

string sourceFolder = @"D:\SingleBlockDataDump_June.csv";
string destinationFile = @"\D:\SingleBlockDataDump_July.csv";
string logFilePath = @"D:\log.txt";
// string[] filePaths = Directory.GetFiles(sourceFolder, "CSV_File_Number?.csv");
StreamWriter fileDest = new StreamWriter(destinationFile, true);

//int i=1;
//for (i = 0; i < filePaths.Length; i++)
{
//string file = filePaths[i];

string[] lines = File.ReadAllLines(sourceFolder); //File.ReadAllLines(file);

//if (i > 0)
//{
//lines = lines.Skip(1).ToArray(); // Skip header row for all but first file
lines = lines.ToArray();
//}
TimeSpan startTime = DateTime.Now.TimeOfDay;
string logText = "Started to merge: " + startTime +Environment.NewLine;

foreach (string line in lines)
{
fileDest.WriteLine(line);
}
TimeSpan endTime=DateTime.Now.TimeOfDay;
logText += "Finished merging: " + endTime;
//TimeSpan duration = DateTime.Parse(endTime).Subtract(DateTime.Parse(startTime));
logText += "Elapsed Time:";
using (StreamWriter writetext = new StreamWriter(logFilePath))
{
writetext.WriteLine(logText);
}
Console.ReadLine();
}

fileDest.Close();
}

Posted 29-Nov-17 2:16am

sankarisiva

Updated 29-Nov-17 3:03am

Add a Solution

Comments

F-ES Sitecore 29-Nov-17 8:22am

Rather than doing ReadAllLines, try processing the file line by line. Can't guarantee it's going to have a whole lot of impact on the performance but it's worth trying.

To preempt your next question google "read file line by line c#"

2 solutions

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Mehdi Gholam · Answer 1 · 2017-11-29T02:47:00

Solution 1

CSV files are text files with a header description, so if the file structures are the same, then just append the second file to the first file while skipping the second files header line (the first line).

Posted 29-Nov-17 2:47am

Mehdi Gholam

Jochen Arndt · Answer 2 · 2017-11-29T03:02:00

Don't use reading line by line (or even ReadAllLines) if speed matters.

Allocate a large byte array to be used for copying. The size should be fair below the available free memory to avoid swapping to disk.

For each file get it's size and - except for the first one - read the first line to skip it and get the offset to the second line from the length. Subtract the offset from the size.

Now use a loop for block wise processing:

Determine the block size (min. of buffer size and remaining size)
Read into buffer
Write to output file
Decrement size by block size

Then there is only a single memory allocation and reading and writing the raw file content avoids the end of line checks. The required time depends then nearly completely on the speed of your storage device (HDD, SSD).