It is the
DataTable.AcceptChanges
that kills you.
I made this quick and dirty code example and if I put
AcceptChanges
inside the loop, it takes around 20 minutes to read 100 000 rows.
When it is placed outside the loop, where it belongs, it takes around 1 second.
(I tried a test with 1 000 000 rows and it took 2 minutes with the
AcceptChanges
outside the loop. With
AcceptChanges
inside the loop the program has been working for 2 hours and still not done)
I have no idea how many columns you have, but given the time difference with my example I don't think that matters that much.
I didn't even try to optimize this code, neither for speed nor for memory.
static void Main(string[] args)
{
using (TextWriter tw = new StreamWriter(@"C:\Temp\LargeCsvFile.csv"))
{
for (int i = 0; i < 100000; i++)
{
tw.WriteLine("Harry,Hacker,{0},2,3,4,5.0,{1}", i, DateTime.Now);
}
}
DataTable dtRecords = new DataTable("Records");
DataColumn dcIndex = dtRecords.Columns.Add("Index", typeof(int));
dcIndex.AutoIncrement = true;
dcIndex.AutoIncrementSeed = 1;
dcIndex.AutoIncrementStep = 1;
DataColumn dcFirstName = dtRecords.Columns.Add("FirstName", typeof(string));
DataColumn dcLastName = dtRecords.Columns.Add("LastName", typeof(string));
DataColumn dcCol1 = dtRecords.Columns.Add("Col1", typeof(int));
DataColumn dcCol2 = dtRecords.Columns.Add("Col2", typeof(int));
DataColumn dcCol3 = dtRecords.Columns.Add("Col3", typeof(int));
DataColumn dcCol4 = dtRecords.Columns.Add("Col4", typeof(int));
DataColumn dcCol5 = dtRecords.Columns.Add("Col5", typeof(double));
DataColumn dcDate = dtRecords.Columns.Add("Date", typeof(DateTime));
DataColumn dcTime = dtRecords.Columns.Add("Time", typeof(DateTime));
dcTime.Expression = "Date";
Stopwatch sw = new Stopwatch();
sw.Start();
DataRow dr;
DataColumn dc;
string[] values;
string previousLine = "";
foreach (string line in File.ReadAllLines(@"C:\Temp\LargeCsvFile.csv"))
{
dr = dtRecords.NewRow();
if (line.Equals(previousLine))
continue;
previousLine = line;
values = line.Split(',');
for (int i=1; i< dtRecords.Columns.Count-1; i++)
{
dc = dtRecords.Columns[i];
dr[dc] = Convert.ChangeType(values[i-1], dc.DataType);
}
dtRecords.Rows.Add(dr);
}
dtRecords.AcceptChanges();
sw.Stop();
long time = sw.ElapsedMilliseconds;
Console.WriteLine("Time {0}", time / 1000);
[UPDATE]
Just out of interest, I implemented a time print every 10000 rows and here is the result.
This is of course with
AcceptChanges
inside the loop.
Row No hh:mm:ss
10000 - 00:00:04
20000 - 00:00:21
30000 - 00:00:56
40000 - 00:01:48
50000 - 00:03:02
60000 - 00:04:34
70000 - 00:06:28
80000 - 00:08:37
90000 - 00:11:08
100000 - 00:13:53
110000 - 00:16:59
120000 - 00:20:32
130000 - 00:24:32
140000 - 00:28:44
150000 - 00:33:12
160000 - 00:37:57
170000 - 00:43:29
180000 - 00:49:27
190000 - 00:55:44
200000 - 01:02:12
210000 - 01:09:10
220000 - 01:16:32
230000 - 01:24:08
240000 - 01:31:57
250000 - 01:40:26
It is amazing how one misplaced line of code can cause such a huge time penalty.
[UPDATE2]
It took 10 hours to process 600 000 rows, then I gave up.