Click here to Skip to main content
15,890,845 members
Please Sign up or sign in to vote.
2.78/5 (3 votes)
See more:
Hi,

In my windows forms application I have a task to read 1 lakh records from a notepad text file and then need to modify he data
by reading the data in a datatable and adding some new columns and doing some manipulations and finally when click on export button
it should generate an excel file.
For reading all the text file(1 lakh records) in datatable the process is taking 15 to 20 minutes of time.So if there are 6 text
files in a particular folder to export the text files sheet wise it is taking almost 2 hours.
Could you please any one explain how to overcome this one by exporting the data into an excel file in less than 10 min.
Here this particular block of code is executing almost 15 minutes.
C#
 //In the below for loop we are reading the values from the text file by splitting with comma and adding
//to the datatable.
for (int i = 1; i < lines.Count(); i++)
{
    if (i < lines.Count() - 1 && lines[i] == lines[i + 1]) continue;//to remove duplicate entries 26-May-16
    dr = dt.NewRow();
    values = lines[i].Split(new char[] { ',' });
    for (int j = 0; j < values.Count() && j < columns.Count(); j++)
        dr[j] = values[j];

    dt.Rows.Add(dr);

    DateTime = Convert.ToDateTime(dt.Rows[dt.Rows.Count - 1][Constants.Date].ToString());
    // strDateTime = DateTime.ToString("dd/MM/yyyy").Split('/');
    tempDate = DateTime.ToShortDateString();
    tempTime = DateTime.ToLongTimeString();
    FunName = dt.Rows[dt.Rows.Count - 1][Constants.FuncName].ToString();
    dt.Rows[dt.Rows.Count - 1][Constants.Index] = dt.Rows.Count;
    dt.Rows[dt.Rows.Count - 1][Constants.Time] = tempTime;
    dt.Rows[dt.Rows.Count - 1][Constants.Date] = tempDate;
    dt.AcceptChanges();

}


What I have tried:

I am working on C# which is reading the text file and then manipulating the data in datatable.
Posted
Updated 6-Jun-16 6:26am
v2

It is the DataTable.AcceptChanges that kills you.

I made this quick and dirty code example and if I put AcceptChanges inside the loop, it takes around 20 minutes to read 100 000 rows.
When it is placed outside the loop, where it belongs, it takes around 1 second.
(I tried a test with 1 000 000 rows and it took 2 minutes with the AcceptChanges outside the loop. With AcceptChanges inside the loop the program has been working for 2 hours and still not done)

I have no idea how many columns you have, but given the time difference with my example I don't think that matters that much.

I didn't even try to optimize this code, neither for speed nor for memory.
C#
static void Main(string[] args)
{
    // Create text file
    using (TextWriter tw = new StreamWriter(@"C:\Temp\LargeCsvFile.csv"))
    {
        for (int i = 0; i < 100000; i++)
        {
            tw.WriteLine("Harry,Hacker,{0},2,3,4,5.0,{1}", i, DateTime.Now);
        }
    }


    DataTable dtRecords = new DataTable("Records");

    // Use an auto increment column instead of adding the the number of rows yourself
    DataColumn dcIndex = dtRecords.Columns.Add("Index", typeof(int));
    dcIndex.AutoIncrement = true;
    dcIndex.AutoIncrementSeed = 1;
    dcIndex.AutoIncrementStep = 1;

    DataColumn dcFirstName = dtRecords.Columns.Add("FirstName", typeof(string));
    DataColumn dcLastName = dtRecords.Columns.Add("LastName", typeof(string));
    DataColumn dcCol1 = dtRecords.Columns.Add("Col1", typeof(int));
    DataColumn dcCol2 = dtRecords.Columns.Add("Col2", typeof(int));
    DataColumn dcCol3 = dtRecords.Columns.Add("Col3", typeof(int));
    DataColumn dcCol4 = dtRecords.Columns.Add("Col4", typeof(int));
    DataColumn dcCol5 = dtRecords.Columns.Add("Col5", typeof(double));
    DataColumn dcDate = dtRecords.Columns.Add("Date", typeof(DateTime));

    // Aggregated column
    DataColumn dcTime = dtRecords.Columns.Add("Time", typeof(DateTime));
    dcTime.Expression = "Date";

    Stopwatch sw = new Stopwatch();
    sw.Start();

    DataRow dr;
    DataColumn dc;
    string[] values;
    string previousLine = "";
    foreach (string line in File.ReadAllLines(@"C:\Temp\LargeCsvFile.csv"))
    {
        dr = dtRecords.NewRow();
        if (line.Equals(previousLine))
            continue;

        previousLine = line;

        values = line.Split(',');
        for (int i=1; i< dtRecords.Columns.Count-1; i++)
        {
            dc = dtRecords.Columns[i];
            dr[dc] = Convert.ChangeType(values[i-1], dc.DataType);
        }
        dtRecords.Rows.Add(dr);

        // This line inside the loop increases the time tremendously
        // dtRecords.AcceptChanges(); 
    }
    dtRecords.AcceptChanges();
    sw.Stop();
    long time = sw.ElapsedMilliseconds;
    Console.WriteLine("Time {0}", time / 1000);


[UPDATE]
Just out of interest, I implemented a time print every 10000 rows and here is the result.
This is of course with AcceptChanges inside the loop.
Row No   hh:mm:ss
 10000 - 00:00:04
 20000 - 00:00:21
 30000 - 00:00:56
 40000 - 00:01:48
 50000 - 00:03:02
 60000 - 00:04:34
 70000 - 00:06:28
 80000 - 00:08:37
 90000 - 00:11:08
100000 - 00:13:53
110000 - 00:16:59
120000 - 00:20:32
130000 - 00:24:32
140000 - 00:28:44
150000 - 00:33:12
160000 - 00:37:57
170000 - 00:43:29
180000 - 00:49:27
190000 - 00:55:44
200000 - 01:02:12
210000 - 01:09:10
220000 - 01:16:32
230000 - 01:24:08
240000 - 01:31:57
250000 - 01:40:26

It is amazing how one misplaced line of code can cause such a huge time penalty.

[UPDATE2]
It took 10 hours to process 600 000 rows, then I gave up.
 
Share this answer
 
v4
Comments
Member 8557048 20-Jun-16 6:00am    
Thank you very much George for your valuable answer.It is really helped me a lot.Earlier
I for reading 40 lakh records from different text files it took 4.5 to 5 hours.
Now its amazing it took 35 mins.

Thank you so much
George Jonsson 21-Jun-16 4:10am    
You are welcome.
You can probably cut that time too if you do some optimizations of the code.
Try using a CSV reader instead: A Fast CSV Reader[^] is pretty good. It's always been fast enough for me...
 
Share this answer
 

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900