Click here to Skip to main content
15,949,686 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
I am new to C#.

I need to analyze a log file containing 500.000+ lines.

I need to filter lines containing af specific keyword and store those in memory for further processing.

The lines has a fixed layout so the keyword will be at the same position in all lines.

What is the fastest method of doing this in C#

I have done something like this with TextFieldParser in Visual Basic but it takes a long time and wonder if there is a faster way.
Posted
Comments
BillWoodruff 1-Jan-15 7:19am    
Hi, Those are interesting articles; unfortunately the author used DateTime to calculate his run-time comparisons, rather than using a 'StopWatch. So, the results really need to be re-timed. However, you can learn a lot from those articles !
Tommy Jensen 1-Jan-15 7:59am    
Thanks I will read those.

And thanks for the tip about stopwatch. I actually did some timings myself with datetime and subtract start from end time for another look I need to do once the lines have been read. I assume that stopwatch is faster/more accurate than using datetime.
DamithSL 1-Jan-15 6:51am    
can you update the question with the code which you already tried?
Tommy Jensen 1-Jan-15 7:55am    
I don't have any code for C# yet. I don't have the VB code handy.

You probably got your answers, but I might have fond an even faster method. I have to admit, I had oly 18MB of test data with around 225k lines. Still, might worth giving a try. I made a small test comparing ReadLines, and my MemoryMappedFile based approach.

using System;
using System.IO;
using System.IO.MemoryMappedFiles;
using System.Collections.Generic;
using System.Text;

namespace MMText
{
	public class MemoryMappedTextFileReader:IDisposable
	{		
		MemoryMappedFile memoryMappedFile;
		
		public MemoryMappedTextFileReader(string fileName)
		{
			memoryMappedFile = MemoryMappedFile.CreateFromFile(fileName, FileMode.Open);
		}
		
		public IEnumerable<string> ReadLines()
		{
			using (var memoryMappedViewStream = memoryMappedFile.CreateViewStream())
			{
		      	using (StreamReader sr = new StreamReader(memoryMappedViewStream, UTF8Encoding.UTF8, true, 4096)) {
		 
		        	while (!sr.EndOfStream) {
						String line = sr.ReadLine();
			          	yield return line;
		        	}
		      }  				
			}
		}
		
		#region IDisposable implementation
		bool disposed = false;
		
		public void Dispose()
		{ 
			Dispose(true);
			GC.SuppressFinalize(this);           
		}
		
		protected virtual void Dispose(bool disposing)
		{
		if (disposed)
			return; 
		
		if (disposing) {
			memoryMappedFile.Dispose();
		}
		
		disposed = true;
		}
		#endregion
	}
}

And the test:
C#
using System;
using System.IO;
using System.Diagnostics;

namespace MMText
{

    class Program
    {
        public static void Main(string[] args)
        {
            long lines = 0;
            const string fileName = @"D:\TEMP\setupapi.dev.20140929_185959.log";

            var watch = Stopwatch.StartNew();
            foreach (var s in File.ReadLines(fileName))
            {
                lines++;
            }
            watch.Stop();
            TimeSpan ts = watch.Elapsed;
            string elapsedTime = String.Format("{0:00}:{1:00}:{2:00}.{3:00}", ts.Hours, ts.Minutes, ts.Seconds, ts.Milliseconds / 10);
            Console.WriteLine("ReadLines - Reading {0} lines took: {1}. Average: {2} ms/line", lines, elapsedTime, 1.0f*watch.ElapsedMilliseconds/lines);

            lines = 0;
            watch = Stopwatch.StartNew();
            using(var x = new MemoryMappedTextFileReader(fileName))
            {
                foreach(var s in x.ReadLines())
                {
                    lines++;
                }
            }
            watch.Stop();
            ts = watch.Elapsed;
            elapsedTime = String.Format("{0:00}:{1:00}:{2:00}.{3:00}", ts.Hours, ts.Minutes, ts.Seconds, ts.Milliseconds / 10);
            Console.WriteLine("MMF - Reading {0} lines took: {1}. Average: {2} ms/line", lines, elapsedTime, 1.0f*watch.ElapsedMilliseconds/lines);

            Console.Write("Press any key to continue . . . ");
            Console.ReadKey(true);
        }
    }
}


Here are the results:
ReadLines - Reading 225661 lines took: 00:00:00.35. Average: 0,001564293 ms/line
MMF - Reading 225662 lines took: 00:00:00.29. Average: 0,001320559 ms/line

Might differ from run to run, but the ratio is the same. You might have noticed the difference of 1 line. Interesting. Opening it with FAR manager's editor shows 225662... so I don't know what ReadLines is missing there...

Still, one has to be carefull with MMF, if you take this path, you should read this also: http://blogs.msdn.com/b/bclteam/archive/2011/06/06/memory-mapped-file-quirks.aspx[^]

[Update: added memory usage tests]
I have updated the test application like this:
C#
public static void Main(string[] args)
        {
            AppDomain.MonitoringIsEnabled = true;

            long lines = 0;
            const string fileName = @"D:\TEMP\setupapi.dev.20140929_185959.log";

            var watch = Stopwatch.StartNew();

            long frl_MU_b = AppDomain.CurrentDomain.MonitoringTotalAllocatedMemorySize;
            foreach (var s in File.ReadLines(fileName))
            {
                lines++;
            }
            long frl_MU_a = AppDomain.CurrentDomain.MonitoringTotalAllocatedMemorySize;
            watch.Stop();
            TimeSpan ts = watch.Elapsed;
            string elapsedTime = String.Format("{0:00}:{1:00}:{2:00}.{3:00}", ts.Hours, ts.Minutes, ts.Seconds, ts.Milliseconds / 10);
            Console.WriteLine("ReadLines - Reading {0} lines took: {1}. Average: {2} ms/line. Memory usage: {3}", lines, elapsedTime, 1.0f*watch.ElapsedMilliseconds/lines, frl_MU_a-frl_MU_b);

            lines = 0;
            watch = Stopwatch.StartNew();
            long mmf_MU_b = AppDomain.CurrentDomain.MonitoringTotalAllocatedMemorySize;
            using(var x = new MemoryMappedTextFileReader(fileName))
            {
                foreach(var s in x.ReadLines())
                {
                    lines++;
                }
            }
            long mmf_MU_a = AppDomain.CurrentDomain.MonitoringTotalAllocatedMemorySize;
            watch.Stop();
            ts = watch.Elapsed;
            elapsedTime = String.Format("{0:00}:{1:00}:{2:00}.{3:00}", ts.Hours, ts.Minutes, ts.Seconds, ts.Milliseconds / 10);
            Console.WriteLine("MMF - Reading {0} lines took: {1}. Average: {2} ms/line. Memory usage: {3}", lines, elapsedTime, 1.0f*watch.ElapsedMilliseconds/lines, mmf_MU_a-mmf_MU_b);

            Console.Write("Press any key to continue . . . ");
            Console.ReadKey(true);
        }

And here are the results:
ReadLines - Reading 225661 lines took: 00:00:00.36. Average: 0,001613039 ms/line. Memory usage: 41667828
MMF - Reading 225662 lines took: 00:00:00.35. Average: 0,001586443 ms/line. Memory usage: 37764368

As you can see, the MMF apprach consumes even less memory.
 
Share this answer
 
v3
Comments
BillWoodruff 1-Jan-15 21:09pm    
+5 Appreciate seeing solutions like this where research and timing was done !
Tommy Jensen 2-Jan-15 2:34am    
Thanks. I will try this during the weekend. Though I suspect this solution will be memoryintensive? The files I need to read can be many hundred megabytes containing millions of lines.
Zoltán Zörgő 2-Jan-15 6:22am    
With MMF you can't control memory usage, that's true. So I have updated my project to have a view over memory usage during the test. See update. Interesting results.
Tommy Jensen 2-Jan-15 8:14am    
I tested now and I think I will use the simple readfile version as the difference on a file with 8 million lines was just 59/100 of a second. I tried with 16 million lines also but got an out of memory error. My files will never read that amount of lines so the difference is minimal. I guess the problem with my old vb program was not the read/parse but just simple poor code on my part.


Thanks all for helping. It is appriciated.
Maciej Los 2-Jan-15 6:36am    
Great job!
 
Share this answer
 
Comments
BillWoodruff 1-Jan-15 7:26am    
+4 That's a valuable summary thread; however, I note that no reply on that thread actually describes the timing technique used to measure performance. Some of the responses cite Dave Lozinski's articles which are compromised because he used DateTime to measure performance rather than StopWatch.

Of course, if Jon Skeet says a certain method is faster, I'd tend to assume he's actually done proper measurement ... based on my perception that he is one brilliant, and thorough, master of C# and .NET !

I also should say that if Mehdi Gholam told me one technique was faster, I would believe him ! :)

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900