Click here to Skip to main content
15,867,568 members
Articles / Programming Languages / C#

NLineDiff: A diff Implementation for Text Files

Rate me:
Please Sign up or sign in to vote.
4.42/5 (4 votes)
28 Aug 2009CPOL3 min read 30.2K   949   20   6
An entry for the Lean and Mean competition

screenshot

screenshot

screenshot

screenshot

Key: Red lines are those deleted from the first file and green lines are those inserted from the second file.

Introduction

This article is an entry for the Lean and Mean competition, which ran on The Code Project during August 2009.

I have implemented a slight variation of Eugene W. Myers' diff algorithm with the linear space refinement from the paper listed in the downloads.

The algorithm is O(ND) in time and O(N) in space, where:

N = number of lines in FileA + number of lines in FileB
D = number of different lines

The output is rendered using a simple WinForms WebBrowser control.

Results

For the test files, N = 3,573 and D = 19. On my 2.5 GHz machine, the solution takes about 5 ms and uses about 450 KB of memory.

Time

A large proportion of the time is spent reading the files from disk and splitting them into string arrays. Using File.ReadAllText takes about 1.7 ms for both files, while File.ReadAllLines takes about 1.9 ms. Using File.OpenRead and StreamReader.ReadLine executes in almost identical times.

Building the HTML from the results also takes some time. Just showing the changes takes about 0.15 ms, which rises to 2.6 ms if the common lines are also included. I have chosen to include the common lines to give more useful results.

As for the diff algorithm itself, the average of 100 iterations is about 0.33 ms. The time for a single run obviously varies more than the average, but consistently takes about 0.4 ms.

These operations give the total time of 1.9 + 0.4 + 2.6 ≈ 5 ms.

The diff is run once before the timing run to allow the JIT compiler to work. This takes about 27 ms, which dwarfs the calculation time. I think that not including this time is legitimate as Chris said in the Notes:

"... there's no need to report memory usage or speed of the framework you use to support your code."

If you disagree, you can add in the JIT time to give a total time of about 32 ms.

Space

The space used is the sum of three areas:

  • The files are loaded into memory, which takes about 360 KB.
  • They are split into string arrays, which takes about 30 KB.
  • The algorithm itself uses an int array, which is about 60 KB.

The sum gives a maximum working memory of 450 KB.

I see that other entries have used Process.PeakWorkingSet64, but this is mostly a measure of how much memory the framework is using. Running this code reports about 900 KB being used by the Thread.Sleep call, which is obviously not correct.

C#
static void Main( string[] args )
{
  var p = Process.GetCurrentProcess();
  var peak = p.PeakWorkingSet64;

  Thread.Sleep( 1000 );

  p.Refresh();

  Console.WriteLine( String.Format( "peak: {0:N0} bytes", p.PeakWorkingSet64 - peak ) );
}

Using the Code

The binary is a single executable that has the test files included as embedded resources. It is configured to load the two texts and split them into string arrays, then time the diff algorithm.

You can also supply two file paths on the command line to use those files.

The source contains some variations which are commented out in Form1.cs. You can edit this file and recompile to verify my time and space measurements on your machine.

Points of Interest

Myers' algorithm is a work of art and I have thoroughly enjoyed learning it. It has taken me about a week to implement it because that's how long it took for me to fully understand it. It is definitely worthy of an article on its own (coming soon!)

History

  • 2009 08 27: First edition
  • 2009 08 28: Fixed bug with overlapping diagonal when D=1 && Δ is even

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
United Kingdom United Kingdom
I discovered C# and .NET 1.0 Beta 1 in late 2000 and loved them immediately.
I have been writing software professionally in C# ever since

In real life, I have spent 3 years travelling abroad,
I have held a UK Private Pilots Licence for 20 years,
and I am a PADI Divemaster.

I now live near idyllic Bournemouth in England.

I can work 'virtually' anywhere!

Comments and Discussions

 
NewsVersion 1.1 posted Pin
Nicholas Butler27-Aug-09 9:38
sitebuilderNicholas Butler27-Aug-09 9:38 
GeneralRe: Version 1.1 posted [modified] Pin
Ilka Guigova28-Aug-09 10:15
Ilka Guigova28-Aug-09 10:15 
Hello Nick,

I confirm that now your application runs with various inputs.

I created grid3.html (to contain the contents of grid1.html 5 times) and grid4.html (to contain the contents of grid2.html 5 times) and did some more testing:
- NLineDiffDemo grid1.html grid1.html - it takes about 0.37 ms and 57 096 bytes
- NLineDiffDemo grid1.html grid2.html - it takes about 0.67 ms and 57 176 bytes
- NLineDiffDemo grid3.html grid3.html - it takes about 3 ms and 285 448 bytes
- NLineDiffDemo grid3.html grid4.html - it takes about 6 ms and 285 848 bytes
(If I got the units wrong, please ignore them and just keep the scale in mind. I apologize.)
And it confirms that the algorithm runs in O(N) in space and O(ND) in time.

Have you considered other approaches apart from the Myers' LCS algorithm? I am looking forward to your article on it.

I also like your output presentation very much : )

I will refrain from voting due to conflict of interest but I welcome any comments that you may have on my entry: Code Lean and Mean File DIFF (FIFF) Application[^] Smile | :)

Good Luck!
Ilka

I would imagine if you could understand Morse Code, a tap dancer would drive you crazy.
-- Mitch Hedberg (American Comedian, 1968-2005)

modified on Friday, August 28, 2009 5:51 PM

RantGood start, but severe issues... Pin
Pete Souza IV27-Aug-09 5:25
professionalPete Souza IV27-Aug-09 5:25 
AnswerRe: Good start, but severe issues... Pin
Nicholas Butler28-Aug-09 4:11
sitebuilderNicholas Butler28-Aug-09 4:11 
GeneralMy vote of 2 Pin
Pete Souza IV27-Aug-09 5:24
professionalPete Souza IV27-Aug-09 5:24 
GeneralReducing the memory usage with memory mapped files Pin
Nemanja Trifunovic27-Aug-09 5:16
Nemanja Trifunovic27-Aug-09 5:16 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.