Click here to Skip to main content
Click here to Skip to main content
Go to top

NLineDiff: A diff Implementation for Text Files

, 28 Aug 2009
Rate this:
Please Sign up or sign in to vote.
An entry for the Lean and Mean competition

screenshot

screenshot

screenshot

screenshot

Key: Red lines are those deleted from the first file and green lines are those inserted from the second file.

Introduction

This article is an entry for the Lean and Mean competition, which ran on The Code Project during August 2009.

I have implemented a slight variation of Eugene W. Myers' diff algorithm with the linear space refinement from the paper listed in the downloads.

The algorithm is O(ND) in time and O(N) in space, where:

N = number of lines in FileA + number of lines in FileB
D = number of different lines

The output is rendered using a simple WinForms WebBrowser control.

Results

For the test files, N = 3,573 and D = 19. On my 2.5 GHz machine, the solution takes about 5 ms and uses about 450 KB of memory.

Time

A large proportion of the time is spent reading the files from disk and splitting them into string arrays. Using File.ReadAllText takes about 1.7 ms for both files, while File.ReadAllLines takes about 1.9 ms. Using File.OpenRead and StreamReader.ReadLine executes in almost identical times.

Building the HTML from the results also takes some time. Just showing the changes takes about 0.15 ms, which rises to 2.6 ms if the common lines are also included. I have chosen to include the common lines to give more useful results.

As for the diff algorithm itself, the average of 100 iterations is about 0.33 ms. The time for a single run obviously varies more than the average, but consistently takes about 0.4 ms.

These operations give the total time of 1.9 + 0.4 + 2.6 ≈ 5 ms.

The diff is run once before the timing run to allow the JIT compiler to work. This takes about 27 ms, which dwarfs the calculation time. I think that not including this time is legitimate as Chris said in the Notes:

"... there's no need to report memory usage or speed of the framework you use to support your code."

If you disagree, you can add in the JIT time to give a total time of about 32 ms.

Space

The space used is the sum of three areas:

  • The files are loaded into memory, which takes about 360 KB.
  • They are split into string arrays, which takes about 30 KB.
  • The algorithm itself uses an int array, which is about 60 KB.

The sum gives a maximum working memory of 450 KB.

I see that other entries have used Process.PeakWorkingSet64, but this is mostly a measure of how much memory the framework is using. Running this code reports about 900 KB being used by the Thread.Sleep call, which is obviously not correct.

static void Main( string[] args )
{
  var p = Process.GetCurrentProcess();
  var peak = p.PeakWorkingSet64;

  Thread.Sleep( 1000 );

  p.Refresh();

  Console.WriteLine( String.Format( "peak: {0:N0} bytes", p.PeakWorkingSet64 - peak ) );
}

Using the Code

The binary is a single executable that has the test files included as embedded resources. It is configured to load the two texts and split them into string arrays, then time the diff algorithm.

You can also supply two file paths on the command line to use those files.

The source contains some variations which are commented out in Form1.cs. You can edit this file and recompile to verify my time and space measurements on your machine.

Points of Interest

Myers' algorithm is a work of art and I have thoroughly enjoyed learning it. It has taken me about a week to implement it because that's how long it took for me to fully understand it. It is definitely worthy of an article on its own (coming soon!)

History

  • 2009 08 27: First edition
  • 2009 08 28: Fixed bug with overlapping diagonal when D=1 && Δ is even

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Nicholas Butler

United Kingdom United Kingdom

I built my first computer, a Sinclair ZX80, on my 11th birthday in 1980.
In 1992, I completed my Computer Science degree and built my first PC.
I discovered C# and .NET 1.0 Beta 1 in late 2000 and loved them immediately.
I have been writing concurrent software professionally, using multi-processor machines, since 1995.
 
In real life, I have spent 3 years travelling abroad,
I have held a UK Private Pilots Licence for 20 years,
and I am a PADI Divemaster.
 
I now live near idyllic Bournemouth in England.
 
If you would like help with multithreading, please contact me via my website:
 
 
I can work 'virtually' anywhere!

Comments and Discussions

 
NewsVersion 1.1 posted PinmentorNick Butler27-Aug-09 9:38 
GeneralRe: Version 1.1 posted [modified] PinmemberIlka Guigova28-Aug-09 10:15 
RantGood start, but severe issues... PinmemberPeter Souza27-Aug-09 5:25 
AnswerRe: Good start, but severe issues... PinmentorNick Butler28-Aug-09 4:11 
GeneralMy vote of 2 PinmemberPeter Souza27-Aug-09 5:24 
GeneralReducing the memory usage with memory mapped files PinmemberNemanja Trifunovic27-Aug-09 5:16 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web01 | 2.8.140916.1 | Last Updated 28 Aug 2009
Article Copyright 2009 by Nicholas Butler
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid