Click here to Skip to main content
Click here to Skip to main content

NLineDiffConsole: A Console diff Implementation for Text Files

, 31 Aug 2009
Rate this:
Please Sign up or sign in to vote.
An entry for the Lean and Mean competition

Introduction

This article is an entry for the Lean and Mean competition, which ran on The Code Project during August 2009.

This is a console version of my previous entry. As such, it also uses the diff algorithm with the linear space refinement. I have included the original paper in the downloads. In this version, I have halved the memory requirements, but the algorithm is the same.

Correction: According to Wikipedia, the algorithm was written by Douglas McIlroy in the early 70's while Eugene W. Myers only described it in his paper. Apologies.

Background

If N = number of lines in the first file, M = number of lines in the other file and D = number of different lines, then the diff algorithm is O( (N+M) D ) in time and O( N+M ) in space.

This is far better than the LCS algorithm, which is O( N * M ) in both time and space. For the test files N = 1784, M = 1789 and D = 19. So in this case, the complexity of the LCS algorithm is several orders of magnitude fatter and slower than the diff algorithm.

I have to repeat my appreciation of McIlroy's algorithm. The more I have learned about it, the more impressed I have become. I certainly couldn't invent anything better!

Using the Code

The demo is a console application that uses the diff implementation, which is in a separate assembly. The demo takes two file paths as arguments. It reads the files, calculates the diff and displays the results and some timing measurements. There is a #define in program.cs that optionally measures the change in Process.PeakWorkingSet64.

Results

The screenshot above shows the timing results for a typical run. Not including the JIT compilation gives times of about 5.5 ms. Including it gives overall times of about 32 ms.

The main arrays used by the algorithm are now a bit under 30 KB, so the calculated memory use is 420 KB plus change.

Other entries measure the change in Process.PeakWorkingSet64, so I will do the same. The measurements vary dramatically, but values of 600 - 700 KB are common.

History

  • 2009 08 31: First edition

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Nicholas Butler

United Kingdom United Kingdom

I built my first computer, a Sinclair ZX80, on my 11th birthday in 1980.
In 1992, I completed my Computer Science degree and built my first PC.
I discovered C# and .NET 1.0 Beta 1 in late 2000 and loved them immediately.
I have been writing concurrent software professionally, using multi-processor machines, since 1995.
 
In real life, I have spent 3 years travelling abroad,
I have held a UK Private Pilots Licence for 20 years,
and I am a PADI Divemaster.
 
I now live near idyllic Bournemouth in England.
 
If you would like help with multithreading, please contact me via my website:
 
 
I can work 'virtually' anywhere!

Comments and Discussions

 
GeneralGreat. PinmvpLuc Pattyn31-Aug-09 16:02 
AnswerRe: Great. PinmentorNick Butler31-Aug-09 23:47 
Hi Luc,
 
What are you doing up at 4 AM Smile | :)
 
Thanks for your feedback, especially the praise! You may have noticed that I have really enjoyed learning the diff algorithm Smile | :) It's a work of beauty and I plan to write a thorough tutorial, but that will take weeks.
 
As for the metrics, I have tried to be as informative as possible. However, there are many factors at work here. For example, when I wrote my first entry, NLineDiff, my solid state drive was fastest at reading the files. But for my second two entries, my VelociRaptor was marginally quicker. Don't ask me why!
 
I don't think PeakWorkingSet64 is a good measure of memory consumption - it is too blunt an instrument. It mostly indicates the memory used by the framework on background threads. Have you tried the code I added to the NLineDiff article? Also, the GC is a complicated beast. Try adding some GC.AddMemoryPressure calls to really liven things up! This is not native code.
 
My implementation of the diff algorithm uses two Int32 arrays which are allocated up front. The only other allocations are for storing the results, which are tiny in comparison. So I can, and have, calculated a memory consumption figure which is a good guide to the space requirements of the algorithm. This is much more accurate than measuring what the framework is up to.
 
For the time metrics, I have obviously used the Stopwatch class. I take snapshots at various intervals to provide more detail, but the overall figure is the total elapsed time. I have also reported the JIT time, although as I suspect this code will be used in a web app Big Grin | :-D the JIT time will be irrelevant.
 

 
Thanks again for the feedback and good luck Smile | :)
 

Nick
 
----------------------------------
Be excellent to each other Smile | :)

GeneralRe: Great. PinmvpLuc Pattyn31-Aug-09 23:57 
GeneralRe: Great. PinadminChris Maunder4-Sep-09 9:10 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web03 | 2.8.140721.1 | Last Updated 31 Aug 2009
Article Copyright 2009 by Nicholas Butler
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid