Click here to Skip to main content
Click here to Skip to main content

An O(ND) Difference Algorithm for C#

By , 14 Mar 2006
 

Introduction

This article is about comparing text files, and the best and most famous algorithm to identify the differences between them. The source code that you can find in the download implements a small class with a simple to use API that does this job. You must have this in your collection of algorithms.

Besides the class that implements the algorithm, there is also a sample web application that compares two files and generates an HTML output with a combined and colored document.

The algorithm was first published 20 years ago under the title "An O(ND) Difference Algorithm and its Variations" by Eugene Myers, Algorithmica Vol. 1 No. 2, 1986, p 251. You can find a copy if it here. In this article, you can find an abstract recursive definition of the algorithm, using some pseudo-code that needs to be transferred to an existing programming language.

There are many C, Java, and Lisp implementations publicly available of this algorithm out there on the internet. Before I wrote the C# version, I discovered that almost all these implementations came from the same source (GNU diffutils) that is available under the (unfree) GNU public license and therefore cannot be reused as a source code for commercial or redistributable application, without being trapped by the GNU license.

There are very old C implementations that use other (worse) heuristic algorithms. Microsoft also published the source code of a diff-tool (WinDiff) that uses some tree structures. Also, a direct transfer from a C source to C# is not easy because there is a lot of pointer arithmetic in the typical C solutions, and I wanted a managed solution. I tried a lot of sources, but found no usable solution written for the .NET platform.

These are the reasons why I implemented the original published algorithm from scratch, and made it available without the GNU license limitations, under a Creative Commons Attribution 2.0 Germany License. The history of this implementation traces back to 2002 when I published a Visual Studio add-in that also compares files, see Visual Studio 2005 Add-in that adds diff tools, an explore command, subversion support, and web project reporting. I couldn't find any bugs in the last 3 years, so I think the code is pretty stable.

I didn't need a high performance diff tool. I will do some performance tweaking as and when needed, so please let me know. I have also dropped some hints in the source code on that topic.

How it works (briefly)

You can find an online working version of the download files here.

  • Comparing the characters of two huge text files is not easy to implement and tends to be slow. Comparing numbers is much easier, so the first step is to compute unique numbers for all the text lines. If the text lines are identical then identical numbers are computed.
  • There are some options that need to be considered before computing these numbers, which are normally useful for some kind of text: stripping off space characters and comparing case insensitive.
  • The core algorithm itself will compare two arrays of numbers, and the preparation is done in the private DiffCodes method and by using a Hashtable.
  • The methods are DiffText and DiffInts.
  • The core of the algorithm is built using two methods:
    • LCS: This is the divide-and-conquer implementation of the longest-common-subsequence algorithm.
    • SMS: This method finds the Shortest Middle Snake.
  • To get some usable performance, I did some changes to the original algorithm. The original algorithm was described using a recursive approach, and compares zero indexed sequences, and passes parts of these sequences as parameters. Extracting sub-arrays and rejoining them is very CPU and memory intensive. To avoid copying these sequences as arrays around, the used arrays together with the lower and upper bounds are passed while the sequences are not copied around all the time. This circumstance makes the LCS and SMS functions more complicate.
  • I added some code to the LCS function to get a fast response on sub-arrays that are identical, completely deleted, or inserted instead of recursively analyzing them down to the single number case.
  • The result is stored in two arrays that flag for modified (deleted or inserted) lines in the two data arrays. These bits are then analyzed to produce an array of objects that describe the found differences.
  • Read the original article if you want to understand more.

The API

To use this functionality, you only have to call one of the DiffText methods. They all get a pair of strings, and return an array of items that describe the inserts and deletes between the two strings. There are no "changes" reported. Instead, you can find an "insert" and "deleted" pair.

DiffText(string TextA, string TextB)

Finds the difference between two texts, comparing by text lines without any conversion. An array of items containing the differences is returned.

DiffText(string TextA, string TextB, bool trimSpace, bool ignoreSpace, bool ignoreCase)

Finds the difference between two texts, comparing by text lines with some optional conversions. An array of items containing the differences is returned.

Diff(int[] ArrayA, int[] ArrayB)

Finds the difference between two arrays of integers. An array of items containing the differences is returned.

A sample application for the algorithm

The sample is a small web application that calculates the difference between two text files and displays the result using HTML.

To setup the sample, you have to create a web-project and copy all the files found in the zip file into the directory. The implementation of the algorithm is given inside the "diff.cs" class.

Further possible optimizations (if you really need speed)

(First rule: don't do it; second: don't do it yet.)

The arrays DataA and DataB are passed as parameters, but are never changed after the creation, so they can be used as members of the class to avoid parameter overhead. In the SMS, there is a lot of boundary arithmetic in the for-D and for-k loops that can be done by incrementing and decrementing the local variables. The DownVector and UpVector arrays are always created and destroyed each time the SMS gets called. It is possible to reuse them when transferring them to members of the class.

See TODO: hints.

Security issues with the web application

  • Using this web-site implementation enables clients to view all the source code of your website. Just be sure that you do not use it as-it-is on a public server.
  • You should implement a check of the parameters, and allow only a diff output on the files that can be displayed (text based files).

License

This work is licensed under Creative Commons Attribution 2.0 Germany License.

Built-in self-test

The class now has a built-in self-test, that works without changing the code. If you compile the diff and debug class files together, you get an EXE file that tests some simple diff-scenarios that were used to discover the bugs in versions 1 and 2 (OutOfArrayBounds typically).

The compile command is:

csc /target:exe /out:diffTest.exe /debug /d:SELFTEST Diff.cs Debug.cs

Thanks for your feedback and the two reported cases that did not diff correctly. (It was always my fault, not a problem of the algorithm).

History

This work was first published here.

  • 2002.09.20
    • There was a "hang" in some situations.
    • Now, I understand a little bit more of the SMS algorithm.
    • There have been overlapping boxes; those where analyzed differently. One return-point is enough.
    • An assertion was added in CreateDiffs when in debug-mode, that counts the number of equal (not modified) lines in both arrays. They must be identical.
  • 2003.02.07
    • Out of bounds error in the Up/Down vector arrays in some situations.
    • The two vectors are now accessed using different offsets that are adjusted using the start k-Line.
    • A test case is added.
  • 2003.04.09
    • Another test that threw an exception was found, but that seems to be fixed by the 2002.09.20 work.
  • 2006.03.10
    • Refactored the API to static methods on the diff class to make the usage simpler.
    • The self test is now using the standard debug class.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

About the Author

Matthias Hertel
Architect Deutsche Bank AG
Germany Germany
Member
see http://www.mathertel.de

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board.
Search this forum  
    Spacing  Noise  Layout  Per page   
QuestionWinForm or WPF versionmemberMember 78411351 Oct '12 - 11:13 
Hi,
 
Can you release a WinForm or WPF version please??
 
It will be very helpful for me!
 
10x,
El
QuestionFormating the resultsmemberMember 78411351 Oct '12 - 11:09 
Hi,
 
I would like to create a text comparer.
I used the method: diff.DiffText, but have no idea how to fomat the reulsts.
I need to show the diff/extra/same between 2 text files.
please help!
 
thanks,
El
AnswerRe: Formating the resultsmemberMatthias Hertel2 Oct '12 - 10:17 
I only have source code for HTML output that is embedded into the *.aspx files in the samples zip.
The text on Codeproject is older than the one published on my web site:
http://www.mathertel.de/Diff/default.aspx
GeneralReverting text to a previous statememberjsoldi1 Mar '11 - 8:12 
Is it possible to use this library to, having text A and B and difference X that is the change from A to B, use X and B to retrieve A?
GeneraldiffItem results are not consistentmembercmschick4 Dec '09 - 18:30 
I'm having trouble with interpreting the results of the diffItems. All is well if you compare say file A to file C or file B to C but if you store the results from each compare operation and then try to find a merged view the numbers in the diffItems don't add up. It would be nice if the diffItems were structured like this:
 
SourceIndex
DestinationIndex
Length
Enum Inserted, Replaced, Deleted, Unchanged End Enum
 
Here's a full blown example of the problem (no the auto-text is not always all caps):
 
FILE A
 
This is file A's text
it has multiple lines that are
always different than file C.
 
FILE B
 
This is file B's text. This text is always
different than file A's text. These two files
are really sections of File C.
Because this section is much longer than file A
It may have multiple inserts from some other
program.
 
FILE C
 
This is the main file it contains multiple lines
and contains the text of both file A and file B. However
another program had this file before I did and inserted
some text in between the file A text and some in between the
file B text like this:
 
This is file A's text
it has multiple lines that are
 
SOMEONE PUT SOME TEXT HERE
 
always different than file C.
 
This is file B's text. This text is always
 
SOME AUTO FOOTER TEXT THAT SOME PROGRAM ENTERED
 
different than file A's text. These two files
are really sections of File C.
Because this section is much longer than file A
It may have multiple inserts from some other
 
SOME AUTO FOOTER TEXT THAT SOME PROGRAM ENTERED
 
program.
 

__________________________________________________________________
Now after you have saved the results of both diffs try to figure out how
to replace the A file text in File C with a token like this {FileA} and the same for
file B like this {FileB}. You cannot. Because the diffItems startA, startB, and insertdB do not make sense and you cannot properly determine the line numbers in the source and destination files. What am I missing?
 
Thanks.
QuestionDiffWord?memberkeeara26 Jan '09 - 20:43 
The application is great and works well for DiffText and DiffInt but i think it would be great if there is DiffWord as well (to differentiate by words rather than by character) because sometimes if a character is in both new and old text, the way it is displayed it is harder to read.
 
Thanks and keep up the good work
Generaltext diffmemberm-khansari22 Dec '08 - 21:15 
hi
 
I want to use your class to compare two text.
the code is:
 
string A = "iMohammad";
string B = "Mohammadi";
my.utils.Diff diff = new my.utils.Diff();
my.utils.Diff.Item[] result = diff.DiffText(A,B);
foreach (my.utils.Diff.Item item in result)
Console.WriteLine(item.ToString());
 
but I recieve one Item object in Item Array
StartA=0, StartB=0 , DeletedA=1 , InsertedB=1
 
I expected to recieve this result
StartA=0,StartB=0,DeletedA=1,InsertedB=0;
StartA=10,StartB=10,DeletedA=0,InsertedB=1;
 
I implemment ToString() method in Item class.
 
what can I do?
 
thanks.
 




}
 
Mohammad Khansari

GeneralRe: text diffmemberMatthias Hertel26 Dec '08 - 0:06 
I've posted a sample that is covering this topic.
Have a look on my web site at: http://www.mathertel.de/Diff/DiffTextLines.aspx[^]
QuestionUnfree?memberBrianary30 Oct '07 - 14:40 
How do you figure?
 
Trapped? As opposed to your customers/readers, who are trapped otherwise?
QuestionProblem with your code... :( [modified]memberCélio16 Jul '07 - 2:48 
Hi.
 
I was very glad I found this sample. But then I got into this problem:
Try to do this:
 
File 1:
if (a > b)
{
  MessageBox.Show("Bigger")
}
 
if (a < b)
{
  MessageBox.Show("Smaller")
}

 
and then File 2:
if (a > b)
{
  MessageBox.Show("Bigger than...");
}

 
the line: 'MessageBox.Show("Bigger than...");' goes into if 'a < b' Frown | :(
 
Result of the sample modified by me in Windows Forms:
0 - if (a > b)
1 - {
  -     MessageBox.Show("Bigger");
  - }
  - 
  - if (a < b)
  - {
  -     MessageBox.Show("Smaller");
2 - MessageBox.Show("Bigger than...");
3 - }

 
... is this a bug?
 
CeliSoft

AnswerRe: Problem with your code... :(memberMatthias Hertel16 Jul '07 - 8:02 
Hi,
 
I just uploadedd your files to my web site. The bug doesn't show up.
See:
http://www.mathertel.de/Diff/DiffTest.aspx?oldfile=Celi_B.txt&newfile=Celi_A.txt[^]
I guess you have problem with the code that scans the returned diff items and draws the output.
 

 
Matthias
GeneralRe: Problem with your code... :(memberCélio16 Jul '07 - 22:52 
Hi Mr. Matthias,
 
"I guess you have problem with the code that scans the returned diff items and draws the output." -> I have tried those same files in your ASP.NET sample too. Same result...
I don't know how you got that result you replied.
 
Thanks any way... Sniff | :^)
 
CeliSoft

GeneralRe: Problem with your code... :(memberbendornis6 Sep '07 - 11:25 
The most likely reason that you got the output you did was that the code was different the last time he compiled under debug mode and that it's now in release mode compile. I've had a similar thing happen and took a while to figure it out. Basically you could test this by adding a new function and calling that function. If that function doesn't get called then you're not using the right compiled dll. (Probably confusing but I hope I got my point across correctly)
AnswerRe: Problem with your code... :(memberCélio6 Sep '07 - 22:49 
OK. Thanks for your answer. It helped me a lot!
 
CeliSoft

GeneralLicense/Attributionmemberchiphnc20 May '07 - 15:22 
I looked in the article & code comments, but didn't see what attribution you'd like per the Creative Commons Attribution 2.0 license.
 
Something simple, like this?
 
Uses text differencing code originally written by Matthias Hertel
http://www.mathertel.de


AnswerRe: License/AttributionmemberMatthias Hertel21 May '07 - 9:01 
I know what you mean... and I've changed the license for that reason to a bsd-style license in January 2007[^].
See http://www.mathertel.de/License.aspx[^] for more precise details.
 
... and I know that I should update this article here and write some more samples to use the class. You can also find updated downloads at http://www.mathertel.de/Diff/[^].
QuestionAny chance of a winforms sample app?membertonyk14 May '07 - 15:29 
I'm a C# newbie and can't figure out how to use this to show the diff between two strings.
 
Would it be possible for someone to post somewhere something simple like a button that would show diff between two strings in a text box on a form?
 
I would really really appreciate the help
 
Tony!

AnswerRe: Any chance of a winforms sample app?memberMatthias Hertel5 May '07 - 2:32 
Not a winforms sample but a step by step sample how to use this class can be found here:
http://www.mathertel.de/Diff/DiffTextLines.aspx[^].
Hope it helps.
QuestionRe: Any chance of a winforms sample app?membertonyk16 May '07 - 17:02 
Thanks for the link. I looked at that and I'm still confused Frown | :(
(Feeling pretty dense and newbish here like it should be obvious)
 
textBox1.Text = (my.utils.Diff.DiffText("Jimmy has class", "Jimmi has class", false, false, false).ToString());
 
I know the above I did up is wrong, but how would pull back the array of items containing the differences?
 
I really appreciate your help Smile | :)
I'll keep digging at it.
 
Tony!

QuestionRe: Any chance of a winforms sample app?memberMatthias Hertel7 May '07 - 9:24 
Ok ! ???
What do you expect that is shown in the textBox1.Text?
Maybe something like:
"Jimmy is not Jimmi"
"Jimm$d(y)$i(i) has class"
 
my.utils.Diff.DiffText
is comparing text (strings with multiple textlines) and not strings (strings with a single textline) which is a completely different situation.
Questionhow can modify it to support chinese??memberP.Dragon24 Mar '07 - 17:00 
hello,I'm chinese.
I found this src,but when I use it,it can't support chinese.
And I know this no much,so......
can you tell me how it can support chinese???
Thank you.
AnswerRe: how can modify it to support chinese??memberMatthias Hertel1 Apr '07 - 23:40 
Interesting question!
The core algorithm is not comparing characters so there should not be any problems in the core but inside the DiffCodes method that transfers text lines into comparable numbers.
Have a look at this implementation.
Anyway it's interesting and I have no Chinese test case (yet). You can send me some Chinese text files and a small test case implementation.
My mail address is available on my web site: http://www.mathertel.de/impressum.aspx[^].
 
Matthias Hertel, http://www.mathertel.de
AnswerRe: how can modify it to support chinese??memberplayhere20 Dec '11 - 20:45 
//aFile = File.OpenText(fileName);
//a = aFile.ReadToEnd();
//aFile.Close();
StreamReader reader = new StreamReader(fileName, Encoding.Default);
a = reader.ReadToEnd();
 
it's Encoding problem.
Questionbroken linkmemberpsidata9 Feb '07 - 1:25 

Link for
"An O(ND) Difference Algorithm and its Variations" by Eugene Myers, Algorithmica Vol. 1 No. 2, 1986, p 251.
http://xmailserver.org/diff2.pdf[^]
is broken
 
can we get this file please...
 
Thanks for the fantastic Article
 
Regards
AnswerRe: broken link: new LinkmemberMatthias Hertel9 Feb '07 - 9:39 
The publication is also available through the scientific literature digital library:
http://citeseer.ist.psu.edu/myers86ond.html[^]

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Permalink | Advertise | Privacy | Mobile
Web01 | 2.6.130523.1 | Last Updated 14 Mar 2006
Article Copyright 2006 by Matthias Hertel
Everything else Copyright © CodeProject, 1999-2013
Terms of Use
Layout: fixed | fluid