|
|||||||||||||||||||||||
|
|||||||||||||||||||||||
|
Announcements
Chapters
Services
Feature Zones
|
IntroductionThis article is about comparing text files, and the best and most famous algorithm to identify the differences between them. The source code that you can find in the download implements a small class with a simple to use API that does this job. You must have this in your collection of algorithms. Besides the class that implements the algorithm, there is also a sample web application that compares two files and generates an HTML output with a combined and colored document. The algorithm was first published 20 years ago under the title "An O(ND) Difference Algorithm and its Variations" by Eugene Myers, Algorithmica Vol. 1 No. 2, 1986, p 251. You can find a copy if it here. In this article, you can find an abstract recursive definition of the algorithm, using some pseudo-code that needs to be transferred to an existing programming language. There are many C, Java, and Lisp implementations publicly available of this algorithm out there on the internet. Before I wrote the C# version, I discovered that almost all these implementations came from the same source (GNU diffutils) that is available under the (unfree) GNU public license and therefore cannot be reused as a source code for commercial or redistributable application, without being trapped by the GNU license. There are very old C implementations that use other (worse) heuristic algorithms. Microsoft also published the source code of a diff-tool (WinDiff) that uses some tree structures. Also, a direct transfer from a C source to C# is not easy because there is a lot of pointer arithmetic in the typical C solutions, and I wanted a managed solution. I tried a lot of sources, but found no usable solution written for the .NET platform. These are the reasons why I implemented the original published algorithm from scratch, and made it available without the GNU license limitations, under a Creative Commons Attribution 2.0 Germany License. The history of this implementation traces back to 2002 when I published a Visual Studio add-in that also compares files, see Visual Studio 2005 Add-in that adds diff tools, an explore command, subversion support, and web project reporting. I couldn't find any bugs in the last 3 years, so I think the code is pretty stable. I didn't need a high performance diff tool. I will do some performance tweaking as and when needed, so please let me know. I have also dropped some hints in the source code on that topic. How it works (briefly)You can find an online working version of the download files here.
The APITo use this functionality, you only have to call one of the DiffText(string TextA, string TextB)Finds the difference between two texts, comparing by text lines without any conversion. An array of items containing the differences is returned. DiffText(string TextA, string TextB, bool trimSpace, bool ignoreSpace, bool ignoreCase)Finds the difference between two texts, comparing by text lines with some optional conversions. An array of items containing the differences is returned. Diff(int[] ArrayA, int[] ArrayB)Finds the difference between two arrays of integers. An array of items containing the differences is returned. A sample application for the algorithmThe sample is a small web application that calculates the difference between two text files and displays the result using HTML. To setup the sample, you have to create a web-project and copy all the files found in the zip file into the directory. The implementation of the algorithm is given inside the "diff.cs" class. Further possible optimizations (if you really need speed)(First rule: don't do it; second: don't do it yet.) The arrays See TODO: hints. Security issues with the web application
LicenseThis work is licensed under Creative Commons Attribution 2.0 Germany License. Built-in self-testThe class now has a built-in self-test, that works without changing the code. If you compile the diff and debug class files together, you get an EXE file that tests some simple diff-scenarios that were used to discover the bugs in versions 1 and 2 ( The compile command is: csc /target:exe /out:diffTest.exe /debug /d:SELFTEST Diff.cs Debug.cs
Thanks for your feedback and the two reported cases that did not diff correctly. (It was always my fault, not a problem of the algorithm). HistoryThis work was first published here.
|
||||||||||||||||||||||