This is a generic algorithm for finding the difference between 2 objects, no matter what the type. The algorithm uses the
ICompare interface to tell if the items are the same. Maybe this could be used for producing a WiKi site in ASP.NET and track revisions.
I just wanted to let all you readers know, I don't have a degree in Computer Science and I haven't read or learned much about algorithm design. I haven't ever really analyzed a Diff algorithm before, but I know a little about them. If there is a way of improving this, or if I am going in the completely wrong direction, please let me know.
I am not really sure of the intimate details of how other Diff algorithms work (like the UNIX Diff utility), but it does use some ideas that I got from general reading on the topic. The Diff Engine uses a bit matrix that employs a sliding window of the 2 files. The matrix looks kind of like this:
The black squares are the squares with matching items. The algorithm starts at 0,0 and looks all the way along the x-axis to find a diagonal sequence, which is 2 or more matching characters in a row. The algorithm looks for the longest sequence, using the the closer of any that are the same length. It then looks down the y-axis, and if it finds a sequence that is larger and closer than any found on the x-axis, it uses that sequence. If it finds a sequence, it performs whatever operation is necessary to get to that sequence. In this case, it wouldn't find a diagonal sequence on the 0 x-axis or the y-axis, so it performs its standard operation, which is to delete the 'B' and insert the 'N'. It then increments our x and y location, so we are now at 1,1. Again, we do not find a diagonal sequence, so we perform the standard operation again. However, at 2,2 we find that we are in a sequence.
Once we find a sequence, first we slide the window down to where we are, since we won't be needing any of the previous characters any more. Then, we continually increment x and y until they reach a non matching square. At 4,4 there is no match, so we switch states again, meaning that we slide the window down, and once again look for a diagonal. Now, on the x-axis 4, we find a diagonal at 4,8, so now must perform whatever is necessary to get there. Moving in the y direction is an insertion, so we insert 'T', 'h', 'e' and ' '. At that point, we enter enter the diagonal, save state, etc.
We follow the diagonal down to the end from there. Pretty straight forward, right?
Using the code
I wrote this code so that it would be as generic as possible, so I built it around 3 basic classes and an interface. There are actually more classes but, I will get to those in a minute.
First, to do a comparison, you will need to subclass
ComparableStreamReader. This is an abstract class that I created, and it has only one method,
GetNext(), which returns an object that implements an
IComparable object. This is really the trick to this. In the demo, I have included a class called
ComparableStringReader, which just returns characters from a string, one at a time. If you look at the code, you will notice that it is actually returning strings with one character and not char objects. What can I say - string already implements
IComparable, and I am lazy. However, if you want to compare items that are not strings, you may have to create your own custom object that implements
public class ComparableStringReader : Diff.ComparableStreamReader
protected string _source;
protected int _location;
public ComparableStringReader(string source)
_source = source;
_location = 0;
public override IComparable GetNext()
if ( _location < _source.Length )
return _source.Substring(_location++, 1);
Once you have created a reader class, you just need to create an instance of the
DiffEngine has two properties, and one method,
Compare(). Compare takes in 2
ComparableStreamReaders, the source (original), and the destination (changed document).
Compare() returns an
EditScript object, which is just a collection of
DiffCommands objects. Before I go into the
DiffCommand objects, however, I want to mention the two properties. The property
WindowSize is an integer that defaults to 256. You have to be careful with this number however. The higher it goes, the more exact and concise the Diff script will be, but it uses memory at a rate of (n2/8). That's right. So at 256, we are only using 8K of memory to store the matrix (it's actually called an Edit Graph). However, if we decide to move up to 1024, we are not using 128K. Also, the higher the window size, the longer it takes to compute, at about the same rate (I am not really sure how to use big-oh notation, only read it, so I don't give it). 255 should work fine for most cases, especially if you are comparing entire lines, one at a time.
DiffEngine de = new DiffEngine();
de.WindowSize = 10;
ComparableStringReader csrSource = new ComparableStringReader(txtSource.Text);
ComparableStringReader csrDest = new ComparableStringReader(txtDestination.Text);
EditScript es = de.Compare(csrSource, csrDest);
Now, once the
Compare() method returns, it returns a script (that same script is available through the
Script property). The
EditScript, like I said before, is a collection of
DiffCommands. Each command represents an operation that must be performed against the source (original) to produce the destination (changed document). There are three types of commands, which are members of the
Delete commands utilize the
Length property of
DiffCommand class, and the
Insert command utilities the
Value property. The
Length property just represents the number of times to repeat the operation. The
Value property represents the object to be inserted, and it is also the exact object that was returned from
In the sample, I have used the
EditScript to provide an exploded view of the source and destination strings so that you can see what they had in common. I also generated a very simple Diff script that could be used to recreate the destination from the original.
B i t Matrix
N ot The Matrix
In this example, the Diff script is using three operators - '*' for
Skip followed by the number of characters to skip, '-' for
Delete followed by the number of characters to delete, and '+' for
Insert, followed by the character to insert.
Eventually, I will write a method to merge a source and an
EditScript to recreate the original, but first, I want to do some better testing just to make sure that this algorithm is actually viable. If you use it successfully, please let me know. Also, if you have any recommendations for improving the algorithm, also let me know.
Points of Interest
There is a reusable
BitMatrix class that this algorithm employs to cut down on memory use. If you want, you can change it to use just
bool values. This may or not speed up the algorithm, since you would be using 8 times more memory (unless the C# compiler does lots of optimizing!).
- 4/21/2004 - Version 0.9 - released.