14,095,566 members
alternative version

#### Stats

172.8K views
75 bookmarked
Posted 20 Apr 2004
Licenced

# A Generic - Reusable Diff Algorithm in C#

, 20 Apr 2004
A generic algorithm that can be used to find the difference between objects.

## Introduction

This is a generic algorithm for finding the difference between 2 objects, no matter what the type. The algorithm uses the `ICompare` interface to tell if the items are the same. Maybe this could be used for producing a WiKi site in ASP.NET and track revisions.

I just wanted to let all you readers know, I don't have a degree in Computer Science and I haven't read or learned much about algorithm design. I haven't ever really analyzed a Diff algorithm before, but I know a little about them. If there is a way of improving this, or if I am going in the completely wrong direction, please let me know.

## Background

I am not really sure of the intimate details of how other Diff algorithms work (like the UNIX Diff utility), but it does use some ideas that I got from general reading on the topic. The Diff Engine uses a bit matrix that employs a sliding window of the 2 files. The matrix looks kind of like this:

The black squares are the squares with matching items. The algorithm starts at 0,0 and looks all the way along the x-axis to find a diagonal sequence, which is 2 or more matching characters in a row. The algorithm looks for the longest sequence, using the the closer of any that are the same length. It then looks down the y-axis, and if it finds a sequence that is larger and closer than any found on the x-axis, it uses that sequence. If it finds a sequence, it performs whatever operation is necessary to get to that sequence. In this case, it wouldn't find a diagonal sequence on the 0 x-axis or the y-axis, so it performs its standard operation, which is to delete the 'B' and insert the 'N'. It then increments our x and y location, so we are now at 1,1. Again, we do not find a diagonal sequence, so we perform the standard operation again. However, at 2,2 we find that we are in a sequence.

Once we find a sequence, first we slide the window down to where we are, since we won't be needing any of the previous characters any more. Then, we continually increment x and y until they reach a non matching square. At 4,4 there is no match, so we switch states again, meaning that we slide the window down, and once again look for a diagonal. Now, on the x-axis 4, we find a diagonal at 4,8, so now must perform whatever is necessary to get there. Moving in the y direction is an insertion, so we insert 'T', 'h', 'e' and ' '. At that point, we enter enter the diagonal, save state, etc.

We follow the diagonal down to the end from there. Pretty straight forward, right?

## Using the code

I wrote this code so that it would be as generic as possible, so I built it around 3 basic classes and an interface. There are actually more classes but, I will get to those in a minute.

First, to do a comparison, you will need to subclass `ComparableStreamReader`. This is an abstract class that I created, and it has only one method, `GetNext()`, which returns an object that implements an `IComparable` object. This is really the trick to this. In the demo, I have included a class called `ComparableStringReader`, which just returns characters from a string, one at a time. If you look at the code, you will notice that it is actually returning strings with one character and not char objects. What can I say - string already implements `IComparable`, and I am lazy. However, if you want to compare items that are not strings, you may have to create your own custom object that implements `IComparable`.

```public class ComparableStringReader : Diff.ComparableStreamReader
{
protected string    _source;
protected int        _location;
{
_source = source;
_location = 0;
}

public override IComparable GetNext()
{
if ( _location < _source.Length )
return _source.Substring(_location++, 1);
else
return null;
}
}```

Once you have created a reader class, you just need to create an instance of the `DiffEngine` class. `DiffEngine` has two properties, and one method, `Compare()`. Compare takes in 2 `ComparableStreamReader`s, the source (original), and the destination (changed document). `Compare()` returns an `EditScript` object, which is just a collection of `DiffCommands` objects. Before I go into the `DiffCommand` objects, however, I want to mention the two properties. The property `WindowSize` is an integer that defaults to 256. You have to be careful with this number however. The higher it goes, the more exact and concise the Diff script will be, but it uses memory at a rate of (n2/8). That's right. So at 256, we are only using 8K of memory to store the matrix (it's actually called an Edit Graph). However, if we decide to move up to 1024, we are not using 128K. Also, the higher the window size, the longer it takes to compute, at about the same rate (I am not really sure how to use big-oh notation, only read it, so I don't give it). 255 should work fine for most cases, especially if you are comparing entire lines, one at a time.

```DiffEngine de = new DiffEngine();
de.WindowSize = 10;        // very small window for a small script

// do the compare
EditScript es = de.Compare(csrSource, csrDest);```

Now, once the `Compare()` method returns, it returns a script (that same script is available through the `Script` property). The `EditScript`, like I said before, is a collection of `DiffCommand`s. Each command represents an operation that must be performed against the source (original) to produce the destination (changed document). There are three types of commands, which are members of the `DiffCommandType` `enum`. There's `Insert`, `Delete`, and `Skip`. The `Skip` and `Delete` commands utilize the `Length` property of `DiffCommand` class, and the `Insert` command utilities the `Value` property. The `Length` property just represents the number of times to repeat the operation. The `Value` property represents the object to be inserted, and it is also the exact object that was returned from `ComparableStreamReader.GetNext()`.

In the sample, I have used the `EditScript` to provide an exploded view of the source and destination strings so that you can see what they had in common. I also generated a very simple Diff script that could be used to recreate the destination from the original.

```B i t     Matrix
N ot The Matrix
-1+N-1+o*2+T+h+e+ *6```

In this example, the Diff script is using three operators - '*' for `Skip` followed by the number of characters to skip, '-' for `Delete` followed by the number of characters to delete, and '+' for `Insert`, followed by the character to insert.

Eventually, I will write a method to merge a source and an `EditScript` to recreate the original, but first, I want to do some better testing just to make sure that this algorithm is actually viable. If you use it successfully, please let me know. Also, if you have any recommendations for improving the algorithm, also let me know.

## Points of Interest

There is a reusable `BitMatrix` class that this algorithm employs to cut down on memory use. If you want, you can change it to use just `bool` values. This may or not speed up the algorithm, since you would be using 8 times more memory (unless the C# compiler does lots of optimizing!).

## History

• 4/21/2004 - Version 0.9 - released.

A list of licenses authors might use can be found here

## Share

 Web Developer United States
I am the Sr. Applications Designer for the U.S. Olympic Committee. I have been writing software for over ten years, starting with QBasic when I was 10. I do not have a Computer Science degree, but I am very interested in advanced algorithms, so I learn as much as possible (although I must admit, not knowing calculus does make it difficult at times). I am always trying to find the next really difficult problem to solve programatically.

## You may also be interested in...

 FirstPrev Next
 Re: Why duplicate the items? Michael Potter4-May-04 3:27 Michael Potter 4-May-04 3:27
 Name of the algorithm mav.northwind21-Apr-04 20:21 mav.northwind 21-Apr-04 20:21
 Re: Name of the algorithm aprenot22-Apr-04 5:17 aprenot 22-Apr-04 5:17
 Re: Name of the algorithm aprenot22-Apr-04 5:19 aprenot 22-Apr-04 5:19
 Interesting Mark Focas21-Apr-04 13:00 Mark Focas 21-Apr-04 13:00
 Last Visit: 19-May-19 23:41     Last Update: 19-May-19 23:41 Refresh « Prev12