14,035,248 members
Add your own
alternative version

#### Stats

172.5K views
1.3K downloads
75 bookmarked
Posted 20 Apr 2004
Licenced

# A Generic - Reusable Diff Algorithm in C#

, 20 Apr 2004
A generic algorithm that can be used to find the difference between objects.

## Introduction

This is a generic algorithm for finding the difference between 2 objects, no matter what the type. The algorithm uses the `ICompare` interface to tell if the items are the same. Maybe this could be used for producing a WiKi site in ASP.NET and track revisions.

I just wanted to let all you readers know, I don't have a degree in Computer Science and I haven't read or learned much about algorithm design. I haven't ever really analyzed a Diff algorithm before, but I know a little about them. If there is a way of improving this, or if I am going in the completely wrong direction, please let me know.

## Background

I am not really sure of the intimate details of how other Diff algorithms work (like the UNIX Diff utility), but it does use some ideas that I got from general reading on the topic. The Diff Engine uses a bit matrix that employs a sliding window of the 2 files. The matrix looks kind of like this:

The black squares are the squares with matching items. The algorithm starts at 0,0 and looks all the way along the x-axis to find a diagonal sequence, which is 2 or more matching characters in a row. The algorithm looks for the longest sequence, using the the closer of any that are the same length. It then looks down the y-axis, and if it finds a sequence that is larger and closer than any found on the x-axis, it uses that sequence. If it finds a sequence, it performs whatever operation is necessary to get to that sequence. In this case, it wouldn't find a diagonal sequence on the 0 x-axis or the y-axis, so it performs its standard operation, which is to delete the 'B' and insert the 'N'. It then increments our x and y location, so we are now at 1,1. Again, we do not find a diagonal sequence, so we perform the standard operation again. However, at 2,2 we find that we are in a sequence.

Once we find a sequence, first we slide the window down to where we are, since we won't be needing any of the previous characters any more. Then, we continually increment x and y until they reach a non matching square. At 4,4 there is no match, so we switch states again, meaning that we slide the window down, and once again look for a diagonal. Now, on the x-axis 4, we find a diagonal at 4,8, so now must perform whatever is necessary to get there. Moving in the y direction is an insertion, so we insert 'T', 'h', 'e' and ' '. At that point, we enter enter the diagonal, save state, etc.

We follow the diagonal down to the end from there. Pretty straight forward, right?

## Using the code

I wrote this code so that it would be as generic as possible, so I built it around 3 basic classes and an interface. There are actually more classes but, I will get to those in a minute.

First, to do a comparison, you will need to subclass `ComparableStreamReader`. This is an abstract class that I created, and it has only one method, `GetNext()`, which returns an object that implements an `IComparable` object. This is really the trick to this. In the demo, I have included a class called `ComparableStringReader`, which just returns characters from a string, one at a time. If you look at the code, you will notice that it is actually returning strings with one character and not char objects. What can I say - string already implements `IComparable`, and I am lazy. However, if you want to compare items that are not strings, you may have to create your own custom object that implements `IComparable`.

```public class ComparableStringReader : Diff.ComparableStreamReader
{
protected string    _source;
protected int        _location;
public ComparableStringReader(string source)
{
_source = source;
_location = 0;
}

public override IComparable GetNext()
{
if ( _location < _source.Length )
return _source.Substring(_location++, 1);
else
return null;
}
}```

Once you have created a reader class, you just need to create an instance of the `DiffEngine` class. `DiffEngine` has two properties, and one method, `Compare()`. Compare takes in 2 `ComparableStreamReader`s, the source (original), and the destination (changed document). `Compare()` returns an `EditScript` object, which is just a collection of `DiffCommands` objects. Before I go into the `DiffCommand` objects, however, I want to mention the two properties. The property `WindowSize` is an integer that defaults to 256. You have to be careful with this number however. The higher it goes, the more exact and concise the Diff script will be, but it uses memory at a rate of (n2/8). That's right. So at 256, we are only using 8K of memory to store the matrix (it's actually called an Edit Graph). However, if we decide to move up to 1024, we are not using 128K. Also, the higher the window size, the longer it takes to compute, at about the same rate (I am not really sure how to use big-oh notation, only read it, so I don't give it). 255 should work fine for most cases, especially if you are comparing entire lines, one at a time.

```DiffEngine de = new DiffEngine();
de.WindowSize = 10;        // very small window for a small script

// create string readers
ComparableStringReader csrSource = new ComparableStringReader(txtSource.Text);
ComparableStringReader csrDest = new ComparableStringReader(txtDestination.Text);

// do the compare
EditScript es = de.Compare(csrSource, csrDest);```

Now, once the `Compare()` method returns, it returns a script (that same script is available through the `Script` property). The `EditScript`, like I said before, is a collection of `DiffCommand`s. Each command represents an operation that must be performed against the source (original) to produce the destination (changed document). There are three types of commands, which are members of the `DiffCommandType` `enum`. There's `Insert`, `Delete`, and `Skip`. The `Skip` and `Delete` commands utilize the `Length` property of `DiffCommand` class, and the `Insert` command utilities the `Value` property. The `Length` property just represents the number of times to repeat the operation. The `Value` property represents the object to be inserted, and it is also the exact object that was returned from `ComparableStreamReader.GetNext()`.

In the sample, I have used the `EditScript` to provide an exploded view of the source and destination strings so that you can see what they had in common. I also generated a very simple Diff script that could be used to recreate the destination from the original.

```B i t     Matrix
N ot The Matrix
-1+N-1+o*2+T+h+e+ *6```

In this example, the Diff script is using three operators - '*' for `Skip` followed by the number of characters to skip, '-' for `Delete` followed by the number of characters to delete, and '+' for `Insert`, followed by the character to insert.

Eventually, I will write a method to merge a source and an `EditScript` to recreate the original, but first, I want to do some better testing just to make sure that this algorithm is actually viable. If you use it successfully, please let me know. Also, if you have any recommendations for improving the algorithm, also let me know.

## Points of Interest

There is a reusable `BitMatrix` class that this algorithm employs to cut down on memory use. If you want, you can change it to use just `bool` values. This may or not speed up the algorithm, since you would be using 8 times more memory (unless the C# compiler does lots of optimizing!).

## History

• 4/21/2004 - Version 0.9 - released.

## License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

## About the Author

 Web Developer United States
I am the Sr. Applications Designer for the U.S. Olympic Committee. I have been writing software for over ten years, starting with QBasic when I was 10. I do not have a Computer Science degree, but I am very interested in advanced algorithms, so I learn as much as possible (although I must admit, not knowing calculus does make it difficult at times). I am always trying to find the next really difficult problem to solve programatically.

## Comments and Discussions

 First PrevNext
 Question XieGuiYan2-Mar-06 20:31 XieGuiYan 2-Mar-06 20:31
 Re: Question aprenot3-Mar-06 5:24 aprenot 3-Mar-06 5:24
 A better implementation Sire4047-Feb-05 22:21 Sire404 7-Feb-05 22:21
 Re: A better implementation Sire4047-Feb-05 23:03 Sire404 7-Feb-05 23:03
 good article fvcs20-Sep-04 18:08 fvcs 20-Sep-04 18:08
 A big THANK YOU (and some ideas) akjohnston23-Aug-04 3:31 akjohnston 23-Aug-04 3:31
 Working But ... shivpal26-Jun-04 1:30 shivpal 26-Jun-04 1:30
 Re: Working But ... Anonymous27-Jun-04 18:47 Anonymous 27-Jun-04 18:47
 Diff in Flash eddie1220815-Jun-04 11:28 eddie12208 15-Jun-04 11:28
 Demo Program rossryan30-May-04 18:29 rossryan 30-May-04 18:29
 Re: Demo Program aprenot30-May-04 18:32 aprenot 30-May-04 18:32
 Re: Demo Program rossryan30-May-04 21:20 rossryan 30-May-04 21:20
 Re: Demo Program aprenot31-May-04 5:23 aprenot 31-May-04 5:23
 Re: Demo Program naughton10-Nov-08 5:29 naughton 10-Nov-08 5:29
 similar function in vb/vb.net? Huisheng Chen9-May-04 16:00 Huisheng Chen 9-May-04 16:00
 Re: similar function in vb/vb.net? aprenot9-May-04 18:58 aprenot 9-May-04 18:58
 Re: similar function in vb/vb.net? akjohnston23-Aug-04 3:10 akjohnston 23-Aug-04 3:10
 Re: similar function in vb/vb.net? kbugg21-Oct-04 2:43 kbugg 21-Oct-04 2:43
 Re: similar function in vb/vb.net? akjohnston21-Oct-04 6:24 akjohnston 21-Oct-04 6:24
 Why duplicate the items? Michael Potter22-Apr-04 4:32 Michael Potter 22-Apr-04 4:32
 Re: Why duplicate the items? aprenot22-Apr-04 4:50 aprenot 22-Apr-04 4:50
 Actually, I had thought about using an array to contain the two lists of items originally. However, that would require the entire source and destination be loaded into memory. I decided to go with a stream based method so that I only have to have the current window's information in memory at one time, which is stored in the 2 buffers in the `DiffBitMatrix`. So, as with your example of doing a line by line comparison, if I have 2 - 4000 line files, each with approximately 30 characters per line we now have 240k in memory, which granted isn't much, but it would be drastically less if I was using a 100x100 byte window, which would only require 6k ((100+100)*30) to be in memory. Of course we would always have the 1250 bytes to hold the 100x100 matrix. Now granted there is the 2 arrays of pointers - so we have to add an extra 800 bytes, but at 8050 bytes is still way less than 240k. I still don't see how a list would be more efficient. However, maybe I am misunderstanding you (i dont speak english as well as I speak C# or C++). If I am, please reiterate what you mean in more detail. aprenot
 Re: Why duplicate the items? Michael Potter22-Apr-04 5:53 Michael Potter 22-Apr-04 5:53
 Re: Why duplicate the items? aprenot22-Apr-04 6:04 aprenot 22-Apr-04 6:04
 Re: Why duplicate the items? Michael Potter23-Apr-04 8:12 Michael Potter 23-Apr-04 8:12
 Re: Why duplicate the items? aprenot23-Apr-04 8:16 aprenot 23-Apr-04 8:16
 Last Visit: 24-Apr-19 12:31     Last Update: 24-Apr-19 12:31 Refresh 12 Next »

General    News    Suggestion    Question    Bug    Answer    Joke    Praise    Rant    Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Cookies | Terms of Use | Mobile
Web03 | 2.8.190424.1 | Last Updated 21 Apr 2004
Article Copyright 2004 by aprenot
Everything else Copyright © CodeProject, 1999-2019
Layout: fixed | fluid