12,824,299 members (42,765 online)
alternative version

#### Stats

440.5K views
317 bookmarked
Posted 3 May 2004

# A Generic, Reusable Diff Algorithm in C# - II

, 10 Jun 2004 Public Domain
 Rate this:
A reusable difference engine written in C#.

## Introduction

During one of my daily visits to CodeProject, I ran across an excellent article by Aprenot: A Generic - Reusable Diff Algorithm in C#. Aprenot’s method of locating the Longest Common Sequence of two sequential sets of objects works extremely well on small sets. As the sets get larger, the algorithm begins experiencing the constraints of reality.

At the heart of the algorithm is a table that stores the comparison results of every item in the first set to every item in the second set. Although this method always produces a perfect difference solution, it can require an enormous amount of CPU and memory. The author solved these issues by only comparing small portions of the two data sets at a time. However, this solution makes the assumption that the changes between the data sets are very close together, and reports inefficient results when the data sets are large with dispersed changes.

This article will present an algorithm based on the one presented by aprenot. The goals of the algorithm are as follows:

1. Maintain the original concepts of Generic and Reusable.
2. Correctly handle large data sets.
3. Greatly lower the number of comparisons necessary.
4. Greatly lower the memory requirements.

## The Problem Defined

Given a data set of 100,000 items, with the need to compare it to a data set of similar size, you can quickly see the problem with using a table to hold the results. If each element in the result table was 1 bit wide, you would need over a Gigabyte of memory. To fill the table, you would need to execute 10 billion comparisons.

## Some Terminology

There are always two sets of items to compare. To help differentiate between the two sets, they will be called Source and Destination. The question we are usually asking is: What do we need to do to the Source list to make it look like the Destination list?

## Generic and Reusable

In order to maintain the generic aspects of the original algorithm, a generic structure is needed. I chose to make use of C#’s `interface`.

```public interface IDiffList
{
int Count();
IComparable GetByIndex(int index);
}```

Both the Destination list and Source list must inherit `IDiffList`. It is assumed that the list is indexed from 0 to `Count()`-1. Just like the original article, the `IComparable` interface is used to compare the items between the lists.

Included in the source code are two structures that make use of this interface; `DiffList_TextFile` and `DiffList_BinaryFile`. They both know how to load their respective files into memory and return their individual items as `IComparable` structures. The source code examples for these objects should be more than adequate for expanding the system to other object types. For example, the system can be easily expanded to compare rows between `DataSet`s or directory structures between drives.

## The Overall Solution

The problem presented earlier is very similar to the differences in sorting algorithms. A shell sort is very quick to code but highly inefficient when the data set gets large. A quick sort is much more efficient on a large dataset. It breaks up the data into very smaller chunks using recursion.

The approach this algorithm takes is similar to that of a quick sort. It breaks up the data into very smaller chunks and processes those smaller chunks through recursion. The steps the algorithm takes are as follows:

1. Find the current Longest Matching Sequence (LMS) of items.
2. Store this LMS in a results pile.
3. Process all data left above the LMS using recursion.
4. Process all data left below the LMS using recursion.

These steps recursively repeat until there is no more data to process (or no more matches are found). At first glance, you should be able to easily understand the recursion logic. What needs further explanation is Step 1. How do we find the LMS without comparing everything to everything? Since this process is called recursively, won’t we end up re-comparing some items?

## Finding the Longest Matching Sequence

First, we need to define where to look for the LMS. The system needs to maintain some boundaries within the Source and Destination lists. This is done using simple integer indexes called `destStart`, `destEnd`, `sourceStart` and `sourceEnd`. At first, these will encompass the entire bounds of each list. Recursion will shrink their ranges with each call.

To find the LMS, we use brute force looping with some intelligent short circuits. I chose to loop through the Destination items and compare them to the available Source items. The pseudo code looks something like this:

```For each Destination Item in Destination Range
Jump out if we can not mathematically find a longer match
Find the LMS for this destination item in the Source Range
If there is a match sequence
If it’s the longest one so far – store the result.
Jump over the Destination Items that are included in this sequence.
End if
Next For

If we have a good best match sequence
Store the match in a final match result list
If there is space left above the match in both
the Destination and Source Ranges
Recursively call function with the upper ranges
If there is space left above the match in both
The Destination and Source Ranges
Recursively call function with upper ranges
Else
There are no matches in this range so just drop out
End if```

The Jumps are what gives this algorithm a lot of its speed. The first mathematical jump looks at the result of some simple math:

```maxPossibleDestLength = (destEnd – destIndex) + 1;
if (maxPossibleDestLength <= curBestLength) break;```

This formula calculates the theoretical best possible match the current Destination item can produce. If it is less than (or equal too) the current best match length then there is no reason to continue in the loop through the rest of the Destination range.

The second jump is more of a leap of faith. We ignore overlapping matching sequences by jumping over the destination indexes that are internal to a current match sequence, and therefore, cut way down on the number of comparisons. This is a really big speed enhancement. If the lists we are testing contain a lot of repetitive data, we may not come up with the perfect solution, but we will find a valid solution fast. I found in testing that I had to manually create a set of files to demonstrate the imperfect solution. In practice, it should be very rare. You can comment out this jump in the source code and run your own tests. I have to warn you that it will greatly increase the calculation time on large data sets.

There is a very good chance during recursion that the same Destination Item will need to be tested again. The only difference in the test will be the width of the Source Range. Since the goal of the algorithm is to lower the number of comparisons, we need to store the previous match results. `DiffState` is the structure that stores the result. It contains an index of the first matching item in the source range and a length so that we know how far the match goes for. `DiffStateList` stores the `DiffState`s by destination index. The loop simple requests the `DiffState` for a particular destination index, and `DiffStateList` either returns a pre-calculated one or a new uncalculated one. There is a simple test performed to see if the `DiffState` needs to be recalculated given the current Destination and source ranges. `DiffState` will also store a status of 'No Match' when appropriate.

If a `DiffState` needs to be recalculated, an algorithm similar to the one above is called.

```For each Source Item in Source Range
Jump out if we can not mathematically find a longer match
Find the match length for the Destination Item
on the particular Source Index
If there is a match
If it’s the longest one so far – store the result
Jump over the Source Items that are included in the match
End if
End for

Store the longest sequence or mark as 'No Match'```

The Jumps are again giving us more speed by avoiding unnecessary item comparisons. I found that the second jump cuts the run time speed by 2/3rds on large data sets. Finding the match length for a particular destination item at a particular source index is just the result of comparing the item lists in sequence at those points and returning the number of sequential matches.

## Gathering the Results

After the algorithm has run its course, you will be left with an `ArrayList` valid match objects. I use an object called `DiffResultSpan` to store these matches. A quick sort of these objects will put them in the necessary sequential order. `DiffResultSpan` can also store the necessary delete, addition and replace states within the comparison.

To build the final result `ArrayList` of ordered `DiffResultSpan`s, we just loop through the matches filling in the blank (unmatched) indexes in between. We return the result as an ordered list of `DiffResultSpan`s that each contains a `DiffResultSpanStatus`.

```public enum DiffResultSpanStatus
{
NoChange, //matched
Replace,
DeleteSource,
}

public class DiffResultSpan : IComparable
{
private int _destIndex;
private int _sourceIndex;
private int _length;
private DiffResultSpanStatus _status;

public int DestIndex {get{return _destIndex;}}
public int SourceIndex {get{return _sourceIndex;}}
public int Length {get{return _length;}}
public DiffResultSpanStatus Status {get{return _status;}}

//.... other code removed for brevity
}```

You can now process the `ArrayList` as is necessary for your application.

## Algorithm Weaknesses

When using the algorithm described above, and when the data sets are completely different, it will compare every item in both sets before it finds out that there are no matches. This can be a very time consuming process on large datasets. On the other hand, if the data sets are equivalent, it will find this out in one iteration of the main loop.

Although it should always find a valid difference, there is a chance that the algorithm will not find the best answer. I have included a set of text files that demonstrate this weakness (source.txt and dest.txt). There is a sequence of 5 matches that is missed by the system because it is overlapped by a previous smaller sequence.

## Algorithm Update 6/10/2004

To help address the 2nd weakness above, three levels of optimization were added to the Diff Engine. Tests identified that large, highly redundant data produced extremely poor difference results. The following `enum `was added:

```public enum DiffEngineLevel
{
FastImperfect, //original level
Medium,
SlowPerfect
}
```

This can be passed to an additional `ProcessDiff()` method. The engine is still fully backward compatible and will default to the original Fast method. Only tests can identify if the `Medium` or `SlowPerfect` levels are necessary for your applications. The speed differences between the settings are quite large.

The differences in these levels effect when or if we jump over sections of the data when we find existing match runs. If you are interested, you will find the changes in the `ProcessRange()` method. They start on line 107 of Engine.cs.

## Included Files

The project DiffCalc is the simple front-end used to test the algorithm. It is capable of doing a text or binary diff between two files.

The DLL project DifferenceEngine is where all the work is done.

• Engine.cs: Contains the actual diff engine as described above.
• Structures.cs: Contains the structures used by the engine.
• TextFile.cs: Contains the structures designed to handle text files.
• BinaryFile.cs: Contains the structures designed to handle binary files.

## Special Note

The first line in Structures.cs is commented out. If you uncomment this line, you will lower the memory needs of the algorithm by using a `HashTable` instead of an allocated array to store the intermediate results. It does slow down the speed by a percent or two. I believe the decrease is due to the reflection that becomes necessary. I am hoping that the future .NET Generics will solve this issue.

## Summary

If you take a step back from the algorithm, you will see that it is essentially building the table described in aprenot's original article. It simply attempts to intelligently jump over large portions of the table. In fact, the more the two lists are the same, the quicker the algorithm will run. It also stores what it does calculate in a more efficient structure instead of rows & columns.

Hopefully, others will find good uses for this code. I am sure there are some more optimizations that will become apparent over time. Drop me a note when you find them.

## History

• May 18, 2004 'Off by One' error on line 106 in Engine.cs fixed. Thanks goes to Lutz Hanusch for identifying the bug.
• May 27, 2004 Incrementor was missing on line 96 in Results.cs in the demo. Thanks goes to Cash Foley for identifying the bug.
• June 10, 2004 Speed optimizations led to very poor diff results on redundant data. Two slower diff engine levels have been added to solve this problem. Thanks goes to Rick Morris for identifying this weakness.

## Share

 Chief Technology Officer United States
No Biography provided

## You may also be interested in...

 Re: Excellent ! Can we have a sample showing diffs on a word or character level ? shahrukhkiyani23-Feb-13 10:57 shahrukhkiyani 23-Feb-13 10:57
 Thank you with a 5 and slight improvement Kamran Behzad24-Jul-11 19:07 Kamran Behzad 24-Jul-11 19:07
 possible bug? Chuck141114-Jun-11 9:01 Chuck1411 14-Jun-11 9:01
 Reverting text to a previous state jsoldi1-Mar-11 9:33 jsoldi 1-Mar-11 9:33
 My vote of 5 whawke17-Feb-11 6:26 whawke 17-Feb-11 6:26
 Large Files > 2GB aselby119-Jul-10 7:08 aselby11 9-Jul-10 7:08
 FABULOUS Member 279070112-Jan-10 19:57 Member 2790701 12-Jan-10 19:57
 Why this one wins cmschick5-Dec-09 2:55 cmschick 5-Dec-09 2:55
 10+ from me. I love this work, it's clean and concise and doesn't fail on a 96 megabyte file like the Eugene Myers algorithm. It may not have the backing of the big league but I cannot find a failure for the life of me.I have been researching difference algorithms that have been ported to .NET for about a week now. The first one I decided to go with was an implementation of the Eugene Myers diff algorithm as it seemed bullet proof and there are visual descriptions about how it works. Not to mention it seems to be the standard and creme de la creme of differencing algorithms according to the majority of the world. You can find implementations here [^]and here[^].The logic seems undeniable. I must have read the white paper 5 times and reviewed at least 4 implementations each of which I tried.However, I've had nothing but problems with all of them. The returned differencing items are very unreliable from set to set and the amount of code required to do anything real is insane. There are structures such as StartA, StartB, InsertedB, and DeletedA. The problem lies in the interpretation of the results from one set to another. The structures cannot be evaluated stand alone as the values in the structure mean different things depending on the change type (inserted, deleted, unchanged etc...).Here's an example that is just plain impossible with the algorithm. When you require multiple change sets to be diffed and merged back into one to get the complete view of all of the changes. Anyone know how subversion does this? (I know probably the Myers diff algorithm).Here's a full blown example of the problem (no the auto-text is not always all caps):FILE AThis is file A's textit has multiple lines that arealways different than file C.FILE BThis is file B's text. This text is alwaysdifferent than file A's text. These two filesare really sections of File C.Because this section is much longer than file AIt may have multiple inserts from some otherprogram.FILE CThis is the main file it contains multiple lines and contains the text of both file A and file B. Howeveranother program had this file before I did and insertedsome text in between the file A text and some in between thefile B text like this:This is file A's textit has multiple lines that areSOMEONE PUT SOME TEXT HEREalways different than file C.This is file B's text. This text is alwaysSOME AUTO FOOTER TEXT THAT SOME PROGRAM ENTEREDdifferent than file A's text. These two filesare really sections of File C.Because this section is much longer than file AIt may have multiple inserts from some otherSOME AUTO FOOTER TEXT THAT SOME PROGRAM ENTEREDprogram.__________________________________________________________________Now after you have saved the results of both diffs try to figure out howto replace the A file text in File C with a token like this {FileA} and the same forfile B like this {FileB}. You cannot, at least not reliably. Because the diffItems startA, startB, and insertdB do not make sense from set to set and you cannot properly determine the line numbers in the source and destination files. What am I missing? I would love it if you could change my mind.
 Why MaxLineLength = 1024 helschaee24-Sep-09 22:13 helschaee 24-Sep-09 22:13
 Can we do same thing in asp.net? nidjain18-Sep-09 20:07 nidjain 18-Sep-09 20:07
 help with asp.net Rico Marcelo24-May-09 14:16 Rico Marcelo 24-May-09 14:16
 Re: help with asp.net jcrumble19-Aug-11 7:29 jcrumble 19-Aug-11 7:29
 Some bug Sergey Stoyan6-Feb-09 5:43 Sergey Stoyan 6-Feb-09 5:43
 Re: Some bug Moshe Ventura16-May-10 23:24 Moshe Ventura 16-May-10 23:24
 Re: Some bug Sergey Stoyan17-May-10 9:16 Sergey Stoyan 17-May-10 9:16
 Thanks & well done Mark Storen28-Oct-08 22:32 Mark Storen 28-Oct-08 22:32
 Re: Thanks & well done Paul Stamp3-Mar-09 10:31 Paul Stamp 3-Mar-09 10:31
 Is there any license for the demo project? Gigi Tse18-Jul-08 2:30 Gigi Tse 18-Jul-08 2:30
 Re: Is there any license for the demo project? Michael Potter18-Jul-08 4:06 Michael Potter 18-Jul-08 4:06
 Re: Is there any license for the demo project? Gigi Tse18-Jul-08 23:17 Gigi Tse 18-Jul-08 23:17
 Is IComparable really necessary? maphix1-Jul-08 1:50 maphix 1-Jul-08 1:50
 Presenting the result in percents? S. Kolic29-May-08 8:01 S. Kolic 29-May-08 8:01
 Re: Presenting the result in percents? Kamran Behzad24-Jul-11 19:16 Kamran Behzad 24-Jul-11 19:16
 What algorithm this one is based on? Member 436242422-Dec-07 2:06 Member 4362424 22-Dec-07 2:06
 Terms and conditions of usage Ali Ozgur22-Aug-07 23:04 Ali Ozgur 22-Aug-07 23:04
 Last Visit: 31-Dec-99 19:00     Last Update: 27-Mar-17 21:52 Refresh 123456 Next »