12,997,469 members (58,425 online)
alternative version

#### Stats

22.4K views
39 bookmarked
Posted 25 Apr 2009

# A Generic Differential Comparison Generator Using Longest Common Substring

, 25 Apr 2009
 Rate this:
An algorithm for describing differences between lists

## Introduction

I was recently tasked with creating a system that compares 2 versions of a legal document, and shows any edits which have been made. I chose to solve this problem by using a recursive "divide and conquer" approach, leveraging a modified version of the Longest Common Substring algorithm.

## Background

The Longest Common Substring algorithm (LCS) has been around for a while, and somewhat recently received a renewed round of interest as a way to compare sequences of  DNA. A great writeup of the details of how it works can be found here. Most standard implementations of the LCS return the length of the longest common substring; I have modified my  implementation to also return the starting positions of the LCS in both input lists.

The general idea of my code is that it takes as input 2 instances of `IList<T>` , where `T` implements `IComparable<T>`.  The longest common subsequence of these lists is found, and stored  in a `DiffItem` object, marked as an "Unchanged" sequence.  This effectively splits both lists into 3 regions; everything before the LCS, the LCS, and everything after it.  The leading region of the lists before the LCS, in turn, has its LCS calculated, thus splitting that region again into 3 sections; the same is done for the trailing sequence.   The recursion halts once there are no more unchecked sections in either list.

## Using the Code

The primary function is a `static `method of the `Diff` class, called `GetDiff`, which takes 2 `IList<T>` and returns a `List<DiffItem>`. The `DiffItem` class is simply a holder class which contains a `List<T>` and a `DiffType` enum value, which can be either Unchanged, Inserted, or Deleted. This collection of `DiffItem`s  can be used to do the following things:

• To just see the elements in the original list, walk the collection and return only the Unchanged and Deleted items
• To just see the elements of the new list, walk the collection and return only the Unchanged and Added items
• To show the comparisons of what has changed, walk the collection and output the Unchanged items normally, and apply some modification or formatting to the Added and Deleted items

The attached solution contains 2 projects. `DiffDesc` contains the core classes of the algorithm, and `DiffDescTest` is a testing console app which looks for 2 text files in the executing directory called "Original.txt" and "Changed.txt".  It loads these files, splitting each into a list of `string`s, where each `string `is a word (delimited on spaces and newlines). Once the set of `DiffItems `is created, a report is generated in the executing directory called "Differential.htm" which shows both inputs and the final markup showing changes, similar to that shown below.

 Original Changed Differential This is the first sentence, and won't be changed. This sentence will be removed. Finally, these closing words will also remain unchanged. This is the first sentence, and won't be changed. I am new content which was inserted!!! Finally, these closing words will also remain unchanged. This is the first sentence, and won't be changed. This sentence will be removed. I am new content which was inserted!!! Finally, these closing words will also remain unchanged.

## Points of Interest

My first implementation did character comparisons instead of whole-word comparisons; however I found this to be "too sensitive", detecting changes that did not actually have any semantic significance.  Even with it performing word comparisons, however, I still see some surprising output at times.  Given the original domain of intent of this solution, which is comparing legal documents, this makes sense; my brain reads the document not as a sequence of words, but as  a sequence of concepts or assertions.  The real question being asked is "What has changed in the meaning of these 2 documents", not so much a minimal edit distance of words or characters. However, a true semantic difference detector is just a tad outside of the scope of this project.  If anyone gets such a thing working, shoot me a message on your way to the bank!

Another enhancement which I am mulling over is that the documents which I will be comparing will most likely contain HTML markup. This means that when I go to insert spans in the differential report, they break proper HTML nesting rules, as the algorithm is unaware of the difference between markup and content. I am leaning towards a solution in which the first phase of preparation which splits the documents will be able to just grab content, but will remember what markup was in between the content regions so that it can be re-inserted into the final output.

## History

• 4/25/2009: Posted original version

## Share

 Architect United States
I run the technology group for a financial services firm in Boston, creating software largely for internal use which automates and assists in most all aspects of the firms operation. I tend to gravitate towards middle tier and back end development, preferring the pure abstractions of databases, algorithms and engines. I started out writing QuickBasic when I was about 10, then got in to VB, then VB.Net when it came out; for the past couple of years I've been rather squarely in the squiggly bracketed domain of C#.

## You may also be interested in...

 Pro

 First Prev Next
 My vote of 5 Filip D'haene26-May-11 0:31 Filip D'haene 26-May-11 0:31
 can´t wait for the html-able version ultimatediddy24-Aug-09 4:06 ultimatediddy 24-Aug-09 4:06
 My complument gianluigis2526-Apr-09 3:00 gianluigis25 26-Apr-09 3:00
 I've experimented the same problem just an year ago and I was quite surprised when I find out that no one had faced it in a serious way. Most of Diff program are just ridicolous. They just compare words or charachters but no one tried to realize an intelligent diff algorythm. The better would be to extract a template from a set of many documents. And it could be easy to do based on your software.
 Last Visit: 31-Dec-99 18:00     Last Update: 22-Jun-17 23:00 Refresh 1