I was recently tasked with creating a system that compares 2 versions of a legal document, and shows any edits which have been made. I chose to solve this problem by using a recursive "divide and conquer" approach, leveraging a modified version of the Longest Common Substring algorithm.
The Longest Common Substring algorithm (LCS) has been around for a while, and somewhat recently received a renewed round of interest as a way to compare sequences of DNA. A great writeup of the details of how it works can be found here. Most standard implementations of the LCS return the length of the longest common substring; I have modified my implementation to also return the starting positions of the LCS in both input lists.
The general idea of my code is that it takes as input 2 instances of
IList<T> , where
IComparable<T>. The longest common subsequence of these lists is found, and stored in a
DiffItem object, marked as an "Unchanged" sequence. This effectively splits both lists into 3 regions; everything before the LCS, the LCS, and everything after it. The leading region of the lists before the LCS, in turn, has its LCS calculated, thus splitting that region again into 3 sections; the same is done for the trailing sequence. The recursion halts once there are no more unchecked sections in either list.
Using the Code
The primary function is a
static method of the
Diff class, called
GetDiff, which takes 2
IList<T> and returns a
DiffItem class is simply a holder class which contains a
List<T> and a
DiffType enum value, which can be either Unchanged, Inserted, or Deleted. This collection of
DiffItems can be used to do the following things:
- To just see the elements in the original list, walk the collection and return only the Unchanged and Deleted items
- To just see the elements of the new list, walk the collection and return only the Unchanged and Added items
- To show the comparisons of what has changed, walk the collection and output the Unchanged items normally, and apply some modification or formatting to the Added and Deleted items
The attached solution contains 2 projects.
DiffDesc contains the core classes of the algorithm, and
DiffDescTest is a testing console app which looks for 2 text files in the executing directory called "Original.txt" and "Changed.txt". It loads these files, splitting each into a list of
strings, where each
string is a word (delimited on spaces and newlines). Once the set of
DiffItems is created, a report is generated in the executing directory called "Differential.htm" which shows both inputs and the final markup showing changes, similar to that shown below.
|This is the first sentence, and won't be changed. This sentence will be removed. Finally, these closing words will also remain unchanged. ||This is the first sentence, and won't be changed. I am new content which was inserted!!! Finally, these closing words will also remain unchanged. ||This is the first sentence, and won't be changed. This sentence will be removed. I am new content which was inserted!!! Finally, these closing words will also remain unchanged. |
Points of Interest
My first implementation did character comparisons instead of whole-word comparisons; however I found this to be "too sensitive", detecting changes that did not actually have any semantic significance. Even with it performing word comparisons, however, I still see some surprising output at times. Given the original domain of intent of this solution, which is comparing legal documents, this makes sense; my brain reads the document not as a sequence of words, but as a sequence of concepts or assertions. The real question being asked is "What has changed in the meaning of these 2 documents", not so much a minimal edit distance of words or characters. However, a true semantic difference detector is just a tad outside of the scope of this project. If anyone gets such a thing working, shoot me a message on your way to the bank!
Another enhancement which I am mulling over is that the documents which I will be comparing will most likely contain HTML markup. This means that when I go to insert spans in the differential report, they break proper HTML nesting rules, as the algorithm is unaware of the difference between markup and content. I am leaning towards a solution in which the first phase of preparation which splits the documents will be able to just grab content, but will remember what markup was in between the content regions so that it can be re-inserted into the final output.
- 4/25/2009: Posted original version