A word-wise HTML text compare and merge engine

Hongwei Shen

4.98/5 (30 votes)

Aug 27, 2005

3 min read

119757

1640

A C# .NET implemntation of HTML text compare and merge engine based on a similar algorithm as the Unix diff.

Download source files - 38.4 Kb

Introduction

The initial motivation for writing this code was to compare the staging version with the current production version of the HTML document inside RainbowPortal when workflow is turned on, in order to provide a way for editor and/or approver to visually compare what is modified. The code is not published in RainbowPortal.

After some research, I chose to use the algorithm proposed by Eugene W. Myers in his paper "An O(ND) Difference Algorithm and its Variation" which is also the algorithm used by Unix Diff. A copy of the file can be found here.

The Unix Diff implementation found here is referenced during the implementation.

Design

The overall requirement can be summarized briefly as:

The to-be-compared is HTML text which may contain HTML tags, HTML comments, whitespaces, and special characters such as " " "&#123" etc.
The documents should be compared as a whole instead of line by line.
Only the differences of content of the document is of importance because the formatting is already visually displayed.
The differences should be highlighted.

Since it is going to be document-wise comparison, for the sake of performance and also convenience for only comparing the content of the documents, I decided to compare the document word by word instead of character by character. Thus, the process can be split into two parts. The first one is to parse the document to find out the tags, natural words, punctuations, special characters and whitespaces etc. The second part is to compare the words, punctuations and some of the special characters to find out the differences. Since the ignored things during the comparison are still needed for reconstructing the document, they must be retained by associating them with the words they are adjacent to.

A class Word is defined to represent the parsed word as well as the tags, whitespaces, and special characters before and after it. The definition of class Word is like:

internal class Word : IComparable
{
    private string _word = string.Empty;
    private string _prefix = string.Empty;
    private string _suffix = string.Empty;
    public Word() {...}
    public Word(string word, string prefix, string suffix) {...}
    ...
    // for reconstructing the word with no change  
public string reconstruct(){...} 
    // for reconstructing the word that is changed
public string reconstruct(string beginTag, string endTag){...}
    public int CompareTo(object obj){...}
}

The parsing follows the following rules:

Anything starting with '<' and ending with '>' is treated as an HTML tag.
HTML tags and whitespaces are treated as prefix or suffix to adjacent words and are put in the prefix or suffix fields of the Word object.
English words separated by space(s), " ", "&#xxx", tailing punctuation are treated as words and are put in the word field of the Word class.
Whitespaces == {' ', '\t', '\n'}.

A strong typed collection WordsCollection is also defined to hold the parsed words.

The class HtmlTextParser is defined with one static method to do the parsing.

internal class HtmlTextParser
{
  static public WordsCollection parse(string s) {...}
}

Finally, class Merger is defined to do the dirty work of comparing and merging:

class Merger
{
  private WordsCollection _original;
  private WordsCollection _modified;
  private IntVector fwdVector;
  private IntVector bwdVector;
  public Merger(string original, string modified){ ..}
  ...
  private MiddleSnake findMiddleSnake(Sequence src, Sequence des){...} 

  private string doMerge(Sequence src, Sequence des){...}
  ...
  public string merge(){...}
}

The constructor of Merger class will take in the HTML text strings and call HtmlTextParser.parse() to parse them into two word collections. When the public method merge() is called, it constructs Sequence objects from the collections and call the private method doMerge(), and doMerge will recursively compare and merge the whole collections.

The engine and the test code are separated into two projects. In the test/bin/release directory, there are old.html and new.html files, running the test program will generate a file called merged.html. Use the browser to test them.

Conclusion

When the HTML document has about 170 pages with about 50% modifications, the engine takes about 15 seconds to parse, compare and merge the two files. If the document's length is less than 10 pages, it takes less than 1-2 seconds. The compared result is tested to be very accurate.