Click here to Skip to main content
15,860,859 members
Articles / Programming Languages / C#
Article

A word-wise HTML text compare and merge engine

Rate me:
Please Sign up or sign in to vote.
4.98/5 (31 votes)
27 Aug 20053 min read 113.2K   1.6K   40   33
A C# .NET implemntation of HTML text compare and merge engine based on a similar algorithm as the Unix diff.

Introduction

The initial motivation for writing this code was to compare the staging version with the current production version of the HTML document inside RainbowPortal when workflow is turned on, in order to provide a way for editor and/or approver to visually compare what is modified. The code is not published in RainbowPortal.

After some research, I chose to use the algorithm proposed by Eugene W. Myers in his paper "An O(ND) Difference Algorithm and its Variation" which is also the algorithm used by Unix Diff. A copy of the file can be found here.

The Unix Diff implementation found here is referenced during the implementation.

Design

The overall requirement can be summarized briefly as:

  1. The to-be-compared is HTML text which may contain HTML tags, HTML comments, whitespaces, and special characters such as " " "&#123" etc.
  2. The documents should be compared as a whole instead of line by line.
  3. Only the differences of content of the document is of importance because the formatting is already visually displayed.
  4. The differences should be highlighted.

Since it is going to be document-wise comparison, for the sake of performance and also convenience for only comparing the content of the documents, I decided to compare the document word by word instead of character by character. Thus, the process can be split into two parts. The first one is to parse the document to find out the tags, natural words, punctuations, special characters and whitespaces etc. The second part is to compare the words, punctuations and some of the special characters to find out the differences. Since the ignored things during the comparison are still needed for reconstructing the document, they must be retained by associating them with the words they are adjacent to.

A class Word is defined to represent the parsed word as well as the tags, whitespaces, and special characters before and after it. The definition of class Word is like:

C#
internal class Word : IComparable
{
    private string _word = string.Empty;
    private string _prefix = string.Empty;
    private string _suffix = string.Empty;
    public Word() {...}
    public Word(string word, string prefix, string suffix) {...}
    ...
    // for reconstructing the word with no change  
public string reconstruct(){...} 
    // for reconstructing the word that is changed
public string reconstruct(string beginTag, string endTag){...}
    public int CompareTo(object obj){...}
}

The parsing follows the following rules:

  1. Anything starting with '<' and ending with '>' is treated as an HTML tag.
  2. HTML tags and whitespaces are treated as prefix or suffix to adjacent words and are put in the prefix or suffix fields of the Word object.
  3. English words separated by space(s), "&nbsp;", "&#xxx", tailing punctuation are treated as words and are put in the word field of the Word class.
  4. Whitespaces == {' ', '\t', '\n'}.

A strong typed collection WordsCollection is also defined to hold the parsed words.

The class HtmlTextParser is defined with one static method to do the parsing.

C#
internal class HtmlTextParser
{
  static public WordsCollection parse(string s) {...}
}

Finally, class Merger is defined to do the dirty work of comparing and merging:

C#
class Merger
{
  private WordsCollection _original;
  private WordsCollection _modified;
  private IntVector fwdVector;
  private IntVector bwdVector;
  public Merger(string original, string modified){ ..}
  ...
  private MiddleSnake findMiddleSnake(Sequence src, Sequence des){...} 

  private string doMerge(Sequence src, Sequence des){...}
  ...
  public string merge(){...}
}

The constructor of Merger class will take in the HTML text strings and call HtmlTextParser.parse() to parse them into two word collections. When the public method merge() is called, it constructs Sequence objects from the collections and call the private method doMerge(), and doMerge will recursively compare and merge the whole collections.

The engine and the test code are separated into two projects. In the test/bin/release directory, there are old.html and new.html files, running the test program will generate a file called merged.html. Use the browser to test them.

Conclusion

When the HTML document has about 170 pages with about 50% modifications, the engine takes about 15 seconds to parse, compare and merge the two files. If the document's length is less than 10 pages, it takes less than 1-2 seconds. The compared result is tested to be very accurate.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here


Written By
Web Developer
Canada Canada
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
QuestionLicence Pin
Roberto M.7921-Jan-21 20:08
Roberto M.7921-Jan-21 20:08 
Questionmerge more than 2 document let's say 3 or 5. Pin
engrwaqasmaroof2-Feb-14 21:30
engrwaqasmaroof2-Feb-14 21:30 
QuestionImage is not comparing Pin
Member 451251426-Jan-14 23:23
Member 451251426-Jan-14 23:23 
GeneralMy vote of 5 Pin
SavindraSingh6-Nov-12 22:08
SavindraSingh6-Nov-12 22:08 
QuestionNo result even after 30 mins on large files Pin
mzd_55517-May-12 19:49
mzd_55517-May-12 19:49 
QuestionNice Work.But does it capture formatting differences? Pin
rajeshperiyasamy23-Feb-11 23:47
rajeshperiyasamy23-Feb-11 23:47 
GeneralMy vote of 4 Pin
rajeshperiyasamy23-Feb-11 23:43
rajeshperiyasamy23-Feb-11 23:43 
QuestionHow can I use this for RichEditContol Pin
ABlokha7715-Jan-10 9:13
ABlokha7715-Jan-10 9:13 
GeneralImages Pin
arjun priyananth21-Oct-09 14:16
arjun priyananth21-Oct-09 14:16 
GeneralBig thanks, but got a problem Pin
Twing020-Jul-09 5:10
Twing020-Jul-09 5:10 
GeneralRe: Big thanks, but got a problem Pin
matrix198426-Oct-09 18:32
matrix198426-Oct-09 18:32 
GeneralA big thankyou Pin
Shaun Wilde22-Mar-09 23:37
Shaun Wilde22-Mar-09 23:37 
GeneralRegarding Table Comparison Pin
Wahmad2-Mar-09 2:16
Wahmad2-Mar-09 2:16 
GeneralRe: Regarding Table Comparison Pin
Hongwei Shen23-Mar-09 7:39
Hongwei Shen23-Mar-09 7:39 
GeneralTerrific Example of HTML Diff tool! Pin
Bob@work14-Nov-08 9:18
Bob@work14-Nov-08 9:18 
QuestionIs this DLL re-enterable and re-usable? Pin
Bill Ruf28-Jul-08 17:13
Bill Ruf28-Jul-08 17:13 
GeneralHTML Table comparison Pin
ashraf_jahangeer17-Sep-07 11:33
ashraf_jahangeer17-Sep-07 11:33 
GeneralRe: HTML Table comparison Pin
Christophe76523176527-May-08 0:18
Christophe76523176527-May-08 0:18 
GeneralRe: HTML Table comparison Pin
Hongwei Shen27-May-08 2:59
Hongwei Shen27-May-08 2:59 
GeneralRe: HTML Table comparison [modified] Pin
Christophe76523176529-May-08 2:06
Christophe76523176529-May-08 2:06 
GeneralRe: HTML Table comparison Pin
nidjain25-Aug-09 4:41
nidjain25-Aug-09 4:41 
GeneralPassing texts like parameters Pin
marioanj1-Mar-07 9:32
marioanj1-Mar-07 9:32 
GeneralI cant execute Pin
marioanj1-Mar-07 9:15
marioanj1-Mar-07 9:15 
GeneralYou rock Pin
Meraz2-Feb-07 8:48
Meraz2-Feb-07 8:48 
GeneralExcellent Pin
.snowboarder31-May-06 10:13
.snowboarder31-May-06 10:13 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.