Click here to Skip to main content
Click here to Skip to main content

A word-wise HTML text compare and merge engine

By , 27 Aug 2005
Rate this:
Please Sign up or sign in to vote.

Introduction

The initial motivation for writing this code was to compare the staging version with the current production version of the HTML document inside RainbowPortal when workflow is turned on, in order to provide a way for editor and/or approver to visually compare what is modified. The code is not published in RainbowPortal.

After some research, I chose to use the algorithm proposed by Eugene W. Myers in his paper "An O(ND) Difference Algorithm and its Variation" which is also the algorithm used by Unix Diff. A copy of the file can be found here.

The Unix Diff implementation found here is referenced during the implementation.

Design

The overall requirement can be summarized briefly as:

  1. The to-be-compared is HTML text which may contain HTML tags, HTML comments, whitespaces, and special characters such as " " "&#123" etc.
  2. The documents should be compared as a whole instead of line by line.
  3. Only the differences of content of the document is of importance because the formatting is already visually displayed.
  4. The differences should be highlighted.

Since it is going to be document-wise comparison, for the sake of performance and also convenience for only comparing the content of the documents, I decided to compare the document word by word instead of character by character. Thus, the process can be split into two parts. The first one is to parse the document to find out the tags, natural words, punctuations, special characters and whitespaces etc. The second part is to compare the words, punctuations and some of the special characters to find out the differences. Since the ignored things during the comparison are still needed for reconstructing the document, they must be retained by associating them with the words they are adjacent to.

A class Word is defined to represent the parsed word as well as the tags, whitespaces, and special characters before and after it. The definition of class Word is like:

internal class Word : IComparable
{
    private string _word = string.Empty;
    private string _prefix = string.Empty;
    private string _suffix = string.Empty;
    public Word() {...}
    public Word(string word, string prefix, string suffix) {...}
    ...
    // for reconstructing the word with no change  
public string reconstruct(){...} 
    // for reconstructing the word that is changed
public string reconstruct(string beginTag, string endTag){...}
    public int CompareTo(object obj){...}
}

The parsing follows the following rules:

  1. Anything starting with '<' and ending with '>' is treated as an HTML tag.
  2. HTML tags and whitespaces are treated as prefix or suffix to adjacent words and are put in the prefix or suffix fields of the Word object.
  3. English words separated by space(s), "&nbsp;", "&#xxx", tailing punctuation are treated as words and are put in the word field of the Word class.
  4. Whitespaces == {' ', '\t', '\n'}.

A strong typed collection WordsCollection is also defined to hold the parsed words.

The class HtmlTextParser is defined with one static method to do the parsing.

internal class HtmlTextParser
{
  static public WordsCollection parse(string s) {...}
}

Finally, class Merger is defined to do the dirty work of comparing and merging:

class Merger
{
  private WordsCollection _original;
  private WordsCollection _modified;
  private IntVector fwdVector;
  private IntVector bwdVector;
  public Merger(string original, string modified){ ..}
  ...
  private MiddleSnake findMiddleSnake(Sequence src, Sequence des){...} 

  private string doMerge(Sequence src, Sequence des){...}
  ...
  public string merge(){...}
}

The constructor of Merger class will take in the HTML text strings and call HtmlTextParser.parse() to parse them into two word collections. When the public method merge() is called, it constructs Sequence objects from the collections and call the private method doMerge(), and doMerge will recursively compare and merge the whole collections.

The engine and the test code are separated into two projects. In the test/bin/release directory, there are old.html and new.html files, running the test program will generate a file called merged.html. Use the browser to test them.

Conclusion

When the HTML document has about 170 pages with about 50% modifications, the engine takes about 15 seconds to parse, compare and merge the two files. If the document's length is less than 10 pages, it takes less than 1-2 seconds. The compared result is tested to be very accurate.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

About the Author

Hongwei Shen
Web Developer
Canada Canada
No Biography provided

Comments and Discussions

 
Questionmerge more than 2 document let's say 3 or 5. Pinmemberengrwaqasmaroof2-Feb-14 21:30 
QuestionImage is not comparing PinmemberMember 451251426-Jan-14 23:23 
GeneralMy vote of 5 PinmemberSavindraSingh6-Nov-12 22:08 
QuestionNo result even after 30 mins on large files Pinmembermzd_55517-May-12 19:49 
I am trying to merge to large files each is 10mb. Even after waiting for 30 minutes no result. I tried will smaller files and worked prefectly.
 
any idea?
QuestionNice Work.But does it capture formatting differences? PinmemberRajesh Periyasamy23-Feb-11 23:47 
GeneralMy vote of 4 PinmemberRajesh Periyasamy23-Feb-11 23:43 
QuestionHow can I use this for RichEditContol PinmemberABlokha7715-Jan-10 9:13 
GeneralImages Pinmemberarjun priyananth21-Oct-09 14:16 
GeneralBig thanks, but got a problem PinmemberTwing020-Jul-09 5:10 
GeneralRe: Big thanks, but got a problem Pinmembermatrix198426-Oct-09 18:32 
GeneralA big thankyou PinmemberShaun Wilde22-Mar-09 23:37 
GeneralRegarding Table Comparison PinmemberMember 46681312-Mar-09 2:16 
GeneralRe: Regarding Table Comparison PinmemberHongwei Shen23-Mar-09 7:39 
GeneralTerrific Example of HTML Diff tool! PinmemberBob@work14-Nov-08 9:18 
QuestionIs this DLL re-enterable and re-usable? PinmemberBill Ruf28-Jul-08 17:13 
GeneralHTML Table comparison Pinmemberinrad17-Sep-07 11:33 
GeneralRe: HTML Table comparison PinmemberChristophe76523176527-May-08 0:18 
GeneralRe: HTML Table comparison PinmemberHongwei Shen27-May-08 2:59 
GeneralRe: HTML Table comparison [modified] PinmemberChristophe76523176529-May-08 2:06 
GeneralRe: HTML Table comparison Pinmembernidjain25-Aug-09 4:41 
GeneralPassing texts like parameters Pinmembermarioanj1-Mar-07 9:32 
GeneralI cant execute Pinmembermarioanj1-Mar-07 9:15 
GeneralYou rock PinmemberMeraz2-Feb-07 8:48 
GeneralExcellent Pinmember.snowboarder31-May-06 10:13 
GeneralDownload still not available PinmemberGareth Brown28-Aug-05 2:37 
GeneralRe: Download still not available PinmemberHongwei Shen28-Aug-05 8:35 
GeneralRe: Download still not available PinmemberBimmelibimmbimmbimm31-Aug-05 1:17 
GeneralRe: Download still not available PinsussAnonymous31-Aug-05 3:12 
GeneralRe: Download still not available PinmemberHongwei Shen31-Aug-05 3:32 
GeneralRe: Download still not available PinmemberBimmelibimmbimmbimm31-Aug-05 4:01 
Generaldownload failed. PinmemberUnruled Boy27-Aug-05 5:55 
General[Msg Deleted] PinmemberHongwei Shen28-Aug-05 8:37 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web04 | 2.8.140415.2 | Last Updated 27 Aug 2005
Article Copyright 2005 by Hongwei Shen
Everything else Copyright © CodeProject, 1999-2014
Terms of Use
Layout: fixed | fluid