Click here to Skip to main content
Click here to Skip to main content

Check Help Links Tool

, 6 Oct 2005
Rate this:
Please Sign up or sign in to vote.
A tool to check links across merged help (CHM) files.

Note: You will need the MSHTML library if you have only installed the .NET redistributable, not the SDK (or VS).

Application image

Introduction

This article presents a tool for checking links in help (CHM) files, including merged files. The app checks links in indexes, tables of contents, and in the topic files themselves, and presents the results in a simple tree view.

Background

I have a library which is documented using XML comments. I use NDoc[^] to generate a CHM for the library reference help, but I also have a hand-written "master" CHM file which contains higher-level help. I merge these two help files, and they both contain links to each other. I couldn't find a tool to check these links, so I wrote one Smile | :)

This was actually quite easy to write as it is really just a front end for the HtmlHelp library and example viewer[^] by Klaus Weisser[^]. All the details of reading CHM files are handled by the library, which means all I had to do was write the UI and the checking algorithms.

Basically, if you need this tool, it will be invaluable; if you don't, well, thanks for looking.

Using the application

The application is very easy to use - at least I tried to make it that way.

The first thing to do is to open a CHM file. This loads the file and any merged files into the library, and starts to fill out the results tree. You can refresh the file at any time to get back to this state. At this stage, the index, table of contents, and topic files are present, but the links from the topic files are not.

It turns out that getting the links from an HTML file is not that easy. The only article I could find on the subject was on MSDN: Walkthrough: Accessing the DHTML DOM from C#[^]. This basically says to open an IE window, load the file, and then get the links from the DOM. This is slow. And if your CHM file has thousands of pages, this is very slow. However, it is the most accurate method, and is therefore the default.

You can speed things up by orders of magnitude by selecting the "fast parsing" option on the toolbar. This uses regular expressions to search the raw HTML. This works well with simple HTML, but is not perfect, although it does now exclude commented blocks. It is useful as a first pass, but I suggest checking using the slow method before you ship your Help files.

Now you have selected your parsing method, you can start checking by hitting the "Work" toolbar button or pressing Ctrl+W. This process loads and parses all the topic files, and checks all the links found. This process works on the subtree selected in the tree view. By default, the root node is selected, so the whole help collection is checked, but you can select a portion of the tree if you are just interested in that branch. The results are displayed by setting the icon of each tree view item.

Each item in the results has a state, which can be either "unknown", "good", "http", "script", or "broken". The state of the parent items is set according to the state of its child items. So, a parent item is only marked as good if none of its children are marked as broken. This quickly gives you an idea of the correctness of your links, and the detailed metrics are displayed in brackets after each item.

You can then navigate the results using the comprehensive Find feature including regular expressions:

Find dialog image

You can also export the results to CSV files. This produces one overall summary file, and one details file for each CHM.

All this can be automated using command line options. These are:

  • -f / -fast: select fast mode
  • -s / -slow: select slow mode
  • -o / -open: open most recently used CHM
  • -o:"C:\path\xxx.chm": open specified CHM
  • -w / -work: check file
  • -x / -export: export reports to default folder
  • -x:"C:\path\folder": export to specified folder
  • -c / -close: close app

So an example could be: CheckHelpLinks.exe -f -o:"C:\test\test.chm" -w -x -c.

The exit code of the application is zero if there are no errors, and non-zero if there are broken links.

And that's about it. Go and fix your broken links and then check again!

Points of Interest

Reflection Magic

I needed access to a property declared as internal in the library. Obviously, this would normally not be available outside the library, but you can get access to it by using reflection. This is a very powerful practice, but it is not ideal: the library author would be quite right to change his internal implementation in future releases, which would break the client code. So, use at your own risk!

This code comes from the IndexItem.Load method in the ClassesIndex.cs file.

What I wanted to write was this:

    CHMFile chmFile = indexItem.ChmFile;

What I actually wrote was this:

    Type t = indexItem.GetType();
    PropertyInfo p = t.GetProperty( "ChmFile",
        BindingFlags.Instance | BindingFlags.NonPublic );
    CHMFile chmFile = ( CHMFile ) p.GetValue( indexItem, null );

Accessing the DHTML DOM

The only way I could find of accessing the DOM was by opening a web browser control, loading the HTML, and then getting an IHTMLDocument2 object from the Document property of the control. Apart from being slow, this is a bit tricky, since the loading process is asynchronous. I solved this by setting a flag before beginning the navigation, and waiting until the DocumentComplete event handler cleared the flag. During the wait, I call Application.DoEvents to keep the UI fresh. I don't know how good this is, as it all happens in a dialog box; what I really wanted was to run the dialog box's message loop, but I couldn't find a way of doing this Frown | :( The code is in the DlgBrowser.cs file, if you want to have a look.

The Regular Expressions

These are the RegEx's used in fast mode. There are two RegEx's for both the anchors and links, which look for elements with and without quotes:

  • Comment: ^(?<before>.*)(?<comment><!--.*?-->)(?<after>.*)$
  • Anchor 1: <\s*A\s[^>]*name\s*=\s*(?<anchor>[^'"].*?)[\s>]
  • Anchor 2: <\s*A\s[^>]*name\s*=\s*(?<quote>['"])(?<anchor>.*?)\k<quote>
  • Link 1: <\s*(?:A|AREA)\s[^>]*href\s*=\s*(?<url>[^'"].*?)[\s>]
  • Link 2: <\s*(?:A|AREA)\s[^>]*href\s*=\s*(?<quote>['"])(?<url>.*?)\k<quote>

References

Acknowledgements

Special thanks to Klaus Weisser[^] for his excellent library and for his help while writing this.

Also, thanks to Ryan Pollack for his help with version 2.

More thanks to Karen Story[^] and Dmitri Posudin[^] for their help with version 3.

History

  • Version 3: 2005 September 30
    • advanced Find feature
    • anchors (bookmarks) now handled
    • improved RegEx's for Fast mode
    • strips comment blocks in Fast mode
    • results export to CSV file
    • view source of any link directly
    • JavaScript links handled separately
    • command line options added
    • now handles many more formats of links correctly
    • new icon
  • Version 2: 2005 August 20
    • links to CHM files with no 'ms-its' prefix now handled
    • links to external (not merged) CHM files now handled
    • HTTP links in CHM files now counted separately
    • a couple of bug fixes
  • Version 1: 2004 August 16
    • First release

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Nicholas Butler

United Kingdom United Kingdom

I built my first computer, a Sinclair ZX80, on my 11th birthday in 1980.
In 1992, I completed my Computer Science degree and built my first PC.
I discovered C# and .NET 1.0 Beta 1 in late 2000 and loved them immediately.
I have been writing concurrent software professionally, using multi-processor machines, since 1995.
 
In real life, I have spent 3 years travelling abroad,
I have held a UK Private Pilots Licence for 20 years,
and I am a PADI Divemaster.
 
I now live near idyllic Bournemouth in England.
 
If you would like help with multithreading, please contact me via my website:
 
 
I can work 'virtually' anywhere!

Comments and Discussions

 
QuestionSlow mode always ends in success, even when there are legitimate failures. Pinmembersteharmon5-Oct-12 7:07 
QuestionBlank broken links caused by xml? PinmemberKaren Story23-Apr-09 13:45 
Generalunhandled exception PinmemberJannisDre11-Jul-08 1:33 
QuestionWhere to place the Microsoft.mshtml.dll ? Pinmemberfrankefm17-Sep-07 5:04 
Questionhelp PinmemberV.Sen...23-Aug-07 2:52 
Questionhelp in program Pinmemberf_talebi19-Aug-07 20:25 
AnswerRe: help in program PinmemberV.Sen...23-Aug-07 2:45 
GeneralUnknown links Pinmember5735082-Jul-07 5:45 
QuestionSearch Didn't Go Deep Enough? Pinmemberjmcconda30-Oct-06 9:28 
AnswerRe: Search Didn't Go Deep Enough? Pinmemberjmcconda2-Nov-06 7:30 
GeneralAwesome - got my 5 PinstaffNishant Sivakumar21-Nov-05 8:22 
GeneralRe: Awesome - got my 5 PinmemberNicholas Butler21-Nov-05 9:37 
GeneralMSHTML library PinmemberAndré Ziegler6-Oct-05 10:55 
GeneralFixed PinmemberNicholas Butler7-Oct-05 6:06 
Generalsmall bug and divers Pinmemberdposudin@borlas.ru4-Sep-05 19:12 
GeneralRe: small bug and divers PinmemberCharlie Williams6-Oct-05 8:06 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web02 | 2.8.140721.1 | Last Updated 6 Oct 2005
Article Copyright 2004 by Nicholas Butler
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid