HTML Table Of Contents Generator

Andrew Peace

4.82/5 (16 votes)

Nov 4, 2001

12 min read

182628

3699

A C# program which takes a HTML file as input and outputs a new file with a table of contents embedded.

Introduction
Usage
- Preparing your HTML file for processing.
- Creating the Table of Contents
Limitations and To-Do
How It Works
Conclusion
Updates

Introduction

Having used the Code Project for well over a year now (or at least it feels that way), I have found the articles to be extremely helpful and useful. However, one problem that is common across many of the longer articles, not only on this site but on others also, is that sometimes they can be difficult to navigate around. This holds true for some of my own articles as well. Although the article may be well structured it is not easy to see at a glance what the headings are and jump to specific parts of the document. In Word, I use the Document Map feature to provide this functionality, but there is no such facility in Internet Explorer.

The aim of this program is to provide an easy method to produce a table of contents in a HTML file, which can be used to navigate through the article. When thinking over the problem, I realised that the program needed to be fairly simple but adaptable so that different styles of contents page can be generated. This may be important because whilst bulleted lists may be appropriate in one style, they may not be in another. Also, the user may wish to provide their own tags to change fonts etc. for the items in the contents listing. In addition to this, any proprietary text (i.e. text that is not taken from the article) should be customisable. This is mainly for the purpose of localisation - the World Wide Web is a global community and thus should be adaptable to fit the needs of more than one culture and language.

Another reason that I have written this small program is to give me a real world problem to tackle using the new .NET Framework from Microsoft. I have written the solution to this problem in Visual C#.NET Beta 2. Below, I outline the usage of the program, and then explain how I solved some of the major issues within the program, and some of the features of the .NET Framework that I have used to help make the task easier.

Usage

This program is extremely easy to use.

Preparing your HTML file for processing.

This is a very simple step, and just involves adding one line to your HTML where you would like the table of contents to appear:

<!-- INSERT contents -->

This comment must be on a line of it's own otherwise the program will not detect it. This could be easily solved but when writing the program I decided that it would be better for the contents section to be kept in a separate section' to the rest of the text anyway. Note that you must copy this line exactly - it is case sensitive.

I recommend that you keep this copy of the file for working on as it will save you having to remove the table of contents and replacing the above tag. If you need to edit it again in the future, just edit the original and re-generate the table of contents.

Creating the Table of Contents

Again, this is an easy process. Just run the program, and in the File to Process' edit box just enter the path to the file you wish to be processed, or use the "Browse" button. In the "Output File" edit box specify the filename of the output file - i.e. the location of the file with the table of contents placed in it. You should not use the same filename for both input and output. The system is designed this way so that if an error occurs the original HTML will not be lost.

When you click on OK, you will see an overview of the file listed in a TreeView control. If this is correct (it will be unless your HTML doesn't follow the rules defined below) then clicking OK again will create the output file with the contents inserted at the specified point.

Limitations and To-Do

As this was originally just a tool for my own use, and a way of learning some of the basics of the .NET Framework and C#, there are still limitations in it:

Better error-checking could have been put into place in some areas.
The  tag must be on a line on its own
The heading tags can be on the same line as other things but they must not overflow onto a second line because the program will not pick up the closing </hN> tag.
The heading tags must be closed:
```
<h2>Whatever next?</h2><p>Some text</p>        
<h2>Whatever next?<p>Some text</p>             
```
This is proper HTML anyway and whether or not you plan to use this program it really should be adhered to.
The code will include all heading levels though could easily be modified to stop at, say <h4>. The code can be instructed to ignore heading levels prior to a certain level, however. This is useful for Code Project articles where the main headings are in fact <h2>.
If you had the following the heading would not be correctly marked with a <a name="xyz32"> tag, although the contents entry would still remain intact:
```

<h1>Heading 1</h1>
```

A feature that I do plan on putting into the code at a later date is the ability to run silently from the command line or a batch. This would be fairly simple to do and may be useful in some circumstances. All options would be specifiable from the command line although the defaults would be used for those not specified. An error would occur if the input and output filenames were omitted.

How It Works

Okay, so you know how to use it, here's the nitty gritty of how it actually does it's job under the hood. There are a couple of interesting .NET features put to use in the program, the main two being Windows Forms and the Regular Expression classes. I will cover below the general operation of the program and then will focus on how I used these two features of the framework to help with writing the program.

General Operation

The program firstly uses a Windows Forms form to get the input and output filenames from the user. This is quite straightforward except perhaps one point related to adding filters to the common dialogs. In actual fact it's pretty similar to doing it in straight API, but because the Filters property of the dialog classes is a collection, I initially tried to add the two filters separately. In fact, this doesn't work and the following, more obvious code will do just fine:

fd.Filter = "HTML files|*.html;*.htm|All files (*.*)|*.*";

Then, it uses my HtmlContentsBuilder class to generate a tree of the headings within the file. I used a custom tree and didn't bother to implement collection interfaces, though there may be a better method of managing the tree.

The BuildTree() function runs through the HTML line by line. It first checks to see if the beginning of the line is a comment carried over from the previous line. If it is, it searches for the end of the comment. If the end of the comment is not on this line, it dumps the line and goes back to the beginning of the loop with the next line.

It then removes any comments on the line and if the end of the line is still in a comment (will be carried over to the next line), it sets a flag stating that the beginning of the next line is still a comment, dumps the line and goes back to the beginning of the loop with the next line.

Finally, it searches for heading tags. If it finds one, it then searches for the closing tag. This is done using regular expressions as described below.

When a heading tag is found, its level is checked. If it is on the same level as the previous heading, it is added to the right of the previous branch of the tree. If it is a less significant heading (e.g. <h3> is less significant than <h2>) then it is added as a child node of the previous branch of the tree. If it is a more significant heading then it is added to the right of the correct parent of the previous heading. The new' leaf just added is then remembered as the next previous' heading for the next iteration.

The tree view is then filled using the tree just generated. If the user decides to generate the Contents HTML, the following process is used:

The first heading that should appear in the file is picked.
The HTML for the table of contents is produced by running through the tree.
A line of the file is read
If the line does contain the heading currently selected, then it is replaced with a new version of the heading containing lower case H's for XHTML compliance and a <a name="xyz32"> tag. Note that an increasing number is added to the end of the name tag. This is to allow for situations where the same heading appears more than once.
If a replacement was made, the next logical heading is chosen and the program loops back to step three.
When the line is completely processed, it is searched for a  tag. These tags are replaced with the table of contents that has been built up previously. This is done at this stage for future expansion; currently this tag must be on a line of its own.
The fully processed line is then saved to the output file.
The next line is retrieved and stages four through seven are repeated until the end of the file is reached.

Note that the program uses the StreamReader and StreamWriter classes for input and output to the File objects. In hindsight the File objects could have been missed out altogether, but this detail is unimportant and by using them we do gain finer control.

Windows Forms

For its entire user interface, this program uses Windows Forms and the common dialogs. I have already demonstrated one small programming pitfall of the common dialogs. I am going to just mention briefly a couple of important points to consider when using Windows Forms to create your UI.

Firstly is determining whether the user clicked OK or Cancel after the dialog has closed. This may seem extremely obvious but not knowing Framework meant that it wasn't something I saw immediately. As before the result is determined from the return value of the ShowDialog function. However, the problem is setting this result in the first place. To do this, when the user clicks OK, before closing you should set the DialogResult property to DialogResult.OK in your OK-clicked event handler:

private void btnOK_Click(object sender, System.EventArgs e)
{
	DialogResult = DialogResult.OK;
	Close();
}

The other issue with Windows Forms is the resizing of dialogs. I have to say this is done superbly with the new class library. It takes a matter of seconds just to add resizing to your forms. All you need to do it set Anchor' properties for the controls that are on the form.

If a control is anchored on one side, then that side will always remain the same distance away from that edge of the form even after resizing. If it is not anchored, then its position relative to the side of the screen will not change. Here are some examples:

This will cause the bottom edge of the control to stick to the bottom edge of the form and for the right edge of the control to stick to the right edge of the form. Thus the control will remain the same size as before but will hold to the bottom right edge of the form. This type of resizing is used for the buttons in the bottom-right of the main dialog of this sample application.

This will cause all edges of the control to stick to the equivalent edge of the form. This will result in the control stretching' in whichever direction the form was resized in. This is used for the TreeView control in the main dialog of this sample.

Regular Expressions

Many of you, as programmers, will probably be familiar with the idea of regular expressions. Basically, they are a find-and-replace tool used for pattern matching, similar to the wildcard features of DOS.

The .NET framework provides a namespace that contains several classes related to regular expressions, giving us all the tools we need to use them in our programs. They are extremely useful in this program where headings tags and comments need to be identified.

There are two options for using regular expressions; to create a RegEx object, or to use the static methods provided. Either way is fine, but you might get better performance using the former method if you use the search pattern more than one because it only needs to be compiled' once. By this, I mean it only needs to translated from the form you enter, e.g. "<h[1-8]>" into op-codes used internally once.

The regular expression classes can be used simply as string searching tools, or the additional codes that can be inserted into the strings to represent certain wildcards' can be used. Here are the ones I used:

Code	Meaning
[XYZ]	Either an X, a Y, or a Z can appear at this point
[A-F]	Any character between A and F can appear at this point
\w	Any word character can appear here
\W	Any non-word character can appear here
*	Zero or more times
+	One or more times

Note that, in C++ and C# at least, you must use a double backslash (\\') for a single backslash to appear in the string. Alternatively in C# you could use the @' symbol prefixing the string:

 "Hello \/ World" // error because \/ isn't a recognised escape sequence
@"Hello \/ World" // = "Hello \/ World", because of the usage of the @ sign

Conclusion

I think I've covered pretty much everything needed. I hope you find this a useful tool - the contents section at the top of this document was automatically generated using it. The code is fairly well commented though I don't claim perfection because I wrote it pretty quickly. As far as I know there aren't any bugs apart from the slight limitations listed at the head of the article, which I feel are mostly justified anyway. I am fairly busy at the minute but will hopefully get round to updating the article and giving the tool support for running from a command-line. Feel free to let me know of any comments or suggestions you might have.

Updates

The code was modified to generate nicer output by removing spaces from the link tags
A bug was fixed whereby the program used a regular expression, part of whose content comes from heading text within the file. However, the program neglected to decorate characters which are part of the regular expression syntax. See RXDecorate function for more details.