Download source code - 10.9 KB

Introduction

I have been working on a project with many "encyclopedia" articles. Some of the articles are lengthy, and I thought it would be nice to add a table of contents. But manually writing and maintaining a TOC while the document was still under revision... I know from experience that such a path leads to madness.

I did some digging and found a very good method using JavaScript. Two problems, though: I didn't like needing to rely on my users having JavaScript enabled, and the TOC was not displaying at all with IE8 in compatibility mode. (It was there according to the DOM, but not displaying, which was very frustrating.) I was also annoyed at the noticeable time lag between text first appearing in the browser and the TOC being displayed.

The logical way around these issues was to do the processing server side. I posted a Quick Question about how to intercept the text of a page, and CP member sTOPs[^] provided a link that helped me to figure out my solution.

This article demonstrates how to dynamically create a table of contents and insert it into the web page before the page gets delivered to the user. Part 2, which still needs some work, will demonstrate how to use a pseudo-tag to generate inline references, similar to what the Wikipedia does with its <ref> pseudo-tag.

Dealing with the HTML

This technique requires minimal changes in your HTML. Put the token {{toc}} where you want the table of contents to go, and the filter will do the rest. Using it "naked" will cause the Visual Studio validator to complain about text not being allowed in the <body> tag; if this annoys you, you can put the token inside a <div>. The filter is case insensitive, so you can also use {{TOC}}, {{Toc}}, and so on.

You will want to look at your <hx> headers that will be used to generate the TOC. On my site, <h1> , <h2>, and <h6> all have special uses, so my filter ignores them. This leaves <h3>, <h4>, and <h5> for use as content headings. Again, the filter is case insensitive, and should work just fine with, say, <H4>. The filter will also handle any existing id attributes on the header tags; if one is not provided, then the filter will auto-generate one for you so that your TOC link will have a place to land. Other attributes, if any, will be handled gracefully if they are in proper HTML format. And, please note: the tags are translated into XElement objects, which means that their content must be XML valid. If your headers include entities like “ that XML does not understand, you will get an error. Using numeric codes in your headers instead of entities will work with some codes; other codes cause an offset problem so remember to check your work. It might be easier to use ASCII replacements (straight quotes instead of the fancy curved ones) or not use entities at all.

Lastly, look at your CSS. If you use my filter out of the box, make sure you include the style sheet so that your TOC looks nice and is properly formatted. A tiny bit of JavaScript is useful too if you want to let users minimize the TOC. Because the table of contents starts out open, it will not damage any functionality or layout if the user has scripting disabled.

Response.Filter: How it works

In the bad old pre .NET days, writing a web filter was frustrating. They had to be written in C++, and the development cycle was basically compile, install on IIS, test, uninstall, try to find your bug, recompile, install, ad nauseam.

Nowadays, filters have become much easier. Internally, the Framework keeps track of the page's assembly and rendering with Streams. Intercept the stream, and you can tinker with the page before sending it on its merry way. Doing this is almost trivial:

If Response.ContentType = "text/html" Then
    Response.Filter = New MyNewFilter(Response.Filter)
End If

If the page has a content type of text/html, then set Response.Filter to be your new filter. This will prevent your filter from being invoked when the server is delivering images, PDFs, and other types of content. The constructor for MyNewFilter takes the previous filter; when yours has done its work, processing will move on to the next one in the chain. Yup, it really is that simple.

The next issue to consider is where to set the filter. From what I've been able to tell, you can do this pretty much any time before the page is delivered, from either the page itself, its master page, or even globally in Global.asax. Because most of my pages will have a table of contents, I have implemented the filter globally, in the PostReleaseRequestState event of the Application object. This event is one of the last to be fired before delivering the page, which makes it a logical choice. In the Global.asax file, I added this code:

Protected Sub Application_PostReleaseRequestState _
          (ByVal sender As Object, ByVal e As System.EventArgs)
    If Response.ContentType = "text/html" Then 
        Response.Filter = New TOCFilter(Response.Filter)
    End If
End Sub

The exact placement of this code may not matter: I have seen examples using the Page_Load event when only individual pages are being filtered. Very likely, all you need to do is use it in some event that every page will call.

Auxiliary class HeaderClass

To encapsulate some of the processing, I have an auxiliary class named HeaderClass.

Private Class HeaderClass
    Private pRank As String
    Private pTag As XElement

    'Return either the id attribute, or the tag's value
    Public ReadOnly Property Id() As String
        Get
            If pTag Is Nothing Then 
                Throw New Exception("Member 'Tag' was not instantiated.")
            End If

            If pTag.Attribute("id") IsNot Nothing Then
                Return pTag.Attribute("id").Value
            Else
                Return pTag.Value
            End If
        End Get
    End Property

    Public ReadOnly Property Length() As Integer
        Get
            Return pTag.ToString.Length
        End Get
    End Property

    Public ReadOnly Property Rank() As String
        Get
            Return pRank
        End Get
    End Property

    Public ReadOnly Property Tag() As XElement
        Get
            Return pTag
        End Get
    End Property

    'Generate a tag with the id attribute. Note the {0} 
    'in the attribute value: that will hold the unique
    'sequence index of the tag.
    Public ReadOnly Property TagReplacement() As String
        Get
            If pTag Is Nothing Then 
                Throw New Exception("Member 'Tag' was not instantiated.")
            End If

            Dim NewTag As New XElement(pTag) 
            NewTag.SetAttributeValue("id", _
                "{0}_" + Me.Id.Replace(" ", "_")) 
            Return NewTag.ToString
        End Get
    End Property

    'The text value of the tag
    Public ReadOnly Property Text() As String
        Get
            If pTag Is Nothing Then 
                Throw New Exception("Member 'Tag' was not instantiated.")
            End If

            Return pTag.Value
        End Get
    End Property

    'All we need to instantiate is a rank and the tag text
    Public Sub New(ByVal Rank As String, ByVal Tag As String)
        pRank = Rank
        Try
            pTag = XElement.Parse(Tag)
        Catch ex As Exception
            Throw New ArgumentException("Not a valid element.", "Tag", ex)
        End Try
    End Sub

End Class

The main purpose of this class is to hold the tag for later reference, with some extra functionality added to make things a bit cleaner.

Class TOCFilter

Now we are ready to look at the filter itself. Response.Filter is a Stream object, so our filter needs to be based on System.IO.Stream. We cannot inherit directly from that, but MemoryStream works fine.

Public Class TOCFilter
    Inherits MemoryStream

    Private Output As Stream
    Private HTML As StringBuilder
    Private EOP As Regex

    Public Sub New(ByVal LinkedStream As Stream)
        Output = LinkedStream
        HTML = New StringBuilder
        EOP = New Regex("</html>", RegexOptions.IgnoreCase)
    End Sub

    Public Overrides Sub Write(ByVal buffer() As Byte, _
    ByVal offset As Integer, ByVal count As Integer)
        Dim BufferStr As String = UTF8Encoding.UTF8.GetString(buffer)

        HTML.Append(BufferStr)
        If EOP.IsMatch(BufferStr) Then
            Dim PageContent As String = HTML.ToString

            'The magic happens here

            Output.Write(UTF8Encoding.UTF8.GetBytes(PageContent), offset, _
            UTF8Encoding.UTF8.GetByteCount(PageContent))
        End If
    End Sub

End Class

The constructor takes the filter stream that it is replacing and puts it aside. When the new filter's Write method is invoked, the value of buffer() is accumulated until we have the whole page, which is then processed and chained to the next filter. The encoding work makes sure that the text gets stored in memory correctly; if you are not using UTF-8 (which is pretty standard nowadays), you will need to reference whatever encoding system your pages are using.

The table of contents itself

Before we can look at how the table of contents is assembled, let's first look at how it will be put together. Here is a sample layout:

<table id="TOC">
  <tr>
    <th id="TOC_Header">Contents [<span id="TOC_Toggle" 
                  onclick="ShowHideToc();">Hide</span>]</th>
  </tr>

  <tr id="TOC_Content" style="display:block">
    <td>
      <table>
        <tr>
          <td class="TOC_level_H3">
            <a href="#1_H3_Header">1& nbsp;& nbsp;H3 Header</a>
          </td>
        </tr>
      </table>
    </td>
  </tr>
</table>

(Ignore the spaces in the & nbsp;, that is just so they will render as text and not as white space.)

What we have here is a table with an id of TOC and has two rows. The first row is the TOC's header and the cell has an id of, oddly enough, TOC_Header. The span TOC_Toggle is linked to a very small bit of JavaScript which will toggle the visibility of the second row, TOC_Content. The cell in that row holds another table, where each TOC entry has its own row. Those cells have one of three classes, which pad the left side to give the cells' indent. The link inside the cell points to the matching hx element further down the page.

Making this case-insensitive

There are two utility methods in TOCFilter, which allow the filter to work without regards to case.

Private Function StringContains(ByVal ThisString As String, _
ByVal SearchText As String) As Boolean
    Dim i As Integer = ThisString.IndexOf(SearchText, _
        StringComparison.CurrentCultureIgnoreCase)
    Return (i <> 0)
End Function

Private Function StringReplace(ByVal ThisString As String, _
    ByVal Pattern As String, _
    ByVal ReplacementText As String) As String

    Return Regex.Replace(ThisString, Pattern, ReplacementText, RegexOptions.IgnoreCase)
End Function

StringContains does a case-insensitive IndexOf operation, and returns True if a match is found. StringReplace uses Regular Expressions to do a case-insensitive replace. Please note that while StringReplace is sufficient for this filter, it is not robust enough for most real-world situations. If you want to use it as-is, you do so at your own risk.

Overrides sub Write

Now that the theory and infrastructure are out of the way, let's look at the heart of the filter. First, we check to see if the TOC token is present; if not, there is no point generating a table that will not be inserted.

If StringContains(PageContent, "{{toc}}") Then
    Dim Headers As New SortedList(Of Integer, HeaderClass)
    Dim Tag As String = ""
    Dim i As Integer = 0
    Dim j As Integer = 0

    i = PageContent.IndexOf("<h3", StringComparison.CurrentCultureIgnoreCase)
    Do While i > 0
        j = PageContent.IndexOf("</h3>", i + 1, _
            StringComparison.CurrentCultureIgnoreCase)
        Tag = PageContent.Substring(i, j - i + 6)
        Headers.Add(i, New HeaderClass("H3", Tag))
        i = PageContent.IndexOf("<h3", j, _
        StringComparison.CurrentCultureIgnoreCase)
    Loop

...

End If

This code searches for <h3> tags. If one is found, the text is copied from BufferStr into Tag and added to Headers. The "+ 6" piece handles the five characters of the closing tag, plus the usual 1 character offset. The code then looks for the next tag starting from the end of the previous, until there are no more tags left. After this loop, two more retrieve the <h4> and <h5> tags.

Notice that Headers is a sorted list whose key is the starting position of the tag. This means that, no matter the order in which the tags are retrieved, they will come out of the list in the order they appear in the page text.

Once we have a list of the headers being indexed, we need to generate the table.

If Headers.Count > 0 Then
    Dim TocStr As New StringBuilder
    Dim H3 As Integer = 0
    Dim H4 As Integer = 0
    Dim H5 As Integer = 0
    Dim Index As String = ""
    Dim NewBufferStr As StringBuilder = Nothing
    Dim shift As Integer = 0
    Dim fudge As Integer = 0

    TocStr.AppendLine("<table id=""TOC"">")
    TocStr.Append(" <tr><th id=""TOC_Header"">")
    TocStr.Append("Contents [<span id=""TOC_Toggle"" onclick=""ShowHideToc();"">Hide</span>]")
    TocStr.AppendLine("</th></tr>")
    TocStr.AppendLine(" <tr style=""display:block;"" id=""TOC_Content"">")
    TocStr.AppendLine("  <td><table>")

    For Each kvp As KeyValuePair(Of Integer, HeaderClass) In Headers
        Select Case kvp.Value.Rank
            Case "H3"
                H3 += 1
                H4 = 0
                H5 = 0
                Index = String.Format("{0}", H3)
                fudge = 3 - Index.Length

            Case "H4"
                H4 += 1
                H5 = 0
                Index = String.Format("{0}.{1}", H3, H4)
                fudge = 3 - Index.Length

            Case "H5"
                H5 += 1
                Index = String.Format("{0}.{1}.{2}", H3, H4, H5)
                fudge = 3 - Index.Length
        End Select

        NewBufferStr = New StringBuilder
        NewBufferStr.Append(PageContent.Substring(0, shift + kvp.Key))
        NewBufferStr.AppendFormat(kvp.Value.TagReplacement, Index.Replace(".", "_"))
        NewBufferStr.Append(PageContent.Substring(shift + kvp.Key + kvp.Value.Length))

        shift += (kvp.Value.TagReplacement.Length - fudge - kvp.Value.Tag.ToString.Length)

        TocStr.AppendFormat("<tr><td class=""TOC_level_{0}"">", kvp.Value.Rank)
        TocStr.AppendFormat("<a href=""#{0}_{1}"">{2}  {3}</a>", _
            Index.Replace(".", "_"), kvp.Value.Id.Replace(" ", "_"), Index, kvp.Value.Text)
        TocStr.AppendLine("</td></tr>")

        PageContent= NewBufferStr.ToString
    Next

    TocStr.AppendLine("  </table></td>")
    TocStr.AppendLine(" </tr>")
    TocStr.AppendLine("</table>")

    PageContent= StringReplace(PageContent, "{{toc}}", TocStr.ToString)

End If

This code is run only if there are headers found. After initializing the variables, it constructs the start of the table of contents in TocStr. Then it goes through every item in Headers. Depending on the type of the current header tag, the index values are reset and the index string is generated. fudge holds an offset determined by the size of Index.

Once we have these values, we splice out the old header tag and insert the new and improved one. The first part of PageContent is copied over to NewBufferStr, up to the location of the old tag. Then we append the new tag out of HeaderClass.TagReplacement. Because the property is already set up with a {0} placeholder for the index, we can use it to format the append. Then we move to the end of the old tag and copy the rest. shift is updated so we know how far out of synch the main buffer has gotten, then we add the link to the new header tag into TocStr. We reset PageContent to include the new tag, and move on to the next header in the list.

Please note that because TagReplacement is used as a format string, it should not contain any format characters other than the one that HeaderClass puts in. If you absolutely must have curly braces in your header text, you will need to give the header tag a safe id.

It is really important to keep track of shift. Remember, the tag locations were based on the original scan. When we rewrite the tags, they will be longer: even if the tag already has an id, we are still adding the sequence index. shift allows us to keep track of where we are supposed to perform our cut.

After stepping through Headers, we close the table of contents and use the case-insensitive StringReplace to swap out the {{toc}} token with the generated mark-up. The last thing that must be done is to write the modified text out to the next filter in the chain.

Those are features, not bugs

While making some corrections, I noticed a possible complaint that I want to head off.

The index for a level is initialized when that level is passed. That means that when you skip a level, like I did with the last TOC entry in the image at the article's start, you end up with an index value of zero. I'm calling this a feature, as good design means you do not skip levels. The zero will let you quickly find when you've done this.

The current version of the filter will remove the {{toc}} token if no headers are found.

Moving on

The ability to easily intercept and alter a web page before it is delivered to the user opens some interesting possibilities, one of which I will cover in my next article. If you find other uses, I would enjoy hearing about them. And as always, if this article was useful to you, please vote it up.

History

Version 1 2011-03-09 - Initial release.
Version 2 2011-03-03 - Corrections and some minor additions.
Version 3 2011-03-15 - Fixed a bug in the source code: StringContains now returns (i <> -1).
Version 4 2011-03-29 - Rewrote the filter to handle the situation where the page content does not come in all at once.