Click here to Skip to main content
Licence 
First Posted 5 Dec 2004
Views 99,576
Bookmarked 82 times

An intelligent 404 page

By | 5 Dec 2004 | Article
This article explains how to enhance your 404 page by providing links to pages with a similar name to the requested page.

Sample Image

Introduction

Did you ever notice on big sites like www.microsoft.com, if you reach a page that doesn't exist, they don't just say "Sorry, 404.", they give you a list of pages that are similar to the one you requested. This is obviously very nice for your users to have, and it's easy enough to integrate into your site. This article provides source code and explains the algorithm to accomplish this feature. Note: the real benefit of the approach outlined here is the semi-intelligent string comparisons.

Background

The need for this grew out of a client of mine who was changing a content management system, and every URL in the site changed, so all the search engine results came up with 404 pages. This was obviously a big inconvenience, so I put this together to help users find their way through the new site when arriving from a search engine.

See it in action

Go to this page (which doesn't exist), and the 404 page should give you a list of pages that have quite similar names.

Requirements

  • Your web site must be set up so that 404 pages get redirected to a .NET aspx page.
  • You must have some way of getting an array of all the page URLs in your site that you want to compare 404 requests against. If you have a content management system, there is probably a structure of all the pages stored in XML or a JavaScript array (for DHTML menus or something), or you could write your own query to get the pages from a database. If not, use a content management system, you could hard-code a string array variable in the 404 page code behind containing the page names, or think up some way of dynamically reading all the .aspx or .html pages from the file system.
  • When the 404 page is accessed, you need to know which page was requested. Using web.config, you can set up 404 error codes to go to /404.aspx, where it will tag on the requested page to the querystring. The source code here assumes you have this approach, but you can obviously change it to your own needs; simply change the GetRequestedUrl() function.

Why Regular Expressions are not enough

To compare strings, you can use System.String.IndexOf or you can use regular expressions to match similarities, but all these methods are very unforgiving for slight discrepancies in the string. In the example URL above, the page name is December15-ISERCWorkshoponTesting.html but under the new content management system, the URL is December 15 - ISERC Workshop - Software Testing.html, which is different enough to make traditional string comparison techniques fall down.

So, I looked around for a fuzzy string comparison routine, and came across an algorithm written by a guy called Levenshtein. His algorithm figures out how different two strings are, based on how many character additions, deletions and modifications are necessary to change one string into the other. This is called the 'edit distance', i.e., how far you have to go to make two strings match. This is very useful because it takes into account slight differences in spacing, punctuation and spelling. I found this algorithm here where Lasse Johansen kindly ported it to C#. The algorithm is explained at that site, and it is well worth a read to see how it is done.

Normalizing the Scores

I originally had a problem with the algorithm because it gave surprising results for certain situations. If the 404 page request was for 'hello' and there is a valid page called 'hello_new_version' and another valid page called 'abcde', then the 'abcde' page gets a better score, because fewer changes are needed to make it the same as hello (just change the five characters in 'abcde' into 'hello'). This is five changes, even though the 'hello_new_version' is semantically a better match. Fortunately, a kind newsgroup participant named Patrice suggested that I divide the score by the length of the comparison string, to normalize the results. This worked perfectly, and I found that a score between 0 (perfect match) and 0.6 (a good match) is worth including as a suggested page. You can change this value in the ComputeResults() method if you want to make it more or less flexible.

Code Summary

private void Page_Load(object sender, System.EventArgs e)
{
  GetRequestedUrl();
  SetUpSiteUrls();
  ComputeResults();
  BindList();
}

The above code shows the four key tasks that make up this solution. Each method is explained below:

Using the code

  1. GetRequestedUrl() simply figures out which page was requested. In this example, it is assumed that your web.config contains the following:
    <system.web>
      <customErrors mode="On">
        <error statusCode="404" redirect="/404.aspx" />
      </customErrors>

    In this example, the querystring on the 404.aspx page contains the requested URL.

    private void GetRequestedUrl()
    {
        // assumes that web.config redirects 404 requests to 404.aspx,
        // putting the requested url
        // in the querystring: ?aspxerrorpath=/whatever.aspx
        this.requestedUrl = 
             String.Concat(Request.QueryString["aspxerrorpath"],"");
        if(this.requestedUrl == "") // nothing to compare with
            return;
        try
        {
            this.requestedUrl = 
               System.IO.Path.GetFileNameWithoutExtension(requestedUrl);
        }
        catch
        {
            return;    // the referrer contained illegal characters
        }
    }
  2. SetUpSiteUrls() is where you load in all the pages in your site. In my content management system, I have an XML file with all the names, so I do an XPath query and add in the names one by one to the ArrayList.
    private void SetUpSiteUrls()
    {
        this.validUrls = new ArrayList();
        /*
         * Insert code here to add the pages in your site to this arraylist
         */
    }
  3. ComputeResults() iterates through the URLs you set up in SetUpSiteUrlsreturns and attaches a score of how close each one is to the requested URL. It also sorts the results and discards any that are not a close match.
    private void ComputeResults()
    {
        ArrayList results = new ArrayList(); // used to store the results
        // build up an arraylist of the positive results
        foreach(string s in validUrls)
        {
            // don't waste time calculating the edit
            // distance of nothing with something
            if(s == "") continue;
            double distance = Levenshtein.CalcEditDistance(s, 
                   this.requestedUrl); // both in lower case
            double meanDistancePerLetter = (distance / s.Length);
            if(meanDistancePerLetter <= 0.60D)
            // anything between 0.0 and 0.6 is a good match.
            // The algorithm always returns a value >= 0
            {
                // add this result to the list. NOTE: you will need some
                // way of inserting the correct url in the hyperlink below
                // (the url represented by 's' doesn't
                // have a file extension or its folder context)
                results.Add(new DictionaryEntry(meanDistancePerLetter, 
                       "<a href='" + s + ".html'>" + s + "</a>"));
                // use dictionary entries because we want to store the score
                // and the hyperlink. can't use sortedlist because they don't
                // allow duplicate keys and we have 2 hyperlinks
                // with the same edit distance.
            }
        }
        results.Sort(new ArrayListKeyAscend());
    }

    Important note: One thing to definitely look out for is the inner-most line of the above code: results.add(new DictionaryEntry(...). I am adding in a HTML hyperlink, with the name of the page + ".html". This may not be a correct link in your web site, because you may have removed the folder part of the URL while populating the validUrls ArrayList. You may need to expand the data structures used in this code to include full URL for each page.

  4. BindList() simply binds the ArrayList of results to the DataGrid, which is configured to display them in a bulleted list.
    private void BindList()
    {
        if(results.Count > 0)
        {
            this.lblHeader.Text = 
              "The following pages have similar names to <i>" + 
              this.requestedUrl + "</i>";
            this.DataList1.DataSource = results;
            this.DataList1.DataBind();
        }
        else
        {
            this.lblHeader.Text = "Unable to find any pages in this site that" + 
                   " have similar names to <i>" + this.requestedUrl + "</i>";
        }   
    }

The 'magic' in the code is all done with the Levenshtein.CalcEditDistance method which returns the distance between two strings. It is included in the source.

WinForms Test Application

If you're interested to test out the Levenshtein algorithm, I've written a Windows Forms application that lets you enter a string (e.g., a page URL) and also a list of strings to compare it against (e.g., all the page URLs in your site), and it gives you the 'edit distance' scores. Download here - 7.04 Kb.

Comments

I think this is a great feature because it adds significant value to the user experience for a web site. Please feel free to comment below if you have any questions, find any bugs, improvements, or if you can't get it working, or if you use it in a novel way.

Enjoy!

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

About the Author

Tim_Mackey

Web Developer

Ireland Ireland

Member



Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board. (secure sign-in)
 
Search this forum  
 FAQ
    Noise  Layout  Per page   
  Refresh
GeneralIIS 7 help Pinmemberdavenaylor200021:07 6 Sep '07  
GeneralGives default 404 page PinmemberIts Piyush Gupta8:44 28 Mar '05  
GeneralRe: Gives default 404 page PinmemberTim_Mackey22:35 28 Mar '05  
GeneralWinFomr Demo Source Code PinmemberEmanuele Baglini4:01 23 Jan '05  
GeneralRe: WinForm Demo Source Code PinmemberTim_Mackey9:53 23 Jan '05  
GeneralRe: WinForm Demo Source Code PinmemberEmanuele Baglini10:38 23 Jan '05  
GeneralPerformance tip PinsussRobert Jeppesen12:44 13 Jan '05  
GeneralRe: Performance tip PinmemberTim_Mackey13:10 13 Jan '05  
GeneralMaking it work with non-ASP.NET resources PinmemberEric Woodruff15:30 10 Jan '05  
GeneralRe: Making it work with non-ASP.NET resources PinmemberTim_Mackey23:54 10 Jan '05  
GeneralAlternate to normalization PinmemberMichael Combs11:36 8 Dec '04  
GeneralRe: Alternate to normalization PinmemberMichael Combs12:05 8 Dec '04  
GeneralRe: Alternate to normalization PinmemberTim_Mackey9:08 9 Dec '04  
GeneralRe: Alternate to normalization PinmemberMichael Combs19:08 9 Dec '04  
GeneralRe: Alternate to normalization PinmemberMichael Combs19:33 9 Dec '04  
GeneralRe: Alternate to normalization PinmemberMichael Combs19:41 9 Dec '04  
GeneralRe: Alternate to normalization PinmemberTim_Mackey22:04 9 Dec '04  
GeneralRe: Alternate to normalization PinmemberRoger Willcocks17:52 14 Dec '04  
GeneralRe: Alternate to normalization PinmemberTim_Mackey23:57 14 Dec '04  
GeneralRe: Alternate to normalization PinmemberMichael Combs4:35 15 Dec '04  
GeneralFunny... PinmemberRui A. Rebelo8:05 6 Dec '04  
GeneralRe: Funny... PinmemberTim_Mackey9:36 6 Dec '04  
GeneralFile searching on server PinmemberC++hristoffer15:03 13 Jan '07  
GeneralRe: File searching on server PinmemberTim_Mackey1:31 14 Jan '07  
GeneralExcellent PinstaffPaul Watson4:32 6 Dec '04  

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Mobile
Web01 | 2.5.120517.1 | Last Updated 6 Dec 2004
Article Copyright 2004 by Tim_Mackey
Everything else Copyright © CodeProject, 1999-2012
Terms of Use
Layout: fixed | fluid