Click here to Skip to main content
15,867,330 members
Articles / Web Development / IIS
Article

Generate a Google Site Map Using the HTTP 404 Handler

Rate me:
Please Sign up or sign in to vote.
4.04/5 (7 votes)
24 Nov 2007CPOL6 min read 45.4K   36   15
Site maps make your websites search engine friendly. Learn how to generate them dynamically using your site's HTTP 404 error handler page.

Introduction

Using Microsoft Internet Information Server (IIS) when you designate a page to handle HTTP 404 (Not Found) errors on a website, you don't have to return HTTP 404 errors at all. Instead, you can return dynamic content with an HTTP 200 (OK) result. This is helpful when you want to build a sitemap.xml file to enhance the search engine performance of your website, for example. In this article, I'll show you how I did this for my own blog.

Backgrounder

There are two ways that a 404 error handler page can be invoked when using IIS with ASP.NET. For the page types registered for ASP.NET -- e.g. ASPX, ASMX -- the <customError> element in the <system.web> section of your web.config file determines what page will be invoked when different kinds of errors occur. For 404 errors, ASP.NET performs the switch to the handler page by using an HTTP 302 (Moved) redirect. This is unacceptable when you want a clean, transparent transfer to the handler page without any knowledge on the part of the client. However, when IIS handles a 404 error instead of ASP.NET, it does something akin to a Server.Transfer() call under the hood, meaning that the client is not redirected. This is good and it's exactly what we need to implement our dynamically generated sitemap.xml file. Since XML files are not handled by the ASP.NET engine, IIS will transfer to an ASP.NET page of our choice, where we can do whatever we like.

Google's Use of Site Maps

Site maps used by Google and other search engines depend on a simple XML schema that you can find here [^]. If you're like me, the best way to understand such a simple schema is to look at a real, live site map. Load the live sitemap.xml file for my own blog [^] into a new web browser window to see an example. It's very easy to understand, don't you think? Site maps are a good complement to the robots.txt file on your site because they allow you to specify what should be indexed by the search engine instead of what should not be indexed. Use the Google Webmaster Tools [^] to register your site map when it's ready.

The 404 Handler Page Code

Of course, the key to being able to dynamically generate a sitemap.xml file using a 404 error handler page is that the sitemap.xml file must not exist, physically, on your site. Start by creating a new ASPX page that will do the work instead. Remember that this page is probably going to do double duty by generating your sitemap.xml file and by handling real "not found" problems. So, it should be styled in a way that matches your site design.

At my gotnet.biz website, I store a reference to the pages that I want Google to index in a database. To build a dynamic site map, all I need to do is add each of those pages as a <url> node, according to the sitemap.xml specification. Below is a helper method called AddUrlNodeToUrlSet, which will do just that. In this first part, the AddUrlNodeToUrlSet method is shown in part one of a two-part partial class:

C#
using System;
using System.Web;
using System.Web.UI;
using System.Xml;

public partial class ErrorNotFound404 : Page
{
    // the standard schema namespace and change frequencies
    // for site maps defined at http://www.sitemaps.org/protocol.php
    private static readonly string xmlns =
        "http://www.sitemaps.org/schemas/sitemap/0.9";
    private enum freq { hourly, daily, weekly, monthly, yearly, never }

    // add a url node to the specified XML document with standard
    // priority to the urlset at the document root
    private static void AddUrlNodeToUrlSet( Uri reqUrl, XmlDocument doc,
        string loc, DateTime? lastmod, freq? changefreq )
    {
        // sanity checks
        if (reqUrl == null || doc == null || loc == null)
            return;

        // call the overload with standard priority
        AddUrlNodeToUrlSet( reqUrl, doc, loc, lastmod, changefreq, null );
    }

    // add a url node to the specified XML document with variable
    // priority to the urlset at the document root
    private static void AddUrlNodeToUrlSet( Uri reqUrl, XmlDocument doc,
        string loc, DateTime? lastmod, freq? changefreq, float? priority )
    {
        // sanity checks
        if (reqUrl == null || doc == null || loc == null)
            return;

        // create the child url element
        XmlNode urlNode = doc.CreateElement( "url", xmlns );

        // format the URL based on the site settings and then escape it
        // ESCAPED( SCHEME + AUTHORITY + VIRTUAL PATH + FILENAME )
        string url = String.Format( "{0}://{1}{2}", reqUrl.Scheme,
            reqUrl.Authority, VirtualPathUtility.ToAbsolute(
            String.Format( "~/{0}", loc ) ) ).Replace( "&", "&amp;" )
            .Replace( "'", "&apos;" ).Replace( "''", "&quot;" )
            .Replace( "<", "&lt;" ).Replace( ">", "&gt;" );

        // set up the loc node containing the URL and add it
        XmlNode newNode = doc.CreateElement( "loc", xmlns );
        newNode.InnerText = url;
        urlNode.AppendChild( newNode );

        // set up the lastmod node (if it should exist) and add it
        if (lastmod != null)
        {
            newNode = doc.CreateElement( "lastmod", xmlns );
            newNode.InnerText = lastmod.Value.ToString( "yyyy-MM-dd" );
            urlNode.AppendChild( newNode );
        }

        // set up the changefreq node (if it should exist) and add it
        if (changefreq != null)
        {
            newNode = doc.CreateElement( "changefreq", xmlns );
            newNode.InnerText = changefreq.Value.ToString();
            urlNode.AppendChild( newNode );
        }

        // set up the priority node (if it should exist) and add it
        if (priority != null)
        {
            newNode = doc.CreateElement( "priority", xmlns );
            newNode.InnerText =
                (priority.Value < 0.0f || priority.Value > 1.0f)
                ? "0.5" : priority.Value.ToString( "0.0" );
            urlNode.AppendChild( newNode );
        }

        // add the new url node to the urlset node
        doc.DocumentElement.AppendChild( urlNode );
    }
}

The AddUrlNodeToUrlSet method defined above will be called during the Page_Load event to construct the sitemap.xml file. It simply adds one <url> node for each page on my site that I want to reference in the site map file. Please keep in mind that for my blog, I generate my site map from a list of page names stored in a database table. So, in this next section of code where I open a database and parse the results, the code that finds your searchable pages from your site might be very different. Now let's look at the Page_Load method in part two of this page:

C#
using System;
using System.Data.OleDb;
using System.Web;
using System.Web.UI;
using System.Xml;

public partial class ErrorNotFound404 : Page
{
    protected void Page_Load( object sender, EventArgs e )
    {
        string QS = Request.ServerVariables["QUERY_STRING"];

        // was it the sitemap.xml file that was not found?
        if (QS != null && QS.EndsWith( "sitemap.xml" ))
        {
            // build the sitemap.xml file dynamically from add all of the
            // articles from the database, set the MIME type to text/xml
            // and stream the file back to the search engine bot
            XmlDocument doc = new XmlDocument();
            doc.LoadXml( String.Format( "<?xml version=\"1.0\" encoding" +
                "=\"UTF-8\"?><urlset xmlns=\"{0}\"></urlset>", xmlns ) );

            // add the fixed blog URL for this site with top priority
            AddUrlNodeToUrlSet( Request.Url, doc, "MyBlog.aspx", null,
                freq.daily, 1.0f );
            // NOTE: add more fixed urls as necessary for your site
            // this could be done programmatically or better still by
            // dependency injection

            // now query the database and add the virtual URLs for this site
            string connectionString = String.Format(
               "NOTE: set this to suit the needs of your content database" );
            string query = "SELECT PAGE_NAME, POSTING_DATE FROM BLOGDB " +
                "ORDER BY POSTING_DATE";

            OleDbConnection conn = new OleDbConnection( connectionString );
            conn.Open();
            OleDbCommand cmd = new OleDbCommand( query, conn );
            OleDbDataReader rdr = cmd.ExecuteReader();

            if (rdr.HasRows)
            {
                while (rdr.Read())
                {
                    object page_name = rdr[0];
                    object posting_date = rdr[1];
                    if ((object)page_name != null && !(page_name is DBNull))
                    {
                        AddUrlNodeToUrlSet( Request.Url, doc, String.Format(
                            "{0}.ashx", page_name.ToString().Trim() ),
                            (DateTime?)posting_date, freq.monthly );
                    }
                }
            }

            // IMPORTANT - trace has to be disabled or the XML returned will
            // not be valid because the div tag inserted by the tracing code
            // will look like a second root XML node which is invalid
            Page.TraceEnabled = false;

            // IMPORTANT - you must clear the response in case handlers
            // upstream inserted anything into the buffered output already
            Response.Clear();

            // IMPORTANT - set the status to 200 OK, not the 404 Not Found
            // that this page would normally return
            Response.Status = "200 OK";

            // IMPORTANT - set the MIME type to XML
            Response.ContentType = "text/xml";

            // buffer the whole XML document and end the request
            Response.Write( doc.OuterXml );
            Response.End();
        }

        // not the sitemap.xml file so set the standard 404 error code
        Response.Status = "404 Not Found";
    }
}

When Page_Load starts, it checks QUERY_STRING to see if the sitemap.xml file was the missing one that caused the transfer to happen. This is possible because the transfer agent in IIS that handles the name of the missing file simply appends it to QUERY_STRING. If the name is sitemap.xml, my code starts a new XML document and adds the virtual <url> nodes using the AddUrlNodeToUrlSet method shown above. Which page names you will include in your site map is totally dependent on your site's content, so you'll have to make most of your adjustments to my sample in that area. At the end of Page_Load is some interesting code I want to highlight. There are five key things that happen at this point, in order:

  • You must disable page tracing if it's turned on. If you don't, ASP.NET appends a <div> element to the end of the document making your XML appear as though it has two root nodes, which will invalidate it.
  • You must clear the Response object in case some other code has already buffered some content to be sent back to the browser. You want just the XML of the site map in the output, nothing else.
  • You need to set the HTTP status code to 200 to make sure that the client sees the result of its request as successful. Google bots don't like anything but success.
  • You must set the MIME type of Response to text/XML because that's what the search engine bots expect for the document type you are returning.
  • Finally, grab the OuterXml property of the XML document and Write() it back to the browser before ending the Response.

Configuring IIS to Transfer to the Error Handler Page

To get the page defined above to handle HTTP 404 errors, it first has to be registered with IIS to handle them. Remember, you can register the same ASPX page to handle errors for both ASP.NET type pages and non-ASP.NET type pages. However, for file types handled by ASP.NET, you use web.config to register them. Since the XML type is not handled by the ASP.NET engine, you need to tell IIS about this new page, which cannot be done through the web.config file. Instead, you must use the IIS Management Console to register the error handler page. The Microsoft TechNet website has very good instructions on this topic here [^]. On my test site using the IIS Management Console, the registration of the ErrorNotFound404.aspx page looks like this:

Screenshot - ConfiguringIISErrors.gif

Conclusion

You can also register the same page as an error handler with ASP.NET via the web.config file as discussed above. Just be aware that when the ASP.NET engine handles an error, it will redirect the browser to the page you specify. So, if you're depending on a clean transfer to the error handler, you probably aren't going to get exactly what you want. For the sitemap.xml file, though, the approach shown above is very clean because of the way IIS handles missing files. Once you're done, use the Fiddler2 Web Debugging Proxy [^] to open your sitemap.xml file and use the session inspector to see exactly what's happening on the wire. You'll see just how clean this code makes the would-be 404 error for that missing file appear to the search engine bots.

One Parting Thought

If you can generate a dynamic sitemap.xml file using this technique, you could probably use it to generate almost any kind of virtual file: robots.txt, RSS feeds, etc. This means that even more of your site could be dynamically generated from database content. Think about that. Enjoy!

History of This Article

  • 24 Nov 2007 - Initial publication
  • 28 Nov 2007 - Article edited and moved to the main CodeProject.com article base

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Web Developer
United States United States
After 16 years as an ardent C++ aficionado, Kevin switched to C# in 2001. Recently, Kevin's been dabbling in dynamically typed languages. Kevin is the Software Architect for Snagajob.com, the #1 source for hourly and part-time employment on the web.

Kevin loves welding, riding motorcycles and spending time with his family. Kevin has also been an adjunct professor teaching software engineering topics at a college in his hometown of Richmond, Virginia since 2000. Check out Kevin's technical blog at www.gotnet.biz for more goodies.

Comments and Discussions

 
Questionerror 404 Pin
Member 1079367523-Sep-16 10:07
Member 1079367523-Sep-16 10:07 
GeneralYou could also use a Managed Handler Pin
Andrew_Thomas9-Mar-09 4:31
Andrew_Thomas9-Mar-09 4:31 
GeneralJust what I need Pin
Jim Taylor29-Nov-07 22:22
Jim Taylor29-Nov-07 22:22 
GeneralAlternative Pin
Richard Deeming28-Nov-07 8:06
mveRichard Deeming28-Nov-07 8:06 
GeneralRe: Alternative Pin
W. Kevin Hazzard29-Nov-07 10:33
W. Kevin Hazzard29-Nov-07 10:33 
Generali don't understand Pin
giammin26-Nov-07 5:29
giammin26-Nov-07 5:29 
GeneralRe: i don't understand Pin
GuinnessKMF26-Nov-07 9:07
GuinnessKMF26-Nov-07 9:07 
GeneralRe: i don't understand Pin
johnher_at_marvin_dot_com27-Nov-07 5:01
johnher_at_marvin_dot_com27-Nov-07 5:01 
GeneralRe: i don't understand Pin
W. Kevin Hazzard29-Nov-07 10:06
W. Kevin Hazzard29-Nov-07 10:06 
GeneralRe: i don't understand Pin
GuinnessKMF30-Nov-07 5:56
GuinnessKMF30-Nov-07 5:56 
GeneralRe: i don't understand Pin
W. Kevin Hazzard29-Nov-07 10:46
W. Kevin Hazzard29-Nov-07 10:46 
GeneralRe: i don't understand Pin
pacevedo1-Dec-07 16:44
pacevedo1-Dec-07 16:44 
GeneralRe: i don't understand Pin
W. Kevin Hazzard4-Dec-07 12:18
W. Kevin Hazzard4-Dec-07 12:18 
Generalcodes Pin
NSABIRE24-Nov-07 21:05
NSABIRE24-Nov-07 21:05 
GeneralGreat Pin
Srinath Gopinath24-Nov-07 7:50
Srinath Gopinath24-Nov-07 7:50 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.