Click here to Skip to main content
15,894,410 members
Articles / Programming Languages / XML

Utility to Generate Large XML Sitemaps from a Database

Rate me:
Please Sign up or sign in to vote.
5.00/5 (2 votes)
24 Jun 2012CPOL3 min read 15.1K   6   1
This is a utility to generate large XML sitemaps from a database.

I wrote a quick utility a while ago to generate large XML sitemaps from a database. There are many sitemap generating tools out there, such as xml-sitemaps.com which are now even made to be web tools (you don’t even have to download anything).

The problem with this is that these tools are usually crawler-based. That means that you basically point it at a starting URL, and it crawls that page looking for links, adds them to the sitemap, and then continues onward to all of those pages, crawling for those links, etc.

I never really understood this. Google is already doing that (and believe me, they are doing it BETTER). More importantly, these crawlers are usually not as forgiving to your web server as Google is. Some of them will blindly hit your server with as many connections as they can throw at it, and not only will your server come to a screeching halt – your sitemap will take FOREVER to create – and might error out on the bagillionth page and then corrupt the whole file or something.

Now if you are the supposed “webmaster” for this web site, isn’t it safe to assume that you have more intimate knowledge about what pages you want to include in your sitemap than a crawler does? If your website is pulling content from a database of some sort, the answer is likely yes.

Anyway, I created this little tool to make it easy for me to generate large XML SiteMap files from a database query of some sort so that I could pack these SiteMap files chock-full of good links without even sending one HTTP Request.

The code is a single file using an XmlTextWriter:

C#
using System;
using System.Collections.Generic;
using System.Text;
using System.Xml;
using System.IO;

class SiteMapWriter
{
    private const string xmlFormatString = 
    @"<?xml version=""1.0"" encoding=""UTF-8""?>";
    private const string xmlStylesheetString = 
    @"<?xml-stylesheet type=""text/xsl"" 
    href=""http://www.intelligiblebabble.com/files/style.xsl""?>";
    private const string xmlns = "http://www.sitemaps.org/schemas/sitemap/0.9";
    private const string xmlns_xsi = "http://www.w3.org/2001/XMLSchema-instance";
    private const string xsi_schemaLocation = 
    "http://www.sitemaps.org/schemas/sitemap/0.9\nhttp://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd";

    private int MaxUrls; //urls per file

    private XmlTextWriter writer;
    private XmlTextWriter parentwriter;
    private readonly HashSet<string> urls;
    private int NumberOfFiles;
    private int NumberOfUrls;

    private string BaseUri;
    private string FileDirectory;
    private string BaseFileName;

    /// <summary>
    /// Create SiteMapWriter Instance
    /// </summary>
    /// <param name="fileDirectory">
    /// the directory to generate the sitemap files to</param>
    /// <param name="baseFileName">
    /// filename to prefix all generated sitemap files</param>
    /// <param name="baseUri">URL to prefix all generated URLs. 
    /// Leave blank if you want relative.</param>
    public SiteMapWriter(string fileDirectory, string baseFileName, 
    string baseUri = "", int maxUrlsPerFile = 30000)
    {
        urls = new HashSet<string>();

        NumberOfFiles = 1;
        NumberOfUrls = 0;

        BaseUri = baseUri;
        BaseFileName = baseFileName;
        FileDirectory = fileDirectory;
        MaxUrls = maxUrlsPerFile;    

        var f = string.Format("{0}{1}.xml", fileDirectory, baseFileName);

        if (File.Exists(f)) File.Delete(f);
        parentwriter = new XmlTextWriter(f, Encoding.UTF8) { Formatting = Formatting.Indented };

        parentwriter.WriteRaw(xmlFormatString);
        parentwriter.WriteRaw("\n");
        parentwriter.WriteRaw(xmlStylesheetString);

        parentwriter.WriteStartElement("sitemapindex");
        parentwriter.WriteAttributeString("xmlns", xmlns);
        parentwriter.WriteAttributeString("xmlns:xsi", xmlns_xsi);
        parentwriter.WriteAttributeString("xsi:schemaLocation", xsi_schemaLocation);

        CreateUrlSet();
    }

    /// <summary>
    /// Add Url to SiteMap
    /// </summary>
    /// <param name="loc">relative path to page</param>
    /// <param name="changefreq">how often the file changes. 
    /// either "daily", "weekly", 
    /// or "monthly". leaves out if empty.</param>
    /// <param name="priority">the priority of the page 
    /// (double between 0 and 1). defaults to 0.5</param>
    public void AddUrl(string loc, double priority = 0.5, string changefreq = null)
    {
        if (urls.Contains(loc)) return;

        writer.WriteStartElement("url");
        writer.WriteElementString("loc", loc);
        if(changefreq != null) {writer.WriteElementString("changefreq", changefreq);}
        writer.WriteElementString("priority",string.Format("{0:0.0000}",priority));
        writer.WriteEndElement();

        urls.Add(loc);
        NumberOfUrls++;
        if(NumberOfUrls % 2000 == 0) Console.WriteLine
                 (string.Format("Urls Processed: {0}",NumberOfUrls));
        if (NumberOfUrls >= MaxUrls) LimitIsMet();
    }

    private void LimitIsMet()
    {
        //close out current file
        CloseWriter();
        NumberOfFiles++;
        CreateUrlSet();
        //create and start new file
        NumberOfUrls = 0;
    }

    private void CreateUrlSet()
    {
        string f = string.Format("{0}{1}_{2}.xml", FileDirectory, BaseFileName, NumberOfFiles);

        if(File.Exists(f)) File.Delete(f);
        writer = new XmlTextWriter(f, Encoding.UTF8) { Formatting = Formatting.Indented };

        writer.WriteRaw(xmlFormatString);
        writer.WriteRaw("\n");
        writer.WriteRaw(xmlStylesheetString);

        writer.WriteStartElement("urlset");
        writer.WriteAttributeString("xmlns", xmlns);
        writer.WriteAttributeString("xmlns:xsi", xmlns_xsi);
        writer.WriteAttributeString("xsi:schemaLocation", xsi_schemaLocation);
        AddSiteMapFile(string.Format("{0}_{1}.xml", BaseFileName, NumberOfFiles));
    }

    private void AddSiteMapFile(string filename)
    {
        parentwriter.WriteStartElement("sitemap");
        parentwriter.WriteElementString("loc", string.Concat(BaseUri,filename));
        parentwriter.WriteEndElement();
    }

    private void CloseWriter()
    {
        writer.WriteEndElement();
        writer.Flush();
        writer.Close();
    }

    /// <summary>
    /// Flushed and closes all open writers.
    /// </summary>
    public void Finish()
    {
        CloseWriter();

        parentwriter.WriteEndElement();
        parentwriter.Flush();
        parentwriter.Close();
    }
}

Not the most elegant code in the world, but it gets the job done. Note, as mentioned above, this is NOT for crawling pages for URLs… (although you could write a quick crawler and then use this alongside it to write the file… maybe a post for next week).

The upside though is that this will write your SiteMaps essentially as quick as you can pull the data from the database. I recently used it to generate > 4 Million URLs in around 30 seconds.

Some things to note:

You can try it out yourself like so:

C#
public static void Main(string[] args)
{
    var writer = new SiteMapWriter(
            @"C:\SiteMapDirectory\",
            "SiteMap",
            "http://www.intelligiblebabble.com/");

    for(int i = 0; i < 300000; i++)
    {
        writer.AddUrl(
            string.Format("{0}/index.html",Guid.NewGuid()));
    }
    writer.Finish();
}

Here, I am generating 300,000 (fake) URLs. Google will not accept sitemap files with more than 50,000 URLs (any more than that and these files are going to be getting pretty large). As a result, this script will create a parent SiteMap file which will link to children sitemap files – each less than 50,000 URLs long. I have set the default max number of URLS in each file to be 30,000 just to be on the safe side – but you can override this in the constructor with the maxUrlsPerFile param.

The Sitemaps generated will generally look like this:

XML
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" 
href="http://www.intelligiblebabble.com/files/style.xsl"?>
<urlset
	xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9&#xA;
                        http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
  <url>
    <loc>2b55897a-43d1-4d3f-bec0-22ff50226a9a/index.html</loc>
    <priority>0.5000</priority>
  </url>
  <url>
    <loc>a4cb2c7a-3080-4e77-a5e8-27a869e33afc/index.html</loc>
    <priority>0.5000</priority>
  </url>
  <url>
    <loc>614f281b-c60b-4dca-9c5c-9c59f1fff1fe/index.html</loc>
    <priority>0.5000</priority>
  </url>
</urlset>

Some other things to note are:

  • A HashSet is used to maintain that there are no duplicate URLs inserted into the file.
  • The AddUrl method accepts a changefreq parameter, and priority parameter to set the corresponding XML attributes.
  • if you are generating the base URL from the database, or you would like to keep the URLs relative, you can leave off the baseUri parameter from the constructor.

That’s pretty much it, please let me know if this was helpful to anyone.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Founder
United States United States
My name is Leland Richardson. I love learning. At the time of writing this I am 23 years old and live in Houston, TX. I was born in West Palm Beach, Florida, grew up in St. Louis, Missouri, and went to school in Houston, Texas at Rice University.

At Rice I received two degrees: one in Physics and one in Mathematics. I love both. I never received any formal education on Computer Science, however, you will find that most of this blog will be about programming and web development. Nevertheless, I think being a good programmer is about being good at learning, and thinking logically about how to solve problems - of which I think my educational background has more than covered.

Since high-school, I had found that the easiest way to make money was by programming. Programming started off as a hobby and small interest, and slowly grew into a passion.

I have recently started working on a new startup here in Houston, TX. I wont bore you with the details of that just yet, but I am very excited about it and I think we can do big things. We plan to launch our project this year at SXSW 2013. What I will say for now, is that we would like to create a company of talented software developers who are similarly ambitious and want to create cool stuff (and have fun doing it).

Comments and Discussions

 
QuestionQuery regarding Sitemap code Pin
lanister27-Jan-16 2:19
lanister27-Jan-16 2:19 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.