This is a quick bit of code that generates an XML Sitemap compatible with Google Sitemaps (Protocol Version 0.9). It can be used in conjunction with the Google Webmaster Tools to help them index your website, and hopefully get you a bit more traffic.
This code also provides a demonstration on how to change the Content Type of a served web page (in this case, how to make a PHP file to be treated like an XML file.
Google uses Sitemaps to gain information about your website. They contain a list of URLs, and possibly information on when the file was last updated, how often it's updated, and how important the page is on your site. If your webpage isn't linked together very thoroughly (by design or otherwise), they let Googlebot (and other search engines' bots, I believe) find pages that would normally be difficult to get at. See Google's own description on the files for more information.
The files can be created by hand, they're only simple XML files and the protocol is well documented but that can become a pain if you change your site around, or if you frequently add new pages (like, for example, in some blogs). Also, why do something by hand when it's not very difficult to get a computer to do it for you.
Using the Code
The source code attached to this article is pretty much ready to go. All you should need to do is change a few variable values and it will create the Sitemap. First I'll explain how to set up the code. Then, if you want to know how it works, I'll move on to explaining.
The code's behaviour is controlled by changing the following variables:
// Set this to true to ignore all files beginning with .
$respectUnixHidden = true;
// Modify this to ignore specific files, delimited by ';'
$filesToIgnore = 'sitemap.php;error_log';
// Modify this to ignore all files with specific file extensions,
// delimited by ';'
$extensionsToIgnore = 'xml;css;txt';
// This is added in front of all the found filenames to produce URLs instead
// of local filenames. For example, if $rootUrl is http://www.google.com,
// index.htm will become http://www.google.org.uk/index.htm
$rootUrl = 'http://www.freetools.org.uk/';
$respectUnixHidden - This makes the code ignore files beginning with '.'. On a Unix system, these are treated as hidden files, common examples including .htaccess. You don't usually want Google indexing these files.
$filesToIgnore - This should be set to a semi-colon delimited list of specific files to ignore. These could include the Sitemap file itself, as well as logs and similar files.
$extensionsToIgnore - This lets you ignore all files ending with specific extensions. In this case, the code will ignore any .xml, .css or .txt files. In this code, the extension is defined as the bit of the filename after the last full-stop.
$rootUrl - This should be set to the URL where the sitemap would be found. It will be tagged on to the beginning of every filename listed to give a full URL.
If you put your modified copy of the file into the same directory as the rest of your websites, it should now work, listing the URLs of all the webpages that aren't filtered out, along with their last modified dates.
Note: This code will only list the files in the same directory as the code itself, not anything in any sub-directories. To get around this, you can put a copy in each sub-directory and then use a Sitemap index file to 'join' them all together (See Google's Sitemap Definition). Alternatively, you can modify the code to fix this problem (or I will if there's enough demand and I have time). If you do this, please do share.
Points of Interest
Defining Content Type
The Sitemap Protocol this code uses relies on XML files, which usually have the extension .xml. This code relies on PHP, and so usually has to have a PHP extension. This can potentially be a problem, as the file might not be recognised as an XML file, even though the output is valid XML. In fact, even if you did have the .xml extension, and your server was set up to treat it as a PHP file, this still probably wouldn't work. As far as the server is concerned, you're sending a HTML page to someone, so that's what it tells that someone they're receiving.
This can be fixed by setting the Content-Type in the file's HTTP header. Explaining the nature of the Header is out of the scope of this article, but the problem can be solved with the following line:
This essentially tells any program that downloads the file that it contains XML. It is important that this line appears before anything that is sent to the client, as after that it will have no effect. For example, this wouldn't work:
And neither would this:
This second example is much more subtle. If you look closely, there is a space before the opening
php tag. This will be sent to the client, making it too late to send the header. This is a relatively common, and extremely annoying, mistake.