Click here to Skip to main content
Click here to Skip to main content

Tagged as

Creating multi-page gzip-compressed sitemap with sitemap index for Google Webmaster Tools

, 28 Nov 2012 CPOL
Rate this:
Please Sign up or sign in to vote.
This article is intended for people who are planning to develop sitemap for a site with large amount of pages.

Introduction

Sitemap is an important part of your site, and large sites with more than 50,000 pages require a multi-page sitemaps. It might become a challenge to create such a sitemap, especially when trying to run it on the godaddy hosting, which is quite limiting.

Background

This article is intended for all types of audience and whoever could find it useful. SQL knowledge is required in order to create two SQL queries.

Using the code

The code is quite simple. The main idea is to calculate total amount of sitemap pages, and to create one page at the time using database offset (limiting amount of rows for each SQL request.) Two files are included: the generateSitemap.php and dbc.php. The generateSitemap is the logic, and the dbc is a database work file. Two queries are needed: first is to get total amount of products designated for the sitemap, and the second is to generate sitemap links for each product. Both queries are hardcoded to the dbc.php file

dbc.php:

<?php
class dbc {

    public $dbserver = 'SERVER';
    public $dbusername = 'USERNAME';
    public $dbpassword = 'PASSWORD';
    public $dbname = 'DATABASE NAME';

    function openDb() {
        try {
            $db = new PDO('mysql:host=' . $this->dbserver . ';dbname=' . 
              $this->dbname . ';charset=utf8', '' . $this->dbusername . '', '' . $this->dbpassword . '');
        } catch (PDOException $e) {
            die("error, please try again");
        }
        return $db;
    }

    function getTotalProductsInDatabase($recordsPerSiteMapFile) {
        $query = "SELECT count(*) as cnt FROM products";
        $dba = $this->openDb();
        $stmt = $dba->prepare($query);
        $stmt->execute();
        $row = $stmt->fetch();
        $dba = null;
        unset($dba);
        unset($stmt);
        //return total amount of sitemap pages
        return (((int) ($row['cnt'] / $recordsPerSiteMapFile)) + 1);
    }

    function getProductsForSitemapFileNumber($recordsPerSiteMapFile, $offset) {
        // query that returns 1 column that contains n-amount
        // of sitemap links - we will loop over them to create sitemap files.
        // query must end with: "limit ? OFFSET ?" - since we deal
        // with large amount of records, we need to partition our records into chunks
        $query = "(select product_links as description from products limit ? OFFSET ?)";
        $dba = $this->openDb();
        $stmt = $dba->prepare($query);
        $stmt->bindValue(1, $recordsPerSiteMapFile, PDO::PARAM_INT);
        $stmt->bindValue(2, $offset * $recordsPerSiteMapFile, PDO::PARAM_INT);
        $stmt->execute();
        $rows = $stmt->fetchAll(PDO::FETCH_ASSOC);
        $dba = null;
        unset($dba);
        unset($stmt);
        return $rows;
    }
}
?>

The code below is the core logic. Brief explanation: get current page, if no page specified, just start with the 0. Get n-amount of links for current page from the database(using offset). Loop over all links and add them to the sitemap. Save sitemap to the xml file using gzip compression. Redirect to next sitemap page or complete the page generation by creating the sitemap index file.

generateSitemap.php:

<?php
ini_set('display_errors', true);
error_reporting(E_ALL);
ini_set('memory_limit', '-1');
set_time_limit(0);
require 'dbc.php';

$db = new dbc();

//specify amount of records for a sitemap (no more than 50,000 (file size should not be over 10mb)
$recordsPerSiteMapFile = 30000;
$SERVER_NAME = "http://www.sitename.com/";
$rootPath = "/home/content/sitename.com/sitemaps/";
//the subdirectory sitemap is not declared here, it is hardcoded
$currentPage = getanyValue('page'); //get the page number, from the browser address. If no page specified, assumed 0; new start
$amountOfPages = ($db->getTotalProductsInDatabase($recordsPerSiteMapFile)); //how many total sitemap pages are there
header("Content-type: text/html; charset=utf-8");

//Start making the XML file for current sitemap page
$xmlDoc = new DOMDocument();
$root = $xmlDoc->appendChild(
        $xmlDoc->createElement("urlset"));
$tutTag = $root->appendChild(
                $xmlDoc->createAttribute("xmlns"))->appendChild(
        $xmlDoc->createTextNode("http://www.google.com/schemas/sitemap/0.9"));

//get records from the database for current sitemap offset
//rows contain only 1 column = DESCRIPTION. This column going to the sitemap
$currentSitemapPageRows = ($db->getProductsForSitemapFileNumber($recordsPerSiteMapFile, $currentPage));

//loop over each link and add it to the sitemap file
foreach ($currentSitemapPageRows as $key => $row) {
    $final_url = $SERVER_NAME . fixSymbols(getUrlFriendlyString($row{'description'}));
    $tutTag = $root->appendChild(
            $xmlDoc->createElement("url"));
    $tutTag->appendChild(
            $xmlDoc->createElement("loc", htmlentities($final_url)));
    $tutTag->appendChild(
            $xmlDoc->createElement("priority", "0.5"));
}
//sitemap file name
$fname = "sitemap_" . $currentPage . ".xml.gz";

$xmlDoc->formatOutput = true;
$theOutput = gzencode($xmlDoc->saveXML(), 9);

//create archive with the sitemap page
file_put_contents($rootPath . $fname, $theOutput);

unset($xmlDoc);
unset($currentSitemapPageRows);
unset($theOutput);
unset($tutTag);

//if current page if Last page, then create sitemap index file. 
//Otherwise, create a next sitemap file(redirect to itself with next sitemap page number)
if ($amountOfPages == $currentPage) {
    createSiteMapIndexFile($amountOfPages, $SERVER_NAME, $rootPath);
} else {
    ?>
    <script type="text/javascript">
        <!--
        window.location = "generateSitemap.php?page=<?php echo ($currentPage + 1); ?>"
        //-->
    </script>
    <?php
}

function createSiteMapIndexFile($totalPages, $SERVER_NAME, $rootPath) {
    //create new XML sitemap index document and save all sitemaps there(so google could find all pages)
    $xmlDocIndex = new DOMDocument();
    $rootIndex = $xmlDocIndex->appendChild(
            $xmlDocIndex->createElement("sitemapindex"));

    $tutTag2 = $rootIndex->appendChild(
                    $xmlDocIndex->createAttribute("xmlns"))->appendChild(
            $xmlDocIndex->createTextNode("http://www.google.com/schemas/sitemap/0.84"));

    for ($i = 0; $i <= $totalPages; $i++) {
        $fname = "sitemap_" . $i . ".xml.gz";
        $tutTag2 = $rootIndex->appendChild(
                $xmlDocIndex->createElement("sitemap"));
        $tutTag2->appendChild(
                $xmlDocIndex->createElement("loc", $SERVER_NAME . "sitemaps/" . $fname));
        $tutTag2->appendChild(
                $xmlDocIndex->createElement("lastmod", date('Y-m-d')));
    }

    //now we save the sitemapindex.xml
    $xmlDocIndex->formatOutput = true;
    $xmlDocIndex->save($rootPath . "sitemapindex.xml");

    echo "<a href=sitemaps/sitemapindex.xml>View Sitemap Index</a>";
}

function getUrlFriendlyString($str) {
    // convert spaces to '-', remove characters that are not alphanumeric
    // or a '-', combine multiple dashes (i.e., '---') into one dash '-'.
    $str = preg_replace("/[-]+/", "-", preg_replace("/[^a-z0-9-]/", 
      "", strtolower(str_replace(" ", "-", $str))));
    return $str;
}

function getanyValue($param) {
    if (isset($_GET[$param])) {
        return $_GET[$param];
    } else {
        return 0;
    }
}

function fixSymbols($str) {
//put your additional logic here if you need to pre-process your site links before adding them to the sitemap page
    return $str;
}
?>

Points of Interest

Script is useful for memory and resource-limited environments. Each page is generated independently, so the time-consuming script runs in batches.

History

11/28/2012 - First release.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Andrew Zan

United States United States
Programming for fun using C#, Java, JSP, Servlets and PHP.

Comments and Discussions

 
QuestionHow to add 2 new lines in loop PinmemberLucifix129-May-13 9:02 
QuestionCreate from URL list? PinmemberThe Real Glenn1-Dec-12 0:35 
AnswerRe: Create from URL list? PinmemberAndrew Zan1-Dec-12 3:01 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web02 | 2.8.141022.2 | Last Updated 28 Nov 2012
Article Copyright 2012 by Andrew Zan
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid