Click here to Skip to main content
6,822,123 members and growing! (17,030 online)
Email Password   helpLost your password?
Desktop Development » Miscellaneous » General     Intermediate License: The Code Project Open License (CPOL)

Google Site Map Crawler

By Summer_son

Console application that chacks all URLs listed in sitemap.xml file
C# (C#1.0, C#2.0, C#3.0), Windows (Win2K, WinXP, Win2003, Vista, TabletPC, Embedded), Win32, CEO
Posted:13 Dec 2007
Views:7,178
Bookmarked:18 times
Unedited contribution
printPrint   add Share
      Discuss Discuss   Broken Article?Report  
2 votes for this article.
Popularity: 0.40 Rating: 1.33 out of 5
1 vote, 50.0%
1
1 vote, 50.0%
2

3

4

5

Introduction

Have you ever thought of trying to validate each URL listed in your sitemap file?

Background

I have a site with dynamically generated page links. Those links are generated based on a page title which can be any combination of letters, numbers and symbols. Of course, the site does remove all forbidden characters from the page title before generating its URL, trims and shortens it a bit... however errors still occur from time to time. For example, a page with a title: ''...IS_BROKEN'' ''' due to my URL conversion specifics will have the following URL: /.IS_BROKEN+ There are thousands of pages so it�s clear that I can not verify each separate page that the site�s database contains.

Based on a list of dynamically generated URLs I generate a sitemap.xml file. Which contains all of the site pages. So each time a map-file is generated I need to ensure that there are no repeating items (this may happen if different pages have same titles) and each separate URL is accessible, i.e. does not produce either bad request, or 404 or anything like that.

So I created a C# program that walks through each URL listed in the sitemap.xml file and tries to access it. It logs all errors occurred into an output file, so it�s easy to track problem pages.

I use XmlDocument class for loading a sitemap.xml; WebRequest and WebResponse classes for determination of whether a URL exists.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Summer_son


Member

Location: Ukraine Ukraine

Other popular Miscellaneous articles:

Article Top
You must Sign In to use this message board.
FAQ FAQ 
 
Layout  Per page   
 Msgs 1 to 2 of 2 (Total in Forum: 2) (Refresh)FirstPrevNext
GeneralMy vote of 2 Pinmembergg679:44 15 Dec '09  
GeneralMinor spelling in article sub-title PinmemberTatworth6:24 14 Dec '08  

General General    News News    Question Question    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads.

PermaLink | Privacy | Terms of Use
Last Updated: 13 Dec 2007
Editor:
Copyright 2007 by Summer_son
Everything else Copyright © CodeProject, 1999-2010
Web18 | Advertise on the Code Project