Click here to Skip to main content
13,192,426 members (53,598 online)
Click here to Skip to main content
Add your own
alternative version

Stats

6.5K views
7 bookmarked
Posted 6 Mar 2014
MIT

A Simple and Powerful Library to Deal with Web Robots Control Strategy

, 6 Mar 2014
Rate this:
Please Sign up or sign in to vote.
How to parse robots.txt and robots meta tag

Introduction

In this tip, I'll present my Library WWW RobotRules (https://robotrules.codeplex.com/). This is a simple library to parse robots.txt and robots meta tag. The library fully respects the RFC 1808 and the RFC 1945.

Using the Code

Configuration

  • RobotRulesUseCache: Boolean, to active or deactivate the cache support
  • RobotRulesCacheLibrary: Type definition string, optional if RobotRulesUseCache is False
<?xml version="1.0" encoding="utf-8" ?>
<configuration>
  <appSettings>
    <add key="RobotRulesUseCache" value="False"/>
    <add key="RobotRulesCacheLibrary" 

    value="RobotRules.Cache.MemoryCache, RobotRules"/>
    <add key="RobotRulesCacheTimeout" value="00:01:00" />
  </appSettings>
</configuration>  

Use the Library

First, define a new parser with your robot user agent:

using RobotRules; 
 
private RobotsFileParser RobotRules = new RobotsFileParser() 
{
 LocalUserAgent = @"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)";
}; 

Then, use it like this:

RobotRules.Parse(new Uri("http://blablabla.com"));
if (RobotRules.IsAllowed("GoogleBot", new Uri ("http://blablabla.com"))) {
   // your code ...
}

This code is great, but if the robot control rules are embedded into the HTML code?

Sample

<!DOCTYPE html>
 
<html lang="en" 

xmlns="<a href="http://www.w3.org/1999/xhtml">http://www.w3.org/1999/xhtml">
<head>
    <meta charset="utf-8" />
    <title>Test</title>
    <meta name="robots" content="nofollow"/>
</head>
<body>
 
</body>
</html>

Don't be worried about that, just use the library like this:

RobotsFileParser RobotRules = new RobotsFileParser()
{
    LocalUserAgent =  @"Mozilla/5.0 
    (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
};

RobotControlStrategy strategy = RobotRules.CheckRobotControlStrategy
("Googlebot", "HTML CONTENT");

if (strategy.CanFollow)
{
    // your code
}
if (strategy.CanIndex)
{
    // your code
}

Points of Interest

  • Use MEF to load the cache plugin instead of reflection

History

  • V1 : 03/06/2014
  • V1.5.2.4
    • ICache now inherits from IDisposable
    • Fix cache initialization
    • RobotsFileParser is disposable
    • RobotsFileParser exposes the method ClearCache()
    • Add new configuration key RobotRulesCacheTimeout to specify cache timeout

License

This article, along with any associated source code and files, is licensed under The MIT License

Share

About the Author

bluecurve01
Software Developer
France (Metropolitan) France (Metropolitan)
No Biography provided

You may also be interested in...

Pro
Pro

Comments and Discussions

 
-- There are no messages in this forum --
Permalink | Advertise | Privacy | Terms of Use | Mobile
Web03 | 2.8.171017.2 | Last Updated 6 Mar 2014
Article Copyright 2014 by bluecurve01
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid