HTML Meta Tag Parser





3.00/5 (10 votes)
Nov 27, 2006
1 min read

60454

644
This article descripes a simple Meta tag parser.
Introduction
This article shows a simple way to parse the meta tags in an HTML string. The class can return the meta data as HtmlMeta
control objects (only from .NET 2.0), or give a result indicating if the meta data constrains a web crawler. It is a simple rewrite of part of craigd's excellent code from the Searcharoo Too project. The rewrite is to get some structure (class) and give my code the ability to reuse the parsing method.
The Meta
tag is described here:
<!ELEMENT META - O EMPTY -- generic metainformation>
<!ATTLIST META
%i18n; -- lang, dir, for use with content --
http-equiv NAME #IMPLIED -- HTTP response header name --
name NAME #IMPLIED -- metainformation name --
content CDATA #REQUIRED -- associated information --
scheme CDATA #IMPLIED -- select form of content --
>
The class consists of two methods:
enum RobotHtmlMeta { None = 0, NoIndex, NoFollow, NoIndexNoFollow };
static List<HtmlMeta> Parse(string htmldata)
static RobotHtmlMeta ParseRobotMetaTags(string htmldata)
Parse
parses the HTML string and creates a list ofHtmlMeta
objects.ParseRobotMetaTags
parses the HTML string (with the help ofParse
) and returns an enumerated value indicating if a robot meta tag was present and what it said.
Both methods use regular expressions. Firstly, to find the meta tag in the HTML, and secondly to find the attributes in the meta
tag.
Using the code
The class can be used straight away. The class methods are all static.
The entire code can be seen here:
using System;
using System.Collections.Generic;
using System.Text;
using System.Web;
using System.Text.RegularExpressions;
using System.Web.UI;
using System.Web.UI.HtmlControls;
namespace hwit.Parsers
{
public class HtmlMetaParser
{
public enum RobotHtmlMeta { None = 0, NoIndex, NoFollow,
NoIndexNoFollow };
public static List<HtmlMeta> Parse(string htmldata)
{
Regex metaregex =
new Regex(@"<meta\s*(?:(?:\b(\w|-)+\b\s*(?:=\s*(?:""[^""]*""|'" +
@"[^']*'|[^""'<> ]+)\s*)?)*)/?\s*>",
RegexOptions.IgnoreCase | RegexOptions.ExplicitCapture);
List<HtmlMeta> MetaList = new List<HtmlMeta>();
foreach (Match metamatch in metaregex.Matches(htmldata))
{
HtmlMeta mymeta = new HtmlMeta();
Regex submetaregex =
new Regex(@"(?<name>\b(\w|-)+\b)\" +
@"s*=\s*(""(?<value>" +
@"[^""]*)""|'(?<value>[^']*)'" +
@"|(?<value>[^""'<> ]+)\s*)+",
RegexOptions.IgnoreCase |
RegexOptions.ExplicitCapture);
foreach (Match submetamatch in
submetaregex.Matches(metamatch.Value.ToString()))
{
if ("http-equiv" ==
submetamatch.Groups["name"].ToString().ToLower())
mymeta.HttpEquiv =
submetamatch.Groups["value"].ToString();
if (("name" ==
submetamatch.Groups["name"].ToString().ToLower())
&& (mymeta.HttpEquiv == String.Empty))
mymeta.Name = submetamatch.Groups["value"].ToString();
if ("scheme" ==
submetamatch.Groups["name"].ToString().ToLower())
mymeta.Scheme = submetamatch.Groups["value"].ToString();
if ("content" ==
submetamatch.Groups["name"].ToString().ToLower())
{
mymeta.Content = submetamatch.Groups["value"].ToString();
MetaList.Add(mymeta);
}
}
}
return MetaList;
}
public static RobotHtmlMeta ParseRobotMetaTags(string htmldata)
{
List<HtmlMeta> MetaList = HtmlMetaParser.Parse(htmldata);
RobotHtmlMeta result = RobotHtmlMeta.None;
foreach (HtmlMeta meta in MetaList)
{
if(meta.Name.ToLower().IndexOf("robots") != -1 ||
meta.Name.ToLower().IndexOf("robot") != -1){
string content = meta.Content.ToLower();
if (content.IndexOf("noindex") != -1 &&
content.IndexOf("nofollow") != -1)
{
result = RobotHtmlMeta.NoIndexNoFollow;
break;
}
if(content.IndexOf("noindex") != -1)
{
result = RobotHtmlMeta.NoIndex;
break;
}
if (content.IndexOf("nofollow") != -1)
{
result = RobotHtmlMeta.NoFollow;
break;
}
}
}
return result;
}
}
}
The code is very simple, and my guess is that only the regular expressions could raise questions. The first regular expression finds all the meta
tags and the second finds the attribute list within a meta
tag.