HTML Meta Tag Parser

CodingDragon

3.00/5 (10 votes)

Nov 27, 2006

1 min read

60454

644

This article descripes a simple Meta tag parser.

Download source - 1 Kb

Introduction

This article shows a simple way to parse the meta tags in an HTML string. The class can return the meta data as HtmlMeta control objects (only from .NET 2.0), or give a result indicating if the meta data constrains a web crawler. It is a simple rewrite of part of craigd's excellent code from the Searcharoo Too project. The rewrite is to get some structure (class) and give my code the ability to reuse the parsing method.

The Meta tag is described here:

<!ELEMENT META - O EMPTY   -- generic metainformation>
<!ATTLIST META
 %i18n;              -- lang, dir, for use with content --
 http-equiv  NAME    #IMPLIED  -- HTTP response header name  --
 name        NAME    #IMPLIED  -- metainformation name --
 content     CDATA   #REQUIRED -- associated information --
 scheme      CDATA   #IMPLIED  -- select form of content --
 >

The class consists of two methods:

enum RobotHtmlMeta { None = 0, NoIndex, NoFollow, NoIndexNoFollow };
static List<HtmlMeta> Parse(string htmldata)
static RobotHtmlMeta ParseRobotMetaTags(string htmldata)

Parse parses the HTML string and creates a list of HtmlMeta objects.
ParseRobotMetaTags parses the HTML string (with the help of Parse) and returns an enumerated value indicating if a robot meta tag was present and what it said.

Both methods use regular expressions. Firstly, to find the meta tag in the HTML, and secondly to find the attributes in the meta tag.

Using the code

The class can be used straight away. The class methods are all static.

The entire code can be seen here:

using System;
using System.Collections.Generic;
using System.Text;
using System.Web;
using System.Text.RegularExpressions;
using System.Web.UI;
using System.Web.UI.HtmlControls;

namespace hwit.Parsers
{
    public class HtmlMetaParser
    {
        public enum RobotHtmlMeta { None = 0, NoIndex, NoFollow, 
                                    NoIndexNoFollow };
   
        public static List<HtmlMeta> Parse(string htmldata)
        {
            Regex metaregex = 
                new Regex(@"<meta\s*(?:(?:\b(\w|-)+\b\s*(?:=\s*(?:""[^""]*""|'" +
                          @"[^']*'|[^""'<> ]+)\s*)?)*)/?\s*>", 
                          RegexOptions.IgnoreCase | RegexOptions.ExplicitCapture);

            List<HtmlMeta> MetaList = new List<HtmlMeta>();
            foreach (Match metamatch in metaregex.Matches(htmldata))
            {
                HtmlMeta mymeta = new HtmlMeta();

                Regex submetaregex = 
                    new Regex(@"(?<name>\b(\w|-)+\b)\" + 
                              @"s*=\s*(""(?<value>" +
                              @"[^""]*)""|'(?<value>[^']*)'" + 
                              @"|(?<value>[^""'<> ]+)\s*)+",
                              RegexOptions.IgnoreCase | 
                              RegexOptions.ExplicitCapture);

                foreach (Match submetamatch in 
                         submetaregex.Matches(metamatch.Value.ToString()))
                {
                    if ("http-equiv" == 
                          submetamatch.Groups["name"].ToString().ToLower())
                        mymeta.HttpEquiv = 
                          submetamatch.Groups["value"].ToString();

                    if (("name" == 
                         submetamatch.Groups["name"].ToString().ToLower())
                         && (mymeta.HttpEquiv == String.Empty))
                       mymeta.Name = submetamatch.Groups["value"].ToString();

                    if ("scheme" == 
                        submetamatch.Groups["name"].ToString().ToLower())
                        mymeta.Scheme = submetamatch.Groups["value"].ToString();

                    if ("content" == 
                        submetamatch.Groups["name"].ToString().ToLower())
                    {
                        mymeta.Content = submetamatch.Groups["value"].ToString();
                        MetaList.Add(mymeta);
                    }
                }
            }
            return MetaList;
        }

        public static RobotHtmlMeta ParseRobotMetaTags(string htmldata)
        {
            List<HtmlMeta> MetaList = HtmlMetaParser.Parse(htmldata);

            RobotHtmlMeta result = RobotHtmlMeta.None;
            foreach (HtmlMeta meta in MetaList)
            {
                if(meta.Name.ToLower().IndexOf("robots") != -1 || 
                        meta.Name.ToLower().IndexOf("robot") != -1){
                    string content = meta.Content.ToLower();
                    if (content.IndexOf("noindex") != -1 && 
                        content.IndexOf("nofollow") != -1)
                    {
                        result = RobotHtmlMeta.NoIndexNoFollow;
                        break;
                    }
                    if(content.IndexOf("noindex") != -1)
                    {
                        result = RobotHtmlMeta.NoIndex;
                        break;
                    }
                    if (content.IndexOf("nofollow") != -1)
                    {
                        result = RobotHtmlMeta.NoFollow;
                        break;
                    }
                }
            }
            return result;
        }
    }
}

The code is very simple, and my guess is that only the regular expressions could raise questions. The first regular expression finds all the meta tags and the second finds the attribute list within a meta tag.