Click here to Skip to main content
15,868,306 members
Articles / Web Development / HTML
Article

HTML Meta Tag Parser

Rate me:
Please Sign up or sign in to vote.
3.00/5 (10 votes)
27 Nov 20061 min read 59.3K   643   32   11
This article descripes a simple Meta tag parser.

Introduction

This article shows a simple way to parse the meta tags in an HTML string. The class can return the meta data as HtmlMeta control objects (only from .NET 2.0), or give a result indicating if the meta data constrains a web crawler. It is a simple rewrite of part of craigd's excellent code from the Searcharoo Too project. The rewrite is to get some structure (class) and give my code the ability to reuse the parsing method.

The Meta tag is described here:

XML
<!ELEMENT META - O EMPTY   -- generic metainformation>
<!ATTLIST META
 %i18n;              -- lang, dir, for use with content --
 http-equiv  NAME    #IMPLIED  -- HTTP response header name  --
 name        NAME    #IMPLIED  -- metainformation name --
 content     CDATA   #REQUIRED -- associated information --
 scheme      CDATA   #IMPLIED  -- select form of content --
 >

The class consists of two methods:

C#
enum RobotHtmlMeta { None = 0, NoIndex, NoFollow, NoIndexNoFollow };
static List<HtmlMeta> Parse(string htmldata)
static RobotHtmlMeta ParseRobotMetaTags(string htmldata)
  • Parse parses the HTML string and creates a list of HtmlMeta objects.
  • ParseRobotMetaTags parses the HTML string (with the help of Parse) and returns an enumerated value indicating if a robot meta tag was present and what it said.

Both methods use regular expressions. Firstly, to find the meta tag in the HTML, and secondly to find the attributes in the meta tag.

Using the code

The class can be used straight away. The class methods are all static.

The entire code can be seen here:

C#
using System;
using System.Collections.Generic;
using System.Text;
using System.Web;
using System.Text.RegularExpressions;
using System.Web.UI;
using System.Web.UI.HtmlControls;

namespace hwit.Parsers
{
    public class HtmlMetaParser
    {
        public enum RobotHtmlMeta { None = 0, NoIndex, NoFollow, 
                                    NoIndexNoFollow };
   
        public static List<HtmlMeta> Parse(string htmldata)
        {
            Regex metaregex = 
                new Regex(@"<meta\s*(?:(?:\b(\w|-)+\b\s*(?:=\s*(?:""[^""]*""|'" +
                          @"[^']*'|[^""'<> ]+)\s*)?)*)/?\s*>", 
                          RegexOptions.IgnoreCase | RegexOptions.ExplicitCapture);

            List<HtmlMeta> MetaList = new List<HtmlMeta>();
            foreach (Match metamatch in metaregex.Matches(htmldata))
            {
                HtmlMeta mymeta = new HtmlMeta();

                Regex submetaregex = 
                    new Regex(@"(?<name>\b(\w|-)+\b)\" + 
                              @"s*=\s*(""(?<value>" +
                              @"[^""]*)""|'(?<value>[^']*)'" + 
                              @"|(?<value>[^""'<> ]+)\s*)+",
                              RegexOptions.IgnoreCase | 
                              RegexOptions.ExplicitCapture);

                foreach (Match submetamatch in 
                         submetaregex.Matches(metamatch.Value.ToString()))
                {
                    if ("http-equiv" == 
                          submetamatch.Groups["name"].ToString().ToLower())
                        mymeta.HttpEquiv = 
                          submetamatch.Groups["value"].ToString();

                    if (("name" == 
                         submetamatch.Groups["name"].ToString().ToLower())
                         && (mymeta.HttpEquiv == String.Empty))
                       mymeta.Name = submetamatch.Groups["value"].ToString();

                    if ("scheme" == 
                        submetamatch.Groups["name"].ToString().ToLower())
                        mymeta.Scheme = submetamatch.Groups["value"].ToString();

                    if ("content" == 
                        submetamatch.Groups["name"].ToString().ToLower())
                    {
                        mymeta.Content = submetamatch.Groups["value"].ToString();
                        MetaList.Add(mymeta);
                    }
                }
            }
            return MetaList;
        }

        public static RobotHtmlMeta ParseRobotMetaTags(string htmldata)
        {
            List<HtmlMeta> MetaList = HtmlMetaParser.Parse(htmldata);

            RobotHtmlMeta result = RobotHtmlMeta.None;
            foreach (HtmlMeta meta in MetaList)
            {
                if(meta.Name.ToLower().IndexOf("robots") != -1 || 
                        meta.Name.ToLower().IndexOf("robot") != -1){
                    string content = meta.Content.ToLower();
                    if (content.IndexOf("noindex") != -1 && 
                        content.IndexOf("nofollow") != -1)
                    {
                        result = RobotHtmlMeta.NoIndexNoFollow;
                        break;
                    }
                    if(content.IndexOf("noindex") != -1)
                    {
                        result = RobotHtmlMeta.NoIndex;
                        break;
                    }
                    if (content.IndexOf("nofollow") != -1)
                    {
                        result = RobotHtmlMeta.NoFollow;
                        break;
                    }
                }
            }
            return result;
        }
    }
}

The code is very simple, and my guess is that only the regular expressions could raise questions. The first regular expression finds all the meta tags and the second finds the attribute list within a meta tag.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here


Written By
Web Developer
Denmark Denmark
The CodingDragon started programming in 1999 in Java but quickly turned to .NET in 2000 - and has never looked back since.

Comments and Discussions

 
QuestionAnyone have an actual usage Pin
Steve Gossett13-Oct-14 13:48
Steve Gossett13-Oct-14 13:48 
GeneralMy vote of 3 Pin
kailash kannan12-Jul-11 20:56
kailash kannan12-Jul-11 20:56 
GeneralIt works great!! Pin
jmfigueroa11-Feb-11 4:59
jmfigueroa11-Feb-11 4:59 
QuestionThe REGEX does not work for... Pin
Corgalore22-Oct-10 10:04
professionalCorgalore22-Oct-10 10:04 
AnswerRe: The REGEX does not work for... Pin
kajalpatel3-Aug-12 7:19
kajalpatel3-Aug-12 7:19 
GeneralThanks! Pin
ptoloza20-Aug-09 5:11
ptoloza20-Aug-09 5:11 
GeneralRe: Thanks! Pin
Romston9-Dec-11 4:24
Romston9-Dec-11 4:24 
Thanks a lot this Regex work perfectly !! Big Grin | :-D
NewsAlso see Pin
Ravi Bhavnani27-Nov-06 11:48
professionalRavi Bhavnani27-Nov-06 11:48 
GeneralRe: Also see Pin
CodingDragon27-Nov-06 22:26
CodingDragon27-Nov-06 22:26 
GeneralRe: Also see Pin
Ravi Bhavnani28-Nov-06 2:01
professionalRavi Bhavnani28-Nov-06 2:01 
GeneralRe: Also see Pin
CodingDragon28-Nov-06 23:32
CodingDragon28-Nov-06 23:32 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.