Click here to Skip to main content
Click here to Skip to main content

HTML Meta Tag Parser

, 27 Nov 2006
Rate this:
Please Sign up or sign in to vote.
This article descripes a simple Meta tag parser.

Introduction

This article shows a simple way to parse the meta tags in an HTML string. The class can return the meta data as HtmlMeta control objects (only from .NET 2.0), or give a result indicating if the meta data constrains a web crawler. It is a simple rewrite of part of craigd's excellent code from the Searcharoo Too project. The rewrite is to get some structure (class) and give my code the ability to reuse the parsing method.

The Meta tag is described here:

<!ELEMENT META - O EMPTY   -- generic metainformation>
<!ATTLIST META
 %i18n;              -- lang, dir, for use with content --
 http-equiv  NAME    #IMPLIED  -- HTTP response header name  --
 name        NAME    #IMPLIED  -- metainformation name --
 content     CDATA   #REQUIRED -- associated information --
 scheme      CDATA   #IMPLIED  -- select form of content --
 >

The class consists of two methods:

enum RobotHtmlMeta { None = 0, NoIndex, NoFollow, NoIndexNoFollow };
static List<HtmlMeta> Parse(string htmldata)
static RobotHtmlMeta ParseRobotMetaTags(string htmldata)
  • Parse parses the HTML string and creates a list of HtmlMeta objects.
  • ParseRobotMetaTags parses the HTML string (with the help of Parse) and returns an enumerated value indicating if a robot meta tag was present and what it said.

Both methods use regular expressions. Firstly, to find the meta tag in the HTML, and secondly to find the attributes in the meta tag.

Using the code

The class can be used straight away. The class methods are all static.

The entire code can be seen here:

using System;
using System.Collections.Generic;
using System.Text;
using System.Web;
using System.Text.RegularExpressions;
using System.Web.UI;
using System.Web.UI.HtmlControls;

namespace hwit.Parsers
{
    public class HtmlMetaParser
    {
        public enum RobotHtmlMeta { None = 0, NoIndex, NoFollow, 
                                    NoIndexNoFollow };
   
        public static List<HtmlMeta> Parse(string htmldata)
        {
            Regex metaregex = 
                new Regex(@"<meta\s*(?:(?:\b(\w|-)+\b\s*(?:=\s*(?:""[^""]*""|'" +
                          @"[^']*'|[^""'<> ]+)\s*)?)*)/?\s*>", 
                          RegexOptions.IgnoreCase | RegexOptions.ExplicitCapture);

            List<HtmlMeta> MetaList = new List<HtmlMeta>();
            foreach (Match metamatch in metaregex.Matches(htmldata))
            {
                HtmlMeta mymeta = new HtmlMeta();

                Regex submetaregex = 
                    new Regex(@"(?<name>\b(\w|-)+\b)\" + 
                              @"s*=\s*(""(?<value>" +
                              @"[^""]*)""|'(?<value>[^']*)'" + 
                              @"|(?<value>[^""'<> ]+)\s*)+",
                              RegexOptions.IgnoreCase | 
                              RegexOptions.ExplicitCapture);

                foreach (Match submetamatch in 
                         submetaregex.Matches(metamatch.Value.ToString()))
                {
                    if ("http-equiv" == 
                          submetamatch.Groups["name"].ToString().ToLower())
                        mymeta.HttpEquiv = 
                          submetamatch.Groups["value"].ToString();

                    if (("name" == 
                         submetamatch.Groups["name"].ToString().ToLower())
                         && (mymeta.HttpEquiv == String.Empty))
                       mymeta.Name = submetamatch.Groups["value"].ToString();

                    if ("scheme" == 
                        submetamatch.Groups["name"].ToString().ToLower())
                        mymeta.Scheme = submetamatch.Groups["value"].ToString();

                    if ("content" == 
                        submetamatch.Groups["name"].ToString().ToLower())
                    {
                        mymeta.Content = submetamatch.Groups["value"].ToString();
                        MetaList.Add(mymeta);
                    }
                }
            }
            return MetaList;
        }

        public static RobotHtmlMeta ParseRobotMetaTags(string htmldata)
        {
            List<HtmlMeta> MetaList = HtmlMetaParser.Parse(htmldata);

            RobotHtmlMeta result = RobotHtmlMeta.None;
            foreach (HtmlMeta meta in MetaList)
            {
                if(meta.Name.ToLower().IndexOf("robots") != -1 || 
                        meta.Name.ToLower().IndexOf("robot") != -1){
                    string content = meta.Content.ToLower();
                    if (content.IndexOf("noindex") != -1 && 
                        content.IndexOf("nofollow") != -1)
                    {
                        result = RobotHtmlMeta.NoIndexNoFollow;
                        break;
                    }
                    if(content.IndexOf("noindex") != -1)
                    {
                        result = RobotHtmlMeta.NoIndex;
                        break;
                    }
                    if (content.IndexOf("nofollow") != -1)
                    {
                        result = RobotHtmlMeta.NoFollow;
                        break;
                    }
                }
            }
            return result;
        }
    }
}

The code is very simple, and my guess is that only the regular expressions could raise questions. The first regular expression finds all the meta tags and the second finds the attribute list within a meta tag.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

About the Author

CodingDragon
Web Developer
Denmark Denmark
The CodingDragon started programming in 1999 in Java but quickly turned to .NET in 2000 - and has never looked back since.

Comments and Discussions

 
GeneralMy vote of 3 Pinmemberkailash kannan12-Jul-11 20:56 
GeneralIt works great!! Pinmemberjmfigueroa11-Feb-11 4:59 
QuestionThe REGEX does not work for... PinmemberCorgalore22-Oct-10 10:04 
AnswerRe: The REGEX does not work for... Pinmemberkajalpatel3-Aug-12 7:19 
GeneralThanks! Pinmemberptoloza20-Aug-09 5:11 
I used this regex:
String regExpMeta = "(\\s)*<meta\\s*(?:(?:\\b(\\w|-)+\\b\\s*(?:=\\s*(?:[\"\"[^\"\"]*\"\"|'[^']*'|[^\"\"'<> ]|[''[^'']*''|\"[^\"]*\"|[^''\"<> ]]]+)\\s*)?)*)/?\\s*>";
and this one:
String regExpAttr = "(?<name>\\b(\\w|-)+\\b)\\s*=\\s*(''(?<value>[^'']*)''|\"\"(?<value>[^\"\"]*)\"\"|\"(?<value>[^\"]*)\"|'(?<value>[^']*)'|(?<value>[^''\"/<> ]+)\\s*|(?<value>[^\"\"'/<> ]+)\\s*)+";
 
I did it so I can parse meta tags like this:
<meta name='a1sdf' content="z1xcv zx cv" />
<meta name='a2 sdf' content='z2xcvz xcv zxv '/>
<meta name="a3sdf" content='z3xcvb' />
<meta name="a4sdf" content="z4xcvbv"/>
 
Thanks!! Smile | :)
P.
GeneralRe: Thanks! PinmemberRomston9-Dec-11 4:24 
NewsAlso see PinmemberRavi Bhavnani27-Nov-06 11:48 
GeneralRe: Also see PinmemberCodingDragon27-Nov-06 22:26 
GeneralRe: Also see PinmemberRavi Bhavnani28-Nov-06 2:01 
GeneralRe: Also see PinmemberCodingDragon28-Nov-06 23:32 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web03 | 2.8.140721.1 | Last Updated 27 Nov 2006
Article Copyright 2006 by CodingDragon
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid