Click here to Skip to main content
11,927,187 members (46,073 online)
Click here to Skip to main content
Add your own
alternative version


32 bookmarked

HTML Meta Tag Parser

, 27 Nov 2006
Rate this:
Please Sign up or sign in to vote.
This article descripes a simple Meta tag parser.


This article shows a simple way to parse the meta tags in an HTML string. The class can return the meta data as HtmlMeta control objects (only from .NET 2.0), or give a result indicating if the meta data constrains a web crawler. It is a simple rewrite of part of craigd's excellent code from the Searcharoo Too project. The rewrite is to get some structure (class) and give my code the ability to reuse the parsing method.

The Meta tag is described here:

<!ELEMENT META - O EMPTY   -- generic metainformation>
 %i18n;              -- lang, dir, for use with content --
 http-equiv  NAME    #IMPLIED  -- HTTP response header name  --
 name        NAME    #IMPLIED  -- metainformation name --
 content     CDATA   #REQUIRED -- associated information --
 scheme      CDATA   #IMPLIED  -- select form of content --

The class consists of two methods:

enum RobotHtmlMeta { None = 0, NoIndex, NoFollow, NoIndexNoFollow };
static List<HtmlMeta> Parse(string htmldata)
static RobotHtmlMeta ParseRobotMetaTags(string htmldata)
  • Parse parses the HTML string and creates a list of HtmlMeta objects.
  • ParseRobotMetaTags parses the HTML string (with the help of Parse) and returns an enumerated value indicating if a robot meta tag was present and what it said.

Both methods use regular expressions. Firstly, to find the meta tag in the HTML, and secondly to find the attributes in the meta tag.

Using the code

The class can be used straight away. The class methods are all static.

The entire code can be seen here:

using System;
using System.Collections.Generic;
using System.Text;
using System.Web;
using System.Text.RegularExpressions;
using System.Web.UI;
using System.Web.UI.HtmlControls;

namespace hwit.Parsers
    public class HtmlMetaParser
        public enum RobotHtmlMeta { None = 0, NoIndex, NoFollow, 
                                    NoIndexNoFollow };
        public static List<HtmlMeta> Parse(string htmldata)
            Regex metaregex = 
                new Regex(@"<meta\s*(?:(?:\b(\w|-)+\b\s*(?:=\s*(?:""[^""]*""|'" +
                          @"[^']*'|[^""'<> ]+)\s*)?)*)/?\s*>", 
                          RegexOptions.IgnoreCase | RegexOptions.ExplicitCapture);

            List<HtmlMeta> MetaList = new List<HtmlMeta>();
            foreach (Match metamatch in metaregex.Matches(htmldata))
                HtmlMeta mymeta = new HtmlMeta();

                Regex submetaregex = 
                    new Regex(@"(?<name>\b(\w|-)+\b)\" + 
                              @"s*=\s*(""(?<value>" +
                              @"[^""]*)""|'(?<value>[^']*)'" + 
                              @"|(?<value>[^""'<> ]+)\s*)+",
                              RegexOptions.IgnoreCase | 

                foreach (Match submetamatch in 
                    if ("http-equiv" == 
                        mymeta.HttpEquiv = 

                    if (("name" == 
                         && (mymeta.HttpEquiv == String.Empty))
                       mymeta.Name = submetamatch.Groups["value"].ToString();

                    if ("scheme" == 
                        mymeta.Scheme = submetamatch.Groups["value"].ToString();

                    if ("content" == 
                        mymeta.Content = submetamatch.Groups["value"].ToString();
            return MetaList;

        public static RobotHtmlMeta ParseRobotMetaTags(string htmldata)
            List<HtmlMeta> MetaList = HtmlMetaParser.Parse(htmldata);

            RobotHtmlMeta result = RobotHtmlMeta.None;
            foreach (HtmlMeta meta in MetaList)
                if(meta.Name.ToLower().IndexOf("robots") != -1 || 
                        meta.Name.ToLower().IndexOf("robot") != -1){
                    string content = meta.Content.ToLower();
                    if (content.IndexOf("noindex") != -1 && 
                        content.IndexOf("nofollow") != -1)
                        result = RobotHtmlMeta.NoIndexNoFollow;
                    if(content.IndexOf("noindex") != -1)
                        result = RobotHtmlMeta.NoIndex;
                    if (content.IndexOf("nofollow") != -1)
                        result = RobotHtmlMeta.NoFollow;
            return result;

The code is very simple, and my guess is that only the regular expressions could raise questions. The first regular expression finds all the meta tags and the second finds the attribute list within a meta tag.


This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here


About the Author

Web Developer
Denmark Denmark
The CodingDragon started programming in 1999 in Java but quickly turned to .NET in 2000 - and has never looked back since.

You may also be interested in...

Comments and Discussions

QuestionAnyone have an actual usage Pin
Member 1115065713-Oct-14 14:48
memberMember 1115065713-Oct-14 14:48 
GeneralMy vote of 3 Pin
kailash kannan12-Jul-11 21:56
memberkailash kannan12-Jul-11 21:56 
GeneralIt works great!! Pin
jmfigueroa11-Feb-11 5:59
memberjmfigueroa11-Feb-11 5:59 
QuestionThe REGEX does not work for... Pin
Corgalore22-Oct-10 11:04
memberCorgalore22-Oct-10 11:04 
AnswerRe: The REGEX does not work for... Pin
kajalpatel3-Aug-12 8:19
memberkajalpatel3-Aug-12 8:19 
GeneralThanks! Pin
ptoloza20-Aug-09 6:11
memberptoloza20-Aug-09 6:11 
GeneralRe: Thanks! Pin
Romston9-Dec-11 5:24
memberRomston9-Dec-11 5:24 
NewsAlso see Pin
Ravi Bhavnani27-Nov-06 12:48
memberRavi Bhavnani27-Nov-06 12:48 
GeneralRe: Also see Pin
CodingDragon27-Nov-06 23:26
memberCodingDragon27-Nov-06 23:26 
GeneralRe: Also see Pin
Ravi Bhavnani28-Nov-06 3:01
memberRavi Bhavnani28-Nov-06 3:01 
GeneralRe: Also see Pin
CodingDragon29-Nov-06 0:32
memberCodingDragon29-Nov-06 0:32 
He he there's only so much time... Wink | ;)

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.151126.1 | Last Updated 27 Nov 2006
Article Copyright 2006 by CodingDragon
Everything else Copyright © CodeProject, 1999-2015
Layout: fixed | fluid