Click here to Skip to main content
12,297,753 members (25,680 online)
Click here to Skip to main content
Add your own
alternative version


74 bookmarked

HTML Parser Technique for Parsing Search Engines (Google)

, 23 Sep 2005
Rate this:
Please Sign up or sign in to vote.
Set of libraries for parsing results of popular search engines (Google, Yahoo!, Lycos, MSN, Netscape, Ask, AllTheWeb, AltaVista).

Search Engine Parser Test


One of the projects I am involved in provides facility to query the Google search engine and uses the links returned by it. We used Google's API and it was fine until some August day when it ceased working with non-English queries. It merely returned irrelevant links. Hence I was forced to write my own Google parser.

Notes about the search engine parser

The Google (and other) search engine parsers are based on the HTML parser that was written by me long ago. I am not providing the source of it but I am going to explain the building blocks of the parser.

In the demo project you can find seven additional search engine parsers (MSN, Netscape, AllTheWeb, AltaVista, Yahoo!, Ask, Lycos).

HTML parser basics

My HTML parser is a regular parser which scans for HTML tokens - tags. It's all I need when I want to parse a regular HTML page.

But it is not enough when it comes to the parsing of a web page with results from the Google search engine, because the page contains many things that I won't need except for the actual links of the search.

I need somehow to tell the parser to find the important structures in the page. For this I define an XML file with definitions of the tags that I want to extract and all that lies in these tags.

Here is an example of the XML file for the Google web page:

  <structure name="TABLE" startTag="table" endTag="/table"/>
  <structure name="PARAGRAPH" startTag="p" endTag="/p"/>

Why do I need <table> and <p> tags for the Google web page? Prior to specializing the HTML Parser for the Google search engine I just looked at the source of a random page from Google. I found out that the links that match the query are placed between <p> tags and the number of total results found is somewhere in the <table> tag.

I could do more than that. The structure of the web page returned by Google (as well as other templatized web pages) is (almost?) the same. So I would know the exact position of the <table> tag where the number of the results matched is situated and directly retrieve it but it ties me to the specific template structure and I was not sure that Google returns the same template web page every time.

Search Engine Parser

There is the SearchEngineParser class which inherits the HtmlParser class and is the base class for all flavours of search engine parsers.

It defines some abstract methods which must be implemented by a specific search engine parser.

Google Search Engine Parser

One of the most important methods that is overridden is GetLinks(HtmlStructure,AddressLinkCollection).

HtmlStructure is defined in a class HtmlParser. It holds part of the web page structure defined by a specific tag. The structure can hold nested structures, the text found in that structure, and address links if any as well.

The idea is to iterate through structures and extract the data. All that we need is to get the address links out of the structure.

The HtmlAnchor class holds the address link:

protected override void GetLinks(HtmlStructure structure,
                            AddressLinkCollection linkCollection)
    if (structure == null) return;
    //if the structure name is PARAGRAPH defined in xml file than 
    //there is probability that this structure 
    //holds the links found by Google
    if (structure.TagName == "PARAGRAPH")
        if (structure.Anchors != null && structure.Anchors.Count != 0)

            IList anchors = structure.Anchors;

            foreach(HtmlAnchor anchor in anchors)
                //if text of the address link is cached 
                //or similar or view as then skip it
                if (anchor.Text.IndexOf("cached") >= 0) continue;
                if (anchor.Text.IndexOf("similar") >= 0) continue;
                if (anchor.Text.IndexOf("view as") >= 0) continue;

                //if the link contains google word then skip it too
                if (anchor.Href.ToString().IndexOf("google") >= 0 ) continue;

                //all other links are the valid links, 
                //place them in AddressLink collection
                AddressLink link = new AddressLink(anchor.Href.ToString(),
    IList structList = structure.Structure;
    //continue to iterate through structures
    foreach(HtmlStructure struct_ in structList)

Explanation of the code above

We iterate through structures found in the web page and those defined in the XML file. There will be only two structures for Google - <table> and <p>.

Each time we get the PARAGRAPH structure which is an alias for <p>, we know that this is a structure where Google holds its link results.

But not everything is so simple because there are some links that we don't need like cached links or similar pages link and we must filter them out.

//if text of the address is cached or similar or view as then skip it
if (anchor.Text.IndexOf("cached") >= 0) continue;
if (anchor.Text.IndexOf("similar") >= 0) continue;
if (anchor.Text.IndexOf("view as") >= 0) continue;

//if the link contains google word then skip it too
if (anchor.Href.ToString().IndexOf("google") >= 0 ) continue;

The next two methods that must be implemented are used for retrieving the number of the total results found. In Google it is of the form: Results 1 - 10 of about 124,000 for omg data mining specification. (0.20 seconds).

We need that 124,000.

GetTotalSearcheResults(HtmlStructure) iterates through structures. If it finds a TABLE structure then it calls FindTotalSearchResults(string text,out int total).

protected override int GetTotalSearchResults(HtmlStructure structure)
    int totalSearchResults = -1;
    if (structure == null) return -1;
    if (structure.TagName == "TABLE")
        if (FindTotalSearchResults(structure.TextArray,out totalSearchResults))
            m_totalSearchResults = totalSearchResults;
            m_isTotaSearchResultsFound = true;
            return totalSearchResults;

    IList structList = structure.Structure;

    foreach(HtmlStructure struct_ in structList)
        totalSearchResults = GetTotalSearchResults(struct_);
        if (totalSearchResults >= 0) break;
    return totalSearchResults;

FindTotalSearchResults(string text,out int total) uses regular expressions to find the number of the results.

protected override bool FindTotalSearchResults(string text,out int total)
    total = -1;
    if (text.IndexOf(SearchResultTermPattern) < 0) return false;
    Match m = Regex.Match(text,TotalSearchResultPattern,
              RegexOptions.IgnoreCase | RegexOptions.Multiline);

        string totalString = m.Groups["total"].Value;
        totalString = totalString.Replace(",","");
        total = int.Parse(totalString);
        return false;
    return true;

The most important method to be overridden is Search() and its variant Search(int nextIndex).

public override bool Search()
    m_fileName = m_queryPathString = m_startQuerySearchPattern +
                 m_query + m_startSearchPattern +
    m_baseUri = new Uri(m_fileName);
    //call the HtmlParser parseMe method to parse the web page
    bool isParsed = this.ParseMe();
    if (isParsed)
        m_addressLinkCollection = new AddressLinkCollection();
        //call GetLinks to fill the address link collection with links
        m_numberOfLinksRetrieved = m_addressLinkCollection.Count;
        m_totalLinksRetrieved += m_numberOfLinksRetrieved;

    return isParsed;


  1. I don't explicitly use the XML file with structure configuration. Instead, I embed it into the assembly. One drawback of that is that the structure of the web page can be changed in the future.
  2. I tested all search engine parsers on English, Arabic, Hebrew and Russian. It works just fine with those languages. There are some inconsistencies like in Yahoo!. Yahoo! returns its web page with UTF-8 charset but the actual encoding is language specific. Because my HTML parser checks the charset of the web page, it won't recognize the actual encoding of it. So you will not see any text related to the link (check the demo).


This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here


About the Author

Kisilevich Slava
Germany Germany
No Biography provided

You may also be interested in...

Comments and Discussions

QuestionWhy does not work the Google Search? Pin
kost2128-Oct-11 8:50
memberkost2128-Oct-11 8:50 
GeneralLooks Like google made changes to their results Pin
hynomark25-Apr-10 12:34
memberhynomark25-Apr-10 12:34 
GeneralInteresting Code - problem with additional Parsers Pin
hynomark7-Feb-10 9:53
memberhynomark7-Feb-10 9:53 
Hey Man,

Very good job with this, however I can get google to work, but receive an error on all the other parsers.

System.MissingMethodException: Method not found: 'Void Slaks.Web.Parser.AddressLink..ctor(System.String, System.String)'.
at Slaks.Web.Parser.YahooParser.GetLinks(HtmlStructure structure, AddressLinkCollection linkCollection)
at Slaks.Web.Parser.YahooParser.Search()

The references are all in there, and they are cool, just seems there is something missing from the dll's, not even sure of the actual method it's trying to find, as th emessage is being chopped...

I have added functionality to lookup the page rank and add that to the results, as I said it's working fine with google, but it's be cool if it worked woth the other engoines as well...

Anything you can share that might help me with that?


GeneralRe: Interesting Code - problem with additional Parsers Pin
hynomark7-Feb-10 12:24
memberhynomark7-Feb-10 12:24 
Generalanchor text Pin
elshorbagy16-Dec-09 23:02
memberelshorbagy16-Dec-09 23:02 
GeneralRe: anchor text Pin
Kisilevich Slava16-Dec-09 23:12
memberKisilevich Slava16-Dec-09 23:12 
Generalneed help Pin
david89k26-Oct-09 21:53
memberdavid89k26-Oct-09 21:53 
GeneralRe: need help Pin
Kisilevich Slava26-Oct-09 22:13
memberKisilevich Slava26-Oct-09 22:13 
Generalneed help Pin
david89k26-Oct-09 21:44
memberdavid89k26-Oct-09 21:44 
Generalplease help me Pin
david89k20-Oct-09 23:36
memberdavid89k20-Oct-09 23:36 
QuestionHow do I change the code that will work on search Images ? Pin
drorby12-Mar-09 23:55
memberdrorby12-Mar-09 23:55 
Generalcould you please share me with your soure code Pin
wei.lee0417-Nov-08 19:32
memberwei.lee0417-Nov-08 19:32 
GeneralRe: could you please share me with your soure code Pin
Wasia2-Apr-09 14:40
memberWasia2-Apr-09 14:40 
GeneralThe base source code Pin
siquylee7-Oct-08 17:57
membersiquylee7-Oct-08 17:57 
GeneralNot working google parser Pin
Nimrod_SPbSU13-May-08 3:03
memberNimrod_SPbSU13-May-08 3:03 
QuestionRequest Pin
AvitalChissick30-Dec-07 21:15
memberAvitalChissick30-Dec-07 21:15 
NewsGoogle parser online tool Pin
jp73129-Oct-07 17:10
memberjp73129-Oct-07 17:10 
GeneralRe: Google parser online tool Pin
Dang Thanh1-Apr-08 23:06
memberDang Thanh1-Apr-08 23:06 
Questionneed slaks.web.parser.htmlparser's Source Pin
Malikabp27-Aug-07 19:08
memberMalikabp27-Aug-07 19:08 
GeneralError on Run Pin
salah_gis26-Jul-07 1:45
membersalah_gis26-Jul-07 1:45 
QuestionExecutable Version? Pin
nico798013-Mar-07 6:05
membernico798013-Mar-07 6:05 
GeneralI know Reason For Previos (Don't work Message) Pin
abosafia9-Aug-06 2:13
memberabosafia9-Aug-06 2:13 
GeneralAltavista and yahoo display unwanted results [modified] Pin
abosafia25-Jul-06 21:15
memberabosafia25-Jul-06 21:15 
Generalthe parser Pin
Ramez Quneibi23-Jul-06 21:07
memberRamez Quneibi23-Jul-06 21:07 
Questioncan i get the base classes Pin
Yoni T3-Jun-06 20:48
memberYoni T3-Jun-06 20:48 
GeneralDoesn't get all results Pin
akZ3us25-Jan-06 4:29
memberakZ3us25-Jan-06 4:29 
GeneralRe: Doesn't get all results Pin
akZ3us25-Jan-06 4:41
memberakZ3us25-Jan-06 4:41 
QuestionError: An unhandled exception Pin
sbz30-Nov-05 19:54
membersbz30-Nov-05 19:54 
AnswerRe: Error: An unhandled exception Pin
Kisilevich Slava30-Nov-05 20:24
memberKisilevich Slava30-Nov-05 20:24 
QuestionRe: Error: An unhandled exception Pin
sbz30-Nov-05 21:12
membersbz30-Nov-05 21:12 
AnswerRe: Error: An unhandled exception Pin
Kisilevich Slava30-Nov-05 21:38
memberKisilevich Slava30-Nov-05 21:38 
AnswerRe: Error: An unhandled exception Pin
sbz1-Dec-05 0:14
membersbz1-Dec-05 0:14 
QuestionCan you send the BaseSource to me Pin
xajhzwb25-Nov-05 21:24
memberxajhzwb25-Nov-05 21:24 
GeneralContacting Author Pin
Tim_Alpha_Beta4-Oct-05 5:51
sussTim_Alpha_Beta4-Oct-05 5:51 
GeneralNot Working Pin
R. Senthil Kumaran28-Sep-05 20:58
memberR. Senthil Kumaran28-Sep-05 20:58 
GeneralRe: Not Working Pin
matsu02534-Nov-05 18:28
membermatsu02534-Nov-05 18:28 
GeneralRe: Not Working Pin
Kisilevich Slava4-Nov-05 22:12
memberKisilevich Slava4-Nov-05 22:12 
GeneralRe: Not Working Pin
AsgharPanahy24-Nov-05 7:28
memberAsgharPanahy24-Nov-05 7:28 
Generalre Google parser source not included Pin
BillWoodruff26-Sep-05 17:53
memberBillWoodruff26-Sep-05 17:53 
GeneralLanguages with the Google API Pin
reinux23-Sep-05 7:58
memberreinux23-Sep-05 7:58 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.160525.2 | Last Updated 23 Sep 2005
Article Copyright 2005 by Kisilevich Slava
Everything else Copyright © CodeProject, 1999-2016
Layout: fixed | fluid