Click here to Skip to main content
Click here to Skip to main content

HTML Parser Technique for Parsing Search Engines (Google)

, 23 Sep 2005
Rate this:
Please Sign up or sign in to vote.
Set of libraries for parsing results of popular search engines (Google, Yahoo!, Lycos, MSN, Netscape, Ask, AllTheWeb, AltaVista).

Search Engine Parser Test

Introduction

One of the projects I am involved in provides facility to query the Google search engine and uses the links returned by it. We used Google's API and it was fine until some August day when it ceased working with non-English queries. It merely returned irrelevant links. Hence I was forced to write my own Google parser.

Notes about the search engine parser

The Google (and other) search engine parsers are based on the HTML parser that was written by me long ago. I am not providing the source of it but I am going to explain the building blocks of the parser.

In the demo project you can find seven additional search engine parsers (MSN, Netscape, AllTheWeb, AltaVista, Yahoo!, Ask, Lycos).

HTML parser basics

My HTML parser is a regular parser which scans for HTML tokens - tags. It's all I need when I want to parse a regular HTML page.

But it is not enough when it comes to the parsing of a web page with results from the Google search engine, because the page contains many things that I won't need except for the actual links of the search.

I need somehow to tell the parser to find the important structures in the page. For this I define an XML file with definitions of the tags that I want to extract and all that lies in these tags.

Here is an example of the XML file for the Google web page:

  <structures>
    <structure name="TABLE" startTag="table" endTag="/table"/>
    <structure name="PARAGRAPH" startTag="p" endTag="/p"/>
  </structures>

Why do I need <table> and <p> tags for the Google web page? Prior to specializing the HTML Parser for the Google search engine I just looked at the source of a random page from Google. I found out that the links that match the query are placed between <p> tags and the number of total results found is somewhere in the <table> tag.

I could do more than that. The structure of the web page returned by Google (as well as other templatized web pages) is (almost?) the same. So I would know the exact position of the <table> tag where the number of the results matched is situated and directly retrieve it but it ties me to the specific template structure and I was not sure that Google returns the same template web page every time.

Search Engine Parser

There is the SearchEngineParser class which inherits the HtmlParser class and is the base class for all flavours of search engine parsers.

It defines some abstract methods which must be implemented by a specific search engine parser.

Google Search Engine Parser

One of the most important methods that is overridden is GetLinks(HtmlStructure,AddressLinkCollection).

HtmlStructure is defined in a class HtmlParser. It holds part of the web page structure defined by a specific tag. The structure can hold nested structures, the text found in that structure, and address links if any as well.

The idea is to iterate through structures and extract the data. All that we need is to get the address links out of the structure.

The HtmlAnchor class holds the address link:

    protected override void GetLinks(HtmlStructure structure, 
                                AddressLinkCollection linkCollection)
    {
        if (structure == null) return;
        //if the structure name is PARAGRAPH defined in xml file than 
        //there is probability that this structure 
        //holds the links found by Google
        if (structure.TagName == "PARAGRAPH")
        {
            if (structure.Anchors != null && structure.Anchors.Count != 0) 
            {

                IList anchors = structure.Anchors;

                foreach(HtmlAnchor anchor in anchors)
                {
                    //if text of the address link is cached 
                    //or similar or view as then skip it
                    if (anchor.Text.IndexOf("cached") >= 0) continue;
                    if (anchor.Text.IndexOf("similar") >= 0) continue;
                    if (anchor.Text.IndexOf("view as") >= 0) continue;
                    
                    //if the link contains google word then skip it too
                    if (anchor.Href.ToString().IndexOf("google") >= 0 ) continue;

                    //all other links are the valid links, 
                    //place them in AddressLink collection
                    AddressLink link = new AddressLink(anchor.Href.ToString(), 
                                                                 anchor.Text);
                    linkCollection.Add(link);
                }
            }
        }
        IList structList = structure.Structure;
        //continue to iterate through structures
        foreach(HtmlStructure struct_ in structList)
        {
            GetLinks(struct_,linkCollection);
        }
    }

Explanation of the code above

We iterate through structures found in the web page and those defined in the XML file. There will be only two structures for Google - <table> and <p>.

Each time we get the PARAGRAPH structure which is an alias for <p>, we know that this is a structure where Google holds its link results.

But not everything is so simple because there are some links that we don't need like cached links or similar pages link and we must filter them out.

    //if text of the address is cached or similar or view as then skip it
    if (anchor.Text.IndexOf("cached") >= 0) continue;
    if (anchor.Text.IndexOf("similar") >= 0) continue;
    if (anchor.Text.IndexOf("view as") >= 0) continue;
                    
    //if the link contains google word then skip it too
    if (anchor.Href.ToString().IndexOf("google") >= 0 ) continue;

The next two methods that must be implemented are used for retrieving the number of the total results found. In Google it is of the form: Results 1 - 10 of about 124,000 for omg data mining specification. (0.20 seconds).

We need that 124,000.

GetTotalSearcheResults(HtmlStructure) iterates through structures. If it finds a TABLE structure then it calls FindTotalSearchResults(string text,out int total).

    protected override int GetTotalSearchResults(HtmlStructure structure)
    {
        int totalSearchResults = -1;
        if (structure == null) return -1;
        if (structure.TagName == "TABLE")
        {
            if (FindTotalSearchResults(structure.TextArray,out totalSearchResults))
            {
                m_totalSearchResults = totalSearchResults;
                m_isTotaSearchResultsFound = true;
                return totalSearchResults;
            }
        }

        IList structList = structure.Structure;

        foreach(HtmlStructure struct_ in structList)
        {
            totalSearchResults = GetTotalSearchResults(struct_);
            if (totalSearchResults >= 0) break;
        }
        return totalSearchResults;
    }

FindTotalSearchResults(string text,out int total) uses regular expressions to find the number of the results.

    protected override bool FindTotalSearchResults(string text,out int total)
    {
        total = -1;
        if (text.IndexOf(SearchResultTermPattern) < 0) return false;
        Match m = Regex.Match(text,TotalSearchResultPattern, 
                  RegexOptions.IgnoreCase | RegexOptions.Multiline);
 
        try
        {
            string totalString = m.Groups["total"].Value;
            totalString = totalString.Replace(",","");
            total = int.Parse(totalString);
        }
        catch(Exception)
        {
            return false;
        }
        return true;
    }

The most important method to be overridden is Search() and its variant Search(int nextIndex).

    public override bool Search()
    {
        m_fileName = m_queryPathString = m_startQuerySearchPattern + 
                     m_query + m_startSearchPattern + 
                     m_totalLinksRetrieved.ToString();
        m_baseUri = new Uri(m_fileName);
        //call the HtmlParser parseMe method to parse the web page
        bool isParsed = this.ParseMe();
        if (isParsed) 
        {
            m_addressLinkCollection = new AddressLinkCollection();
            //call GetLinks to fill the address link collection with links
            this.GetLinks(this.RootStructure,m_addressLinkCollection);
            m_numberOfLinksRetrieved = m_addressLinkCollection.Count;
            m_totalLinksRetrieved += m_numberOfLinksRetrieved;
        }

        return isParsed;
    }

After-notes

  1. I don't explicitly use the XML file with structure configuration. Instead, I embed it into the assembly. One drawback of that is that the structure of the web page can be changed in the future.
  2. I tested all search engine parsers on English, Arabic, Hebrew and Russian. It works just fine with those languages. There are some inconsistencies like in Yahoo!. Yahoo! returns its web page with UTF-8 charset but the actual encoding is language specific. Because my HTML parser checks the charset of the web page, it won't recognize the actual encoding of it. So you will not see any text related to the link (check the demo).

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

Share

About the Author

Kisilevich Slava
Engineer
Germany Germany
No Biography provided

Comments and Discussions

 
QuestionWhy does not work the Google Search? Pinmemberkost2128-Oct-11 9:50 
GeneralLooks Like google made changes to their results Pinmemberhynomark25-Apr-10 13:34 
GeneralInteresting Code - problem with additional Parsers Pinmemberhynomark7-Feb-10 10:53 
GeneralRe: Interesting Code - problem with additional Parsers Pinmemberhynomark7-Feb-10 13:24 
Generalanchor text Pinmemberelshorbagy17-Dec-09 0:02 
GeneralRe: anchor text PinmemberKisilevich Slava17-Dec-09 0:12 
Generalneed help Pinmemberdavid89k26-Oct-09 22:53 
GeneralRe: need help PinmemberKisilevich Slava26-Oct-09 23:13 
Hi,
 
Yahoo parser is included.
Generalneed help Pinmemberdavid89k26-Oct-09 22:44 
Generalplease help me Pinmemberdavid89k21-Oct-09 0:36 
QuestionHow do I change the code that will work on search Images ? Pinmemberdrorby13-Mar-09 0:55 
Generalcould you please share me with your soure code Pinmemberwei.lee0417-Nov-08 20:32 
GeneralRe: could you please share me with your soure code PinmemberWasia2-Apr-09 15:40 
GeneralThe base source code Pinmembersiquylee7-Oct-08 18:57 
GeneralNot working google parser PinmemberNimrod_SPbSU13-May-08 4:03 
QuestionRequest PinmemberAvitalChissick30-Dec-07 22:15 
NewsGoogle parser online tool Pinmemberjp73129-Oct-07 18:10 
GeneralRe: Google parser online tool PinmemberDang Thanh2-Apr-08 0:06 
Questionneed slaks.web.parser.htmlparser's Source PinmemberMalikabp27-Aug-07 20:08 
GeneralError on Run Pinmembersalah_gis26-Jul-07 2:45 
QuestionExecutable Version? Pinmembernico798013-Mar-07 7:05 
GeneralI know Reason For Previos (Don't work Message) Pinmemberabosafia9-Aug-06 3:13 
GeneralAltavista and yahoo display unwanted results [modified] Pinmemberabosafia25-Jul-06 22:15 
Generalthe parser PinmemberRamez Quneibi23-Jul-06 22:07 
Questioncan i get the base classes PinmemberYoni T3-Jun-06 21:48 
GeneralDoesn't get all results PinmemberakZ3us25-Jan-06 5:29 
GeneralRe: Doesn't get all results PinmemberakZ3us25-Jan-06 5:41 
QuestionError: An unhandled exception Pinmembersbz30-Nov-05 20:54 
AnswerRe: Error: An unhandled exception PinmemberKisilevich Slava30-Nov-05 21:24 
QuestionRe: Error: An unhandled exception Pinmembersbz30-Nov-05 22:12 
AnswerRe: Error: An unhandled exception PinmemberKisilevich Slava30-Nov-05 22:38 
AnswerRe: Error: An unhandled exception Pinmembersbz1-Dec-05 1:14 
QuestionCan you send the BaseSource to me Pinmemberxajhzwb25-Nov-05 22:24 
GeneralContacting Author PinsussTim_Alpha_Beta4-Oct-05 6:51 
GeneralNot Working PinmemberR. Senthil Kumaran28-Sep-05 21:58 
GeneralRe: Not Working Pinmembermatsu02534-Nov-05 19:28 
GeneralRe: Not Working PinmemberKisilevich Slava4-Nov-05 23:12 
GeneralRe: Not Working PinmemberAsgharPanahy24-Nov-05 8:28 
Generalre Google parser source not included PinmemberBillWoodruff26-Sep-05 18:53 
GeneralLanguages with the Google API Pinmemberreinux23-Sep-05 8:58 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Terms of Use | Mobile
Web03 | 2.8.1411023.1 | Last Updated 23 Sep 2005
Article Copyright 2005 by Kisilevich Slava
Everything else Copyright © CodeProject, 1999-2014
Layout: fixed | fluid