Click here to Skip to main content
15,867,488 members
Articles / Web Development / HTML
Article

HTML Parser Technique for Parsing Search Engines (Google)

Rate me:
Please Sign up or sign in to vote.
3.06/5 (11 votes)
23 Sep 20054 min read 131K   2.6K   75   41
Set of libraries for parsing results of popular search engines (Google, Yahoo!, Lycos, MSN, Netscape, Ask, AllTheWeb, AltaVista).

Search Engine Parser Test

Introduction

One of the projects I am involved in provides facility to query the Google search engine and uses the links returned by it. We used Google's API and it was fine until some August day when it ceased working with non-English queries. It merely returned irrelevant links. Hence I was forced to write my own Google parser.

Notes about the search engine parser

The Google (and other) search engine parsers are based on the HTML parser that was written by me long ago. I am not providing the source of it but I am going to explain the building blocks of the parser.

In the demo project you can find seven additional search engine parsers (MSN, Netscape, AllTheWeb, AltaVista, Yahoo!, Ask, Lycos).

HTML parser basics

My HTML parser is a regular parser which scans for HTML tokens - tags. It's all I need when I want to parse a regular HTML page.

But it is not enough when it comes to the parsing of a web page with results from the Google search engine, because the page contains many things that I won't need except for the actual links of the search.

I need somehow to tell the parser to find the important structures in the page. For this I define an XML file with definitions of the tags that I want to extract and all that lies in these tags.

Here is an example of the XML file for the Google web page:

XML
<structures>
  <structure name="TABLE" startTag="table" endTag="/table"/>
  <structure name="PARAGRAPH" startTag="p" endTag="/p"/>
</structures>

Why do I need <table> and <p> tags for the Google web page? Prior to specializing the HTML Parser for the Google search engine I just looked at the source of a random page from Google. I found out that the links that match the query are placed between <p> tags and the number of total results found is somewhere in the <table> tag.

I could do more than that. The structure of the web page returned by Google (as well as other templatized web pages) is (almost?) the same. So I would know the exact position of the <table> tag where the number of the results matched is situated and directly retrieve it but it ties me to the specific template structure and I was not sure that Google returns the same template web page every time.

Search Engine Parser

There is the SearchEngineParser class which inherits the HtmlParser class and is the base class for all flavours of search engine parsers.

It defines some abstract methods which must be implemented by a specific search engine parser.

Google Search Engine Parser

One of the most important methods that is overridden is GetLinks(HtmlStructure,AddressLinkCollection).

HtmlStructure is defined in a class HtmlParser. It holds part of the web page structure defined by a specific tag. The structure can hold nested structures, the text found in that structure, and address links if any as well.

The idea is to iterate through structures and extract the data. All that we need is to get the address links out of the structure.

The HtmlAnchor class holds the address link:

C#
protected override void GetLinks(HtmlStructure structure,
                            AddressLinkCollection linkCollection)
{
    if (structure == null) return;
    //if the structure name is PARAGRAPH defined in xml file than
    //there is probability that this structure
    //holds the links found by Google
    if (structure.TagName == "PARAGRAPH")
    {
        if (structure.Anchors != null && structure.Anchors.Count != 0)
        {

            IList anchors = structure.Anchors;

            foreach(HtmlAnchor anchor in anchors)
            {
                //if text of the address link is cached
                //or similar or view as then skip it
                if (anchor.Text.IndexOf("cached") >= 0) continue;
                if (anchor.Text.IndexOf("similar") >= 0) continue;
                if (anchor.Text.IndexOf("view as") >= 0) continue;

                //if the link contains google word then skip it too
                if (anchor.Href.ToString().IndexOf("google") >= 0 ) continue;

                //all other links are the valid links,
                //place them in AddressLink collection
                AddressLink link = new AddressLink(anchor.Href.ToString(),
                                                             anchor.Text);
                linkCollection.Add(link);
            }
        }
    }
    IList structList = structure.Structure;
    //continue to iterate through structures
    foreach(HtmlStructure struct_ in structList)
    {
        GetLinks(struct_,linkCollection);
    }
}

Explanation of the code above

We iterate through structures found in the web page and those defined in the XML file. There will be only two structures for Google - <table> and <p>.

Each time we get the PARAGRAPH structure which is an alias for <p>, we know that this is a structure where Google holds its link results.

But not everything is so simple because there are some links that we don't need like cached links or similar pages link and we must filter them out.

C#
//if text of the address is cached or similar or view as then skip it
if (anchor.Text.IndexOf("cached") >= 0) continue;
if (anchor.Text.IndexOf("similar") >= 0) continue;
if (anchor.Text.IndexOf("view as") >= 0) continue;

//if the link contains google word then skip it too
if (anchor.Href.ToString().IndexOf("google") >= 0 ) continue;

The next two methods that must be implemented are used for retrieving the number of the total results found. In Google it is of the form: Results 1 - 10 of about 124,000 for omg data mining specification. (0.20 seconds).

We need that 124,000.

GetTotalSearcheResults(HtmlStructure) iterates through structures. If it finds a TABLE structure then it calls FindTotalSearchResults(string text,out int total).

C#
protected override int GetTotalSearchResults(HtmlStructure structure)
{
    int totalSearchResults = -1;
    if (structure == null) return -1;
    if (structure.TagName == "TABLE")
    {
        if (FindTotalSearchResults(structure.TextArray,out totalSearchResults))
        {
            m_totalSearchResults = totalSearchResults;
            m_isTotaSearchResultsFound = true;
            return totalSearchResults;
        }
    }

    IList structList = structure.Structure;

    foreach(HtmlStructure struct_ in structList)
    {
        totalSearchResults = GetTotalSearchResults(struct_);
        if (totalSearchResults >= 0) break;
    }
    return totalSearchResults;
}

FindTotalSearchResults(string text,out int total) uses regular expressions to find the number of the results.

C#
protected override bool FindTotalSearchResults(string text,out int total)
{
    total = -1;
    if (text.IndexOf(SearchResultTermPattern) < 0) return false;
    Match m = Regex.Match(text,TotalSearchResultPattern,
              RegexOptions.IgnoreCase | RegexOptions.Multiline);

    try
    {
        string totalString = m.Groups["total"].Value;
        totalString = totalString.Replace(",","");
        total = int.Parse(totalString);
    }
    catch(Exception)
    {
        return false;
    }
    return true;
}

The most important method to be overridden is Search() and its variant Search(int nextIndex).

C#
public override bool Search()
{
    m_fileName = m_queryPathString = m_startQuerySearchPattern +
                 m_query + m_startSearchPattern +
                 m_totalLinksRetrieved.ToString();
    m_baseUri = new Uri(m_fileName);
    //call the HtmlParser parseMe method to parse the web page
    bool isParsed = this.ParseMe();
    if (isParsed)
    {
        m_addressLinkCollection = new AddressLinkCollection();
        //call GetLinks to fill the address link collection with links
        this.GetLinks(this.RootStructure,m_addressLinkCollection);
        m_numberOfLinksRetrieved = m_addressLinkCollection.Count;
        m_totalLinksRetrieved += m_numberOfLinksRetrieved;
    }

    return isParsed;
}

After-notes

  1. I don't explicitly use the XML file with structure configuration. Instead, I embed it into the assembly. One drawback of that is that the structure of the web page can be changed in the future.
  2. I tested all search engine parsers on English, Arabic, Hebrew and Russian. It works just fine with those languages. There are some inconsistencies like in Yahoo!. Yahoo! returns its web page with UTF-8 charset but the actual encoding is language specific. Because my HTML parser checks the charset of the web page, it won't recognize the actual encoding of it. So you will not see any text related to the link (check the demo).

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here


Written By
Engineer
Germany Germany
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
Questionplz slaks.web.parser.htmlparser's Source,Can you send it to Me? Pin
Member 97933013-Oct-19 21:31
Member 97933013-Oct-19 21:31 
QuestionWhy does not work the Google Search? Pin
kost2128-Oct-11 8:50
kost2128-Oct-11 8:50 
GeneralLooks Like google made changes to their results Pin
hynomark25-Apr-10 12:34
hynomark25-Apr-10 12:34 
GeneralInteresting Code - problem with additional Parsers Pin
hynomark7-Feb-10 9:53
hynomark7-Feb-10 9:53 
GeneralRe: Interesting Code - problem with additional Parsers Pin
hynomark7-Feb-10 12:24
hynomark7-Feb-10 12:24 
Generalanchor text Pin
elshorbagy16-Dec-09 23:02
elshorbagy16-Dec-09 23:02 
GeneralRe: anchor text Pin
Kisilevich Slava16-Dec-09 23:12
Kisilevich Slava16-Dec-09 23:12 
Generalneed help Pin
david89k26-Oct-09 21:53
david89k26-Oct-09 21:53 
GeneralRe: need help Pin
Kisilevich Slava26-Oct-09 22:13
Kisilevich Slava26-Oct-09 22:13 
Generalneed help Pin
david89k26-Oct-09 21:44
david89k26-Oct-09 21:44 
Generalplease help me Pin
david89k20-Oct-09 23:36
david89k20-Oct-09 23:36 
QuestionHow do I change the code that will work on search Images ? Pin
drorby12-Mar-09 23:55
drorby12-Mar-09 23:55 
Generalcould you please share me with your soure code Pin
wei.lee0417-Nov-08 19:32
wei.lee0417-Nov-08 19:32 
GeneralRe: could you please share me with your soure code Pin
Wasia2-Apr-09 14:40
Wasia2-Apr-09 14:40 
GeneralThe base source code Pin
siquylee7-Oct-08 17:57
siquylee7-Oct-08 17:57 
GeneralNot working google parser Pin
Nimrod_SPbSU13-May-08 3:03
Nimrod_SPbSU13-May-08 3:03 
QuestionRequest Pin
AvitalChissick30-Dec-07 21:15
AvitalChissick30-Dec-07 21:15 
This is a great article.
Thank you for writing it.

I would love to look at the source code for Slaks.Web.Parser.HtmlParser.

My email is avital.chissick@gmail.com

Thank you.
NewsGoogle parser online tool Pin
jp73129-Oct-07 17:10
jp73129-Oct-07 17:10 
GeneralRe: Google parser online tool Pin
Dang Thanh1-Apr-08 23:06
Dang Thanh1-Apr-08 23:06 
Questionneed slaks.web.parser.htmlparser's Source Pin
Malikabp27-Aug-07 19:08
Malikabp27-Aug-07 19:08 
GeneralError on Run Pin
salah_gis26-Jul-07 1:45
salah_gis26-Jul-07 1:45 
QuestionExecutable Version? Pin
nico798013-Mar-07 6:05
nico798013-Mar-07 6:05 
GeneralI know Reason For Previos (Don't work Message) Pin
abosafia9-Aug-06 2:13
abosafia9-Aug-06 2:13 
GeneralAltavista and yahoo display unwanted results [modified] Pin
abosafia25-Jul-06 21:15
abosafia25-Jul-06 21:15 
Generalthe parser Pin
Ramez Quneibi23-Jul-06 21:07
Ramez Quneibi23-Jul-06 21:07 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.