Need Help in Understanding a confusion and Solving a problem while making Web Crawlers to get total Links Count

Question

0.00/5 (No votes)

See more:

Hi, I have tried to get a starting on making a web crawler. Was progressing well till I got this confusion That i can't understand. I have written the following code:

I am passing "http://www.google.com" as the string "URL"

C#

public void crawlURL(string URL, string depth)
        {
            if (!checkPageHasBeenCrawled(URL))
            {
                PageContent = getURLContent(URL);
                MatchCollection matches = Regex.Matches(PageContent, "href=\"",                       RegexOptions.IgnoreCase);
                int count = matches.Count;
            }
        } 

private string getURLContent(string URL)
        {
            string content;
            HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(URL);
            request.UserAgent = "Fetching contents Data";

            WebResponse response = request.GetResponse();

            Stream stream = response.GetResponseStream();

            StreamReader reader = new StreamReader(stream);
            content = reader.ReadToEnd();

            reader.Close();
            stream.Close();
            return content;
        }

Problem:
I am trying to get all the Links of the page(http://www.google.com) but I see less count of the links from the Regex matches. It gives me links count to be 19 while when i checked the source code manually for the word "href=" it gave me 41 occurances. I can't understand why it is giving me less count of the word from the code.
Also One other problem I am currently facing is how to make sure that the links I have obtained are not broken links. They should map to some place. One way I can think of is the same way I am getting www.google.com page content by httpwebrequest object. Is thier another way or this would be the best. Thanks !!

[Edit]Code block added[/Edit]

Posted 20-Nov-12 6:30am

Member 9568921

Updated 20-Nov-12 6:37am

Thomas Daniels

v2

Add a Solution

2 solutions

Add a Solution

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Jitendra2005 · Answer 1 · 2012-11-20T09:10:00

Your code is good, but you have to do lot more with it.

Point1: count issue
do not use href=" as at some places url can href=www.xyz.com
do not use href only as it can be use in javascript like href.indexOf("#") or something similer. This can be the reason of getting more count manually.
so, use only href=

Point2: find bad url
get content just after href= till next space and create url (don't forget to append http:// if does not exist) for this content, if you get some exception (handle it) it means it is a bad url otherwise url will be created. To achieve this you have to run loop over the content recieved by first url (i.e. http://www.google.com as per your example) till last href=

How you write logic for this is depend upon you.

Hope this will help.

Thanks,
Jitendra

Thomas Daniels · Answer 2 · 2012-11-20T06:45:00

You search for the string href="
This will not work for Google.
You try to find links like this:

HTML

<a href="http://www.gmail.com">Gmail</a>

A link like that, you'll find.
But Google has also a few links like this:

HTML

<a href=http://www.gmail.com>Gmail

Search for the string href= and change your crawlURL method into this:

C#

public void crawlURL(string URL, string depth)
        {
            if (!checkPageHasBeenCrawled(URL))
            {
                PageContent = getURLContent(URL);
                MatchCollection matches = Regex.Matches(PageContent, "href=",    RegexOptions.IgnoreCase);
                int count = matches.Count;
            }
        }

[EDIT]

But why do you search an attribute?
If you search for the a-tag, then you'll find the links also.
So, change the method into this:

C#

public void crawlURL(string URL, string depth)
        {
            if (!checkPageHasBeenCrawled(URL))
            {
                PageContent = getURLContent(URL);
                MatchCollection matches = Regex.Matches(PageContent, "<a",    RegexOptions.IgnoreCase);
                int count = matches.Count;
            }
        }

Hope this helps.

Need Help in Understanding a confusion and Solving a problem while making Web Crawlers to get total Links Count

2 solutions

Solution 3

Solution 1

Add your solution here

Preview 0

Need Help in Understanding a confusion and Solving a problem while making Web Crawlers to get total Links Count

2 solutions

Solution 3

Solution 1

Add your solution here

Preview 0

Existing Members

...or Join us