Click here to Skip to main content
15,881,882 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
Hi, I have tried to get a starting on making a web crawler. Was progressing well till I got this confusion That i can't understand. I have written the following code:

I am passing "http://www.google.com" as the string "URL"
C#
public void crawlURL(string URL, string depth)
        {
            if (!checkPageHasBeenCrawled(URL))
            {
                PageContent = getURLContent(URL);
                MatchCollection matches = Regex.Matches(PageContent, "href=\"",                       RegexOptions.IgnoreCase);
                int count = matches.Count;
            }
        } 

private string getURLContent(string URL)
        {
            string content;
            HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(URL);
            request.UserAgent = "Fetching contents Data";

            WebResponse response = request.GetResponse();

            Stream stream = response.GetResponseStream();

            StreamReader reader = new StreamReader(stream);
            content = reader.ReadToEnd();

            reader.Close();
            stream.Close();
            return content;
        }

Problem:
I am trying to get all the Links of the page(http://www.google.com) but I see less count of the links from the Regex matches. It gives me links count to be 19 while when i checked the source code manually for the word "href=" it gave me 41 occurances. I can't understand why it is giving me less count of the word from the code.
Also One other problem I am currently facing is how to make sure that the links I have obtained are not broken links. They should map to some place. One way I can think of is the same way I am getting www.google.com page content by httpwebrequest object. Is thier another way or this would be the best. Thanks !!

[Edit]Code block added[/Edit]
Posted
Updated 20-Nov-12 6:37am
v2

You search for the string href="
This will not work for Google.
You try to find links like this:
HTML
<a href="http://www.gmail.com">Gmail</a>

A link like that, you'll find.
But Google has also a few links like this:
HTML
<a href=http://www.gmail.com>Gmail

Search for the string href= and change your crawlURL method into this:
C#
public void crawlURL(string URL, string depth)
        {
            if (!checkPageHasBeenCrawled(URL))
            {
                PageContent = getURLContent(URL);
                MatchCollection matches = Regex.Matches(PageContent, "href=",    RegexOptions.IgnoreCase);
                int count = matches.Count;
            }
        }


[EDIT]

But why do you search an attribute?
If you search for the a-tag, then you'll find the links also.
So, change the method into this:
C#
public void crawlURL(string URL, string depth)
        {
            if (!checkPageHasBeenCrawled(URL))
            {
                PageContent = getURLContent(URL);
                MatchCollection matches = Regex.Matches(PageContent, "<a",    RegexOptions.IgnoreCase);
                int count = matches.Count;
            }
        }

Hope this helps.
 
Share this answer
 
v2
Comments
Member 9568921 20-Nov-12 12:57pm    
It only made the count from 19 to 20. But the problem still remains the same. The google source code has the word "href=" 42 times which is way more count than i am getting. I tried to catter the possibility of spaces before or after the word but it still gives more or less the same result which is the less count I hope to obtain. Any other idea ??
Thomas Daniels 20-Nov-12 13:01pm    
Yes. I updated my answer.
Sergey Alexandrovich Kryukov 20-Nov-12 15:03pm    
Not clear. You say "you search for "href="" and later advise the same. You say "A link like that, you'll find. But Google has also a few links like this" and show below essentially the same link example...
--SA
Your code is good, but you have to do lot more with it.

Point1: count issue
do not use href=" as at some places url can href=www.xyz.com
do not use href only as it can be use in javascript like href.indexOf("#") or something similer. This can be the reason of getting more count manually.
so, use only href=

Point2: find bad url
get content just after href= till next space and create url (don't forget to append http:// if does not exist) for this content, if you get some exception (handle it) it means it is a bad url otherwise url will be created. To achieve this you have to run loop over the content recieved by first url (i.e. http://www.google.com as per your example) till last href=

How you write logic for this is depend upon you.

Hope this will help.

Thanks,
Jitendra
 
Share this answer
 
v3

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900