Click here to Skip to main content
15,881,173 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
Hi, I have tried to get a starting on making a web crawler. Was progressing well till I got this confusion That i can't understand. I have written the following code:

I am passing "http://www.google.com" as the string "URL"
C#
public void crawlURL(string URL, string depth)
        {
            if (!checkPageHasBeenCrawled(URL))
            {
                PageContent = getURLContent(URL);
                MatchCollection matches = Regex.Matches(PageContent, "href=\"",                       RegexOptions.IgnoreCase);
                int count = matches.Count;
            }
        } 

private string getURLContent(string URL)
        {
            string content;
            HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(URL);
            request.UserAgent = "Fetching contents Data";

            WebResponse response = request.GetResponse();

            Stream stream = response.GetResponseStream();

            StreamReader reader = new StreamReader(stream);
            content = reader.ReadToEnd();

            reader.Close();
            stream.Close();
            return content;
        }

Problem:
I am trying to get all the Links of the page(http://www.google.com) but I see less count of the links from the Regex matches. It gives me links count to be 19 while when i checked the source code manually for the word "href=" it gave me 41 occurances. I can't understand why it is giving me less count of the word from the code.
Also One other problem I am currently facing is how to make sure that the links I have obtained are not broken links. They should map to some place. One way I can think of is the same way I am getting www.google.com page content by httpwebrequest object. Is thier another way or this would be the best. Thanks !!

[Edit]Code block added[/Edit]
Posted
Updated 20-Nov-12 6:37am
v2

Your code is good, but you have to do lot more with it.

Point1: count issue
do not use href=" as at some places url can href=www.xyz.com
do not use href only as it can be use in javascript like href.indexOf("#") or something similer. This can be the reason of getting more count manually.
so, use only href=

Point2: find bad url
get content just after href= till next space and create url (don't forget to append http:// if does not exist) for this content, if you get some exception (handle it) it means it is a bad url otherwise url will be created. To achieve this you have to run loop over the content recieved by first url (i.e. http://www.google.com as per your example) till last href=

How you write logic for this is depend upon you.

Hope this will help.

Thanks,
Jitendra
 
Share this answer
 
v3
You search for the string href="
This will not work for Google.
You try to find links like this:
HTML
<a href="http://www.gmail.com">Gmail</a>

A link like that, you'll find.
But Google has also a few links like this:
HTML
<a href=http://www.gmail.com>Gmail

Search for the string href= and change your crawlURL method into this:
C#
public void crawlURL(string URL, string depth)
        {
            if (!checkPageHasBeenCrawled(URL))
            {
                PageContent = getURLContent(URL);
                MatchCollection matches = Regex.Matches(PageContent, "href=",    RegexOptions.IgnoreCase);
                int count = matches.Count;
            }
        }


[EDIT]

But why do you search an attribute?
If you search for the a-tag, then you'll find the links also.
So, change the method into this:
C#
public void crawlURL(string URL, string depth)
        {
            if (!checkPageHasBeenCrawled(URL))
            {
                PageContent = getURLContent(URL);
                MatchCollection matches = Regex.Matches(PageContent, "<a",    RegexOptions.IgnoreCase);
                int count = matches.Count;
            }
        }

Hope this helps.
 
Share this answer
 
v2
Comments
Member 9568921 20-Nov-12 12:57pm    
It only made the count from 19 to 20. But the problem still remains the same. The google source code has the word "href=" 42 times which is way more count than i am getting. I tried to catter the possibility of spaces before or after the word but it still gives more or less the same result which is the less count I hope to obtain. Any other idea ??
Thomas Daniels 20-Nov-12 13:01pm    
Yes. I updated my answer.
Sergey Alexandrovich Kryukov 20-Nov-12 15:03pm    
Not clear. You say "you search for "href="" and later advise the same. You say "A link like that, you'll find. But Google has also a few links like this" and show below essentially the same link example...
--SA

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900