Click here to Skip to main content
Rate this: bad
good
Please Sign up or sign in to vote.
See more: C# ASP.NET webscraping
Hi, I have tried to get a starting on making a web crawler. Was progressing well till I got this confusion That i can't understand. I have written the following code:
 
I am passing "http://www.google.com" as the string "URL"
public void crawlURL(string URL, string depth)
        {
            if (!checkPageHasBeenCrawled(URL))
            {
                PageContent = getURLContent(URL);
                MatchCollection matches = Regex.Matches(PageContent, "href=\"",                       RegexOptions.IgnoreCase);
                int count = matches.Count;
            }
        } 
 
private string getURLContent(string URL)
        {
            string content;
            HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(URL);
            request.UserAgent = "Fetching contents Data";
 
            WebResponse response = request.GetResponse();
 
            Stream stream = response.GetResponseStream();
 
            StreamReader reader = new StreamReader(stream);
            content = reader.ReadToEnd();
 
            reader.Close();
            stream.Close();
            return content;
        }
Problem:
I am trying to get all the Links of the page(http://www.google.com) but I see less count of the links from the Regex matches. It gives me links count to be 19 while when i checked the source code manually for the word "href=" it gave me 41 occurances. I can't understand why it is giving me less count of the word from the code.
Also One other problem I am currently facing is how to make sure that the links I have obtained are not broken links. They should map to some place. One way I can think of is the same way I am getting www.google.com page content by httpwebrequest object. Is thier another way or this would be the best. Thanks !!
 
[Edit]Code block added[/Edit]
Posted 20-Nov-12 6:30am
Edited 20-Nov-12 6:37am
v2
Rate this: bad
good
Please Sign up or sign in to vote.

Solution 1

You search for the string href="
This will not work for Google.
You try to find links like this:
<a href="http://www.gmail.com">Gmail</a>
A link like that, you'll find.
But Google has also a few links like this:
<a href=http://www.gmail.com>Gmail
Search for the string href= and change your crawlURL method into this:
public void crawlURL(string URL, string depth)
        {
            if (!checkPageHasBeenCrawled(URL))
            {
                PageContent = getURLContent(URL);
                MatchCollection matches = Regex.Matches(PageContent, "href=",    RegexOptions.IgnoreCase);
                int count = matches.Count;
            }
        }
 
[EDIT]
 
But why do you search an attribute?
If you search for the a-tag, then you'll find the links also.
So, change the method into this:
public void crawlURL(string URL, string depth)
        {
            if (!checkPageHasBeenCrawled(URL))
            {
                PageContent = getURLContent(URL);
                MatchCollection matches = Regex.Matches(PageContent, "<a",    RegexOptions.IgnoreCase);
                int count = matches.Count;
            }
        }
Hope this helps.
  Permalink  
v2
Comments
Member 9568921 at 20-Nov-12 12:57pm
   
It only made the count from 19 to 20. But the problem still remains the same. The google source code has the word "href=" 42 times which is way more count than i am getting. I tried to catter the possibility of spaces before or after the word but it still gives more or less the same result which is the less count I hope to obtain. Any other idea ??
ProgramFOX at 20-Nov-12 13:01pm
   
Yes. I updated my answer.
Sergey Alexandrovich Kryukov at 20-Nov-12 15:03pm
   
Not clear. You say "you search for "href="" and later advise the same. You say "A link like that, you'll find. But Google has also a few links like this" and show below essentially the same link example...
--SA
Rate this: bad
good
Please Sign up or sign in to vote.

Solution 3

Your code is good, but you have to do lot more with it.
 
Point1: count issue
do not use href=" as at some places url can href=www.xyz.com
do not use href only as it can be use in javascript like href.indexOf("#") or something similer. This can be the reason of getting more count manually.
so, use only href=
 
Point2: find bad url
get content just after href= till next space and create url (don't forget to append http:// if does not exist) for this content, if you get some exception (handle it) it means it is a bad url otherwise url will be created. To achieve this you have to run loop over the content recieved by first url (i.e. http://www.google.com as per your example) till last href=
 
How you write logic for this is depend upon you.
 
Hope this will help.
 
Thanks,
Jitendra
  Permalink  
v3

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

  Print Answers RSS
0 Sergey Alexandrovich Kryukov 528
1 OriginalGriff 459
2 ChintanShukla 305
3 Richard Deeming 250
4 RyanDev 230
0 Sergey Alexandrovich Kryukov 8,901
1 OriginalGriff 7,571
2 CPallini 2,603
3 Richard MacCutchan 2,095
4 Abhinav S 1,893


Advertise | Privacy | Mobile
Web01 | 2.8.140827.1 | Last Updated 20 Nov 2012
Copyright © CodeProject, 1999-2014
All Rights Reserved. Terms of Service
Layout: fixed | fluid

CodeProject, 503-250 Ferrand Drive Toronto Ontario, M3C 3G8 Canada +1 416-849-8900 x 100