Need Help in Understanding a confusion and Solving a problem while making Web Crawlers to get total Links Count

Question

0.00/5 (No votes)

See more:

Hi, I have tried to get a starting on making a web crawler. Was progressing well till I got this confusion That i can't understand. I have written the following code:

I am passing "http://www.google.com" as the string "URL"

C#

public void crawlURL(string URL, string depth)
        {
            if (!checkPageHasBeenCrawled(URL))
            {
                PageContent = getURLContent(URL);
                MatchCollection matches = Regex.Matches(PageContent, "href=\"",                       RegexOptions.IgnoreCase);
                int count = matches.Count;
            }
        } 

private string getURLContent(string URL)
        {
            string content;
            HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(URL);
            request.UserAgent = "Fetching contents Data";

            WebResponse response = request.GetResponse();

            Stream stream = response.GetResponseStream();

            StreamReader reader = new StreamReader(stream);
            content = reader.ReadToEnd();

            reader.Close();
            stream.Close();
            return content;
        }

Problem:
I am trying to get all the Links of the page(http://www.google.com) but I see less count of the links from the Regex matches. It gives me links count to be 19 while when i checked the source code manually for the word "href=" it gave me 41 occurances. I can't understand why it is giving me less count of the word from the code.
Also One other problem I am currently facing is how to make sure that the links I have obtained are not broken links. They should map to some place. One way I can think of is the same way I am getting www.google.com page content by httpwebrequest object. Is thier another way or this would be the best. Thanks !!

[Edit]Code block added[/Edit]

Posted 20-Nov-12 6:30am

Member 9568921

Updated 20-Nov-12 6:37am

Thomas Daniels

v2

Add a Solution

2 solutions

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Thomas Daniels · Answer 1 · 2012-11-20T06:45:00

You search for the string href="
This will not work for Google.
You try to find links like this:

HTML

<a href="http://www.gmail.com">Gmail</a>

A link like that, you'll find.
But Google has also a few links like this:

HTML

<a href=http://www.gmail.com>Gmail

Search for the string href= and change your crawlURL method into this:

C#

public void crawlURL(string URL, string depth)
        {
            if (!checkPageHasBeenCrawled(URL))
            {
                PageContent = getURLContent(URL);
                MatchCollection matches = Regex.Matches(PageContent, "href=",    RegexOptions.IgnoreCase);
                int count = matches.Count;
            }
        }

[EDIT]

But why do you search an attribute?
If you search for the a-tag, then you'll find the links also.
So, change the method into this:

C#

public void crawlURL(string URL, string depth)
        {
            if (!checkPageHasBeenCrawled(URL))
            {
                PageContent = getURLContent(URL);
                MatchCollection matches = Regex.Matches(PageContent, "<a",    RegexOptions.IgnoreCase);
                int count = matches.Count;
            }
        }

Hope this helps.

Jitendra2005 · Answer 2 · 2012-11-20T09:10:00

Your code is good, but you have to do lot more with it.

Point1: count issue
do not use href=" as at some places url can href=www.xyz.com
do not use href only as it can be use in javascript like href.indexOf("#") or something similer. This can be the reason of getting more count manually.
so, use only href=

Point2: find bad url
get content just after href= till next space and create url (don't forget to append http:// if does not exist) for this content, if you get some exception (handle it) it means it is a bad url otherwise url will be created. To achieve this you have to run loop over the content recieved by first url (i.e. http://www.google.com as per your example) till last href=

How you write logic for this is depend upon you.

Hope this will help.

Thanks,
Jitendra

Need Help in Understanding a confusion and Solving a problem while making Web Crawlers to get total Links Count

2 solutions

Solution 1

Solution 3

Add your solution here

Preview 0