Need Help in Understanding a confusion and Solving a problem while making Web Crawlers to get total Links Count

Member 9568921 asked:

Hi, I have tried to get a starting on making a web crawler. Was progressing well till I got this confusion That i can't understand. I have written the following code:

I am passing "http://www.google.com" as the string "URL"

public void crawlURL(string URL, string depth)
        {
            if (!checkPageHasBeenCrawled(URL))
            {
                PageContent = getURLContent(URL);
                MatchCollection matches = Regex.Matches(PageContent, "href=\"",                       RegexOptions.IgnoreCase);
                int count = matches.Count;
            }
        } 

private string getURLContent(string URL)
        {
            string content;
            HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(URL);
            request.UserAgent = "Fetching contents Data";

            WebResponse response = request.GetResponse();

            Stream stream = response.GetResponseStream();

            StreamReader reader = new StreamReader(stream);
            content = reader.ReadToEnd();

            reader.Close();
            stream.Close();
            return content;
        }

Problem:
I am trying to get all the Links of the page(http://www.google.com) but I see less count of the links from the Regex matches. It gives me links count to be 19 while when i checked the source code manually for the word "href=" it gave me 41 occurances. I can't understand why it is giving me less count of the word from the code.
Also One other problem I am currently facing is how to make sure that the links I have obtained are not broken links. They should map to some place. One way I can think of is the same way I am getting www.google.com page content by httpwebrequest object. Is thier another way or this would be the best. Thanks !!

[Edit]Code block added[/Edit]

Tags: C#, ASP.NET, Webscraping

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

Cancel

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

Please note that all posts will be submitted under the http://www.codeproject.com/info/cpol10.aspx.