Click here to Skip to main content
Click here to Skip to main content

Link Scanner

By , 14 Oct 2009
 

Sample Image

Introduction

Often developers have to write apps that have to parse something. This is a small example how to parse a web page ad get all the links that it contains. Such examples are realy good for beginner developers, and I think that it will give an idea of how to to create a nice parser. This example was created for a concrete problem, so it is not that abstract. The path of the web page must be a URL.

Using the Code

Scanner.cs contains all of the logic:

public class Scanner
{
    //regular expression patterns
    private static string urlPattern = @"http(s)?://([\w-]+\.)+[\w-]+(/[\w- ./?%&=]*)?";
    private static string tagPattern = @"<a\b[^>]*(.*?)";
    private static string emailPattern = @"\w+([-+.']\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*";


    // gets all links that the url contains
    public static List<string> getInnerUrls(string url) {
        var innerUrls = new List<string>();

        //create the WebRequest for url eg "http://www.codeproject.com"
        WebRequest request = WebRequest.Create(url);

        //get the stream from the web response
        var reader = new StreamReader(request.GetResponse().GetResponseStream());

        //get the htmlCode
        string htmlCode = reader.ReadToEnd();

        List<string> links = getMatches(htmlCode);
        foreach (string link in links) {
            //check if the links is referred to the same site
            if (!Regex.IsMatch(link, urlPattern) && !Regex.IsMatch(link, emailPattern)) {
                //form an absolute url for the link
                string absoluteUrlPath = getAblosuteUrl(getDomainName(url), link);
                innerUrls.Add(absoluteUrlPath);
            }
            else {
                innerUrls.Add(link);
            }
        }
        return innerUrls;
    }

    // get all links that the page contains
    private static List<string> getMatches(string source) {
        var matchesList = new List<string>();
        //get the collection that match the tag pattern
        MatchCollection matches = Regex.Matches(source, tagPattern);
        //add the text under the href attribute
        //to the list
        foreach (Match match in matches) {
            string val = match.Value.Trim();
            if (val.Contains("href=\"")) {
                string link = getSubstring(val, "href=\"", "\"");
                matchesList.Add(link);
            }
        }
        return matchesList;
    }

    private static string getSubstring(string source, string start, string end) {
            // return the sub string 
    }

    /// creates an absolute url for the source whitch the site contains
    private static string getAblosuteUrl(string domainName, string path) {
        //forms and return an absolute url for the source that is referred to the site
    }

    private static string getDomainName(string url) {
     // return the url path were the page is stored
    }}

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Daniel Killewo
Software Developer
Ukraine Ukraine
Member
I'm a .Net Developer. Love exploring and trying out new things.

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board.
Search this forum  
    Spacing  Noise  Layout  Per page   
Questionconvert to C#membersulmain19 Feb '13 - 9:50 
i can convert it to C# ??
AnswerRe: convert to C#memberDaniel Killewo19 Feb '13 - 9:57 
Sorry, maybe haven't understood your question correctly. But it is c#
GeneralRe: convert to C#membersulmain27 Feb '13 - 21:29 
yes , i need to ask :
i can convert it to windows application c# (desktop application ) Wink | ;)
GeneralRe: convert to C#memberDaniel Killewo1 Mar '13 - 0:22 
Currently I don't intend to create a sample desktop application on this example. However if you do it and would like to share the code. You can send it to me I'll insert it to the post.

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Permalink | Advertise | Privacy | Mobile
Web03 | 2.6.130523.1 | Last Updated 14 Oct 2009
Article Copyright 2009 by Daniel Killewo
Everything else Copyright © CodeProject, 1999-2013
Terms of Use
Layout: fixed | fluid