Click here to Skip to main content
Click here to Skip to main content

Link Scanner

By , 14 Oct 2009
Rate this:
Please Sign up or sign in to vote.

Sample Image

Introduction

Often developers have to write apps that have to parse something. This is a small example how to parse a web page ad get all the links that it contains. Such examples are realy good for beginner developers, and I think that it will give an idea of how to to create a nice parser. This example was created for a concrete problem, so it is not that abstract. The path of the web page must be a URL.

Using the Code

Scanner.cs contains all of the logic:

public class Scanner
{
    //regular expression patterns
    private static string urlPattern = @"http(s)?://([\w-]+\.)+[\w-]+(/[\w- ./?%&=]*)?";
    private static string tagPattern = @"<a\b[^>]*(.*?)";
    private static string emailPattern = @"\w+([-+.']\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*";


    // gets all links that the url contains
    public static List<string> getInnerUrls(string url) {
        var innerUrls = new List<string>();

        //create the WebRequest for url eg "http://www.codeproject.com"
        WebRequest request = WebRequest.Create(url);

        //get the stream from the web response
        var reader = new StreamReader(request.GetResponse().GetResponseStream());

        //get the htmlCode
        string htmlCode = reader.ReadToEnd();

        List<string> links = getMatches(htmlCode);
        foreach (string link in links) {
            //check if the links is referred to the same site
            if (!Regex.IsMatch(link, urlPattern) && !Regex.IsMatch(link, emailPattern)) {
                //form an absolute url for the link
                string absoluteUrlPath = getAblosuteUrl(getDomainName(url), link);
                innerUrls.Add(absoluteUrlPath);
            }
            else {
                innerUrls.Add(link);
            }
        }
        return innerUrls;
    }

    // get all links that the page contains
    private static List<string> getMatches(string source) {
        var matchesList = new List<string>();
        //get the collection that match the tag pattern
        MatchCollection matches = Regex.Matches(source, tagPattern);
        //add the text under the href attribute
        //to the list
        foreach (Match match in matches) {
            string val = match.Value.Trim();
            if (val.Contains("href=\"")) {
                string link = getSubstring(val, "href=\"", "\"");
                matchesList.Add(link);
            }
        }
        return matchesList;
    }

    private static string getSubstring(string source, string start, string end) {
            // return the sub string 
    }

    /// creates an absolute url for the source whitch the site contains
    private static string getAblosuteUrl(string domainName, string path) {
        //forms and return an absolute url for the source that is referred to the site
    }

    private static string getDomainName(string url) {
     // return the url path were the page is stored
    }}

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Daniel Killyevo
Software Developer
Ukraine Ukraine
I'm a .Net Developer. Love exploring and trying out new things.
Follow on   Twitter

Comments and Discussions

 
Questionconvert to C# Pinmembersulmain19-Feb-13 9:50 
AnswerRe: convert to C# PinmemberDaniel Killewo19-Feb-13 9:57 
GeneralRe: convert to C# Pinmembersulmain27-Feb-13 21:29 
GeneralRe: convert to C# PinmemberDaniel Killewo1-Mar-13 0:22 
Generallink Pinmembereko8519-Apr-11 2:20 
GeneralRe: link PinmemberCoffeeCode19-Apr-11 2:22 
GeneralRe: link Pinmembersulmain5-Mar-13 22:36 
GeneralNice! PinmemberMrReed15-Dec-10 22:07 
GeneralGreat Job! Pinmembercodeadborn20-Oct-09 10:22 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web04 | 2.8.140415.2 | Last Updated 14 Oct 2009
Article Copyright 2009 by Daniel Killyevo
Everything else Copyright © CodeProject, 1999-2014
Terms of Use
Layout: fixed | fluid