Click here to Skip to main content
15,886,873 members
Articles / Programming Languages / C#

Link Scanner

Rate me:
Please Sign up or sign in to vote.
4.61/5 (10 votes)
14 Oct 2009CPOL 38.5K   1.4K   24   9
Gets all links that a page contains.

Sample Image

Introduction

Often developers have to write apps that have to parse something. This is a small example how to parse a web page ad get all the links that it contains. Such examples are realy good for beginner developers, and I think that it will give an idea of how to to create a nice parser. This example was created for a concrete problem, so it is not that abstract. The path of the web page must be a URL.

Using the Code

Scanner.cs contains all of the logic:

C#
public class Scanner
{
    //regular expression patterns
    private static string urlPattern = @"http(s)?://([\w-]+\.)+[\w-]+(/[\w- ./?%&=]*)?";
    private static string tagPattern = @"<a\b[^>]*(.*?)";
    private static string emailPattern = @"\w+([-+.']\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*";


    // gets all links that the url contains
    public static List<string> getInnerUrls(string url) {
        var innerUrls = new List<string>();

        //create the WebRequest for url eg "http://www.codeproject.com"
        WebRequest request = WebRequest.Create(url);

        //get the stream from the web response
        var reader = new StreamReader(request.GetResponse().GetResponseStream());

        //get the htmlCode
        string htmlCode = reader.ReadToEnd();

        List<string> links = getMatches(htmlCode);
        foreach (string link in links) {
            //check if the links is referred to the same site
            if (!Regex.IsMatch(link, urlPattern) && !Regex.IsMatch(link, emailPattern)) {
                //form an absolute url for the link
                string absoluteUrlPath = getAblosuteUrl(getDomainName(url), link);
                innerUrls.Add(absoluteUrlPath);
            }
            else {
                innerUrls.Add(link);
            }
        }
        return innerUrls;
    }

    // get all links that the page contains
    private static List<string> getMatches(string source) {
        var matchesList = new List<string>();
        //get the collection that match the tag pattern
        MatchCollection matches = Regex.Matches(source, tagPattern);
        //add the text under the href attribute
        //to the list
        foreach (Match match in matches) {
            string val = match.Value.Trim();
            if (val.Contains("href=\"")) {
                string link = getSubstring(val, "href=\"", "\"");
                matchesList.Add(link);
            }
        }
        return matchesList;
    }

    private static string getSubstring(string source, string start, string end) {
            // return the sub string 
    }

    /// creates an absolute url for the source whitch the site contains
    private static string getAblosuteUrl(string domainName, string path) {
        //forms and return an absolute url for the source that is referred to the site
    }

    private static string getDomainName(string url) {
     // return the url path were the page is stored
    }}

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Software Developer
Ukraine Ukraine
I'm a .Net Developer. Love exploring and trying out new things.

Comments and Discussions

 
Questionconvert to C# Pin
sulmain19-Feb-13 9:50
sulmain19-Feb-13 9:50 
AnswerRe: convert to C# Pin
Daniel Killyevo19-Feb-13 9:57
Daniel Killyevo19-Feb-13 9:57 
GeneralRe: convert to C# Pin
sulmain27-Feb-13 21:29
sulmain27-Feb-13 21:29 
GeneralRe: convert to C# Pin
Daniel Killyevo1-Mar-13 0:22
Daniel Killyevo1-Mar-13 0:22 
Currently I don't intend to create a sample desktop application on this example. However if you do it and would like to share the code. You can send it to me I'll insert it to the post.
Generallink Pin
eko8519-Apr-11 2:20
eko8519-Apr-11 2:20 
GeneralRe: link Pin
Daniel Killyevo19-Apr-11 2:22
Daniel Killyevo19-Apr-11 2:22 
GeneralRe: link Pin
sulmain5-Mar-13 22:36
sulmain5-Mar-13 22:36 
GeneralNice! Pin
MrReed15-Dec-10 22:07
MrReed15-Dec-10 22:07 
GeneralGreat Job! Pin
codeadborn20-Oct-09 10:22
codeadborn20-Oct-09 10:22 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.