Click here to Skip to main content
15,867,835 members
Articles / Programming Languages / C#

Link Scanner

Rate me:
Please Sign up or sign in to vote.
4.61/5 (10 votes)
14 Oct 2009CPOL 38.3K   1.4K   24   9
Gets all links that a page contains.

Sample Image

Introduction

Often developers have to write apps that have to parse something. This is a small example how to parse a web page ad get all the links that it contains. Such examples are realy good for beginner developers, and I think that it will give an idea of how to to create a nice parser. This example was created for a concrete problem, so it is not that abstract. The path of the web page must be a URL.

Using the Code

Scanner.cs contains all of the logic:

C#
public class Scanner
{
    //regular expression patterns
    private static string urlPattern = @"http(s)?://([\w-]+\.)+[\w-]+(/[\w- ./?%&=]*)?";
    private static string tagPattern = @"<a\b[^>]*(.*?)";
    private static string emailPattern = @"\w+([-+.']\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*";


    // gets all links that the url contains
    public static List<string> getInnerUrls(string url) {
        var innerUrls = new List<string>();

        //create the WebRequest for url eg "http://www.codeproject.com"
        WebRequest request = WebRequest.Create(url);

        //get the stream from the web response
        var reader = new StreamReader(request.GetResponse().GetResponseStream());

        //get the htmlCode
        string htmlCode = reader.ReadToEnd();

        List<string> links = getMatches(htmlCode);
        foreach (string link in links) {
            //check if the links is referred to the same site
            if (!Regex.IsMatch(link, urlPattern) && !Regex.IsMatch(link, emailPattern)) {
                //form an absolute url for the link
                string absoluteUrlPath = getAblosuteUrl(getDomainName(url), link);
                innerUrls.Add(absoluteUrlPath);
            }
            else {
                innerUrls.Add(link);
            }
        }
        return innerUrls;
    }

    // get all links that the page contains
    private static List<string> getMatches(string source) {
        var matchesList = new List<string>();
        //get the collection that match the tag pattern
        MatchCollection matches = Regex.Matches(source, tagPattern);
        //add the text under the href attribute
        //to the list
        foreach (Match match in matches) {
            string val = match.Value.Trim();
            if (val.Contains("href=\"")) {
                string link = getSubstring(val, "href=\"", "\"");
                matchesList.Add(link);
            }
        }
        return matchesList;
    }

    private static string getSubstring(string source, string start, string end) {
            // return the sub string 
    }

    /// creates an absolute url for the source whitch the site contains
    private static string getAblosuteUrl(string domainName, string path) {
        //forms and return an absolute url for the source that is referred to the site
    }

    private static string getDomainName(string url) {
     // return the url path were the page is stored
    }}

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Software Developer
Ukraine Ukraine
I'm a .Net Developer. Love exploring and trying out new things.

Comments and Discussions

 
Questionconvert to C# Pin
sulmain19-Feb-13 9:50
sulmain19-Feb-13 9:50 
AnswerRe: convert to C# Pin
Daniel Killyevo19-Feb-13 9:57
Daniel Killyevo19-Feb-13 9:57 
GeneralRe: convert to C# Pin
sulmain27-Feb-13 21:29
sulmain27-Feb-13 21:29 
GeneralRe: convert to C# Pin
Daniel Killyevo1-Mar-13 0:22
Daniel Killyevo1-Mar-13 0:22 
Generallink Pin
eko8519-Apr-11 2:20
eko8519-Apr-11 2:20 
GeneralRe: link Pin
Daniel Killyevo19-Apr-11 2:22
Daniel Killyevo19-Apr-11 2:22 
nope, but u can easily convert it Wink | ;)
GeneralRe: link Pin
sulmain5-Mar-13 22:36
sulmain5-Mar-13 22:36 
GeneralNice! Pin
MrReed15-Dec-10 22:07
MrReed15-Dec-10 22:07 
GeneralGreat Job! Pin
codeadborn20-Oct-09 10:22
codeadborn20-Oct-09 10:22 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.