Click here to Skip to main content
13,899,460 members
Click here to Skip to main content
Add your own
alternative version

Tagged as


24 bookmarked
Posted 14 Oct 2009
Licenced CPOL

Link Scanner

, 14 Oct 2009
Rate this:
Please Sign up or sign in to vote.
Gets all links that a page contains.

Sample Image


Often developers have to write apps that have to parse something. This is a small example how to parse a web page ad get all the links that it contains. Such examples are realy good for beginner developers, and I think that it will give an idea of how to to create a nice parser. This example was created for a concrete problem, so it is not that abstract. The path of the web page must be a URL.

Using the Code

Scanner.cs contains all of the logic:

public class Scanner
    //regular expression patterns
    private static string urlPattern = @"http(s)?://([\w-]+\.)+[\w-]+(/[\w- ./?%&=]*)?";
    private static string tagPattern = @"<a\b[^>]*(.*?)";
    private static string emailPattern = @"\w+([-+.']\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*";

    // gets all links that the url contains
    public static List<string> getInnerUrls(string url) {
        var innerUrls = new List<string>();

        //create the WebRequest for url eg ""
        WebRequest request = WebRequest.Create(url);

        //get the stream from the web response
        var reader = new StreamReader(request.GetResponse().GetResponseStream());

        //get the htmlCode
        string htmlCode = reader.ReadToEnd();

        List<string> links = getMatches(htmlCode);
        foreach (string link in links) {
            //check if the links is referred to the same site
            if (!Regex.IsMatch(link, urlPattern) && !Regex.IsMatch(link, emailPattern)) {
                //form an absolute url for the link
                string absoluteUrlPath = getAblosuteUrl(getDomainName(url), link);
            else {
        return innerUrls;

    // get all links that the page contains
    private static List<string> getMatches(string source) {
        var matchesList = new List<string>();
        //get the collection that match the tag pattern
        MatchCollection matches = Regex.Matches(source, tagPattern);
        //add the text under the href attribute
        //to the list
        foreach (Match match in matches) {
            string val = match.Value.Trim();
            if (val.Contains("href=\"")) {
                string link = getSubstring(val, "href=\"", "\"");
        return matchesList;

    private static string getSubstring(string source, string start, string end) {
            // return the sub string 

    /// creates an absolute url for the source whitch the site contains
    private static string getAblosuteUrl(string domainName, string path) {
        //forms and return an absolute url for the source that is referred to the site

    private static string getDomainName(string url) {
     // return the url path were the page is stored


This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


About the Author

Daniel Killyevo
Software Developer
Ukraine Ukraine
I'm a .Net Developer. Love exploring and trying out new things.

You may also be interested in...

Comments and Discussions

Questionconvert to C# Pin
sulmain19-Feb-13 9:50
membersulmain19-Feb-13 9:50 
AnswerRe: convert to C# Pin
Daniel Killyevo19-Feb-13 9:57
memberDaniel Killyevo19-Feb-13 9:57 
GeneralRe: convert to C# Pin
sulmain27-Feb-13 21:29
membersulmain27-Feb-13 21:29 
GeneralRe: convert to C# Pin
Daniel Killyevo1-Mar-13 0:22
memberDaniel Killyevo1-Mar-13 0:22 
Generallink Pin
eko8519-Apr-11 2:20
membereko8519-Apr-11 2:20 
GeneralRe: link Pin
Daniel Killyevo19-Apr-11 2:22
memberDaniel Killyevo19-Apr-11 2:22 
GeneralRe: link Pin
sulmain5-Mar-13 22:36
membersulmain5-Mar-13 22:36 
GeneralNice! Pin
MrReed15-Dec-10 22:07
memberMrReed15-Dec-10 22:07 
GeneralGreat Job! Pin
codeadborn20-Oct-09 10:22
membercodeadborn20-Oct-09 10:22 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Cookies | Terms of Use | Mobile
Web01 | 2.8.190306.1 | Last Updated 14 Oct 2009
Article Copyright 2009 by Daniel Killyevo
Everything else Copyright © CodeProject, 1999-2019
Layout: fixed | fluid