Click here to Skip to main content
15,895,011 members
Articles / Programming Languages / C#

Creating Link Extractor and Filter in C#: Part 1

Rate me:
Please Sign up or sign in to vote.
4.22/5 (7 votes)
3 Oct 2013CPOL2 min read 10.8K   9   1
How to extract all the links from a webpage using a web client.

Introduction

In this article we will learn how to extract all the links from a webpage using a web client. At the end of this article you will be able to create an application that can extract links from pages and filter those links on the basis of parameters you want. So without wasting much time let’s dive directly into the code.

Creating the Link Grabber

So we are creating a link grabber. For that we need some logic and it’s always a good idea to clarify the logic before creating something. So let’s define the logic.

The logic is:

  • We need a link for the page to crawl. We can get that link from a TextBox.
  • Now we have the link. The next step will be to download the web page to crawl. We can either use a web client for it or a WebBrowser control.
  • Now we have the HTML document. The next step is to extract the links from that page.
  • As we know most of the useful links are contained in the href attribute of the anchor tags.
  • Now up to that point we know that we want to grab the anchor elements of the page. So we can do this using getElementsByTagName().
  • Now we have the collection of all anchor elements.
  • The next step is get the href attribute and add it to a list. Let this list be a check box list.
  • Now we have all the extracted links.

Before proceeding let’s code the preceding logic.

The Code

The following is the code for the grabber.

  1. Open Visual Studio and choose "New project".
  2. 662492/Clipboard05.jpg

  3. Now choose "Visual C#" -> Windows -> "Windows Forms application".
  4. 662492/Clipboard06.jpg

  5. Now drop a text box from the Toolbar onto the form.
  6. 662492/Clipboard01.jpg

  7. Now drop a button from the Toolbar onto the form and name it "grab".
  8. 662492/Clipboard02.jpg

  9. Now add one check list box from the Toolbar menu onto the form.
  10. 662492/Clipboard03.jpg

  11. Now double-click on the button to generate the click handler.
  12. Add the following code for the click handler:
  13. C#
    using System;
    using System.Collections.Generic;
    using System.ComponentModel;
    using System.Data;
    using System.Drawing;
    using System.Linq;
    using System.Text;
    using System.Threading.Tasks;
    using System.Windows.Forms;
    namespace linkGrabber
    {
        public partial class Form1 : Form
        {
            public Form1()
            {
                InitializeComponent();
            }
            private void button1_Click(object sender, EventArgs e)
            {
                WebBrowser wb = new WebBrowser();
                wb.Url = new Uri(textBox1.Text);
                wb.DocumentCompleted += wb_DocumentCompleted;
            }
            void wb_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
            {
                HtmlDocument source = ((WebBrowser)sender).Document;
                extractLink(source);
            }
            private void extractLink(HtmlDocument source)
            {
                HtmlElementCollection anchorList = source.GetElementsByTagName("a");
                foreach (var item in anchorList)
                {
                    checkedListBox1.Items.Add(((HtmlElement)item).GetAttribute("href"));
                }
            }
        }
    }

    662492/Capture.jpg

Summary

That’s it; all done. Now you have successfully made a link grabber. You can further extend it by adding a filter to it. In my next part I will show how to add a filter and how to download files. Thanks for reading and don’t forget to comment and share.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Student
India India
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
Questionabsolutely great work. Pin
natarajbangalore15-Mar-15 20:04
natarajbangalore15-Mar-15 20:04 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.