Click here to Skip to main content
Click here to Skip to main content
Technical Blog

Tagged as

Creating Link Extractor and Filter in C#: Part 1

, 3 Oct 2013 CPOL
Rate this:
Please Sign up or sign in to vote.
How to extract all the links from a webpage using a web client.

Introduction

In this article we will learn how to extract all the links from a webpage using a web client. At the end of this article you will be able to create an application that can extract links from pages and filter those links on the basis of parameters you want. So without wasting much time let’s dive directly into the code.

Creating the Link Grabber

So we are creating a link grabber. For that we need some logic and it’s always a good idea to clarify the logic before creating something. So let’s define the logic.

The logic is:

  • We need a link for the page to crawl. We can get that link from a TextBox.
  • Now we have the link. The next step will be to download the web page to crawl. We can either use a web client for it or a WebBrowser control.
  • Now we have the HTML document. The next step is to extract the links from that page.
  • As we know most of the useful links are contained in the href attribute of the anchor tags.
  • Now up to that point we know that we want to grab the anchor elements of the page. So we can do this using getElementsByTagName().
  • Now we have the collection of all anchor elements.
  • The next step is get the href attribute and add it to a list. Let this list be a check box list.
  • Now we have all the extracted links.

Before proceeding let’s code the preceding logic.

The Code

The following is the code for the grabber.

  1. Open Visual Studio and choose "New project".
  2. 662492/Clipboard05.jpg

  3. Now choose "Visual C#" -> Windows -> "Windows Forms application".
  4. 662492/Clipboard06.jpg

  5. Now drop a text box from the Toolbar onto the form.
  6. 662492/Clipboard01.jpg

  7. Now drop a button from the Toolbar onto the form and name it "grab".
  8. 662492/Clipboard02.jpg

  9. Now add one check list box from the Toolbar menu onto the form.
  10. 662492/Clipboard03.jpg

  11. Now double-click on the button to generate the click handler.
  12. Add the following code for the click handler:
  13. using System;
    using System.Collections.Generic;
    using System.ComponentModel;
    using System.Data;
    using System.Drawing;
    using System.Linq;
    using System.Text;
    using System.Threading.Tasks;
    using System.Windows.Forms;
    namespace linkGrabber
    {
        public partial class Form1 : Form
        {
            public Form1()
            {
                InitializeComponent();
            }
            private void button1_Click(object sender, EventArgs e)
            {
                WebBrowser wb = new WebBrowser();
                wb.Url = new Uri(textBox1.Text);
                wb.DocumentCompleted += wb_DocumentCompleted;
            }
            void wb_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
            {
                HtmlDocument source = ((WebBrowser)sender).Document;
                extractLink(source);
            }
            private void extractLink(HtmlDocument source)
            {
                HtmlElementCollection anchorList = source.GetElementsByTagName("a");
                foreach (var item in anchorList)
                {
                    checkedListBox1.Items.Add(((HtmlElement)item).GetAttribute("href"));
                }
            }
        }
    }

    662492/Capture.jpg

Summary

That’s it; all done. Now you have successfully made a link grabber. You can further extend it by adding a filter to it. In my next part I will show how to add a filter and how to download files. Thanks for reading and don’t forget to comment and share.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Arpit Jain
Student
India India
No Biography provided
Follow on   Twitter   Google+   LinkedIn

Comments and Discussions

 
-- There are no messages in this forum --
| Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.141223.1 | Last Updated 3 Oct 2013
Article Copyright 2013 by Arpit Jain
Everything else Copyright © CodeProject, 1999-2014
Layout: fixed | fluid