Click here to Skip to main content
15,884,237 members
Articles / Programming Languages / C#
Article

Web Extractor using Regular Expressions

Rate me:
Please Sign up or sign in to vote.
3.89/5 (5 votes)
27 Mar 2009BSD2 min read 46.9K   5.2K   36   2
A flexible opensource web extractor that allows you to specify your own regular expressions
Web Extractor

Introduction

What is a Web Extractor

A web extractor is a software that helps you extract information like URL, images, emails, phone and fax from the web. It allows the users to specify a URL to crawl and automatically extract the required information.

WebExtractor360

WebExtractor360 is a free and open source web data extractor. It allows you to extract Images, Phrases, HTML Headers, HTML Tables, URLs (Links), URLs (Keywords), Emails, Phone, Fax and ANY other information on the web by specifying a Regular Expression. The flexibility to specify a custom Regular Expression allows the users to extract any kind of information from the web quickly and easily.

Supported Features

  • Extract URLs
  • Extract Images
  • Extract URL Titles
  • Extract HTML Tables
  • Extract Phone
  • Extract Fax
  • Extract Emails
  • Extract any web data using a Regular Expression

Using the Code

This project is written in C# (.NET 2.0). The software starts by crawling the specified web URL or any local file resource. All data that maps to the Match (Regular Expression) field will be returned as a result. Upon completion of the matching process for the specified URL, the crawler will continue to process other URLs that the specified URL links to. The entire process is repeated until the Maximum URL has been reached or there are no more URLs to process.

During the crawl, the ExtractorProcessingEngine.cs class is used to perform the regular expression matching. The function that handles the matching is specified below:

C#
override public void doHandleContents(Source source)
{
        int counter = 0;
        Match m = new Regex(expressions[0], RegexOptions.IgnoreCase).Match(source.Data);
        while (m.Success)
        {
            Console.WriteLine("Results - " + m.Groups[PARAM].ToString());
             ResultsValueObject rvo = new ResultsValueObject();
            rvo.WebPage = source.Uri.AbsoluteUri.ToString();
            rvo.Result = m.Groups[PARAM].ToString();

            m_emailList.Add(rvo);
            counter++;
            m = m.NextMatch();
        }
}

The WebProcessingEngine.cs class is used to automatically crawl the specified website by finding all the hyperlinks. The function that finds all the hyperlinks is shown below:

C#
public void HandleLinks( Source source )
{
    int   counter  = 0;
    Match m = new Regex(@"(?:href\s*=)(?:[\s""']*)(?!#|mailto|location.|javascript)
	(?<PARAM1>.*?)(?:[\s>""'])", RegexOptions.IgnoreCase).Match(source.Data);
    while( m.Success )
    {
        if (Singleton.GetInstance().ShareClientModelHolder.
				ShareSearchOptionsVO.ReportLinksFound)
        {
            ExtractorCurrentResultsValueObject vo = 
			new ExtractorCurrentResultsValueObject();
            vo.NoProgress = 1;
            vo.Links = "Links Found - " + m.Groups["PARAM1"].ToString();
            Console.WriteLine("MyLinks-" + vo.Links);
            m_action.CommonCallBack(vo);
        }
        if (AddWebPage(source.Uri, m.Groups["PARAM1"].ToString()))
        {
            counter++;
        }
        m = m.NextMatch( );
    }
}

This tool provides the flexibility for anyone interested in extracting any kind of information from the web using Regular Expressions. The software also provides many commonly used Regular Expressions for extracting data on the web, so that users who do not have any knowledge of Regular Expressions will be able to use the software.

Original Website

For updates and more information on WebExtractor360, please visit the original site for this project: Web Extractor.

History

  • 27th March, 2009: Initial version

License

This article, along with any associated source code and files, is licensed under The BSD License


Written By
Singapore Singapore
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
Generalfailed to transverse links Pin
the pink jedi22-Apr-09 18:35
the pink jedi22-Apr-09 18:35 
GeneralRe: failed to transverse links Pin
ConnectCode28-Apr-09 19:46
ConnectCode28-Apr-09 19:46 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.