Click here to Skip to main content
15,891,473 members
Articles / Programming Languages / C#
Article

Extract RSS feeds from Web pages

Rate me:
Please Sign up or sign in to vote.
3.85/5 (6 votes)
25 Sep 20044 min read 160.9K   1.8K   36   25
Shows how to extract RSS feeds from Web pages.

Introduction

I love RSS readers. They save a lot of my time. Would it be nice if we can convert any Web data into RSS format? Then we can view Bank records, Credit card records, online shop promotions, e-mail subscriptions, etc. in one standard way.

Unfortunately, not too many Web sites provide RSS/ATOM feeds. In this article, I will show that RSS extraction is a very simple task, especially if a proper technology is used.

How to extract

We will only consider Web pages which are developed by using HTML or DHTML. From the first glance, the task looks very simple: download HTML pages locally and parse them. But it can take hours to write the code even for a simple web site, and it is hard to keep the code working; web site changes can break it.

The following approaches can be used to extract data from Web pages: "Raw" HTTP, IE Automation, and SWExplorerAutomation.

"Raw" HTTP

HTTP is a "raw" approach. We use WebRequest (.NET) to download a page source locally. The RSS data then can be extracted by XPath or regular expressions. To use XPath, the page source should be converted to XML (XHTML) using HTML Tidy.

Pros

  • Performance is very fast.

Cons

  • Requires knowledge of TCP/IP, HTTP, HTTPS, cookies, etc.
  • Due to HTML is not well formed, HTML to XML conversion will not always work.
  • Very unstable. Even simple changes to a web page layout will break an extraction.
  • Will not work with web pages created by JavaScript.
  • Time consuming.

IE automation

The solution is based on accessing HTML DOM. We can use Internet Explorer automation or host Web Browser control to get access to the HTML DOM data model.

Pros

  • Can work with any web page shown in IE.
  • Doesn't require knowledge of TCP/IP, HTTP, HTTPS, cookies, etc.

Cons

  • Changes to web site layout will break an extraction.
  • Requires a good knowledge of Web Browser events, HTML DOM, COM.
  • Not as fast as HTTP way.
  • Time consuming.

SWExplorerAutomation

Image 1

Picture 1. SWExplorerAutomation class diagram.

SWExplorerAutomation is a framework which converts a web application into programmable objects: scenes (pages) and controls. Those objects are visually defined using visual designer, and accessible from any .NET language.

Pros

  • Can work with any web page shown in IE.
  • Doesn't require knowledge of TCP/IP, HTTP, HTTPS, cookies, etc.
  • Separates data extraction from program logic.
  • Effectively handles error conditions.
  • Takes minutes to write code.

Cons

  • Not as fast as HTTP way.

SWExplorerAutomation Example

To illustrate how SWExplorerAutomation can be used to extract RSS feeds from web pages, I wrote a sample application which extracts RSS feed from CNN web site. I have created the following definitions (scenes) for CNN pages: [CnnNews], [Sport], [Money], [Main]. Each of the scenes contains HtmlContent control which extracts data from a defined page place.

First, we create and initialize ExplorerManager instance. ExplorerManager is initialized by [cnn_rss.htp] project file which was visually created by SWExplorerAutomation designer. ExplorerManager Connect () function runs Internet Explorer instance and connects to it. Then ExplorerManager navigates browser to the main CNN page.

C#
ExplorerManager explorerManager = new ExplorerManager();
explorerManager.Connect();
explorerManager.LoadProject(@"..\..\cnn_rss.htp");
explorerManager.Navigate("http://www.cnn.com/");
rssw.WriteChannel("CNN", "CNN News", scene.Descriptor.Url);

The code waits until a scene defined for the main CNN page will be activated. It uses XPathDataExtractor to extract list of article links from the web page.

C#
scene = explorerManager["CnnNews"];
if (!scene.WaitForActive(60000)) 
return ""; 
XmlNodeList nodeList = (HtmlContent)(scene["HtmlContent_0"])). 
XPathDataExtractor.Expressions["ItemList"].SelectNodes(); 
for ( int i = 0; i < nodeList.Count; i++) { 
     //…….. 
}

The same set of actions Navigate, Wait, Extract is repeated for every article link.

C#
for ( int i = 0; i < nodeList.Count; i++) { 
  string link = nodeList[i].Attributes["href"].Value as String; 
  explorerManager.Navigate(link); 
  Scene[] scenes = explorerManager.WaitForActive( new 
        string[] {" Main ", "Money", "Sport"}, 20000); 
  if (scenes == null) 
    continue; 
  scene = scenes[0]; 
  XPathDataExtractor xe = 
    ((HtmlContent)(scene["HtmlContent_0"])).XPathDataExtractor; 
  string title = xe.Expressions["Title"].SelectNodes()[0].InnerText; 
  string pubDateStr = xe.Expressions["PubDate"].SelectNodes()[0].InnerText; 
  WriteRssItem(title, link, 
    xe.Expressions["PubDate"].SelectNodes()[0].InnerText, 
    xe.Expressions["Content"].SelectNodes()); 
  scene.Deactivate(); 
}

The code is completely metadata driven and doesn't require changes in case CNN site design will change.

Using Visual Designer to create cnn_rss.htp

Image 2

Screenshot 1. SWExplorerAutomation Visual Designer

To create cnn_rss.htp using SWDesigner

  • On the Explorer menu, click Run.
  • Navigate IE to http://www.cnn.com/.
  • On the Scene Editor menu, click Start.
  • Use right mouse button to show IE context menu. Click SceneEditor\Text Selection Mode.
  • Mark text on CNN page. Click SceneEditor\Select control from the context menu. The HtmlContent control will be added to the project.
  • Rename the control to CnnNews.
  • Run XPathDataExtractor custom property editor.
  • Define named XPath expression: select HTML link using mouse cursor, and click left mouse button to calculate XPath expression. Change the expression to select list of links (for example, DIV[1]/DIV[position() != 7] /A[1]).
  • Click Add button. Rename the named expression to "ItemList".
  • Click Exec button to test the expression and close XPathDataExtractor dialog.
  • Navigate to one of the news articles. Mark text on the page and create control (step 5).
  • Create the following named XPath expressions: PubDate, Content and Title.
  • Change Scene descriptor URL pattern to regular expression “http://www\.cnn\.com/2004(.*)” and change title pattern to “CNN\.com\ -(.*)”
  • Repeat 11-13 for Money and Sport.

To view cnn_rss.htp using SWDesigner

  • On Project menu, click Open.
  • Select “cnn_rss.htp”.
  • On the Scene Editor menu, click Start.
  • Select CnnNews scene. On the context menu, click Navigate.
  • Run XPathDataExtractor custom property editor.
  • Repeat 4-5 for all scenes.

Using the code

Just don't forget to register SWExplorerAutomation.dll. It is a Browser Helper Object and has to be registered.

Summary

The article explains how to extract RSS feeds from web pages using SWExplorerAutomation. It took me less then 10 minutes to write and test the article example code. Future articles will explain SWExplorerAutomation in more details and in more complex situations.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here


Written By
Web Developer
United States United States
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
GeneralLogin Pin
yachitha19-Apr-07 22:10
yachitha19-Apr-07 22:10 
GeneralScene Editor/add selected Control Pin
Malini821-Mar-07 23:30
Malini821-Mar-07 23:30 
GeneralRe: Scene Editor/add selected Control Pin
Alex Furman2-Mar-07 2:16
Alex Furman2-Mar-07 2:16 
GeneralRe: Scene Editor/add selected Control Pin
chandler838-Mar-07 0:23
chandler838-Mar-07 0:23 
GeneralRe: Scene Editor/add selected Control Pin
Alex Furman17-Mar-07 15:04
Alex Furman17-Mar-07 15:04 
GeneralNice Article Alex Pin
DamonCarr1-Oct-06 9:29
DamonCarr1-Oct-06 9:29 
Generalc#.net Pin
surshbabuk9-Jun-06 18:52
surshbabuk9-Jun-06 18:52 
GeneralProblems with dynamic names Pin
purple_tonberry6-Mar-06 16:53
purple_tonberry6-Mar-06 16:53 
Alex, this is a great tool. However there's some problems I'm facing now. I've use the generated C# code to automate a web page. Somehow the automation doesn't work well. Some scenes doesn't get activated because the frame name in the web page is generated dynamically, with each time the page is loaded with different frame name. Is there a way to activate the scene without restricting the frame name, or using the pattern as in UrlPattern and TitlePattern? Thanks
GeneralRe: Problems with dynamic names Pin
Alex Furman6-Mar-06 17:09
Alex Furman6-Mar-06 17:09 
GeneralRe: Problems with dynamic names Pin
purple_tonberry7-Mar-06 14:40
purple_tonberry7-Mar-06 14:40 
GeneralRe: Problems with dynamic names Pin
Alex Furman10-Mar-06 14:57
Alex Furman10-Mar-06 14:57 
GeneralURL Pin
Silveraxx30-Dec-05 3:33
Silveraxx30-Dec-05 3:33 
GeneralRe: URL Pin
Alex Furman30-Dec-05 7:05
Alex Furman30-Dec-05 7:05 
GeneralGreat! but I have a little trouble here. Pin
wyx200010-Oct-04 20:49
wyx200010-Oct-04 20:49 
GeneralRe: Great! but I have a little trouble here. Pin
Alex Furman11-Oct-04 5:20
Alex Furman11-Oct-04 5:20 
GeneralRe: Great! but I have a little trouble here. Pin
Alex Furman11-Oct-04 5:39
Alex Furman11-Oct-04 5:39 
GeneralSWExplorerAutomation Designer Q Pin
Craig Hildebrandt2-Oct-04 13:40
Craig Hildebrandt2-Oct-04 13:40 
GeneralRe: SWExplorerAutomation Designer Q Pin
Alex Furman2-Oct-04 16:37
Alex Furman2-Oct-04 16:37 
GeneralRe: SWExplorerAutomation Designer Q Pin
Malini8214-Mar-07 19:11
Malini8214-Mar-07 19:11 
Questionlogin? Pin
csmba1-Oct-04 10:53
csmba1-Oct-04 10:53 
AnswerRe: login? Pin
Alex Furman1-Oct-04 11:42
Alex Furman1-Oct-04 11:42 
QuestionVisual Designer? Pin
redevries28-Sep-04 21:02
redevries28-Sep-04 21:02 
AnswerRe: Visual Designer? Pin
Alex Furman29-Sep-04 1:53
Alex Furman29-Sep-04 1:53 
Generaldsfsdfsd Pin
Anonymous27-Sep-04 0:16
Anonymous27-Sep-04 0:16 
GeneralRe: dsfsdfsd Pin
Anonymous27-Sep-04 0:17
Anonymous27-Sep-04 0:17 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.