Click here to Skip to main content
12,746,681 members (30,827 online)
Click here to Skip to main content
Add your own
alternative version


36 bookmarked
Posted 25 Sep 2004

Extract RSS feeds from Web pages

, 25 Sep 2004
Rate this:
Please Sign up or sign in to vote.
Shows how to extract RSS feeds from Web pages.


I love RSS readers. They save a lot of my time. Would it be nice if we can convert any Web data into RSS format? Then we can view Bank records, Credit card records, online shop promotions, e-mail subscriptions, etc. in one standard way.

Unfortunately, not too many Web sites provide RSS/ATOM feeds. In this article, I will show that RSS extraction is a very simple task, especially if a proper technology is used.

How to extract

We will only consider Web pages which are developed by using HTML or DHTML. From the first glance, the task looks very simple: download HTML pages locally and parse them. But it can take hours to write the code even for a simple web site, and it is hard to keep the code working; web site changes can break it.

The following approaches can be used to extract data from Web pages: "Raw" HTTP, IE Automation, and SWExplorerAutomation.

"Raw" HTTP

HTTP is a "raw" approach. We use WebRequest (.NET) to download a page source locally. The RSS data then can be extracted by XPath or regular expressions. To use XPath, the page source should be converted to XML (XHTML) using HTML Tidy.


  • Performance is very fast.


  • Requires knowledge of TCP/IP, HTTP, HTTPS, cookies, etc.
  • Due to HTML is not well formed, HTML to XML conversion will not always work.
  • Very unstable. Even simple changes to a web page layout will break an extraction.
  • Will not work with web pages created by JavaScript.
  • Time consuming.

IE automation

The solution is based on accessing HTML DOM. We can use Internet Explorer automation or host Web Browser control to get access to the HTML DOM data model.


  • Can work with any web page shown in IE.
  • Doesn't require knowledge of TCP/IP, HTTP, HTTPS, cookies, etc.


  • Changes to web site layout will break an extraction.
  • Requires a good knowledge of Web Browser events, HTML DOM, COM.
  • Not as fast as HTTP way.
  • Time consuming.


Picture 1. SWExplorerAutomation class diagram.

SWExplorerAutomation is a framework which converts a web application into programmable objects: scenes (pages) and controls. Those objects are visually defined using visual designer, and accessible from any .NET language.


  • Can work with any web page shown in IE.
  • Doesn't require knowledge of TCP/IP, HTTP, HTTPS, cookies, etc.
  • Separates data extraction from program logic.
  • Effectively handles error conditions.
  • Takes minutes to write code.


  • Not as fast as HTTP way.

SWExplorerAutomation Example

To illustrate how SWExplorerAutomation can be used to extract RSS feeds from web pages, I wrote a sample application which extracts RSS feed from CNN web site. I have created the following definitions (scenes) for CNN pages: [CnnNews], [Sport], [Money], [Main]. Each of the scenes contains HtmlContent control which extracts data from a defined page place.

First, we create and initialize ExplorerManager instance. ExplorerManager is initialized by [cnn_rss.htp] project file which was visually created by SWExplorerAutomation designer. ExplorerManager Connect () function runs Internet Explorer instance and connects to it. Then ExplorerManager navigates browser to the main CNN page.

ExplorerManager explorerManager = new ExplorerManager();
rssw.WriteChannel("CNN", "CNN News", scene.Descriptor.Url);

The code waits until a scene defined for the main CNN page will be activated. It uses XPathDataExtractor to extract list of article links from the web page.

scene = explorerManager["CnnNews"];
if (!scene.WaitForActive(60000)) 
return ""; 
XmlNodeList nodeList = (HtmlContent)(scene["HtmlContent_0"])). 
for ( int i = 0; i < nodeList.Count; i++) { 

The same set of actions Navigate, Wait, Extract is repeated for every article link.

for ( int i = 0; i < nodeList.Count; i++) { 
  string link = nodeList[i].Attributes["href"].Value as String; 
  Scene[] scenes = explorerManager.WaitForActive( new 
        string[] {" Main ", "Money", "Sport"}, 20000); 
  if (scenes == null) 
  scene = scenes[0]; 
  XPathDataExtractor xe = 
  string title = xe.Expressions["Title"].SelectNodes()[0].InnerText; 
  string pubDateStr = xe.Expressions["PubDate"].SelectNodes()[0].InnerText; 
  WriteRssItem(title, link, 

The code is completely metadata driven and doesn't require changes in case CNN site design will change.

Using Visual Designer to create cnn_rss.htp

Screenshot 1. SWExplorerAutomation Visual Designer

To create cnn_rss.htp using SWDesigner

  • On the Explorer menu, click Run.
  • Navigate IE to
  • On the Scene Editor menu, click Start.
  • Use right mouse button to show IE context menu. Click SceneEditor\Text Selection Mode.
  • Mark text on CNN page. Click SceneEditor\Select control from the context menu. The HtmlContent control will be added to the project.
  • Rename the control to CnnNews.
  • Run XPathDataExtractor custom property editor.
  • Define named XPath expression: select HTML link using mouse cursor, and click left mouse button to calculate XPath expression. Change the expression to select list of links (for example, DIV[1]/DIV[position() != 7] /A[1]).
  • Click Add button. Rename the named expression to "ItemList".
  • Click Exec button to test the expression and close XPathDataExtractor dialog.
  • Navigate to one of the news articles. Mark text on the page and create control (step 5).
  • Create the following named XPath expressions: PubDate, Content and Title.
  • Change Scene descriptor URL pattern to regular expression “http://www\.cnn\.com/2004(.*)” and change title pattern to “CNN\.com\ -(.*)”
  • Repeat 11-13 for Money and Sport.

To view cnn_rss.htp using SWDesigner

  • On Project menu, click Open.
  • Select “cnn_rss.htp”.
  • On the Scene Editor menu, click Start.
  • Select CnnNews scene. On the context menu, click Navigate.
  • Run XPathDataExtractor custom property editor.
  • Repeat 4-5 for all scenes.

Using the code

Just don't forget to register SWExplorerAutomation.dll. It is a Browser Helper Object and has to be registered.


The article explains how to extract RSS feeds from web pages using SWExplorerAutomation. It took me less then 10 minutes to write and test the article example code. Future articles will explain SWExplorerAutomation in more details and in more complex situations.


This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here


About the Author

Alex Furman
Web Developer
United States United States
No Biography provided

You may also be interested in...

Comments and Discussions

GeneralLogin Pin
yachitha19-Apr-07 23:10
memberyachitha19-Apr-07 23:10 
GeneralScene Editor/add selected Control Pin
Malini822-Mar-07 0:30
memberMalini822-Mar-07 0:30 
GeneralRe: Scene Editor/add selected Control Pin
Alex Furman2-Mar-07 3:16
memberAlex Furman2-Mar-07 3:16 
GeneralRe: Scene Editor/add selected Control Pin
chandler838-Mar-07 1:23
memberchandler838-Mar-07 1:23 
GeneralRe: Scene Editor/add selected Control Pin
Alex Furman17-Mar-07 16:04
memberAlex Furman17-Mar-07 16:04 
GeneralNice Article Alex Pin
DamonCarr1-Oct-06 10:29
memberDamonCarr1-Oct-06 10:29 Pin
surshbabuk9-Jun-06 19:52
membersurshbabuk9-Jun-06 19:52 
GeneralProblems with dynamic names Pin
purple_tonberry6-Mar-06 17:53
memberpurple_tonberry6-Mar-06 17:53 
GeneralRe: Problems with dynamic names Pin
Alex Furman6-Mar-06 18:09
memberAlex Furman6-Mar-06 18:09 
GeneralRe: Problems with dynamic names Pin
purple_tonberry7-Mar-06 15:40
memberpurple_tonberry7-Mar-06 15:40 
GeneralRe: Problems with dynamic names Pin
Alex Furman10-Mar-06 15:57
memberAlex Furman10-Mar-06 15:57 
GeneralURL Pin
Silveraxx30-Dec-05 4:33
memberSilveraxx30-Dec-05 4:33 
GeneralRe: URL Pin
Alex Furman30-Dec-05 8:05
memberAlex Furman30-Dec-05 8:05 
GeneralGreat! but I have a little trouble here. Pin
wyx200010-Oct-04 21:49
memberwyx200010-Oct-04 21:49 
GeneralRe: Great! but I have a little trouble here. Pin
Alex Furman11-Oct-04 6:20
memberAlex Furman11-Oct-04 6:20 
GeneralRe: Great! but I have a little trouble here. Pin
Alex Furman11-Oct-04 6:39
memberAlex Furman11-Oct-04 6:39 
GeneralSWExplorerAutomation Designer Q Pin
Craig Hildebrandt2-Oct-04 14:40
memberCraig Hildebrandt2-Oct-04 14:40 
GeneralRe: SWExplorerAutomation Designer Q Pin
Alex Furman2-Oct-04 17:37
memberAlex Furman2-Oct-04 17:37 
GeneralRe: SWExplorerAutomation Designer Q Pin
Malini8214-Mar-07 20:11
memberMalini8214-Mar-07 20:11 
Questionlogin? Pin
csmba1-Oct-04 11:53
membercsmba1-Oct-04 11:53 
AnswerRe: login? Pin
Alex Furman1-Oct-04 12:42
memberAlex Furman1-Oct-04 12:42 
QuestionVisual Designer? Pin
redevries28-Sep-04 22:02
memberredevries28-Sep-04 22:02 
AnswerRe: Visual Designer? Pin
Alex Furman29-Sep-04 2:53
memberAlex Furman29-Sep-04 2:53 
Generaldsfsdfsd Pin
Anonymous27-Sep-04 1:16
sussAnonymous27-Sep-04 1:16 
GeneralRe: dsfsdfsd Pin
Anonymous27-Sep-04 1:17
sussAnonymous27-Sep-04 1:17 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170215.1 | Last Updated 25 Sep 2004
Article Copyright 2004 by Alex Furman
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid