Click here to Skip to main content
15,891,375 members
Articles / Programming Languages / C#
Article

Extract RSS feeds from Web pages

Rate me:
Please Sign up or sign in to vote.
3.85/5 (6 votes)
25 Sep 20044 min read 160.9K   1.8K   36   25
Shows how to extract RSS feeds from Web pages.

Introduction

I love RSS readers. They save a lot of my time. Would it be nice if we can convert any Web data into RSS format? Then we can view Bank records, Credit card records, online shop promotions, e-mail subscriptions, etc. in one standard way.

Unfortunately, not too many Web sites provide RSS/ATOM feeds. In this article, I will show that RSS extraction is a very simple task, especially if a proper technology is used.

How to extract

We will only consider Web pages which are developed by using HTML or DHTML. From the first glance, the task looks very simple: download HTML pages locally and parse them. But it can take hours to write the code even for a simple web site, and it is hard to keep the code working; web site changes can break it.

The following approaches can be used to extract data from Web pages: "Raw" HTTP, IE Automation, and SWExplorerAutomation.

"Raw" HTTP

HTTP is a "raw" approach. We use WebRequest (.NET) to download a page source locally. The RSS data then can be extracted by XPath or regular expressions. To use XPath, the page source should be converted to XML (XHTML) using HTML Tidy.

Pros

  • Performance is very fast.

Cons

  • Requires knowledge of TCP/IP, HTTP, HTTPS, cookies, etc.
  • Due to HTML is not well formed, HTML to XML conversion will not always work.
  • Very unstable. Even simple changes to a web page layout will break an extraction.
  • Will not work with web pages created by JavaScript.
  • Time consuming.

IE automation

The solution is based on accessing HTML DOM. We can use Internet Explorer automation or host Web Browser control to get access to the HTML DOM data model.

Pros

  • Can work with any web page shown in IE.
  • Doesn't require knowledge of TCP/IP, HTTP, HTTPS, cookies, etc.

Cons

  • Changes to web site layout will break an extraction.
  • Requires a good knowledge of Web Browser events, HTML DOM, COM.
  • Not as fast as HTTP way.
  • Time consuming.

SWExplorerAutomation

Image 1

Picture 1. SWExplorerAutomation class diagram.

SWExplorerAutomation is a framework which converts a web application into programmable objects: scenes (pages) and controls. Those objects are visually defined using visual designer, and accessible from any .NET language.

Pros

  • Can work with any web page shown in IE.
  • Doesn't require knowledge of TCP/IP, HTTP, HTTPS, cookies, etc.
  • Separates data extraction from program logic.
  • Effectively handles error conditions.
  • Takes minutes to write code.

Cons

  • Not as fast as HTTP way.

SWExplorerAutomation Example

To illustrate how SWExplorerAutomation can be used to extract RSS feeds from web pages, I wrote a sample application which extracts RSS feed from CNN web site. I have created the following definitions (scenes) for CNN pages: [CnnNews], [Sport], [Money], [Main]. Each of the scenes contains HtmlContent control which extracts data from a defined page place.

First, we create and initialize ExplorerManager instance. ExplorerManager is initialized by [cnn_rss.htp] project file which was visually created by SWExplorerAutomation designer. ExplorerManager Connect () function runs Internet Explorer instance and connects to it. Then ExplorerManager navigates browser to the main CNN page.

C#
ExplorerManager explorerManager = new ExplorerManager();
explorerManager.Connect();
explorerManager.LoadProject(@"..\..\cnn_rss.htp");
explorerManager.Navigate("http://www.cnn.com/");
rssw.WriteChannel("CNN", "CNN News", scene.Descriptor.Url);

The code waits until a scene defined for the main CNN page will be activated. It uses XPathDataExtractor to extract list of article links from the web page.

C#
scene = explorerManager["CnnNews"];
if (!scene.WaitForActive(60000)) 
return ""; 
XmlNodeList nodeList = (HtmlContent)(scene["HtmlContent_0"])). 
XPathDataExtractor.Expressions["ItemList"].SelectNodes(); 
for ( int i = 0; i < nodeList.Count; i++) { 
     //…….. 
}

The same set of actions Navigate, Wait, Extract is repeated for every article link.

C#
for ( int i = 0; i < nodeList.Count; i++) { 
  string link = nodeList[i].Attributes["href"].Value as String; 
  explorerManager.Navigate(link); 
  Scene[] scenes = explorerManager.WaitForActive( new 
        string[] {" Main ", "Money", "Sport"}, 20000); 
  if (scenes == null) 
    continue; 
  scene = scenes[0]; 
  XPathDataExtractor xe = 
    ((HtmlContent)(scene["HtmlContent_0"])).XPathDataExtractor; 
  string title = xe.Expressions["Title"].SelectNodes()[0].InnerText; 
  string pubDateStr = xe.Expressions["PubDate"].SelectNodes()[0].InnerText; 
  WriteRssItem(title, link, 
    xe.Expressions["PubDate"].SelectNodes()[0].InnerText, 
    xe.Expressions["Content"].SelectNodes()); 
  scene.Deactivate(); 
}

The code is completely metadata driven and doesn't require changes in case CNN site design will change.

Using Visual Designer to create cnn_rss.htp

Image 2

Screenshot 1. SWExplorerAutomation Visual Designer

To create cnn_rss.htp using SWDesigner

  • On the Explorer menu, click Run.
  • Navigate IE to http://www.cnn.com/.
  • On the Scene Editor menu, click Start.
  • Use right mouse button to show IE context menu. Click SceneEditor\Text Selection Mode.
  • Mark text on CNN page. Click SceneEditor\Select control from the context menu. The HtmlContent control will be added to the project.
  • Rename the control to CnnNews.
  • Run XPathDataExtractor custom property editor.
  • Define named XPath expression: select HTML link using mouse cursor, and click left mouse button to calculate XPath expression. Change the expression to select list of links (for example, DIV[1]/DIV[position() != 7] /A[1]).
  • Click Add button. Rename the named expression to "ItemList".
  • Click Exec button to test the expression and close XPathDataExtractor dialog.
  • Navigate to one of the news articles. Mark text on the page and create control (step 5).
  • Create the following named XPath expressions: PubDate, Content and Title.
  • Change Scene descriptor URL pattern to regular expression “http://www\.cnn\.com/2004(.*)” and change title pattern to “CNN\.com\ -(.*)”
  • Repeat 11-13 for Money and Sport.

To view cnn_rss.htp using SWDesigner

  • On Project menu, click Open.
  • Select “cnn_rss.htp”.
  • On the Scene Editor menu, click Start.
  • Select CnnNews scene. On the context menu, click Navigate.
  • Run XPathDataExtractor custom property editor.
  • Repeat 4-5 for all scenes.

Using the code

Just don't forget to register SWExplorerAutomation.dll. It is a Browser Helper Object and has to be registered.

Summary

The article explains how to extract RSS feeds from web pages using SWExplorerAutomation. It took me less then 10 minutes to write and test the article example code. Future articles will explain SWExplorerAutomation in more details and in more complex situations.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here


Written By
Web Developer
United States United States
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
GeneralLogin Pin
yachitha19-Apr-07 22:10
yachitha19-Apr-07 22:10 
GeneralScene Editor/add selected Control Pin
Malini821-Mar-07 23:30
Malini821-Mar-07 23:30 
GeneralRe: Scene Editor/add selected Control Pin
Alex Furman2-Mar-07 2:16
Alex Furman2-Mar-07 2:16 
GeneralRe: Scene Editor/add selected Control Pin
chandler838-Mar-07 0:23
chandler838-Mar-07 0:23 
GeneralRe: Scene Editor/add selected Control Pin
Alex Furman17-Mar-07 15:04
Alex Furman17-Mar-07 15:04 
GeneralNice Article Alex Pin
DamonCarr1-Oct-06 9:29
DamonCarr1-Oct-06 9:29 
Generalc#.net Pin
surshbabuk9-Jun-06 18:52
surshbabuk9-Jun-06 18:52 
GeneralProblems with dynamic names Pin
purple_tonberry6-Mar-06 16:53
purple_tonberry6-Mar-06 16:53 
GeneralRe: Problems with dynamic names Pin
Alex Furman6-Mar-06 17:09
Alex Furman6-Mar-06 17:09 
GeneralRe: Problems with dynamic names Pin
purple_tonberry7-Mar-06 14:40
purple_tonberry7-Mar-06 14:40 
GeneralRe: Problems with dynamic names Pin
Alex Furman10-Mar-06 14:57
Alex Furman10-Mar-06 14:57 
GeneralURL Pin
Silveraxx30-Dec-05 3:33
Silveraxx30-Dec-05 3:33 
GeneralRe: URL Pin
Alex Furman30-Dec-05 7:05
Alex Furman30-Dec-05 7:05 
GeneralGreat! but I have a little trouble here. Pin
wyx200010-Oct-04 20:49
wyx200010-Oct-04 20:49 
GeneralRe: Great! but I have a little trouble here. Pin
Alex Furman11-Oct-04 5:20
Alex Furman11-Oct-04 5:20 
That is probably my fault. There is HtmlImage control in SWEA, but if the image used as a link on the a page the program creates HtmlAnchor.

I suggest the following:

1. Use mouse to sellect (green ) "Permapink" text. Then use CTRL key or menu to create HtmlContent control. Edit the descriptor of the control. Change XPath expression on Change it on "HTML[1]/BODY[1]/DIV[1]/TABLE[2]/TBODY[1]/TR[1]/TD[1]/TABLE[1]/TBODY[1]/TR[1]".
Select Control View tab, you have to see all images there.

2. The control has property XpathDataExtractor. It will allow to extract all url's to the images. Open custom editor for the property. Highlight and click on the one of image (shades), then click on the other shade. you will see difference in the xpath. The TABLE and TD tags are different. Replace TABLE[xxx] on TABLE and TD[xxx] on TD. The result will be "TR[1]/TD[3]/TABLE/TBODY[1]/TR[2]/TD/TABLE[1]/TBODY[1]/TR[1]/TD[1]/A[1]/IMG[1]" expression.
Press Exec and you will see all image links. Add the expression to the named expression list (press add button). Rename it to ImageList. Press OK to save.

3. Run script recorder. Select the scene and click right mouse. Select "navigate" from context menu. Select c#/VB.net to generate a template code. Click on Create Visual Studio Project.
You have now template code and project file.

4. The update coed is below.
namespace Test {
using System;
using System.IO;
using System.Xml;
using System.Net;
using SWExplorerAutomation.Client;
using SWExplorerAutomation.Client.Controls;


public class TestCode {

public static void Main() {
SWExplorerAutomation.Client.ExplorerManager explorerManager =
new WExplorerAutomation.Client.ExplorerManager();
SWExplorerAutomation.Client.Scene scene;
explorerManager.Connect(-1);
explorerManager.LoadProject ("C:\\Alf\\ImageExtractor\\ImageExtractor.htp");
scene = explorerManager["Scene_0"];
explorerManager.Navigate(scene.Descriptor.Url);

scene.WaitForActive(60000);
if (!scene.IsActive()) return;
string xml = scene["HtmlContent_0"].OuterXml;
XmlNodeList nl = (scene["HtmlContent_0"] as HtmlContent).XPathDataExtractor.Expressions["ImageList"].SelectNodes();
for (int i = 0; i < nl.Count; i++) {
Console.WriteLine(nl[i].Attributes["src"].Value);
System.Net.WebClient wc = new System.Net.WebClient();
wc.DownloadFile(nl[i].Attributes["src"].Value, i.ToString()+ ".gif");
}
}
}
}

5. Unfortunately I have found that the images are drawn by a script and not downloaded from the web site. This is a different story. IEExplorer doesn't provide an access to the drawn image. I have ideas how to solve the problem and if many will request it will be implemented.

Alex.


AlexF
GeneralRe: Great! but I have a little trouble here. Pin
Alex Furman11-Oct-04 5:39
Alex Furman11-Oct-04 5:39 
GeneralSWExplorerAutomation Designer Q Pin
Craig Hildebrandt2-Oct-04 13:40
Craig Hildebrandt2-Oct-04 13:40 
GeneralRe: SWExplorerAutomation Designer Q Pin
Alex Furman2-Oct-04 16:37
Alex Furman2-Oct-04 16:37 
GeneralRe: SWExplorerAutomation Designer Q Pin
Malini8214-Mar-07 19:11
Malini8214-Mar-07 19:11 
Questionlogin? Pin
csmba1-Oct-04 10:53
csmba1-Oct-04 10:53 
AnswerRe: login? Pin
Alex Furman1-Oct-04 11:42
Alex Furman1-Oct-04 11:42 
QuestionVisual Designer? Pin
redevries28-Sep-04 21:02
redevries28-Sep-04 21:02 
AnswerRe: Visual Designer? Pin
Alex Furman29-Sep-04 1:53
Alex Furman29-Sep-04 1:53 
Generaldsfsdfsd Pin
Anonymous27-Sep-04 0:16
Anonymous27-Sep-04 0:16 
GeneralRe: dsfsdfsd Pin
Anonymous27-Sep-04 0:17
Anonymous27-Sep-04 0:17 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.