![]() |
Languages »
XML »
XML/XSLT
Intermediate
Extract RSS feeds from Web pagesBy Alex FurmanShows how to extract RSS feeds from Web pages. |
C#.NET1.0, .NET1.1, Win2K, WinXPVS.NET2003, Dev
|
|
Advanced Search Add to IE Search |
|
|
|
||||||||||||||||
I love RSS readers. They save a lot of my time. Would it be nice if we can convert any Web data into RSS format? Then we can view Bank records, Credit card records, online shop promotions, e-mail subscriptions, etc. in one standard way.
Unfortunately, not too many Web sites provide RSS/ATOM feeds. In this article, I will show that RSS extraction is a very simple task, especially if a proper technology is used.
We will only consider Web pages which are developed by using HTML or DHTML. From the first glance, the task looks very simple: download HTML pages locally and parse them. But it can take hours to write the code even for a simple web site, and it is hard to keep the code working; web site changes can break it.
The following approaches can be used to extract data from Web pages: "Raw" HTTP, IE Automation, and SWExplorerAutomation.
HTTP is a "raw" approach. We use WebRequest (.NET) to download a page source locally. The RSS data then can be extracted by XPath or regular expressions. To use XPath, the page source should be converted to XML (XHTML) using HTML Tidy.
Pros
Cons
The solution is based on accessing HTML DOM. We can use Internet Explorer automation or host Web Browser control to get access to the HTML DOM data model.
Pros
Cons

Picture 1. SWExplorerAutomation class diagram.
SWExplorerAutomation is a framework which converts a web application into programmable objects: scenes (pages) and controls. Those objects are visually defined using visual designer, and accessible from any .NET language.
Pros
Cons
To illustrate how SWExplorerAutomation can be used to extract RSS feeds from web pages, I wrote a sample application which extracts RSS feed from CNN web site. I have created the following definitions (scenes) for CNN pages: [CnnNews], [Sport], [Money], [Main]. Each of the scenes contains HtmlContent control which extracts data from a defined page place.
First, we create and initialize ExplorerManager instance. ExplorerManager is initialized by [cnn_rss.htp] project file which was visually created by SWExplorerAutomation designer. ExplorerManager Connect () function runs Internet Explorer instance and connects to it. Then ExplorerManager navigates browser to the main CNN page.
ExplorerManager explorerManager = new ExplorerManager();
explorerManager.Connect();
explorerManager.LoadProject(@"..\..\cnn_rss.htp");
explorerManager.Navigate("http://www.cnn.com/");
rssw.WriteChannel("CNN", "CNN News", scene.Descriptor.Url);
The code waits until a scene defined for the main CNN page will be activated. It uses XPathDataExtractor to extract list of article links from the web page.
scene = explorerManager["CnnNews"];
if (!scene.WaitForActive(60000))
return "";
XmlNodeList nodeList = (HtmlContent)(scene["HtmlContent_0"])).
XPathDataExtractor.Expressions["ItemList"].SelectNodes();
for ( int i = 0; i < nodeList.Count; i++) {
//��..
}
The same set of actions Navigate, Wait, Extract is repeated for every article link.
for ( int i = 0; i < nodeList.Count; i++) {
string link = nodeList[i].Attributes["href"].Value as String;
explorerManager.Navigate(link);
Scene[] scenes = explorerManager.WaitForActive( new
string[] {" Main ", "Money", "Sport"}, 20000);
if (scenes == null)
continue;
scene = scenes[0];
XPathDataExtractor xe =
((HtmlContent)(scene["HtmlContent_0"])).XPathDataExtractor;
string title = xe.Expressions["Title"].SelectNodes()[0].InnerText;
string pubDateStr = xe.Expressions["PubDate"].SelectNodes()[0].InnerText;
WriteRssItem(title, link,
xe.Expressions["PubDate"].SelectNodes()[0].InnerText,
xe.Expressions["Content"].SelectNodes());
scene.Deactivate();
}
The code is completely metadata driven and doesn't require changes in case CNN site design will change.
Screenshot 1. SWExplorerAutomation Visual Designer
HtmlContent control will be added to the project.
CnnNews.
PubDate, Content and Title.
Just don't forget to register SWExplorerAutomation.dll. It is a Browser Helper Object and has to be registered.
The article explains how to extract RSS feeds from web pages using SWExplorerAutomation. It took me less then 10 minutes to write and test the article example code. Future articles will explain SWExplorerAutomation in more details and in more complex situations.
General
News
Question
Answer
Joke
Rant
Admin
Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads.
|
PermaLink |
Privacy |
Terms of Use
Last Updated: 25 Sep 2004 Editor: Smitha Vijayan |
Copyright 2004 by Alex Furman Everything else Copyright © CodeProject, 1999-2010 Web10 | Advertise on the Code Project |