Introduction
This article describes
WebResourceProvider
, a simple yet powerful framework for retrieving useful information from publicly available web services. I use the term "web service" in a generic, non-Microsoft sense, to mean information providers such as:
The demo application included with this article shows how you can easily create objects to get:
- stock quotes
- the weather for a US zip code
- the list of locations served by a US zip code
- the translation of a piece of text
- the list of broken links on an HTML page
- the list of top posters at CodeProject
A Word of Caution
Before you use
WebResourceProvider
to write the next killer app, be aware that there are legal and ethical issues regarding the use of information obtained from other sources. In particular, the terms of service (TOS) of content providers such as
Yahoo,
CNN, etc. clearly state what you can and cannot do with information retrieved from their sites. Even if you write a web resource provider for personal use only, you should take into consideration any undue stress that your object may put on a web server. The
CodeProjectTopPosters
example in the demo won't let you get at more than the top 40 CodeProject posters. Further, it pauses between multiple accesses to the CodeProject server in order to not overload it.
How it Works

WebResourceProvider
works by initializing itself, constructing a URL to be retrieved, downloading the resource, and extracting useful information from the downloaded content. The process is repeated until no more data needs to be downloaded.
You use WebResourceProvider
by deriving your own resource provider class from it, and overriding any of these virtual methods (shown in red in the diagram on the right):
init
constructUrl
isPost()
getPostData()
parseContent()
moreAvailable()
WebResourceProvider
provides an assortment of methods to help parse downloaded content. They are:
Method |
|
Purpose |
at |
|
Checks whether current location is at a string |
atExact |
|
Case sensitive version of at() |
skipTo |
|
Advances current location to next occurence of a string |
skipToExact |
|
Case sensitive version of skipTo() |
skipBackTo |
|
Retreats current location to previous occurence of a string |
skipBackToExact |
|
Case sensitive version of skipBackTo() |
extractTo |
|
Extracts text from current location to the start of a string |
extractToExact |
|
Case sensitive version of extractTo() |
extractToEnd |
|
Extracts text from current location to end of content |
getIndex |
|
Returns current location |
getLinks |
|
Returns HREF and IMG links in content |
resetIndex |
|
Sets current location to start of content |
replaceEvery |
|
Replaces every occurence of a string in content with another |
removeComments |
|
Removes comments from content |
removeScripts |
|
Removes scripts from content |
removeEnclosingAnchorTag |
|
Removes anchor tag enclosing a string |
removeEnclosingQuotes |
|
Removes quotes enclosing a string |
removeHtml |
|
Removes HTML from a string |
trim |
|
Removes leading and trailing whitespace from a string |
Sample Resource Providers
Here are screenshots of the sample resource providers in action.
|
The WeatherProvider object works by posting a request to CNN's weather form and parsing the returned information. |
|
The Translator object works by posting a request to Google's translation engine and parsing the returned information. The request includes the translation mode.
The sample performs a reverse translation and presents it along with the original text for comparison purposes.
|
|
The LinkChecker object is a thin layer above the WebResourceProvider class. It delegates the job of determining the document and image links at a URL to the base class.
LinkChecker::getLinks() does its best to determine the links on a page, but because of the large number of ways a link can be specified, this method may miss a few links.
The LinkChecker demo uses WebResourceProvider::urlExists() to check whether a link is valid.
Click here to see a screenshot of the LinkChecker demo run against the CodeProject home page. Keep up the good work, Chris!
|
Using WebResourceProvider
To use
WebResourceProvider
do the following:
- Build the WebResourceProvider_Lib project.
- Modify your application's project to look for header files and libraries in the WebResourceProvider_Lib project area.
- Derive an object from
WebResourceProvider
. You'll need to #include WebResourceProvider.h
in your derived class' header file.
- Override the
constructUrl()
method in your derived class. This method specifies the URL to be downloaded.
- Override the
parse()
method in your derived class. This method extracts information from the downloaded content and stores it in the derived class' member variables.
- Optionally override other
WebResourceProvider
virtual methods. See the source code of the sample resource providers included in the demo project for examples.
- Link your application with WebResourceProvider.lib.
Acknowledgement
WebResourceProvider
uses the following code written by others:
A Call for Interesting WebResourceProviders!
This is an invitation to the CP community to come up with interesting and useful web resource providers. Let your imagination (and coding prowess) flow! Please post your cool WebResourceProvider
derived classes at CodeProject.
Revision History
- 23 Mar 2007
Updated parsing logic in WeatherProvider, ZipCodeProvider and Translator modules.
- 7 Oct 2006
Bug Fix: Updated parsing logic in Translator module.
- 7 Aug 2002
Bug Fix: Fixed computation error in extractToEnd()
.
Bug Fix: Added missing call to init()
in fetchResource()
.
Added methods skipBackTo()
and skipBackToExact
.
- 8 May 2002
Added methods urlExists()
, getLinks()
, removeComments()
, removeScripts()
, findNoCase()
and findStringInArray()
.
Modified at()
, skipTo()
, and extractTo()
to be case insensitive. Added case sensitive analogs atExact()
, skipToExact()
and extractToExact()
.
Added LinkChecker
sample to demo app.
Fixed a bug in parse()
that caused the fetch status to be ignored.
Speeded up ZipCodeDecoder
sample object.
- 30 April 2002
Corrected control flow image.
- 29 April 2002
Initial version of article.