StringParser






4.92/5 (19 votes)
An object that makes it easy to extract information from strings, especially HTML content.
Introduction
StringParser
is an object that helps you
extract information from a string
. The class is perhaps best
suited to parse HTML pages downloaded from the web (see my WebResourceProvider class that
helps you do this). You use StringParser
by constructing it
with some content (i.e. a string
) and using its navigational and
extraction methods to extract substrings from the content.
StringParser
also provides some static methods designed
specifically for parsing HTML.
API
Here are some of the methods provided byStringParser
. Please see the accompanying
documentation for an exhaustive list.
Navigational
API![]() resetPosition() ![]() skipToEndOf() ![]() skipToEndOfNoCase() ![]() skipToStartOf() ![]() skipToStartOfNoCase() |
Extraction API![]() extractTo() ![]() extractToNoCase() ![]() extractUntil() ![]() extractUntilNoCase() ![]() extractToEnd() |
Position query
API![]() at() ![]() atNoCase() |
HTML parsing
API![]() getLinks() ![]() removeComments() ![]() removeEnclosingAnchorTag() ![]() removeEnclosingQuotes() ![]() removeHtml() ![]() removeScripts() |
Example 1 - Extracting delimited text
This example shows how to extract text contained between two delimiters.// Extract text between the comma and question mark string strExtract = ""; string str = "Hello Sally, how are you?"; StringParser p = new StringParser (str); if (p.skipToStartOf (",") && p.extractTo ("?", ref strExtract)) Console.Writeln ("Extracted text = {0}", strExtract); else Console.Writeln ("No text extracted.");
Example 2 - Extracting the nth occurence of a delimited string
This example shows how to obtain thehref
attribute of the third anchor
tag (<a>
) in an HTML string. The example assumes the
string contains valid HTML. // Get href attribute of 3rd <a> tag string strExtract = ""; string str = "..."; // HTML StringParser p = new StringParser (str); if (p.skipToStartOfNoCase ("<a") && p.skipToStartOfNoCase ("<a") && p.skipToStartOfNoCase ("<a") && p.skipToStartOfNoCase ("href=\"") && p.extractTo ("\"", ref strExtract)) Console.Writeln ("Extracted text = {0}", strExtract); else Console.Writeln ("No text extracted.");
Example 3 - Global case-insensitive replacement
This example shows how to case-insensitively replace a string in the parser's content..// Replace every occurence of <td> with <td class="foo"> string str = "..."; // HTML StringParser p = new StringParser (str); p.replaceEvery ("<td>", "<td class=\"foo\">");
Example 4 - Poor man's web scraping
This example shows how to obtain a stock's quote from the content downloaded from Yahoo Finance (MSFT). The example makes assumptions about the format of the web page.// Scrape http://finance.yahoo.com/q?s=msft string strQuote = ""; string str = "..."; // HTML downloaded from http://finance.yahoo.com/q?s=msft StringParser p = new StringParser (str); if (p.skipToEndOfNoCase ("Last Trade:</td><td class="yfnc_tabledata1"><big><b>") && p.extractTo ("</b>", ref strQuote)) Console.Writeln ("MSFT (delayed) = {0}", strQuote);
Example 5 - Get list of hyperlinked phrases
This example shows how to obtain the list of hyperlinked phrases in HTML content.ArrayList phrases = new ArrayList(); string str = "..."; // HTML content StringParser p = new StringParser (str); while (p.skipToStartOfNoCase ("<a")) { string strPhrase = ""; if (p.skipToEndOf (">") && p.extractTo ("<a>", ref strPhrase)) phrases.Add (strPhrase); }
Demo applications
C# applications (with full source code) that useStringParser
can be found here:
- DomainWalker - a web topology analyzer
- GoogleTranslator - an object that uses Google to translate natural language
- SimpleRSS - an RSS channel reader
Revision History
- 15 Jan 2006
Initial version.