Introduction
StringParser
is an object that helps you
extract information from a
string
. The class is perhaps best
suited to parse HTML pages downloaded from the web (see my
WebResourceProvider class that
helps you do this). You use
StringParser
by constructing it
with some content (i.e. a
string
) and using its navigational and
extraction methods to extract substrings from the content.
StringParser
also provides some static methods designed
specifically for parsing HTML.
API
Here are some of the methods provided by
StringParser
. Please see the
accompanying
documentation for an exhaustive list.
Navigational
API
resetPosition()
skipToEndOf()
skipToEndOfNoCase()
skipToStartOf()
skipToStartOfNoCase() | | Extraction API
extractTo()
extractToNoCase()
extractUntil()
extractUntilNoCase()
extractToEnd()
| | Position query
API
at()
atNoCase()
| | HTML parsing
API
getLinks()
removeComments()
removeEnclosingAnchorTag()
removeEnclosingQuotes()
removeHtml()
removeScripts()
|
Example 1 - Extracting delimited text
This example shows how to extract
text contained between two delimiters.
string strExtract = "";
string str = "Hello Sally, how are you?";
StringParser p = new StringParser (str);
if (p.skipToStartOf (",") && p.extractTo ("?", ref strExtract))
Console.Writeln ("Extracted text = {0}", strExtract);
else
Console.Writeln ("No text extracted.");
Example 2 - Extracting the nth occurence of a delimited string
This
example shows how to obtain the
href
attribute of the third anchor
tag (
<a>
) in an HTML string. The example assumes the
string contains valid HTML.
string strExtract = "";
string str = "...";
StringParser p = new StringParser (str);
if (p.skipToStartOfNoCase ("<a") &&
p.skipToStartOfNoCase ("<a") &&
p.skipToStartOfNoCase ("<a") &&
p.skipToStartOfNoCase ("href=\"") &&
p.extractTo ("\"", ref strExtract))
Console.Writeln ("Extracted text = {0}", strExtract);
else
Console.Writeln ("No text extracted.");
Example 3 - Global case-insensitive replacement
This example shows how
to case-insensitively replace a string in the parser's content..
string str = "...";
StringParser p = new StringParser (str);
p.replaceEvery ("<td>", "<td class=\"foo\">");
Example 4 - Poor man's web scraping
This example shows how to obtain a
stock's quote from the content downloaded from
Yahoo Finance (MSFT).
The example makes assumptions about the format of the web page.
string strQuote = "";
string str = "...";
StringParser p = new StringParser (str);
if (p.skipToEndOfNoCase ("Last Trade:</td><td class="yfnc_tabledata1"><big><b>") &&
p.extractTo ("</b>", ref strQuote))
Console.Writeln ("MSFT (delayed) = {0}", strQuote);
Example 5 - Get list of hyperlinked phrases
This example shows how to
obtain the list of hyperlinked phrases in HTML content.
ArrayList phrases = new ArrayList();
string str = "...";
StringParser p = new StringParser (str);
while (p.skipToStartOfNoCase ("<a")) {
string strPhrase = "";
if (p.skipToEndOf (">") && p.extractTo ("<a>", ref strPhrase))
phrases.Add (strPhrase);
}
Demo applications
C# applications (with full source code) that use
StringParser
can be found here:
Revision History
- 15 Jan 2006
Initial version.
Ravi Bhavnani is an ardent fan of Microsoft technologies who loves building Windows apps, especially PIMs, system utilities, and things that go bump on the Internet. During his career, Ravi has developed expert systems, desktop imaging apps, marketing automation software, EDA tools, a platform to help people find, analyze and understand information, trading software for institutional investors and advanced data visualization solutions. He currently works for a company that provides enterprise workforce management solutions to large clients.
His interests include the .NET framework, reasoning systems, financial analysis and algorithmic trading, NLP, HCI and UI design. Ravi holds a BS in Physics and Math and an MS in Computer Science and was a Microsoft MVP (C++ and C# in 2006 and 2007). He is also the co-inventor of 3 patents on software security and generating data visualization dashboards. His claim to fame is that he crafted CodeProject's "joke" forum post icon.
Ravi's biggest fear is that one day he might actually get a life, although the chances of that happening seem extremely remote.