Click here to Skip to main content
Click here to Skip to main content

StringParser

By , 15 Jan 2006
 

Introduction

StringParser is an object that helps you extract information from a string.  The class is perhaps best suited to parse HTML pages downloaded from the web (see my WebResourceProvider class that helps you do this).  You use StringParser by constructing it with some content (i.e. a string) and using its navigational and extraction methods to extract substrings from the content.  StringParser also provides some static methods designed specifically for parsing HTML.

API

Here are some of the methods provided by StringParser.  Please see the accompanying documentation for an exhaustive list.

Navigational API
resetPosition()
skipToEndOf()
skipToEndOfNoCase()
skipToStartOf()
skipToStartOfNoCase()
  Extraction API
extractTo()
extractToNoCase()
extractUntil()
extractUntilNoCase()
extractToEnd()
  Position query API
at()
atNoCase()
  HTML parsing API
getLinks()
removeComments()
removeEnclosingAnchorTag()
removeEnclosingQuotes()
removeHtml()
removeScripts()

Example 1 - Extracting delimited text

This example shows how to extract text contained between two delimiters. 
  // Extract text between the comma and question mark
  string strExtract = "";
  string str = "Hello Sally, how are you?";
  StringParser p = new StringParser (str);
  if (p.skipToStartOf (",") && p.extractTo ("?", ref strExtract))
     Console.Writeln ("Extracted text = {0}", strExtract);
  else
     Console.Writeln ("No text extracted.");

Example 2 - Extracting the nth occurence of a delimited string

This example shows how to obtain the href attribute of the third anchor tag (<a>) in an HTML string.  The example assumes the string contains valid HTML.
  // Get href attribute of 3rd <a> tag
  string strExtract = "";
  string str = "..."; // HTML
  StringParser p = new StringParser (str);
  if (p.skipToStartOfNoCase ("<a") &&
      p.skipToStartOfNoCase ("<a") &&
      p.skipToStartOfNoCase ("<a") &&
      p.skipToStartOfNoCase ("href=\"") &&
      p.extractTo ("\"", ref strExtract))
     Console.Writeln ("Extracted text = {0}", strExtract);
  else
     Console.Writeln ("No text extracted.");

Example 3 - Global case-insensitive replacement

This example shows how to case-insensitively replace a string in the parser's content..
  // Replace every occurence of <td> with <td class="foo">
  string str = "..."; // HTML
  StringParser p = new StringParser (str);
  p.replaceEvery ("<td>", "<td class=\"foo\">");

Example 4 - Poor man's web scraping

This example shows how to obtain a stock's quote from the content downloaded from Yahoo Finance (MSFT).  The example makes assumptions about the format of the web page.
  // Scrape http://finance.yahoo.com/q?s=msft
  string strQuote = "";
  string str = "..."; // HTML downloaded from http://finance.yahoo.com/q?s=msft
  StringParser p = new StringParser (str);
  if (p.skipToEndOfNoCase ("Last Trade:</td><td class="yfnc_tabledata1"><big><b>") &&
      p.extractTo ("</b>", ref strQuote))
     Console.Writeln ("MSFT (delayed) = {0}", strQuote);

Example 5 - Get list of hyperlinked phrases

This example shows how to obtain the list of hyperlinked phrases in HTML content.
  ArrayList phrases = new ArrayList();
  string str = "..."; // HTML content
  StringParser p = new StringParser (str);
  while (p.skipToStartOfNoCase ("<a")) {
    string strPhrase = "";
    if (p.skipToEndOf (">") && p.extractTo ("<a>", ref strPhrase))
       phrases.Add (strPhrase);
  }

Demo applications

C# applications (with full source code) that use StringParser can be found here:

Revision History

  • 15 Jan 2006
    Initial version.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Ravi Bhavnani
Technical Lead
Canada Canada
Member
Ravi Bhavnani is an ardent fan of Microsoft technologies who loves building Windows apps, especially PIMs, system utilities, and things that go bump on the Internet. During his career, Ravi has developed expert systems, desktop imaging apps, marketing automation software, EDA tools, a platform to help people find, analyze and understand information, trading software for institutional investors and advanced data visualization solutions. He currently works for a company that provides enterprise workforce management solutions to large clients.
 
His interests include the .NET framework, reasoning systems, financial analysis and algorithmic trading, NLP, CHI and UI design. Ravi holds a BS in Physics and Math and an MS in Computer Science and was a Microsoft MVP (C++ and C# in 2006 and 2007). He is also the co-inventor of 2 patents on software security and generating data visualization dashboards. His claim to fame is that he crafted CodeProject's "joke" forum post icon.
 
Ravi's biggest fear is that one day he might actually get a life, although the chances of that happening seem extremely remote.

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
Hint: For improved responsiveness ensure Javascript is enabled and choose 'Normal' from the Layout dropdown and hit 'Update'.
You must Sign In to use this message board.
Search this forum  
    Spacing  Noise  Layout  Per page   
QuestionI am newby and need your help.memberMember 841644127 May '12 - 21:41 
AnswerRe: I am newby and need your help.memberRavi Bhavnani28 May '12 - 1:54 
QuestionRe: I am newby and need your help. [modified]memberlance.spurgeon28 May '12 - 8:59 
Hi Ravi
 
Thanks for the quick reply.
 
I have a user control form 2 textboxes with a convert button, I want the user to input HTML textbox 1 click convert button your script to insert it in textbox2. But I just can't get it right.
 
I converted your c# to vb though a code converter. I need you to please take a look at the code I pasted earlier and if you can tell me where I have gone wrong. Thanks for your time
 
Cheers Lance

modified 29 May '12 - 12:47.

AnswerRe: I am newby and need your help.memberlance.spurgeon29 May '12 - 10:41 
GeneralMy vote of 5membersamiDiab29 Feb '12 - 3:33 
GeneralRe: My vote of 5memberRavi Bhavnani28 May '12 - 1:53 
GeneralExtracting Meta Keywords and Descriptionsmemberkeith_fra26 Jul '07 - 9:48 
QuestionRewindTo?memberkrn_2k19 Jun '07 - 8:23 
AnswerRe: RewindTo?memberRavi Bhavnani19 Jun '07 - 8:37 
GeneralRe: RewindTo?memberkrn_2k19 Jun '07 - 8:44 
Generalextract tagsmemberrama jayapal29 Mar '07 - 21:50 
GeneralRe: extract tagsmemberRavi Bhavnani30 Mar '07 - 2:58 
Generalgood stuffmembertonyc2a25 Feb '07 - 6:34 
GeneralRe: good stuffmemberRavi Bhavnani25 Feb '07 - 6:42 
QuestionString parser for Client servermembervenkiiz23 Jan '07 - 3:23 
AnswerRe: String parser for Client servermemberRavi Bhavnani23 Jan '07 - 4:40 
GeneralC++ VersionmemberImtiaz Murtaza23 Nov '06 - 20:18 
AnswerRe: C++ VersionmemberRavi Bhavnani24 Nov '06 - 1:57 
Questionthe best articale!!!memberronicohen17 Nov '06 - 7:51 
AnswerRe: the best articale!!!memberRavi Bhavnani17 Nov '06 - 7:59 
GeneralThis is awesome!memberzythra15 Apr '06 - 17:49 
GeneralRe: This is awesome!memberRavi Bhavnani16 Apr '06 - 4:56 
QuestionHow to keep session onmembersumoncsekugmail19 Feb '06 - 0:57 
AnswerRe: How to keep session onmemberRavi Bhavnani19 Feb '06 - 2:28 
GeneralReading Meta tagsmemberrizwan_rashid14 Feb '06 - 18:26 
GeneralRe: Reading Meta tagsmemberRavi Bhavnani15 Feb '06 - 2:44 
GeneralHELP!!memberrizwan_rashid14 Feb '06 - 6:15 
GeneralHELP!!!!memberrizwan_rashid12 Feb '06 - 0:44 
GeneralRe: HELP!!!!memberRavi Bhavnani12 Feb '06 - 1:14 
GeneralRe: HELP!!!!memberrizwan_rashid12 Feb '06 - 1:38 
GeneralRe: HELP!!!!memberRavi Bhavnani12 Feb '06 - 1:45 
GeneralCongratulations from Francemembercadlink17 Jan '06 - 13:14 
GeneralRe: Congratulations from FrancememberRavi Bhavnani18 Jan '06 - 1:58 
GeneralThe file cant downloadmemberdigitalpump15 Jan '06 - 14:55 
GeneralRe: The file cant downloadmemberRavi Bhavnani15 Jan '06 - 15:00 
GeneralFixed!memberRavi Bhavnani15 Jan '06 - 15:17 
GeneralRe: Fixed!memberdigitalpump15 Jan '06 - 22:52 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Permalink | Advertise | Privacy | Mobile
Web02 | 2.6.130516.1 | Last Updated 15 Jan 2006
Article Copyright 2006 by Ravi Bhavnani
Everything else Copyright © CodeProject, 1999-2013
Terms of Use
Layout: fixed | fluid