Click here to Skip to main content
Click here to Skip to main content

StringParser

By , 15 Jan 2006
Rate this:
Please Sign up or sign in to vote.

Introduction

StringParser is an object that helps you extract information from a string.  The class is perhaps best suited to parse HTML pages downloaded from the web (see my WebResourceProvider class that helps you do this).  You use StringParser by constructing it with some content (i.e. a string) and using its navigational and extraction methods to extract substrings from the content.  StringParser also provides some static methods designed specifically for parsing HTML.

API

Here are some of the methods provided by StringParser.  Please see the accompanying documentation for an exhaustive list.

Navigational API
resetPosition()
skipToEndOf()
skipToEndOfNoCase()
skipToStartOf()
skipToStartOfNoCase()
  Extraction API
extractTo()
extractToNoCase()
extractUntil()
extractUntilNoCase()
extractToEnd()
  Position query API
at()
atNoCase()
  HTML parsing API
getLinks()
removeComments()
removeEnclosingAnchorTag()
removeEnclosingQuotes()
removeHtml()
removeScripts()

Example 1 - Extracting delimited text

This example shows how to extract text contained between two delimiters. 
  // Extract text between the comma and question mark
  string strExtract = "";
  string str = "Hello Sally, how are you?";
  StringParser p = new StringParser (str);
  if (p.skipToStartOf (",") && p.extractTo ("?", ref strExtract))
     Console.Writeln ("Extracted text = {0}", strExtract);
  else
     Console.Writeln ("No text extracted.");

Example 2 - Extracting the nth occurence of a delimited string

This example shows how to obtain the href attribute of the third anchor tag (<a>) in an HTML string.  The example assumes the string contains valid HTML.
  // Get href attribute of 3rd <a> tag
  string strExtract = "";
  string str = "..."; // HTML
  StringParser p = new StringParser (str);
  if (p.skipToStartOfNoCase ("<a") &&
      p.skipToStartOfNoCase ("<a") &&
      p.skipToStartOfNoCase ("<a") &&
      p.skipToStartOfNoCase ("href=\"") &&
      p.extractTo ("\"", ref strExtract))
     Console.Writeln ("Extracted text = {0}", strExtract);
  else
     Console.Writeln ("No text extracted.");

Example 3 - Global case-insensitive replacement

This example shows how to case-insensitively replace a string in the parser's content..
  // Replace every occurence of <td> with <td class="foo">
  string str = "..."; // HTML
  StringParser p = new StringParser (str);
  p.replaceEvery ("<td>", "<td class=\"foo\">");

Example 4 - Poor man's web scraping

This example shows how to obtain a stock's quote from the content downloaded from Yahoo Finance (MSFT).  The example makes assumptions about the format of the web page.
  // Scrape http://finance.yahoo.com/q?s=msft
  string strQuote = "";
  string str = "..."; // HTML downloaded from http://finance.yahoo.com/q?s=msft
  StringParser p = new StringParser (str);
  if (p.skipToEndOfNoCase ("Last Trade:</td><td class="yfnc_tabledata1"><big><b>") &&
      p.extractTo ("</b>", ref strQuote))
     Console.Writeln ("MSFT (delayed) = {0}", strQuote);

Example 5 - Get list of hyperlinked phrases

This example shows how to obtain the list of hyperlinked phrases in HTML content.
  ArrayList phrases = new ArrayList();
  string str = "..."; // HTML content
  StringParser p = new StringParser (str);
  while (p.skipToStartOfNoCase ("<a")) {
    string strPhrase = "";
    if (p.skipToEndOf (">") && p.extractTo ("<a>", ref strPhrase))
       phrases.Add (strPhrase);
  }

Demo applications

C# applications (with full source code) that use StringParser can be found here:

Revision History

  • 15 Jan 2006
    Initial version.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Ravi Bhavnani
Technical Lead
Canada Canada
Ravi Bhavnani is an ardent fan of Microsoft technologies who loves building Windows apps, especially PIMs, system utilities, and things that go bump on the Internet. During his career, Ravi has developed expert systems, desktop imaging apps, marketing automation software, EDA tools, a platform to help people find, analyze and understand information, trading software for institutional investors and advanced data visualization solutions. He currently works for a company that provides enterprise workforce management solutions to large clients.
 
His interests include the .NET framework, reasoning systems, financial analysis and algorithmic trading, NLP, HCI and UI design. Ravi holds a BS in Physics and Math and an MS in Computer Science and was a Microsoft MVP (C++ and C# in 2006 and 2007). He is also the co-inventor of 2 patents on software security and generating data visualization dashboards. His claim to fame is that he crafted CodeProject's "joke" forum post icon.
 
Ravi's biggest fear is that one day he might actually get a life, although the chances of that happening seem extremely remote.
Follow on   Google+   LinkedIn

Comments and Discussions

 
GeneralMy vote of 5 PinmvpMaciej Los31-Mar-14 6:57 
GeneralRe: My vote of 5 PinpremiumRavi Bhavnani31-Mar-14 7:57 
QuestionScraping Of Data from Paginated Grid View PinprofessionalMember 1050050624-Jan-14 18:14 
AnswerRe: Scraping Of Data from Paginated Grid View PinprofessionalRavi Bhavnani24-Jan-14 19:44 
GeneralRe: Scraping Of Data from Paginated Grid View PinprofessionalMember 1050050624-Jan-14 20:03 
GeneralRe: Scraping Of Data from Paginated Grid View PinprofessionalRavi Bhavnani24-Jan-14 20:36 
GeneralRe: Scraping Of Data from Paginated Grid View PinprofessionalMember 1050050624-Jan-14 21:58 
QuestionCan your script be extended to a tool that can download real time quotes from google finance and store it in Amibroker PinmemberRaj23230-Nov-13 21:13 
AnswerRe: Can your script be extended to a tool that can download real time quotes from google finance and store it in Amibroker PinprofessionalRavi Bhavnani1-Dec-13 12:16 
QuestionI am newby and need your help. PinmemberMember 841644127-May-12 21:41 
Imports System
Imports System.Collections.Generic
Imports System.Linq
Imports System.Web
Imports System.Web.UI
Imports System.Web.UI.WebControls
Imports System.Collections
 
Imports RavSoft.StringParser
Partial Public Class Pages_HTMLtoTEXT_Converter
Inherits Global.Gamesys.Web.PageBase

Public ReadOnly Property CssClass() As String
Get
Return "DiscussionsPage"
End Get
End Property

Protected Sub Page_Load(ByVal sender As Object, ByVal e As EventArgs) Handles Me.Load
End Sub
 
Protected Sub Button2_Click(ByVal sender As Object, ByVal e As System.EventArgs) Handles Button2.Click
output.InnerHtml = removeHtml(input.InnerHtml)
End Sub
 
End Class
 
Namespace RavSoft
'''
''' A class that helps you to extract information from a string.
'''

Public Class StringParser
'''
''' Default constructor.
'''

Public Sub New()
End Sub
 
'''
''' Constructs a StringParser with specific content.
'''

''' The parser's content.
Public Sub New(ByVal strContent As String)
Content = strContent
End Sub
 
'''//////////
' Properties
 
''' Gets and sets the content to be parsed.
Public Property Content() As String
Get
Return m_strContent
End Get
Set(ByVal value As String)
m_strContent = value
m_strContentLC = m_strContent.ToLower()
resetPosition()
End Set
End Property
 
''' Gets the parser's current position.
Public ReadOnly Property Position() As Integer
Get
Return m_nIndex
End Get
End Property
 
'''//////////////
' Static methods
 
'''
''' Retrieves the collection of HTML links in a string.
'''

''' The string.
''' Root url (may be null).
''' Collection of document link strings.
''' Collection of image link strings.
Public Shared Sub getLinks(ByVal strString As String, ByVal strRootUrl As String, ByRef documents As ArrayList, ByRef images As ArrayList)
' Remove comments and JavaScript and fix links
strString = StringParser.removeComments(strString)
strString = StringParser.removeScripts(strString)
Dim parser As New StringParser(strString)
parser.replaceEvery("'", """")
 
' Set root url
Dim rootUrl As String = ""
If strRootUrl IsNot Nothing Then
rootUrl = strRootUrl.Trim()
End If
If (rootUrl.Length > 0) AndAlso Not rootUrl.EndsWith("/") Then
rootUrl += "/"
End If
 
' Extract HREF targets
Dim strUrl As String = ""
parser.resetPosition()
While parser.skipToEndOfNoCase("href=""")
If parser.extractTo("""", strUrl) Then
strUrl = strUrl.Trim()
If strUrl.Length > 0 Then
If strUrl.IndexOf("mailto:") = -1 Then
 
' Get fully qualified url (best guess)
If Not strUrl.StartsWith("http://") AndAlso Not strUrl.StartsWith("ftp://") Then
Try
Dim uriBuilder As New UriBuilder(rootUrl)
uriBuilder.Path = strUrl
strUrl = uriBuilder.Uri.ToString()
Catch generatedExceptionName As Exception
strUrl = "http://" & strUrl
End Try
End If
 
' Add url to document list if not already present
If Not documents.Contains(strUrl) Then
documents.Add(strUrl)
End If
End If
End If
End If
End While
 
' Extract SRC targets
parser.resetPosition()
While parser.skipToEndOfNoCase("src=""")
If parser.extractTo("""", strUrl) Then
strUrl = strUrl.Trim()
If strUrl.Length > 0 Then
 
' Get fully qualified url (best guess)
If Not strUrl.StartsWith("http://") AndAlso Not strUrl.StartsWith("ftp://") Then
Try
Dim uriBuilder As New UriBuilder(rootUrl)
uriBuilder.Path = strUrl
strUrl = uriBuilder.Uri.ToString()
Catch generatedExceptionName As Exception
strUrl = "http://" & strUrl
End Try
End If
 
' Add url to images list if not already present
If Not images.Contains(strUrl) Then
images.Add(strUrl)
End If
End If
End If
End While
End Sub
 
'''
''' Removes all HTML comments from a string.
'''

''' The string.
''' Comment-free version of string.
Public Shared Function removeComments(ByVal strString As String) As String
' Return comment-free version of string
Dim strCommentFreeString As String = ""
Dim strSegment As String = ""
Dim parser As New StringParser(strString)
 
While parser.extractTo("<!--", strSegment)
strCommentFreeString += strSegment
If Not parser.skipToEndOf("-->") Then
Return strString
End If
End While
 
parser.extractToEnd(strSegment)
strCommentFreeString += strSegment
Return strCommentFreeString
End Function
 
'''
''' Returns an unanchored version of a string, i.e. without the enclosing
''' leftmost <a...> and rightmost </a> tags.
'''

''' The string.
''' Unanchored version of string.
Public Shared Function removeEnclosingAnchorTag(ByVal strString As String) As String
Dim strStringLC As String = strString.ToLower()
Dim nStart As Integer = strStringLC.IndexOf(" -1 Then
nStart += 1
nStart = strStringLC.IndexOf(">", nStart)
If nStart <> -1 Then
nStart += 1
Dim nEnd As Integer = strStringLC.LastIndexOf("")
If nEnd <> -1 Then
Dim strRet As String = strString.Substring(nStart, nEnd - nStart)
Return strRet
End If
End If
End If
Return strString
End Function
 
'''
''' Returns an unquoted version of a string, i.e. without the enclosing
''' leftmost and rightmost double " characters.
'''

''' The string.
''' Unquoted version of string.
Public Shared Function removeEnclosingQuotes(ByVal strString As String) As String
Dim nStart As Integer = strString.IndexOf("""")
If nStart <> -1 Then
Dim nEnd As Integer = strString.LastIndexOf("""")
If nEnd > nStart Then
Return strString.Substring(nStart, nEnd - nStart - 1)
End If
End If
Return strString
End Function
 
'''
''' Returns a version of a string without any HTML tags.
'''

''' The string.
''' Version of string without HTML tags.
Public Shared Function removeHtml(ByVal strString As String) As String
' Do some common case-sensitive replacements
Dim replacements As New Hashtable()
replacements.Add(" ", " ")
replacements.Add("&", "&")
replacements.Add("å", "")
replacements.Add("ä", "")
replacements.Add("é", "")
replacements.Add("í", "")
replacements.Add("ì", "")
replacements.Add("ò", "")
replacements.Add("ö", "")
replacements.Add(""", """")
replacements.Add("ß", "")
Dim parser As New StringParser(strString)
For Each key As String In replacements.Keys
Dim val As String = TryCast(replacements(key), String)
If strString.IndexOf(key) <> -1 Then
parser.replaceEveryExact(key, val)
End If
Next
 
' Do some sequential replacements
parser.replaceEveryExact("�", "&#")
parser.replaceEveryExact("'", "'")
parser.replaceEveryExact(" parser.replaceEveryExact("<~/", "  
' Case-insensitive replacements
replacements.Clear()
replacements.Add("
", " ")
replacements.Add("

", " ")
For Each key As String In replacements.Keys
Dim val As String = TryCast(replacements(key), String)
If strString.IndexOf(key) <> -1 Then
parser.replaceEvery(key, val)
End If
Next
strString = parser.Content
 
' Remove all tags
Dim strClean As String = ""
Dim nIndex As Integer = 0
Dim nStartTag As Integer = 0
While (InlineAssignHelper(nStartTag, strString.IndexOf("<", nIndex))) <> -1
 
' Extract to start of tag
Dim strSubstring As String = strString.Substring(nIndex, (nStartTag - nIndex))
strClean += strSubstring
nIndex = nStartTag + 1
 
' Skip over tag
Dim nEndTag As Integer = strString.IndexOf(">", nIndex)
If nEndTag = (-1) Then
Exit While
End If
nIndex = nEndTag + 1
End While
 
' Gather remaining text
If nIndex < strString.Length Then
strClean += strString.Substring(nIndex, strString.Length - nIndex)
End If
strString = strClean
strClean = ""
 
' Finally, reduce spaces
parser.Content = strString
parser.replaceEveryExact(" ", " ")
strString = parser.Content.Trim()
 
' Return the de-HTMLized string
Return strString
End Function
 
'''


''' Removes all scripts from a string.
'''

''' The string.
''' Version of string without any scripts.
Public Shared Function removeScripts(ByVal strString As String) As String
' Get script-free version of content
Dim strStringSansScripts As String = ""
Dim strSegment As String = ""
Dim parser As New StringParser(strString)
 
While parser.extractToNoCase("<script", strSegment)
strStringSansScripts += strSegment
If Not parser.skipToEndOfNoCase("") Then
parser.Content = strStringSansScripts
Return strString
End If
End While
 
parser.extractToEnd(strSegment)
strStringSansScripts += strSegment
Return (strStringSansScripts)
End Function
 
'''//////////
' Operations
 
'''
''' Checks if the parser is case-sensitively positioned at the start
''' of a string.
'''

''' The string.
'''
''' true if the parser is positioned at the start of the string, false
''' otherwise.
'''

Public Function at(ByVal strString As String) As Boolean
If m_strContent.IndexOf(strString, Position) = Position Then
Return (True)
End If
Return (False)
End Function
 
'''
''' Checks if the parser is case-insensitively positioned at the start
''' of a string.
'''

''' The string.
'''
''' true if the parser is positioned at the start of the string, false
''' otherwise.
'''

Public Function atNoCase(ByVal strString As String) As Boolean
strString = strString.ToLower()
If m_strContentLC.IndexOf(strString, Position) = Position Then
Return (True)
End If
Return (False)
End Function
 
'''
''' Extracts the text from the parser's current position to the case-
''' sensitive start of a string and advances the parser just after the
''' string.
'''

''' The string.
''' The extracted text.
''' true if the parser was advanced, false otherwise.
Public Function extractTo(ByVal strString As String, ByRef strExtract As String) As Boolean
Dim nPos As Integer = m_strContent.IndexOf(strString, Position)
If nPos <> -1 Then
strExtract = m_strContent.Substring(m_nIndex, nPos - m_nIndex)
m_nIndex = nPos + strString.Length
Return (True)
End If
Return (False)
End Function
 
'''
''' Extracts the text from the parser's current position to the case-
''' insensitive start of a string and advances the parser just after the
''' string.
'''

''' The string.
''' The extracted text.
''' true if the parser was advanced, false otherwise.
Public Function extractToNoCase(ByVal strString As String, ByRef strExtract As String) As Boolean
strString = strString.ToLower()
Dim nPos As Integer = m_strContentLC.IndexOf(strString, Position)
If nPos <> -1 Then
strExtract = m_strContent.Substring(m_nIndex, nPos - m_nIndex)
m_nIndex = nPos + strString.Length
Return (True)
End If
Return (False)
End Function
 
'''
''' Extracts the text from the parser's current position to the case-
''' sensitive start of a string and position's the parser at the start
''' of the string.
'''

''' The string.
''' The extracted text.
''' true if the parser was advanced, false otherwise.
Public Function extractUntil(ByVal strString As String, ByRef strExtract As String) As Boolean
Dim nPos As Integer = m_strContent.IndexOf(strString, Position)
If nPos <> -1 Then
strExtract = m_strContent.Substring(m_nIndex, nPos - m_nIndex)
m_nIndex = nPos
Return (True)
End If
Return (False)
End Function
 
'''
''' Extracts the text from the parser's current position to the case-
''' insensitive start of a string and position's the parser at the start
''' of the string.
'''

''' The string.
''' The extracted text.
''' true if the parser was advanced, false otherwise.
Public Function extractUntilNoCase(ByVal strString As String, ByRef strExtract As String) As Boolean
strString = strString.ToLower()
Dim nPos As Integer = m_strContentLC.IndexOf(strString, Position)
If nPos <> -1 Then
strExtract = m_strContent.Substring(m_nIndex, nPos - m_nIndex)
m_nIndex = nPos
Return (True)
End If
Return (False)
End Function
 
'''
''' Extracts the text from the parser's current position to the end
''' of its content and does not advance the parser's position.
'''

''' The extracted text.
Public Sub extractToEnd(ByRef strExtract As String)
strExtract = ""
If Position < m_strContent.Length Then
Dim nRemainLen As Integer = m_strContent.Length - Position
strExtract = m_strContent.Substring(Position, nRemainLen)
End If
End Sub
 
'''
''' Case-insensitively replaces every occurence of a string in the
''' parser's content with another.
'''

''' The occurrence.
''' The replacement string.
''' The number of occurences replaced.
Public Function replaceEvery(ByVal strOccurrence As String, ByVal strReplacement As String) As Integer
' Initialize replacement process
Dim nReplacements As Integer = 0
strOccurrence = strOccurrence.ToLower()
 
' For every occurence...
Dim nOccurrence As Integer = m_strContentLC.IndexOf(strOccurrence)
While nOccurrence <> -1
 
' Create replaced substring
Dim strReplacedString As String = m_strContent.Substring(0, nOccurrence) & strReplacement
 
' Add remaining substring (if any)
Dim nStartOfRemainingSubstring As Integer = nOccurrence + strOccurrence.Length
If nStartOfRemainingSubstring < m_strContent.Length Then
Dim strSecondPart As String = m_strContent.Substring(nStartOfRemainingSubstring, m_strContent.Length - nStartOfRemainingSubstring)
strReplacedString += strSecondPart
End If
 
' Update the original string
m_strContent = strReplacedString
m_strContentLC = m_strContent.ToLower()
nReplacements += 1
 
' Find the next occurence
nOccurrence = m_strContentLC.IndexOf(strOccurrence)
End While
Return (nReplacements)
End Function
 
'''
''' Case sensitive version of replaceEvery()
'''

''' The occurrence.
''' The replacement string.
''' The number of occurences replaced.
Public Function replaceEveryExact(ByVal strOccurrence As String, ByVal strReplacement As String) As Integer
Dim nReplacements As Integer = 0
While m_strContent.IndexOf(strOccurrence) <> -1
m_strContent = m_strContent.Replace(strOccurrence, strReplacement)
nReplacements += 1
End While
m_strContentLC = m_strContent.ToLower()
Return (nReplacements)
End Function
 
'''
''' Resets the parser's position to the start of the content.
'''

Public Sub resetPosition()
m_nIndex = 0
End Sub
 
'''
''' Advances the parser's position to the start of the next case-sensitive
''' occurence of a string.
'''

''' The string.
''' true if the parser's position was advanced, false otherwise.
Public Function skipToStartOf(ByVal strString As String) As Boolean
Dim bStatus As Boolean = seekTo(strString, False, False)
Return (bStatus)
End Function
 
'''
''' Advances the parser's position to the start of the next case-insensitive
''' occurence of a string.
'''

''' The string.
''' true if the parser's position was advanced, false otherwise.
Public Function skipToStartOfNoCase(ByVal strText As String) As Boolean
Dim bStatus As Boolean = seekTo(strText, True, False)
Return (bStatus)
End Function
 
'''
''' Advances the parser's position to the end of the next case-sensitive
''' occurence of a string.
'''

''' The string.
''' true if the parser's position was advanced, false otherwise.
Public Function skipToEndOf(ByVal strString As String) As Boolean
Dim bStatus As Boolean = seekTo(strString, False, True)
Return (bStatus)
End Function
 
'''
''' Advances the parser's position to the end of the next case-insensitive
''' occurence of a string.
'''

''' The string.
''' true if the parser's position was advanced, false otherwise.
Public Function skipToEndOfNoCase(ByVal strText As String) As Boolean
Dim bStatus As Boolean = seekTo(strText, True, True)
Return (bStatus)
End Function
 
' ////////////////////////
' Implementation (members)
 
''' Content to be parsed.
Private m_strContent As String = ""
 
''' Lower-cased version of content to be parsed.
Private m_strContentLC As String = ""
 
''' Current position in content.
Private m_nIndex As Integer = 0
 
' ////////////////////////
' Implementation (methods)
 
'''
''' Advances the parser's position to the next occurence of a string.
'''

''' The string.
''' Flag: perform a case-insensitive search.
''' Flag: position parser just after string.
'''
Private Function seekTo(ByVal strString As String, ByVal bNoCase As Boolean, ByVal bPositionAfter As Boolean) As Boolean
If Position < m_strContent.Length Then
 
' Find the start of the string - return if not found
Dim nNewIndex As Integer = 0
If bNoCase Then
strString = strString.ToLower()
nNewIndex = m_strContentLC.IndexOf(strString, Position)
Else
nNewIndex = m_strContent.IndexOf(strString, Position)
End If
If nNewIndex = -1 Then
Return (False)
End If
 
' Position the parser
m_nIndex = nNewIndex
If bPositionAfter Then
m_nIndex += strString.Length
End If
Return (True)
End If
Return (False)
End Function
Private Shared Function InlineAssignHelper(Of T)(ByRef target As T, ByVal value As T) As T
target = value
Return value
End Function
End Class
End Namespace
AnswerRe: I am newby and need your help. PinmemberRavi Bhavnani28-May-12 1:54 
QuestionRe: I am newby and need your help. [modified] Pinmemberlance.spurgeon28-May-12 8:59 
AnswerRe: I am newby and need your help. Pinmemberlance.spurgeon29-May-12 10:41 
GeneralMy vote of 5 PinmembersamiDiab29-Feb-12 3:33 
GeneralRe: My vote of 5 PinmemberRavi Bhavnani28-May-12 1:53 
GeneralExtracting Meta Keywords and Descriptions Pinmemberkeith_fra26-Jul-07 9:48 
QuestionRewindTo? Pinmemberkrn_2k19-Jun-07 8:23 
AnswerRe: RewindTo? PinmemberRavi Bhavnani19-Jun-07 8:37 
GeneralRe: RewindTo? Pinmemberkrn_2k19-Jun-07 8:44 
Generalextract tags Pinmemberrama jayapal29-Mar-07 21:50 
GeneralRe: extract tags PinmemberRavi Bhavnani30-Mar-07 2:58 
Generalgood stuff Pinmembertonyc2a25-Feb-07 6:34 
GeneralRe: good stuff PinmemberRavi Bhavnani25-Feb-07 6:42 
QuestionString parser for Client server Pinmembervenkiiz23-Jan-07 3:23 
AnswerRe: String parser for Client server PinmemberRavi Bhavnani23-Jan-07 4:40 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web04 | 2.8.140415.2 | Last Updated 15 Jan 2006
Article Copyright 2006 by Ravi Bhavnani
Everything else Copyright © CodeProject, 1999-2014
Terms of Use
Layout: fixed | fluid