Add your own alternative version
Stats
118.4K views 3.6K downloads 99 bookmarked
Posted
15 Jan 2006
|
Comments and Discussions
|
|
|
|
I love a simple tool that does what it does. Thanks for not building a framework that I had to import 50 assemblies for, and it only works in 4.5 and not 3.5 or 4.0 but in 4.0000....
|
|
|
|
|
/ravi
|
|
|
|
|
|
Thanks. It's used to much advantage here[^] and here[^].
/ravi
|
|
|
|
|
Hi Ravi, you have explained very well about scraping.I am working on one project in which i need to scrape data from grid view. Now the Problem is grid view have a Pagination. How can i fetch the data of the entire grid view ?
|
|
|
|
|
This question is better suited to my article WebResourceProvider goes .NET[^]. The key is the continueFetching() method where you return true when you discover you have more pages to fetch. The framework will then call getFetchUrl() which returns the url of the next page to parse. You keep doing this until you discover there are no more pages to scrape.
/ravi
|
|
|
|
|
Thank u sir.i am taking your suggestion in account. but sir If i can't resolve it then please help me out to accomplish the task.
|
|
|
|
|
I can try to help if you have a specific question about your code. But please don't expect me (or anyone else at CP) to write your code for you. It doesn't work that way.
/ravi
|
|
|
|
|
Ya Sure Sir. If My Code is not working then i will give u my code. So you can identify error.
|
|
|
|
|
Hello Ravi,
Firstly, thank you or the good work you are doing already ! Although it may be asking for a bit more I'll take a try.
I'm looking for a tool/program/software that connects to Amibroker (http://amibroker.in) using any software/code/script (DDE/C#/Excel or whatever) to retrieve real-time stock quotes from google finance that is updated every minute ( or maybe 2 minutes).
So basically there are paid softwares to do the job, but i'm looking to see if anyone can provide a free interface.
[There is a software DataFeeder from stocklive.in which is free and downloads data from yahoo finance and feeds it to amibroker (i.e. writes the stock quote minute by minute it into the amibroker database/txt file. Amibroker updates the chart as soon as it find the new info automatically). However, Yahoo quotes are 10-15 delayed and therefore looking for an alternative from google finance.
Thanks in advance.
Raj
|
|
|
|
|
|
Imports System
Imports System.Collections.Generic
Imports System.Linq
Imports System.Web
Imports System.Web.UI
Imports System.Web.UI.WebControls
Imports System.Collections
Imports RavSoft.StringParser
Partial Public Class Pages_HTMLtoTEXT_Converter
Inherits Global.Gamesys.Web.PageBase
Public ReadOnly Property CssClass() As String
Get
Return "DiscussionsPage"
End Get
End Property
Protected Sub Page_Load(ByVal sender As Object, ByVal e As EventArgs) Handles Me.Load
End Sub
Protected Sub Button2_Click(ByVal sender As Object, ByVal e As System.EventArgs) Handles Button2.Click
output.InnerHtml = removeHtml(input.InnerHtml)
End Sub
End Class
Namespace RavSoft
''' <summary>
''' A class that helps you to extract information from a string.
'''
Public Class StringParser
''' <summary>
''' Default constructor.
'''
Public Sub New()
End Sub
''' <summary>
''' Constructs a StringParser with specific content.
'''
''' <param name="strContent">The parser's content.
Public Sub New(ByVal strContent As String)
Content = strContent
End Sub
'''//////////
' Properties
''' <summary>Gets and sets the content to be parsed.
Public Property Content() As String
Get
Return m_strContent
End Get
Set(ByVal value As String)
m_strContent = value
m_strContentLC = m_strContent.ToLower()
resetPosition()
End Set
End Property
''' <summary>Gets the parser's current position.
Public ReadOnly Property Position() As Integer
Get
Return m_nIndex
End Get
End Property
'''//////////////
' Static methods
''' <summary>
''' Retrieves the collection of HTML links in a string.
'''
''' <param name="strString">The string.
''' <param name="strRootUrl">Root url (may be null).
''' <param name="documents">Collection of document link strings.
''' <param name="images">Collection of image link strings.
Public Shared Sub getLinks(ByVal strString As String, ByVal strRootUrl As String, ByRef documents As ArrayList, ByRef images As ArrayList)
' Remove comments and JavaScript and fix links
strString = StringParser.removeComments(strString)
strString = StringParser.removeScripts(strString)
Dim parser As New StringParser(strString)
parser.replaceEvery("'", """")
' Set root url
Dim rootUrl As String = ""
If strRootUrl IsNot Nothing Then
rootUrl = strRootUrl.Trim()
End If
If (rootUrl.Length > 0) AndAlso Not rootUrl.EndsWith("/") Then
rootUrl += "/"
End If
' Extract HREF targets
Dim strUrl As String = ""
parser.resetPosition()
While parser.skipToEndOfNoCase("href=""")
If parser.extractTo("""", strUrl) Then
strUrl = strUrl.Trim()
If strUrl.Length > 0 Then
If strUrl.IndexOf("mailto:") = -1 Then
' Get fully qualified url (best guess)
If Not strUrl.StartsWith("http://") AndAlso Not strUrl.StartsWith("ftp://") Then
Try
Dim uriBuilder As New UriBuilder(rootUrl)
uriBuilder.Path = strUrl
strUrl = uriBuilder.Uri.ToString()
Catch generatedExceptionName As Exception
strUrl = "http://" & strUrl
End Try
End If
' Add url to document list if not already present
If Not documents.Contains(strUrl) Then
documents.Add(strUrl)
End If
End If
End If
End If
End While
' Extract SRC targets
parser.resetPosition()
While parser.skipToEndOfNoCase("src=""")
If parser.extractTo("""", strUrl) Then
strUrl = strUrl.Trim()
If strUrl.Length > 0 Then
' Get fully qualified url (best guess)
If Not strUrl.StartsWith("http://") AndAlso Not strUrl.StartsWith("ftp://") Then
Try
Dim uriBuilder As New UriBuilder(rootUrl)
uriBuilder.Path = strUrl
strUrl = uriBuilder.Uri.ToString()
Catch generatedExceptionName As Exception
strUrl = "http://" & strUrl
End Try
End If
' Add url to images list if not already present
If Not images.Contains(strUrl) Then
images.Add(strUrl)
End If
End If
End If
End While
End Sub
''' <summary>
''' Removes all HTML comments from a string.
'''
''' <param name="strString">The string.
''' <returns>Comment-free version of string.
Public Shared Function removeComments(ByVal strString As String) As String
' Return comment-free version of string
Dim strCommentFreeString As String = ""
Dim strSegment As String = ""
Dim parser As New StringParser(strString)
While parser.extractTo("<!--", strSegment)
strCommentFreeString += strSegment
If Not parser.skipToEndOf("-->") Then
Return strString
End If
End While
parser.extractToEnd(strSegment)
strCommentFreeString += strSegment
Return strCommentFreeString
End Function
''' <summary>
''' Returns an unanchored version of a string, i.e. without the enclosing
''' leftmost <a...> and rightmost </a> tags.
'''
''' <param name="strString">The string.
''' <returns>Unanchored version of string.
Public Shared Function removeEnclosingAnchorTag(ByVal strString As String) As String
Dim strStringLC As String = strString.ToLower()
Dim nStart As Integer = strStringLC.IndexOf("<a")
if="" nstart="" <=""> -1 Then
nStart += 1
nStart = strStringLC.IndexOf(">", nStart)
If nStart <> -1 Then
nStart += 1
Dim nEnd As Integer = strStringLC.LastIndexOf("")
If nEnd <> -1 Then
Dim strRet As String = strString.Substring(nStart, nEnd - nStart)
Return strRet
End If
End If
End If
Return strString
End Function
''' <summary>
''' Returns an unquoted version of a string, i.e. without the enclosing
''' leftmost and rightmost double " characters.
'''
''' <param name="strString">The string.
''' <returns>Unquoted version of string.
Public Shared Function removeEnclosingQuotes(ByVal strString As String) As String
Dim nStart As Integer = strString.IndexOf("""")
If nStart <> -1 Then
Dim nEnd As Integer = strString.LastIndexOf("""")
If nEnd > nStart Then
Return strString.Substring(nStart, nEnd - nStart - 1)
End If
End If
Return strString
End Function
''' <summary>
''' Returns a version of a string without any HTML tags.
'''
''' <param name="strString">The string.
''' <returns>Version of string without HTML tags.
Public Shared Function removeHtml(ByVal strString As String) As String
' Do some common case-sensitive replacements
Dim replacements As New Hashtable()
replacements.Add(" ", " ")
replacements.Add("&", "&")
replacements.Add("å", "")
replacements.Add("ä", "")
replacements.Add("é", "")
replacements.Add("í", "")
replacements.Add("ì", "")
replacements.Add("ò", "")
replacements.Add("ö", "")
replacements.Add(""", """")
replacements.Add("ß", "")
Dim parser As New StringParser(strString)
For Each key As String In replacements.Keys
Dim val As String = TryCast(replacements(key), String)
If strString.IndexOf(key) <> -1 Then
parser.replaceEveryExact(key, val)
End If
Next
' Do some sequential replacements
parser.replaceEveryExact("�", "&#")
parser.replaceEveryExact("'", "'")
parser.replaceEveryExact("</", " <~/")
parser.replaceEveryExact("<~/", "</")
' Case-insensitive replacements
replacements.Clear()
replacements.Add(" ", " ")
replacements.Add("", " ")
For Each key As String In replacements.Keys
Dim val As String = TryCast(replacements(key), String)
If strString.IndexOf(key) <> -1 Then
parser.replaceEvery(key, val)
End If
Next
strString = parser.Content
' Remove all tags
Dim strClean As String = ""
Dim nIndex As Integer = 0
Dim nStartTag As Integer = 0
While (InlineAssignHelper(nStartTag, strString.IndexOf("<", nIndex))) <> -1
' Extract to start of tag
Dim strSubstring As String = strString.Substring(nIndex, (nStartTag - nIndex))
strClean += strSubstring
nIndex = nStartTag + 1
' Skip over tag
Dim nEndTag As Integer = strString.IndexOf(">", nIndex)
If nEndTag = (-1) Then
Exit While
End If
nIndex = nEndTag + 1
End While
' Gather remaining text
If nIndex < strString.Length Then
strClean += strString.Substring(nIndex, strString.Length - nIndex)
End If
strString = strClean
strClean = ""
' Finally, reduce spaces
parser.Content = strString
parser.replaceEveryExact(" ", " ")
strString = parser.Content.Trim()
' Return the de-HTMLized string
Return strString
End Function
''' <summary>
''' Removes all scripts from a string.
'''
''' <param name="strString">The string.
''' <returns>Version of string without any scripts.
Public Shared Function removeScripts(ByVal strString As String) As String
' Get script-free version of content
Dim strStringSansScripts As String = ""
Dim strSegment As String = ""
Dim parser As New StringParser(strString)
While parser.extractToNoCase("<script", strSegment)
strStringSansScripts += strSegment
If Not parser.skipToEndOfNoCase("") Then
parser.Content = strStringSansScripts
Return strString
End If
End While
parser.extractToEnd(strSegment)
strStringSansScripts += strSegment
Return (strStringSansScripts)
End Function
'''//////////
' Operations
''' <summary>
''' Checks if the parser is case-sensitively positioned at the start
''' of a string.
'''
''' <param name="strString">The string.
''' <returns>
''' true if the parser is positioned at the start of the string, false
''' otherwise.
'''
Public Function at(ByVal strString As String) As Boolean
If m_strContent.IndexOf(strString, Position) = Position Then
Return (True)
End If
Return (False)
End Function
''' <summary>
''' Checks if the parser is case-insensitively positioned at the start
''' of a string.
'''
''' <param name="strString">The string.
''' <returns>
''' true if the parser is positioned at the start of the string, false
''' otherwise.
'''
Public Function atNoCase(ByVal strString As String) As Boolean
strString = strString.ToLower()
If m_strContentLC.IndexOf(strString, Position) = Position Then
Return (True)
End If
Return (False)
End Function
''' <summary>
''' Extracts the text from the parser's current position to the case-
''' sensitive start of a string and advances the parser just after the
''' string.
'''
''' <param name="strString">The string.
''' <param name="strExtract">The extracted text.
''' <returns>true if the parser was advanced, false otherwise.
Public Function extractTo(ByVal strString As String, ByRef strExtract As String) As Boolean
Dim nPos As Integer = m_strContent.IndexOf(strString, Position)
If nPos <> -1 Then
strExtract = m_strContent.Substring(m_nIndex, nPos - m_nIndex)
m_nIndex = nPos + strString.Length
Return (True)
End If
Return (False)
End Function
''' <summary>
''' Extracts the text from the parser's current position to the case-
''' insensitive start of a string and advances the parser just after the
''' string.
'''
''' <param name="strString">The string.
''' <param name="strExtract">The extracted text.
''' <returns>true if the parser was advanced, false otherwise.
Public Function extractToNoCase(ByVal strString As String, ByRef strExtract As String) As Boolean
strString = strString.ToLower()
Dim nPos As Integer = m_strContentLC.IndexOf(strString, Position)
If nPos <> -1 Then
strExtract = m_strContent.Substring(m_nIndex, nPos - m_nIndex)
m_nIndex = nPos + strString.Length
Return (True)
End If
Return (False)
End Function
''' <summary>
''' Extracts the text from the parser's current position to the case-
''' sensitive start of a string and position's the parser at the start
''' of the string.
'''
''' <param name="strString">The string.
''' <param name="strExtract">The extracted text.
''' <returns>true if the parser was advanced, false otherwise.
Public Function extractUntil(ByVal strString As String, ByRef strExtract As String) As Boolean
Dim nPos As Integer = m_strContent.IndexOf(strString, Position)
If nPos <> -1 Then
strExtract = m_strContent.Substring(m_nIndex, nPos - m_nIndex)
m_nIndex = nPos
Return (True)
End If
Return (False)
End Function
''' <summary>
''' Extracts the text from the parser's current position to the case-
''' insensitive start of a string and position's the parser at the start
''' of the string.
'''
''' <param name="strString">The string.
''' <param name="strExtract">The extracted text.
''' <returns>true if the parser was advanced, false otherwise.
Public Function extractUntilNoCase(ByVal strString As String, ByRef strExtract As String) As Boolean
strString = strString.ToLower()
Dim nPos As Integer = m_strContentLC.IndexOf(strString, Position)
If nPos <> -1 Then
strExtract = m_strContent.Substring(m_nIndex, nPos - m_nIndex)
m_nIndex = nPos
Return (True)
End If
Return (False)
End Function
''' <summary>
''' Extracts the text from the parser's current position to the end
''' of its content and does not advance the parser's position.
'''
''' <param name="strExtract">The extracted text.
Public Sub extractToEnd(ByRef strExtract As String)
strExtract = ""
If Position < m_strContent.Length Then
Dim nRemainLen As Integer = m_strContent.Length - Position
strExtract = m_strContent.Substring(Position, nRemainLen)
End If
End Sub
''' <summary>
''' Case-insensitively replaces every occurence of a string in the
''' parser's content with another.
'''
''' <param name="strOccurrence">The occurrence.
''' <param name="strReplacement">The replacement string.
''' <returns>The number of occurences replaced.
Public Function replaceEvery(ByVal strOccurrence As String, ByVal strReplacement As String) As Integer
' Initialize replacement process
Dim nReplacements As Integer = 0
strOccurrence = strOccurrence.ToLower()
' For every occurence...
Dim nOccurrence As Integer = m_strContentLC.IndexOf(strOccurrence)
While nOccurrence <> -1
' Create replaced substring
Dim strReplacedString As String = m_strContent.Substring(0, nOccurrence) & strReplacement
' Add remaining substring (if any)
Dim nStartOfRemainingSubstring As Integer = nOccurrence + strOccurrence.Length
If nStartOfRemainingSubstring < m_strContent.Length Then
Dim strSecondPart As String = m_strContent.Substring(nStartOfRemainingSubstring, m_strContent.Length - nStartOfRemainingSubstring)
strReplacedString += strSecondPart
End If
' Update the original string
m_strContent = strReplacedString
m_strContentLC = m_strContent.ToLower()
nReplacements += 1
' Find the next occurence
nOccurrence = m_strContentLC.IndexOf(strOccurrence)
End While
Return (nReplacements)
End Function
''' <summary>
''' Case sensitive version of replaceEvery()
'''
''' <param name="strOccurrence">The occurrence.
''' <param name="strReplacement">The replacement string.
''' <returns>The number of occurences replaced.
Public Function replaceEveryExact(ByVal strOccurrence As String, ByVal strReplacement As String) As Integer
Dim nReplacements As Integer = 0
While m_strContent.IndexOf(strOccurrence) <> -1
m_strContent = m_strContent.Replace(strOccurrence, strReplacement)
nReplacements += 1
End While
m_strContentLC = m_strContent.ToLower()
Return (nReplacements)
End Function
''' <summary>
''' Resets the parser's position to the start of the content.
'''
Public Sub resetPosition()
m_nIndex = 0
End Sub
''' <summary>
''' Advances the parser's position to the start of the next case-sensitive
''' occurence of a string.
'''
''' <param name="strString">The string.
''' <returns>true if the parser's position was advanced, false otherwise.
Public Function skipToStartOf(ByVal strString As String) As Boolean
Dim bStatus As Boolean = seekTo(strString, False, False)
Return (bStatus)
End Function
''' <summary>
''' Advances the parser's position to the start of the next case-insensitive
''' occurence of a string.
'''
''' <param name="strText">The string.
''' <returns>true if the parser's position was advanced, false otherwise.
Public Function skipToStartOfNoCase(ByVal strText As String) As Boolean
Dim bStatus As Boolean = seekTo(strText, True, False)
Return (bStatus)
End Function
''' <summary>
''' Advances the parser's position to the end of the next case-sensitive
''' occurence of a string.
'''
''' <param name="strString">The string.
''' <returns>true if the parser's position was advanced, false otherwise.
Public Function skipToEndOf(ByVal strString As String) As Boolean
Dim bStatus As Boolean = seekTo(strString, False, True)
Return (bStatus)
End Function
''' <summary>
''' Advances the parser's position to the end of the next case-insensitive
''' occurence of a string.
'''
''' <param name="strText">The string.
''' <returns>true if the parser's position was advanced, false otherwise.
Public Function skipToEndOfNoCase(ByVal strText As String) As Boolean
Dim bStatus As Boolean = seekTo(strText, True, True)
Return (bStatus)
End Function
' ////////////////////////
' Implementation (members)
''' <summary>Content to be parsed.
Private m_strContent As String = ""
''' <summary>Lower-cased version of content to be parsed.
Private m_strContentLC As String = ""
''' <summary>Current position in content.
Private m_nIndex As Integer = 0
' ////////////////////////
' Implementation (methods)
''' <summary>
''' Advances the parser's position to the next occurence of a string.
'''
''' <param name="strString">The string.
''' <param name="bNoCase">Flag: perform a case-insensitive search.
''' <param name="bPositionAfter">Flag: position parser just after string.
''' <returns>
Private Function seekTo(ByVal strString As String, ByVal bNoCase As Boolean, ByVal bPositionAfter As Boolean) As Boolean
If Position < m_strContent.Length Then
' Find the start of the string - return if not found
Dim nNewIndex As Integer = 0
If bNoCase Then
strString = strString.ToLower()
nNewIndex = m_strContentLC.IndexOf(strString, Position)
Else
nNewIndex = m_strContent.IndexOf(strString, Position)
End If
If nNewIndex = -1 Then
Return (False)
End If
' Position the parser
m_nIndex = nNewIndex
If bPositionAfter Then
m_nIndex += strString.Length
End If
Return (True)
End If
Return (False)
End Function
Private Shared Function InlineAssignHelper(Of T)(ByRef target As T, ByVal value As T) As T
target = value
Return value
End Function
End Class
End Namespace
|
|
|
|
|
Do you have a specific question?
/ravi
|
|
|
|
|
Hi Ravi
Thanks for the quick reply.
I have a user control form 2 textboxes with a convert button, I want the user to input HTML textbox 1 click convert button your script to insert it in textbox2. But I just can't get it right.
I converted your c# to vb though a code converter. I need you to please take a look at the code I pasted earlier and if you can tell me where I have gone wrong. Thanks for your time
Cheers Lance
modified 29-May-12 12:47pm.
|
|
|
|
|
|
skipTo and extractTo methods fit my case very well
|
|
|
|
|
|
Hi, this is really good stuff!
Any thoughts on the best way to extract out just the meta keywords and descriptions?
I have stuck on this for about a week and then I came across your site.
I already have the HTML String I was just wondering the best way to get those elements out.
Thanks!
|
|
|
|
|
Hi, great helpful class. A couple of questions:
Is it possible to search/extract backwards? Many it is required to search for some unchanging string to locate information, then read backwards to get the variable (unsearchable) data located just in front of it.
Also is it possible to record a Position, then move back to that position if a search fails: MoveTo(Position)?
Thanks!
|
|
|
|
|
krn_2k wrote: Is it possible to search/extract backwards?
This functionality is present in the MFC version[^] of the WebResourceProvider class. I'll add them to StringParser .
krn_2k wrote: is it possible to record a Position, then move back to that position if a search fails: MoveTo(Position)?
I'll add that property.
Thanks for your comments!
/ravi
|
|
|
|
|
Great thank you, got my vote!
|
|
|
|
|
i tried to extract all the tags from my webpage
i used the following code but i didnt get the desired result to extract the tags
can u verify is my code correct using Stringparser
string html = "URL";
System.Net.HttpWebRequest webrequest = (HttpWebRequest)System.Net.WebRequest.Create(html);
System.Net.HttpWebResponse webresponse = (HttpWebResponse)webrequest.GetResponse();
StreamReader webstream = new StreamReader(webresponse.GetResponseStream(), Encoding.ASCII);
webrequest.Method = "GET";
string strml = webstream.ReadToEnd();
ArrayList phrases = new ArrayList();
string str = strml; // HTML content
StringParser p = new StringParser(str);
while (p.skipToStartOfNoCase("<input"))
{
="" string="" strphrase="" ;
="" if="" (p.skiptoendofnocase("="">") && p.extractTo("", ref strPhrase))
phrases.Add(strPhrase);
}
webstream.Close();
webresponse.Close();
webrequest.Abort();
|
|
|
|
|
It's hard to tell if your logic is correct without knowing the downloaded content and what you're looking to extract. I could try to help if you posted a sample url and the desired extract target.
/ravi
|
|
|
|
|
|
General News Suggestion Question Bug Answer Joke Praise Rant Admin
Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.
|
|