Click here to Skip to main content
Click here to Skip to main content

Parsing an HTML document by using a recursive function

, 14 Apr 2010
Rate this:
Please Sign up or sign in to vote.
This is an example of one way to parse an HTML document by using a recursive function. In this example, an html document is loaded from a text file, but the code also demonstrates (in remarks) using a web page as a source file.When the recursive function is called, a conditional statement...
This is an example of one way to parse an HTML document by using a recursive function. In this example, an html document is loaded from a text file, but the code also demonstrates (in remarks) using a web page as a source file.
 
When the recursive function is called, a conditional statement evaluates the html elements for child elements. If the element has children, the recursion occurs (the function calls itself) and the child of the candidate gets evaluated for children. Eventually, the function finds an element with no children, and the element's inner text, etc., is appended to a textbox. Running this program in debug mode might be helpful if my description is confusing.
 
import the namespaces as shown
Imports System.Windows.Forms.HtmlDocument
Imports System.Xml
 

    Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
 
        Try
            Dim HTMLDocument As HtmlDocument
            'Dim webclient As System.Net.WebClient = New System.Net.WebClient
            'Dim url As String = "http://www.somewebsite.com"
            Dim myHTML As String '= webclient.DownloadString(url)

            'instead of downloading the html, lets get it from a file
            Dim filePath As String = "C:\htmlsourcefile.txt"
            Dim myStreamReader = New System.IO.StreamReader(filePath)
            myHTML = myStreamReader.ReadToEnd
 
            WebBrowser1.Navigate("about:blank")
            Dim objectDoc = WebBrowser1.Document
            WebBrowser1.Document.Write(myHTML)
            WebBrowser1.ScriptErrorsSuppressed = True
            HTMLDocument = WebBrowser1.Document
 
            append("The document title is: " & HTMLDocument.Title)
 
            Dim headElementCollection As HtmlElementCollection = _
            HTMLDocument.GetElementsByTagName("head")
 
            'call the function (no value is returned)
            getChildren(headElementCollection)
            append(vbCrLf)
            headElementCollection = HTMLDocument.GetElementsByTagName("body")
            'same function again, just for the body this time
            getChildren(headElementCollection)
 

        Catch ex As Exception
 
            append(ex.ToString)
 
        End Try
 
    End Sub
 
This is the recursive function
    Private Function getChildren(ByVal xElementCollection As HtmlElementCollection)
        Dim xLabel As String
 
        Dim parentElement As HtmlElement
 
        For Each parentElement In xElementCollection
            If parentElement.Children.Count > 0 Then
 
                Select Case parentElement.TagName.ToLower
                    Case "tr" : xLabel = "Row"
                    Case "td" : xLabel = "Cell"
                    Case "th" : xLabel = "Header"
                    Case "a" : xLabel = "Anchor"
                    Case "tbody" : xLabel = "T-Body"
                    Case "div" : xLabel = "Division"
                    Case "head" : xLabel = "Head"
                    Case "body" : xLabel = "Body"
                    Case "table" : xLabel = "Table"
                    Case "p" : xLabel = "Paragraph"
                    Case Else : xLabel = "element not specified"
 
                End Select
 
                append("<" & xLabel & ">")
                getChildren(parentElement.Children)
                append("<" & xLabel & " />")
 
            Else
 
                If parentElement.InnerText <> "" Then
                    append("     " & parentElement.InnerText & "")
                Else
                    append("     " & vbNull.ToString & "")
                End If
 
                If parentElement.GetAttribute("href").ToString <> "" Then
                    append("     " & parentElement.GetAttribute("href") & "")
                End If
 
            End If
        Next
 
        Return Nothing
 
    End Function
 
 
one last thing, I prefer not to use TextBox1.append("one two three"), so I do it this way....
 
    Private Sub append(ByVal myTextToAppend As String)
        TextBox1.AppendText(myTextToAppend & vbCrLf)
        Application.DoEvents()
        outputXL = outputXL & myTextToAppend & vbCrLf
    End Sub

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

marc9889
Software Developer (Senior) Jacobs Technology
United States United States
No Biography provided

Comments and Discussions

 
GeneralThis code worked very well for our needs with minimal tweaki... PinmemberTrellium9-Oct-11 6:25 
GeneralRe: Thank you Trellium, and I hope it is still working for you. ... Pinmemberwww.marcjohnson.us6-Dec-11 3:44 
Thank you Trellium, and I hope it is still working for you. Yes, you are exactly right! The document needs to be well formed before this function can be run without risk of breakage. That was an assumption on my part, but thanks for mentioning it.
GeneralI am trying to use this code by uncommenting the statements ... Pinmembersumboddie1-May-11 7:27 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web03 | 2.8.140827.1 | Last Updated 14 Apr 2010
Article Copyright 2010 by marc9889
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid