Click here to Skip to main content
Click here to Skip to main content

HTML Parsing using .NET Framework

, 29 Jun 2007
Rate this:
Please Sign up or sign in to vote.
In this article, i will be showing how to parse and modify a peace of HTML code using a nice and helpfull .NET assembly.
Download the demo class of this example :

Introduction

Hello again !

In this article we will be working on HTML parsing using a single .NET Framework Assembly.
The namespace is called mshtml, ill be showing how this assembly and its objects can be so helpfull in some basics ways.

The assembly

The assembly is located on the folder below :

Screenshot - mshtml.jpg

If you are working on a remote website/shared drive letters from another environment, just copy and paste this assembly to the /bin folder.

First we need to add the assembly reference to our project.

Project Properties >> Add reference >> .NET >> Microsoft.mshtml

The HTML block

This is the simple example of a HTML block that we will be working with :

Dim myHTML$ = _
          vbCrLf & "<html>" _
        & vbCrLf & "<body>" _
        & vbCrLf & "<input type=""text"" name=""myTextBox""/>" _
        & vbCrLf & "<input type=""button"" name=""myButton""/>" _
        & vbCrLf & "<input type=""checkbox"" name=""myCheckBox"" class=""removeMe""/>" _
        & vbCrLf & "</body>" _
        & vbCrLf & "</html>"


How you can see, its simple as 3 inputs.

The objects

In this assembly, we have all objects that can represent our HTML tags in this example ill be focusing on the INPUT tag, using the object: IHTMLInputElement.

But we have a generic object that give us extra properties, the : IHTMLElement.
To access those objects dont forget the next :

Imports mshtml

The function

And here we have the full function that we will be using, with some comments :

    Function parseMyHtml(ByVal htmlToParse$)

        '::......... Declare a new HTML document to use, and write our normal HTML
        Dim htmlDocument As IHTMLDocument2 = New HTMLDocumentClass()
        htmlDocument.write(htmlToParse)
        htmlDocument.close()

        '::......... With this we retrieve all of the HTML elements collection existing on out HTML block
        Dim allElements As IHTMLElementCollection = htmlDocument.body.all

        '::......... Find by name out INPUT element on the group, and set a new value
        Dim myTextBox As IHTMLInputElement = allElements.item("myTextBox")
        myTextBox.value = "This is my text box!"

        '::......... Our button, but now its a "IHTMLElement", the generic object, that gives us more properties
        '::......... And set a new attribute to our element
        Dim myButton As IHTMLElement = allElements.item("myButton")
        myButton.setAttribute("onClick", "javascript:alert('This is the button!')")

        '::......... As a input, we set its value
        Dim myButton2 As IHTMLInputElement = allElements.item("myButton")
        myButton2.value = "Click me!"

        '::......... Get the INPUT group of elements
        Dim allInputs As IHTMLElementCollection = allElements.tags("input")
        Dim element As IHTMLElement

        '::......... Change some properties
        For Each element In allInputs
            element.style.border = "1px solid red"
            element.style.fontFamily = "Verdana"

            '::......... I dont want any "removeMe" classed elements, so lets remove them
            If element.className = "removeMe" Then
                element.outerHTML = ""
            End If
        Next

        '::......... Return the parent element content ( BODY > HTML )
        Return htmlDocument.body.parentElement.outerHTML

    End Function

In this function we can see the use of some objects and its functions/properties.

The result HTML

After calling the function and showing to the user the result we have the next HTML processed block :

<HTML><HEAD></HEAD>
<BODY><INPUT style="BORDER-RIGHT: red 1px solid; BORDER-TOP: red 1px solid; BORDER-LEFT: red 1px solid;
 BORDER-BOTTOM: red 1px solid; FONT-FAMILY: Verdana" value="This is my text box!" name=myTextBox>
 <INPUT style="BORDER-RIGHT: red 1px solid; BORDER-TOP: red 1px solid; BORDER-LEFT: red 1px solid; 
BORDER-BOTTOM: red 1px solid; FONT-FAMILY: Verdana" type=button value="Click me!" 
name=myButton onClick="javascript:alert('This is the button!')">  </BODY></HTML>

A bit nasty code, because the mshtml process it and every attribute such as style:border and sets every single border instead, but believe me, the browser will recognize that.

Conclusion

After testing this assembly with another page, a big one, almost 120kb of pure HTML with javascript/CSS, images and some other stuff, the assembly showed very strong, and didn't report any error.

Very usefull, when you have some html inputed by an user and you need to change some properties.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

About the Author

Fabricio Miranda
Web Developer
Brazil Brazil
No Biography provided

Comments and Discussions

 
GeneralAnd the point is PinmvpMark Nischalke30-Jun-07 5:20 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web04 | 2.8.140709.1 | Last Updated 29 Jun 2007
Article Copyright 2007 by Fabricio Miranda
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid