Click here to Skip to main content
15,867,453 members
Articles / Web Development / ASP.NET
Article

HTML Parsing using .NET Framework

Rate me:
Please Sign up or sign in to vote.
3.28/5 (8 votes)
29 Jun 20072 min read 129.4K   2.8K   37   7
In this article, i will be showing how to parse and modify a peace of HTML code using a nice and helpfull .NET assembly.
Download the demo class of this example :

Introduction

Hello again !

In this article we will be working on HTML parsing using a single .NET Framework Assembly.
The namespace is called mshtml, ill be showing how this assembly and its objects can be so helpfull in some basics ways.

The assembly

The assembly is located on the folder below :

Screenshot - mshtml.jpg

If you are working on a remote website/shared drive letters from another environment, just copy and paste this assembly to the /bin folder.

First we need to add the assembly reference to our project.

Project Properties >> Add reference >> .NET >> Microsoft.mshtml

The HTML block

This is the simple example of a HTML block that we will be working with :

VB.NET
Dim myHTML$ = _
          vbCrLf & "<html>" _
        & vbCrLf & "<body>" _
        & vbCrLf & "<input type=""text"" name=""myTextBox""/>" _
        & vbCrLf & "<input type=""button"" name=""myButton""/>" _
        & vbCrLf & "<input type=""checkbox"" name=""myCheckBox"" class=""removeMe""/>" _
        & vbCrLf & "</body>" _
        & vbCrLf & "</html>"


How you can see, its simple as 3 inputs.

The objects

In this assembly, we have all objects that can represent our HTML tags in this example ill be focusing on the INPUT tag, using the object: IHTMLInputElement.

But we have a generic object that give us extra properties, the : IHTMLElement.
To access those objects dont forget the next :

VB.NET
Imports mshtml

The function

And here we have the full function that we will be using, with some comments :

VB.NET
Function parseMyHtml(ByVal htmlToParse$)

    '::......... Declare a new HTML document to use, and write our normal HTML
    Dim htmlDocument As IHTMLDocument2 = New HTMLDocumentClass()
    htmlDocument.write(htmlToParse)
    htmlDocument.close()

    '::......... With this we retrieve all of the HTML elements collection existing on out HTML block
    Dim allElements As IHTMLElementCollection = htmlDocument.body.all

    '::......... Find by name out INPUT element on the group, and set a new value
    Dim myTextBox As IHTMLInputElement = allElements.item("myTextBox")
    myTextBox.value = "This is my text box!"

    '::......... Our button, but now its a "IHTMLElement", the generic object, that gives us more properties
    '::......... And set a new attribute to our element
    Dim myButton As IHTMLElement = allElements.item("myButton")
    myButton.setAttribute("onClick", "javascript:alert('This is the button!')")

    '::......... As a input, we set its value
    Dim myButton2 As IHTMLInputElement = allElements.item("myButton")
    myButton2.value = "Click me!"

    '::......... Get the INPUT group of elements
    Dim allInputs As IHTMLElementCollection = allElements.tags("input")
    Dim element As IHTMLElement

    '::......... Change some properties
    For Each element In allInputs
        element.style.border = "1px solid red"
        element.style.fontFamily = "Verdana"

        '::......... I dont want any "removeMe" classed elements, so lets remove them
        If element.className = "removeMe" Then
            element.outerHTML = ""
        End If
    Next

    '::......... Return the parent element content ( BODY > HTML )
    Return htmlDocument.body.parentElement.outerHTML

End Function

In this function we can see the use of some objects and its functions/properties.

The result HTML

After calling the function and showing to the user the result we have the next HTML processed block :

HTML
<HTML><HEAD></HEAD>
<BODY><INPUT style="BORDER-RIGHT: red 1px solid; BORDER-TOP: red 1px solid; BORDER-LEFT: red 1px solid;
 BORDER-BOTTOM: red 1px solid; FONT-FAMILY: Verdana" value="This is my text box!" name=myTextBox>
 <INPUT style="BORDER-RIGHT: red 1px solid; BORDER-TOP: red 1px solid; BORDER-LEFT: red 1px solid; 
BORDER-BOTTOM: red 1px solid; FONT-FAMILY: Verdana" type=button value="Click me!" 
name=myButton onClick="javascript:alert('This is the button!')">  </BODY></HTML>

A bit nasty code, because the mshtml process it and every attribute such as style:border and sets every single border instead, but believe me, the browser will recognize that.

Conclusion

After testing this assembly with another page, a big one, almost 120kb of pure HTML with javascript/CSS, images and some other stuff, the assembly showed very strong, and didn't report any error.

Very usefull, when you have some html inputed by an user and you need to change some properties.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here


Written By
Web Developer
Brazil Brazil
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
GeneralMy vote of 5 Pin
evry1falls10-Mar-13 10:49
evry1falls10-Mar-13 10:49 
GeneralNot a good idea Pin
Chris Maunder3-Jul-07 6:14
cofounderChris Maunder3-Jul-07 6:14 
GeneralAnd the point is Pin
Not Active30-Jun-07 5:20
mentorNot Active30-Jun-07 5:20 
GeneralFYI - not an assembly Pin
KellyLeahy29-Jun-07 17:38
KellyLeahy29-Jun-07 17:38 
This is actually not an assembly, at least not a "true" assembly. It's a primary interop assy, which means it's a wrapper for a COM dll. This one happens to be the MSHTML DLL that is a component of IE.

You can treat it as a .NET DLL, but it's actually not managed code and is doing interop calls to COM objects under the hood.

Kelly Leahy
Milliman, inc.
GeneralThanks! Pin
The_Mega_ZZTer29-Jun-07 16:01
The_Mega_ZZTer29-Jun-07 16:01 
GeneralRe: Thanks! Pin
Uwe Keim29-Jun-07 19:26
sitebuilderUwe Keim29-Jun-07 19:26 
GeneralRe: Thanks! Pin
The_Mega_ZZTer30-Jun-07 3:55
The_Mega_ZZTer30-Jun-07 3:55 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.