HTML Parsing using .NET Framework






3.26/5 (9 votes)
Jun 29, 2007
2 min read

131218

2810
In this article, i will be showing how to parse and modify a peace of HTML code using a nice and helpfull .NET assembly.
Introduction
Hello again !
In this article we will be working on HTML parsing using a single .NET Framework Assembly.
The namespace is called mshtml, ill be showing how this assembly and its objects can be so helpfull in some basics ways.
The assembly
The assembly is located on the folder below :
If you are working on a remote website/shared drive letters from another environment, just copy and paste this assembly to the /bin folder.
First we need to add the assembly reference to our project.
Project Properties >> Add reference >> .NET >> Microsoft.mshtml
The HTML block
This is the simple example of a HTML block that we will be working with :
Dim myHTML$ = _
vbCrLf & "<html>" _
& vbCrLf & "<body>" _
& vbCrLf & "<input type=""text"" name=""myTextBox""/>" _
& vbCrLf & "<input type=""button"" name=""myButton""/>" _
& vbCrLf & "<input type=""checkbox"" name=""myCheckBox"" class=""removeMe""/>" _
& vbCrLf & "</body>" _
& vbCrLf & "</html>"
How you can see, its simple as 3 inputs.
The objects
In this assembly, we have all objects that can represent our HTML tags in this example ill be focusing on the INPUT tag, using the object: IHTMLInputElement
.
But we have a generic object that give us extra properties, the : IHTMLElement
.
To access those objects dont forget the next :
Imports mshtml
The function
And here we have the full function that we will be using, with some comments :
Function parseMyHtml(ByVal htmlToParse$)
'::......... Declare a new HTML document to use, and write our normal HTML
Dim htmlDocument As IHTMLDocument2 = New HTMLDocumentClass()
htmlDocument.write(htmlToParse)
htmlDocument.close()
'::......... With this we retrieve all of the HTML elements collection existing on out HTML block
Dim allElements As IHTMLElementCollection = htmlDocument.body.all
'::......... Find by name out INPUT element on the group, and set a new value
Dim myTextBox As IHTMLInputElement = allElements.item("myTextBox")
myTextBox.value = "This is my text box!"
'::......... Our button, but now its a "IHTMLElement", the generic object, that gives us more properties
'::......... And set a new attribute to our element
Dim myButton As IHTMLElement = allElements.item("myButton")
myButton.setAttribute("onClick", "javascript:alert('This is the button!')")
'::......... As a input, we set its value
Dim myButton2 As IHTMLInputElement = allElements.item("myButton")
myButton2.value = "Click me!"
'::......... Get the INPUT group of elements
Dim allInputs As IHTMLElementCollection = allElements.tags("input")
Dim element As IHTMLElement
'::......... Change some properties
For Each element In allInputs
element.style.border = "1px solid red"
element.style.fontFamily = "Verdana"
'::......... I dont want any "removeMe" classed elements, so lets remove them
If element.className = "removeMe" Then
element.outerHTML = ""
End If
Next
'::......... Return the parent element content ( BODY > HTML )
Return htmlDocument.body.parentElement.outerHTML
End Function
In this function we can see the use of some objects and its functions/properties.
The result HTML
After calling the function and showing to the user the result we have the next HTML processed block :
<HTML><HEAD></HEAD>
<BODY><INPUT style="BORDER-RIGHT: red 1px solid; BORDER-TOP: red 1px solid; BORDER-LEFT: red 1px solid;
BORDER-BOTTOM: red 1px solid; FONT-FAMILY: Verdana" value="This is my text box!" name=myTextBox>
<INPUT style="BORDER-RIGHT: red 1px solid; BORDER-TOP: red 1px solid; BORDER-LEFT: red 1px solid;
BORDER-BOTTOM: red 1px solid; FONT-FAMILY: Verdana" type=button value="Click me!"
name=myButton onClick="javascript:alert('This is the button!')"> </BODY></HTML>
A bit nasty code, because the mshtml process it and every attribute such as style:border and sets every single border instead, but believe me, the browser will recognize that.
Conclusion
After testing this assembly with another page, a big one, almost 120kb of pure HTML with javascript/CSS, images and some other stuff, the assembly showed very strong, and didn't report any error.
Very usefull, when you have some html inputed by an user and you need to change some properties.