HTML DOM Using .NET

ai8rahim

4.75/5 (10 votes)

Jun 5, 2010

CPOL

7 min read

142760

5690

Retrieving & Processing HTML from Websites in .NET Applications

Download demo application - 21.96 KB

Introduction

Today the software development landscape has evolved significantly with the proliferation of Web technologies. Thus a majority of applications developed have some form of connectivity or integration with another application, web service, web application, remote database, etc.

This article will therefore try to touch one specific area, which is HTML content and DOM. And in doing so, it will investigate two approaches available in .NET which can be used to fuse these two for some practical purpose.

Examples provided are based on .NET code and libraries. However, the concepts remain the same for HTML and DOM are independent from any programming language. This article is not exhaustive in any manner, however references are provided for those seeking a more in depth coverage.

Background

According to W3C [1], HTML is the publishing language for the World Wide Web. This basically means that HTML is the language that is used to display content in your web browser when you visit any website.

HTML (Hyper Text Markup Language) is a markup language where predefined tags are used to instruct the browser how content should appear. For example, <h1>This is a heading</h1>, is the heading tag that tells the browser, the text “This is a heading” should be displayed bolded, and slightly bigger than the rest of the text on the web page. Different tags are used for different purposes. These tags are defined by the W3C in their language specifications. Currently the latest specification for HTML is HTML 4.01 [2]. The purpose of a specification is to specify how a certain language should be used, i.e. the recommendations by the creators of the language. HTML 4.01 specification by W3C recommends how HTML should be used in your websites and what the language is suppose to do. There is also XHTML 1.0 which is the latest specification for XTHML [3]. This is an extension of the HTML 4, which was designed with the intent to harness and integrate the power of XML in web pages.

DOM (Document Object Model) is an interface that allows applications to dynamically access content, structure and style of documents. It is not restricted to a specific platform or language [4]. W3C has defined several levels of DOM (e.g. DOM Level 1 – 3) and also several modules for each level (e.g. Core, XML, HTML, etc. 14 modules altogether). An implementation (application, agent, library, API, SDK, etc.) is said to conform to a certain DOM level or a module, if it that implementation supports all the interfaces for that module and the associated semantics [5] [6].

Approach

The steps that will be taken to demonstrate how HTML DOM can be used in .NET are:

Step 1. Retrieve the HTML Content
Step 2. Process the HTML Content using DOM
Step 3. Make use of the processed HTML Content

The following are the details of these steps.

Step 1. Retrieve the HTML Content

To retrieve the HTML Content, the System.Net.WebClient Class will be used. This class provides common methods for sending data to and receiving data from a resource identified by a URI [7]. According to RFC3986 [8], URI (Uniform Resource Identifier) is a compact sequence of characters that identifies an abstract or physical resource. It provides a simple and extensible means for identifying a resource. The commonly used term “URL” is basically a subset of URI. More on this topic can be explored in the RFC3986 document (link in the reference section).

The following code (Code Listing 1) can be used to retrieve the HTML content from the www.cnn.com and display in the TextBox1 control. However before the code can be executed, remember to reference the System.Net namespace.

Code Listing 1:

// WebClient object
WebClient client = new WebClient();

// Retrieve resource as a stream
Stream data = client.OpenRead(new Uri("http://www.cnn.com"));
            
// Retrieve the text
StreamReader reader = new StreamReader(data);
string htmlContent = reader.ReadToEnd();

textBox1.Text = htmlContent;

// Cleanup
data.Close();
reader.Close();

An alternative to using the System.Net.WebClient class would be to use the WebBrowser Control. Simply place the WebBrowser control on the Form and use the following code to go to a preferred URI.

webBrowser1.Navigate(new Uri(txtURL.Text));

Following this, the DocumentCompleted event can be used to capture the HTML content after the page has been loaded using the DocumentText property (Code Listing 2).

The WebBrowser control may be suitable if you wish to display the page to the user, however if you only want to capture the HTML content, then the WebClient class is more suitable and efficient.

Code Listing 2:

private void webBrowser1_DocumentCompleted
	(object sender, WebBrowserDocumentCompletedEventArgs e)
{
    textBox1.Text = webBrowser1.DocumentText;
}

Step 2. Process the HTML Content using DOM

Once the HTML content has been retrieved, DOM can be used for processing based on your needs. You could construct a full DOM tree, or you could simple extract specific tags, ids, or even content from the web page. For this part, I will use the Microsoft HTML Object Library, which is a COM library that has to be referenced in your Visual Studio Project. Once the reference has been added, reference the mshtml namespace in the code.

The mshtml namespace consists of different interfaces that can be used to access the Dynamic HTML (DHTML) Object Model [9][10]. The IHTMLDocument2 interface will be used in this article. This interface can be used to get information about the document, and also to examine and modify HTML elements and text in the document [11].

Firstly obtain the document interface using the IHTMLDocument2 interface. All the elements in the document can be accessed using this interface. The following code shows how this can be done purely using the mshtml interfaces. Most of the examples available are using the WebBrowser control, however this is an alternative approach. After the document object is created, the document is constructed using the HTML content (htmlContent) retrieved in the Code Listing 1. Following this, other interfaces can be used to access the documents elements. The following code (Code Listing 3) shows how all the tags are traversed and displayed in a ListBox by tag name.

Code Listing 3:

// Obtain the document interface
IHTMLDocument2 htmlDocument = (IHTMLDocument2)new mshtml.HTMLDocument();

// Construct the document
htmlDocument.write(htmlContent);

listBox1.Items.Clear();

// Extract all elements
IHTMLElementCollection allElements = htmlDocument.all;

// Iterate all the elements and display tag names
foreach (IHTMLElement element in allElements)
{
    listBox1.Items.Add(element.tagName);
}

More specific queries can be done on the elements such as extract all the links or even all the images in a particular web page. The following code (Code Listing 4) shows how all the image elements can be extracted and its sources displayed in a ListBox.

Code Listing 4:

// Extract all image elements
IHTMLElementCollection imgElements = htmlDocument.images;

// Iterate through each image element
foreach (IHTMLImgElement img in imgElements)
{
    listBox1.Items.Add(img.src);
}

Step 3. Make Use of the Processed HTML Content

There are numerous applications for this approach of extracting HTML content. Once the content is extracted, using DOM, selective elements can be processed. For example, using the above examples a simple image gallery can be built using the images used in a website. The following code iterates through all the items in the ListBox and adds PictureBoxes dynamically to a FlowLayoutPanel using the image sources retrieved in the previous step.

Code Listing 5:

// Iterate through all the items in the listbox
for (int i = 0; i < listBox1.Items.Count - 1; i++)
{
    // Create PictureBox dynamically
    // and set its properties
    PictureBox pic = new PictureBox();
    pic.Width = 100;
    pic.Height = 100;
    pic.SizeMode = PictureBoxSizeMode.StretchImage;

    // Assign location of picture
    // from the listbox
    pic.ImageLocation = listBox1.Items[i].ToString();

    // Add PictureBox to a panel
    flowLayoutPanel1.Controls.Add(pic);
}

Conclusion

This article has covered just a little bit of HTML DOM and how it can be used within .NET. While the applications are numerous, I hope that the readers will have some direction and know where to start when solving problems in this domain. If you wish to build photo galleries by ripping pictures off on other websites, monitor changes brought to web pages, or even develop spiders that crawl several sites, this approach is a simple and efficient way of going about it.

There are several other techniques that can be used and also very helpful third-party tools or libraries built just for this purpose are available, some for free, others commercially. The following is a screenshot of the demo application built using the code presented in this article. It is available for download from my blog. If you have any feedback, ideas or queries, please drop me a mail.

Demo Application Screenshot

Figure 1: Demo Application Screenshot

References

[1] http://www.w3.org/html/[^]
[2] http://www.w3.org/TR/html401/[^]
[3] http://www.w3.org/TR/xhtml1/[^]
[4] http://www.w3.org/DOM/[^]
[5] http://www.w3.org/TR/2000/REC-DOM-Level-2-Core-20001113/introduction.html[^]
[6] http://www.w3.org/2003/02/06-dom-support.html[^]
[7] http://msdn.microsoft.com/en-us/library/system.net.webclient.aspx[^]
[8] ftp://ftp.rfc-editor.org/in-notes/rfc3986.txt[^]
[9] http://msdn.microsoft.com/en-us/library/bb508515%28v=VS.85%29.aspx[^]
[10] http://msdn.microsoft.com/en-us/library/ms533044%28v=VS.85%29.aspx[^]
[11] http://msdn.microsoft.com/en-us/library/aa752574%28v=VS.85%29.aspx[^]

History

5^th June, 2010: Initial post