Click here to Skip to main content
13,260,019 members (47,722 online)
Click here to Skip to main content
Add your own
alternative version

Stats

9K views
146 downloads
6 bookmarked
Posted 12 Oct 2015

Using HtmlAgility pack and CssSelectors

, 12 Oct 2015
Rate this:
Please Sign up or sign in to vote.
Observations while parsing HTML files

Introduction

To start, I don't claim to be an expert in XPath or Regular Expressions but the following are some observations I have made while parsing HTML documents for client projects.

In the following examples I am using HtmlAgility pack (HAP) to load the HTML into a document object model (DOM) and parse into nodes. Additionaly, there are cases where I have had to parse the document on elements which are not truly nodes, such as comments.

In addition to observations about HAP in general, I’ll point out extension methods provided by the HAP.CSSSelectors package which allows for much easier selection.

Background

I have been successfuly using HTMLAgility pack for a client, parsing HTML documents to extract pertinent information. CssSelector extensions will add a new level powerful level of abstraction to gather the required data.

Using the code

Packages for the example will need to be imported using NuGet. The package descriptions will be loaded in the project but you will need to set NuGet package manager to restore the libraries.

In the project I have included a really simple html file with examples of issues I have needed to address in my projects. 

To test without any modifications, you will need to copy the HTML file to the following drive and directory – C:\testdata

HtmlAgility has a number of classes available to it including classes and enums which represent various parts of the DOM, these classes include HtmlAttribute, HtmlAttributeCollection, HtmlCommentNode and so on.

The first class we are going to examine is the HtmlDocument class. This class has the methods to load and parse the document into its respective parts.

In the attached source code I call out each section of the code using the nomenclature of (Part X) where X is a number. 

To use, the following line needs to be implemented:

(Part1)

HtmlAgilityPack.agpack = new HtmlAgilityPack.HtmlDocument();

The next method to call is the method to load the document. You can load from either a string:

agpack.LoadHtml(Html string) 
//or from a resource - 
agpack.Load(@"c:\testdata\testdat.htm");

Like a web browser, HAP is forgiving on the Html supplied. You can query for errors but it will not break.

The file include has a missing close on the second font tag and a misplaced end tag. Works great in browser, does not throw an error in HAP but can be checked for.

(Part 2)

var errors = agpack.ParseErrors;

 

ParseErrors will return a collection and a count of errors. Interesting enough the closing font tab did not throw an error. But the misplaced </tr> did.

 

Once the Document has been loaded, the two main methods to use for searching are:

SelectNodes(string XPath)  // from the DocumentNode
GetElementbyId(string Id) // from the HtmlDocument

Since there can only be a single ID, getElementById will return a single node and SelectNodes will bring back a collection of nodes because using XPath you may match one or more items.

My client has an application which will append several files together, delimiting each document with a start and end comment. The following is how I handle splitting this document back into its constituent parts. The file I have included has a section which is delineated with comments, the comments are in the form: 

<!-- Start Table: 1234 --> HTML Body <!-- End Table -->

were 1234 might represent some type of account number that we need for processing.

(Part 3)

You could use the following to get the comment:

var comment = agpack.DocumentNode.SelectNodes("//comment()[contains(., 'Start Table:')]");

This says from the whole document (“//”) select comments which contain from the current location (.) the words Start Table.

Since this is a comment, it has no child nodes and the inner text is simply the text of the comment itself. This is useful if what you want to do is parse the comment to determine a value in the comment (account number in this case) but doesn’t really help when you want the text between the comments. To accomplish this, I fall back to Regular Expressions and grouping.

(Part 4)

var html = Regex.Match(agpack.DocumentNode.InnerHtml,@"<!-- Start Table: \d* -->(?<one>.*)<!-- End Table -->",RegexOptions.Singleline).Groups[1];

Now, in the html.Value we have the text between the two tags.

Moving onto finding elements in the DOM, the first example is finding the node using getElementById. There are three tables, but only two have an ID assigned to them. One is ID=”abc” the other is ID=”table3”

Let’s start with looking at table with id=”abc”:

(Part 5)

var node = agpack.GetElementbyId("abc");

This will return a single node representing the table. The InnerHtml will contain all the text between the <table></table> tags. It will also contain a collection of nodes representing the DOM structure of the table.

(Part 6)

One approach to getting the row nodes is to use Linq to discover them, such as:

var rownodes = node.ChildNodes.Where(w => w.OriginalName == "tr");

This sort of works, if you check the count you will see you have three rows. However there are actually four rows, the first wrapped in a <thead></thead> will not be found.

Another approach is to use SelectNodes on the node to discover the tr elements.

rownodes = node.SelectNodes("tr");

But this also fails to find all the rows, just finding its immediate children.

What about node.SelectNodes("/tr"); This returns nothing. 

What about node.SelectNodes("//tr"); the good news is that it found the missing row along with all the rows (12) in the document.

After a little digging I found the following two solutions worked:

rownodes = node.SelectNodes(node.XPath + "//tr");

//or

// http://www.w3schools.com/xsl/xpath_axes.asp
rownodes = node.SelectNodes("descendant::tr");
this returns all four, this was interesting to me. I think I had assumed HAP would have been doing the SelectNodes from the current node and “//tr” would have worked, alas “//” says to search from the root of the document.  But the second option does work as a descendant from the currently selected node.

Similarly, we can find all the td elements of the tr elements using the same procedures. Note that for table 3 we bring back twelve td elements even though they are children of <tr> and <font> and <span> elements.

(Part 7)

node = null;
node = agpack.GetElementbyId("table3")
nodes = node.SelectNodes("descendant::td");

 

Let’s move onto HAP.CssSelectors

This sits on top of HtmlAgility pack and will in fact ensure that it is installed as part of the NuGet package.

It allows you to select elements using CSS selectors rather than XPath. For example:

 (Part 8)

rownodes = agpack.QuerySelectorAll("#abc tr");

In this case I did not need to find from the node, simply selecting from whole document it returned the expected 4 rows.

listTDNodes = agpack.QuerySelectorAll("#table3 td");

Here is an example of getting only the <td>s (three) in the second row.

listTDNodes = agpack.QuerySelectorAll("#table3 tr:nth-child(2) td");

 

This returned twelve items, four rows by 3 columns. One thing to note. The QuerySelectorAll method returns as List<node> rather than a collections of nodes. This is important to know if you plan to mix and match.

In addition to selecting by id(#) you can select by class(.), much easier than looking for an attribute with class using XPath.

listTDNodes = agpack.QuerySelectorAll(".table");

Returns the first and third table with the class of table.

Points of Interest

In conclusion, the CssSelectors extension is another useful tool to select elements easily without the need to dig deep into XPath or iterate through collections. I know I will be looking forward to implementing some of these findings into my own work.

History

Keep a running update of any changes or improvements you've made here.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

DotNetSteve
Software Developer (Senior) Polaris Solutions
United States United States
Steven Contos

Working in varied settings from small entrepreneurial companies to Fortune 500 companies. Skilled in analyzing client needs and developing solutions that are sound and effective.

Strong analytic capabilities with proven accomplishments in developing programs that exceed or meet stated goals, consistently work well, are easily maintained and fully documented. Versed in a number of SDLC technologies including Agile and Scrum, dedicated to deliver high quality software on time and on budget.

Experienced in helping companies and teams change their culture. Providing clear vision, asking tough questions of both developers and business, leading by example and building trust among all concerned.

You may also be interested in...

Pro
Pro

Comments and Discussions

 
QuestionConfusing nodes with elements Pin
Gerd Wagner13-Oct-15 3:40
professionalGerd Wagner13-Oct-15 3:40 
AnswerRe: Confusing nodes with elements Pin
DotNetSteve13-Oct-15 5:01
memberDotNetSteve13-Oct-15 5:01 
QuestionGood Article Pin
PANKAJMAURYA13-Oct-15 2:42
professionalPANKAJMAURYA13-Oct-15 2:42 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.171114.1 | Last Updated 12 Oct 2015
Article Copyright 2015 by DotNetSteve
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid