To start, I don't claim to be an expert in XPath or Regular Expressions but the following are some observations I have made while parsing HTML documents for client projects.
In the following examples I am using HtmlAgility pack (HAP) to load the HTML into a document object model (DOM) and parse into nodes. Additionaly, there are cases where I have had to parse the document on elements which are not truly nodes, such as comments.
In addition to observations about HAP in general, I’ll point out extension methods provided by the HAP.CSSSelectors package which allows for much easier selection.
I have been successfuly using HTMLAgility pack for a client, parsing HTML documents to extract pertinent information. CssSelector extensions will add a new level powerful level of abstraction to gather the required data.
Using the code
Packages for the example will need to be imported using NuGet. The package descriptions will be loaded in the project but you will need to set NuGet package manager to restore the libraries.
In the project I have included a really simple html file with examples of issues I have needed to address in my projects.
To test without any modifications, you will need to copy the HTML file to the following drive and directory – C:\testdata
HtmlAgility has a number of classes available to it including classes and enums which represent various parts of the DOM, these classes include HtmlAttribute, HtmlAttributeCollection, HtmlCommentNode and so on.
The first class we are going to examine is the HtmlDocument class. This class has the methods to load and parse the document into its respective parts.
In the attached source code I call out each section of the code using the nomenclature of (Part X) where X is a number.
To use, the following line needs to be implemented:
HtmlAgilityPack.agpack = new HtmlAgilityPack.HtmlDocument();
The next method to call is the method to load the document. You can load from either a string:
Like a web browser, HAP is forgiving on the Html supplied. You can query for errors but it will not break.
The file include has a missing close on the second font tag and a misplaced end tag. Works great in browser, does not throw an error in HAP but can be checked for.
var errors = agpack.ParseErrors;
ParseErrors will return a collection and a count of errors. Interesting enough the closing font tab did not throw an error. But the misplaced </tr> did.
Once the Document has been loaded, the two main methods to use for searching are:
Since there can only be a single ID, getElementById will return a single node and SelectNodes will bring back a collection of nodes because using XPath you may match one or more items.
My client has an application which will append several files together, delimiting each document with a start and end comment. The following is how I handle splitting this document back into its constituent parts. The file I have included has a section which is delineated with comments, the comments are in the form:
<!-- HTML Body <!--
were 1234 might represent some type of account number that we need for processing.
You could use the following to get the comment:
var comment = agpack.DocumentNode.SelectNodes("//comment()[contains(., 'Start Table:')]");
This says from the whole document (“//”) select comments which contain from the current location (.) the words Start Table.
Since this is a comment, it has no child nodes and the inner text is simply the text of the comment itself. This is useful if what you want to do is parse the comment to determine a value in the comment (account number in this case) but doesn’t really help when you want the text between the comments. To accomplish this, I fall back to Regular Expressions and grouping.
var html = Regex.Match(agpack.DocumentNode.InnerHtml,@"<!-- Start Table: \d* -->(?<one>.*)<!-- End Table -->",RegexOptions.Singleline).Groups;
Now, in the html.Value we have the text between the two tags.
Moving onto finding elements in the DOM, the first example is finding the node using getElementById. There are three tables, but only two have an ID assigned to them. One is ID=”abc” the other is ID=”table3”
Let’s start with looking at table with id=”abc”:
var node = agpack.GetElementbyId("abc");
This will return a single node representing the table. The InnerHtml will contain all the text between the <table></table> tags. It will also contain a collection of nodes representing the DOM structure of the table.
One approach to getting the row nodes is to use Linq to discover them, such as:
var rownodes = node.ChildNodes.Where(w => w.OriginalName == "tr");
This sort of works, if you check the count you will see you have three rows. However there are actually four rows, the first wrapped in a <thead></thead> will not be found.
Another approach is to use SelectNodes on the node to discover the tr elements.
rownodes = node.SelectNodes("tr");
But this also fails to find all the rows, just finding its immediate children.
What about node.SelectNodes("/tr"); This returns nothing.
What about node.SelectNodes("//tr"); the good news is that it found the missing row along with all the rows (12) in the document.
After a little digging I found the following two solutions worked:
rownodes = node.SelectNodes(node.XPath + "//tr");
rownodes = node.SelectNodes("descendant::tr");
this returns all four, this was interesting to me. I think I had assumed HAP would have been doing the SelectNodes from the current node and “//tr” would have worked, alas “//” says to search from the root of the document. But the second option does work as a descendant from the currently selected node.
Similarly, we can find all the td elements of the tr elements using the same procedures. Note that for table 3 we bring back twelve td elements even though they are children of <tr> and <font> and <span> elements.
node = null;
node = agpack.GetElementbyId("table3")
nodes = node.SelectNodes("descendant::td");
Let’s move onto HAP.CssSelectors
This sits on top of HtmlAgility pack and will in fact ensure that it is installed as part of the NuGet package.
It allows you to select elements using CSS selectors rather than XPath. For example:
rownodes = agpack.QuerySelectorAll("#abc tr");
In this case I did not need to find from the node, simply selecting from whole document it returned the expected 4 rows.
listTDNodes = agpack.QuerySelectorAll("#table3 td");
Here is an example of getting only the <td>s (three) in the second row.
listTDNodes = agpack.QuerySelectorAll("#table3 tr:nth-child(2) td");
This returned twelve items, four rows by 3 columns. One thing to note. The QuerySelectorAll method returns as List<node> rather than a collections of nodes. This is important to know if you plan to mix and match.
In addition to selecting by id(#) you can select by class(.), much easier than looking for an attribute with class using XPath.
listTDNodes = agpack.QuerySelectorAll(".table");
Returns the first and third table with the class of table.
Points of Interest
In conclusion, the CssSelectors extension is another useful tool to select elements easily without the need to dig deep into XPath or iterate through collections. I know I will be looking forward to implementing some of these findings into my own work.
Keep a running update of any changes or improvements you've made here.