Introduction
This article presents an implementation of XPathNavigator that can be used to perform XPath queries over HTML documents.
Background
The task of locating specific elements and attributes in HTML documents
arises from time to time when working with a web browser control. It
can be used for filtering content, performing some transformations or
automation, e.g. filling in the forms. Of course one can write code for
traversing an HTML document for each specific case, but I believe that
it is much better to facilitate an existing query language, such as
XPath, for this purpose. There are at least two possibilities to add
support of XPath for HTML document. The first one is to convert HTML to
XML, which is straightforward, but not efficient since the entire
document should be processed. The second one, which is specific to .Net
framework, is to create a class derived from XPathNavigator[1] that will expose HTML document structure in the form suitable for performing XPath queries.
Creating a specific XPathNavigator for HTML documents is an elegant
approach which doesn’t require duplicating the document structure, but
initially I thought that it wouldn’t be easy to implement. However,
after studying MSDN[1][2],
I found out that it’s not difficult. All you need is to create a class
derived from XPathNavigator and implement at least the following
methods:
- NameTable
- Clone
- NodeType
- LocalName
- Name
- NamespaceURI
- Prefix
- BaseURI
- IsEmptyElement
- MoveToFirstAttribute
- MoveToNextAttribute
- MoveToFirstNamespace
- MoveToNextNamespace
- MoveToNext
- MoveToPrevious
- MoveToFirstChild
- MoveToParent
- MoveTo
- MoveToId
- IsSamePosition
- Value
So I've created a class HtmlXPathNavigator that implements all these
methods. When designing this class I've uses the State design pattern[3]
slightly tuned up to achieve shorter code. The only difference from the
classical State pattern is that HtmlXPathNavigator (context) is not
passed to methods that modify the state. Instead, these methods return
a new state or null and navigator checks the return value and sets the
new state if it is not null.

As you can see from the diagram above there are four states:
RootState, ElementState, AttributeState, TextState derived from the
base class State. They represent different parts of HTML document and I
think that there is no need to describe them in more details since the
names are self-descriptive.
HtmlXPathNavigator also provides capability of tracing method
invocations. I used tracing while debugging the navigator and
deliberately decided to leave this functionality since it may be useful
to see which methods are called when processing XPath query.
Tracing may be turned on in configuration file or by changing the
default source level to SourceLevels.Information in source code. I
recommend not using the default trace listener which writes to the
debug console since non-trivial XPath queries will generate lots of
messages flooding the console window. The following configuration can
be used to enable tracing HtmlXPath tracing and direct output to the
HtmlXPath.log file.
="1.0" ="utf-8"
<configuration>
<system.diagnostics>
<sources>
<source name="HtmlXPath" switchName="SourceSwitch"
switchType="System.Diagnostics.SourceSwitch">
<listeners>
<add name="Log"
type="System.Diagnostics.TextWriterTraceListener"
initializeData="HtmlXPath.log"/>
<remove name ="Default"/>
</listeners>
</source>
</sources>
<switches>
<add name="SourceSwitch" value="Information"/>
</switches>
<trace autoflush="true"/>
</system.diagnostics>
</configuration>
Using the code
Please refer to the sample which is included into the archive and MSDN documentation for XPathNavigator class
[1].
Points of Interest
Note that XPath queries are case-sensitive while HTML elements are not. However MSHTML DOM document seems to have all element names converted to uppercase. Take it into account when writing your quieries.
References
- XPathNavigator Class
- XPathNavigator over Different Stores
- Erich Gamma, Richard Helm, Ralph Johnson, and
John Vlissides, Design Patterns: Elements of Reusable Object-Oriented
Software, Addison-Wesley, 1995, ISBN 0-201-63361-2
The latest version of this article can be found here.
Download htmlxpath.zip - 2.77 KB