Click here to Skip to main content
Licence MIT
First Posted 27 Jan 2008
Views 15,558
Bookmarked 12 times

XPath for HTML

By | 27 Jan 2008 | Article
Implementation of XPathNavigator for HTML.

Introduction

This article presents an implementation of XPathNavigator that can be used to perform XPath queries over HTML documents.

Background

The task of locating specific elements and attributes in HTML documents arises from time to time when working with a web browser control. It can be used for filtering content, performing some transformations or automation, e.g. filling in the forms. Of course one can write code for traversing an HTML document for each specific case, but I believe that it is much better to facilitate an existing query language, such as XPath, for this purpose. There are at least two possibilities to add support of XPath for HTML document. The first one is to convert HTML to XML, which is straightforward, but not efficient since the entire document should be processed. The second one, which is specific to .Net framework, is to create a class derived from XPathNavigator[1] that will expose HTML document structure in the form suitable for performing XPath queries.

Creating a specific XPathNavigator for HTML documents is an elegant approach which doesn’t require duplicating the document structure, but initially I thought that it wouldn’t be easy to implement. However, after studying MSDN[1][2], I found out that it’s not difficult. All you need is to create a class derived from XPathNavigator and implement at least the following methods:

  • NameTable
  • Clone
  • NodeType
  • LocalName
  • Name
  • NamespaceURI
  • Prefix
  • BaseURI
  • IsEmptyElement
  • MoveToFirstAttribute
  • MoveToNextAttribute
  • MoveToFirstNamespace
  • MoveToNextNamespace
  • MoveToNext
  • MoveToPrevious
  • MoveToFirstChild
  • MoveToParent
  • MoveTo
  • MoveToId
  • IsSamePosition
  • Value

So I've created a class HtmlXPathNavigator that implements all these methods. When designing this class I've uses the State design pattern[3] slightly tuned up to achieve shorter code. The only difference from the classical State pattern is that HtmlXPathNavigator (context) is not passed to methods that modify the state. Instead, these methods return a new state or null and navigator checks the return value and sets the new state if it is not null.

htmlxpath

As you can see from the diagram above there are four states: RootState, ElementState, AttributeState, TextState derived from the base class State. They represent different parts of HTML document and I think that there is no need to describe them in more details since the names are self-descriptive.

HtmlXPathNavigator also provides capability of tracing method invocations. I used tracing while debugging the navigator and deliberately decided to leave this functionality since it may be useful to see which methods are called when processing XPath query.

Tracing may be turned on in configuration file or by changing the default source level to SourceLevels.Information in source code. I recommend not using the default trace listener which writes to the debug console since non-trivial XPath queries will generate lots of messages flooding the console window. The following configuration can be used to enable tracing HtmlXPath tracing and direct output to the HtmlXPath.log file.

<?xml version="1.0" encoding="utf-8" ?>
<configuration>
  <system.diagnostics>
    <sources>
      <source name="HtmlXPath" switchName="SourceSwitch"
              switchType="System.Diagnostics.SourceSwitch">
        <listeners>
          <add name="Log"
               type="System.Diagnostics.TextWriterTraceListener"
               initializeData="HtmlXPath.log"/>
          <remove name ="Default"/>
        </listeners>
      </source>
    </sources>
    <switches>
      <add name="SourceSwitch" value="Information"/>
    </switches>
    <trace autoflush="true"/>
  </system.diagnostics>
</configuration>

Using the code

Please refer to the sample which is included into the archive and MSDN documentation for XPathNavigator class[1].

Points of Interest

Note that XPath queries are case-sensitive while HTML elements are not. However MSHTML DOM document seems to have all element names converted to uppercase. Take it into account when writing your quieries.

References

  1. XPathNavigator Class
  2. XPathNavigator over Different Stores
  3. Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides, Design Patterns: Elements of Reusable Object-Oriented Software, Addison-Wesley, 1995, ISBN 0-201-63361-2

The latest version of this article can be found here.

Download htmlxpath.zip - 2.77 KB

License

This article, along with any associated source code and files, is licensed under The MIT License

About the Author

Victor Zverovich



United Kingdom United Kingdom

Member



Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board. (secure sign-in)
 
Search this forum  
 FAQ
    Noise  Layout  Per page   
  Refresh
QuestionGreat Article! Pinmemberandries20020:47 26 Apr '12  
QuestionReally strange PinmemberElmue13:40 2 Jul '11  
GeneralNice! Pinmemberapanpapan0:07 8 Feb '10  

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Mobile
Web01 | 2.5.120517.1 | Last Updated 27 Jan 2008
Article Copyright 2008 by Victor Zverovich
Everything else Copyright © CodeProject, 1999-2012
Terms of Use
Layout: fixed | fluid