Click here to Skip to main content
Click here to Skip to main content

Tagged as

Go to top

How to Use XPaths Robustly

, 18 Jan 2013
Rate this:
Please Sign up or sign in to vote.
How to use XPaths robustly

In an earlier post I referred to XPaths but did not explain how to use them.

Say we have the following HTML document:

<html>  
 <body>
  <div></div>  
  <div id="content">  
   <ul>  
    <li>First item</li>  
    <li>Second item</li>  
   </ul>  
  </div>  
 </body>  
</html>

To access the list elements we follow the HTML structure from the root tag down to the li's:

html > body > 2nd div > ul > many li's.

An XPath to represent this traversal is:

/html[1]/body[1]/div[2]/ul[1]/li

If a tag has no index then every tag of that type will be selected:

/html/body/div/ul/li

XPaths can also use attributes to select nodes:

/html/body/div[@id="content"]/ul/li 

And instead of using an absolute XPath from the root the XPath can be relative to a particular node by using double slash:

//div[@id="content"]/ul/li

This is more reliable than an absolute XPath because it can still locate the correct content after the surrounding structure is changed.

There are other features in the XPath standard but the above are all I use regularly.

A handy way to find the XPath of a tag is with Firefox's Firebug extension. To do this open the HTML tab in Firebug, right click the element you are interested in, and select "Copy XPath". (Alternatively use the "Inspect" button to select the tag.)
This will give you an XPath with indices only where there are multiple tags of the same type, such as:

/html/body/div[2]/ul/li

One thing to keep in mind is Firefox will always create a tbody tag within tables whether it existed in the original HTML or not. This has tripped me up a few times!

For one-off scrapes the above XPath should be fine. But for long term repeat scrapes it is better to use a relative XPath around an ID element with attributes instead of indices. From my experience such an XPath is more likely to survive minor modifications to the layout. However for a more robust solution see my SiteScraper library, which I will introduce in a later post.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Richard Penman

Australia Australia
No Biography provided

Comments and Discussions

 
-- There are no messages in this forum --
| Advertise | Privacy | Mobile
Web03 | 2.8.140916.1 | Last Updated 18 Jan 2013
Article Copyright 2013 by Richard Penman
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid