Click here to Skip to main content
Click here to Skip to main content
Technical Blog

Tagged as

Parsing HTML with Python

, 23 Jan 2013 CPOL
Rate this:
Please Sign up or sign in to vote.
HTML is a tree structure: at the root is a tag followed by the and tags and then more tags before the content itself. However when a webpage is downloaded all one gets is a series of characters. Working directly with that text is fine when using regular expressions, but often we want to traverse

HTML is a tree structure: at the root is a <html> tag followed by the <head> and <body> tags and then more tags before the content itself. However when a webpage is downloaded all one gets is a series of characters. Working directly with that text is fine when using regular expressions, but often we want to traverse the webpage content, which requires parsing the tree structure.

Unfortunately the HTML of many webpages around the internet is invalid - for example a list may be missing closing tags:

<ul>  
 <li>abc
 <li>def  
 <li>ghi
</ul>

but it still needs to be interpreted as a proper list:

  • abc
  • def
  • ghi

This means we can't naively parse HTML by assuming a tag ends when we find the next closing tag. Instead it is best to use one of the many HTML parsing libraries available, such as BeautifulSoup, lxml, html5lib, and libxml2dom.

Seemingly the most well known and used such library is BeautifulSoup. A Google search for Python web scraping module currently returns BeautifulSoup as the first result.
However I instead use lxml because I find it more robust when parsing bad HTML. Additionally Ian Bicking found lxml more efficient than the other parsing libraries, though my priority is accuracy over speed.

You will need to use version 2 onwards of lxml, which includes the html module. This meant needing to compile lxml up to Ubuntu 8.10, which came with an earlier version.

Here is an example how to parse the previous broken HTML with lxml:

from lxml import html  
tree = html.fromstring('<ul><li>abc</li><li>def<li>ghi</li></ul>')  
tree.xpath('//ul/li')  
[<Element li at 959553c>, <Element li at 95952fc>, <Element li at 959544c>]

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Richard Penman

Australia Australia
No Biography provided

Comments and Discussions

 
-- There are no messages in this forum --
| Advertise | Privacy | Terms of Use | Mobile
Web04 | 2.8.141223.1 | Last Updated 23 Jan 2013
Article Copyright 2013 by Richard Penman
Everything else Copyright © CodeProject, 1999-2014
Layout: fixed | fluid