Click here to Skip to main content
Click here to Skip to main content

Web scraping with regular expressions

, 18 Jan 2013
Rate this:
Please Sign up or sign in to vote.
Using regular expressions for web scraping is sometimes criticized, but I believe they still have their place, particularly for one-off scrapes. Let's say I want to extract the title of a particular webpage - here is an implementation using BeautifulSoup, lxml, and regular expressions:import reimpor

Using regular expressions for web scraping is sometimes criticized, but I believe they still have their place, particularly for one-off scrapes. Let's say I want to extract the title of a particular webpage - here is an implementation using BeautifulSoup, lxml, and regular expressions:

import re
import time
import urllib2
from BeautifulSoup import BeautifulSoup
from lxml import html as lxmlhtml


def timeit(fn, *args):
    t1 = time.time()
    for i in range(100):
        fn(*args)
    t2 = time.time()
    print '%s took %0.3f ms' % (fn.func_name, (t2-t1)*1000.0)
    
    
def bs_test(html):
    soup = BeautifulSoup(html)
    return soup.html.head.title
    
def lxml_test(html):
    tree = lxmlhtml.fromstring(html)
    return tree.xpath('//title')[0].text_content()
    
def regex_test(html):
    return re.findall('<title>(.*?)</title>', html)[0]
    
    
if __name__ == '__main__':
    url = 'http://webscraping.com/blog/Web-scraping-with-regular-expressions/'
    html = urllib2.urlopen(url).read()
    for fn in (bs_test, lxml_test, regex_test):
        timeit(fn, html)

The results are:

regex_test took 40.032 ms
lxml_test took 1863.463 ms
bs_test took 54206.303 ms

That means for this use case lxml takes 40x longer than regular expressions and BeautifulSoup over 1000x! This is because lxml and BeautifulSoup parse the entire document into their internal format, when only the title is required.

XPaths are very useful for most web scraping tasks, but there is still a use case for regular expressions.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Richard Penman

Australia Australia
No Biography provided

Comments and Discussions

 
-- There are no messages in this forum --
| Advertise | Privacy | Mobile
Web04 | 2.8.140721.1 | Last Updated 18 Jan 2013
Article Copyright 2013 by Richard Penman
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid