Using regular expressions for web scraping is sometimes criticized, but I believe they still have their place, particularly for one-off scrapes. Let's say I want to extract the title of a particular webpage - here is an implementation using BeautifulSoup, lxml, and regular expressions:
from BeautifulSoup import BeautifulSoup
from lxml import html as lxmlhtml
def timeit(fn, *args):
t1 = time.time()
for i in range(100):
t2 = time.time()
print '%s took %0.3f ms' % (fn.func_name, (t2-t1)*1000.0)
soup = BeautifulSoup(html)
tree = lxmlhtml.fromstring(html)
return re.findall('<title>(.*?)</title>', html)
if __name__ == '__main__':
url = 'http://webscraping.com/blog/Web-scraping-with-regular-expressions/'
html = urllib2.urlopen(url).read()
for fn in (bs_test, lxml_test, regex_test):
The results are:
regex_test took 40.032 ms
lxml_test took 1863.463 ms
bs_test took 54206.303 ms
That means for this use case lxml takes 40x longer than regular expressions and BeautifulSoup over 1000x! This is because lxml and BeautifulSoup parse the entire document into their internal format, when only the title is required.
XPaths are very useful for most web scraping tasks, but there is still a use case for regular expressions.