Click here to Skip to main content

Articles by Richard Penman (Technical Blogs: 24)

Technical Blogs: 24

RSS Feed
No articles have been posted.

Average blogs rating: 4.83

Applications & Tools

Webpage screenshots with Webkit [Technical Blog]
Posted: 9 Jan 2013   Updated: 9 Jan 2013   Views: 3,346   Rating: 0.0 / 5    Votes: 0   Popularity: 0.0
Licence: The Code Project Open License (CPOL)      Bookmarked: 1   Downloaded: 0
My solution using Webkit.
How to teach yourself web scraping [Technical Blog]
Posted: 9 Jan 2013   Updated: 9 Jan 2013   Views: 4,236   Rating: 0.0 / 5    Votes: 0   Popularity: 0.0
Licence: The Code Project Open License (CPOL)      Bookmarked: 11   Downloaded: 0
How to learn about web scraping.
How to use proxies [Technical Blog]
Posted: 9 Jan 2013   Updated: 9 Jan 2013   Views: 3,441   Rating: 0.0 / 5    Votes: 0   Popularity: 0.0
Licence: The Code Project Open License (CPOL)      Bookmarked: 2   Downloaded: 0
How to use proxies.
Crawling with threads [Technical Blog]
Posted: 13 Jan 2013   Updated: 13 Jan 2013   Views: 3,075   Rating: 0.0 / 5    Votes: 0   Popularity: 0.0
Licence: The Code Project Open License (CPOL)      Bookmarked: 1   Downloaded: 0
Crawling with threads.
Using Google Translate to crawl a website [Technical Blog]
Posted: 13 Jan 2013   Updated: 13 Jan 2013   Views: 1,835   Rating: 5.00/5    Votes: 1   Popularity: 0.00
Licence: The Code Project Open License (CPOL)      Bookmarked: 2   Downloaded: 0
Using Google Translate to crawl a website.
Scraping dynamic data [Technical Blog]
Posted: 15 Jan 2013   Updated: 15 Jan 2013   Views: 1,671   Rating: 0.0 / 5    Votes: 0   Popularity: 0.0
Licence: The Code Project Open License (CPOL)      Bookmarked: 2   Downloaded: 0
I have three solutions for periodically scraping a website.
Why web2py? [Technical Blog]
Posted: 15 Jan 2013   Updated: 15 Jan 2013   Views: 2,054   Rating: 5.00/5    Votes: 1   Popularity: 0.00
Licence: The Code Project Open License (CPOL)      Bookmarked: 0   Downloaded: 0
Why web2py?
Scraping JavaScript based web pages with Chickenfoot [Technical Blog]
Posted: 16 Jan 2013   Updated: 16 Jan 2013   Views: 2,322   Rating: 0.0 / 5    Votes: 0   Popularity: 0.0
Licence: The Code Project Open License (CPOL)      Bookmarked: 2   Downloaded: 0
Scraping JavaScript based web pages with Chickenfoot.

Client side scripting

Scraping JavaScript webpages with Webkit [Technical Blog]
Posted: 15 Jan 2013   Updated: 15 Jan 2013   Views: 1,742   Rating: 0.0 / 5    Votes: 0   Popularity: 0.0
Licence: The Code Project Open License (CPOL)      Bookmarked: 2   Downloaded: 0
Webkit has now been ported to the Qt framework and can be used through its Python bindings.

HTML / CSS

Scraping multiple JavaScript webpages with webkit [Technical Blog]
Posted: 9 Jan 2013   Updated: 9 Jan 2013   Views: 3,565   Rating: 0.0 / 5    Votes: 0   Popularity: 0.0
Licence: The Code Project Open License (CPOL)      Bookmarked: 5   Downloaded: 0
I made an earlier post about using webkit to process the JavaScript in a webpage so you can access the resulting HTML and how to apply it to multiple webpages.
How to Use XPaths Robustly [Technical Blog]
Posted: 18 Jan 2013   Updated: 18 Jan 2013   Views: 1,236   Rating: 0.0 / 5    Votes: 0   Popularity: 0.0
Licence: The Code Project Open License (CPOL)      Bookmarked: 0   Downloaded: 0
How to use XPaths robustly
Parsing HTML with Python [Technical Blog]
Posted: 23 Jan 2013   Updated: 23 Jan 2013   Views: 4,118   Rating: 0.0 / 5    Votes: 0   Popularity: 0.0
Licence: The Code Project Open License (CPOL)      Bookmarked: 3   Downloaded: 0
HTML is a tree structure: at the root is a tag followed by the and tags and then more tags before the content itself. However when a webpage is downloaded all one gets is a series of characters. Working directly with that text is fine when using regular expressions, but often we want to traverse

Web Security

Automatic web scraping [Technical Blog]
Posted: 9 Jan 2013   Updated: 9 Jan 2013   Views: 3,933   Rating: 0.0 / 5    Votes: 0   Popularity: 0.0
Licence: The Code Project Open License (CPOL)      Bookmarked: 5   Downloaded: 0
I have been interested in automatic approaches to web scraping for a few years now.During university I created the SiteScraper library, which used training cases to automatically scrape webpages.This approach was particularly useful for scraping a website periodically because the model could automat
How to protect your data [Technical Blog]
Posted: 16 Jan 2013   Updated: 16 Jan 2013   Views: 1,625   Rating: 0.0 / 5    Votes: 0   Popularity: 0.0
Licence: The Code Project Open License (CPOL)      Bookmarked: 1   Downloaded: 0
Some strategies to protect your data.
How to crawl websites without being blocked [Technical Blog]
Posted: 16 Jan 2013   Updated: 16 Jan 2013   Views: 2,187   Rating: 0.0 / 5    Votes: 0   Popularity: 0.0
Licence: The Code Project Open License (CPOL)      Bookmarked: 1   Downloaded: 0
How to crawl websites without being blocked.
Web scraping with regular expressions [Technical Blog]
Posted: 18 Jan 2013   Updated: 18 Jan 2013   Views: 1,523   Rating: 0.0 / 5    Votes: 0   Popularity: 0.0
Licence: The Code Project Open License (CPOL)      Bookmarked: 2   Downloaded: 0
Using regular expressions for web scraping is sometimes criticized, but I believe they still have their place, particularly for one-off scrapes. Let's say I want to extract the title of a particular webpage - here is an implementation using BeautifulSoup, lxml, and regular expressions:import reimpor
Typical web scraping job [Technical Blog]
Posted: 23 Jan 2013   Updated: 23 Jan 2013   Views: 3,637   Rating: 5.00/5    Votes: 1   Popularity: 0.00
Licence: The Code Project Open License (CPOL)      Bookmarked: 4   Downloaded: 0
In this post I will clarify what I do by walking through a simple web scraping job I worked on.

Database

Automatically importing CSV into MySQL [Technical Blog]
Posted: 7 Jan 2013   Updated: 7 Jan 2013   Views: 3,696   Rating: 5.00/5    Votes: 1   Popularity: 0.00
Licence: The Code Project Open License (CPOL)      Bookmarked: 3   Downloaded: 0
Sometimes I need to import large spreadsheets into MySQL.The easy way would be to assume all fields are varchar, but then the database would lose features such as ordering by a numeric field.The hard way would be to manually determine the type of each field to define the schema.That doesn't sound mu

Other .NET Languages

How to make python faster [Technical Blog]
Posted: 7 Jan 2013   Updated: 7 Jan 2013   Views: 2,961   Rating: 0.0 / 5    Votes: 0   Popularity: 0.0
Licence: The Code Project Open License (CPOL)      Bookmarked: 3   Downloaded: 0
Python and other scripting languages are sometimes dismissed because of their inefficiency compared to compiled languages like C. For example here are implementations of the fibonacci sequence in C and Python:int fib(int n){ if (n < 2) return n; else return fib(n - 1) + fib(n - 2);}int m

Libraries

The SiteScraper module [Technical Blog]
Posted: 13 Jan 2013   Updated: 13 Jan 2013   Views: 2,815   Rating: 0.0 / 5    Votes: 0   Popularity: 0.0
Licence: The Code Project Open License (CPOL)      Bookmarked: 3   Downloaded: 0
Automatically scraping website data based on example cases.
Increase your Google App Engine quotas for free [Technical Blog]
Posted: 16 Jan 2013   Updated: 16 Jan 2013   Views: 1,605   Rating: 4.00/5    Votes: 1   Popularity: 0.00
Licence: The Code Project Open License (CPOL)      Bookmarked: 1   Downloaded: 0
How to increase your Google App Engine quotas for free.

Threads, Processes & IPC

Threading with webkit [Technical Blog]
Posted: 9 Jan 2013   Updated: 9 Jan 2013   Views: 2,813   Rating: 0.0 / 5    Votes: 0   Popularity: 0.0
Licence: The Code Project Open License (CPOL)      Bookmarked: 4   Downloaded: 0
In a previous post I showed how to scrape a list of webpages. Here is an updated example that downloads the content in multiple threads.

Uncategorised Technical Blogs

Generating a website screenshot history [Technical Blog]
Posted: 7 Jan 2013   Updated: 7 Jan 2013   Views: 6,162   Rating: 0.0 / 5    Votes: 0   Popularity: 0.0
Licence: The Code Project Open License (CPOL)      Bookmarked: 0   Downloaded: 0
There is a nice website screenshots.com that hosts historic screenshots for many websites.

Reviews on Third Party Products and Tools

Solving CAPTCHA with OCR [Technical Blog]
Posted: 7 Jan 2013   Updated: 7 Jan 2013   Views: 4,138   Rating: 5.00/5    Votes: 1   Popularity: 0.00
Licence: The Code Project Open License (CPOL)      Bookmarked: 2   Downloaded: 0
Some websites require passing a CAPTCHA to access their content. As I have written before these can be parsed using the deathbycaptcha API, however for large websites with many CAPTCHA's this becomes prohibitively expensive. For example solving 1 million CAPTCHA's with this API would cost $1390.Fort
No tips have been posted.

Richard Penman

Australia Australia
No Biography provided


Advertise | Privacy | Mobile
Web03 | 2.6.130617.1 | Last Updated 18 Jun 2013
Copyright © CodeProject, 1999-2013
All Rights Reserved. Terms of Use
Layout: fixed | fluid