Click here to Skip to main content
Click here to Skip to main content
Technical Blog

Tagged as

Solving CAPTCHA with OCR

, 7 Jan 2013 CPOL
Rate this:
Please Sign up or sign in to vote.
Some websites require passing a CAPTCHA to access their content. As I have written before these can be parsed using the deathbycaptcha API, however for large websites with many CAPTCHA's this becomes prohibitively expensive. For example solving 1 million CAPTCHA's with this API would cost $1390.Fort

Editorial Note

This article appears in the Third Party Product Reviews section. Articles in this section are for the members only and must not be used by tool vendors to promote or advertise products in any way, shape or form. Please report any spam or advertising.

Some websites require passing a CAPTCHA to access their content. As I have written before these can be parsed using the deathbycaptcha API, however for large websites with many CAPTCHA's this becomes prohibitively expensive. For example solving 1 million CAPTCHA's with this API would cost $1390.

Fortunately many CAPTCHA's are weak and can be solved by cleaning the image and using simple OCR. Here are some example CAPTCHA images from a recent website I worked with:

Helpfully the distracting marks are lighter so the image can be thresholded to isolate the text:

Now the resulting images can be passed to an OCR program to extract the text. Here are results from 3 popular open source OCR tools:

Captcha 1 Captcha 2 Captcha 3 Result
7rrg5 hirbZ izi3b
Tesseract 7rrq5 hirbZ izi3b 2 / 3
Gocr 7rr95 _i_bz izi3b 1 / 3
Ocrad 7rrgS hi_bL iLi3b 0 / 3

Excellent results. Getting 100% accuracy is not necessary when solving CAPTCHA's, because real people make mistakes too so websites will just respond with another CAPTCHA to solve.

Tesseract only confused 'g' with 'q' and Gorc thought that 'g' was a '9', which is understandable. Even though Ocrad did not get any correct on this small sample set, it was close every time. And this was without training on the font or fixing the text orientation.

If you are interested the Python code used is available for download here. It depends on the PIL for image processing and each of the OCR tools.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Richard Penman

Australia Australia
No Biography provided

Comments and Discussions

 
GeneralCAPTCHA days are numbered PinmemberMember 1008707931-May-13 12:07 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web02 | 2.8.141022.2 | Last Updated 7 Jan 2013
Article Copyright 2013 by Richard Penman
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid