Click here to Skip to main content
15,887,477 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
Hi,

I am OCRing some bills from my scanner with a well-known OCR library. It's very good, and returns all the text it finds as a big string filled with the OCR'd text.

Near the top of the bill, there is a line that says

January 16, 2016


I have tried splitting the output into lines, but it's on a different line for each bill, always in the same <long month>, <day number>, <four-digit year> format.

What is a Regex I can use to munch on the text, and pick out the date in that format?

What I have tried:

I've google searched and searched but I am probably not using the right searches. Any tips would help!
Posted
Updated 17-Oct-17 17:58pm

1 solution

First thing, grab a copy of Expresso.
Your regex will probably wind up something like
/January|February|....|December\s*\d{1,2}\s*,\s*\d{4}/
I've thrown in the \s*s so it will be tolerant of variable amounts of whitespace (from my experience of OCR).
Feel free to munch this up however you like.

edit: oops! removed spurious []

edit2: corrected spelling of Expresso
 
Share this answer
 
v4

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900