Click here to Skip to main content
14,126,835 members
Click here to Skip to main content
Add your own
alternative version

Tagged as

Stats

9.1K views
583 downloads
10 bookmarked
Posted 31 Mar 2018
Licenced Apache

Data Scraping from Image using Tesseract

, 31 Mar 2018
Rate this:
Please Sign up or sign in to vote.
Scrape data from image using Tesseract OCR engine

Introduction

Data Science is a growing field. According to CRISP DM model and other Data Mining models, we need to collect data before mining out knowledge and conduct predictive analysis. Data Collection can involve data scraping, which includes web scraping (HTML to Text), image to text and video to text conversion. When data is in text format, we usually use text mining techniques to mine out knowledge.

In this article, I am going to introduce you to Optical Character Recognition (OCR) to convert images to text. I developed Just Another Tesseract Interface (JATI) to convert images into text files, and consolidate them into a set of text data for text mining and natural language processing.

JATI interface with Tesseract OCR engine to convert image into text. I have included the source code. In this article, I am going to explain interfacing of the popular open source Tesseract OCR engine using C#.

Selecting the Image Portion to Convert

To OCR the whole image, it is easy, but I want to select a portion of the image to OCR. This can improve the accuracy of the result also. Hence, in JATI, user can click on the picturebox image and drag to draw a rectangle to select the portion. The selected area will then be cropped. The following are the steps to accomplish this.

References:

  1. http://www.c-sharpcorner.com/UploadFile/hirendra_singh/how-to-make-image-editor-tool-in-C-Sharp-cropping-image/
  2. https://stackoverflow.com/questions/34551800/get-the-exact-size-of-the-zoomed-image-inside-the-picturebox

Include the System.Drawing library:

using System.Drawing;

Mouse Down event for PictureBox1:

void PictureBox1MouseDown(object sender, MouseEventArgs e)
        {
            try {
           
             if (e.Button == System.Windows.Forms.MouseButtons.Left)
             {
                 Cursor = Cursors.Cross;
                startX = e.X;
                startY = e.Y;
               
                selPen = new Pen(Color.Red, 1);
              }
             
             pictureBox1.Refresh();
            }
           
            catch(Exception ex) {
               
            }
        }

Mouse Move event for PictureBox1:

void PictureBox1MouseMove(object sender, MouseEventArgs e)
        {
            try {
            if(e.Button == System.Windows.Forms.MouseButtons.Left) {
                pictureBox1.Refresh();   
                //Cursor = Cursors.Cross;
                curX = e.X;
                curY = e.Y;
               
                Rectangle rect = new Rectangle(startX, startY, curX - startX, curY - startY);
                pictureBox1.CreateGraphics().DrawRectangle(selPen, rect);               
            }
            }
           
            catch(Exception ex) {
               
            }
           
        }

Mouse Up event for PictureBox1:

void PictureBox1MouseUp(object sender, MouseEventArgs e)
        {
            try {
            Cursor = Cursors.Arrow;
       
            Rectangle rect = new Rectangle(startX, startY, curX-startX, curY-startY);
          
            Bitmap OriginalImage = new Bitmap(pictureBox1.Image, pictureBox1.Width, pictureBox1.Height);
            Bitmap _img = new Bitmap(curX-startX, curY-startY);

            Graphics g = Graphics.FromImage(_img);

            g.InterpolationMode = System.Drawing.Drawing2D.InterpolationMode.HighQualityBicubic;
            g.PixelOffsetMode = System.Drawing.Drawing2D.PixelOffsetMode.HighQuality;
            g.CompositingQuality = System.Drawing.Drawing2D.CompositingQuality.HighQuality;

            g.DrawImage(OriginalImage, 0, 0, rect, GraphicsUnit.Pixel);
 
            pictureBox2.Image = _img;
            pictureBox2.SizeMode = PictureBoxSizeMode.Zoom;
            pictureBox2.Width = _img.Width;
            pictureBox2.Height = _img.Height;
              
            }
           
            catch(Exception ex) {
               
            }
        }

The above code crops the selected image portion and places it into picturebox2. Following is the detailed explanation.

Create a new rectangle object for the selection:

Rectangle rect = new Rectangle(startX, startY, curX-startX, curY-startY);

Save the original image into a Bitmap object:

Bitmap OriginalImage = new Bitmap(pictureBox1.Image, pictureBox1.Width, pictureBox1.Height);

Create a new Bitmap Object:

Bitmap _img = new Bitmap(curX-startX, curY-startY);

Create a Graphics Object based on the new Bitmap Object:

Graphics g = Graphics.FromImage(_img);

Settings of Graphics Object:

g.InterpolationMode = System.Drawing.Drawing2D.InterpolationMode.HighQualityBicubic;
g.PixelOffsetMode = System.Drawing.Drawing2D.PixelOffsetMode.HighQuality;
g.CompositingQuality = System.Drawing.Drawing2D.CompositingQuality.HighQuality;

Cropped the image based on selection and put into pictureBox2:

g.DrawImage(OriginalImage, 0, 0, rect, GraphicsUnit.Pixel);
pictureBox2.Image = _img;

To get the selected coordinates for the image, I use:

string selCoordinates = "(" + startX.ToString() + "," + startY.ToString() + 
                        "," + curX.ToString() + "," + curY.ToString() + ")";

Image to Text Recognition using Tesseract

I use Tesseract OCR engine to convert images into text. To interface with Tesseract OCR engine, include System.Diagnostic library:

using System.Diagnostics;

Save the cropped image selection from pictureBox2 into a temporary directory:

pictureBox2.Image.Save(Directory.GetCurrentDirectory() + "/JATI/temp/temp" + ".png");

Set the input file and output file for Tesseract OCR engine:

string input = Directory.GetCurrentDirectory() + "/JATI/temp/temp" + ".png";
string output = Directory.GetCurrentDirectory() + "/JATI/temp/temp" + ".txt";

Create the Process and put in the arguments:

Process myProcess = Process.Start(Directory.GetCurrentDirectory() + 
"/JATI/tesseract.exe", "--tessdata-dir ./JATI/ " + input + " " + 
output.Replace(".txt", "") + " -l " + languageTextBox.Text + " -psm " + psmTextBox.Text);

Wait for the process to exit:

myProcess.WaitForExit();

License

This article, along with any associated source code and files, is licensed under The Apache License, Version 2.0

Share

About the Author

Eric M. H. Goh
Founder SVBook
Singapore Singapore
Eric Goh is a data scientist, software engineer, adjunct faculty and entrepreneur with years of experiences in multiple industries. His varied career includes data science, data and text mining, natural language processing, machine learning, intelligent system development, and engineering product design. He founded SVBook and extended it with DSTK.Tech and EMHAcademy. DSTK.Tech is where Eric develops his own DSTK data science softwares. Eric also publishes 5 books at LeanPub and SVBook, and teaches the content at Udemy and EMHAcademy. During his free time, Eric is also an adjunct faculty at University of the People.

Eric Goh has been leading his teams for various industrial projects, including the advanced product code classification system project which automates Singapore Custom’s trade facilitation process, and Nanyang Technological University's data science projects where he develop his own DSTK data science software. He has years of experience in C#, Java, C/C++, SPSS Statistics and Modeller, SAS Enterprise Miner, R, Python, Excel, Excel VBA and etc. He won Tan Kah Kee Young Inventors' Merit Award and Shortlisted Entry for TelR Data Mining Challenge.

He holds a Masters of Technology degree from the National University of Singapore, an Executive MBA degree from U21Global (currently GlobalNxt) and IGNOU, a Graduate Diploma in Mechatronics from A*STAR SIMTech (a national research institute located in Nanyang Technological University), and Coursera Specialization Certificate in Business Statistics and Analysis from Rice University. He possessed a Bachelor of Science degree in Computing from the University of Portsmouth after National Service. He is also a AIIM Certified Business Process Management Master (BPMM), GSTF certified Big Data Science Analyst (CBDSA), and IES Certified Lecturer.

Specialties: Data Science, Text Mining, Social Network Analysis, Natural Language Processing, Machine Learning, Software Engineering, Mechatronics, Business.

You may also be interested in...

Pro

Comments and Discussions

 
QuestionError---- out of memory Pin
Imran AS Shaikh31-May-18 23:27
memberImran AS Shaikh31-May-18 23:27 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Cookies | Terms of Use | Mobile
Web06 | 2.8.190518.1 | Last Updated 31 Mar 2018
Article Copyright 2018 by Eric M. H. Goh
Everything else Copyright © CodeProject, 1999-2019
Layout: fixed | fluid