How to convert PDF data to word document using ocr

Question

1.00/5 (1 vote)

See more:

hi,

How to convert PDF data to word document using ocr.

1. The font size and font family should not change
2. The styles should not change(bold, italic, underline, strike. etc....)
3. If the PDF have image the image should be in word.

i need .net samples for reference.

What I have tried:

hi,

How to convert PDF data to word document using ocr.

1. The font size and font family should not change
2. The styles should not change(bold, italic, underline, strike. etc....)
3. If the PDF have image the image should be in word.

i need .net samples for reference.

Posted 26-Mar-18 2:48am

Member 11183856

Updated 29-Mar-18 11:14am

Add a Solution

Comments

MadMyche 26-Mar-18 9:27am

Plenty of usable results searching google for "Parse PDF C#" or "Automate PDF Export to Word". Nothing is perfect, it is up to you to figure out what your needs are and go through the thousands of results to figure out what works best.

http://www.adobe.com/devnet/acrobat.html
https://www.codeproject.com/Articles/12445/Converting-PDF-to-Text-in-C

Here is a question for you; what are you going to do if the font used is not on the local machine?

Richard Deeming 27-Mar-18 9:07am

The "What I have tried" box is where you show us what you have tried. It is NOT there for you to post a second copy of the "Describe the problem" text!

3 solutions

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Richard MacCutchan · Answer 1 · 2018-03-26T02:52:00

Solution 1

Quote:
i need .net samples for reference.

Then you should use Google to search for them.

Posted 26-Mar-18 2:52am

Richard MacCutchan

Comments

Member 11183856 26-Mar-18 8:52am

i tried but i didnt get any thing for my reference solution.

Richard MacCutchan 26-Mar-18 8:54am

Then maybe you will have to write it yourself.

Maciej Los 26-Mar-18 13:55pm

5ed!

Maciej Los · Answer 2 · 2018-03-26T07:56:00

Solution 2

Check this: How to convert PDF to Word document in C# and VB.NET in C# for Visual Studio 2010[^] and refer this: Convert Pdf To Word[^]

Posted 26-Mar-18 7:56am

Maciej Los

LEADTOOLS Support · Answer 3 · 2018-03-29T11:16:00

The .NET framework classes on their own do not contain the ability to OCR PDF files and convert them to Word, so you will need an SDK or library to do that. One option is described in this CodeProject article we posted a while back.
Since that article was published, the LEADTOOLS SDK has been significantly improved in different ways, but the .NET code is still simple to write and understand, and looks like this:

IOcrEngine ocrEngine = OcrEngineManager.CreateEngine(OcrEngineType.LEAD, false);
RasterCodecs rasterCodecs = new RasterCodecs();
rasterCodecs.Options.Load.AllPages = true;
ocrEngine.Startup(rasterCodecs, null, null, null);
string fileName = @"inputFile.pdf";
IOcrDocument ocrDocument = ocrEngine.DocumentManager.CreateDocument();
CodecsImageInfo fileinfo = rasterCodecs.GetInformation(fileName, true);
for (int pagenumber = 1; pagenumber <= fileinfo.TotalPages; pagenumber++)
{
ocrDocument.Pages.AddPage(rasterCodecs.Load(fileName, 0, CodecsLoadByteOrder.Bgr, pagenumber, pagenumber), null);
}
ocrDocument.Pages.Recognize(null);
ocrDocument.Save(fileName + ".docx", DocumentFormat.Docx, null);

If you would like to try it, you can download the free evaluation of the main LEADTOOLS setup from this page.