Click here to Skip to main content
15,890,973 members
Please Sign up or sign in to vote.
1.00/5 (1 vote)
See more:
hi,

How to convert PDF data to word document using ocr.

1. The font size and font family should not change
2. The styles should not change(bold, italic, underline, strike. etc....)
3. If the PDF have image the image should be in word.

i need .net samples for reference.

What I have tried:

hi,

How to convert PDF data to word document using ocr.

1. The font size and font family should not change
2. The styles should not change(bold, italic, underline, strike. etc....)
3. If the PDF have image the image should be in word.

i need .net samples for reference.
Posted
Updated 29-Mar-18 11:14am
Comments
MadMyche 26-Mar-18 9:27am    
Plenty of usable results searching google for "Parse PDF C#" or "Automate PDF Export to Word". Nothing is perfect, it is up to you to figure out what your needs are and go through the thousands of results to figure out what works best.

http://www.adobe.com/devnet/acrobat.html
https://www.codeproject.com/Articles/12445/Converting-PDF-to-Text-in-C

Here is a question for you; what are you going to do if the font used is not on the local machine?
Richard Deeming 27-Mar-18 9:07am    
The "What I have tried" box is where you show us what you have tried. It is NOT there for you to post a second copy of the "Describe the problem" text!

Quote:
i need .net samples for reference.
Then you should use Google to search for them.
 
Share this answer
 
Comments
Member 11183856 26-Mar-18 8:52am    
i tried but i didnt get any thing for my reference solution.
Richard MacCutchan 26-Mar-18 8:54am    
Then maybe you will have to write it yourself.
Maciej Los 26-Mar-18 13:55pm    
5ed!
 
Share this answer
 
The .NET framework classes on their own do not contain the ability to OCR PDF files and convert them to Word, so you will need an SDK or library to do that. One option is described in this CodeProject article we posted a while back.
Since that article was published, the LEADTOOLS SDK has been significantly improved in different ways, but the .NET code is still simple to write and understand, and looks like this:

IOcrEngine ocrEngine = OcrEngineManager.CreateEngine(OcrEngineType.LEAD, false);
RasterCodecs rasterCodecs = new RasterCodecs();
rasterCodecs.Options.Load.AllPages = true;
ocrEngine.Startup(rasterCodecs, null, null, null);
string fileName = @"inputFile.pdf";
IOcrDocument ocrDocument = ocrEngine.DocumentManager.CreateDocument();
CodecsImageInfo fileinfo = rasterCodecs.GetInformation(fileName, true);
for (int pagenumber = 1; pagenumber <= fileinfo.TotalPages; pagenumber++)
{
ocrDocument.Pages.AddPage(rasterCodecs.Load(fileName, 0, CodecsLoadByteOrder.Bgr, pagenumber, pagenumber), null);
}
ocrDocument.Pages.Recognize(null);
ocrDocument.Save(fileName + ".docx", DocumentFormat.Docx, null);

If you would like to try it, you can download the free evaluation of the main LEADTOOLS setup from this page.
 
Share this answer
 

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900