Click here to Skip to main content
15,897,891 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
Hi,

I have to convert a MS-word (*.doc/*.docx) file to a text file (*.txt) using a VC++ and MFC.

The problem faced while conversion is that When I try to convert word file to text file the text file is not in the readable form. It shows the text in the form like "´•IOÃ0…ïHü‡ÈW”¸p@5åÀr„JqvIj/ò¸Û¿gÜ%j«¶)P.‘ç½÷yœ™tfºN&àQY“³ë¬Ã0ÒÊT9û¼¤w,Á L!jk ".

Reverse conversion, I mean from text to *.doc is working fine.

I checked the font properties are also same. Even I pasted the converted garbage text to word file, but it produced the same garbage one.

Welcome if any further information required.

Thanks
Posted
Updated 7-Sep-11 21:59pm
v2
Comments
OriginalGriff 8-Sep-11 4:24am    
Which converter are you using?

This happens because of the way the two application (MS-Word & Notepad) handles and manipulate the data.
The file with the extension '.txt' will have plain ANSI charecters with out any formatting where as '.doc / .docx' file can have UNICODE charecters and more importantly other data such as images.
So when you try to show the contents which are not supported in '.txt' file we endup in seeing this junk chars.
 
Share this answer
 
MS Word adds all kind of information to the document (e. g. document properties such as date and author, MS copyright stuff, etc.), but it shouldn't be too hard to filter out that part.

What is difficult however are the additional bits of information MS Word puts right into the text, such as formatting information, anchors, special characters, image data or other embedded elements. In addition to that, the exact format of a MS Word document may change between versions, so you might need to figure out the version the doc was stored with.

All that said, the effort isn't worth it: just use MS Word itself to store the document in another format, using "Save As"!
 
Share this answer
 
Comments
Simon Bang Terkildsen 8-Sep-11 10:43am    
+5 Save as for the win
Philippe Mori 8-Sep-11 20:32pm    
Or you can also use your prefered technology to access such document like Automation, Open XML SDK (not recommanded for beginners) or third-parties libraries. Newer format (*.docx) are easier to parse. There are essentially a zip file that contains files and each file has it own purpose. Old format is a proprietary binary format...

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900