Click here to Skip to main content
12,892,589 members (48,079 online)
Rate this:
Please Sign up or sign in to vote.
See more: C++ MFC VisualC++

I have to convert a MS-word (*.doc/*.docx) file to a text file (*.txt) using a VC++ and MFC.

The problem faced while conversion is that When I try to convert word file to text file the text file is not in the readable form. It shows the text in the form like "´•IOÃ0…ïHü‡ÈW”¸p@5åÀr„JqvIj/ò¸Û¿gÜ%j«¶)P.‘ç½÷yœ™tfºN&àQY“³ë¬Ã0ÒÊT9û¼¤w,Á L!jk ".

Reverse conversion, I mean from text to *.doc is working fine.

I checked the font properties are also same. Even I pasted the converted garbage text to word file, but it produced the same garbage one.

Welcome if any further information required.

Posted 7-Sep-11 20:18pm
Updated 7-Sep-11 21:59pm
OriginalGriff 8-Sep-11 4:24am
Which converter are you using?
Rate this: bad
Please Sign up or sign in to vote.

Solution 1

This happens because of the way the two application (MS-Word & Notepad) handles and manipulate the data.
The file with the extension '.txt' will have plain ANSI charecters with out any formatting where as '.doc / .docx' file can have UNICODE charecters and more importantly other data such as images.
So when you try to show the contents which are not supported in '.txt' file we endup in seeing this junk chars.
Rate this: bad
Please Sign up or sign in to vote.

Solution 2

MS Word adds all kind of information to the document (e. g. document properties such as date and author, MS copyright stuff, etc.), but it shouldn't be too hard to filter out that part.

What is difficult however are the additional bits of information MS Word puts right into the text, such as formatting information, anchors, special characters, image data or other embedded elements. In addition to that, the exact format of a MS Word document may change between versions, so you might need to figure out the version the doc was stored with.

All that said, the effort isn't worth it: just use MS Word itself to store the document in another format, using "Save As"!
Simon Bang Terkildsen 8-Sep-11 10:43am
+5 Save as for the win
Philippe Mori 8-Sep-11 20:32pm
Or you can also use your prefered technology to access such document like Automation, Open XML SDK (not recommanded for beginners) or third-parties libraries. Newer format (*.docx) are easier to parse. There are essentially a zip file that contains files and each file has it own purpose. Old format is a proprietary binary format...

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

    Print Answers RSS
Top Experts
Last 24hrsThis month

Advertise | Privacy | Mobile
Web02 | 2.8.170424.1 | Last Updated 8 Sep 2011
Copyright © CodeProject, 1999-2017
All Rights Reserved. Terms of Service
Layout: fixed | fluid

CodeProject, 503-250 Ferrand Drive Toronto Ontario, M3C 3G8 Canada +1 416-849-8900 x 100