Click here to Skip to main content
Rate this: bad
good
Please Sign up or sign in to vote.
See more: C++ MFC VC++
Hi,
 
I have to convert a MS-word (*.doc/*.docx) file to a text file (*.txt) using a VC++ and MFC.
 
The problem faced while conversion is that When I try to convert word file to text file the text file is not in the readable form. It shows the text in the form like "´•IOÃ0…ïHü‡ÈW”¸p@5åÀr„JqvIj/ò¸Û¿gÜ%j«¶)P.‘ç½÷yœ™tfºN&àQY“³ë¬Ã0ÒÊT9û¼¤w,Á L!jk ".
 
Reverse conversion, I mean from text to *.doc is working fine.
 
I checked the font properties are also same. Even I pasted the converted garbage text to word file, but it produced the same garbage one.
 
Welcome if any further information required.
 
Thanks
Posted 7-Sep-11 20:18pm
Edited 7-Sep-11 21:59pm
v2
Comments
OriginalGriff at 8-Sep-11 4:24am
   
Which converter are you using?
Rate this: bad
good
Please Sign up or sign in to vote.

Solution 1

This happens because of the way the two application (MS-Word & Notepad) handles and manipulate the data.
The file with the extension '.txt' will have plain ANSI charecters with out any formatting where as '.doc / .docx' file can have UNICODE charecters and more importantly other data such as images.
So when you try to show the contents which are not supported in '.txt' file we endup in seeing this junk chars.
  Permalink  
Rate this: bad
good
Please Sign up or sign in to vote.

Solution 2

MS Word adds all kind of information to the document (e. g. document properties such as date and author, MS copyright stuff, etc.), but it shouldn't be too hard to filter out that part.
 
What is difficult however are the additional bits of information MS Word puts right into the text, such as formatting information, anchors, special characters, image data or other embedded elements. In addition to that, the exact format of a MS Word document may change between versions, so you might need to figure out the version the doc was stored with.
 
All that said, the effort isn't worth it: just use MS Word itself to store the document in another format, using "Save As"!
  Permalink  
Comments
Simon Bang Terkildsen at 8-Sep-11 10:43am
   
+5 Save as for the win
Philippe Mori at 8-Sep-11 20:32pm
   
Or you can also use your prefered technology to access such document like Automation, Open XML SDK (not recommanded for beginners) or third-parties libraries. Newer format (*.docx) are easier to parse. There are essentially a zip file that contains files and each file has it own purpose. Old format is a proprietary binary format...

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

  Print Answers RSS
0 Sergey Alexandrovich Kryukov 888
1 OriginalGriff 420
2 CPallini 275
3 George Jonsson 226
4 Richard Deeming 145
0 OriginalGriff 5,450
1 CPallini 4,500
2 Sergey Alexandrovich Kryukov 4,272
3 George Jonsson 3,057
4 Gihan Liyanage 2,445


Advertise | Privacy | Mobile
Web02 | 2.8.140916.1 | Last Updated 8 Sep 2011
Copyright © CodeProject, 1999-2014
All Rights Reserved. Terms of Service
Layout: fixed | fluid

CodeProject, 503-250 Ferrand Drive Toronto Ontario, M3C 3G8 Canada +1 416-849-8900 x 100