Click here to Skip to main content
15,216,532 members
Rate this:
Please Sign up or sign in to vote.
See more:
i am reading a ms word doc using c#, i want only words(upper case and lower case) not space,comma,numbers,special characters,symbols etc. kindly help me with a good solution with code. thanks in advance.
Posted
Comments
Nitesh Luharuka 13-Oct-12 4:56am
   
can you show your sample doc file you are trying to read?

1 solution

Rate this:
Please Sign up or sign in to vote.

Solution 1

Hi,
A reliable, professional-grade solution requires a lot of programming, and is not a trivial task. One good example you can find online in my free Semantic Analyzer, which extracts words and sentences from arbitrary text (btw, multilingual) and then apply concordance calculator to compute the frequency of word occurences: Semantic Analyzer[^]

In general, you first must get a string containing the plain text of interest (no formatting etc), then remove all special characters (like ",", ":", ";", etc.) using either String.Replace() or regular expression, then apply String.Split() using " " separator. You will get an array of strings containing words in the text. In real world solution, you must do much more of string processing, for e.g., replacing trailing blank spaces "     " with just a single one " ", etc. As mentioned above, entire production-grade solution goes far beyond the boundary of just a single article, and is also subject/domain-specific. You should probably start with simple proto and then trim it to fit your particular case. For your immediate needs, you can use my free online semantic analyzer, which provides a reasonable accuracy.

Kind regards,
AB
   
Comments
RaviRanjanKr 13-Oct-12 23:59pm
   
My 5+
DrABELL 14-Oct-12 0:07am
   
Thanks!
Marco Bertschi 16-Oct-12 5:21am
   
Maybe the Microsoft.Interop.Word DLL which is installed with the MS Office Suite provides some additional help, but I am not sure about it.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)




CodeProject, 503-250 Ferrand Drive Toronto Ontario, M3C 3G8 Canada +1 416-849-8900 x 100