Click here to Skip to main content
15,896,290 members
Please Sign up or sign in to vote.
1.00/5 (1 vote)
See more:
C#
string input = @"path";
string contents = File.ReadAllText(input);

foreach (string word in stopWord)
{
   contents = contents.Replace(word, "");
}

I want to remove stop words on string level. This of my code even remove characters of words, when stop words matches in a word. like: in, if it appears in marking: it makes this word as markg by removing "in"? how to do do on string level instead of character level.
Posted
Updated 18-Aug-15 22:56pm
v3
Comments
CPallini 19-Aug-15 3:16am    
Please elaborate. Possibly an example could help.
Sergey Alexandrovich Kryukov 19-Aug-15 3:19am    
I think I understood it. Please see Solution 1.
"Byte level" is totally confusing. .NET characters are not bytes.
—SA

1 solution

The concept of "stop word" is not properly defined. Proper definition should take into account not just the word itself, but its context. In your case, it can be quite simple. So, it should be not just a word, but some rule. I would advise to use Regex instead. Your "stop words" will be not just strings, but the rules of matching expressed in the form of Regular expression patterns.

Please see:
https://en.wikipedia.org/wiki/Regular_expression[^],
https://msdn.microsoft.com/en-us/library/system.text.regularexpressions(v=vs.110).aspx[^],
https://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regex(v=vs.110).aspx[^].

See also: https://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regex.replace%28v=vs.110%29.aspx[^].

I don't know your rules, so you should better learn Regular Expressions and formulate what are your "stop word" by yourself.

Good luck.
—SA
 
Share this answer
 
Comments
Matt T Heffron 19-Aug-15 11:41am    
Just adding to your answer: ;-)

If OP's "stop words" are just words in the general English sense, then the pattern for each word is just the text of that word with the \b anchor at each end:
stop word: in
pattern: \bin\b

Also, processing the "contents" string repeatedly is quite inefficient. Using Regex, construct a single pattern with all of the possible stop word patterns joined with the alternation operator: | and then use the Regex.Replace method.
stop words: in, on, at
pattern: \bin\b|\bon\b|\bat\b
Sergey Alexandrovich Kryukov 19-Aug-15 13:03pm    
Yes, I understand, but it means that this is not just the string, but a "separate word", so actual piece of knowledge for the definition of word is not just the string with this word. Yes, a single pattern is a good point, this is what I meant, but it could make a pretty complex pattern. Yes, \b should do the trick, for the words as they are understood in languages like English.

Thank you very much.
—SA

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900