Click here to Skip to main content
15,887,135 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
I need to remove words from the text with separators next to them. The problem is that the program only removes 1 separator after the word but there are many of them. Any suggestions how to remove other separators? Also, I need to make sure that the word is not connected with other letters. For example (If the word is fHouse or Housef it should not be removed)

At the moment I have:

What I have tried:

C#
public static void Process(string fin, string fout)
        {
            using (var foutv = File.CreateText(fout)) //fout - OutPut.txt
            {
                using (StreamReader reader = new StreamReader(fin)) // fin - InPut.txt
                {
                    string line;
                    while ((line = reader.ReadLine()) != null)
                    {
                        string[] WordsToRemove = { "Home", "House", "Room" };
                        char[] seperators = {';', ' ', '.', ',', '!', '?', ':'};
                        foreach(string word in WordsToRemove)
                        {
                            foreach (char seperator in seperators)
                            {
                                line = line.Replace(word + seperator, string.Empty);
                            }
                        }
                        foutv.WriteLine(line);
                    }
                }
            }
        }

I have :
;;;;;;;;;;,,,,,,,,,, fhgkHouse!House!Dog;;;!!Inside!C!Room!Home!House!Room;;;;;;;;;;!Table!London!Computer!Room;..;

Results I get:
;;;;;;;;;;,,,,,,,,,, fhgkDog;;;!!Inside!C!;;;;;;;;;!Table!London!Computer!..;

The results should be:
fhgkHouse!Dog;;;!!Inside!C!Table!London!Computer!
Posted
Updated 21-Nov-22 15:42pm
v2
Comments
PIEBALDconsult 21-Nov-22 21:43pm    
I'd use a Regular Expression.
Graeme_Grant 22-Nov-22 0:57am    
Added a regex version just for you! ❤️

1 solution

Here is a working solution:
1. Trim unwanted leading characters
2. remove unwanted words
3. remove unwanted characters trailing unwanted words
C#
using System.Text;

string file = "data.txt";

string[] wordsToRemove = { "Home", "House", "Room" };
char[] seperators = {';', ' ', '.', ',', '!', '?', ':'}; 

string rawText = File.ReadAllText(file);

bool isCapturing = false;
bool isTrimming = false;
int start = -1;

StringBuilder sb = new();

for (int i = 0; i < rawText.Length; i++)
{
    if (start == -1 && char.IsLetterOrDigit(rawText[i]))
    {
        isCapturing = true; // stripping lead junk...
        isTrimming = false;
        start = i;
    }

    if (start == -1 && isCapturing)
    {
        if (isTrimming && rawText[i].Equals('!'))
        {
            isTrimming = false;
            continue;
        }

        if (!isTrimming)
            sb.Append(rawText[i]);
    }

    // tracking a word...
    if (start > -1 && seperators.Contains(rawText[i]))
    {
        if (!wordsToRemove.Any(x => x
                .Equals(rawText.Substring(start, i - start),
                    StringComparison.InvariantCultureIgnoreCase)))

            sb.Append(rawText.Substring(start, i - start + 1));
        else
            isTrimming = true; // trim unwanted characters

        start = -1;
    }
}

Console.WriteLine(sb);

Output:
fhgkHouse!Dog;;;!!Inside!C!Table!London!Computer!


UPDATE

@PIEBALDconsult, Here is a regex version just for you...
C#
string file = "data.txt";

string[] wordsToRemove = { "Home", "House", "Room" };

string rawText = File.ReadAllText(file);

string pattern = $"^.*?(?=[a-z])|(?<![a-z])((?=(?:{string.Join("|", wordsToRemove)}))(.*?)(?:\\!|\\z))";

string result = Regex.Replace(rawText, pattern, "", RegexOptions.IgnoreCase);

Console.WriteLine(result);

Output:
fhgkHouse!Dog;;;!!Inside!C!Table!London!Computer!

For an explanation of how it works, paste the regular expression and Test string into regex101: build, test, and debug regex[^]

Enjoy!
 
Share this answer
 
v3
Comments
Graeme_Grant 21-Nov-22 21:46pm    
@PIEBALDconsult (reply to a deleted message) I would too but am too lazy to write it ... another way is to use span<t> but instead went with a quick'n'dirty answer. The logic is there if anyone wants to do it... :P
MrDomke 22-Nov-22 2:50am    
Heyy, I just tried your suggestion but this problem pops up https://prnt.sc/bozcRF3XvoPF
Graeme_Grant 22-Nov-22 3:17am    
Here is the working project: TrimText.zip - Google Drive[^] .. also, don't forget the regex101 website (set to c#)
Graeme_Grant 22-Nov-22 17:03pm    
How did you go with the downloadable sample project that was posted for you? did you try the answer out on the regex101 website? Are you still having issues?

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900