Click here to Skip to main content
14,297,442 members
Rate this:
Please Sign up or sign in to vote.
See more:
I have 26 files with text inside them and I want to remove some [special groups of words] from all of them. I have a specific group of text to remove, right now.
I'm comfortable with other solutions different than using regex, but I wish though to find a solution in this direction(if possible).
---------------------------------------
sample:
< I>(î áëþäàõ â ðåñòîðàíå)< /I>
< I>(÷åã,î-ë. — of)< /I>
< I>n< /I>
< I>áèáë.< /I>
---------------------------------------

I am thinking at using RegularExpressions on it but I need a regex formula for finding < I>, any word inside ,and stop After finding < /I>.
I know I can use @"< I>\w*" but further I can't imagine any combination possible...

//obs: there is no space between < and I>; 
//i put it here because interfere with this  html page.
                     if (line[1].Contains("< I>"))
                     {
                        string[] segment = Regex.Split(line[1], "< I>");
                     }

(PS- my English is not as good as a native one; also my level in c# is not so advanced. Thank you for understanding.)

---continued:
I found a nice regex snippet that look promising:
"[^"]*"  [solution to match any string within double quotes]

Right now I am delving into regex, and it will took some time until I will familiarize with it. Until then this case will remain open unfortunately. In the end I will close it. If you will find something useful in the meantime, I will look over it. Thanks.
Posted
Updated 30-Jan-12 1:36am
v2
Comments
   
Something I always wanted to know but was afraid to ask:
Where this Russian text in archaic Cyrillic Windows-1251 comes from? Yes, I know it was Windows encoding before Unicode and NT, a proprietary one. Where is comes from these days? :-)

Thank you,
--SA
Amir Mahfoozi 29-Jan-12 3:04am
   
Are they ASP.NET files ?
_Q12_ 29-Jan-12 6:36am
   
SAKryukov - i want to make a personal translator (en-ru/ru-en)to be able to learn a little faster the russian language...for this purpose these characters are appearing in my samples. I am struggling to make this mini dictionary from some months now...and the problem is not the code in itself or what I use, the BIG problem (and the time consuming one) is how much close possible for MY needs i can narrow it possible. I am in the same time write down with pencil the words to learn them...besides programming...the final result must remain in my head after all, but i dream and imagine that the software may can help a bit (im not sure 100% if its true). I made 5 projects only for this [ big project] alone, in different variants, and I'm learning from mistakes, because only mistakes I made so far(and it's very frustrating-believe me). Unfortunately i did not learn all the basics in programming that are there to learn, I just cope with the lacks and press on; in the final, something will crack and I will obtain what i need from it(i hope).
The response was a bit long because its a bit complicated to resume at few words.
Sorry about the boring explanation.
The words and grammar (as you find for yourself until now), I learn by myself and with the help coming from specific sites(linguistic ones)---again, for my pleasure and curiosity only. This forum I use solely for programming purposes only(not linguistic ones). I talk much, don't I? :-)
BTW, the Cyrillic words you see there, I personally don't see them at all... that's why this form of putting there, in the thought that nobody will notice theirs origin...but now im discouraged from what little i know.
Resuming my original problem, how do I make that regex - because I sincerely don't know to make it (Im medium in programming), and with my hand on my heart I don't mean anything else than knowledge (believe me).
Amir Mahfoozi 29-Jan-12 6:42am
   
Dear _q12_. Are you telling this to me or SA ? if SA then you have mistakenly replied to my comment. BTW, I think that your problem can be solved by using xml loaders :)
_Q12_ 29-Jan-12 6:51am
   
Amir- sorry for not responding to you faster, the files are *.txt. They can not be used in conjunction with "xml loaders". They contain a lot of text that right now is very much modified at the point that they are not in the default format for an easy manipulation.
The text is scrambled a lot.
I need to clean the remaining "garbage" from it and to leave it in a accessible format for future use.
So I need basic string manipulation for it.
   
No problem, Amir, I happened to look at it.
--SA
Amir Mahfoozi 30-Jan-12 1:22am
   
OK SA :)
   
Your test was unreadable to me, but I used to recognize this encoding in Latin rendering as it use to be very usual in the past. I caught my eyes, so I re-coded it into Unicode just to confirm it was Russian.

Here are my notes; at the present level of technology, you can have more or less adequate translation from/to some languages, but certainly not for Russian. The best results I ever saw was by Google Translate, but even Google Translate cannot adequately translate from Russian, forget to Russian -- the results look really ridiculous. Generally, automatic translators or even spellcheckers (forget grammar checkers, I don't think they exist) is a perpetual source of modern Russian jokes. This language is too complex for that. Sometimes, explanation why some simple phrase should be worded in this and not another way needs a whole article of couple of more pages.

Google Translate can be used just as a Dictionary though. You can find some on-line dictionary; one good quality example is ABBYY, see http://lingvopro.abbyyonline.com/en.

I used to buy and use their installable desktop application, it's fast, but generally, these guys don't know what programming is (I used to know that team many years ago, they did not improve much).

I have my own dictionary (sorry, I'm not ready to share it at this point) based on XDXF format (http://xdxf.sourceforge.net) which is better, but the available dictionaries are not that good.

--SA
_Q12_ 30-Jan-12 8:42am
   
As you formulated, "This language is too complex for that", I also think so... but from the grammar point of view is complex. The words have another meaning for me. I learned english in school, but what I really learned best from this language was the words(strings). The grammar I never like it(even from my natural language) -she come after I was able to play with words,make phrases in my head walking on street,etc. One or two times I seriously opened a grammar book to learn something from it. That is my experience with english language. From this experience, logically, I adapt to russian language, and I want to learn it as english in the past. Meaning, I First learn a bunch of words, then after I can play a little with them, I will pay attention to grammar aspects,as i go. I think is a VERY hard language and I appreciate it for its hardness. I have a sort of attraction for it. (Don't ask me why,because I myself don't know). Maybe I like the best things...and the best ones are the harder to obtain also. Or to escape from routine.
I am using sites for learning it like:
http://www.lexilogos.com/clavier/russkij.htm -very flexible keyboard
http://translate.google.co.uk/?sl=en&tl=ro#ru|en| -for most used words(but some conjugations are not helping)
http://www.alphadictionary.com/rusgrammar/possess.html -for grammar
http://forum.wordreference.com/showthread.php?t=1069603 -with a vast references links and that is the best one I could find. (and believe me I searched a lot)
-----------------
That was the learning Class.
I am now looking on methods (simpler as possible) to obtaining it as fast as I can. I am conscientious that I will learn it in years but I want to put it to the test. For that speeding up I need some type of software that can satisfy some (I want all but I cant) elementary logical needs. A kind of database with words, both en and ru(at I am working right NOW), then the ability to navigate through russian internet(I really obtain this functionality in my program), and some goodies into my software (that I will implement as I go).
Thats all. (man...that was long)
PS You actually did not ask any question...but from context I presume that you are asking though...this is the whole explanation about.
Right now I am delving a bit more into regex to obtain more info about this alien language(its in itself a mini language from what I read about it). So I will come with updates soon as I finish reading and testing.
Thanks for reading.
_Q12_ 30-Jan-12 9:05am
   
------------------- away from principal subject-------------------
SAKryukov- I want to ask you something very important for the development of my project.
Do you know a way to retrieve (programatically-c#) (for example) from this page:
http://translate.google.co.uk/?sl=en&tl=ro#en|ru|free
the word [бесплатно]?
I have tried to retrieve(in a label in c#),but they have there a some kind of counter that counts words and spaces between. I was able to retrieve words from that page but with a delay of some words back...if you understand what I mean.
[be= ],[bee= ],[bees=быть],[beast=пчела] and so on.
I must enter the same word twice for proper translation.
Any ideas?
Rate this:
Please Sign up or sign in to vote.

Solution 1

I hope this give you the general idea for doing the job :
string pattern = @"</?\w+((\s+\w+(\s*=\s*(?:"".*?""|'.*?'|[^'"">\s]+))?)+\s*|\s*)/?>";
Regex regex = new Regex(pattern, RegexOptions.Multiline);

StringBuilder sb = new StringBuilder();
sb.Append(@"<I>abc</I>");
sb.Append(@"<I>def</I>");
sb.Append(@"<I>gfi</I>");
sb.Append(@"<I>jkl</I>");
var input = sb.ToString();

var matches = regex.Matches(input);
for (int i = 0; i < matches.Count-1; i+=2)
    Console.WriteLine(input.Substring(matches[i].Index + matches[i].Length, matches[i].Index - matches[i].Index + matches[i].Length));


Hope It helps.
   
Comments
_Q12_ 29-Jan-12 7:37am
   
hmmm...it looks very complicated. I am very new in regex, and I learn only the use of \w \* \s ...I was imagine a simpler solution than this... so I see now its very complicated business.
Do you know another way (much more simpler) than this?
In the end I will try it,and see what I can come up with,but its brain squeezing. I will give you the accept after all but sincerely right now i dont understand squat from what you wrote there... maybe 10% i understand.
Amir Mahfoozi 29-Jan-12 7:45am
   
I didn't invented it ;) I just copy and paste it from stakoverflow. But the whole code is mine :)
This person has described it to some extents : http://haacked.com/archive/2004/10/25/usingregularexpressionstomatchhtml.aspx
BTW, I feel that HTML Agility Pack will solve your problem : http://htmlagilitypack.codeplex.com/
Give it a try when you had time.
_Q12_ 29-Jan-12 7:47am
   
In pseudocode I was thinking like this:
search for < I>, when find it, make an index of it.
search for < /I>, when find it, make an index of it,too.
in between those 2 indexes,all text= "".
finally, remove the text from those indexes.
Right now I dont care if it can be done with or without regex, i want it simple - not complicated.
Amir Mahfoozi 29-Jan-12 7:54am
   
You can do it with both regex and HTML agility pack. The above code will do what you have mentioned here. But if you guess that there may be some syntactically incorrect occurrences then it needs some modification. But if not it solves your problem, I think.
   
_q21_,

Sorry, you really need to review your attitudes. We already had to have some unpleasant discussion when I answered your other question, but believe I do it only to help you.

You keep saying: "too complex", "very complicated". As soon as you need to get anything good, anything at all -- this is never easy. There are many easy things, but they usually have very little value.

--SA
Rate this:
Please Sign up or sign in to vote.

Solution 2

Hello _q12_,

your question is a bit ambiguous. Assuming that you have a predefined list of redundant entries, Regex does not help a lot. But nonetheless, the followong might help:


static void Main(string[] args)
{
    List<string> redundant = new List<string>()
    {
        "abc",
        "xyz",
        "...",
    };
    string file = "datafileX.txt";

    string data = File.ReadAllText(file);
    data = ReplaceRedundantContent(data, redundant);
    File.WriteAllText(file, data);

}

private static string ReplaceRedundantContent(string data, List<string> redundant)
{
    string result = data;
    foreach (string remove in redundant)
    {
        // all characters to be taken literally
        string pattern = Regex.Escape("<I>"+remove+"</I>");
        result = Regex.Replace(result, pattern, "");
    }
    return result;
}


If you want to search for any text between the <I> and </I>, you may use the following pattern:
"<I>.*?</I>"

This matches all text by taking as little as possible, indicated by the question mark. If the question mark was not there, the match would be "greedy", meaning, that as much as possible is taken.

Cheers

Andi
   
Rate this:
Please Sign up or sign in to vote.

Solution 3

_q12_ wrote: "Right now I dont care if it can be done with or without regex, i want it simple - not complicated."

Okay, now that you've opened that door :) : Try this:
private string testString = @"
    < I>(î áëþäàõ â ðåñòîðàíå)< /I>
    < I>(÷åã,î-ë. — of)< /I>
    < I>n< /I>
    < I>áèáë.< /I>";

private string[] stringSeparators = new string[] { "< I>" };

private char[] charsToTrim = { '<', '/', '>' };

private List<string> cleanStrings = new List<string>();

// assumes you have a Button on a Form named 'button1
// with this Click EventHandler "wired-up"
private void button1_Click(object sender, EventArgs e)
{
    string[] splitTestString = testString.Split(stringSeparators, StringSplitOptions.RemoveEmptyEntries);

    foreach(string theStr in splitTestString)
    {
        cleanStrings.Add(theStr.Trim().TrimEnd(charsToTrim));

        // seeing is believing
        Console.WriteLine(cleanStrings.Last());
    }
}
p.s. I have no doubt one of our "virtuosos" here will simplify this even further !
   

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)




CodeProject, 503-250 Ferrand Drive Toronto Ontario, M3C 3G8 Canada +1 416-849-8900 x 100