Click here to Skip to main content
15,885,887 members
Please Sign up or sign in to vote.
1.00/5 (1 vote)
See more:
Please, PROGRAMMER help me... Im so so confused how to create code for parsing English sentence.
how to recognize when one sentence ends and another sentence begins when it has embedded punctuation marks..
For example :

Input :

First sentence. Second sentence! Third sentence? .Yes.
.....Nice to meet you.... I am okay


Output :
arr[0] = First sentence
arr[1] = Second sentence
arr[3] = Third sentence
arr[4] = Yes
arr[5] = Nice to meet you
arr[6] = I am okay


Explanation :
I wanna split { ?, ! , multiple dot or string doesn't have meaning }

This my code

C#
string[] word = new string[100];
string inputRtb = rtbInput.Text;

string plot1 = "";
string plot2 = "";

string[] splitString = inputRtb.Split(new char[] {' ', '\t', '\n'});

int j = 0;
int pos = 0;
for (int i = 0 ; i < splitString.Length; i++)
{
    if (splitString[i].Trim() != "" && splitString[i].Trim() != ".")
    {
        if (splitString[i].Trim()[splitString[i].Length - 1] == '.')
        {
            plot2 = substr(splitString[i].Trim(), 0, splitString[i].Length - 1);
            if (plot1 == "")
                plot1 += plot2;
            else
                plot1 += " " + plot2;
            pos++;
        }

        else if (plot1 == "")
            plot1 += splitString[i].Trim();
        else
            plot1 += " " + splitString[i].Trim();
    }


    if (plot1 != "" && splitString[i].Trim() == ".")
    {
        word[j++] = plot1;
        plot1 = "";
    }
    else if (pos > 0)
    {
        word[j++] = plot1;
        plot1 = "";
        pos--;
    }
    else if (plot1 != "" && i == splitString.Length-1)
    {
        word[j++] = plot1;
        plot1 = "";
    }
}
Posted
Updated 8-Mar-13 4:15am
v2
Comments
Menon Santosh 8-Mar-13 8:23am    
Elaborate your problem
Member 10531704 18-Jan-14 9:11am    
what is rtbinput and inputrtb
Richard MacCutchan 8-Mar-13 8:27am    
Don't try and do it with basic split methods. You need to write your own parser that recognises specific sentence terminators.
Berry Harahap 8-Mar-13 8:32am    
@Santosh : Hallo.
My problem if INPUT : ...Asia
i have difficult to split it, i want the output is OUTPUT : Asia

or INPUT : Thanks...
OUTPUT : Thanks

INPUT : ..Thanks..
OUTPUT : Thanks

INPUT : Thanks
OUTPUT : Thanks

INPUT : Thanks
OUTPUT : Thanks

INPUT : !!!Thanks
INPUT : Thanks

INPUT : Thanks?????
OUTPUT : Thanks


I cant get the code for make it. Any solution?
Im just student.
joshrduncan2012 8-Mar-13 9:14am    
I would look at writing your own version of the parser (like what Richard said) using the String.Split() method (you may get some blanks in the array that is created from that. So you would have to throw away the blank elements.

This task is not so trivial as it sounds: a punctuation character is intrinsically context dependent.
E.g. the dots in at the beginning of this line do not make a sentence each. ;-)
Or "(see item 1. above)" does not terminate after the dot.
There is many more cases also with other punctuation characters.

But it looks like this is not the topic of the question.
So, if you want to simply get the chunks of text between some delimiters, treating repetitions of delimiters as one delimiter, stripping off leading and trainling spaces from the found chunks of text, then the following would do:
C#
string fullText = "..."; // input text
char[] delim = ".?!;".ToCharArray(); // add more single character delimiter as needed
var sentences = fullText.Split(delim, StringSplitOptions.RemoveEmptyEntries).Select(s=>s.Trim());
foreach(var s in sentences) Console.WriteLine(s);

Cheers
Andi
 
Share this answer
 
Comments
Berry Harahap 9-Mar-13 3:30am    
==Dear Andreas Gieriet==
Very simple approach and good job, Sir.
Would you mind giving opinion, what do you think about this code? Which one do you choose? Your code or this code? Or any combination? Please give me recommendation.
=================================================
ArrayList arrSentence = new ArrayList();
string temp = inputRtb;
temp = temp.Replace(Environment.NewLine, " ");
char[] arrSeparator = { '.', '?', '!' };

string[] splitInput = temp.Split(arrSeparator, StringSplitOptions.RemoveEmptyEntries);
for (int i = 0; i < splitInput.Length; i++) {
int pos = temp.IndexOf(splitInput[i].ToString());
char[] oChars = temp.Trim().ToCharArray();
char charCollection = oChars[pos + splitInput[i].Length];
arrSentence.Add(splitInput[i].ToString().Trim() + charCollection.ToString());
}
==================================================

Please, sir give me recommendation.
Cheers
Andi :) :)
Andreas Gieriet 9-Mar-13 15:00pm    
I never use non-generic containers unless I'm forced to. So ArrayList for me is a no-go.
Naming a variable temp is a no-go too. If you can not think of a reasonable name, then you have a design problem.
You use ToString() at several locations where it is non-sense. Know what types you are using and for what reason!
Calling a variable of type char a charCollection is misleading. This is confusing.

Generally: you have inconsistent naming scheme (why do you call a variable oChar and others without any prefix?) and you have weird type selection or conversions (e.g. see ToString()).

My advise: learn the C# types and choose a naming scheme for variables that carry the meaning of the stored values.

Don't take this critisism personal, take it as advise for improvement.

Regards
Andi
Berry Harahap 9-Mar-13 19:59pm    
== Mr Andreas Gieriet ==
Your suggestion is my pleasure. Thanks a lot. :) :) :)
Sergey Alexandrovich Kryukov 13-Mar-13 1:24am    
Right, a 5.
—SA
Andreas Gieriet 13-Mar-13 3:08am    
Thanks for your 5!
Andi
Here is a rough algorithm, but it should get the job done.

public static string[] ParseSentences(string sentence)
{
    char[] terminators = { '.', '?', '!' };

    List<string> sentences = new List<string>(
        sentence.Split(terminators, StringSplitOptions.RemoveEmptyEntries));
    for (int i = sentences.Count - 1; i >= 0; i--)
        if (sentences[i].Trim().Length == 0)
            sentences.RemoveAt(i);
        else
            sentences[i] = sentences[i].Trim();

    return sentences.ToArray();
}
 
Share this answer
 
v3
Comments
Berry Harahap 8-Mar-13 19:03pm    
Dear @Mr MichaelBergsma, I'm interested to try this code. But,
List<string> sentences = new List<string>(sentences.Split(terminators, StringSplitOptions.RemoveEmptyEntries)) , There is error in sentences.Split , List<string> sentences = new List<string> doesn't contain a definition 'split' or what reference i should add Mr?
Andreas Gieriet 9-Mar-13 0:30am    
Read carefully: there are two variables with almost the same name: sentence and sentences.
Cheers
Andi
Sergey Alexandrovich Kryukov 13-Mar-13 1:24am    
That's pretty much it... My 5.
—SA
This one line will do the trick, you just need to add all the line terminators to the array.
C#
string[] splitString = inputRtb.Split(new char[] { '!', '?', '.', '\t', '\n' }, StringSplitOptions.RemoveEmptyEntries);
 
Share this answer
 
Comments
Sergey Alexandrovich Kryukov 13-Mar-13 1:25am    
That's pretty much it, too, a 5.
—SA
[no name] 13-Mar-13 2:17am    
Thanks Sergey.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900