Click here to Skip to main content
15,892,746 members
Please Sign up or sign in to vote.
1.00/5 (1 vote)
See more:
I am having this string:

<p><img width="600" height="366" src="http://www.channelstv.com/wp-content/uploads/2014/01/boko-haram-2.jpg" class="attachment-post-thumbnail wp-post-image" alt="boko-haram-2" /></p><em><strong><Eleven people have been killed in a late night attack in Jakana village, Kaga Local Government Area of Borno State, north east Nigeria as residents of Magumeri flee after rumour of an impending attack by Boko Haram.</strong></em>

i want to extract only valid word and skip and word that is not readable, i have tried this:

C#
<pre>try
                {
                    string msg = "&lt;p&gt;&lt;img width=&quot;600&quot; height=&quot;366&quot; src=&quot;http://www.channelstv.com/wp-content/uploads/2014/01/boko-haram-2.jpg&quot; class=&quot;attachment-post-thumbnail wp-post-image&quot; alt=&quot;boko-haram-2&quot; /&gt;&lt;/p&gt;&lt;em&gt;&lt;strong&gt;&lt;Eleven people have been killed in a late night attack in Jakana village, Kaga Local Government Area of Borno State, north east Nigeria as residents of Magumeri flee after rumour of an impending attack by Boko Haram.&lt;/strong&gt;&lt;/em&gt;";

                    
                    string retrieve = msg.Substring(20, 105);
                }
                catch(IndexOutOfRangeException)
                {

                }
</pre>


but it doesn't seem to give me the desire result.

Any assistance will be appreciated. Thanks in advance...


[edit]SHOUTING removed - OriginalGriff[/edit]
Posted
Updated 31-Mar-14 6:15am
v5
Comments
OriginalGriff 31-Mar-14 11:52am    
DON'T SHOUT. Using all capitals is considered shouting on the internet, and rude (using all lower case is considered childish). Use proper capitalisation if you want to be taken seriously.
Uwakpeter 31-Mar-14 11:57am    
Thanks for the correction
[no name] 31-Mar-14 11:53am    
Probably because "msg2" is not defined.
Uwakpeter 31-Mar-14 11:57am    
msg2 holds the string
[no name] 31-Mar-14 11:58am    
No it doesn't. There is no "msg2" defined anywhere in your code sample.

Please, read my comments to the question.

There are 2 general ways to "extract" substring from source string.
1) How to: Search Strings Using String Methods (C# Programming Guide)[^]
2) How to: Search Strings Using Regular Expressions (C# Programming Guide)[^]
 
Share this answer
 
At the risk of appearing arrogant, I will tell you what you really want to do.
  1. First, HTML decode the string.  This converts HTML encoded content like "&lt;" to "<".  See this[^] link for how to do this.
  2. Once you have a string that contains HTML and plain text, remove the HTML using my StringParser[^] utility's removeHtml() method.
  3. You will be left with just the text that you wanted in the first place.


Using your example, the resulting string will be:

Eleven people have been killed in a late night attack in Jakana village, Kaga Local Government Area of Borno State, north east Nigeria as residents of Magumeri flee after rumour of an impending attack by Boko Haram.

/ravi
 
Share this answer
 
Comments
Maciej Los 31-Mar-14 12:56pm    
Self-reference, i like it!
Ravi Bhavnani 31-Mar-14 13:58pm    
Thanks. :) I only mentioned it because I thought it fit the bill nicely.

/ravi
Uwakpeter 16-Apr-14 7:35am    
please where is the StringParser utility class or method?
Ravi Bhavnani 16-Apr-14 8:14am    
Did you try clicking the link in the solution?

/ravi
Uwakpeter 16-Apr-14 9:05am    
yes i did, but i didnt see the StringParser Class!
C#
string retrieve = msg.Substring(msg.IndexOf("Eleven"), (msg.IndexOf(".<") - msg.IndexOf("Eleven")));
 
Share this answer
 
Comments
Uwakpeter 31-Mar-14 12:12pm    
Eleven here will not always going to be eleven in that same position. so what happens if it changes to something else?
Jawad Ahmed Tanoli 31-Mar-14 12:15pm    
so your string pattern can be change every time ?
what did you know that will be common in your message?
Uwakpeter 31-Mar-14 12:18pm    
It changes every time, and nothing is going to be common. If i could filter the alphanumerics and be left with only readable words, that will suffice. Thanks
Jawad Ahmed Tanoli 31-Mar-14 12:24pm    
then it will be hard to parse because there must be something unique if you will try to read alphabet only then also make some complex because it includes some other words.try to save multiple message string then observe what is common in all and where you get your desired message in string then it is possible to read accurately.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900