Click here to Skip to main content
15,920,708 members
Please Sign up or sign in to vote.
2.50/5 (2 votes)
See more:
Hi,
I have an input string and want to extract several substrings from it. My input string has this format:
C#
String inputStr="substr1<strlabel1>  substr2<strlabel2>  substr3<strlabel3> …  "

so,the substr contains each character and punctuation marks such .!:?/}{][\)( and space
except < or > charachers
and strlabel contains any characher(w*)
For example :
C#
string inputStr="this<zm>  is<vbb>  an<aa> simple example<jh>  for<ppr>  your<zm>"
the result substrings must be as follows:
C#
string[] substrs={"this," is"," an"," simple example"," for"," your"};
string[] strlabels={" zm","vbb","aa","jh","ppr","zm""};

How to I can extract each substr and strlabel from inputstring?
Posted
Updated 4-Jan-16 8:50am
v3
Comments
BillWoodruff 4-Jan-16 14:15pm    
Would you be better off here if the result of parsing your string was a data structure the expressed the relationship of the two types of items ?

It appears you will not have guaranteed unique values for the two types of 'whatevers' in your source string, so you can't use a Dictionary (which requires no duplicate Keys); however, you could you use a List<KeyValuePair<string,string>>, or a Tuple to handle duplicates.
Maciej Los 4-Jan-16 14:54pm    
BillWoodruff 4-Jan-16 16:27pm    
Sorry I missed this, Maciej

This a typical RegEx problem.
Said otherwise you need to learn Regular Expressions (aka RegEx)
Regex Class (System.Text.RegularExpressions)[^]
To debug your RegEx, you may find useful to use this site
Debuggex: Online visual regex tester. JavaScript, Python, and PCRE.[^]
and
perlre - perldoc.perl.org[^]

From your question, you will have to pay attention to matches function, and since you want 2 lists, you will have a RegEx par list.
the 2 RegEx should look like:
([^<>]*)<[^<>]*>
[^<>]*<([^<>]*)>
 
Share this answer
 
v2
This can be very simple:
C#
List<string> Words = new List<string>();
List<string> Tags = new List<string>();

var splitItems = inputStr.Split(new char[] {'<', '>'}, StringSplitOptions.RemoveEmptyEntries);

for (int i = 0; i < (splitItems.Length); i+= 2)
{
    Words.Add(splitItems[i]);
    Tags.Add(splitItems[i + 1]);
}
 
Share this answer
 
C#
string inputStr = "this<zm>  is<vbb>  an<aa> simple example<jh>  for<ppr>  your<zm>";
string[] tokens = inputStr.Split(new char[] { '<' },StringSplitOptions.RemoveEmptyEntries);
string[] substrs = tokens.Select(s => s.Contains('>') ? s.Split(new char[] { '>' }, StringSplitOptions.RemoveEmptyEntries).Count() > 1 ? s.Split(new char[] { '>' }, StringSplitOptions.RemoveEmptyEntries)[1] : string.Empty : s).ToArray().Where(val => !string.IsNullOrEmpty(val)).ToArray();
string[] strlabels = tokens.Select(s => s.Contains('>') ? s.Split(new char[] { '>' }, StringSplitOptions.RemoveEmptyEntries)[0] : string.Empty).ToArray().Where(val => !string.IsNullOrEmpty(val)).ToArray();
 
Share this answer
 
v2
Use a regex:
[^<>]+(?=(\<.*?\>)|$)
Should do it.
 
Share this answer
 
Comments
ridoy 4-Jan-16 14:31pm    
I think so, a 5.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900