Click here to Skip to main content
15,895,667 members
Please Sign up or sign in to vote.
2.50/5 (2 votes)
See more:
Hi,
I have an input string and want to extract several substrings from it. My input string has this format:
C#
String inputStr="substr1<strlabel1>  substr2<strlabel2>  substr3<strlabel3> …  "

so,the substr contains each character and punctuation marks such .!:?/}{][\)( and space
except < or > charachers
and strlabel contains any characher(w*)
For example :
C#
string inputStr="this<zm>  is<vbb>  an<aa> simple example<jh>  for<ppr>  your<zm>"
the result substrings must be as follows:
C#
string[] substrs={"this," is"," an"," simple example"," for"," your"};
string[] strlabels={" zm","vbb","aa","jh","ppr","zm""};

How to I can extract each substr and strlabel from inputstring?
Posted
Updated 4-Jan-16 8:50am
v3
Comments
BillWoodruff 4-Jan-16 14:15pm    
Would you be better off here if the result of parsing your string was a data structure the expressed the relationship of the two types of items ?

It appears you will not have guaranteed unique values for the two types of 'whatevers' in your source string, so you can't use a Dictionary (which requires no duplicate Keys); however, you could you use a List<KeyValuePair<string,string>>, or a Tuple to handle duplicates.
Maciej Los 4-Jan-16 14:54pm    
BillWoodruff 4-Jan-16 16:27pm    
Sorry I missed this, Maciej

Use a regex:
[^<>]+(?=(\<.*?\>)|$)
Should do it.
 
Share this answer
 
Comments
ridoy 4-Jan-16 14:31pm    
I think so, a 5.
This can be very simple:
C#
List<string> Words = new List<string>();
List<string> Tags = new List<string>();

var splitItems = inputStr.Split(new char[] {'<', '>'}, StringSplitOptions.RemoveEmptyEntries);

for (int i = 0; i < (splitItems.Length); i+= 2)
{
    Words.Add(splitItems[i]);
    Tags.Add(splitItems[i + 1]);
}
 
Share this answer
 
This a typical RegEx problem.
Said otherwise you need to learn Regular Expressions (aka RegEx)
Regex Class (System.Text.RegularExpressions)[^]
To debug your RegEx, you may find useful to use this site
Debuggex: Online visual regex tester. JavaScript, Python, and PCRE.[^]
and
perlre - perldoc.perl.org[^]

From your question, you will have to pay attention to matches function, and since you want 2 lists, you will have a RegEx par list.
the 2 RegEx should look like:
([^<>]*)<[^<>]*>
[^<>]*<([^<>]*)>
 
Share this answer
 
v2
C#
string inputStr = "this<zm>  is<vbb>  an<aa> simple example<jh>  for<ppr>  your<zm>";
string[] tokens = inputStr.Split(new char[] { '<' },StringSplitOptions.RemoveEmptyEntries);
string[] substrs = tokens.Select(s => s.Contains('>') ? s.Split(new char[] { '>' }, StringSplitOptions.RemoveEmptyEntries).Count() > 1 ? s.Split(new char[] { '>' }, StringSplitOptions.RemoveEmptyEntries)[1] : string.Empty : s).ToArray().Where(val => !string.IsNullOrEmpty(val)).ToArray();
string[] strlabels = tokens.Select(s => s.Contains('>') ? s.Split(new char[] { '>' }, StringSplitOptions.RemoveEmptyEntries)[0] : string.Empty).ToArray().Where(val => !string.IsNullOrEmpty(val)).ToArray();
 
Share this answer
 
v2

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900