Click here to Skip to main content
14,691,272 members
Please Sign up or sign in to vote.
3.00/5 (1 vote)
See more:
Hi, I have a string:

"CRUDE OIL, LIGHT SWEET - ICE FUTURES EUROPE" ,110517 ,2011-05-17,067411, "(CONTRACTS OF 1,000 BARRELS)"

I am looking for a regular expression which split the string at the commas but keeps the characters within the double quotes together.

Something like ...regex.split(str, ", [^\"])") won`t work.

Can anyone please help me out?

Best
Mho
Posted
Comments
Sergey Alexandrovich Kryukov 25-May-11 16:22pm
   
Tag the language and platform!
--SA

Try this Regex: ,(?!(?<=(?:^|,)\s*\x22(?:[^\x22]|\x22\x22|\\\x22)*,)(?:[^\x22]|\x22\x22|\\\x22)*\x22\s*(?:,|$))

C# example:
string input = "\"CRUDE OIL, LIGHT SWEET - ICE FUTURES EUROPE\" ,110517 ,2011-05-17,067411, \"(CONTRACTS OF 1,000 BARRELS)\"";

string[] strings = System.Text.RegularExpressions.Regex.Split(input, @",(?!(?<=(?:^|,)\s*\x22(?:[^\x22]|\x22\x22|\\\x22)*,)(?:[^\x22]|\x22\x22|\\\x22)*\x22\s*(?:,|$))");


Output is:

"CRUDE OIL, LIGHT SWEET - ICE FUTURES EUROPE"
110517
2011-05-17
067411
"(CONTRACTS OF 1,000 BARRELS)"
   
Comments
Manfred Rudolf Bihy 26-May-11 7:57am
   
Wow! Me's speechless :)
Take a 5, it's all I can give.
Kim Togo 26-May-11 8:23am
   
Thanks Manfred :-)
mhogli 26-May-11 17:19pm
   
Thats brilliant and it works! Thank you very much Kim, five points from Germany to Denmark
Kim Togo 27-May-11 10:11am
   
Thanks mhogli, and you are welcome. If it has solved your problem, then please press "Accept Answer". This will help other CP members to find the right solution.
What type of language is it ?
Kim Togo 31-May-11 9:12am
   
Hi mhogli
Did the solution solve your problem?
mhogli 31-May-11 16:13pm
   
For a couple of lines, this solution is perfect. For my 20.000 lines it is to slow in performance.
Kim Togo 1-Jun-11 2:38am
   
Good to know. What if you chance the line to a static variable and have .NET pre-cache the regex with RegexOptions.Compiled.

public static System.Text.RegularExpressions.Regex CommaRegex = new System.Text.RegularExpressions.Regex(@",(?!(?<=(?:^|,)\s*\x22(?:[^\x22]|\x22\x22|\\\x22)*,)(?:[^\x22]|\x22\x22|\\\x22)*\x22\s*(?:,|$))", System.Text.RegularExpressions.RegexOptions.Compiled | System.Text.RegularExpressions.RegexOptions.Singleline);

And then call CommaRegex.Split in the method that handles one line at the time from the ?
fjdiewornncalwe 30-May-11 14:57pm
   
Impressive. +5.
Kim Togo 31-May-11 2:29am
   
Thank you Marcus.
This is the case where Regular Expressions do not play well. You need a split functions by ',' and blank space. On .NET the best way is using string.Split, on other platforms something like that is still the best options.

I could give more details if you tagged your platform/language. Do it now!

[EDIT per discussion, below]
For example of solution for a similar problem, please see my article: Enumeration-based Command Line Utility[^].

In the code for this article, I had a problem to make a good simulator to test command line in tricky situation where quotation marks are used to pass command line argument containing blank spaces. .NET parses a raw single-string command line into the array of strings in some cunning way, splitting the line by blank spaces but preserving the blank spaced in quoted fragments of the command line. It can even split more or less reasonably if the user makes mistakes in balancing of the quotation marks. I need to simulate this in order to accelerate testing.

The section "6. Testing" of the article explains this problem and the code. Look at this section to find out where this algorithm is implemented in my demo/test code and locate this code. I analyzed the problem from different stand points and concluded that Regular expression would not be really helpful, so I ended up with direct string calculations.

Hope it can be useful.

—SA
   
v3
Comments
mhogli 26-May-11 3:20am
   
Thank you for your support. I use VB.NETand Framework 3.5.

Why do you think regex is no option? The expression: "\sa[^ub].*?\s" for example find all words which have an "a" at the begin until the next blank space. But it will not find terms if is an "u" or "b" following the "a". Thats why I thought regex would be the best solution for splitting such strings.
   
I think this is unnecessary complication. You devise more or less tricky Regex and then need to traverse all matches anyway. I solved the very similar task recently and considered different options, ended up using string manipulations. If you use C#, would give you the reference...
--SA
mhogli 30-May-11 3:09am
   
Despite using vb.net I am very interested in your statement. How did you finally end up to solve this challenge with string manipulations?
Sergey Alexandrovich Kryukov 30-May-11 14:17pm
   
Sure. See the updated solution and my article; find the source code the way I explained.
--SA
Espen Harlinn 30-May-11 19:00pm
   
Good point - regex seems like an overkill, my 5
Sergey Alexandrovich Kryukov 30-May-11 22:23pm
   
Thank you, Espen. Not that it's overkill, I would say -- under-delivering in some cases like that.
--SA
mhogli 31-May-11 16:11pm
   
For more than 20.000 lines of csv-data, regex seems to have a slight performance problem. For that reason I would prefer your approach.
Sergey Alexandrovich Kryukov 31-May-11 19:55pm
   
Thank you, but I'm not sure about performance. You can test it.
Will you formally accept my answer (green button)?
Thank you.
--SA
I've been playing around with Expresso in helping me with regex.
   
Comments
Sergey Alexandrovich Kryukov 25-May-11 16:25pm
   
OK, and do you know the good pattern? No! This is not a task for Regex.
Please see my answer.
--SA
I wrote an article for this, but it doesn't use Regex.

Persistent String Parser[^]
   
v2
Comments
mhogli 30-May-11 15:41pm
   
Ok, but that don´t parse the typical csv-string 1,2,3,4, "5, 78" very well.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)




CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900