 |
|
 |
char[] chars = s.ToCharArray(); is unnecessary; just index into the string.
return tokens.ToArray(); I would just return the List and let the caller decide what to do.
I also prefer to allow the caller to specify the delimiter(s) and escape character(s) plus the ability to skip empty fields and limit the number of fields returned.
And have it as a method in a library, not a snippet.
|
|
|
|
 |
|
 |
Hi
The code for the escape character appears to have a problem.
Escape mode.
"c" is a char variable, but you are comparing it to "//" which is 2 characters.
AECAEC
|
|
|
|
 |
|
 |
Hello. Thank you for having a look. I am wondering though, if the line you are referring to is this one...
// If this character is the "escape" character and we are
// presently within text...
if (c == '\\' && inText)
...? If so, my intention is to compare c to the single backslash (\). So, the first backslash is escaping the second one. I don't think the compiler would actually let me get away with saying if(c == '//') (using single quotes).
Please let me know if I have misunderstood. And thanks again!
|
|
|
|
 |
|
 |
pat daburu wrote: I don't think the compiler would actually let me get away with saying if(c == '//')
In C it's legal to put two characters in a character literal, but not in C#. :(
|
|
|
|
 |
|
|
 |
|
 |
I use this RegEx to split lines on ";" excluding one ine quoted strings.
private Regex LineSplitter = new Regex("(\"(?<value>((\"\")|[^\"])*)\"|(?<value>[^;]*));", RegexOptions.ExplicitCapture | RegexOptions.IgnoreCase | RegexOptions.Singleline);
One drawback is that i have to had a ";" at the end of my line (which is bad). I may be able to find the correct way to write my regex, but i didn't try yet.
|
|
|
|
 |
|
 |
I would also suggest the RegEx approach .
---
Voloda
|
|
|
|
 |
|
 |
Regular expressions are great for this kind of task. Most of the time, that's what I would use. I wanted to post this particular code in case it might be helpful to someone who's having trouble tweaking their expression, and to demonstrate how you can do step-by-step something similar to what the expression engine does.
|
|
|
|
 |
|
 |
I think that this kind of string parsing is a highway to hell (each change will usually break entire functionality and finally it will lead into one function with many screens of unusefull code). Additionally:
1. Your code probably doesn't cover "" situation which should escape quotes in some cases.
2. Your code probably eats entirely \ characters
3. What's the intention of the escaped status variable? Its usage is not very clear as it's just set and reset.
---
Voloda
|
|
|
|
 |
|
 |
"Highway to Hell"... like the AC/DC song. Thanks for the QA. The code that deals with escaping was indeed faulty. The code I posted also didn't include the last item in the string as one of the tokens. I have a version that fixes those and I'll update the article as soon as possible. (I'm new to posting here, so I have to figure out how to go about doing that. Any advice?)
As for the best approach being the regular expression: You'll get no argument from me. Regex is my go-to tool, and I almost never resort to this kind of parsing. If you know of a pattern that works well for this, please consider posting it (or a link to it).
I would imagine it would also be helpful if the pattern were broken down and explained so a coder could make modifications if the pattern wasn't handling a given case, or if they needed to, say, change from comma-separated values to some other delimiter, eliminate or otherwise handle domain-specific characters, etc.
To the question of the relative complexity of parsing like this... I don't know about you, but whenever I modify a complex regular expression, I tend to feel there's a certain risk of breaking it. That's not to say I think this approach is better, just that there may be a balance to be sought in some cases. But, as I say, if you are of the mind that it should be the regex parser or nothing, I can respect that.
Thanks again.
|
|
|
|
 |
|
 |
pat daburu wrote: As for the best approach being the regular expression: You'll get no argument from me. Regex is my go-to tool, and I almost never resort to this kind of parsing. If you know of a pattern that works well for this, please consider posting it (or a link to it).
I think that the pattern above seems very good.
pat daburu wrote: I would imagine it would also be helpful if the pattern were broken down and explained so a coder could make modifications if the pattern wasn't handling a given case, or if they needed to, say, change from comma-separated values to some other delimiter, eliminate or otherwise handle domain-specific characters, etc.
you can build the regex dynamically, so there is no problem with changing delimiter
pat daburu wrote: To the question of the relative complexity of parsing like this... I don't know about you, but whenever I modify a complex regular expression, I tend to feel there's a certain risk of breaking it. That's not to say I think this approach is better, just that there may be a balance to be sought in some cases. But, as I say, if you are of the mind that it should be the regex parser or nothing, I can respect that.
Yes, I would agree that regular expressions are often a magic.
But you can very easily use unit testing and check your regex behaviour against a lot of considered strings which should be able parse. After changes you will also immediately see whether there is something broken or not.
---
Voloda
modified on Wednesday, September 9, 2009 2:24 PM
|
|
|
|
 |
|
 |
Regular expressions are the way to go.
Here's a quick regular expression that'll put your elements within a line within nested capture groups. I just whipped it up, so test if you want to use it.
(?:"(?<element>[^"]*)",?)+
|
|
|
|
 |
|
 |
Yes, but... what about a CSV with mixed quoted and unquoted values?
Muhammad "The Greatest" Ali,123,"Main St.","Louisville, Kentucky, U.S.", ...
|
|
|
|
 |
|
 |
The code here shouldn't have a problem with the fact that "Main St." is in quotes, but "123" is not. (I have the idea that this is your concern, but I may misunderstand.)
On the other hand, it would be confused by...
Muhammad "The Greatest" Ali
...as a value in a field, expecting instead to see it as...
"Muhammad \"The Greatest\" Ali"
...with quotes at the beginning and the end, and intermediate quotes escaped.
If you pass it this string...
"Muhammad \"The Greatest\" Ali",123,"Main St.","Louisville, Kentucky, U.S."
You should get these tokens...
[0] "Muhammad "The Greatest" Ali"
[1] 123
[2] "Main St."
[3] "Louisville, Kentucky, U.S."
So, I suppose I'm still wondering if you encounter files that have the delimiters showing up in the middle of a field?
|
|
|
|
 |
|
 |
pat daburu wrote: it would be confused by
Mine wouldn't.
pat daburu wrote: intermediate quotes escaped
Yes, that would be preferable, and is how I would probably create it.
pat daburu wrote: wondering if you encounter
No, but I can create them just to mess with you.
|
|
|
|
 |
|
 |
Dealing with issues such as a mix of quoted and non-quoted fields isn't much more difficult. However, there needs to be a consistent usage of the delimiters in play. Otherwise, you can never be certain about where one field ends.
|
|
|
|
 |
|
 |
tgrt wrote: consistent usage of the delimiters
Within one file, yes, but not necessarily in files from diverse sources. The goal (in my opinion) is a general-purpose CSV parser.
|
|
|
|
 |
|
 |
...splitting the string on commas, and then check each resulting string to see if it starts with a quotation mark? If so, search until a string is found that ends with a quotation mark (it may be the same string). If so, join the searched string(s) together, adding commas between them. Then strip off the leading and trailing quote and replace all occurrences of double-quotes with single quotes.
|
|
|
|
 |
|
 |
That would be an excellent strategy to solve the particular problem. I went with the char-by-char approach mostly to demonstrate a more general solution to parsing out the text. My thinking was that a person could take the code I've posted here and use it as a kind of template if they so desired. By adding states, and code blocks to handle each state, you could handle a fairly wide variety of cases. (Though there would certainly come a point where letting the regular expression language do the work would make more sense.)
|
|
|
|
 |
|
 |
blah,blah "blah, blah" blah,blah
|
|
|
|
 |
|
 |
That's a good catch, but it's a case the code consciously doesn't consider. I had the understanding that in a .csv file, if a field is quoted, the entire field would be delimited, rather than having quotes show up in the middle. So, this method would expect to see...
blah,blah,"blah, blah",blah, blah
...but not your example...
blah,blah "blah, blah" blah,blah
...where the quotes show up in the middle of a field. While I haven't seen this case come up, I don't know for certain that it never does. Have you encountered .csv files where that happens?
|
|
|
|
 |