|
I've not done testing with DBCS or any other code page. It may work with DBCS encoded files, as I've only tested with ANSI/Unicode. But if you can translate the file to Unicode, it should work fine.
I found the following that discusses a way to translate a code page to Unicode:
http://www.dotnet247.com/247reference/msgs/2/12808.aspx[^]
I hope that helps....
|
|
|
|
|
It is going to be a while before I can begin implementation of v2.0 due to time constraints. But as I stated, I would place the plans for the next version up so that others could critique it before I begin. So, please let me know what you think...Thanks.
I am worried about the change to VS 2005 and C# 2.0 might cause some people headaches, but as far as the code...there's not too much code tied specifically to C# 2.0. I might be able to provide some .NET 1.1 ready code, but that's just dependent on how much time I have, so *shrug*.
Improvement Plans:
1. Separating FixedWidth/Delimited parsing code to reduce overhead of checking internal state and to allow improvements in efficiency for FixedWidth parsing (should increase it dramatically).
2. Adding new events to the GenericParserAdapter.
-OnColumnChangeError(object sender, GenericParserColumnChangeErrorEventArgs e) - Fires when the row being changed throws a constraint error
-OnConversionError(object sender, GenericParserConversionErrorEventArgs e) - Fires when a row cannot be properly converted to the target type.
-TableNewRow(object sender, GenericParserTableNewRowEventArgs e) - Fires after a new row is created for the table
-ColumnChanging(object sender, GenericParserColumnChangeEventArgs e) - Fires before a value is placed into a column on the current row.
-ColumnChanged(object sender, GenericParserColumnChangeEventArgs e) - Fires after a value is placed into a column on the current row.
The sender will always be the GenericParserAdapter. The GenericParserAdapter will expose various information about the current line number, row number, column number, etc.
GenericParserColumnChangeErrorEventArgs contains:
-Column - DataColumn
-Row - DataRow
-ProposedValue - Object
-ErrorHandled - bool (false by default), throws an exception up upon return if still false.
GenericParserConversionErrorEventArgs contains:
-Column - DataColumn
-Row - DataRow
-ProposedValue - Object
-ErrorHandled - bool (false by default), throws an exception up upon return if still false.
GenericParserTableNewRowEventArgs contains:
-Row - DataRow
GenericParserColumnChangeEventArgs contains:
-Column - DataColumn
-Row - DataRow
-ProposedValue - Object
I'm providing these to be identical to what's currently provided on a DataTable for .NET 2.0. I decided on placing the events on the GenericParserAdapter to keep the events on the DataTable 'prestine' from the events that would be required for parsing. Plus, I wanted to provide the information exposed through the GenericParserAdapter via the 'sender' parameter. Hence, the design...
Furthermore, the events will allow the developer to enhance the parsing in whatever means needed by him. At this point, if someone is using the GenericParserAdapter its because they don't want to interface with strings and place them into their data structure. Performance becomes slightly less important when compared to functionality. If you want higher performance you use the GenericParser, but at the cost of rolling your own 'adapter'.
Just fyi...I'm not too sure if I'll keep the names of the EventArgs that long ...but we'll see.
3. You can supply a DataSet/DataTable to the GenericParserAdapter to fill with data (this allows for type casting).
4. Have a flag that indicates that the incoming values are mapped to the target DataTable via ColumnIndex or ColumnName (header must be supplied here).
Platform changes:
-Using C# 2.0 (though only a small usage of Generics and Nullable types)
-Using VS 2005
-Using VS 2005 Unit Testing (which is almost synonymous with NUnit)
Fixes:
-Fix for truely 'nulling' out EscapeCharacter, CommentCharacter
-Fix for skipping rows bug
-Improving the performance tests
|
|
|
|
|
Andrew,
Looks good.
One thing I'd also like to see added is support for excluding footer lines. One way of implementing this is to have an internal buffer that is n+1 lines long (where n is the number of lines of footer). When the file is first read the buffer is filled. The next Read() pops the top line off the buffer and reads in the next line to the bottom. EOF occurs when you can't fill the buffer with a line from the file (at which point it contains the last n lines of the file).
Keep up the good work !
Richard Lavey - Xenomorph
|
|
|
|
|
Just want to let you know that I'm working on some fixed-width imports and this is a great tool! Awesome job!
|
|
|
|
|
Thanks
|
|
|
|
|
It would be nice if there was an option to strip or replace control characters within the data. I have cases where there are zero bytes in the data, and without stripping them, XML and XSLT do not like the generated data.
I also noticed a couple of places in the code where m_chEscapeCharacter and m_chTextQualifier were not beinig checked against NULL_CHAR before using them. I fixed this in Read() by changing the code to:
if (this.m_ParserRowState != RowState.CommentRow)
{
if (this.m_chEscapeCharacter != NULL_CHAR)
{
if (this.m_blnEscapeCharacterFound)
{
this.m_blnEscapeCharacterFound = false;
++this.m_intCurrentIndex;
continue;
}
// Skip this character and then the next one, so that we ignore the escaped character.
if (this.m_caBuffer[this.m_intCurrentIndex] == this.m_chEscapeCharacter)
{
this.m_blnEscapeCharacterFound = true;
++this.m_intCurrentIndex;
continue;
}
}
if (this.m_chTextQualifier != NULL_CHAR)
{
// Text qualifiers cause us to ignore row/column delimiters.
if (this.m_caBuffer[this.m_intCurrentIndex] == this.m_chTextQualifier)
{
this.m_blnInText = !this.m_blnInText;
++this.m_intCurrentIndex;
continue;
}
// If we're still within text, so we don't care about the row/column delimiters.
if (this.m_blnInText)
{
++this.m_intCurrentIndex;
continue;
}
}
}
Thanks
|
|
|
|
|
Could you explain a little more? I wasn't aware that you could find a 'Null' character in a file, unless I'm missing something. So, it shouldn't matter if you check to see if the TextQualifier is 'Null' or not. Any examples would be great.
As for the other request about stripping control characters, could you provide an example as well what you're asking for.
Thanks.
|
|
|
|
|
I have several cases where files generated from mainframe systems have embedded zero bytes (i.e., '\0'). Using your parser and C# a string can get created with an embedded '\0' and not even always at the end of the string. C# just treats it like any other character. But if you try to use that string in an XML document and/or send it through the XSLT processor, it causes a problem.
If the text delimiter and or escape characters are set to '\0' to mean that none was specified, your code should not treat '\0' as the actual escape character or text delimiter. For example, if it thinks that '\0' is the escape character and an '\0' it is the end of a field, that it may use that as an escape and not recognize the column delimiter character following the '\0'.
For now, I am stripping all control characters using Char.IsControl(), but that means I have to loop through all of the characters for every string returned from the parser. It would be nice if the parser could be configured to either string or replace control characters with some other character. I have examples with '\t', '\0', 0x1a, and others.
Thanks for your quick reply,
Joe
|
|
|
|
|
I don't know when I'll get to these changes (I'm under a time crunch at work), but I'll see what I can do.
I can make the control characters and such 'nullable', so that I can use null as the 'unset' state. I wasn't aware that you could put a '\0' into a file, but I guess I should have tried it atleast ;-D.
Anyways..does that solve your problem on both ends?
If not email me back and we'll discuss offline what you'd need to have happen.
|
|
|
|
|
Yes, making the control characters and such 'nullable' will fix my first problem. My other issue is stripping control characters, and I can do outside of the parser, unless you think that is a generally useful feature.
Thanks
|
|
|
|
|
I will probably leave that sort of functionality up to the user to either modify the code or process their data afterwards.
I'm actually putting together a design for the next release of the tool, which will include events/type casting/performance improvements. I'm hoping it'll make using the GenericParser easier. When I sort out a few more details, I'll post up on the thread my final thoughts in case anyone has some comments to add about it.
|
|
|
|
|
I find myself wondering the actual intended purpose of the text qualifiers. It defaults to the double quote (")
This causes severe problems when encountered in normal fashion:
How tall are you?,72" tall,All the rest of the text on the line gets put into the same column,unless the remainder of the row is longer than 1024 characters,which causes an error in the parsing and crashes the code
IIRC, the CSV spec calls for a text qualifier that really only get checked at the beginning of the column:
no text qualifier,"a qualifier because a comma (,) is in this column","a qualifier, because I want commas,,, but double quotes ("") are treated specialy.",if the line doesn't begin with the text qualifier then "quote marks" are ignored mid phrase
I believe the GenericParser falls short in this regard.
Is there a workaround?
|
|
|
|
|
I had a little problem trying to figure out what exactly you were asking, but I can atleast give you an answer for your first question.
To overcome the issue you're seeing where the data does not contain 'quoted' text, you can just set the text qualifier to null ('\0').
The intent for the text qualifiers was for use at the beginning/end of a column to cause any internal 'delimiter' characters from being recognized as data rather than delimiters. For example:
A,B,"
C,and more text",D
Would translate to:
Column 1: A
Column 2: B
Column 3:
C,and more text
Column 4: D
Now, for what I've seen of formats commonly used, this is typical. But I don't mean to say I'm an expert on what everyone else is doing, so I'm open to changing this behavior.
I do recognize the short coming I believe you're trying to point out. The only change that would seem prudent in this regard would be to change the behavior to such:
1) Text Qualifiers are only checked at the initial start of a column.
2) If a string did not start with a Text Qualifier, then all of the Text Qualifiers in the string are ignored
3) If a string did start with a Text Qualifier, then all of the Text Qualifiers in the string are treated as the 'end', unless ofcourse double'd up for escaping purposes. This would also impact how the parser handles 'escaped' characters, but that is not too big of a deal.
Would that seem to solve your problems?
|
|
|
|
|
Yes, It seems as if you both understood what I'm asking, and how the intended behavior should be.
Yes, setting the text qualifier to '\0' does fix the case where the file doesn't use any text qualifiers, but some files that I've been given to parse use a text qualifier in some places, and "misuse" it in others, so a fix like this would have to be implemented.
Also, in my testing yesterday, I came across a couple of other cases that your code didn't handle properly:
a,b,"c",d
yours handle's properly as:
a
b
c
d
but even according to your "spec" it doesn't handle properly
a,b,c"c,d"d,e""e
it takes it apart as:
a
b
c"c,d"d
e""e
if it is treating it as a text qualifier, shouldn't it remove the quotes? or in the case of adjacent doubleqoutes returning one?
|
|
|
|
|
Well, the problem with the old code was that it merely looked for a text qualifier and then said..ok 'I'm in text now, ignore delimiters'. It didn't take into account where that text qualifier was found. I've already corrected the current version of the code.
All I need to do now is add unit tests and update the code that extracts the final string from what it perceives as the column. I have to add the functionality that doubling the text qualifier will just escape the character and that should be about it. I never implemented anything like that in version 1.0.
Now, just to make sure, once I've implemented the changes I'm (you're) proposing, the example of:
a,b",c"c,d""d,"e""e",""f
Would be parsed as:
a
b"
c"c
d""d
e"e
""f
I was using the ADO text driver as a basis for what's 'expected' functionality wise, but I'm not 100% sure if that's what you're asking for.
As a special case, I'm basically treating anything not marked as text (meaning surrounded with the text qualifiers) to be literal text. So, if you double up the text qualifier and you're not in text, then you don't assume its trying to escape anything because there's no need to escape it. I could definitely make this an option to toggle between, but I'm not sure if I'm just creating superfluous code here. Thoughts?
|
|
|
|
|
I think you're right on the money here.
I don't think that it should escape it. I can see that someone potentially might want it to escape (which would make d"d in your above example) this, but I don't think people will actually want it. It's not very useful.
Curious, when can we expect a fix for this?
Yesterday, I started creating an uber hack for it, and it's ugly, but I'm not very far. I'll need a fix in the next week or two, so I may have to hack it myself.
|
|
|
|
|
I'll see what I can do this weekend. I've got about 1/2 of the code changes done already. I might just go ahead and release a point release since I've got a bug I've gotta squash anyways. We'll see what comes of it from this weekend. I could possibly as well send you an updated version of the parser as well in spite of a 'formal' release.
We'll just have to play it by ear...
|
|
|
|
|
k, thanks. Please keep me posted
|
|
|
|
|
Just in case anyone else is experiencing the same woes, I've got a beta version of v1.1 available. It's unit tested, so really the only thing beta about it is that I've got some other additions I want to make before releasing it.
Other than that..it's good to go.
Email me through Codeproject if you want the complete project sent to you.
If I get overloaded with requests, I'll just go through and post it up here. I'll keep everyone updated via this page.
|
|
|
|
|
Nice, I think I'll wait until it actually gets released. Do you know when that will be?
|
|
|
|
|
Not sure exactly when that'll be. I'm planning on improving the efficiency of the GenericParser and for the GenericParserAdapter adding the ability to type cast the data coming out (with events to further help transform the data). So, when that'll be, probably not for a atleast two weeks from now. Up to you...
|
|
|
|
|
So, We found a CSV that we need to parse, and due to a bug in their CSV (on row 14 thousand something), it fatally exposed your improper handling of text qualifiers. So I would like the fixed version of this. Is it something that could be available to me now?
McKay Salisbury
|
|
|
|
|
Check your hotmail email account and you shall find what you're looking for.
|
|
|
|
|
Hey guys...sorry for the long delay in producing the next version. Life/work/etc. has gotten in the way of continued development. I do plan on adding some type casting to this class and improving the efficiency of the current code. So, eventually it'll have some more bells and whistles...
Just a note, I will also be fixing the small bug found about the code skipping lines when it happened to fall on the end of a buffer. I've applied the following patch to the code, which I'll offer to you as well, until I'm ready to formally release the next version.
Replace your current Read() function on the GenericParser class (GenericParser.cs) with this:
public bool Read()
{
this._CheckDiposed();
// Setup some internal variables for the parsing.
this._InitializeParse();
// Do we need to stop parsing rows.
if (this.m_ParserState == ParserState.Finished)
return false;
// Read chunks of the data into the buffer, then parse each chunk for data.
while (this._ReadDataIntoBuffer())
{
while (this.m_intCurrentIndex < this.m_intCharactersInBuffer)
{
// If we're in the state of parsing the row type, it means we're at the beginning of a row.
if (this.m_ParserRowState == RowState.GetRowType)
{
this._ParseRowType();
// In the event that we read in a comment row, we need to skip
// the comment character, just in case they use it for something
// else in the file (row delimiter possibly).
if (this.m_ParserRowState == RowState.CommentRow)
{
++this.m_intCurrentIndex;
continue;
}
}
////////////////////////////////////////////////
// At this point, we're parsing character by //
// character to find the end of a row/column. //
////////////////////////////////////////////////
// If we're in comment, we want to bypass any special considerations and just
// find the end.
if (this.m_ParserRowState != RowState.CommentRow)
{
if (this.m_blnEscapeCharacterFound)
{
this.m_blnEscapeCharacterFound = false;
++this.m_intCurrentIndex;
continue;
}
// Skip this character and then the next one, so that we ignore the escaped character.
if (this.m_caBuffer[this.m_intCurrentIndex] == this.m_chEscapeCharacter)
{
this.m_blnEscapeCharacterFound = true;
++this.m_intCurrentIndex;
continue;
}
// Text qualifiers cause us to ignore row/column delimiters.
if (this.m_caBuffer[this.m_intCurrentIndex] == this.m_chTextQualifier)
{
this.m_blnInText = !this.m_blnInText;
++this.m_intCurrentIndex;
continue;
}
// If we're still within text, so we don't care about the row/column delimiters.
if (this.m_blnInText)
{
++this.m_intCurrentIndex;
continue;
}
}
if (this._IsEndOfRow())
{
// Move back one character to get the last character in the column
// (ended with row delimiter).
if (!this._IsEmptyRow(this.m_intCurrentIndex))
{
if ((this.m_ParserRowState == RowState.DataRow)
|| (this.m_ParserRowState == RowState.SkippedRow))
++this.m_intDataRowNumber;
if ((this.m_ParserRowState == RowState.DataRow)
|| (this.m_ParserRowState == RowState.HeaderRow))
this._ExtractColumn(this.m_intCurrentIndex - 1);
}
// Add the length of the RowDelimiter to the CurrentIndex to move us along.
this.m_intCurrentIndex += this.m_caRowDelimiter.Length;
this.m_intCurrentColumnStartIndex = this.m_intCurrentIndex;
// Ensure that we have some data, before trying to do something with it.
// This prevents problems with empty rows.
if (this.m_saData.Count > 0)
{
// Have we got a row that meets our expected number of columns.
if ((this.m_intExpectedColumnCount > 0) && (this.m_saData.Count != this.m_intExpectedColumnCount))
throw new ParsingException(string.Format("Number of columns ({0}) differs from the expected column count ({1}).",
this.m_saData.Count,
this.m_intExpectedColumnCount),
this.m_intFileRowNumber);
// If we were in a data row, we need to stop.
if (this.m_ParserRowState == RowState.DataRow)
{
// We only indicate that we've found a 'valid' end of row when there's data.
this.m_blnEndOfRowFound = true;
break;
}
else if (this.m_ParserRowState == RowState.HeaderRow)
this._SetColumnNames();
}
this.m_ParserRowState = RowState.GetRowType;
continue;
}
else if ((this.m_ParserRowState != RowState.CommentRow) && this._IsEndOfColumn())
{
// Mark that this row has an end of column (ensures that we didn't find an empty row).
this.m_blnEndOfColumnFound = true;
// Move back one character to get the last character in the column
// (ended with column delimiter).
if ((this.m_ParserRowState == RowState.DataRow)
|| (this.m_ParserRowState == RowState.HeaderRow))
this._ExtractColumn(this.m_intCurrentIndex - 1);
// Add the length of the ColumnDelimiter to the CurrentIndex to move us along.
if (!this.m_blnFixedWidth)
this.m_intCurrentIndex += this.m_caColumnDelimiter.Length;
// Set the start of the column indice at the start of a new column.
this.m_intCurrentColumnStartIndex = this.m_intCurrentIndex;
continue;
}
else
++this.m_intCurrentIndex;
}
// We found the end of a row..return normally.
if (this.m_blnEndOfRowFound)
return (this.m_saData.Count > 0);
else
{
//////////////////////////////////////////////////
// At this point, the buffer has been expended. //
//////////////////////////////////////////////////
// We ran out of data, flush out the last column and return.
if (this.m_BufferState == BufferState.NoFetchableData)
{
this.m_BufferState = BufferState.NoDataLeft;
this.m_ParserState = ParserState.Finished;
if (!this._IsEmptyRow(this.m_intCurrentIndex))
{
if ((this.m_ParserRowState == RowState.DataRow)
|| (this.m_ParserRowState == RowState.SkippedRow))
++this.m_intDataRowNumber;
// Move back one character to get the last character in the column (ended with EOF).
if (this.m_ParserRowState == RowState.DataRow)
{
// There's one column left to extract.
this._ExtractColumn(this.m_intCurrentIndex - 1);
return (this.m_saData.Count > 0);
}
}
// There's nothing left to extract.
return false;
}
else
{
// Move the leftover data in the buffer to the front and start over.
this._CopyRemainingDataToFront(this.m_intCurrentColumnStartIndex);
// Indicate that we need to fetch more data.
this.m_BufferState = BufferState.FetchData;
continue;
}
}
}
this.m_ParserState = ParserState.Finished;
return false;
}
|
|
|
|
|
I do have a beta v1.1 out that fixes the problem with skipping rows and a problem found with TextQualifiers, if you need this before I make a formal release let me know via email. There are no known bugs in the beta, as I've performed adequate unit testing on these changes. I'm just waiting on releasing it formally until I've completed more of my goals for the next release.
|
|
|
|
|