|

Introduction
I’ve been working on a time-series analysis project where the data are stored as structures in massive binary files. Importing the files into a database would cause a performance hit with no value added, so dealing with the files in their original binary format is the best option. My initial assumption was that throughput would be limited by disk speed, but I found that my first implementation resulted in 100% CPU utilization on my research box. It was obviously time to optimize.
While there is a wealth of information available on the innumerable ways of reading files with C#, there is virtually no discussion about the performance implications of various design decisions. Hopefully, this article will allow the reader to improve the performance of binary file reading in their application and will shed some light on some of the undocumented performance traps hidden in the System.IO classes.
Is there Data?
It may seem silly to have a section on checking for the end of a file (EOF), but there are a plethora of methods employed by programmers, and improperly checking for the EOF can absolutely cripple performance and introduce mysterious errors and exceptions to your application.
BinaryReader.PeekChar Method
If you are using this method in any application, god save you. Based on its frequent appearance in .NET newsgroups, this method is widely used, but I’m not sure why it even exists. According to Microsoft, the BinaryReader.PeekChar method “Returns the next available character and does not advance the byte or character position.” The return value is an int containing “The next available character, or -1 if no more characters are available or the stream does not support seeking.” Gee, that sounds awfully useful in determining if we’re at the end of the stream.
The BinaryReader class is used for reading binary files which are broken into bytes not chars, so why peek at the next char rather than byte? I could understand if there was an issue implementing a common interface, but the TextReader derived classes just use Peek. Why doesn’t the BinaryReader include a plain old Peek method that returns the next byte as an int? By now, you’re probably wondering why I’m ranting so much about this. Who cares? So, you get the next byte for free? Well, something entirely unnatural happens somewhere in the bowels of this method that periodically results in a “Conversion Buffer Overflow” exception. As the result of some dark voodoo process, certain two byte combinations in your binary file can not be converted into an appropriate return value by the method. I have no idea why certain byte combinations have been deigned toxic to PeekChar, but prepare for freaky results if you use it.

Stream.Position >= Stream.Length
This test is pretty straightforward. If your current position is greater than or equal to the length of the stream, you’re going to be pretty hard-pressed to read any additional data. As it turns out, this statement is a massive performance bottleneck.
After finishing the initial build of my application, it was time for some optimization. I downloaded the ANTS Profiler Demo from Red Gate Software, and was shocked to find that over half the execution time of my program was being spent in the EOF method of my data reader. Without the profiler results, I never would have imagined that this innocuous looking line of code was cutting the performance of my application into half. After all, I opened the FileStream using the FileShare.Read option, so there was no danger of the file’s length changing, but it appears as though the position and file length are not cached by the class, so every call to Position or Length results in another file system query. In my benchmarking, I’ve found that calling both Position and Length takes twice as long as calling one or the other.
_position >= _length (Cache it yourself)
It’s sad, but true. This is the fastest method by a long shot. Get the length of your FileStream once when you open it, and don’t forget to advance your position counter every time you read. Maybe Microsoft will fix this performance trap someday, but until then, don’t forget to cache the file length and position yourself!
Read It!
Now that we know there’s data, we have to read it into our data structures. I’ve included three different approaches, with varying merits. I did not include the unsafe approach of casting a byte array of freshly read data into a structure because I prefer to avoid unsafe code if at all possible.
FileStream.Read with PtrToStructure
Logically, I assumed that the fastest way to read in a structure would be the functional equivalent of C++’s basic_istream::read method. There are plenty of articles and newsgroup posts about using the Marshal class in order to torture raw bits into a struct. The cleanest implementation I’ve found is this: public static TestStruct FromFileStream(FileStream fs)
{
byte[] buff = new byte[Marshal.SizeOf(typeof(TestStruct))];
int amt = 0;
while(amt < buff.Length)
amt += fs.Read(buff, amt, buff.Length-amt);
GCHandle handle = GCHandle.Alloc(buff, GCHandleType.Pinned);
TestStruct s =
(TestStruct)Marshal.PtrToStructure(handle.AddrOfPinnedObject(),
typeof(TestStruct));
handle.Free();
return s
}
BinaryReader.ReadBytes with PtrToStructure
This approach is functionally almost identical to the FileStream.Read approach, but I provided it as a more apples-to-apples comparison to the other BinaryReader approach. The code is as follows: public static TestStruct FromBinaryReaderBlock(BinaryReader br)
{
byte[] buff = br.ReadBytes(Marshal.SizeOf(typeof(TestStruct)));
GCHandle handle = GCHandle.Alloc(buff, GCHandleType.Pinned);
TestStruct s =
(TestStruct)Marshal.PtrToStructure(handle.AddrOfPinnedObject(),
typeof(TestStruct));
handle.Free();
return s;
}
BinaryReader with individual Read calls for structure fields
I assumed that this would be the slowest method for filling my data structures --it was certainly the least sexy approach. Here’s the relevant sample code: public static TestStruct FromBinaryReaderField(BinaryReader br)
{
TestStruct s = new TestStruct();
s.longField = br.ReadInt64();
s.byteField = br.ReadByte();
s.byteArrayField = br.ReadBytes(16);
s.floatField = br.ReadSingle();
return s;
}
Results
As I’ve already foreshadowed, my assumptions about the performance of various read techniques was entirely wrong for my data structures. Using the BinaryReader to populate the individual fields of my structures was more than twice as fast as the other methods. These results are highly sensitive to the number of fields in your structure. If you are concerned about performance, I recommend testing both approaches. I found that, at about 40 fields, the results for the three approaches were almost equivalent, and beyond that, the block reading approaches gained an upper hand.
Using the Test App
I’ve thrown together a quick benchmarking application with simplified reading classes to demonstrate the techniques outlined so far. It has facilities to generate sample data and benchmark the three reading approaches with dynamic and cached EOF sensing.
Generating Test Data
By default, test data is created in the same directory as the executable with the filename “sampledata.bin”. The number of records to be created can be varied. Ten million records will take up a little bit more than 276 MB, so make sure you have enough disk space to accommodate the data. The ‘Randomize Output’ checkbox determines whether each record will be created using random data to thwart NTFS’s disk compression. Click the ‘Generate Data’ button to build the file.
Benchmarking
Benchmarking results are more reliable when averaged over many trials. Adjust the number of trials for each test scenario using the ‘Test Count’ box. ‘Update Frequency’ can be used to adjust how frequently the status bar will inform you of progress. Designate an update frequency greater than the number of records to avoid including status bar updates in your benchmark results. The ‘Drop Best and Worst Trials from Average’ check box will omit the longest and shortest trial from the average entry --they will still be listed in the ‘Results’ ListView. Select the readers to be tested using the checkboxes –‘BinaryReader Block’ corresponds to the PtrToStructure approach. Select the ‘EOF detection’ methods to test --'Dynamic’ uses Length and Position properties each time EOF is called. Click ‘Run Tests’ to generate results.
Miscellaneous Findings
StructLayoutAttribute
If you’re working with reading pre-defined binary files, you will become very familiar with the StructLayoutAttribute. This attribute allows you to tell the compiler specifically how to layout a struct in memory using the LayoutKind and Pack parameters. Marshaling a byte array into a structure where the memory layout differs from its layout on disk will result in corrupted data. Make sure they match.
Warning! Depending on the way a structure is saved, you may need to read and discard empty packing bytes between reading fields when using the BinaryReader.
MarshalAsAttribute
Be sure to use the MarshalAsAttribute for all fixed width arrays in your structure.Structures with variable length arrays cannot be marshaled to or from pointers.
Writing Data
Writing binary data can be accomplished in the same ways as reading. I imagine that the performance considerations are very similar as well. So, writing out the fields of a structure using the BinaryWriter is probably optimal for small structures. Larger structures can be marshaled into byte arrays using this pattern: public byte[] ToByteArray()
{
byte[] buff = new byte[Marshal.SizeOf(typeof(TestStruct))];
GCHandle handle = GCHandle.Alloc(buff, GCHandleType.Pinned);
Marshal.StructureToPtr(this, handle.AddrOfPinnedObject(), false);
handle.Free();
return buff;
}
Marshal.SizeOf
Even small changes to a method can yield significant boost to performance when the method is called millions or billions of times during the execution of a program. Apparently, Marshal.SizeOf is evaluated at runtime even when there is a call to typeof as the parameter. I shaved several minutes off of my application’s execution time by creating a class with a static Size property to use in place of Marshal.SizeOf. Since the return value is calculated every time the application is started, the dangers of using a constant for size are avoided. internal sealed class TSSize
{
public static int _size;
static TSSize()
{
_size = Marshal.SizeOf(typeof(TestStruct));
}
public static int Size
{
get
{
return _size;
}
}
}
| You must Sign In to use this message board. |
|
| | Msgs 1 to 25 of 39 (Total in Forum: 39) (Refresh) | FirstPrevNext |
|
|
 |
|
|
Hi,
I liked reading your article and appreciate you doing the profiling. I would've never thought that Position and Length are not cached.
My question is, what about read buffers? I have written an ID3 Tag library to read ID3 tags from MP3 files. ID3 tags exist at the beginning of the file and an older tag exists at the end of the file. ID3 tag can have a cover art JPG image in it. So what I'm doing is reading 512k bytes into a byte array. Then I'm creating a MemoryStream from that byte array. Would there be a large performance hit from doing it in the MemoryStream, rather than just parsing the byte array?
I'm not going to change my current implementation any time soon. If there is a large hit, then I can just create my own MemoryStream class. . Before I do though, I want to know if anyone has done some profiling with MemoryStreams.
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
Hi,
Nice article, however my problem is slight different and I am not getting any answer from anyone, because microsoft never provide any performance benchmarking, I am creating one but still thought I will ask you.
We have web server and I have a million small small images which exists as pure ID.dat file (ID is long number) starting from 1 to right now it is close to million. All are stored in one folder, now problem is taking backup and dealing with individual files is slower.
Now I want to write a file server where I will combine 1,000 small small files into one big flat File appended one after another, ofcourse in format of Header describing the file chunk and entire file chunk later. Also adding CRC values in between to check and maintain consistancy.
Now question here is, when a client requests for file, should open a new file handle and read this big file from offset to length and deliver the file or I should have one file opened all the time and lock it and use "lock" keyword and deliver the file bytes to clients by using multithreaded methods?
For example...
public void DeliverFile(HttpResponse response, long startOffset, long Length) { FileStream fs = new FileStream(_fileName); fs.Seek(Seek.Start,startOffset); byte[] buffer = new byte[5120]; // 5k do { int count = fs.Read(buffer,0,5120); if(count==0) break; response.Write(buffer,0,count); }while(true); fs.Close(); }
OR...
FileStream fs; // local class member, always initialized and on... static Object fileLock = new Object(); public void DeliverFile(HttpResponse response, long startOffset, long Length) { lock(fileLock) { byte[] buffer = new byte[5120]; // 5k fs.Seek(Seek.Start,startOffset); do { int count = fs.Read(buffer,0,5120); if(count==0) break; response.Write(buffer,0,count); }while(true); } }
Please note, this code is just an idea, its not tested, I just need to know which approach will be faster, I assume that opening and closing files will definately put some load on disk as well as it needs to keep on updating last update record of each file, where else one file open all the time will not put load on machine.
Combining various files into one file is better to take backup, keep multiple copies and move data to various places.
Programming is fun. -Akash Kava
|
| Sign In·View Thread·PermaLink | 2.00/5 (1 vote) |
|
|
|
 |
|
|
 |
|
|
MonoDevelop's import functions didn't work properly so I created a new project and referenced Mono's System.dll, System.Data.dll, System.Drawing.dll, and System.Windows.Forms.dll, it compiled successfully. I ran the resulting binary from the command line "mono CSharpFileIODemo.exe" and the app opened and functioned properly under Ubuntu 8.04. The relative performance of the different read methods was the same as shown under Windows.
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
Anthony,
I was looking through your code and immediately thought that things were a bit unfairly matched because in your "Marshalling" method every time you make a read / write to the stream you create a new buffer, a new handle and free it. Since the structure does not (usually) change during runtime then why don't you create the buffer once, the handle (and pin it) once and then free it once all the reading is done?
I've put together a simple benchmark (note I didn't do the manual method of BinaryReader.ReadInt32 etc). In this one it uses both a normal FileStream and a BinaryReader/Writer and you'll see from this[^] screenshot that there is quite a drastic difference (this sample reads / writes 100,000 structures, and the time is given in ticks).
You can see the sample code here[^]. Note it'll require .NET 3.5 / VS2008 because I used a couple of LINQ statements etc to compact the code.
Can be converted for use on 2.0 quite easily, in fact the core methods apart from the use of the var keyword is normal .NET 2.0.
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
Hello, it seems like you did a good job on this example...thank you so much.
but i have a question, in your example, the struct had values of known size, like int, double, even the bytes array you created you have specified a fixed size for it,
the problem now...is that i want to do something like this:
public struct blablabla { public ArrayList variableSizedData; public string variableSizedString; }
given that i know where exactly in the file does each value start and end... what is the fastest approach then? because it now takes me 13 seconds to read and parse and mount into memory a 175MB DBF file... and i really really need to make them not more than 5 secs 
i currently use something like this:
Stream myDBFStream = new FileStream(DBaseFilePath, FileMode.Open, FileAccess.Read); BinaryReader myBinReader = new BinaryReader(myDBFStream); //reading the header data (first 32 bytes) byte[] firstThirtyTwo = myBinReader.ReadBytes(32); dBaseheader = new DBHeader(); dBaseheader.VersionNumber = (VersionType)firstThirtyTwo[0]; DateTime dt = new DateTime(firstThirtyTwo[1] + 1900, firstThirtyTwo[2], firstThirtyTwo[3]); dBaseheader.DateOfLastUpdate = dt; dBaseheader.NumberOfRecords = BitConverter.ToInt32(firstThirtyTwo, 4); dBaseheader.LengthOfHeader = BitConverter.ToInt16(firstThirtyTwo, 8); dBaseheader.LengthOfEachRecord = BitConverter.ToInt16(firstThirtyTwo, 10); dBaseheader.IncompleteTransaction = BitConverter.ToBoolean(firstThirtyTwo, 14); dBaseheader.EncryptionFlag = BitConverter.ToBoolean(firstThirtyTwo, 15); dBaseheader.LanguageDriver = firstThirtyTwo[29]; and then i loop on each record using the number of records property... and inside each loop of those....i make a nested loop to parse and fill the fields in each record one by one...
it's a bit of a drama here the code works...its just 13 secs is too damn much... any ideas???
ZooM
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
There isn't any obvious way to optimise your data reading much further, but it's possible that the constructor for your class DBHeader instances that is slowing things down. Move the line "dBaseheader = new DBHeader();" to a point outside your loop (before it, obviously) and see how the speed goes. This will tell you if that's where your delays are occurring. If this is the issue, consider creating an array of structures instead of instantiating objects (if this is possible based on your data). If this is not the issue, try skipping the file stream by 32bytes each time instead of reading them to see if it's the Readbytes(32) that is causing performance issues.
Of course, use a profiler and you don't need to do trial and error...
Cheers, Jason
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
Binary File Reading with C#
My own Binary File Reading with C# but i want to display in data grid
in my binary file headers i dont know how to find .please help me
|
| Sign In·View Thread·PermaLink | 2.00/5 (1 vote) |
|
|
|
 |
|
|
 |
|
|
Hi Anthony
Thanks for the article - I had no idea that Position/Length would be slow for a File Stream - useful information.
With regard to your PeekChar comments: You are right that BinaryReader deals with bytes not chars and that is exactly the reason that the method is called PeekChar. Chars are not necessarily one byte in length - it depends on the encoded that was used. By default, UTF8 encoding is used and so most chars are written using one byte but they could take two. The PeekChar function will use its internal Encoding class to retrieve the next char in the stream. If next byte(s) in the stream were not *specifically written* as encoded chars you will get errors when the first byte indicates a 2-byte char but the second byte is not valid for a UTF-8 encoded char. The bottom line is that you have to know what type of data is next in the stream - unless you know there is a char in the stream, don't use PeekChar.
Your article implies that you need to find the EOF because the number of records is not known at read time but why this is isn't clear (I may have missed something of course!). I can think of two solutions for this: Since you are talking about a file stream and a fixed-size struct then dividing the file size by the struct length will give you the number of contained records. Just calculate this first, read the correct number of records and then close the file - no need to look at Position at all.
Your first paragraph leads me to believe that you are processing collections of TestStructs rather than individual ones. Assuming you are creating the file from a TestStruct[] (or ArrayList or whatever) then you could store the length of the array first and then the structs data. To read them you can read the count first and then read the exact number back.
I had a play with your test app using the code from my Fast Serialization article. This involved changing TestStruct to add these interface methods:
void IOwnedDataSerializable.SerializeOwnedData(SerializationWriter writer, object context) { writer.Write(longField); writer.Write(floatField); writer.Write(byteField); writer.WriteBytesDirect(byteArrayField); }
void IOwnedDataSerializable.DeserializeOwnedData(SerializationReader reader, object context) { longField = reader.ReadInt64(); floatField = reader.ReadSingle(); byteField = reader.ReadByte(); byteArrayField = reader.ReadBytesDirect(16); }
(A couple of minor changes were required to my code also - to make the Stream constructor public and to add support for direct writing of byte arrays - once I have tested them I will update the article)
Testing showed that timing were virtually identical to your fastest BinaryReaderField/Cached method - for 10,000,000 items only +/- 0.1 sec. difference depending on whether reading each item individually or as a complete array.
The benefits of doing this are:- 1) Any type of object can be stored including strings, float[,], bitmaps etc. (as other commenters have mentioned) 2) Would be compatible with remoting. 3) Possibility of reducing storage size/increasing speed if you know your values will be in a certain range. (Not much with your particular test data types but if you have strings or lots of values that are 0 then there is huge scope here) 4) Works with classes as well as structs including inheritance. 5) Full control over what gets serialized - can deal with aggregated objects. 6) No need for Marshal or StructLayout or unsafe code.
Would be interesting to see what sort of timings can be achieved with your real data rather than non-optimizable randomized data.
Cheers Simon
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
public int show(int[] k) { y = k[0]; z = k[1];
y= Convert.ToInt32(textRandFrom.Text); z= Convert.ToInt32(textRandTo.Text); if(this.DialogResult == dialog result.OK) { k[0]= y; k[1]= z; return k[]; } else { return k[]; } } // the problem is, how to return the whole of the array?? //plz..
me..
|
| Sign In·View Thread·PermaLink | 3.00/5 (2 votes) |
|
|
|
 |
|
|
When save struct don't save string content. Save pointer tu string object. How to save fixed string fields?
Leonardo.
|
| Sign In·View Thread·PermaLink | 2.00/5 (1 vote) |
|
|
|
 |
|
|
Hello,
I've searched thru the Web but couldn't find a way to fast read data from binary file into 2-dimension array of floats in C#. I tried to change code from this articel:
temperatureField = (float[,])Marshal.PtrToStructure(handle.AddrOfPinnedObject(), typeof(float[,]));
but in this line I get exception "No parameterless constructor defined for this object."
Can anyone help me with my problem?
vanix
|
| Sign In·View Thread·PermaLink | 2.00/5 (1 vote) |
|
|
|
 |
|
|
 |
|
|
I am currently developing small, very simple and stupid "database" system for .NET and I have found similar results. Standard implementation for (file)stream position and length is awful slow. Generally it is safe to cache position to own custom stream - however you should be careful with length value, because it is possible that another thread/process changes file length and this change is not notified to your stream (it is also remotely possible that, if stream's length is suddenly changed to smaller than your current position in stream, your cached position is pointing to invalid position). If you are caching file length or position, allways open files without sharing to them with another process - or at least make sure that your program is not reducing file lengths (ofcourse - you can't ensure that some another program would do it...).
I guess this is major reason why Microsoft hasn't cached position and length values to stream object. It isn't safe, if file is shared. It is possible that another thread/process changes file length.
If you want to store data to stream, I would prefer to use either custom interface, that have read and write methods, or BinaryFormatter. Using marshalling and storing raw data seems to be hack-trick, creates dependency to underlaying NET platform implementation, might store unnecessary data to stream and object that is going to be stored doesn't have any control over stored data. Marshalling is also technology that at least I am trying to avoid until all other possibilities are used - my experience is that it is damn slow.
For example (custom data interface):
// Custom data interface definition public interface IData { void Write( BinaryWriter writer ); void Read( BinaryReader reader ); }
// Object implementing some interface public class MyDataObject : IData { // This data is not stored, because we want to keep here when _this_ instance is created private long m_instanceCreatedTime = DateTime.Now.Ticks; // Stored/restored value private long m_myDataValue;
public void Write( BinaryWriter writer ) { writer.Write( m_myDataValue ); }
public void Read( BinaryReader reader ) { m_myDataValue = reader.ReadInt64(); } }
// Some class have following storing and restoring methods... // Note: You need to create data variable somewhere first // This might need mechanism that decides what kind data object you need to create, // before calling Store/Restore methods public void Store( string fileName, IData data ) { // Open stream wihtout sharing using ( FileStream stream = new FileStream( fileName, FileMode.Create, FileAccess.Write, FileShare.None ) ) { using ( BinaryWriter writer = new BinaryWriter( stream ) ) { data.Write( writer ); } } }
public void Restore( string fileName, IData data ) { // Allow read sharing using ( FileStream stream = new FileStream( fileName, FileMode.Open, FileAccess.Read, FileShare.Read ) ) { using ( BinaryReader reader = new BinaryReader( stream ) ) { data.Read( reader ); } } }
This is the internet, where the men are men, the women are men and kids are the FBI.
-- modified at 11:06 Friday 2nd December, 2005
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
Same here. I write information retrieval systems (search engines); for which some of them reside completely on disk.
All interfaces I use for binary data have a write, read and a size member. Furthermore I'm using a buffer at the end of the file for new entries (cache 1) and a read/write through cache (cache 2) to make things more pleasant.
About the caching of position and length - I tend to disagree. There are a lot of situations where the file is locked for reading, and in doing so the mutex problem is not an issue.
Cheers,
Stefan de Bruijn, Senior search engineer, Teezir
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
what if you have a structure like this
struct foo { public bitmap bmp; public byte dummy; }
wouldn't the bitmap object create a problem? how would you save that?
|
| Sign In·View Thread·PermaLink | 2.00/5 (1 vote) |
|
|
|
 |
|
|
Based on your results, it seems you don't really need to layout your structure using StructLayoutAttributes. If reading and writing by fields, memory layout and packing is irrelevant. Would you agree or do you still see the need to layout memory according to the file structure byte per byte?
|
| Sign In·View Thread·PermaLink | 2.00/5 (1 vote) |
|
|
|
 |
|
|
I'm sure that not many people will have gone so deeply into the performance of the IO framework so deeply, so I think you should email the Base Class Library feedback email *goes to find* bclpub@microsoft.com
|
| Sign In·View Thread·PermaLink | 1.00/5 (1 vote) |
|
|
|
 |
|
|
Yes, thank you. This is a very interesting analysis, and the benchmarking methodology looks sound.
I must say that reading & writing text got a lot more attention than reading & writing binary file formats. The ResourceReader class is probably our heaviest internal user of BinaryReader, but in there the file format is well-defined so that we generally should be able to seek to a location and read a specific number of bytes. We have no need to test for EOF one byte at a time, so you've hit a code pattern that we didn't optimize. (And ResourceReader often uses the now-public UnmanagedMemoryStream & unsafe code to avoid reading arrays via the Stream classes altogether. Casting UnmanagedMemoryStream's PositionPointer property to a value type with sequential layout will often give you the best performance, if you can ensure you don't run into alignment issues on IA64. And a MemoryMappedFileStream could make the copying data into the GC heap unnecessary.)
FileStream's Position property's performance is better in Whidbey as long as the FileStream's handle was not retrieved via the Handle property, nor passed into the constructors.
PeekChar will throw an exception if you're using an Encoding such as UTF-8 (BinaryReader's default) that has a very rigid data format, and you attempt to read non-UTF-8 data, such as an int field right after your string field. You can work around this using the ASCIIEncoding (though you'll give up the ability to read Unicode data), or by not using PeekChar to detect EOF. Reading bytes from FileStream's Read(byte[], int, int) method until it returns 0 is the authoritative way to detect the (current) end of the stream.
As a side note, for an interesting performance challenge pitting off native C++ code vs. a managed app using StreamReader, please look at Rico's Performance Quiz #6, where he and Raymond Chen are trying to write competing implementations of a fast Chinese/English dictionary reader. http://blogs.msdn.com/ricom/archive/2005/05/10/416151.aspx[^] It may give you a few insights to how to write a fast parser, and the results are somewhat surprising.
Brian Grunkemeyer "M$" CLR Base Class Library team
|
| Sign In·View Thread·PermaLink | 5.00/5 (1 vote) |
|
|
|
 |
|
|
Good idea. Here was the relatively cogent response I got. I don’t agree with some of the logic regarding the PeekChar method, but overall I’m pretty impressed with the speed and completeness. I’ll update my article in the next week or so to include some of Ravi’s suggestions.
-Anthony
Ravi's Response:
Anthony,
Thanks for your report. What is the build you are using? If it is not Whidbey Beta2, you might want to upgrade. I'll try to address your questions below:
1) BinaryReader.PeekChar
I agree with you on the usability issues with PeekChar. Through out System.IO we are plagued by the short comings of this API's current design. We need to spend some time cleaning this up in the future version.
Problem: BinaryReader is a convenient means by which you can read binary data directly into primitive types. In that sense, supporting Char data type is essential. But any time you try to peek anything other than the head byte from a stream, its trouble. This is especially true if you are reading a stream via an un-buffered reader. The chief issue here is that when you associate a particular encoding with your reader, we try to encode a full character per that encoding from the peeked bytes. Char type is 2bytes long (wide char) and so we try to read 2 bytes from the underlying stream and form a character with the decoder obtained from the given encoding.
However, there is no guarantee that these 2 bytes always will yield a character, for instance, think of surrogate Unicode pairs. A character in a encoding can be composed of any arbitrary number of bytes (you can play with Encoding.GetMaxByteCount(1)). In this case, it is not recommended that we break the character sequence and return the characters individually (for ex, high surrogate Char and low surrogate Char). So we basically return you -1 indicating there is nothing available to return at this time. However, keep in mind that the decoder remembers this state and caches the partial sequence of bytes internally.
So when you call PeekChar subsequently, we again read the next 2 bytes and give it to the decoder to combine with the cached bytes, may be now it can form a character. Great! But since we are looking for only one Char to give back to you from PeekChar, decoder can't fit in the entire character sequence in a one Char long buffer. Hence, the reason why you sometime get the "not enough buffer to convert the character" error.
Solution: I suppose since the return type of PeekChar is Int32, we can return you up to 4 bytes (i.e 2 Chars) at a time. However, you now need to extract the chars out of the Int32 type yourself and note that this only solves the problem partially.
Ideally, we would need a PeekChar method that can return you a Char[], which might solve the problem (but probably ugly!). Also, we can bake the concept of Peekable down at the Stream level so that we wouldn't have the issue of reading a byte off the stream for peeking from a reader and not being able to cache that or put it back into the stream (if the stream is un-seekable). Along the same lines, I think we should also add the concept of IsBlocked to the Stream so that if there is a way you can detect this reliably in your stream, the higher layers can make educated choice rather than guessing.
As I said, these issues are interesting and we need to spend some time thinking about the right solution. Unfortunately, in Whidbey unless you know the nature of bytes in you stream, calling PeekChar/ReadChar is probably not the right thing to do. May be you can workaround this by designing your own version of peek along the lines of what I've outlined.
2) Performance issues with FileStream.Position & Length
Position: Unless, we have exposed the handle of the file (i.e, you either called FileStream.SafeFileHandle or constructed the FileStream with your own handle), querying Position should be just an arithmetic operation of internal buffer positions. I would be surprised if you see a perf issue here.
On the other hand, if you have exposed the handle, then we can't make any assumption that the current instance of FileStream is the only one manipulating the underlying file pointer, hence we need to query Win32 dynamically.
Length: We could do some optimization such as the Position implementation above and look at the FileMode and cache this but it is better if we don't. You should cache this explicitly based on your scenario as you know your case the best. Querying Length dynamically when you don't expect it to change is certainly not the right thing.
3) BinaryReader.ReadXxx Vs Marshal.PtrToStructure For the binary reader supported data types, its best to use that directly rather than the round about approach of pinning buffer and getting marshaling to do the conversion.
Thanks -Ravi
|
| Sign In·View Thread·PermaLink | 4.50/5 (3 vot | | | | | |