Click here to Skip to main content
15,891,136 members
Articles / Programming Languages / C#
Article

Fast Binary File Reading with C#

Rate me:
Please Sign up or sign in to vote.
4.82/5 (98 votes)
28 Jun 20058 min read 640.8K   9.9K   240   48
Exploring the fastest way to read structures from a binary file in C#.

Sample image

Introduction

I’ve been working on a time-series analysis project where the data are stored as structures in massive binary files. Importing the files into a database would cause a performance hit with no value added, so dealing with the files in their original binary format is the best option. My initial assumption was that throughput would be limited by disk speed, but I found that my first implementation resulted in 100% CPU utilization on my research box. It was obviously time to optimize.

While there is a wealth of information available on the innumerable ways of reading files with C#, there is virtually no discussion about the performance implications of various design decisions. Hopefully, this article will allow the reader to improve the performance of binary file reading in their application and will shed some light on some of the undocumented performance traps hidden in the System.IO classes.

Is there Data?

It may seem silly to have a section on checking for the end of a file (EOF), but there are a plethora of methods employed by programmers, and improperly checking for the EOF can absolutely cripple performance and introduce mysterious errors and exceptions to your application.

BinaryReader.PeekChar Method

If you are using this method in any application, god save you. Based on its frequent appearance in .NET newsgroups, this method is widely used, but I’m not sure why it even exists. According to Microsoft, the BinaryReader.PeekChar method “Returns the next available character and does not advance the byte or character position.” The return value is an int containing “The next available character, or -1 if no more characters are available or the stream does not support seeking.” Gee, that sounds awfully useful in determining if we’re at the end of the stream.

The BinaryReader class is used for reading binary files which are broken into bytes not chars, so why peek at the next char rather than byte? I could understand if there was an issue implementing a common interface, but the TextReader derived classes just use Peek. Why doesn’t the BinaryReader include a plain old Peek method that returns the next byte as an int? By now, you’re probably wondering why I’m ranting so much about this. Who cares? So, you get the next byte for free? Well, something entirely unnatural happens somewhere in the bowels of this method that periodically results in a “Conversion Buffer Overflow” exception. As the result of some dark voodoo process, certain two byte combinations in your binary file can not be converted into an appropriate return value by the method. I have no idea why certain byte combinations have been deigned toxic to PeekChar, but prepare for freaky results if you use it.

Sample image

Stream.Position >= Stream.Length

This test is pretty straightforward. If your current position is greater than or equal to the length of the stream, you’re going to be pretty hard-pressed to read any additional data. As it turns out, this statement is a massive performance bottleneck.

After finishing the initial build of my application, it was time for some optimization. I downloaded the ANTS Profiler Demo from Red Gate Software, and was shocked to find that over half the execution time of my program was being spent in the EOF method of my data reader. Without the profiler results, I never would have imagined that this innocuous looking line of code was cutting the performance of my application into half. After all, I opened the FileStream using the FileShare.Read option, so there was no danger of the file’s length changing, but it appears as though the position and file length are not cached by the class, so every call to Position or Length results in another file system query. In my benchmarking, I’ve found that calling both Position and Length takes twice as long as calling one or the other.

_position >= _length (Cache it yourself)

It’s sad, but true. This is the fastest method by a long shot. Get the length of your FileStream once when you open it, and don’t forget to advance your position counter every time you read. Maybe Microsoft will fix this performance trap someday, but until then, don’t forget to cache the file length and position yourself!

Read It!

Now that we know there’s data, we have to read it into our data structures. I’ve included three different approaches, with varying merits. I did not include the unsafe approach of casting a byte array of freshly read data into a structure because I prefer to avoid unsafe code if at all possible.

FileStream.Read with PtrToStructure

Logically, I assumed that the fastest way to read in a structure would be the functional equivalent of C++’s basic_istream::read method. There are plenty of articles and newsgroup posts about using the Marshal class in order to torture raw bits into a struct. The cleanest implementation I’ve found is this:

C#
public static TestStruct FromFileStream(FileStream fs)
{
    //Create Buffer
    byte[] buff = new byte[Marshal.SizeOf(typeof(TestStruct))]; 
    int amt = 0; 
    //Loop until we've read enough bytes (usually once) 
    while(amt < buff.Length)
        amt += fs.Read(buff, amt, buff.Length-amt); //Read bytes 
    //Make sure that the Garbage Collector doesn't move our buffer 
    GCHandle handle = GCHandle.Alloc(buff, GCHandleType.Pinned);
    //Marshal the bytes
    TestStruct s = 
      (TestStruct)Marshal.PtrToStructure(handle.AddrOfPinnedObject(), 
      typeof(TestStruct)); 
    handle.Free();//Give control of the buffer back to the GC 
    return s
}

BinaryReader.ReadBytes with PtrToStructure

This approach is functionally almost identical to the FileStream.Read approach, but I provided it as a more apples-to-apples comparison to the other BinaryReader approach. The code is as follows:

C#
public static TestStruct FromBinaryReaderBlock(BinaryReader br)
{
    //Read byte array
    byte[] buff = br.ReadBytes(Marshal.SizeOf(typeof(TestStruct)));
    //Make sure that the Garbage Collector doesn't move our buffer 
    GCHandle handle = GCHandle.Alloc(buff, GCHandleType.Pinned);
    //Marshal the bytes
    TestStruct s = 
      (TestStruct)Marshal.PtrToStructure(handle.AddrOfPinnedObject(),
      typeof(TestStruct));
    handle.Free();//Give control of the buffer back to the GC 
    return s;
}

BinaryReader with individual Read calls for structure fields

I assumed that this would be the slowest method for filling my data structures --it was certainly the least sexy approach. Here’s the relevant sample code:

C#
public static TestStruct FromBinaryReaderField(BinaryReader br)
{
     TestStruct s = new TestStruct();//New struct
     s.longField = br.ReadInt64();//Fill the first field
     s.byteField = br.ReadByte();//Fill the second field
     s.byteArrayField = br.ReadBytes(16);//...
     s.floatField = br.ReadSingle();//...
     return s;
}

Results

As I’ve already foreshadowed, my assumptions about the performance of various read techniques was entirely wrong for my data structures. Using the BinaryReader to populate the individual fields of my structures was more than twice as fast as the other methods. These results are highly sensitive to the number of fields in your structure. If you are concerned about performance, I recommend testing both approaches. I found that, at about 40 fields, the results for the three approaches were almost equivalent, and beyond that, the block reading approaches gained an upper hand.

Using the Test App

I’ve thrown together a quick benchmarking application with simplified reading classes to demonstrate the techniques outlined so far. It has facilities to generate sample data and benchmark the three reading approaches with dynamic and cached EOF sensing.

Generating Test Data

By default, test data is created in the same directory as the executable with the filename “sampledata.bin”. The number of records to be created can be varied. Ten million records will take up a little bit more than 276 MB, so make sure you have enough disk space to accommodate the data. The ‘Randomize Output’ checkbox determines whether each record will be created using random data to thwart NTFS’s disk compression. Click the ‘Generate Data’ button to build the file.

Benchmarking

Benchmarking results are more reliable when averaged over many trials. Adjust the number of trials for each test scenario using the ‘Test Count’ box. ‘Update Frequency’ can be used to adjust how frequently the status bar will inform you of progress. Designate an update frequency greater than the number of records to avoid including status bar updates in your benchmark results. The ‘Drop Best and Worst Trials from Average’ check box will omit the longest and shortest trial from the average entry --they will still be listed in the ‘Results’ ListView. Select the readers to be tested using the checkboxes –‘BinaryReader Block’ corresponds to the PtrToStructure approach. Select the ‘EOF detection’ methods to test --'Dynamic’ uses Length and Position properties each time EOF is called. Click ‘Run Tests’ to generate results.

Miscellaneous Findings

StructLayoutAttribute

If you’re working with reading pre-defined binary files, you will become very familiar with the StructLayoutAttribute. This attribute allows you to tell the compiler specifically how to layout a struct in memory using the LayoutKind and Pack parameters. Marshaling a byte array into a structure where the memory layout differs from its layout on disk will result in corrupted data. Make sure they match.

Warning! Depending on the way a structure is saved, you may need to read and discard empty packing bytes between reading fields when using the BinaryReader.

MarshalAsAttribute

Be sure to use the MarshalAsAttribute for all fixed width arrays in your structure.Structures with variable length arrays cannot be marshaled to or from pointers.

Writing Data

Writing binary data can be accomplished in the same ways as reading. I imagine that the performance considerations are very similar as well. So, writing out the fields of a structure using the BinaryWriter is probably optimal for small structures. Larger structures can be marshaled into byte arrays using this pattern:

C#
public byte[] ToByteArray()
{
 byte[] buff = new byte[Marshal.SizeOf(typeof(TestStruct))];//Create Buffer
 GCHandle handle = GCHandle.Alloc(buff, GCHandleType.Pinned);//Hands off GC
 //Marshal the structure
 Marshal.StructureToPtr(this, handle.AddrOfPinnedObject(), false);
 handle.Free();
 return buff;
}

Marshal.SizeOf

Even small changes to a method can yield significant boost to performance when the method is called millions or billions of times during the execution of a program. Apparently, Marshal.SizeOf is evaluated at runtime even when there is a call to typeof as the parameter. I shaved several minutes off of my application’s execution time by creating a class with a static Size property to use in place of Marshal.SizeOf. Since the return value is calculated every time the application is started, the dangers of using a constant for size are avoided.

C#
internal sealed class TSSize
{
 public static int _size;

 static TSSize()
 {
  _size = Marshal.SizeOf(typeof(TestStruct));
 }

 public static int Size
 {
  get
  {
    return _size;
  }
 }
}

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here


Written By
Web Developer
United States United States
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
GeneralOne Problem... Pin
Xoy16-Feb-06 15:25
Xoy16-Feb-06 15:25 
GeneralI have found similar results Pin
Mikko Puonti2-Dec-05 4:38
Mikko Puonti2-Dec-05 4:38 
GeneralRe: I have found similar results Pin
atlaste14-Feb-07 7:50
atlaste14-Feb-07 7:50 
Questionany luck with other types of structs? Pin
MrPolite5-Sep-05 13:22
MrPolite5-Sep-05 13:22 
QuestionNo need for StructLayoutAttribute? Pin
zonebear9-Aug-05 10:41
zonebear9-Aug-05 10:41 
QuestionHave you complained to M$? Pin
DrGUI12-Jul-05 5:19
DrGUI12-Jul-05 5:19 
AnswerRe: Have you complained to M$? Pin
Brian Grunkemeyer29-Jul-05 16:52
sussBrian Grunkemeyer29-Jul-05 16:52 
AnswerRe: Have you complained to M$? Pin
Anthony Baraff29-Jul-05 17:01
Anthony Baraff29-Jul-05 17:01 
Good idea. Here was the relatively cogent response I got. I don’t agree with some of the logic regarding the PeekChar method, but overall I’m pretty impressed with the speed and completeness. I’ll update my article in the next week or so to include some of Ravi’s suggestions.

-Anthony

Ravi's Response:

Anthony,

Thanks for your report. What is the build you are using? If it is not
Whidbey Beta2, you might want to upgrade. I'll try to address your
questions below:

1) BinaryReader.PeekChar

I agree with you on the usability issues with PeekChar. Through out
System.IO we are plagued by the short comings of this API's current
design. We need to spend some time cleaning this up in the future
version.

Problem: BinaryReader is a convenient means by which you can read binary
data directly into primitive types. In that sense, supporting Char data
type is essential.
But any time you try to peek anything other than the head byte from a
stream, its trouble. This is especially true if you are reading a stream
via an un-buffered reader. The chief issue here is that when you
associate a particular encoding with your reader, we try to encode a
full character per that encoding from the peeked bytes. Char type is
2bytes long (wide char) and so we try to read 2 bytes from the
underlying stream and form a character with the decoder obtained from
the given encoding.

However, there is no guarantee that these 2 bytes always will yield a
character, for instance, think of surrogate Unicode pairs. A character
in a encoding can be composed of any arbitrary number of bytes (you can
play with Encoding.GetMaxByteCount(1)). In this case, it is not
recommended that we break the character sequence and return the
characters individually (for ex, high surrogate Char and low surrogate
Char). So we basically return you -1 indicating there is nothing
available to return at this time. However, keep in mind that the decoder
remembers this state and caches the partial sequence of bytes
internally.

So when you call PeekChar subsequently, we again read the next 2 bytes
and give it to the decoder to combine with the cached bytes, may be now
it can form a character. Great! But since we are looking for only one
Char to give back to you from PeekChar, decoder can't fit in the entire
character sequence in a one Char long buffer. Hence, the reason why you
sometime get the "not enough buffer to convert the character" error.

Solution: I suppose since the return type of PeekChar is Int32, we can
return you up to 4 bytes (i.e 2 Chars) at a time. However, you now need
to extract the chars out of the Int32 type yourself and note that this
only solves the problem partially.

Ideally, we would need a PeekChar method that can return you a Char[],
which might solve the problem (but probably ugly!). Also, we can bake
the concept of Peekable down at the Stream level so that we wouldn't
have the issue of reading a byte off the stream for peeking from a
reader and not being able to cache that or put it back into the stream
(if the stream is un-seekable). Along the same lines, I think we should
also add the concept of IsBlocked to the Stream so that if there is a
way you can detect this reliably in your stream, the higher layers can
make educated choice rather than guessing.

As I said, these issues are interesting and we need to spend some time
thinking about the right solution. Unfortunately, in Whidbey unless you
know the nature of bytes in you stream, calling PeekChar/ReadChar is
probably not the right thing to do. May be you can workaround this by
designing your own version of peek along the lines of what I've
outlined.


2) Performance issues with FileStream.Position & Length

Position: Unless, we have exposed the handle of the file (i.e, you
either called FileStream.SafeFileHandle or constructed the FileStream
with your own handle), querying Position should be just an arithmetic
operation of internal buffer positions. I would be surprised if you see
a perf issue here.

On the other hand, if you have exposed the handle, then we can't make
any assumption that the current instance of FileStream is the only one
manipulating the underlying file pointer, hence we need to query Win32
dynamically.

Length: We could do some optimization such as the Position
implementation above and look at the FileMode and cache this but it is
better if we don't. You should cache this explicitly based on your
scenario as you know your case the best. Querying Length dynamically
when you don't expect it to change is certainly not the right thing.


3) BinaryReader.ReadXxx Vs Marshal.PtrToStructure
For the binary reader supported data types, its best to use that
directly rather than the round about approach of pinning buffer and
getting marshaling to do the conversion.


Thanks
-Ravi
Answer'M$'? Ugh. Pin
Judah Gabriel Himango29-Aug-05 5:49
sponsorJudah Gabriel Himango29-Aug-05 5:49 
GeneralThe fastest approach Pin
Liu Junfeng11-Jul-05 17:02
Liu Junfeng11-Jul-05 17:02 
GeneralRe: The fastest approach Pin
Anthony Baraff12-Jul-05 4:09
Anthony Baraff12-Jul-05 4:09 
GeneralRe: The fastest approach PinPopular
KubuS22-May-08 4:31
KubuS22-May-08 4:31 
Generala few questions... Pin
Super Lloyd1-Jul-05 4:34
Super Lloyd1-Jul-05 4:34 
GeneralRe: a few questions... Pin
Anthony Baraff1-Jul-05 4:48
Anthony Baraff1-Jul-05 4:48 
QuestionMarshaling arrays of structs? Pin
Cockeyed Bob28-Jun-05 8:42
Cockeyed Bob28-Jun-05 8:42 
AnswerRe: Marshaling arrays of structs? Pin
Anthony Baraff28-Jun-05 9:02
Anthony Baraff28-Jun-05 9:02 
QuestionWhat about fread? Pin
leppie27-Jun-05 1:04
leppie27-Jun-05 1:04 
Generala question Pin
DaberElay25-Jun-05 13:20
DaberElay25-Jun-05 13:20 
GeneralRe: a question Pin
Bojan Rajkovic26-Jun-05 17:19
Bojan Rajkovic26-Jun-05 17:19 
GeneralRe: a question Pin
Anthony Baraff27-Jun-05 2:46
Anthony Baraff27-Jun-05 2:46 
GeneralRe: a question Pin
Bojan Rajkovic28-Jun-05 10:11
Bojan Rajkovic28-Jun-05 10:11 
QuestionWhat about when reading serialized objects? Pin
iliyang22-Jun-05 2:03
iliyang22-Jun-05 2:03 
AnswerRe: What about when reading serialized objects? Pin
Anonymous1-Jul-05 11:57
Anonymous1-Jul-05 11:57 
GeneralVery good Pin
Bojan Rajkovic21-Jun-05 18:12
Bojan Rajkovic21-Jun-05 18:12 
GeneralRe: Very good Pin
NormDroid21-Jun-05 22:35
professionalNormDroid21-Jun-05 22:35 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.