14,639,459 members
Articles » General Programming » Algorithms & Recipes » Data Structures
Article
Posted 6 Mar 2010

39K views
32 bookmarked

# From Binary to Data Structures

Rate this:
6 Mar 2010CPOL
How to parse well-formed binary data into your data structures

## Introduction

Sometimes, you want to parse well-formed binary data and bring it into your objects to do some dirty stuff with it.

In the Windows world, most data structures are stored in special binary format. Either we call a `WinApi `function or we want to read from special files like images, spool files, executables or may be the previously announced Outlook Personal Folders File.

Most specifications for these files can be found on the MSDN Library: Open Specification.

In my example, we are going to get the COFF (Common Object File Format) file header from a PE (Portable Executable). The exact specification can be found here: PECOFF.

## PE File Format and COFF Header

Before we start, we need to know how this file is formatted. The following figure shows an overview of the Microsoft PE executable format.
Source: Microsoft

Our goal is to get the PE header. As we can see, the image starts with a MS-DOS 2.0 header with is not important for us. From the documentation, we can read

"...After the MS DOS stub, at the file offset specified at offset 0x3c, is a 4-byte...".

With this information, we know our reader has to jump to location `0x3c` and read the offset to the signature. The signature is always 4 bytes that ensures that the image is a PE file. The signature is: `PE\0\0`.

To prove this, we first seek to the offset `0x3c`, read if the file contains the signature.

So we need to declare some constants, because we do not want magic numbers.

```private const int PeSignatureOffsetLocation = 0x3c;
private const int PeSignatureSize           = 4;
private const string PeSignatureContent     = "PE";```

Then a method for moving the reader to the correct location to read the offset of signature. With this method, we always move the underlining `Stream` of the `BinaryReader` to the start location of the PE signature.

```private void SeekToPeSignature(BinaryReader br) {
// seek to the offset for the PE signature
br.BaseStream.Seek(PeSignatureOffsetLocation, SeekOrigin.Begin);
// seek to the start of the PE signature
br.BaseStream.Seek(offsetToPeSig, SeekOrigin.Begin);
}```

Now, we can check if it is a valid PE image by reading of the next 4 byte that contains the content `PE`.

```private bool IsValidPeSignature(BinaryReader br) {
// read 4 bytes to get the PE signature
// convert it to a string and trim \0 at the end of the content
string peContent = Encoding.Default.GetString(peSigBytes).TrimEnd('\0');
// check if PE is in the content
return peContent.Equals(PeSignatureContent);
}```

With this basic functionality, we have a good base reader class to try the different methods of parsing the COFF file header.

The COFF header has the following structure:

 Offset Size Field 0 2 `Machine` 2 2 `NumberOfSections` 4 4 `TimeDateStamp` 8 4 `PointerToSymbolTable` 12 4 `NumberOfSymbols` 16 2 `SizeOfOptionalHeader` 18 2 `Characteristics`

If we translate this table to code, we get something like this:

```[StructLayout(LayoutKind.Sequential, CharSet = CharSet.Unicode)]
public MachineType Machine;
public ushort NumberOfSections;
public uint TimeDateStamp;
public uint PointerToSymbolTable;
public uint NumberOfSymbols;
public Characteristic Characteristics;
}```

All readers do the same thing, so we go to the patterns library in our head and see that Strategy pattern or Template method pattern is sticking out in the bookshelf.

I have decided to take the template method pattern in this case, because the `Parse()` should handle the IO for all implementations and the concrete parsing should be done in its derived classes.

```public CoffHeader Parse() {
using (var br = new BinaryReader(File.Open
SeekToPeSignature(br);
if (!IsValidPeSignature(br)) {
}
return ParseInternal(br);
}
}

First we open the `BinaryReader`, seek the PE signature then we check if it contains a valid PE signature and the rest is done by the derived implementations.

The first solution is using the `BinaryReader`. It is the general way to get the data. We only need to know which order, which data-type and its size. If we read byte for byte, we could comment out the first line in the `CoffHeader` structure, because we have control about the order of the member assignment.

```protected override CoffHeader ParseInternal(BinaryReader br) {
return coff;
}```

If the structure is as short as the COFF header here and the specification will never change, there is probably no reason to change the strategy. But if a data-type will be changed, a new member will be added or ordering of member will be changed, the maintenance costs of this method are very high.

Another way to bring the data into this structure is using a "magically" unsafe trick. As above, we know the layout and order of the data structure. Now, we need the `StructLayout `attribute, because we have to ensure that the .NET Runtime allocates the structure in the same order as it is specified in the source code. We also need to enable "Allow unsafe code (/unsafe)" in the project's build properties.

Then we need to add the following constructor to the `CoffHeader `structure.

```[StructLayout(LayoutKind.Sequential, CharSet = CharSet.Unicode)]
unsafe {
fixed (byte* packet = &data[0]) {
}
}
}
}```

The "magic" trick is in the statement:

`this = *(CoffHeader*)packet; `

What happens here? We have a fixed size of data somewhere in the memory and because a `struct` in C# is a value-type, the assignment operator `=` copies the whole data of the structure and not only the reference.

To fill the structure with data, we need to pass the data as bytes into the `CoffHeader` structure. This can be achieved by reading the exact size of the structure from the PE file.

```protected override CoffHeader ParseInternal(BinaryReader br) {
}```

This solution is the fastest way to parse the data and bring it into the structure, but it is unsafe and it could introduce some security and stability risks.

In this solution, we are using the same approach of the structure assignment as above. But we need to replace the unsafe part in the constructor with the following managed part:

```[StructLayout(LayoutKind.Sequential, CharSet = CharSet.Unicode)]
IntPtr coffPtr = IntPtr.Zero;
try {
coffPtr = Marshal.AllocHGlobal(size);
Marshal.Copy(data, 0, coffPtr, size);
} finally {
Marshal.FreeHGlobal(coffPtr);
}
}
}```

## Conclusion

We saw that we can parse well-formed binary data to our data structures using different approaches. The first is probably the clearest way, because we know each member and its size and ordering and we have control over the reading the data for each member. But if we add member or the structure is going change by some reason, we need to change the reader.

The two other solutions use the approach of the structure assignment. In the unsafe implementation, we need to compile the project with the `/unsafe` option. We increase the performance, but we get some security risks.

## History

• 6th March 2010: Initial post

## Share

 Software Developer (Senior) Metrohm AG Switzerland
I'm a passionate programmer with heart and soul. I've started programming in 2001 with C++, VB6 and then in 2002 with the .NET Framework. C# and I are still married, because we always solving our problems in a clean, easy and programmatic way.

My other passions are coffee and single-malt whisky. As a owner of a Rocket Espresso Milano, I always get the perfect shot to get up in the morning.

 First Prev Next
 Good example Rajkumar-Kannan7-Mar-10 16:57 Rajkumar-Kannan 7-Mar-10 16:57
 Re: Good example Cédric Menzi7-Mar-10 21:31 Cédric Menzi 7-Mar-10 21:31
 Re: Good example Joshua Dale15-Mar-10 10:19 Joshua Dale 15-Mar-10 10:19
 My vote of 1 Bartosz Wojcik7-Mar-10 1:21 Bartosz Wojcik 7-Mar-10 1:21
 Re: My vote of 1 Cédric Menzi7-Mar-10 2:44 Cédric Menzi 7-Mar-10 2:44
 Re: My vote of 1 Joshua Dale15-Mar-10 10:18 Joshua Dale 15-Mar-10 10:18
 Last Visit: 24-Sep-20 20:41     Last Update: 24-Sep-20 20:41 Refresh 1