Reading and writing assemblies are a vital part of any .NET metaprogramming library.

Duality at the Heart of Tao

From the very beginning, I wanted to make sure that Tao was capable of both reading and writing assemblies. After being overwhelmed after reading parts of the ECMA-335 spec, I realized that the only way to effectively create Tao is to allow it to be able to both read and write in small portions that I could incrementally understand. PEVerify was ultimately useless in this case since it only deals with validating entire assemblies rather than just validating the small binary subsets that are contained in each assembly.

The Goal

In an ideal scenario, Tao should be able to read an assembly (or part of an assembly), load it into memory, and then write it back to the disk and the bytes should look exactly the same as the bytes that it used to load it into memory. Given that even the slightest byte read or written in the wrong position could mean the difference between a valid and an invalid assembly, I needed to use an approach that will prevent me from both regressive bugs, and act as a guide when I read or write the incorrect bytes from an assembly. The problem is that since no such tools exist, what’s the best approach for incrementally round tripping an assembly?

Testing the Seemingly Untestable

Writing your own System.Reflection API from scratch can be very difficult, and there’s absolutely no room for mistakes. That’s why I chose to write it using a specification-driven approach since I didn’t have the time to test everything manually once the code was written. I needed an automated approach to testing since there were so many unwritten features that I still had to test. I also needed Tao to easily evolve as my understanding of the .NET assembly metadata increased over time. The idea was that I would use a hex editor to sample small byte arrays from an assembly and use those sample arrays as input to the code under test. Any particular round tripping feature in Tao would be considered “done” if what was read into memory could be serialized back to disk and match the sample array, byte-for-byte. At the time, there was so much that I didn’t know about the PE Format, so I wanted to start small, and use the tests as a guide to build my knowledge of the format over time. However, if you have a bunch of bytes in memory that I need to verify against a set of bytes on disk, how do you ensure that both sets of bytes match each other?

That’s where Tao’s hash function comes in handy:

<script src="https://gist.github.com/philiplaureano/5422630.js"></script>

These two extension methods form the heart of Tao’s self-verification system. It allowed me to incrementally read, write, and ultimately verify every single byte from a .NET assembly as my understanding of .NET assembly formats grew over time. The only problem I had at that point was that the test feedback was still too coarse. The hash functions allow you to verify a block of data, but how do you make it fail fast and track down the exact byte location if it writes the incorrect byte?

Divide and Conquer

In order to effectively test Tao, we’ll need to use a hash function that tracks down an invalid byte at the very point where the byte is written. Each test should be able to immediately fail fast and fail hard if it encounters an invalid byte being read or written to disk. Tao cannot afford to wait until an entire chunk of bytes is written before it can be verified, simply because there’s currently no way to pinpoint the exact location of a byte mismatch if the chunk gets too large. For example, if you want to do a random write of a byte array with two million elements and have it fail at the instant that it writes the first invalid byte, how do you find the exact point where the mismatch occurred with just a hashing function performed on the entire byte array?

As it turns out, if you have two side-by-side arrays of two million bytes each that Tao needs to diff, the simplest way to track down the differences between these two arrays using a hashing function is to eliminate the parts that they have in common. The bytes that will remain are the bytes that differ, and in this case, we only need to find the first byte position where there is a mismatch. One of the most effective ways to eliminate the common parts of these two byte arrays is to recursively halve and compare the two arrays into smaller chunks and compare each successively smaller chunk until the first mismatching byte is found. This approach allows Tao to compare even very large sets of data often using fewer than fifty comparisons, and aside from the recursive function call, the implementation speaks for itself:

<script src="https://gist.github.com/philiplaureano/5423009.js"></script>

Each call to getMismatchPosition recursively halves and compares the streams until the stream comparison size is only a single byte in length. Once the mismatch byte has been found, the comparison ends, and the local function returns the position where the mismatch occurred. Now that we have a way to pinpoint the exact byte mismatch position in any given stream, the next issue is finding a way to immediately fail a test at the very instant that any piece of Tao's code attempts to write an invalid chunk of bytes to a stream. In this case, this is where Tao's TracerStream class is immensely useful:

<script src="https://gist.github.com/philiplaureano/5423127.js"></script>

There's nothing particularly interesting about the StreamDecorator base class other than the fact that it wraps (or decorates) an existing stream so that we don't have to reimplement an entire stream in the TracerStream class if it just intercepts the write calls. What is interesting, however, is that the TracerStream class compares every chunk that is being written (just as it is being written) and immediately fails the write operation if the bytes being written don't match the bytes in the original stream. It's a very simple and effective way to incrementally verify parts of a binary, and it's very useful for verifying .NET assemblies. Now that we have written the basic tools for verifying byte streams in an assembly, the next task is to actually find samples that Tao can parse into memory, write to disk, and verify whether or not it matches the original bytes written into memory. Given this task, exactly what tools do we need to actually grab these byte samples, and start roundtripping these samples in Tao's tests?

Finding Binary Samples using the Right Toolset

One of the best tools for analyzing and sampling raw .NET assemblies is a tool called CFF Explorer. When I first started writing Tao, I used CFF Explorer’s hex editor to create binary dumps in C# that I could use as the expected byte arrays in Tao’s unit tests. For example, when I needed to test roundtriping MS-DOS headers with Tao, I used the sample bytes from CFF Explorer to verify that the DosHeaderWriter class was writing the correct bytes to its given output stream:

<script src="https://gist.github.com/philiplaureano/5423458.js"></script>

Closing the Loop

In a nutshell, the above test case demonstrates how Tao began, and shows how I designed Tao for roundtripping from the very beginning. The idea was that in order to roundtrip an assembly, we need to have the tools to sample small parts of the assembly so that we can, in turn, write small tests that have expectations and assertions that determine what an assembly should look like in memory, despite the fact that we might yet not understand everything about the .NET assembly format. In the next series of posts, we'll dive into the .NET format with the knowledge that each one of these successive read, write, and roundtripping tests in Tao will ensure that it will always be reading and writing the correct assemblies, without fear of having any bug regressions. Fewer regressions means that it will be easier to work with the .NET assembly format and we can just move on to working on the next part of reading/writing/roundtripping an assembly. It's a simple idea with some big design repercussions, and hopefully, these series of posts will show you the details of how I created Tao, and help you understand how I did it, as well as understand why I did it.

Coming Up in the Next Post

In the next post in this series, we'll talk about how to build the simplest possible assembly from scratch with ILASM and parse it with Tao. We'll also dive deeper into the CLR metadata format and explore the CLR metadata tables, and I'll briefly talk about how those tables bind an assembly together, and show you how Tao roundtrips those tables. Lastly, I'll also talk about some experimental uses for being able to directly manipulate those tables, and how they might just change the way you look at .NET assembly manipulation. One half of the post will talk about the tables, and the other half will talk about some of the crazy ideas which inspired me to write Tao in the first place, so stay tuned!