New features the .NET framework should have

Qwertie

4.00/5 (4 votes)

May 25, 2010

CPOL

13 min read

11182

I recently wrote a class called CPTrie that stores a sorted collection of strings or integers in less space than a Dictionary or SortedDictionary. It took a long time to develop this data structure in .NET while minimizing memory and CPU usage. I am fairly convinced that without some of .NET's restrictions, I could have made an even more compact data structure.

Allow me to suggest a few features that would make .NET better by allowing it to do certain tasks more efficiently, or by giving coders more flexible design options.

It is already easier to write efficient code in .NET than Java, of course, thanks to features like structs (value types), Generics and compiled expression trees. But it could be better.

Slices

I love the type safety that .NET provides, but people seem to forget that type safety doesn't have to be a straightjacket. They think, for instance, that pointers automatically kill type safety, apparently forgetting that "references" are nothing but neutered pointers. Pointers aren't inherently dangerous, rather it is certain operations that are dangerous, like reading or writing past the end of an array. If your language can prevent you from doing dangerous things with pointers then they suddenly become safe.

Google Go recognizes this, and provides a safe form of pointer called a slice. A slice is a pointer to an array that stores the array size with the pointer. This is not the same as a .NET array reference, which always points to the beginning of the array. A slice can point to a subset of an array, and if .NET had slices, all functions of the form

int Sum(int[] array, int startIndex, int endIndex) {...}

could be replaced simply with

int Sum(int[..] array) {...}

And you would probably call the function with a nice syntax like

int result = Sum(array[startIndex..endIndex]);
// or Sum(array) to pass the whole array

Plus, these functions often come with a "convenience overload" to process the entire array...

void Foo(int[] array) { Foo(array, 0, array.Length); }

...which would not be necessary with slices.

Slices can be more efficient than the traditional triplet (array, start, end) for three reasons:

A slice is two values (pointer + length). Passing it around is faster than passing around three values.

Triplets use 50% more memory (relevant if you need to store many of them)

All functions that access an array through a triplet must check the array index on every access. For instance, in today's .NET, Sum() might be written like this:
```
void Sum(int[] array, int startIndex, int endIndex)
{
    // I avoid using the "less than" sign because Blogger is still
    // terribly buggy after several years. Lazy bums.
    if (0>startIndex || startIndex>endIndex || endindex>array.Length)
       throw new ArgumentException();

    int total = 0;
    for (int i = startIndex; array.Length > i; i++)
       total += array[i];
    return total;
}
```
With slices:
```
void Sum(int[..] slice)
{
  int total = 0;
  for (int i = 0; slice.Length > i; i++)
     total += slice[i];
  return total;
}
```
Microsoft .NET can detect that a for loop goes from 0 to the array length, and optimize out the range check. It should be easy to do the same for slices, which could make a method like this significantly faster, yet still perfectly type-safe. The CLR is unable to optimize the first loop, however, and there may be additional overhead in the first loop because the JIT can only hold 2 or 3 variables in registers.

Slices would be even more useful if you could cast them in type-safe ways: for instance you should be able to cast a int[4] slice to a short[8] slice, and you should be able to cast a short[7] to an int[3] (there is the possibility of an alignment exception in the latter case on certain processors, but it is still "safe" in the sense that memory corruption is impossible). Slice casts would be extremely useful for parsing various binary data blobs (e.g. binary files, network packets), which are difficult to deal with in .NET today.

Such casts are perfectly safe, provided that casts involving reference types are prohibited.

There's one disadvantage of allowing the slice cast: the optimizer would have to assume that a value-type slice may point to the same physical memory as a different kind of slice or array. But C has the same problem and it still wins the speed race.

One roadblock to implementing slices in .NET may be a challenge to the garbage collector. In .NET today all references point to the beginning of an object. Slices can point to the middle of an object, which means that either the garbage collector needs a way to find the beginning of an object given its middle, or that slices will require that the arrays they point to be pinned. The latter "solution" is very undesirable, as it would mean slices could only be stored in the stack within "fixed" blocks (plus, pinning could tax the GC, but I don't know enough about the GC to say if this is a big problem).

Slices that point to null are also possible. .NET would have to guarantee that null slices always have a Length of 0. Note how null slices are advantageous over null array references, because it is always possible to access the Length field of a slice.

Fixed-size arrays

In some cases it would be really nice to have zero-overhead fixed-size arrays. For instance, I have seen people representing points like this:

class Point {
    private float[] _coords = new float[3];
    public float X { get { return _coords[0]; } set { _coords[0] = value; } }
    public float Y { get { return _coords[1]; } set { _coords[1] = value; } }
    public float Z { get { return _coords[2]; } set { _coords[2] = value; } }
    public float this[int n] {
                     get { return _coords[n]; } set { _coords[n] = value; }
    }
}

This is very inefficient if you need to keep track of large numbers of points, since the 12 bytes of coordinate data is accompanied by 20 bytes of overhead: 8 bytes for the class and 12 bytes for the array. Moreover, all access to the X, Y and Z properties is guarded by an array range check, even though the subscripts and array size are constants. Only the indexer should require a range check.

To solve these problems, .NET should offer "safe" fixed-size arrays that hold the array directly in the container class or struct. This would save 12 bytes of overhead and allow faster access to the array, as well as faster object construction.

You can already use fixed-size arrays in .NET, but they are considered "unsafe" and have the following disadvantages:

They require you to sprinkle the "unsafe" keyword all over the place.
You have to set a compiler option to allow "unsafe" code.
Your code cannot run in secure environments, such as inside a web browser.
Fixed-size arrays cannot hold references to classes, or structs (only primitives are allowed).

Provided that there are array index range checks, there is really nothing unsafe about a fixed-size array, and no reason they can't hold references or structs. The reason for these silly restrictions (I think) is that the CLR doesn't have built-in support for fixed-size arrays, so C# resorts to pointer arithmetic, which is unverifiable to the CLR (and in that sense "unsafe").

Fixed-size arrays would work very nicely with slices, so that you could offer access to the array via a property:

class Point {
    private fixed float _coords[3];
    ...
    public float[..] Coords { get { return _coords; } }
}

However, this would only be legal inside classes. A struct could not return a slice of itself, since a struct could cease to exist before the slice.

Go interfaces

In Go, classes do not have to explicitly say what interfaces they implement, and anybody that notices a pattern in the methods of two otherwise unrelated classes can declare that they both implement a certain interface.

Go's interfaces provide compile-time type checking, and since Go is built heavily around these "implicit" interfaces, I assume Google found a way to give them high performance similar to .NET interfaces. As such I think Microsoft would be smart to consider adding support for Go-style interfaces, which in turn would allow a fast implementation of the Go language on the CLR.

Go-style interfaces would allow developers to work around limitations in the way Microsoft designed interfaces. For instance, there are a lot of read-only collection classes (or collections like Stack where modifications are constrained), but Microsoft didn't bother to make a IReadOnlyCollection interface, so we're stuck writing four unsupported methods (Add, Clear, Remove, CopyTo) in addition to the three that matter (GetEnumerator, Count, Contains).

With Go interfaces, you can define your own private IReadOnlyCollection interface which all the BCL collections, and any collections made by third parties, automatically implement.

(I am not, of course, suggesting that traditional interfaces be abandoned. They are too well entrenched and may have a performance advantage anyhow.)

Default interface implementations

And that reminds me, interfaces should be able to have default implementations. For instance, ICollection.Contains(T) could have a default implementation like this:

bool Contains(T item) default {
    var c = EqualityComparer<T>.Default;
    foreach(T t in this)
        if (item.Equals(t))
            return true;
    return false;
}

Half the time this is exactly how one implements the Contains() method. Why require every collection author to write the same code? Of course, traits could also help us avoid code duplication and would be more powerful, so let me stick that on my list:

Traits

Read about them in this PDF.

Return type covariance

If class Circle is derived from class Shape, then the following code is perfectly safe and should be legal:

abstract class ShapeFactory {
    abstract public Shape New();
}
abstract class CircleFactory {
    override public Circle New() { ... }
}

I have encountered situations where covariance would be useful on multiple occasions and it pains me to have to find a different design or implement a workaround just because Microsoft won't add this trivial feature.

Thread-local variable inheritance

Just a small pet peeve. When you create a child thread, it loses all the values of thread-local variables from the parent. There should be an option to copy values from the parent thread. I heard a rumor that somebody (Apple?) has a patent on this trivial idea, though.

Binary-comparing two value types

Another trivial feature that Microsoft left out of .NET is the ability to bitwise-compare two values. While developing the VList data structures for .NET, I really wanted this feature. There's no reason the 'ceq' opcode shouldn't support it, and as far as I can tell the ECMA-335 standard does not state whether 'ceq' can be used on arbitrary value types. However, C# doesn't allow it so I assume it's not supported.

It's true that bitwise equality isn't "perfect". For instance, floating point has a "positive zero" and a "negative zero" that compare equal, but are bitwise unequal. However, if two value types ARE bitwise equal then they are the same for all practical purposes, and it's hard to think why you would waste time calling Equals() (which, in structs, uses reflection by default!)

In the case of reference types you can call object.ReferenceEquals to find out if two references are the same, but what if you are writing generic code that handles both values and references? In that case you cannot use object.ReferenceEquals, which is why a bitwise comparison operator would be so useful.

A concrete example of this occurs in my VList code, which has a "Smart Select" method:

public FVList<T> SmartSelect(Func<T, T> map)

This method of FVList passes each element of the FVList to the map() function. If, for every element of the list, map() returns the same value as it was given, SmartSelect() can simply return the original list instead of a new one. The ability to do this is very important for efficient functional programming; unfortunately, .NET screws it up by making it difficult to tell whether map() returned its argument unchanged or not. We have to use virtual function calls, and maybe even reflection, to tell whether the output is the same as the input. That's stupid.

Integers in pointers

Okay, what I'm about to suggest may sound crazy. You have been warned.

In .NET, all references have 0s in the bottom two bits of the pointer. This is also true for C++ pointers to heap objects, and the Ruby language interpreter and the Judy array library use this fact to store a kind of safe union in a pointer:

union {
    Object* pointer;
    int integer;
}

If the two lowest bits are 0, the value is a pointer, otherwise it is an integer of some kind. A similar technique is to use the lowest bit to discriminate a union between two pointer types:

union {
    Object0* pointer0; // if low bit is 0
    Object1* pointer1; // if low bit is 1
}

While abusing pointers like this in C is dangerous (since you could easily dereference an integer by mistake, etc.), formalizing the technique can make it safe (just as the Ruby language is safe from memory corruption despite using this kind of union extensively).

In C#, the second union would be almost pointless since objects are self-typed, but it might be useful instead if you could use the low bit of a pointer as a boolean flag. For instance, .NET's SortedDictionary uses the following node structure:

internal class Node
{
    private bool isRed;
    private T item;
    private Node left;
    private Node right;
}

There's something to be said for a straightforward implementation like this, but if somebody's using this class extensively with a large data set, they might appreciate the memory savings (up to 17%) that would come from hiding the "isRed" flag inside the 'left' or 'right' reference.

Of course, this technique is pretty obscure, and is rarely needed, yet occasionally it is the only efficient technique to do the job. In Ruby this feature is key to the type system that unifies integers and objects. In Ruby "everything is an object", but storing integers on the heap is very inefficient. Although all Ruby variables are implemented as pointers, if you store a small integer in a variable, rather than allocating a heap object, Ruby just encodes the integer right into the pointer. Can you think of an approach that is remotely as efficient? Similarly, the Judy array's ability to store data compactly is heavily reliant on the fact that it can change the meaning of a "pointer" based on its low bits. I wanted to port Judy to .NET, and struggled for quite some time before concluding that approximating Judy is impossible in .NET.

The main disadvantage of this feature would be that the garbage collector would have to be aware of these special pointer types; code to deal with them might slow down the GC slightly, even in code that doesn't use the special pointers. I suspect this feature could be added without modifying CIL or the JIT, and the code to handle them in managed code could be neatly tucked away in pointer-sized structures in the BCL.

Units

The traditional type system (roughly speaking) classifies data by its structure: an array holds data in a certain way; a linked list holds data in a different way. There is an orthogonal concept of type, the "unit" type, which says something about the data itself. For example, an integer may contain a quantity measured in bytes, kilobytes, kilometres, sectors or pixels. A double has a different physical structure than an int, but it could have the same unit type.

It is safe to add or subtract two values as long as they have the same units. I can add 10 bytes to 5 bytes and get 15 bytes, but if I add 5 bytes to 10 DWORDs there is probably an error in my code. Every engineer and scientist knows that if you multiply "2 kg" times "2 m/s^2", the result is "4 N", not "4 lbf" and code that computes the latter result might have destroyed the Mars Climate Orbiter.

A unit type checker like the one I wrote for boo would semi-automatically detect such errors and emit a warning message. Unit checking would work best if it had some support from the Microsoft BCL (Base Class Library). That's because a unit checker needs to track units as they pass through the BCL. For instance, if I write "Math.Abs(-7`pixels`), the unit checker has to be told that the return value of Math.Abs() has the same units as its argument. It would be nice if the BCL had attributes on its methods to provide this kind of information.

Now on CodeProject