D-style ranges in C# (.NET)
Ranges are the most important concept in the design of modern D collection code, and I want them to be included under the Loyc umbrella (but for now, don't worry what Loyc is.) As I wrote in an as-yet unpublished article on about the design of my new language, Enhanced C#:
Ranges are the most important concept in the design of modern D collection code, and I want them to be included under the Loyc umbrella (but for now, don't worry what Loyc is.)
As I wrote in an as-yet unpublished article on about the design of my new language, Enhanced C#:
The really key thing I like about .NET is that it is specifically a multi-language, multi-OS platform with a standard binary format on all platforms--much like Java, but technically superior. It's only "multi-OS", of course, insofar as you avoid Windows-only libraries such as WPF and WCF, and I would encourage you to do so (if your boss will allow it). .NET solves the "integration gap" as long as the languages you want to mash up are available in .NET.
Without a multi-language runtime, interoperability is limited to the lowest common denominator, which is C, which is an extremely painful limitation. Modern languages should be able to interact on a much higher level.
A multi-language platform avoids a key problem with most new languages: they define their own new "standard library" from scratch. You know what they say about standards, that the nice thing about standards is that there are so many to choose from? I hate that! I don't want to learn a new standard library with every new language I learn. Nor do I want to be locked into a certain API based on the operating system (Windows APIs and Linux APIs). And all the engineering effort that goes into the plethora of "standard" libraries could be better spent on other things. The existence of many standard libraries increases learning curves and severely limits interoperability between languages.
In my opinion, this is the most important problem .NET solves; there is just one standard library for all languages and all operating systems (plus an optional "extra" standard library per language, if necessary). All languages and all operating systems are interoperable--it's really nice! The .NET standard library (called the BCL, base class library) definitely could be and should be designed better, but at least they have the Right Idea.
Unfortunately, while the .NET standard library has a lot of useful functionality, it has a lot of missing pieces, and pieces whose design was poorly thought out.
Since Loyc is all about interoperability between languages, it makes sense that there should be a "Loyc standard library" which provides similar functionality in a variety of programming languages. The .NET BCL is a good start, but not quite good enough. I certainly haven't had time to design a full-fledged standard library, but I'd like to talk today about some ideas I have about collections.
One of the most important elements of any standard library is the way it handles
collections. Unfortunately, .NET's collection interfaces are a "bare minimum"
design. There's IEnumerable(T)
, ICollection(T)
,
IList(T)
, IDictionary(T)
and that's about all.
- Read-only lists must use the same interface as writable lists. This reduces
type safety, and a read-only list is incompatible with .NET
generics variance.
Plus, implementing the entire
IList(T)
interface is a huge pain in the butt when you only want to provide read-only access. - Similarly for read-only collections: you can either implement just
IEnumerable(T)
--but thenCount
andContains(
) are inaccessible--or the fullICollection(T)
interface withClear()
,Add()
,Remove()
, and even implementingIsReadOnly
andCopyTo
is a chore. - There are no interfaces with
AddRange()
orInsertRange(
) - There is no support for slices (sections of a list or a sequence)--this is the single biggest problem in my opinion.
- Although an enumerator can sometimes be restarted from the beginning with
Reset()
, you cannot save the current position and start from there later. - There is no specific support for backward enumeration, even though most collections could easily do it.
- Various elements of the design are less efficient than they could be, such
as the need to call two interface methods per iteration of
IEnumerable
, or the fact that two separate operations are required to perform operations such as "add value to dictionary if key does not already exist" (known asAddIfNotPresent
in my new mutable dictionary,MMap<T>
) or "get value and remove pair" (GetAndRemove
). - There's lots of other stuff one could ask for.
Here are some of the collection interfaces my Loyc.Essentials library includes already:
ISource(out T): IEnumerable(T)
plusICount
, which represents theCount
property alone (covariant)IListSource(out T)
: a simple read-only list collection (covariant)INegListSource(T)
,INegArray(T)
andINegAutoSizeArray(T)
: interfaces that represent lists that do not start at index zeroIPush(T)
,IPop(T)
,IStack(T)
,IQueue(T)
: interfaces with "push" and "pop" actions;
andIStack
(T)IQueue(T)
are identical, but represent different behaviorsISetImm(T,SetT)
: interface that represents an immutable set, with sub-interfacesISetOperations(T,SetT)
andISetTests(SetT).
(SetT
is the type that implements the interface. It's a bit inconvenient to use interfaces parameterized on the set type itself, but I found that actually implementing a set interface that is not parameterized on itself is a damned inconvenience, but I digress.)IArray(T)
: interface that represents a collection that can be modified, but cannot be resized.IAutoSizeArray(T):
anIArray(T)
that resizes itself automatically when you assign an item to a new index.IAddRange(T)
,IListRangeMethods(T)
: interfaces that contain additional collection methods such as
andAddRange
()RemoveRange()
ISinkCollection(in T)
,ISinkArray(in T),
ISinkList(in T)
: interfaces that represent collections you can write to, but not read from (contravariant).ICollectionEx(T)
: combinesISource(T)
,ISinkCollection(T)
,IAddRange(T)
andRemoveAll(Predicate(T))
, and of courseICollection(T)
.IListEx(T)
: combinesICollectionEx(T)
,IArray(T)
,ISinkList(T)
, and of courseIList(T)
.
I have observed that there are never quite enough interfaces: some people want an
interface that includes AddRange(...), others will want an interface with
RemoveAll(...)
; somebody will implement a collection that can Add() but not
Remove() items; and almost no one actually calls CopyTo()
. To meet
everyone's needs,
Go-style interfaces, which .NET does not support, would really help. With these,
the missing interfaces can be retrofitted onto existing classes. And certainly Go-style
interfaces are the best workaround for a standard library like the .NET BCL, whose
built-in interfaces are so terribly impoverished.
But without support for Go-style interfaces, the best alternative is to define a large number of interfaces, and to provide adapter types to coerce old collections to the new interfaces.
Anyway, all of this is just background for the real topic of today's post: Ranges.
Ranges are an improvement on the C++ concept of iterators. I don't exactly know how ranges were invented in D, but perhaps someone noticed that most of the C++ STL algorithms require pairs of iterators:
bool all_of(InputIterator first, InputIterator last, UnaryPredicate pred);
Function for_each(InputIterator first, InputIterator last, Function fn);
InputIterator find(InputIterator first, InputIterator last, const T& val);
difference_type count(InputIterator first, InputIterator last, const T& val);
ForwardIterator unique(ForwardIterator first, ForwardIterator last, BinaryPredicate pred);
void reverse(BidirectionalIterator first, BidirectionalIterator last);
void sort(RandomAccessIterator first, RandomAccessIterator last, Compare comp);
ForwardIterator lower_bound(ForwardIterator first, ForwardIterator last, const T& val);
In fact, most STL algorithms require exactly two iterators--a range--and none require only a single iterator (I think). Passing ranges as iterators tends to be highly cumbersome; you are constantly repeating yourself:
push_heap(currentNode().children.begin(), currentNode().children.end());
I once read an in-depth article that explained all of this in more detail--I can't find the article now, but it discussed a "corner case" in C++ that requires three iterators instead of two, as well as how STL algorithms that return a single iterator (within a range) can map to D by returning a range. Since D does not use the iterator concept, if a D algorithm wants to return a reference to a single element within a range, instead of returning an iterator, it will return a range whose "front" is the single element that the algorithm wanted to return. But this range can be moved forward as usual with "popFront". Thus, for example, if you search for "6" in a list of numbers from 1 to 10, the search algorithm would return a range from 6 to 10.
I frequently run into difficulty in .NET because it does not have a "range" concept. We have LINQ, but that's not really enough. For instance if I want to get a sequence representing the second half of an array, I could write
array.Skip(array.Length/2)
But the Skip()
operation just wasted array.Length/2
loop iterations skipping elements one at a time, wasting a lot of CPU time,
and the return value is no longer array-like--it's just an IEnumerable
that can only be read from beginning to end. A lot of programmers are willing to "solve"
the problem by throwing away CPU time:
array.Skip(array.Length/2).ToArray() // viola!
but that's not good enough for me.
I'd like to solve many of .NET's limitations by bringing ranges from D to C#. D ranges are explained in a tutorial at this link. In summary, D has four kinds of ranges:
InputRange
: requires the empty, front andpopFront()
member functionsForwardRange
: additionally requires the save member functionBidirectionalRange
: additionally requires the back andpopBack()
member functionsRandomAccessRange
: additionally requires the [] operator, and a length property if the range is finite
Now, there are some important differences between D and C#, so D-style ranges cannot be ported directly to C#:
- D has templated methods that can examine an object at compile time to decide
how to treat it. For example, to support output ranges, the
put(range, element)
operation can callrange.put(element)
if it exists, or if there is noput()
method, it can modifyrange.front
and then callrange.popFront()
instead. - D ranges are generally easier to copy (although this is not guaranteed).
The issue of how to copy a range is most irksome. To make this situation clear, imagine a C++ iterator. C++ iterators are modeled after C++ pointers, and if you copy a pointer:
int* x = &array[i];
int* y = x;
x++;
Then obviously, the two pointers are independent: modifying x did not affect y. The simple '=' operator makes a copy, and copies are independent of each other.
So that's how C++ iterators work too. When you transfer an iterator with '=', you expect the copies to be independent, and they are (when implemented properly).
Since ranges are essentially just pairs of iterators, ideally, you would expect ranges to work the same way (although I hear that in D, it doesn't always work that way.)
Unfortunately, in .NET it is generally impossible to make ranges work that way. Certainly some ranges can work that way, such as a simple array range, which would naturally be a struct type:
struct ArraySlice<T> : IRange<T> {
T[] _array;
int _start, _count;
...
public T this[int index] {
get {
/* TODO: check if index is out of range */
return _array[_start + index];
}
}
...
}
Unfortunately, using '=' to make a copy is impossible in general. First of all,
if IRange(T)
is an interface that represents a random-access range,
simply casting ArraySlice(int)
to IRange(int)
will defeat
the '=' operator, so that it no longer makes copies.
And even if every range type were a struct, that still doesn't mean that they could
be safely copied with '='. Consider a forward range that walks a
B+ tree. In order to walk the
tree, it is necessary to have a stack--a list of parent nodes, which will be an
array or perhaps a Stack(T)
:
struct TreeRange<T> : IFRange<T> { // forward range
private Stack<Pair<InternalNode<T>, int>> _parents;
private LeafNode<T> _curNode; // node being walked right now
private int _index; // the current item in _curNode
...
}
But C# does not allow any action to be taken when a struct is copied, so if the
array or Stack is stored in a struct, the Stack will not be copied when the '='
operator is used, even though all the other fields are copied. The two copies will
have independent _index
and _curNode
fields, but share
the same Stack! Therefore, after you've copied the range, advancing one of the copies
of the range will (after a few iterations, anyway) completely screw up the other
copy, making it behave in some bizarre way.
D mostly solves this problem with a "postblit constructor": like C#, D
copies value types bit-for-bit; but unlike C#, D then calls the "postblit constructor",
known as this(this)
in code, on the copy. This allows the
copy to properly "finish" the copy operation.
I cannot add postblit constructors to C#, but I still think that the range concept is important enough to support. Therefore, I am adding range interfaces to my Loyc.Essentials library in the Loyc.Collections namespace. I recommend that if a range type is not safe to copy with '=' (such as the TreeRange example), it should be a class instead. Luckily, the .NET garbage collector is much faster than the one in D, so using a class does not usually cause significant performance problems.
Besides, D does not solve the copying problem completely:
- D has a class of ranges called "input ranges" that cannot be copied, and/li>
- D ranges could be implemented as class objects, which are very nearly the same as C# class objects; the '=' operator does not copy classes
For this reason* D has a save()
function (actually it is a
property, curiously enough) on ranges that duplicates the range safely. (* I speak
as if I know what I'm talking about, but please note that I have not written any
real programs with D.)
For the C# version of ranges, the interfaces will inherit ICloneable<R>
so you can copy ranges explicitly with Clone()
; this is equivalent
to the save() method in D.
Here are the proposed range interfaces: IFRange
(forward range),
IBRange
(bidirectional range), and IRange
(finite random-access
range). Note that it is not really necessary to define an "input range";
the standard IEnumerator(T
) interface is already equivalent to an input
range. Perhaps I'll add an IIRange
(infinite random-access range) later,
and maybe an output range, but for now I have defined these three:
ppublic interface IFRange<out T> : IEnumerable<T>, ICloneable<IFRange<T>>
{
bool IsEmpty { get; }
T First { get; }
bool DropFirst();
T PopFirst(out bool empty);
}
public interface IBRange<out T> : IFRange<T>, ICloneable<IBRange<T>>
{
T Last { get; }
bool DropLast();
T PopLast(out bool empty);
}
public interface IRange<out T> : IBRange<T>, IListSource<T>, ICloneable<IRange<T>>
{
}
// IListSource<T> is a read-only list (not a range) and ISource<T>
// is simply IEnumerable plus Count. I am wondering whether the naming
// is appropriate; I chose the name "Source" as opposed to "Sink": a
// Source provides data, while Sink accepts data.
public interface IListSource<out T> : ISource<T>
{
T this[int index] { get; }
T TryGet(int index, ref bool fail);
IRange<T> Slice(int start, int count);
}
AnAnd
in addition there are three mutable variants, IMFRange
(allows
you to modify First), IMBRange
(allows you to modify First and Last)
and IMRange
(allows you to modify any element).
Tentatively I have used the names "First
" and "Last
"
instead of "Front
" and "Back
" because
the collection classes I've already written contain "First
"
and "Last
" properties. I'm debating whether to change this
for consistency with D. D's "empty" becomes "IsEmpty
"
in .NET, for consistency with other properties such as "IsReadOnly
".
"DropFirst()
" and "DropLast()
" could
just as easily be called "PopFirst()
" and "PopLast()
"
for consistency with D, but I would like to eventually have extension methods called "PopFirst()
"
and "PopLast()
" which not only chop off the first element
of the range, but return that first element at the same time. These extension methods
do not currently exist, however, due to a limitation of C#. Since a range could
be a struct
type, the extension method must take the
range by reference to guarantee that it will actually be modified! Unfortunately, "(this
ref Range self)
" is an illegal argument list in C#. Nevertheless, I
hope Enhanced C# will eventually support this type of extension method.
Finally, I've added special Pop operations that D does not have, and a Slice() method from IListSource:
T PopFirst(out bool empty);
T PopLast(out bool empty);
IRange<T> Slice(int start, int count);
Come to think of it, I don't know how D solves the problem of slicing random-access ranges. Probably something involving the "$" and ".." operators.
Anyway, the thing I wanted to highlight was that PopFirst
contains
the functionality of IsEmpty
, First, and DropFirst()
all
in one single method. It saves the value of IsEmpty
in the 'empty'
parameter, gets the value of First, drops the first element, and returns the old
value of First
(if !empty). I have
complained
loudly in the past about the fact that IEnumerator(T)
requires two interface calls just to yield a single element*, and D's range interface
is 50% worse, requiring three method calls per iteration! In D, this is rarely a
problem because D ranges are used directly rather than through interfaces, so the
compiler can inline the calls to all three methods. But in .NET land, interfaces
are used far more often, so the three calls are fairly expensive.
The question, though, is whether it's worth requiring both approaches. The "chatty" approach with three methods is more convenient for the average caller who doesn't care much about performance, but the single-call method is crucial for high performance and should not be eliminated. So what should I do? Keep the four methods, drop all but the last one, or have some kind of compromise?
If extension methods were compatible with ranges, a very reasonable compromise would
be to keep just IsEmpty
and PopFirst
, but eliminate
First
and DropFirst
. Then, two extension methods would
perform PopFirst
without requiring a cumbersome "out" argument:
public static T PopFirst<R, T>(this ref R range, T defaultValue) where R : IFRange<T>
{
bool fail;
T next = range.PopFirst(out fail);
if (!fail)
return next;
else
return defaultValue;
}
public static T PopFirst<R,T>(this ref R range) where R:IFRange<T>
{
bool fail;
T next = range.PopFirst(out fail);
if (fail) throw new EmptySequenceException();
return next;
}
But as I explained, these extension methods cannot currently exist--not as extension
methods, anyway. You can, however, call the non-extension method that does exist,
Range.PopFirst(ref r)
. It's better than nothing, anyway.
Another compromise would be to eliminate DropFirst()
, but keep First
in case you want to "peek" at the next value without removing it (note
that requiring a First property may increase the complexity or run-time size of
a range type; and technically it is possible to implement First() as an extension
method, so it is not strictly necessary for it to be part of the interface, although
it is potentially faster when it is built-in.)
Finally, I am wondering whether to actually use the term "random-access range" or whether to use the term "slice" instead. As far as I can tell, in D the term "slice" means "array range", i.e. a section of an array, not some other data type. But "slice" rolls off the tongue nicely, and I am tempted to use it for all random-access ranges. In that case, IRange(T) would be renamed ISlice(T). Opinions, anyone?
In any case, I am in the process of adding range support to my collections library, Loyc.Collections. Note that the collection interfaces will go in Loyc.Essentials, a library of small methods, simple data structures and core interfaces, while most of the actual collection implementations will go in Loyc.Collections. I will also have to review the existing interfaces to understand how they relate to ranges and whether one or more of them should be eliminated or combined with ranges somehow. Perhaps I'll post again when it's done.
Loyc.Collections contains some data structures I've published articles about already,
such as the
AList,
VList
and
CPTrie, but also some interesting unpublished data structures such as the BList,
BMultiMap, Set/MSet, Map/MMap, InvertibleSet, and DList. Let me know if you would
like me to prioritize writing articles about these things (I also have a parser
generator and a new programming language in the works, you know, plus I have to
find the time to set up a web site for all of this).
P.S. An ominous footnote:
The mutable ranges do not include a Clone method due to a limitation of C#. C# does
not support covariance, which means that every time a derived interface supports
cloning, the implementing class is required to write a separate clone method. Read-only
ranges already have to implement up to three clone methods: ICloneable(IFRange(T))
,
ICloneable(IBRange(T))
, ICloneable(IRange(T))
, and that's
in addition to the Clone method for the concrete type! If mutable ranges also supported
cloning, they would add up to three more clone methods, which is really getting
out of hand.
Even without mutable clone methods, implementing all the ICloneables
can be a chore. But you can just copy and paste these, inserting an appropriate
constructor call for your range type:
IFRange<T> ICloneable<IFRange<T>>.Clone() { return Clone(); }
IBRange<T> ICloneable<IBRange<T>>.Clone() { return Clone(); }
IRange<T> ICloneable<IRange<T>> .Clone() { return Clone(); }
public MyType Clone() { return new MyType(...); }
When complete, EC# will, of course, take care of this crap for you.
Update: it was pointed out to me that although algorithms
usually need ranges, there is other code that has some need for ordinary iterators.
In .NET we typically use an index when we need an iterator, but this only
works for random-access data structures, and it is not as type-safe as an iterator
(although it has the virtue of storage efficiency.) For non-random-access data structures,
the solutions are either ad-hoc (e.g. LinkedListNode
inside a LinkedList)
or non-existant (SortedDictionary
has neither an iterator nor a range
type; an iterator would have been one solution for
a href="http://stackoverflow.com/questions/1690929/what-net-dictionary-supports-a-find-nearest-key-operation">
this question I had). Comments are welcome about how to deal with the need for
iterators in .NET.
I suppose the problem with C++ iterators is that they are useless without external
context: you can't increment or decrement one without also comparing it to
begin()
or end()
in its container, which implies that the caller
must manually keep track of which container it came from. Thus, an iterator is hardly
an improvement over a plain-old integer index, the only advantages being
- You can dereference it without reference to its container (but this is nothing special; you can dereference an object reference too)
- Unlike an index, it's compatible with non-random-access data structures (and huge data structures, for which a 32-bit integer index does not suffice).
Perhaps the iterator concept could be improved by being made self-aware: if the iterator "knew" and could tell you when it was at the beginning or end of its container. This would increase the storage requirement in some circumstances but not others. For example, an iterator to a doubly-linked list node can tell when it is at the beginning or end, but an iterator to a singly-linked list node can only tell when it is at the end (or rather, when there are no more elements after the current one. If the iterator follows the 'null' pointer at the end of a linked list, it becomes null itself, and can no longer increment or decrement).
A pointer inside an array may or may not be able to tell if it is at the end depending on how arrays work. Perhaps the way D heap arrays work would allow an array iterator, implemented as a simple pointer, to still know when it is safe to increment and decrement. The simplest possible .NET array iterator is an array reference plus an index, and again the iterator can know when it is at the beginning or end of the array--except that if the iterator came from inside a range, it would be unaware of the range's boundaries, which may be smaller than the array's boundaries.
The question is, what does an appropriate iterator type look like? Could it follow an interface that would make it more useful than a C++ iterator, but still more efficient than a range?