Ranges are the most important concept in the design of modern
D collection code, and I want them to be included
under the Loyc umbrella (but for now, don't worry what Loyc is.)
As I wrote in an as-yet unpublished article on about the design of my new language,
Enhanced C#:
The really key thing I like about .NET is that it is specifically a multi-language,
multi-OS platform with a standard binary format on all platforms--much like
Java, but technically superior. It's only "multi-OS", of course, insofar
as you avoid Windows-only libraries such as WPF and WCF, and I would encourage
you to do so (if your boss will allow it). .NET solves the "integration
gap" as long as the languages you want to mash up are available in .NET.
Without a multi-language runtime, interoperability is limited to the lowest
common denominator, which is C, which is an extremely painful limitation. Modern
languages should be able to interact on a much higher level.
A multi-language platform avoids a key problem with most new languages: they
define their own new "standard library" from scratch. You know what
they say about standards, that the nice thing about standards is that there
are so many to choose from? I hate that! I don't want to learn
a new standard library with every new language I learn. Nor do I want to be
locked into a certain API based on the operating system (Windows APIs and Linux
APIs). And all the engineering effort that goes into the plethora of "standard"
libraries could be better spent on other things. The existence of many standard
libraries increases learning curves and severely limits interoperability between
languages.
In my opinion, this is the most important problem .NET solves; there is just
one standard library for all languages and all operating systems (plus an optional "extra"
standard library per language, if necessary). All languages and all operating
systems are interoperable--it's really nice! The .NET standard library (called
the BCL, base class library) definitely could be and should be designed better,
but at least they have the Right Idea.
Unfortunately, while the .NET standard library has a lot of useful functionality,
it has a lot of missing pieces, and pieces whose design was poorly thought out.
Since Loyc is all about interoperability between languages, it makes sense that
there should be a "Loyc standard library" which provides similar functionality
in a variety of programming languages. The .NET BCL is a good start, but not quite
good enough. I certainly haven't had time to design a full-fledged standard library,
but I'd like to talk today about some ideas I have about collections.
One of the most important elements of any standard library is the way it handles
collections. Unfortunately, .NET's collection interfaces are a "bare minimum"
design. There's IEnumerable(T)
, ICollection(T)
,
IList(T)
,
IDictionary(T)
and that's about all.
- Read-only lists must use the same interface as writable lists. This reduces
type safety, and a read-only list is incompatible with .NET
generics variance.
Plus, implementing the entire
IList(T)
interface is a huge pain in the butt
when you only want to provide read-only access. - Similarly for read-only collections: you can either implement just
IEnumerable(T)
--but then Count
and Contains(
)
are inaccessible--or the full ICollection(T)
interface with
Clear()
, Add()
, Remove()
, and even implementing
IsReadOnly
and CopyTo
is a chore. - There are no interfaces with
AddRange()
or InsertRange(
) - There is no support for slices (sections of a list or a sequence)--this
is the single biggest problem in my opinion.
- Although an enumerator can sometimes be restarted from the beginning with
Reset()
, you cannot save the current position and start from there
later. - There is no specific support for backward enumeration, even though most
collections could easily do it.
- Various elements of the design are less efficient than they could be, such
as the need to call two interface methods per iteration of
IEnumerable
,
or the fact that two separate operations are required to perform operations
such as "add value to dictionary if key does not already exist" (known
as AddIfNotPresent
in my new mutable dictionary, MMap<T>
)
or "get value and remove pair" (GetAndRemove
). - There's lots of other stuff one could ask for.
Here are some of the collection interfaces my Loyc.Essentials library includes already:
ISource(out T): IEnumerable(T)
plus ICount
, which
represents the Count
property alone (covariant) IListSource(out T)
: a simple read-only list collection (covariant) INegListSource(T)
, INegArray(T)
and INegAutoSizeArray(T)
:
interfaces that represent lists that do not start at index zero IPush(T)
, IPop(T)
, IStack(T)
,
IQueue(T)
: interfaces with "push" and "pop"
actions;
<code>IStack
(T) and IQueue(T)
are identical, but represent different behaviors ISetImm(T,SetT)
: interface that represents an immutable set,
with sub-interfaces ISetOperations(T,SetT)
and ISetTests(SetT).
(SetT
is the type that implements the interface. It's a bit inconvenient
to use interfaces parameterized on the set type itself, but I found
that actually implementing a set interface that is not parameterized
on itself is a damned inconvenience, but I digress.) IArray(T)
: interface that represents a collection that can
be modified, but cannot be resized. IAutoSizeArray(T):
an IArray(T)
that resizes itself
automatically when you assign an item to a new index. IAddRange(T)
, IListRangeMethods(T)
: interfaces
that contain additional collection methods such as
<code>
AddRange
() and RemoveRange()
ISinkCollection(in T)
, ISinkArray(in T),
ISinkList(in T)
: interfaces that represent collections you can write
to, but not read from (contravariant). ICollectionEx(T)
: combines ISource(T)
, ISinkCollection(T)
,
IAddRange(T)
and RemoveAll(Predicate(T))
, and of course
ICollection(T)
. IListEx(T)
: combines ICollectionEx(T)
, IArray(T)
,
ISinkList(T)
, and of course IList(T)
.
I have observed that there are never quite enough interfaces: some people want an
interface that includes AddRange(...), others will want an interface with
RemoveAll(...)
; somebody will implement a collection that can Add() but not
Remove() items; and almost no one actually calls
CopyTo()
. To meet
everyone's needs,
Go-style interfaces, which .NET does not support, would really help. With these,
the missing interfaces can be retrofitted onto existing classes. And certainly Go-style
interfaces are the best workaround for a standard library like the .NET BCL, whose
built-in interfaces are so terribly impoverished.
But without support for Go-style interfaces, the best alternative is to define a
large number of interfaces, and to provide adapter types to coerce old collections
to the new interfaces.
Anyway, all of this is just background for the real topic of today's post: Ranges.
Ranges are an improvement on the C++ concept of iterators. I don't exactly know
how ranges were invented in D, but perhaps someone noticed that most of the C++
STL algorithms require pairs of iterators:
bool all_of(InputIterator first, InputIterator last, UnaryPredicate pred);
Function for_each(InputIterator first, InputIterator last, Function fn);
InputIterator find(InputIterator first, InputIterator last, const T& val);
difference_type count(InputIterator first, InputIterator last, const T& val);
ForwardIterator unique(ForwardIterator first, ForwardIterator last, BinaryPredicate pred);
void reverse(BidirectionalIterator first, BidirectionalIterator last);
void sort(RandomAccessIterator first, RandomAccessIterator last, Compare comp);
ForwardIterator lower_bound(ForwardIterator first, ForwardIterator last, const T& val);
In fact, most STL algorithms require exactly two iterators--a range--and none require
only a single iterator (I think). Passing ranges as iterators tends to be highly
cumbersome; you are constantly repeating yourself:
push_heap(currentNode().children.begin(), currentNode().children.end());
I once read an in-depth article that explained all of this in more detail--I can't
find the article now, but it discussed a "corner case" in C++ that requires
three iterators instead of two, as well as how STL algorithms that return a single
iterator (within a range) can map to D by returning a range. Since D does not use
the iterator concept, if a D algorithm wants to return a reference to a single element
within a range, instead of returning an iterator, it will return a range whose "front"
is the single element that the algorithm wanted to return. But this range can be
moved forward as usual with "popFront". Thus, for example, if you search
for "6" in a list of numbers from 1 to 10, the search algorithm would
return a range from 6 to 10.
I frequently run into difficulty in .NET because it does not have a "range"
concept. We have LINQ, but that's not really enough. For instance if I want to get
a sequence representing the second half of an array, I could write
array.Skip(array.Length/2)
But the Skip()
operation just wasted
array.Length/2
loop iterations skipping elements one at a time, wasting a lot of CPU time,
and the return value is no longer
array-like--it's just an IEnumerable
that can only be read from beginning to end. A lot of programmers are willing to "solve"
the problem by throwing away CPU time:
array.Skip(array.Length/2).ToArray()
but that's not good enough for me.
I'd like to solve many of .NET's limitations by bringing ranges from D
to C#. D ranges are explained in
a tutorial at this link. In summary, D has four kinds of ranges:
InputRange
: requires the empty, front and popFront()
member functions ForwardRange
: additionally requires the save member function BidirectionalRange
: additionally requires the back and
popBack()
member functions RandomAccessRange
: additionally requires the [] operator, and
a length property if the range is finite
Now, there are some important differences between D and C#, so D-style ranges cannot
be ported directly to C#:
- D has templated methods that can examine an object at compile time to decide
how to treat it. For example, to support output ranges, the
put(range,
element)
operation can call
range.put(element)
if it exists, or if there is no put()
method, it
can modify range.front
and then call
range.popFront()
instead. - D ranges are generally easier to copy (although this is not guaranteed).
The issue of how to copy a range is most irksome. To make this situation clear,
imagine a C++ iterator. C++ iterators are modeled after C++ pointers, and if you
copy a pointer:
int* x = &array[i];
int* y = x;
x++;
Then obviously, the two pointers are independent: modifying x did not affect y.
The simple '=' operator makes a copy, and copies are independent of each other.
So that's how C++ iterators work too. When you transfer an iterator with '=', you
expect the copies to be independent, and they are (when implemented properly).
Since ranges are essentially just pairs of iterators, ideally, you would expect
ranges to work the same way (although I hear that in D, it doesn't always work that
way.)
Unfortunately, in .NET it is generally impossible to make ranges
work that way. Certainly some ranges can work that way, such as a simple array range,
which would naturally be a struct type:
struct ArraySlice<T> : IRange<T> {
T[] _array;
int _start, _count;
...
public T this[int index] {
get {
return _array[_start + index];
}
}
...
}
Unfortunately, using '=' to make a copy is impossible in general. First of all,
if IRange(T)
is an interface that represents a random-access range,
simply casting ArraySlice(int)
to IRange(int)
will defeat
the '=' operator, so that it no longer makes copies.
And even if every range type were a struct, that still doesn't mean that they could
be safely copied with '='. Consider a forward range that walks a
B+ tree. In order to walk the
tree, it is necessary to have a stack--a list of parent nodes, which will be an
array or perhaps a Stack(T)
:
struct TreeRange<T> : IFRange<T> {
private Stack<Pair<InternalNode<T>, int>> _parents;
private LeafNode<T> _curNode;
private int _index;
...
}
But C# does not allow any action to be taken when a struct is copied, so if the
array or Stack is stored in a struct, the Stack will not be copied when the '='
operator is used, even though all the other fields are copied. The two copies will
have independent _index
and _curNode
fields, but share
the same Stack! Therefore, after you've copied the range, advancing one of the copies
of the range will (after a few iterations, anyway) completely screw up the other
copy, making it behave in some bizarre way.
D mostly solves this problem with a "postblit constructor": like C#, D
copies value types bit-for-bit; but unlike C#, D then calls the "postblit constructor",
known as this(this)
in code, on the copy. This allows the
copy to properly "finish" the copy operation.
I cannot add postblit constructors to C#, but I still think that the range concept
is important enough to support. Therefore, I am adding range interfaces to my Loyc.Essentials
library in the Loyc.Collections namespace. I recommend that if a range type is not
safe to copy with '=' (such as the TreeRange example), it should be a class instead.
Luckily, the .NET garbage collector is much faster than the one in D, so using a
class does not usually cause significant performance problems.
Besides, D does not solve the copying problem completely:
- D has a class of ranges called "input ranges" that cannot be copied,
and/li>
- D ranges could be implemented as class objects, which are very nearly the
same as C# class objects; the '=' operator does not copy classes
For this reason* D has a save()
function (actually it is a
property, curiously enough) on ranges that duplicates the range safely. (* I speak
as if I know what I'm talking about, but please note that I have not written any
real programs with D.)
For the C# version of ranges, the interfaces will inherit
ICloneable<R>
so you can copy ranges explicitly with
Clone()
; this is equivalent
to the save() method in D.
Here are the proposed range interfaces: IFRange
(forward range),
IBRange
(bidirectional range), and IRange
(finite random-access
range). Note that it is not really necessary to define an "input range";
the standard IEnumerator(T
) interface is already equivalent to an input
range. Perhaps I'll add an IIRange
(infinite random-access range) later,
and maybe an output range, but for now I have defined these three:
ppublic interface IFRange<out T> : IEnumerable<T>, ICloneable<IFRange<T>>
{
bool IsEmpty { get; }
T First { get; }
bool DropFirst();
T PopFirst(out bool empty);
}
public interface IBRange<out T> : IFRange<T>, ICloneable<IBRange<T>>
{
T Last { get; }
bool DropLast();
T PopLast(out bool empty);
}
public interface IRange<out T> : IBRange<T>, IListSource<T>, ICloneable<IRange<T>>
{
}
public interface IListSource<out T> : ISource<T>
{
T this[int index] { get; }
T TryGet(int index, ref bool fail);
IRange<T> Slice(int start, int count);
}
AnAnd
in addition there are three mutable variants, IMFRange
(allows
you to modify First), IMBRange
(allows you to modify First and Last)
and IMRange
(allows you to modify any element).
Tentatively I have used the names "First
" and "Last
"
instead of "Front
" and "Back
" because
the collection classes I've already written contain "First
"
and "Last
" properties. I'm debating whether to change this
for consistency with D. D's "empty" becomes "IsEmpty
"
in .NET, for consistency with other properties such as "IsReadOnly
".
"DropFirst()
" and "DropLast()
" could
just as easily be called "PopFirst()
" and "PopLast()
"
for consistency with D, but I would like to eventually have extension methods called "PopFirst()
"
and "PopLast()
" which not only chop off the first element
of the range, but return that first element at the same time. These extension methods
do not currently exist, however, due to a limitation of C#. Since a range could
be a struct
type, the extension method must take the
range by reference to guarantee that it will actually be modified! Unfortunately, "
(this
ref Range self)
" is an illegal argument list in C#. Nevertheless, I
hope Enhanced C# will eventually support this type of extension method.
Finally, I've added special Pop operations that D does not have, and a Slice() method
from IListSource:
T PopFirst(out bool empty);
T PopLast(out bool empty);
IRange<T> Slice(int start, int count);
Come to think of it, I don't know how D solves the problem of slicing random-access
ranges. Probably something involving the "$" and ".." operators.
Anyway, the thing I wanted to highlight was that PopFirst
contains
the functionality of IsEmpty
, First, and DropFirst()
all
in one single method. It saves the value of IsEmpty
in the 'empty'
parameter, gets the value of First, drops the first element, and returns the old
value of First
(if !empty). I have
complained
loudly in the past about the fact that IEnumerator(T)
requires two interface calls just to yield a single element*, and D's range interface
is 50% worse, requiring three method calls per iteration! In D, this is rarely a
problem because D ranges are used directly rather than through interfaces, so the
compiler can inline the calls to all three methods. But in .NET land, interfaces
are used far more often, so the three calls are fairly expensive.
The question, though, is whether it's worth requiring both approaches. The "chatty"
approach with three methods is more convenient for the average caller who doesn't
care much about performance, but the single-call method is crucial for high performance
and should not be eliminated. So what should I do? Keep the four methods, drop all
but the last one, or have some kind of compromise?
If extension methods were compatible with ranges, a very reasonable compromise would
be to keep just IsEmpty
and PopFirst
, but eliminate
First
and DropFirst
. Then, two extension methods would
perform PopFirst
without requiring a cumbersome "out" argument:
public static T PopFirst<R, T>(this ref R range, T defaultValue) where R : IFRange<T>
{
bool fail;
T next = range.PopFirst(out fail);
if (!fail)
return next;
else
return defaultValue;
}
public static T PopFirst<R,T>(this ref R range) where R:IFRange<T>
{
bool fail;
T next = range.PopFirst(out fail);
if (fail) throw new EmptySequenceException();
return next;
}
But as I explained, these extension methods cannot currently exist--not as extension
methods, anyway. You can, however, call the non-extension method that does exist,
Range.PopFirst(ref r)
. It's better than nothing, anyway.
Another compromise would be to eliminate DropFirst()
, but keep First
in case you want to "peek" at the next value without removing it (note
that requiring a First property may increase the complexity or run-time size of
a range type; and technically it is possible to implement First() as an extension
method, so it is not strictly necessary for it to be part of the interface, although
it is potentially faster when it is built-in.)
Finally, I am wondering whether to actually use the term "random-access range"
or whether to use the term "slice" instead. As far as I can tell, in D
the term "slice" means "array range", i.e. a section of an array,
not some other data type. But "slice" rolls off the tongue nicely, and
I am tempted to use it for all random-access ranges. In that case, IRange(T) would
be renamed ISlice(T). Opinions, anyone?
In any case, I am in the process of adding range support to my collections library,
Loyc.Collections. Note that the collection interfaces will go in Loyc.Essentials,
a library of small methods, simple data structures and core interfaces, while most
of the actual collection implementations will go in Loyc.Collections. I
will also have to review the existing interfaces to understand how they relate to
ranges and whether one or more of them should be eliminated or combined with ranges
somehow. Perhaps I'll post again when it's done.
Loyc.Collections contains some data structures I've published articles about already,
such as the
AList,
VList
and
CPTrie, but also some interesting unpublished data structures such as the BList,
BMultiMap, Set/MSet, Map/MMap, InvertibleSet, and DList. Let me know if you would
like me to prioritize writing articles about these things (I also have a parser
generator and a new programming language in the works, you know, plus I have to
find the time to set up a web site for all of this).
P.S. An ominous footnote:
The mutable ranges do not include a Clone method due to a limitation of C#. C# does
not support covariance, which means that every time a derived interface supports
cloning, the implementing class is required to write a separate clone method. Read-only
ranges already have to implement up to three clone methods: ICloneable(IFRange(T))
,
ICloneable(IBRange(T))
, ICloneable(IRange(T))
, and that's
in addition to the Clone method for the concrete type! If mutable ranges also supported
cloning, they would add up to three more clone methods, which is really getting
out of hand.
Even without mutable clone methods, implementing all the
ICloneables
can be a chore. But you can just copy and paste these, inserting an appropriate
constructor call for your range type:
IFRange<T> ICloneable<IFRange<T>>.Clone() { return Clone(); }
IBRange<T> ICloneable<IBRange<T>>.Clone() { return Clone(); }
IRange<T> ICloneable<IRange<T>> .Clone() { return Clone(); }
public MyType Clone() { return new MyType(...); }
When complete, EC# will, of course, take care of this crap for you.
Update: it was pointed out to me that although algorithms
usually need ranges, there is other code that has some need for ordinary iterators.
In .NET we typically use an index when we need an iterator, but this only
works for random-access data structures, and it is not as type-safe as an iterator
(although it has the virtue of storage efficiency.) For non-random-access data structures,
the solutions are either ad-hoc (e.g. LinkedListNode
inside a LinkedList)
or non-existant (SortedDictionary
has neither an iterator nor a range
type; an iterator would have been one solution for
a href="http://stackoverflow.com/questions/1690929/what-net-dictionary-supports-a-find-nearest-key-operation">
this question I had). Comments are welcome about how to deal with the need for
iterators in .NET.
I suppose the problem with C++ iterators is that they are useless without external
context: you can't increment or decrement one without also comparing it to
begin()
or
end()
in its container, which implies that the caller
must manually keep track of which container it came from. Thus, an iterator is hardly
an improvement over a plain-old integer index, the only advantages being
- You can dereference it without reference to its container (but this is nothing
special; you can dereference an object reference too)
- Unlike an index, it's compatible with non-random-access data structures
(and huge data structures, for which a 32-bit integer index does not suffice).
Perhaps the iterator concept could be improved by being made self-aware: if the
iterator "knew" and could tell you when it was at the beginning or end
of its container. This would increase the storage requirement in some circumstances
but not others. For example, an iterator to a doubly-linked list node can tell when
it is at the beginning or end, but an iterator to a singly-linked list node can
only tell when it is at the end (or rather, when there are no more elements after
the current one. If the iterator follows the 'null' pointer at the end of a linked
list, it becomes null itself, and can no longer increment or decrement).
A pointer inside an array may or may not be able to tell if it is at the end depending
on how arrays work. Perhaps the way D heap arrays work would allow an array iterator,
implemented as a simple pointer, to still know when it is safe to increment and
decrement. The simplest possible .NET array iterator is an array reference plus
an index, and again the iterator can know when it is at the beginning or end of
the array--except that if the iterator came from inside a range, it would be unaware
of the range's boundaries, which may be smaller than the array's boundaries.
The question is, what does an appropriate iterator type look like? Could it follow
an interface that would make it more useful than a C++ iterator, but still more
efficient than a range?