Extended Generic Collection From Absolute Zero

Jason Pearce

Rate me:

4.64/5 (9 votes)

8 Mar 2010CPOL17 min read

25.6K

102

Building a generic collection class from scratch without just wrapping a List or ArrayList.

Download source code - 28.9 KB

Introduction

I decided to try my hand at a custom generic collection class, and started looking into the implementations of the generic collection classes such as List<T> or the interfaces like ICollection<T>. But while what I found as I went along was useful information, ultimately, one question rose above all others as I looked at example after example of simple and complex implementations of CollectionBase or ICollection<T> that at its core just wrapped an ArrayList or List<T> internally, which to me defeats the purpose of implementing a baser class or interface in the first place. If you could write a class that wraps a List<T> successfully, why bother with implementing ICollection or IList if you could just put a List<T> in your class? What if I didn't want to wrap an existing structure that is more advanced than the more primitive interfaces I was considering? What options are there?

Background

Eventually, I ended up with my own collection class, which I think could still be useful even in the age of 3.0 - HashSet<T> is a speedy unique-value list that can beat my collection on Contains checks for unique values, but it relies on being unique and cannot sort, and List<T> is slightly faster than my collection on Sorts, but not by much. Neither one still natively provide a single-value sorted unique-value list. (See Benchmarks near the end for actual numbers.)

Even if it is rendered obsolete by extending a List<T> to add a Unique function, I think it's still good information to know. I had never broken down a collection to this level before, and picked up a few new concepts.

Getting to the code

The most basic question I started with was simply this:

Given a class A, how do you implement A[index]? There was a quick obvious answer of an internal array, but the most obvious drawback to that was the static size, since we obviously will want a dynamic Add option that doesn't cause a recreation of the array on every Add. The less obvious drawback that I saw immediately that caused me to code far too much was that it did exactly what I didn't want, namely it implemented a pre-defined collection class (Array) even if this was a far simpler collection than List<T> and the like.

The answer that I ended up with is a nested class. A nested class is simply a class that is declared within your class, and in this case, I declared a nested class with a generic type property, and I can carry that generic type value to the nested class in the constructor. In this code, we see both a declaration of a class with a "generic" value type as well as a nested class that carries the same T type. Note that I am implementing zero inheritance, they derive directly from System.Object.

public class basicList<T> 
{
    //basicList constructors
    //...
    private class basicItem
    {
        private T _data;
        public T Value { get { return _data; } set { _data = value; } }
        public basicItem(T t) { _data = t; }
    }
}

Note that the nested class is private, but the constructor/property is public - if you had a public nested class, it could be accessed outside the parent class, but since it's private only, the parent class can access it; however, the methods your parent class need to use (including the constructor) must not be private.

Now, we need a method on the parent basicList class that creates a basicItem, something like this:

public class basicList<T> 
{
  public Add(T t) { basicItem newItem = new basicItem(t);}
  //basicItemClass
  //...
}

You are probably saying that's great that we made a basicItem with data, but how do we find anything once we leave Add?

Answer: we need to implement indexing of some kind. This is probably where people with more common sense or less free time than me decide to implement a List or ArrayList because they're pretty quick to spring to mind when you think to yourself, "What can I use to keep a collection of values and also easily access by index?" But, part of this project was no implementation of another collection class, partially because I want direct access to the data instead of just passing through another collection, and partially because that was the whole point I got started on.

So, with a collection of basicItem objects that contain T data, how can I keep track of them for indexing? The answer is what got me finally going: each basicItem needs a link to another basicItem in the form of a reference property, and we need to keep track of the count of objects, and the parent object needs to keep a reference to one of the objects. This string of references will keep the classes from getting collected as well as provide easy enumeration implementation. For performance reasons, I went with the ability to go forwards or backwards, so I need to keep track of:

basicList:

reference to first basicItem
reference to last basicItem
current count of items

basicItem:

reference to previous basicItem
reference to next basicItem

On construction of a basicItem in a basic Add function, it can know the previous item if it's not the first, but not the next, since in a basic Add, the new item is appended, there is nothing after it.

The new basicItem constructor and basicList constructor, and the Add method:

public class basicList<T> 
{
    private basicItem _firstItem, _lastItem;
    private int _count;
    
    public basicList()
    {
        _firstItem = null;
        _lastItem = null;
        _count = 0;
    }
    
    public void Add(T t)
    {
        basicItem n = new basicItem(t, _lastItem);
        if (_count == 0) { _firstItem = n; } //store first node
        else { _lastItem.Next = n; }
        //not first, put reference to previous last

        _lastItem = n;
        _count++;
    }
    
    private class basicItem
    {
        //next and previous as from a zero-based perspective
        private basicItem _prev, _next;
        private T _data;
        //properties
        public basicItem Previous { get { return _prev; } set { _prev = value; } }
        public basicItem Next { get { return _next; } set { _next = value; } }
        public T Value { get { return _data; } set { _data = value; } }
        //constructors
        public basicItem(T t) : this(t, null) { }
        public basicItem(T t, basicItem prev)
        {
            _prev = prev;
            _data = t;
        }
    }
}

Now we are getting somewhere! When basicList.Add() is called with a value, it first calls the basicItem constructor with the value given and a link to _lastItem to store in Previous. Note that if it is the first item in the collection, it gets a null reference from the basicList constructor, and if it is not, the previous _lastItem gets a Next link to the new basicItem. Items placed by Add are appended by default, so the item created is always the _lastItem. _count goes up by 1, and if it is the first item, it is also placed in _firstItem. This system also has the advantage of not hard-coding an index in the basicItem, but rather by letting it be relative, we can easily move/remove/add at will by simply changing the links of the basicItem to their new orientation.

Now to implement indexing! With what we have here, we know _firstItem and its index is always 0, and we have _lastItem and we know its index is always _count - 1. Two things we can implement here - one is a generic IEnumerator<T> to support foreach operations, and a this[int index] to allow direct access. Since we are implementing a IEnumerator<T>, our parent class can also inherit from IENumerable<T> for no more cost than simply pointing its implementation of IEnumerator<T> to the one we are creating. Incidentally, this is the only inheritance the basicList class will ever get - it's probably not adding much, but I thought I might as well.

New code:

public class basicList<T> : IEnumerable<T> 
{
    //snip unchanged basicList constructor 
    //and Add method and basicItem class
    
    public IEnumerator<T> GetEnumerator()
    {
        basicItem current = _lastItem;
        while (current != null)
        {
            yield return current.Value;
            current = current.Previous;
        }
    }
    
    IEnumerator IEnumerable.GetEnumerator() { return GetEnumerator(); }
    //IENumerable<T> implementation

    public T this[int index]
    {
        get
        {
            int curIndex = 0;
            basicItem curItem = _firstItem;
            while(curIndex < index)
            {
                curItem = _firstItem.Next;
                curIndex++;
            }
            if(curItem != null) { return curItem.Value;}
            return default(T);
        }
        set
        {
            int curIndex = 0;
            basicItem curItem = _firstItem;
            while(curIndex < index)
            {
                curItem = _firstItem.Next;
                curIndex++;
            }
            if(curItem != null) { curItem.Value = value;}
        }
    }
}

We now have a fully functional generic list! We can Add an item, foreach the collection, or get/set a value directly by index by simply going through our collection from the beginning (or end, more on this later).

Extending the list functions

OK, so a list we can append an item to, enumerate, and access by index is the most basic of lists. Surely, we can implement some other functions while still retaining zero inheritance (besides the freebie) and staying completely generic? Of course, we can!

Basic list functions - zero inheritance and fully generic:

AddRange
AddAt
RemoveAt
Clear
Contains

I'm not going to rewrite the code for AddAt/RemoveAt in this section, since I implemented some performance enhancement I haven't gotten to yet - but basically, because the indexes are not hard-coded, all you have to do is just re-hook the links. In the case of RemoveAt, you grab the basicItem at the index, then link the Previous and Next directly to each other, then either null the item (for actual removal) or just re-hook it wherever you want to go in the case of an uproot and replant. In the case of AddAt, you grab the items before and after, and hook them to the new item, and then hook the new item to the before and after. In both cases, the first/last need to be checked for replacement, and the count needs to be updated.

AddRange couldn't be easier, just take a T[] and loop through them, calling Add on each value. For Clear, I actually hooked into the constructor to call, but it nulls everything, resets values to 0, and parses through the collection removing links just to make sure we don't have a long chain floating out in the abyss that the collection doesn't clean up. We can implement Contains at this point, though we are limited to the Object inherited Equals.

public class basicList<T> : IEnumerable<T> 
{
    public basicList() { this.Clear(); } //moved to Clear function
    
    public void AddRange(T[] t) { for (int i = 0; i < t.Length; i++) { Add(t[i]); } }
    
    public bool Contains(T t)
    {
        foreach (T tVal in this) { if (tVal.Equals(t)) { return true; } }
        return false;
    }
    
    public void Clear()
    {
        if (_firstItem != null)
        {
            basicItem curItem = _firstItem;
            while (curItem.Next != null)
            {
                curItem = curItem.Next;
                basicItem prevItem = curItem.Previous;
                prevItem = null;
            }
        }
        _firstItem = null;
        _lastItem = null;
        _count = 0;
    }
}

This is about as far as we can possibly get with an unrestricted generic Object-derived no-inherit list. We only have the basic Object functions without limiting T. At this point, if we want more, we can either subclass basicList, or just use the minimal restriction on T to get the rest of the functions we want.

Performance enhancements

I would go directly to the part where we implement sorting and the extended functions, but unfortunately, that is pretty complex, and uses what I'm about to describe here. Rather than rewriting entire code segments to make them make sense without explaining this system, I thought I would explain it first.

What I implemented is what I call the pointer system. This is the main reason my list class can compete (and beat some) of the predefined collection classes. Now I know what pointer means, and I'm not really using pointers. I'm using object references, but for my own clarity, I call them pointers, since by grabbing one, you are essentially grabbing the object it's referring to.

Recall that _firstNode and _lastNode are essentially references to the first and last nodes, whose positions we know as 0 and (Count - 1), respectively. What this system does is keep more of these references, except instead of being anchored to the first and last, they are free to be wherever within those bounds. Necessarily, this system is pretty complex to keep track of, but it is vital to actually perform anything other than appending items. Remember that at this stage, to get index[y], we have to start at 0 and iterate 0 -> y. Some performance enhancement can be had by determining if y is closer to the end or start and going from there, and that was my initial implementation. But what happens when we have to go through a collection one by one and flip back and forth between nodes, such as an operation like sorting requires? I'll tell you - in small collections, it's not very noticeable, but in large ones, it's extremely noticeable, as to get to the middle of the collection, it always has to start from the start or end.

In my implementation, I allow the user to optionally set the number of pointers used; in large collections, larger numbers will help somewhat, although all my tests were done with the default number, which is 4.

Since this article is getting long already, I'll briefly explain the implementation which you can see in the source code.

Pointers are kept in two separate arrays, a basicItem[] _curPointers and a int[] _curPointerIndices.

_curPointers[] contains the actual object references
_curPointerIndices[] corresponds to basicList[index] to the current node index

When a call is made to get a pointer, the function _getClosestPointer just returns the index of the most strategic pointer to _getNodeAt, who then passes it on to the caller.

if none exists, it creates one at zero or the end depending on which is closer to _createNewPointer
if some exist, but making a new one at zero or end would be closer:

if there is an empty spot to create one, grab it and create it
if not, replace the least strategically placed pointer (based on proximity to other pointers and start/end) in _calcPointerWorthLeast

Functions then take that index and just use _getClosestPointer, and functions that grab the object reference use _getNodeAt
- functions keep track of their working object acquired from the pointer and its index, or just the index depending on the function
- they update the pointer/index where needed, either by directly inputting to _curPointers / _curPointerIndices or with _updatePointer
- if any nodes are added/removed during use, affected indices of pointers are updated accordingly with _movePointers
- pointers support a reverse lookup, given a basicItem _getPointerIndex can return the index of the pointer for updates

Enhanced functions

Now that I've explained that a bit, let's move on to the final part where we set up a Sort and reforged Contains, and enable auto-sorting on Add. First off, to do any kind of comparison other than the basic Object-derived Equals(obj), we need to finally put a little limitation on our T, namely asking it to implement IComparable<T> by tacking on where T:Comparable<T> to the class name. I chose to augment the current list, rather than stop the class here and inherit it and then tack on the requested inheritance to T, because IComparable<T> is incredibly easy to implement. It has only one method - CompareTo, which given t1.CompareTo(obj t2), returns -1 if t1 < t2, 0 if t1 = t2, 1 if t1 > t2. I usually implement that myself in custom classes with a redirect to a type-specific comparison and avoid the normal object kludge.

Now that we can guarantee T is IComparable, we can update Contains and implement sorting. A basic sort routine would be nice, but since I'm me, it's not basic. Before we get to that, though, here is the updated Contains method; recall that before this, we could only use Object.Equals; this is much faster:

public bool Contains(T t)
{
    //new and improved
    foreach (T tVal in this) { if (tVal.CompareTo(t) == 0) { return true; } } 
    return false;
}

It's now using CompareTo == 0. One other thing that you could do is a quick compare of Object.getHashCode first since that's quicker than CompareTo (which is still faster than Equals). But keep in mind, getHashCode isn't a bulletproof comparison operator, and if hash codes are equal, that doesn't necessarily mean the two objects are equal, thus the second check with CompareTo. I tested this with integers, and it sped up value checking considerably, but then when I used strings, it slowed it down- my guess is hash codes are equal more often, or it takes longer to calculate them with strings, and since this is intended to be a generic List with as few restrictions as possible, I didn't want to set up a special string handler, a special integer handler etc., and so set it back to the most reliable comparison. If I was going to use it with strings, I would just override the comparison in a new string-targeted class inheriting basicList.

On to sorting! At its heart, sorting only relies on a few things - that you have a reliable method to compare two values, which we do now with CompareTo, and that you can keep track of where you are while you're flipping around. The latter is much more complex with the pointer system I described above, but it's pretty impressive on a sort op.

The sorting I used here is a combination of several parts. I'll post the actual sort handling code, although most of it won't make a ton of sense since we didn't go into that much code on the pointer system and the sort function itself calls another function which calls another. Nonetheless, here is the current _handleSortCall (the public Sort looks at _autosort, and if it's already autosorted, it ignores the call).

private void _handleSortCall()
{
    //private method allows public sort to decide if 
    //it wants to actually sort or not and permits 
    //autosort property handler to bypass
    basicItem nEnd = _firstItem, nStart = _firstItem;
    int posEnd = 0; //start is always 0
    //working index so we don't have to enumerate 
    //every one in the sort op function to place it
    int workingIndex = 0;
    while (nEnd.Next != null)
    {//start at the 2nd node and work until there are no more nodes
        basicItem thisNode = nEnd.Next;
        workingIndex++;
        //first check to see if it even needs to be moved
        //since we are moving forward, we just need 
        //to make sure the previous is worth less
        if (thisNode.Value.CompareTo(thisNode.Previous.Value) < 0)
        {
            //remove before looking so we don't find the same value
            _removeNodeAtIndex(workingIndex, ref thisNode);
            int newIndex = _findProperIndexByValue(thisNode.Value, 0);
            _addNodeAtIndex(newIndex, ref thisNode);
            posEnd++;
            workingIndex++;
        }
        else
        {//sort not necessary
            nEnd = thisNode;
            posEnd++;
        }
    }
}

Nothing is passed to it since this is the basic sort call. The next more specific function is _findProperIndexByValue, which, given a value, figures out the first index where the next value is greater, basically an ascending sort finder. To do this, it creates up to 10 zones, and checks the starting value one at a time, and then recurses, narrows the zones, and repeats until it has less than 20 possibilities, at which point, it calls another function _findFirstBiggerValueIndex, whose function you can probably guess. The functions _removeNodeAtIndex and _addNodeAtIndex are exactly what you would think they are, the workhorses behind the AddAt and RemoveAt functions as well as absolutely vital for sort op since we are essentially plucking one node from its home and placing it somewhere else.

Since we are starting at the beginning and moving forward, any value behind the previous work node is guaranteed to be of lower or equal value than the current node, meaning that for each node we move forward, we only have to check if the value is larger than the last, and if it is, we can keep on moving, since a gap will be covered by later nodes being moved back if such a gap exists.

Epilogue

If you've read this far, I am quite proud of you. This is far, far longer than I anticipated it being, and I even started cutting it short once we got into the complex stuff. I do have a slight hope that somebody will find this useful despite there being obvious alternatives. There is a certain pride in building something from scratch that can compete with the established kings on nearly even ground. If not, it was a great brain exercise, hopefully for both of us.

Benchmarks!

I was curious how my little experiment stacked up against the big dogs, so I did an experiment using a 500K int collection. I also did a 10K string collection, but the results were so similar across the board I scrapped it until I felt good about this one. Only thing I will say about the strings is that my unsorted list's Add time was exactly identical to all the others, and its sort time was exactly the same as List<string> and ArrayList, so I'm not just hiding bad results.

Code used for benchmark (in the code, plus two short verifications that the last 50 and first 50 are correct):

for (int x = 500000; x > 0; x--) { list.Add(x); }
list.Sort();
for (int x = 500000; x > 0; x--) { if (!list.Contains(x)) { list.Add(x, null); } }

I wrapped a little timer function around each of these to provide the milliseconds elapsed, and here are the results:

Collection Class	Add Test	Sort Test	Contains Test
`basicList !_autoSort`	141 ms.	203 ms.	Stopped at 10 seconds - 8022 completed. Estimated 623 needed.
`basicList _autoSort`	453 ms.	0 ms.	Stopped at 10 seconds - 11648 completed. Estimated 429 needed.
`ArrayList`	109 ms.	718 ms.	Stopped at 10 seconds - 652 completed. Estimated 7692 needed.
`List<int>`	31 ms.	78 ms.	Stopped at 10 seconds - 805 completed. Estimated 6250 needed.
`Hashset`	125 ms.	NA	765 ms.
`Hashtable(int, null)`	281 ms.	NA	968 ms.
`SortedList<int, object> added (int, null)`	added 100k in ~1min	NA	NA
`Dictionary<int, object> added (int, null)`	156 ms.	NA	765 ms.
`SortedDictionary<int, object> added (int, null)`	703 ms.	0 ms.	781 ms.

I did do the testing in 3.0, as you can see with the HashSet, but downgraded the assembly to 2.0, since it doesn't need more than that. SortedList couldn't even make it out of the gate, funnily enough. The unsorted list was beaten in both Add and Sort by the generic List, but as you can see, managed to process 10x as many Contains, and the sort time was much better than ArrayList. The autoSorted basicList did 50% more Contains in 10 seconds, but still not good enough for me to implement a unique function just yet. If I can get that down to Dictionary sort times, then I finally have completely customizable competitive unique-value auto-sorted Set that is on equal grounds with the predefined ones. In my opinion, the non-unique unsorted and sorted are on pretty equal footing with the others. But I am biased!

Future plans

Currently (as of initial writing), there is also a unique attribute/public property/handler method in the class, which is commented out. You could implement it, it would require only adding another condition to the Add/AddAt/AddRange for unique values. The reason I haven't implemented it is not difficulty, but because I am unsatisfied with the performance of the Contains check. Since Unique would require a hit on the Contains check for every single Add, being satisfied is completely necessary for me. But, of course, I am benchmarking with 500K collections, and most applications would probably be fine.

Conclusion

Hope you found this as interesting an exercise as I did. But even a fraction is good enough for me to feel good about writing it!

Please feel free to let me know about bugs you find, I did test it pretty thoroughly, but I'm sure I didn't find everything.

As a footnote, if you are used to the normal ///writeup on every function, I apologize - I only commented where I felt necessary, and I do not like paging through a million pages of mostly comments. If, however, this gets some downloads, and it is requested, I can update it with that. I have yet to see if anyone is even interested in downloading the actual code, the interest for me was the concept.

History

3/8/2010 - Posted this!

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Written By

Jason Pearce

United States

I enjoy reinventing the wheel in as many languages and platforms as possible!

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.