Understanding and creating a .NET like List

Paulo Zemek

5.00/5 (24 votes)

Jul 8, 2016

CPOL

15 min read

22810

A List is like a resizable array. The main points are:

You can get or set its items using an indexer (that is, something like list[somePosition] = someValue);
You can Add() and Remove() values from it, effectively changing its size.

Surely there are other methods on the .NET List<T> implementation but I am not covering all the methods as many of them are there simply because it is good to have them. The real point is that a list works pretty similar to an array, but can be resized while an array can't (don't be fooled by the static Array.Resize() method as it actually does a copy of the array instead of resizing one like the name suggests).

If our purpose was to create a really, really naive implementation of a List we could do this:

public class List
{
  private object[] _array = new object[0];

  public int Count
  {
    get
    {
      return _array.Length;
    }
  }

  public object this[int index]
  {
    get
    {
      return _array[index];
    }
    set
    {
      _array[index] = value;
    }
  }

  public void Add(object item)
  {
    int positionOfNewItem = _array.Length;
    Array.Resize(ref _array, _array.Length + 1);
    _array[positionOfNewItem] = item;
  }
}

I said this was a very, very naive implementation. It is missing methods, like Remove(), Clear and others, it is missing parameter validations (what will happen if a value of -1 is given to the indexer?) and it is not a generic implementation (if we want a list of strings only, we will need to copy the code and change the type object by the type string) but most importantly: It is slow. Really slow.

Note that for each new item, an Array.Resize() is done, which actually allocates a new array and copies all items to the new array. So, to add only 4 items we will:

Allocate a new array (of size 1). The copy will do nothing, and we will set the item at index 0 (the first one);
Then we will allocate a new array, of size 2, we'll copy one item and, finally, set the item at index 1;
Then we will allocate a new array, of size 3, we'll copy 2 items and, finally, set the item at index 2.
Then we will allocate a new array, of size 4, we'll copy 3 items and, finally, set the item at index 3.

I think that you can imagine that the more items we have, the slower it gets. This is because the more items in the list, the slower the copy gets.

With 4 items it may not sound that bad. But what about 1000 items? Will we allocate 1000 new arrays and do 1000 copies, each time bigger and slower than the previous one?

The real answer is: If we don't care about wasting some memory, we can do better.

A better implementation

The .NET List<T> implementation (and most lists implementations that are backed by an array) allocates an array bigger than the amount of items it needs to store. So, yes, it does waste some memory if we don't do the right amount of Add() calls but from a performance perspective, it is much faster.

The implementation is more like this:

public class List
{
  private static readonly object _emptyArray = new object[0];

  private object[] _array = _emptyArray;
  private int _count;

  public int Count
  {
    get
    {
      return _count;
    }
  }

  public object this[int index]
  {
    get
    {
      if (index < 0 || index >= _count)
        throw new IndexOutOfRangeException("index");

      return _array[index];
    }
    set
    {
      if (index < 0 || index >= _count)
        throw new IndexOutOfRangeException("index");

      _array[index] = value;
    }
  }

  public void Add(object item)
  {
    if (_count == _array.Length)
    {
      int newCount = _count * 2;
      if (newCount == 0)
        newCount = 4;

      Array.Resize(ref _array, newCount);
    }

    _array[_count] = item;
    _count++;
  }
}

If you pay attention, I improved the getter and setter by throwing an appropriate exception if the index is invalid. This is necessary because now the Count may be different from the array Length. I also avoid allocating a new empty array per list, only a single empty array is allocated and shared by all the lists when they are first created (this is safe as a 0-Length array is immutable).

Actually, we could consider valid to set a value at any index and, if needed, resize the inner array. Well, I don't do this and the .NET List doesn't either and I have reasons to believe such automatic resizing is bad, but I don't want to lose focus here, so, at this moment, we must accept that we don't do automatic resizes in that situation.

Now take a look at the Add() implementation. It will only resize the array (which actually causes a new array allocation with a copy) if the actual Count equals the array Length. When doing the first add the Count will be zero and the array length too, so it will end-up allocating an array of size 4.

For the next 3 Add() calls, the array will have the right size, so the item will be set directly to the array and the _count will be incremented. And, when we finally arrive to the 5th Add(), it will double the array size. As it keeps doubling the size, even if we need to copy items at some moment, it will become rarer and rarer... so we can say that we keep a good speed for additions at the end. Actually, that's all the optimization that the .NET List class does regarding additions.

What else?

Well, I still didn't present the Remove() method (or any of the other methods), I still didn't talk about Object Oriented Programming Principles that we are following (or not), I didn't talk about the .NET interfaces or anything about different implementations. But I hope that which this lesson you understand that doing copies at every Add() is bad and that it should be avoided (that's why the .NET list uses that rule of doubling the inner array when its capacity is exhausted). I also hope you understand that using more memory than it is actually used can be a good thing, if you "expect" such memory to be used in the future.

So let's see some of the other methods:

RemoveLast()

The logic of doubling the array length to add new items at the end doesn't help in adding or removing items in the middle, but the fact that the array may be bigger than the actual number of items helps when removing items located at the end. We can say that by simply doing a

      _count--;

We can remove an item from the end. But such solution is still naive. What will happen if the user never adds a new item after a RemoveLast()? The item that's not accessible anymore (because it is outside the _count boundaries) can't be collected, because to the garbage collector it is still active (it is still part of an active array). It is important to note that when we talk about Garbage Collection, accessible items mean those items the Garbage Collector sees as accessible, not items that are really accessible. Putting an item in an array that's accessible, even if it is in a position that we will never access, is still considered an accessible item. So, a RemoveLast() is going to look more like this:

public void RemoveLast()
{
  if (_count == 0)
    throw new InvalidOperationException("This list is already empty.");

  _count--;
  _array[_count] = null;
}

That _array[_count] = null; means that such item can be collected if needed. It may be a waste of time if we simply add an item just after doing the remove, but the class doesn't know that, so it must do something smart for those cases we don't actually add a new item just after that call (and, well, setting a null value to an array index is pretty fast).

Clear()

The Clear() is both pretty simple and pretty problematic. What do we expect to happen after a Clear()?

Imagine that the actual list has 1 million items. Will the user of this list Clear() it and add 1 million items again, or will the user simply let it die cleared?

Actually, a "perfect" Clear() (in the sense of clearing even the used memory) will do this:

public void Clear()
{
  _array = _emptyArray;
  _count = 0;
}

But the .NET implementation does something like this:

public void Clear()
{
  Array.Clear(_array, 0, _count);
  _count = 0;
}

This, in my opinion, means that the developers hope a cleared list will become as big as it was before the Clear(). This can't be considered a bug, but it is a memory consumption problem. If you clear a list because you expect it to reduce its inner array size... well, it is not going to do that. Items that were in it can be collected, but the array itself can't. Also, if you think that doing a clear after finishing using your local list is helping in anything, then think again, because it isn't. A local list can already be collected when the method ends. Doing a Clear() just before that means you are losing time for no reason.

TrimExcess

This method exists to avoid wasting memory at the end of the list. If we add many items and then we know that we are not adding new items we can call TrimExcess to make sure the inner array "shrinks" to the same size as the actual Count. It is important to note that an action is only needed when the Count is smaller than the actual array size (the .NET implementation expects more than 10% space gain to actually do anything) and we must remember that the resize creates a new array with the smaller size, which means that we waste even more space before allowing the old array to be collected (so, don't call this method if your list is short lived).

This method is often used together with the Clear method to guarantee that the list is completely cleared and the memory from the inner array is reclaimed. In my opinion the .NET Clear method should be called RemoveAll and the Clear should use the solution that simply sets the array to an empty array. It would be faster and less confusing than doing a Clear and a TrimExcess to completely release memory.

The most basic implementation of a TrimExcess could be done with a single line of code, like this:

public void TrimExcess()
{
  Array.Resize(ref _array, _count);
}

AddRange()

The AddRange is not really smart, but it avoids some unnecessary validations and copies. Remember what happens at every Add()? We check if we need to resize the inner array and, independently if we do it or not, we set the item and the count. If we do thousands of Add calls we will have thousands of validations and possibly many resizes. Well, with this method we can validate the size only once, resize the array only once (if we need that at all) and then we can copy all the items and set the count only once.

The method could look like:

public void AddRange(IReadOnlyCollection collection)
{
  if (collection == null)
    throw new ArgumentNullException("collection");

  int count = collection.Count;
  if (count == 0)
    return;

  int finalCount = _count + count;
  if (finalCount > array.Length)
    Array.Resize(ref _array, finalCount);

  collection.CopyTo(_array, _count);
  _count = finalCount;
}

Actually, this is not the code of the List implementation in .NET. The List implementation in .NET is both more complex and "stupid". It tries to detect if the received collection is really an ICollection or not, because the parameter type is IEnumerable<T>. If it is an ICollection it uses the Count to do this optimized solution... if it is not, then it iterates the given collection adding the items normally. I don't consider that a good implementation, but this is how it works.

Generics

You probably know that the .NET List<T> is a generic type. That's what that <T> is all about. In my example I was using the object as the item types, but using the type as object has many problems. In special:

It allows any object to be put into the list, without any validation;
Users will need to cast the read items to their right type before using them. If they actually inserted an object of an invalid type, they will only see the error when doing the cast, not at the insert;
Value-types (ints, for example) will be boxed.

And we can solve all those problems by making the type generic. To do that, it is enough to put that <T> after the class name and replace all the object uses by a T. This means that we will have an array of ints if we use a List<int>, an array of strings if we use a List<string> and I think that you already got the idea. By using a generic type we will have better performance (at least for value types) and we will also have some compile-time validations, as a list of strings is not going to accept an int as parameter.

The interfaces

The .NET List implements lots of interfaces today. In fact, some of them are actually legacy and problematic.

For example, the non-generic ICollection has properties like IsSynchronized and SyncRoot. Putting a public SyncRoot property is an excess of responsibility (the List class doesn't need to care about this), also it is a possible source of bugs and it doesn't allow to use reader/writer locks. If someone wants to make a list thread-safe by using locks, it is the responsibility of that someone (or of that class) to deal with the locks. We should not add properties (and consequently fields) to our type for that kind of situation. The ICollection also has the CopyTo method, but it actually doesn't allow you to tell which part do you want to copy. You always copy the entire contents of one collection to an array (starting at a specific index). This is simply odd, in my opinion.

The generic interfaces, on the other hand, seem much more accurate. Yet the mutable interfaces appeared before the read-only ones, so the mutable ICollection<T> has the IsReadOnly property and, if it is true... well, you still have all the mutable methods available, but they are expected to throw exceptions if they are used. Only the new IReadOnly* interfaces are really slim. Anyways, I can't say that you should implement those interfaces.

The fact that most of those interfaces expose counter-intuitive properties (like the SyncRoot or the IsReadOnly on a mutable interface) only shows me that we should avoid implementing interfaces for the sake of completeness. Users can always create adapters that implement interfaces and redirect to our types if needed. In fact, if performance is a problem, as long as we don't seal our types (and sealing itself is another big topic if we want to discuss) users can simply inherit from the actual List and implement the interfaces they want, without the extra cost of redirection from one type to another (as happens with adapters). So, in my opinion, the only interface that's really worth implementing is the IEnumerable<T> interface... or maybe not even that one. See, what do you prefer:

foreach(string name in names)
{
  // Do something here.
}

Or do you prefer this:

foreach(string name in names.Enumerate())
{
  // Do something here.
}

Actually the first solution means we have to implement the IEnumerable<T> while the second solution means we only need to create a method Enumerate() that returns an IEnumerable<T>. I will let you decide which one is better, but in any case, the contents of the GetEnumerator() method (needed by the interface) or the contents of the Enumerate() method will look like this:

for(int i=0; i<_count; i++)
  yield return _array[i];

The attributes

Similarly to the interfaces, my implementation doesn't use any attributes. Some attributes are used to control serialization (as well as some interfaces) and a full-featured implementation may use them as well (the .NET list does).

The only thing I can say is that I consider it terrible to use attributes as a way to control serialization. To me, serialization is not the main purpose of a list and it should exist as a separate list. Everything that's needed for serialization is to scan the list (which is already possible) and serialize each one of the items. So, I will not even cover how to use the attributes on the list class because I consider that to be a bad practice.

The Constants

In many situations constant values are usually bad and making everything configurable is the way to go, yet we have some constants in this class: the 0, the 2 and the 4. They are not visible to the users of the class, so they don't use constant members, yet they are constants and we may want to avoid them.

0 and 4

Looking at the source code you can see that the list starts with an empty array and when the first item is added, an array of length 4 is allocated.

Both of those could be different. The list could start with an array of any size, or even with a null in place of the empty array.

Avoiding null is good to avoid extra checks for null... at least that's what people usually say. See the getter, if the list is empty (count 0) we will never access the array, we will throw at that moment. So, it doesn't really matter. But can we say the same for all the methods?

Well, starting with an array of Length at least 1 would help the resize to be simpler. The actual implementation checks if the size is 0 to set it to 4. If the array size was 1, it could double the array Length without ever checking for a size of zero.

2 - Doubling the array length

The number two is special because it is used to double the array length. A better implementation could simply allow users to configure this value.

The truth is: If we know the actual number of items, we can allocate an array to the right size and use it. If we are guessing, we must use something that works, even if it is not perfect.

By doubling the array length we may keep many unused items at the end. Yet, you may try to multiply its size by 3, by 1.5 or any value that works for you. I used the value two because this is the same value used by .NET. Feel free to try different amounts.

Doubling the Length is violated

Not everything is a rule. Doubling the array length when resizing on Add calls doesn't mean we do the same on AddRange calls. In that case, if a resize is needed, the exact final count is used as the resize size. If we are going to call AddRange many times, we may face the same problem the naive implementation had: Too many resizes will be necessary.

For example, imagine that each AddRange is receiving a collection with only one item. It would be effectively doing the same thing as an Add... yet, it will be doing a resize per call. Maybe the best way to go would be to keep doubling the size until we reached a value big enough to hold all the items. It would keep the rule used by normal adds with all the pros and cons, and will still keep AddRange's advantage of doing a single resize.

Why didn't I do that? Well... I was following what the .NET list does. Also, there's a big chance that users will call AddRange only once, so maybe this decision is the best one. That doesn't happen with the Add method, which is expected to be invoked many times.

Conclusion

I think this article shows how a simple to use and small class like a list can be quite complex in its development, full of small but very important considerations ranging from performance and memory consumption to configurability.

Hope you like and create your own implementation to better understand it.