Choosing the Right Collection

Arthur Minduca

4.91/5 (32 votes)

Apr 26, 2016

CPOL

11 min read

50478

How to choose the right data structure for a collection of elements

Introduction

Everyone once had the experience of not having the right tool for a job. You can get the job done, but it will cost you more effort, maybe some limitations and even some unwanted damage. But what if a person had all the right tools, knew how to use most of them, but still he proactively decided to stop thinking about them because he got that generic one that is “ok” for most of the tasks? What could be the consequences?

A good wrench can be enough to hit a nail on the wall, but it will never be as efficient as the simplest hammer.

In this post, I will cover some well-known basic data structures, discuss the importance of knowing how each one works and propose a systematic, time complexity-oriented approach for choosing a proper data structure in a given scenario. I will be using .NET collections as example. Warning: Be aware that there are tons of data structures out there and that each one is a perfect fit for a specific context. In this post, we will be covering the data structures that are usually available as built-in component on many languages.

How Many Times Did You Use a Collection Other Than List, Array and Dictionary?

When we need to group entities of the same type in a collection for general purposes, List(T) a default choice for many developers for a good reason: it’s practical and versatile as data structure. Furthermore, there is a good chance that in a random circumstance, it’s a good choice.

So Why Should I Bother About It?!

04 - Word Cloud

The simple answer is pretty straightforward: We shouldn’t be using a tool just because it’s capable of handling the job. When choosing a data structure, it’s important to consider how it will be used and the complexity of the operations that will be performed against it. With that in mind, choosing the best tool for the job is a matter of knowing the proper usage of each one and the complexity of their common operations.

The elaborate answer is still as simple as the simple answer: You should think about the proper data structure because it will improve code quality and readability, not rarely, it will be much simpler and productive to implement and, lastly but not the least, you will avoid hidden performance issues that are hard to track.

Know How They Work

It’s relevant the number of developers who don’t know (or know, but just don’t care) about how the collections they use work under the hood. This basic knowledge is the key to understanding the complexity of their operations. Actually, you don’t have to know the details. A simple glance can be just enough to support your decision about which tool to use under the proper circumstances.

By understanding our environment and making fortunate choices, we avoid common collateral damage that haunts developers on a regular basis and ensures best practices. For instance:

We Avoid Hidden Performance Issues Due To a Collection Misuse

List(T).Add has O(1) amortized time, which is nice, but List(T).Insert() and List(T).Remove() has O(n) complexity. During these operations, the elements after the target index are shifted accordingly in order to reorganize the dynamic array. That said, writing other than at the end of the collection can be computationally expensive and an innocent Insert can be a potential source of performance issues.

04 - ListInsert

When we understand how it works, we can anticipate these potential issues and be saved from being cursed by other developers for generating a problem that is hard to track.

We Ensure Proper Usage of Interfaces And Code is Easier to Understand

When you use a more dedicated collection, you explicitly specify how it should be used. A mere declaration naturally reduces the set of usage possibilities and keeps its affordance clear. For instance, if someone inspects a code he doesn’t know and finds a class that manipulates a Queue(T), it becomes naturally clear that this attribute is not inserting or removing elements at the middle of the collection. It means also that the first element added is always the first to be read and that at some point, the element is probably removed after its value is obtained. All that information by reading one single line of code.

By improving affordance, we ensure that the user (i.e., someone who analyses the code) doesn’t need to think (or think a lot) in order to understand what’s going on.

Affordance: A situation where an object’s sensory characteristics intuitively imply its functionality and use.
Source: Usability First.

That said, it becomes clear that if we used a List(T) in a scenario where a Queue(T) would be a fit, even if the performance matches, we lose in affordance by not using the right tool, thus compromising readability and interpretation.

Choosing the Right Data Structure: A Flowchart

I created this flowchart as a simple tool to support decision when choosing a data structure. As I said at the beginning, there are tons of data structures out there and each one is a perfect fit for a specific context. This chart contains only common data structures supported natively by many languages.

04 - Data Structures

There Is No Free Lunch

There is no data structure that will fit better in every context. In order to select a proper collection, you need to know what you can give away.

A dictionary, hashset or hashtable will give you O(1) amortized complexity for insert /add / remove, but you can’t guarantee order of the elements.
A Binary Search Tree will give you O(log n) for insert / remove and a predictable order. It’s far more efficient than a O(n) for big sets, but it’s not as fast as a dictionary.

O(log n): A logarithmic running-time function is faster than O(n) and slower than O(1). For instance, if you have a collection of 16 elements and need to find a specific element, the worst-case scenario of a O(n) function would be to find the desired element after 16 iterations, against 4 iterations of a O(log n) function (log2 16 = 4) and a direct access of a O(1) function. Example of a O(log n) algorithm: Binary Search.

That said, if you need fast access and you don’t need order, the "limitations" of a dictionary is not even an issue. However, if you need order and elements are constantly entering and leaving at unpredictable positions, most search trees will have O(log n) complexity for both writing and retrieval, which is pretty decent, but not as fast as the dictionary's amortized constant time.

Everything depends on your needs. The perfect fit will always be achieving maximum performance with minimum collateral damage due to the intrinsic limitations of a data structure.

Amortization is a method for analyzing a given algorithm's time complexity. (...) The motivation is that looking at the worst-case run time per operation can be too pessimistic. Example: If we started out with a dynamic array (List<T>) of size 4, it would take constant time to push four elements onto it. Yet, pushing a fifth element onto that array would take longer as the array would have to create a new array of double the current size (8), copy the old elements onto the new array, and then add the new element. (Thus, despite the eventual array copy, we still consider that List.Add has a constant amortized time).
Source: Wikipedia.

List, Array

Time Complexity

	List	Array	SortedList
Access by index	O(1)	O(1)	O(1)
Access by key	-	-	O(log n)
Search	O(n)	O(n)	O(log n)
Insert	O(n)	-	O(n)
Delete	O(n)	-	O(n)
Insert at the end	O(1) amortized	-	O(log n)
Delete at the end	O(1)	-	O(log n)

Lists and arrays are preferable when you need indexed or random access to elements.
They are also versatile and commonly used when you just want to group data for general purposes.
.NET lists have a Sort() method, which implements a Quick Sort. On average, this algorithm is an O(n*log n) operation. If you need to sort the elements only once and then you just add elements at the end, it will probably be better to use this method than to use a SortedList, since this last one has O(log n) for insertions at the end, against O(1) amortized of List<T>.
If you don’t need to add or remove elements after the collection is created, consider to use an array instead of a List<T>.
Avoid making heavy Insert() and Remove() operations in a List<T>. It’s an O(n) operation because the array is shifted. Remove an element at the end is O(1) because there is no shifting.
Add at the end is O(1) amortized. When the internal array reaches its limit, a new instance is created with double size and the elements are copied. If you know how many elements are to be inserted during creation of the collection, you can define its size during initialization and save time and space.

Stack, Queue

Time Complexity

	Stack / Queue
Access to the object at the top	O(1)
Search	O(n)
Insert	O(1) amortized
Delete	O(1)

These are perfect when you need a sequential list where elements are discarded after its value is retrieved.
Consider using a Queue<T> for First-in, First-out behavior (FIFO), or its thread-safe version ConcurrentQueue<T>.
Consider using a Stack<T> for Last-in, First-out behavior (LIFO), or its thread-safe version ConcurrentStack<T>.
Push / Enqueue is O(1) amortized. When the internal array reaches its limit, a new instance is created with double size and the elements are copied. If you know how many elements are to be inserted during creation of the collection, you can define its size during initialization and save time and space.
Pop / Dequeue is O(1).

Linked List (Singly, Doubly, Circular)

Time Complexity

	Linked List
Access to a node itself or its adjacencies	O(1)
Search	O(n)
Insert	O(1)
Delete	O(1)

Linked lists are preferable when you need constant time insertion and removal at the middle of the collection.
It is also useful when memory usage is critical and you don’t know the number of elements to be inserted, since there is no array copy in a linked list when it becomes too big.
You should use a Circular linked list if the problem is inherently circular (for instance, to control whose turn it is in a multi-player board game).
It should be Doubly-linked if you need to go back in the list, or Singly-linked if you just need to move forward.
Linked lists are not indexed. Thus, if you want to keep track of specific nodes, you might consider to keep a reference of them.
.NET LinkedList<T> is a Doubly-Linked List.Thus, you can naturally use it for a Singly-Linked List. If you need it to be circular, just make the first node to be the next of the last.

Dictionary, HashSet

Time Complexity (Average)

	Dictionary / HashSet
Access by key	O(1) amortized
Search	O(n)
Insert	O(1) amortized
Delete	O(1) amortized

Time Complexity (Worst)

	Dictionary / HashSet
Access by key	O(n)
Search	O(n)
Insert	O(n)
Delete	O(n)

Dictionaries and HashSets are great for fast access, but the order of the elements is not guaranteed.
HashSet is preferable over Dictionary when you need a collection of unique values.
The complexity of insert / get / remove includes the complexity of calculating the hash. So keep GetHashCode() simple and with constant time.
The worst case scenario is GetHashCode() returning the same value for all the entries. Elements with same hash are grouped in the same bucket to avoid collision. In this case, the elements in the bucket are iterated and Equals() of the key passed as argument is called against the key of each element in the bucket. The complexity of accessing the element is then the complexity of finding the element inside the bucket, which can be O(n).
ListDictionary is faster than Hashtable for small collections (10 items or fewer). The Dictionary<TKey, TValue> generic class provides faster lookup than the SortedDictionary<TKey, TValue> generic class. The multi-threaded implementation is ConcurrentDictionary<TKey, TValue>. ConcurrentBag provides fast multi-threaded insertion for unordered data.
Source: MSDN.

Search Tree

Time Complexity

	SortedDictionary	SortedList
Access by key	O(log n)	O(log n)
Access by zero-based index	-	O(1)
Search	O(log n)	O(log n)
Insert	O(log n)	O(n)
Delete	O(log n)	O(n)
Insert at the end	O(log n)	O(log n)
Delete at the end	O(log n)	O(log n)

Trees are great when you need fast insertion, deletion and retrieval.
It’s ideal in cases where data is entering / leaving constantly in different positions.
Search tree is a family of data structure and there are tons of trees. You should check which tree is the best for your scenario.
The SortedList<TKey, TValue> generic class is a binary search tree. It's similar to the SortedDictionary<TKey, TValue> generic class. The two classes have similar object models, and both have O(log n) retrieval. The two classes differ in memory use and speed of insertion and removal.
SortedList<TKey, TValue> uses less memory than SortedDictionary<TKey, TValue>.
SortedList<TKey,TValue> supports efficient indexed retrieval of keys and values through the collections returned by the Keys and Values properties.
SortedDictionary<TKey, TValue> has faster insertion and removal operations for unsorted data, O(log n) as opposed to O(n) for SortedList<TKey, TValue>.
If the list is populated all at once from sorted data, SortedList<TKey, TValue> is faster than SortedDictionary<TKey, TValue>.

Putting It All Together

We shouldn’t be using a data structure just because it’s capable of handling the job. It’s important to consider the scenario and the operations that will be performed. The understanding of how things works under the hood helps us make the judgement of which data structure will perform better and ensure application’s performance. It will also improve code quality and readability.