Click here to Skip to main content
13,146,395 members (48,003 online)
Click here to Skip to main content
Add your own
alternative version

Stats

17.5K views
39 bookmarked
Posted 25 Apr 2016

Choosing The Right Collection

, 26 Apr 2016
Rate this:
Please Sign up or sign in to vote.
How to chose the right data structure for a collection of elements

Everyone once had the experience of not having the right tool for a job. You can get the job done, but it will cost you more effort, maybe some limitations and even some unwanted damage. But what if a person had all the right tools, knew how to use most of them, but still he proactively decided stop thinking about them because he got that generic one that is “ok” for most of the tasks? What could be the consequences?

A good wrench can be enough to hit a nail on the wall, but it will never be as efficient as the simplest hammer.

In this post I will cover some well-known basic data structures, discuss the importance of knowing how each one works and propose a systematic, time complexity-oriented approach for choosing a proper data structure in a given scenario. I will be using .NET collections as example. Warning : Be aware that there are tons of data structures out there and that each one is a perfect fit for a specific context. In this post we will be covering the data structures that are usually available as built-in component on many languages.

How Many Times Did You Use a Collection Other Than List, Array and Dictionary?

When we need to group entities of same type in a collection for general purposes, List(T) a default choice for many developers for a good reason : it’s practical and versatile as data structure. Furthermore, there is a good chance that in a random circumstance it’s a good choice.

So Why Should I Bother About It?!

04 - Word Cloud
The simple answer is pretty straightforward : we shouldn’t be using a tool just because it’s capable of handling the job. When choosing a data structure, it’s important to consider how it will be used and the complexity of the operations that will be performed against it. With that in mind, choosing the best tool for the job is a matter of knowing the proper usage of each one and the complexity of their common operations.

The elaborate answer is still as simple as the simple answer : You should think about the proper data structure because it will improve code quality and readability, not rarely it will be much simpler and productive to implement and, lastly but not least, you will avoid hidden performance issues that are hard to track.

Know How They Work

It’s relevant the number of developers who don’t know (or know, but just don’t care) about how the collections they use work under the hood. This basic knowledge is the key to understand the complexity of their operations. Actually, you don’t have to know the details. A simple glance can be just enough to support your decision about which tool to use under the proper circumstances.

By understanding our environment and making fortunate choices, we avoid common collateral damage that haunts developers on a regular basis and ensure best practices. For instance :

We Avoid Hidden Performance Issues Due To a Collection Misuse

List(T).Add has O(1) amortized time, which is nice, but List(T).Insert() and List(T).Remove() has O(n) complexity. During these operations, the elements after the target index are shifted accordingly in order to reorganize the dynamic array. That said, writing other than at the end of the collection can be computationally expensive and an innocent Insert can be a potential source of performance issues.

04 - ListInsert

When we understand how it works, we can anticipate these potential issues and be saved from being cursed by other developers for generating a problem that is hard to track.

We Ensure Proper Usage of Interfaces And Code is Easier to Understand

When you use a more dedicated collection, you explicitly specify how it should be used. A mere declaration naturally reduces the set of usage possibilities and keeps its affordance clear. For instance, if someone inspects a code he doesn’t know and finds a class that manipulates a Queue(T), it becomes naturally clear that this attribute is not inserting or removing elements at the middle of the collection. It means also that the first element added is always the first to be read and that at some point the element is probably removed after its value is obtained. All that information by reading one single line of code.

By improving affordance, we ensure that the user (i.e. someone who analyses the code) doesn’t need to think (or think a lot) in order to understand what’s going on.

Affordance : a situation where an object’s sensory characteristics intuitively imply its functionality and use. Source: Usability First.

That said, it becomes clear that if we used a List(T) in a scenario where a Queue(T) would be a fit, even if the performance matches, we lose in affordance by not using the right tool, thus compromising readability and interpretation.

Choosing The Right Data Structure : A Flowchart

I created this flowchart as a simple tool to support decision when choosing a data structure. As I said at the beginning, there are tons of data structures out there and each one is a perfect fit for a specific context. This chart contains only common data structures supported natively by many languages. 04 - Data Structures

There Is No Free Lunch

There is no data structure that will fit better in every context. In order to select a proper collection, you need to know what you can give away.

  • A dictionary, hashset or hashtable will give you O(1) amortized complexity for insert /add / remove, but you can’t guarantee order of the elements.
  • A Binary Search Tree will give you O(log n) for insert / remove and a predictable order. It’s far more efficient than a O(n) for big sets, but it’s not as fast as a dictionary.
O(log n) : A logarithmic running-time function is faster than O(n) and slower than O(1). For instance, if you have a collection of 16 elements and need to find a specific element, the worst-case scenario of a O(n) function would be to find the desired element after 16 iterations, against 4 iterations of a O(log n) function (log2 16 = 4) and a direct access of a O(1) function. Example of a O(log n) algorithm : Binary Search.

That said, if you need fast access and you don’t need order, the "limitations" of a dictionary is not even an issue. However, if you need order and elements are constantly entering and leaving at unpredictable positions, most search trees will have O(log n) complexity for both writing and retrieval, which is pretty decent, but not as fast as the dictionary's amortized constant time.

Everything depends on your needs. The perfect fit is will always be achieving maximum performance with minimum collateral damage due to the intrinsic limitations of a data structure.

Amortization is a method for analyzing a given algorithm's time complexity. (...) The motivation is that looking at the worst-case run time per operation can be too pessimistic. Example : If we started out with a dynamic array (List<T>) of size 4, it would take constant time to push four elements onto it. Yet pushing a fifth element onto that array would take longer as the array would have to create a new array of double the current size (8), copy the old elements onto the new array, and then add the new element. (Thus, despite the eventual array copy, we still consider that List.Add has a constant amortized time). Source : Wikipedia.

List, Array

Time complexity

 ListArraySortedList
Access by indexO(1)O(1)O(1)
Access by key--O(log n)
SearchO(n)O(n)O(log n)
InsertO(n)-O(n)
DeleteO(n)-O(n)
Insert at the endO(1) amortized-O(log n)
Delete at the endO(1)-O(log n)
  • Lists and arrays are preferable when you need indexed or random access to elements.
  • They are also versatile and commonly used when you just want to group data for general purposes.
  • .NET lists have a Sort() method, which implements a Quick Sort. On average, this algorithm is an O(n*log n) operation. If you need to sort the elements only once and then you just add elements at the end, it will probably be better to use this method than to use a SortedList, since this last one has O(log n) for insertions at the end, against O(1) amortized of List<T>.
  • If you don’t need to add or remove elements after the collection is created, consider to use an array instead of a List<T>.
  • Avoid making heavy Insert() and Remove() operations in a List<T>. It’s an O(n) operation because the array is shifted. Remove an element at the end is O(1) because there is no shifting.
  • Add at the end is O(1) amortized. When the internal array reaches its limit, a new instance is created with double size and the elements are copied. If you know how many elements are to be inserted during creation of the collection, you can define its size during initialization and save time and space.

Stack, Queue

Time complexity

 Stack / Queue
Access to the object at the topO(1)
SearchO(n)
InsertO(1) amortized
DeleteO(1)
  • These are perfect when you need a sequential list where elements are discarded after its value is retrieved.
  • Consider using a Queue<T> for First-in, First-out behavior (FIFO), or its thread-safe version ConcurrentQueue<T>.
  • Consider using a Stack<T> for Last-in, First-out behavior (LIFO), or its thread-safe version ConcurrentStack<T>.
  • Push / Enqueue is O(1) amortized. When the internal array reaches its limit, a new instance is created with double size and the elements are copied. If you know how many elements are to be inserted during creation of the collection, you can define its size during initialization and save time and space.
  • Pop / Dequeue is O(1).

Linked List (Singly, Doubly, Circular)

Time complexity

 Linked List
Access to a node itself or its adjacenciesO(1)
SearchO(n)
Insert O(1)
DeleteO(1)
  • Linked lists are preferable when you need constant time insertion and remove at the middle of the collection.
  • It is also useful when memory usage is critical and you don’t know the number of elements to be inserted, since there is no array copy in a linked list when it becomes too big.
  • You should use a Circular linked list if the problem is inherently circular (for instance, to control whose turn it is in a multi-player board game).
  • It should be Doubly-linked if you need to go back in the list, or Singly-linked if you just need to move forward.
  • Linked lists are not indexed. Thus, if you want to keep track of specific nodes, you might consider to keep a reference of them.
  • .NET LinkedList<T> is a Doubly-Linked List.Thus, you can naturally use it for a Singly-Linked List. If you need it to be circular, just make the first node to be the next of the last.

Dictionary, HashSet

Time complexity (Average)

 Dictionary / HashSet
Access by keyO(1) amortized
SearchO(n)
InsertO(1) amortized
DeleteO(1) amortized

Time complexity (Worst)

 Dictionary / HashSet
Access by keyO(n)
SearchO(n)
InsertO(n)
DeleteO(n)
  • Dictionaries and HashSets are great for fast access, but the order of the elements is not guaranteed.
  • HashSet is preferable over Dictionary when you need a collection of unique values.
  • The complexity of insert / get / remove includes the complexity of calculating the hash. So keep GetHashCode() simple and with constant time.
  • The worst case scenario is GetHashCode() returning the same value for all the entries. Elements with same hash are grouped in the same bucket to avoid collision. In this case, the elements in the bucket are iterated and Equals() of the key passed as argument is called against the key of each element in the bucket. The complexity of accessing the element is then the complexity of finding the element inside the bucket, which can be O(n).
  • ListDictionary is faster than Hashtable for small collections (10 items or fewer). The Dictionary<TKey, TValue> generic class provides faster lookup than the SortedDictionary<TKey, TValue> generic class. The multi-threaded implementation is ConcurrentDictionary<TKey, TValue>. ConcurrentBag provides fast multi-threaded insertion for unordered data. Source: MSDN.

Search Tree

Time complexity

 SortedDictionarySortedList
Access by keyO(log n)O(log n)
Access by zero-based index-O(1)
SearchO(log n)O(log n)
InsertO(log n)O(n)
DeleteO(log n)O(n)
Insert at the endO(log n)O(log n)
Delete at the endO(log n)O(log n)
  • Trees are great when you need fast insertion, deletion and retrieval.
  • It’s ideal in cases where data is entering / leaving constantly in different positions.
  • Search tree is a family of data structure and there are tons of trees. You should check which tree is the best for your scenario.
  • The SortedList<TKey, TValue> generic class is a binary search tree. It's similar to the SortedDictionary<TKey, TValue> generic class. The two classes have similar object models, and both have O(log n) retrieval. The two classes differ in memory use and speed of insertion and removal.
  • SortedList<TKey, TValue> uses less memory than SortedDictionary<TKey, TValue>.
  • SortedList<TKey,TValue> supports efficient indexed retrieval of keys and values through the collections returned by the Keys and Values properties.
  • SortedDictionary<TKey, TValue> has faster insertion and removal operations for unsorted data, O(log n) as opposed to O(n) for SortedList<TKey, TValue>.
  • If the list is populated all at once from sorted data, SortedList<TKey, TValue> is faster than SortedDictionary<TKey, TValue>.

Putting It All Together

We shouldn’t be using a data structure just because it’s capable of handling the job. It’s important to consider the scenario and the operations that will be performed. The understanding of how things works under the hood helps us make the judgement of which data structure will perform better and ensure application’s performance. It will also improve code quality and readability.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Arthur Minduca
Software Developer Nurun
Canada Canada
Work experience in Microsoft technologies, web and distributed systems. MSc in computer science (field of Machine Learning) at UFPE, bachelor of Computer Engineering at UPE, Microsoft certified and Software engineer at Revenu Québec, Canada.

You may also be interested in...

Pro
Pro

Comments and Discussions

 
Praisegood article Pin
Prasanna Murali7-Jul-16 3:26
memberPrasanna Murali7-Jul-16 3:26 
QuestionDictionary<> Pin
kosat30-Apr-16 3:03
memberkosat30-Apr-16 3:03 
GeneralMy vote of 4 Pin
cjb11026-Apr-16 22:57
membercjb11026-Apr-16 22:57 
General[My vote of 1] Your logic tree is all wrong Pin
SledgeHammer0126-Apr-16 9:42
memberSledgeHammer0126-Apr-16 9:42 
GeneralRe: [My vote of 1] Your logic tree is all wrong Pin
Arthur Minduca26-Apr-16 10:29
professionalArthur Minduca26-Apr-16 10:29 
GeneralRe: [My vote of 1] Your logic tree is all wrong Pin
SledgeHammer0126-Apr-16 11:02
memberSledgeHammer0126-Apr-16 11:02 
PraiseThanks for putting this together and for sharing! Pin
Slacker00726-Apr-16 2:00
professionalSlacker00726-Apr-16 2:00 
GeneralRe: Thanks for putting this together and for sharing! Pin
Arthur Minduca26-Apr-16 6:16
professionalArthur Minduca26-Apr-16 6:16 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Terms of Use | Mobile
Web02 | 2.8.170915.1 | Last Updated 26 Apr 2016
Article Copyright 2016 by Arthur Minduca
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid