A Brief Survey of Data Structures (and a Little Digression about the Go Language)

Paolo Martinoli

5.00/5 (1 vote)

Apr 9, 2024

CPOL

27 min read

4971

Data structures (aka containers or collections) are data types that allow multiple values (elements or items) to be stored. In languages where variables can be typed, the elements of a container are normally required to be of the same type.

1 Access methods

Data structures differ in the way elements can be accessed. There are essentially three conceptually different methods:

sequential
random (by index)
associative (by key)

Access methods are not mutually exclusive options. Indeed, every practical container implementation allows sequential access.

1.1 Sequential access

Sequential access means that one can iterate (enumerate) over the elements of the container, accessing them one after another in a sequence.

Lists, trees and graphs

Sequential access, unlike the other two access methods, which designate a precise type of container, allows for several variations: lists, trees and graphs. A list, or linear container, is a special case of a tree, which in turn is a special case of a graph.

In a list – the simplest type of sequential container – each element has exactly one predecessor (except one, the first element, which has none) and one successor (except one, the last element, which has none).

When we release the uniqueness requirement for successors, we have a tree (see Section 2). In a tree, an element can have more than one successor, but still only one predecessor.

If we also relax the uniqueness requirement for predecessors, we have a graph.

Stacks, queues and sorted lists

Lists can be further specified, depending on where new elements can be inserted:

at the beginning of the sequence (stack or LIFO list);
at the end (queue or FIFO list);
in the middle of the sequence at an arbitrary position (linked list)¹.

Queues and stacks allow accessing their nodes in the same order in which they were added to the sequence (queues) or in the reverse order (stacks). This is why queues are also called FIFO (First In First Out) lists and stacks are called LIFO (Last In First Out) lists.

Since linked lists allow new elements to be inserted at an arbitrary position in the sequence, data can be kept in an order other than the insertion order. If the items to be stored can be compared and sorted in some way, then it is possible to construct a sorted list, taking care of inserting new items in the correct positions. The time overhead that occurs during the insertion process is recovered when the elements are accessed. (In practice, there is no point in placing an element at a particular position within a linked list, unless you want to construct a sorted list. The raison d'être of linked lists is to build sorted lists).

Doubly linked lists and double-ended queues

There are also doubly linked lists: these are linked lists that can be iterated to and fro in both directions.

There are also double-ended queues or deques. In a deque, new elements can be inserted or read and removed from either side of the sequence. Basically, we have a container that can act as both a stack and a queue. Unlike doubly linked lists, elements cannot be inserted at arbitrary positions in deques.

1.1.1 Some briefing about the implementation

The following discursive description is intended solely to depict how a list (or tree) works; all technical caveats are purposely left out.

At the simplest level, lists and trees can be implemented using dynamic allocation and pointers. We can imagine that each element of a list holds a pointer to the next element in the sequence. In the case of a tree, or a graph, each element contains not just a single pointer, but a number of pointers to its multiple successors. In the case of a doubly linked list, it also contains a pointer to the previous element of the sequence². In a list, for the last element, the pointer to the next element in the sequence has a null value. Likewise, in a doubly linked list, for the first element, the pointer to the previous element has a null value. A list internally stores a pointer to its first element. A doubly linked list also stores a pointer to its last element.

To be a little more specific, a naive implementation of a linked list with elements of type T consists in two generic³ classes list<T> and element<T>.

Class list<T> represents the container while class element<T> wraps a pointer to an element of type T and allows safe access to the elements, preventing direct access and possible misuse of pointers to them.

(I apologize in advance to the reader for the pedantry of the following exposition).

The list<T> class exposes an accessor method that returns a pointer of the first item in the list, an object of type element<T>. (A doubly linked list also has a method that returns a pointer to the last item in the list).

The element<T> class has basically two accessors: one that returns a pointer to the data embedded in the element, a pointer to an object of type T, and one that returns a pointer to the next element in the sequence (or null if there is none). (In a doubly linked list, the element<T> class also has a method that returns a pointer to the previous element of the list, or null).

The list<T> class has a mutator method for inserting a new element at the beginning of the list. The method takes a data item of type T as a parameter, creates a new element<T> object, copies the data item into it, and places the new element at the beginning of the list, arranging the internal pointers accordingly. (Again, a doubly linked list also has a mutator method for inserting a new element at the end of the list).

The list<T> class also has a mutator method for inserting a new element after a given element of the list, whose pointer is taken as a parameter⁴. (Yes, a doubly linked list also has a mutator method for inserting a new element before a given element of the list).

The list<T> class also has a mutator method for removing an element from the list; a pointer to the element before the element to be removed is passed to it as a parameter. There is also a method to remove the first element of the list. In a doubly linked list, it is possible to remove the element pointed to by the parameter (not its successor).

Rely on arrays in lieu of stacks and queues

Stacks could be implemented just like linked lists, with no method to insert a new element at any position. The same is true for queues, at the cost of an extra pointer pointing to the last element in the sequence. However, this is rarely done in practice. The great strength and distinguishing feature of lists (and trees) is the ability to insert elements at arbitrary positions. When this requirement is not present, as in the case of stacks, queues, and deques, it is usually preferred to use arrays (see Section 1.2), foregoing dynamic allocation, or to use dynamic arrays (see Section 3), using a sort of poor man's version of dynamic allocation. The stack implemented in this way (array) is the simplest of all possible structures; a single pointer to the first element is all that is needed to make things work. This simplicity is exploited at the operating system level. In our abstract machine model, each process is allocated a memory space, part of which is managed as a stack. A processor register, the famous stack pointer, is dedicated to holding the one pointer needed by the stack.

1.1.2 Assessment

Since every type of container, both random (arrays) and associative (maps), provides sequential access, you might ask what's the point of using data structures that provide only sequential access.

The answer is dynamic allocation, and especially the ability to insert elements at arbitrary positions. This, in turn, means the ability to build sorted lists and trees, and thus exploit the order at retrieval time. Search trees (see Section 2) work on this principle.

For sorted lists, although the items are ordered, retrieving a particular element is quite inefficient. It means traversing the list, accessing all items before it, until:

the element with that value is found;
an element with a greater value is found or the end of the list is reached, meaning that the requested item is not in the list.

As a result, the access time depends on both the position of the item in the list and on the size of the list (the number of items it contains).

In order to obtain a constant (or quasi‑constant) access time, it is necessary to resort to random access, accessing through consecutive indexes, or to associative access, accessing through generic identifying values, called keys.

1.2 Random access (by index)

Random access means that elements can be inserted and retrieved in one fell swoop by specifying their index, that is their position within the container. The access time is a constant, in that it does not depend neither on the value of the index, nor on the size of the container (the number of the elements in it). Indexes are consecutive integers, usually ranging from 0 to l-1, where l is the size (or length) of the container.

1.2.1 Implementation

The only way to implement random access is to resort to static allocation, that is all the memory required to store elements must be allocated when the container is created, and a maximum capacity must be specified. Furthermore, the allocated memory must be a single block of contiguous memory within the virtual space of the process. This is because the array implementation relies on the fact that, in our abstract model of a computer system, memory itself is a large array of words, with random access.

This type of container is known as an array.

For arrays, elements must be of the same type, it's not optional.

1.2.2 Assessment

For certain applications where small performance differences and constant, predictable access times are critical, static allocation may be the preferred option. Dynamic allocation is always slower, both when reading, because pointers have to be dereferenced, and when allocating, which has a cost. Beyond the case of high-performance applications, dynamic allocation is convenient. With dynamic allocation, memory is allocated only as elements are inserted, and there is no need to set a limit on their number in advance. Furthermore, when an element is removed from the container, the memory allocated for it is reclaimed.

To partially overcome the restrictions of static allocation, dynamic arrays can be defined (see Section 3). Dynamic arrays are dynamic only in the sense that they can be resized by allocating new memory when necessary, i.e. when the size limit of the array is reached. When an ordered sequence is needed, arrays don't allow you to shift adjacent elements to make room for new ones.

To get almost the same great advantage of arrays, which is a quasi-constant access time, plus the benefits of a fundamental access method, it is necessary to go with associative access.

1.3 Associative access (by key)

A key is a value that uniquely identifies an element of a container and that can be used to retrieve it, similar to how the elements of an array are retrieved by specifying their index. However, keys are not limited to contiguous integers but can be of any type (most commonly strings).

In this case, despite the apparent similarity, arrays cannot be used. Not so much because keys usually have different types than integers⁵. The real problem is that the space of possible keys is much larger than the number of elements that will actually be stored, and with an array, which is statically allocated, we are forced to preallocate enough contiguous memory locations to store one element for each of the possible keys. If the difference between the largest and smallest index of an array is n, then we must allocate enough space to hold all possible n elements.

It is clear that we need to resort to dynamic allocation, so that keys, unlike array indexes, do not have to represent a position within a block of contiguously allocated memory.

The price of this flexibility is that, unlike random access, the access time is not constant for all the elements, and on average it gets worse as the size of the container increases.

A container that provides associative access is called a map or a dictionary.

1.3.1 Some briefing about the implementation

Whereas arrays and lists are essentially straightforward to implement, maps enable complex and completely different implementations.

Hash tables

A typical implementation of a map makes use of hash tables.

Hash tables can be seen as a tradeoff between the amount of memory preallocated for the container and the average access time. With hash tables, a hash function is defined to map key values to indexes into an array of preallocated memory. Obviously, since the possible keys are much more than the allocated memory locations, the hash function cannot be injective, and there will be collisions, that is two or more different keys will be mapped to the same index. When this happens, the collision is handled (there are different strategies) and all elements with the same key are stored, but subsequent accesses to collided elements will take more time. Thus, access time results to be a constant in normal cases, but sometimes it is larger because of collisions. The more memory is preallocated, the lower the probability of a collision, and the lower the average access time. If enough memory is reserved for all possible keys, there are no collisions, we fall back to the random access case, and the access time becomes a constant.

Binary search trees

In addition to hash tables, maps can be implemented by a type of tree called a binary search tree (see Section 2).

This implementation allows fast enumeration of the elements in key order, which hash tables may not provide (depending on the nature of the hashing function). However, while binary search trees are significantly faster than linear lists, they are on average slower to access than hash tables.

1.3.2 Assessment

Evaluating different implementations of a map is a topic on which there is endless literature. What can be understood from just scratching the surface is that there is no implementation that is always better than any other. One cannot even say whether an implementation based on a hash table or a binary search tree is absolutely preferable. In fact, although hash tables are on average faster than search trees, the performance of a map depends on a variety of factors that are almost always very difficult to estimate a priori. Some of these factors are:

the size of the key space (i.e. the number of possible keys);
the number of elements that will be in the map on average, which is useful to allocate the hash table accordingly;
the ratio of the number of insertions/deletions to the number of reads (if the number of reads is not significantly higher than the number of insertions, the use search trees, which have a large insertion overhead, loses its effectiveness)
the need to enumerate elements in key order, in which case search trees are the choice.

A look at the most common implementations does not help to lean towards one or the other option (hash table or search trees). The C++ STL library, one of the obligatory references on the subject, defines a map class that is implemented via a particular variant of a search tree, a so‑called red‑black tree. It's only since C++11 that the library also defines a hash table‑based implementation of a map, the unordered_map class. In contrast, the Go language has a primitive language type, map, implemented via hash table. And it has no tree‑based map type even among the packages in its standard library.

2 Trees

As mentioned in Section 1.1, trees are sequential containers whose elements can have more than one successor but only one predecessor (except for the root, which has no predecessor at all).

Elements of trees are also called nodes. The successors of a node are called its child nodes and the node itself is indicated as their parent node. It is assumed that an ordering can be defined on the child nodes of any parent node. Each child node can be the root of another tree, called a subtree. If a node has no child nodes is called a leaf node.

Pre‑order, in‑order and post‑order traversal

The nodes of a tree can be enumerated just like the elements of a list; however, while a list establishes a single precise sequence, a tree can be iterated (or traversed) in a number of ways. For example, a tree can be traversed by a recursive process that consists in traversing the root node of the tree, add it to the sequence, and then traversing its subtrees, if any, in the same way. This type of traversal is called pre‑order traversal. Post‑order traversal is the same, but the parent node is added to the sequence after its subtrees have been traversed. Binary trees can undergo in‑order traversal, which is: for each node, the first of the node's two subtrees is traversed, the node is added to the sequence, and then the second subtree is traversed.

Pre‑order traversal can be used to make a mathematical expression in polish notation from a syntax tree. Pre‑order traversal is also used to create a copy of the tree. Each node is copied before copying its children.

Post‑order traversal can generate a mathematical expression in reverse polish notation. Post‑order traversal is also used to delete a tree. Each node is freed after freeing its children.

In‑order traversal is commonly used on binary search trees (see below).

Each linear sequence running from the root to a leaf node is called a branch of the tree, and its length is the number of nodes in it. The height of a tree is the length of its longest branch.

Well‑balanced trees

If the height of a tree is equal to its size (the number of nodes in it), then we have only one branch, and the tree degenerates to a list. At the opposite extreme of this case, we have another degenerate tree with one root node and n-1 branches of length 1, where n is the size of the tree. In this case, in practice, the storage policy is delegated to the container holding the n-1 child nodes of the root.

Between these extremes lies the virtuous case of a well‑balanced tree (or search tree), which can be defined as a tree with the smallest possible height. For a binary tree, this height is log(n), where log is the base 2 logarithm. Intuitively, keeping a tree well‑balanced is equivalent to keeping its branches as equal in length as possible.

Binary search trees

Like lists – if nodes can be compared and sorted somehow – trees can be sorted. A sorted tree is one in which the nodes are sorted as they are traversed according to some traversal method. In‑order traversal is usually chosen. This way, each node is greater than all the nodes in its left subtree and less than all the nodes in its right subtree.

A search tree is a particular type of sorted tree that is kept well‑balanced. Nodes are carefully inserted in the proper locations in order to maintain the balancing property (in addition to sorting). In some cases, to accomplish this task, nodes already in the tree must be rearranged. Greater efficiency in retrieving data compensates for the penalty in insertion time.

A binary tree is a tree in which each node has at most two children. As mentioned in Section 1.3.1, a special type of search tree, the binary search tree, is often used to implement a map. In this case, each node of the tree stores a couple (key, value) and the tree is kept sorted according to keys.

Binary search trees show us how an inherently sequential structure like a tree, if properly constructed, can be used to (efficiently) implement an associative container. Thanks to the well‑balanced property, search trees have fast key search times, against an overhead in node insertion, which, as mentioned earlier, can even result in the need to rearrange the nodes already in the tree, when a new insertion occurs.

Since binary search trees are sorted trees, they allow to (efficiently) enumerate the elements according to the key order.

Recursion

Most of the useful features of trees stem from the fact that trees are recursive data structures.

Informally, a recursive structure is a structure that can be defined in terms of itself.

This is a possible recursive definition of a tree: a tree is a node with n subtrees attached, where 0 ≤ n ≥ N.

For binary trees, N is 2. For linear lists, N is 1. The case when n is equal to 0 is the termination case, which avoids getting stuck in endless recursion.

What follows from recursiveness is the existence of a simple (recursive) procedure⁶ for traversing all the nodes in a tree without getting trapped in an infinite loop.

This fundamental property is not shared by graphs, that are not recursive structures. If we were to use the same recursive traversal algorithm that we use for trees to traverse a graph, the process would go into an infinite loop, because graphs contain cycles.

A recursive procedure is simple in that it can be implemented using a simple stack to store the progress of the process. In the vast majority of programming languages (those with functions and automatic allocation of local variables), the stack is the system stack in the virtual memory space of the process, and is managed automatically by the compiler (i.e., by code generated by the compiler) without explicit programmer intervention.

In the case of tree traversal, the stack stores the information about which node we are on at every step, and which subtree of that node we are going to traverse next (remember, we assume that the subtrees of a node are a sequence with a precise order). The recursive nature of the tree assures us that we will not revisit this node unless we return from traversing the subtree. In the case of a graph, however, we must check at each step whether we have landed on a node that has already been traversed, in order to discard the step. To perform this check, it is not enough to store the traversed nodes on a stack, but we need a more general data structure, a map or an array. (At the opposite extreme, we have linear lists, which do not have any state storage requirements for iteration).

3 Arrays of pointers and dynamic arrays

Arrays are such a powerful and simple tool that it is worthwhile looking for ways to reduce the impact of their big drawback, static memory allocation.

arrays of pointers

With data structures, using languages that permit the declaration of pointers to objects, we always are presented with the choice of defining a container of pointers instead of concrete objects.

While for all the other data structures – exploiting dynamic allocation – it is mostly the same in terms of memory consumption, for arrays the use of pointers is obviously a big gain. Especially for languages with managed memory, which relieves us of the bother of memory deallocation.

Dynamic arrays

Let us take a look at the built‑in implementation of arrays in the Go language⁷. In this language, arrays can grow in size as needed, something that we might call dynamic arrays.

Of course, dynamic arrays are implemented in other languages as well. For example, the C++ STL library has the class vector that does the job.

In both implementations (Go and STL), a dynamic array grows simply by allocating a larger array and copying all the elements of the original array into the new one. For the STL vector class, this is made necessary by the requirement that vectors, just like regular arrays, must use contiguous storage locations for their elements, so that they can also be accessed using offsets on regular pointers.

If this (nasty) requirement is dropped, different strategies can be considered. For example, a dynamic array class could maintain a pool of arrays. When it is necessary to increase the size of the dynamic array, and the first array in the pool is full, another one is allocated and added to the pool, and so on. No elements are moved from one array to another. Given an index of the dynamic array, a function calculates, in an obvious way, which array to access and the index relative to it. When the pool is full and no more arrays can be added, a "compacting" step is performed, replacing the arrays with a single large array that can hold all the elements in them, and copying all the elements into it.

4 Iterators

We have seen that, in addition to strictly sequential containers, arrays also allow sequential access to data, but in a different way, using an index that takes all corresponding values sequentially from the first element to the last. Some map implementations (certainly those based on binary search trees) also allow sequential access, in ways that depend on the particular implementation.

Iterators are an abstraction that allows us to sequentially access data structures of different types using the same code.

Class element<T> described in Section 1.1.1 is, in nuce, a basic version of iterator. It provides methods for iterating over the list that abstract from its implementation, i.e., avoid to directly accessing pointers to elements. Class element<T> is immutable, and its accessor that yields the next element returns a new element object. In common implementations, however, the iterator class is stateful, and there is a method that advances its internal pointer to the next element. In the C++ STL library, this method is implemented elegantly (for some) by overloading the increment operator (++). Of course, in a doubly linked list, there is a method that moves its internal pointer back to the previous element, and of course in the STL library this method is implemented by overloading the decrement operator (--). In the STL library, even the accessor method that retrieves the referenced data is implemented by overloading the dereference operator (*). Sometimes iterators are presented as an abstraction of the concept of pointer, which is particularly evident in the STL implementation.

The iterator interface⁸ can be implemented for any type of container besides lists, to achieve homogeneity of access. For example, in the case of arrays, the iterator class is supposed to hold the integer index of the current element, rather than a pointer to it, as for lists.

¹ Also known as Italian queue (someone always cuts in).

² The important thing is that, in a doubly linked list, for each element, we can tell the pointer to the next element from the pointer to the previous one. This essential information allows us to traverse the list correctly, keeping the direction (backwards or forwards) and avoiding loops. It's as if we had two lists that shared the same elements and went in opposite directions. If this information were missing, we would have neither a list nor a tree, but a graph, with a potential cycle for each element, because each element would have two successors and two predecessors.

³ Generic Programming (Generics)

Speaking of the implementation of data structures, we cannot fail to mention a construct present in the most advanced programming languages, a construct of general utility, but which finds its most natural application precisely for this purpose: this is the mechanism of generic programming. Thanks to it, it is possible to parameterize a data structure with respect to the type of the objects it contains. For example, instead of coding a list that contains elements of type T1 and another list, with almost identical code, intended to contain elements of type T2, and so on, you can write generic code that works for elements of an unspecified type T. In the program, the actual type of the contained elements is specified when the list is declared.

A common notation for generic types is the "angle bracket notation"; for example, a generic type (class) Stack is written Stack<T>. To declare a variable of type Stack that stores, say, strings, you'd write Stack<string> stringStack, replacing T with the actual type string.

Although not the first language to introduce this construct, generics have always been a distinctive feature of C++, where they are called templates. C++'s templates are a very efficient implementation of generic programming, an implementation of long standing (templates were introduced in the 3.0 release of the language, in 1991) but still subject to evolution. They are so important that the language's standard library, which, among other things, includes a wide variety of data structures, is called the Standard Template Library (STL).

Generic programming has been gradually introduced into almost all the major strongly typed languages. Examples of such languages are the Go language, where it was eventually introduced starting with version 1.18 in 2022, C# (.net) since version 2.0, released in 2005, and Java, since 2004.

⁴ Exercise: figure out why inserting an element before a given element of a linked list is very inefficient.

⁵ Remember that, in the world of programming, every object of any type can be reduced to a number; in fact, in the depths of the system, every object of any type is a number.

⁶ A recursive procedure is a procedure that calls itself.

⁷ A little digression about the Go language

Among its many interesting features, the Go language is worth studying because of its anti‑conventional attitude. The language, which does not care about trends, dares to abandon constructs that are considered indispensable today in favor of other, often simpler solutions. Consider this: the language is not object oriented, does not present a complete implementation of the class construct. You can declare methods on a structure, but there is no inheritance, no constructors and destructors, no class hierarchies. The language only allows you to declare and implement interfaces. (But type binding is not explicit; you never declare that a particular structure implements a particular interface; the compiler infers it. From the language documentation: "A type implements an interface by implementing its methods. There is no explicit declaration of intent, no implements keyword".

The language (horribile dictu) does not handle errors by throwing and catching exceptions, but by return values (facilitated by the fact that a function in Go can have more than one return value). Namespaces come in a simplified form. In the switch construct, it is not necessary to write break to avoid falling into the underlying case (standing ovation!). Pointers are used explicitly, without the pretence of reference types. Of course no form of pointer arithmetic is allowed, and access to arrays is guarded from index overflow by a special construct called slice. Generics have not been supported until version 1.18, dated March 2022.

To conclude, the language is memory managed but subject to compilation into executable code, an unusual combination. There is no virtual machine; run‑time services such as garbage‑collection are statically linked into each executable. Language and compile toolchain support TDD (Test Driven Development), external references are handled smoothly and it is easy to cross‑compile for multiple target operating systems.

Ah, in case you are not convinced of the subversive nature of language, here is a quote from the legendary Ken Thompson, one of the three main contributors to its development: When the three of us (Thompson, Rob Pike, and Robert Griesemer) got started, it was pure research. The three of us got together and decided that we hated C++!

⁸ An interface is a class with only pure virtual methods, with no implementation. The interface of a class is the public methods of the class.