Topological Sorting in C#

Tomas Takac

4.99/5 (51 votes)

Jan 27, 2015

CPOL

12 min read

79322

2277

How to sort an directed graph using topological sorting in C#

Part I

Introduction

Sorting a list of numbers or strings is easy. Sorting a list of items by a key is not complicated either. But what if you want to order a sequence of items so that an item must be preceded by all the other items it depends on? Welcome to topological sorting!

In the beginning I will show and explain a basic implementation of topological sort in C#. Then I will cover more complex scenarios and improve the solution step-by-step in the process.

Background

I came across this problem recently. We have a list of tasks that read data from a common store and then write the results back there. The trick is some of these tasks depend on results from other tasks. The engine which is running the tasks does not know anything about their internal structure so it cannot simply execute the dependencies first. We have to feed the tasks in correct order ourselves.

Source sequence with dependencies.

In the above image we have sequence of eight items named A to H. The arrows show how these depend on each other. Here A does not depend on anything else, it's a root element, B depends on C and E, while D depends on A etc. The goal is to order the items in such way that there are references from right to left only.

When I started to work on the problem I knew there is something like topological sorting but that was it. So I started to look up more information. There is a nice article about topological sorting on Wikipedia. Of course I was looking for a ready to use solution at first. After a quick search I found this implementation of the depth-first search. It wasn't quite what I needed so I grabbed the code and started from there.

The Algorithm

The depth-first search algorithm is not complicated. It works on a directed acyclic graph. Expressed in pseudocode it takes only a couple of lines:

source: the sequence of items to be sorted (input)
sorted: the list of sorted items (output)

while there are unmarked nodes in source
  select an unmarked node n
  visit(n)

function visit(node n)
  if n has temporary mark then this is a cyclic dependency
  if n has no mark
    add temporary mark to n    
    for each dependency of n
      visit(dependency)
    remove temporary mark from n
    add permanent mark to n
    add n to sorted

It doesn't matter how the input is ordered or where we start. We take an item form the input and walk its dependencies. The concept of marking the items help us to keep track of already visited nodes.

The temporary mark says we are currently processing the node. So if we see that mark it means the node is its own dependency i.e. there is a cyclic reference in the graph. Permanent mark means we already processed that item and it was placed in the output. In this case we don't need to do anything as this dependency was already satisfied.

Processing an item simply means going trough all it's dependencies first and add them to the sorted list. Then it is safe to add the processed item itself.

Basic Implementation

public static IList<T> Sort<T>(IEnumerable<T> source, Func<T, IEnumerable<T>> getDependencies)
{
    var sorted = new List<T>();
    var visited = new Dictionary<T, bool>();

    foreach (var item in source)
    {
        Visit(item, getDependencies, sorted, visited);
    }

    return sorted;
}

public static void Visit<T>(T item, Func<T, IEnumerable<T>> getDependencies, 
                   List<T> sorted, Dictionary<T, bool> visited)
{
    bool inProcess;
    var alreadyVisited = visited.TryGetValue(item, out inProcess);

    if (alreadyVisited)
    {
        if (inProcess)
        {
            throw new ArgumentException("Cyclic dependency found.");
        }
    }
    else
    {
        visited[item] = true;

        var dependencies = getDependencies(item);
        if (dependencies != null)
        {
            foreach (var dependency in dependencies)
            {
                Visit(dependency, getDependencies, sorted, visited);
            }
        }

        visited[item] = false;
        sorted.Add(item);
    }
}

As you can see, there is a list for sorted output and a dictionary to track marked (visited) items. The original implementation from which I'm heavily borrowing here used a HashSet to track visited items. But unfortunately it didn't have any notion of the temporary mark. This was replaced by checking the sorted output for the element in question. I decided to use a dictionary instead where I can store a bool flag for each element, true for temporary mark and false for permanent one. Furthermore, compared to the pseudocode I don't bother with selecting an unmarked item on the top level. There is no need because Visit method handles this situation out of the box.

Let's test it, shall we? First we define the class representing our items to be sorted.

public class Item
{
    public string Name { get; private set; }
    public Item[] Dependencies { get; private set; }

    public Item(string name, params Item[] dependencies)
    {
        Name = name;
        Dependencies = dependencies;
    }
}

Items' dependencies are expressed as straightforward as it gets - an array of items. Now the actual test scenario. Please note that we reuse the same instances for dependencies.

var a = new Item("A");
var c = new Item("C");
var f = new Item("F");
var h = new Item("H");
var d = new Item("D", a);
var g = new Item("G", f, h);
var e = new Item("E", d, g);
var b = new Item("B", c, e);

var unsorted = new[] { a, b, c, d, e, f, g, h };

var sorted = TopologicalSort.Sort(unsorted, x => x.Dependencies);

This will give following output:

A C D F H G E B

Check the diagram above, this is actually correct. The topological sort is working.

Item Equality

Do you remember when I pointed out we are reusing the same instances for dependencies? What if that wouldn't be the case? Let's rewrite the test a bit, now creating a new instance every time we need an item.

var a = new Item("A");
var c = new Item("C");
var f = new Item("F");
var h = new Item("H");
var d = new Item("D", new Item("A"));
var g = new Item("G", new Item("F"), new Item("H"));
var e = new Item("E", new Item("D", new Item("A")), 
        new Item("G", new Item("F"), new Item("H")));
var b = new Item("B", new Item("C"), new Item("E", new Item("D", new Item("A")), 
                  new Item("G", new Item("F"), new Item("H"))));

var unsorted = new[] { a, b, c, d, e, f, g, h };

var sorted = TopologicalSort.Sort(unsorted, x => x.Dependencies);

The output is quite different this time:

A C A D F H G E B C A D A D F H G E F F H G H

The reason for this is that the items are not recognized as equal anymore because they are different instances. But this is a valid scenario. You don't always have the power to ensure that there is only one instance of that particular item, e.g., when you read the items from database or deserialize from XML or retrieve form a web service.

Default equality comparison relies on reference equality. Of course, we can change this behavior by overriding Equals() and GetHashCode() methods in Item class. However, this is not always possible. Luckily enough, there is a dictionary constructor which takes an IEqualityComparer. What we need to do is create one for Item class and pass it into the sort method.

public class ItemEqualityComparer : EqualityComparer<Item>
{
    public override bool Equals(Item x, Item y)
    {
        return (x == null && y == null) || (x != null && y != null && x.Name == y.Name);
    }

    public override int GetHashCode(Item obj)
    {
        return obj == null ? 0 : obj.Name.GetHashCode();
    }
}

public static IList<T> Sort<T>(IEnumerable<T> source, 
       Func<T, IEnumerable<T>> getDependencies, IEqualityComparer<T> comparer = null)
{
    ...
    var visited = new Dictionary<T, bool>(comparer);
    ...

var sorted = TopologicalSort.Sort(unsorted, x => x.Dependencies, new ItemEqualityComparer());

Generic Comparer

In the previous section, we used the ItemEqualityComparer to override the equals logic. But this is again a very common scenario. It would be nice to extend our sort implementation to support this instead of having to create a specialized class for each item type we want to sort. In most cases, I guess the entity can be identified by some sort of key. We need a generic equality comparer which will compare items based on these keys.

public class GenericEqualityComparer<TItem,TKey> : EqualityComparer<TItem>
{
    private readonly Func<TItem, TKey> getKey;
    private readonly EqualityComparer<TKey> keyComparer;

    public GenericEqualityComparer(Func<TItem, TKey> getKey)
    {
        this.getKey = getKey;
        keyComparer = EqualityComparer<TKey>.Default;
    }

    public override bool Equals(TItem x, TItem y)
    {
        if (x == null && y == null)
        {
            return true;
        }
        if (x == null || y == null)
        {
            return false;
        }
        return keyComparer.Equals(getKey(x), getKey(y));
    }

    public override int GetHashCode(TItem obj)
    {
        if (obj == null)
        {
            return 0;
        }
        return keyComparer.GetHashCode(getKey(obj));
    }
}

The comparer simply takes a delegate to extract key form the item and then performs comparison using that key. A new overload of Sort method will take care of creating the comparer and passing it to the original Sort method.

public static IList<T> Sort<T, TKey>(IEnumerable<T> source, 
       Func<T, IEnumerable<T>> getDependencies, Func<T, TKey> getKey)
{
    return Sort<T>(source, getDependencies, new GenericEqualityComparer<T, TKey>(getKey));
}

Now we can use it in the test.

var sorted = TopologicalSort.Sort(unsorted, x => x.Dependencies, x => x.Name);

Indirect References

Up until now, we assumed that our items store their dependencies directly - as an array of items. What if we have only the keys of the dependencies, not the instances themselves? For topological sorting to work, we need the actual objects so we can "visit" them and in turn examine their dependencies. Let us introduce a new item class.

public class Item2
{
    public string Name { get; private set; }
    public string[] Dependencies { get; private set; }

    public Item2(string name, params string[] dependencies)
    {
        Name = name;
        Dependencies = dependencies;
    }
}

Instead of having an array of items stored in Dependencies property, we only have their names. We need to map these names back to the respective items. But we have all the items in the source sequence and we have a delegate to extract keys from those so we can build a dictionary.

private static Func<T, IEnumerable<T>> RemapDependencies<T, TKey>(IEnumerable<T> source, 
               Func<T, IEnumerable<TKey>> getDependencies, Func<T, TKey> getKey)
{
    var map = source.ToDictionary(getKey);
    return item =>
    {
        var dependencies = getDependencies(item);
        return dependencies != null
            ? dependencies.Select(key => map[key])
            : null;
    };
}

public static IList<T> Sort<T, TKey>(IEnumerable<T> source, 
       Func<T, IEnumerable<TKey>> getDependencies, Func<T, TKey> getKey)
{
    return Sort<T>(source, RemapDependencies(source, getDependencies, getKey));
}

Please note that in the new overload of Sort the function to extract dependencies from an item returns an enumeration of keys. RemapDependencies methods returns a new dependency extraction function which wraps the original one and translates the keys to actual items using the dictionary.

With all this in place, we can rewrite the test using the new item class.

var a = new Item2("A");
var b = new Item2("B", "C", "E");
var c = new Item2("C");
var d = new Item2("D", "A");
var e = new Item2("E", "D", "G");
var f = new Item2("F");
var g = new Item2("G", "F", "H");
var h = new Item2("H");

var unsorted = new[] { a, b, c, d, e, f, g, h };

var sorted = TopologicalSort.Sort(unsorted, x => x.Dependencies, x => x.Name);

The call to topological sort looks exactly the same as in previous section, but this time the new overload will be invoked because of the second argument. As you can check, this gives the same results as in previous tests.

Grouping

Let's go back to the example I gave in the beginning. A sequence of tasks which depend on each other is run by an execution engine. This is not a problem anymore since we can sort the sequence and run the tasks one by one. But imagine, we would like to run them in parallel. For example, by splitting the sorted sequence into batches of five items and then run tasks in a batch in parallel. We cannot really do that because the dependencies are expressed implicitly by the order of the items, more detailed information is not preserved. Let me redraw the diagram a bit.

The input sequence as directed graph.

This is our test sequence shown as directed graph. The arrows are pointing in the direction of our intended control flow. On the left, there are the root items with no dependencies. Then in the next level, there are items which depend on the root items, etc. If we knew the level in which the item resides, we could then create the batches without breaking any references.

public static IList<ICollection<T>> Group<T>(IEnumerable<T> source, 
Func<T, IEnumerable<T>> getDependencies, IEqualityComparer<T> comparer = null)
{
    var sorted = new List<ICollection<T>>();
    var visited = new Dictionary<T, int>(comparer);

    foreach (var item in source)
    {
        Visit(item, getDependencies, sorted, visited);
    }

    return sorted;
}

This is a modified version of the Sort method. It returns a list of collections, each collection represent a level in the graph, zero being the leftmost level - the root items. The other change is that the dictionary of visited items has integer values now. This represents the level of the respective item. It will help us to position the item one level higher than any of its dependencies.

public static int Visit<T>(T item, Func<T, IEnumerable<T>> getDependencies, 
                           List<ICollection<T>> sorted, Dictionary<T, int> visited)
{
    const int inProcess = -1;
    int level;
    var alreadyVisited = visited.TryGetValue(item, out level);

    if (alreadyVisited)
    {
        if (level == inProcess)
        {
            throw new ArgumentException("Cyclic dependency found.");
        }
    }
    else
    {
        visited[item] = (level = inProcess);

        var dependencies = getDependencies(item);
        if (dependencies != null)
        {
            foreach (var dependency in dependencies)
            {
                var depLevel = Visit(dependency, getDependencies, sorted, visited);
                level = Math.Max(level, depLevel);
            }
        }

        visited[item] = ++level;
        while (sorted.Count <= level)
        {
            sorted.Add(new Collection<T>());
        }
        sorted[level].Add(item);
    }

    return level;
}

In the Visit method, we lost the bool flag saying if we are processing an item twice in the same dependency tree. We replace that with a dummy level number. The other change is we need to find the highest level of all the dependencies and add one to put the current item to the level just above the highest dependency. Please note that in case there are no dependencies this will be zero (-1 + 1). Finally, we create the level if it's missing in the output and add the item there.

var a = new Item("A");
var c = new Item("C");
var f = new Item("F");
var h = new Item("H");
var d = new Item("D", a);
var g = new Item("G", f, h);
var e = new Item("E", d, g);
var b = new Item("B", c, e);

var unsorted = new[] { a, b, c, d, e, f, g, h };

var sorted = TopologicalSort.Group(unsorted, x => x.Dependencies);

To test it, we use the first test with direct dependencies. The output shows the levels were created as expected.

A C F H / D G / E / B /

We can apply the same improvements to the Group method as we did for Sort, i.e., support for comparers and indirect references.

Cyclic Dependencies

Until now, we aways used an acyclic graph in the tests. What if we try to sort a graph with cyclic references?

Graph with cyclic dependencies

An exception is thrown. But why are cyclic references not supported actually? Simply because a cycle has no start or end. In our example, E depends on G, which depends on F which in turn depends on E. No matter how we order them, we always break dependencies.

However, the implementation can be modified to simply ignore cyclic dependencies and not to throw an exception. Because if we cannot sort the items in the cycle, we may say that we don't care in which order they appear in the result. This can be useful in some cases but you need to be aware that the resulting sequence is actually not sorted!

public static void Visit<T>(T item, Func<T, IEnumerable<T>> getDependencies, 
                            List<T> sorted, Dictionary<T, bool> visited, bool ignoreCycles)
{
    ...
    if (alreadyVisited)
    {
        if (inProcess && !ignoreCycles)
        {
            throw new ArgumentException("Cyclic dependency found.");
        }
    }
    ...

Part II

Streamed Data

Not long after I finished the implementation of the deep-first algorithm, I started to think how it could be improved. I focused mainly on the case where items are referenced indirectly via keys. And I found two main problems.

First, all of the data needs to be available to you at the time of sorting. But what if I have a stream of data which is big? Maybe too big to fit into memory? Creating a dictionary to map keys to items won't be possible. We really need to process the source one item at a time.

Second, the API is not very nice. Instead of using static methods, it would be much nicer to have an extension method. You can argue that this is a cosmetic issue and can be achieved with the existing solution. But with a streamed approach, it feels more natural to use LINQ syntax.

Kahn's Algorithm

The depth-first algorithm is easy to implement but because it walks down the dependency tree it cannot be really used when the graph to be sorted is not completely in memory. What we really need is to wait for the items to appear in the source sequence and then act on it. Luckily, there is another algorithm to topological sorting which works in this scenario.

Sorted: list that will contain the sorted elements
Ready: set of all nodes with no unresolved dependencies

while Ready is not empty do
    remove a node n from Ready
    add n to tail of Sorted
    for each node m which depends on n do
        remove dependency from the graph
        if m has no other dependencies then
            insert m into Ready
if graph has dependencies then
    return error (graph has at least one cycle)
else 
    return Sorted

As you can see, the recursion is gone. The graph is our source sequence. The only problem is we don't have the complete graph at our disposal, but rather need to work with one item at a time.

Waiting for Dependencies

In our example, B depends on C and E but precedes them in the source sequence. So at the time B comes in, C and E are not yet available. What we need to do is to put B aside and wait for C and E. This can be represented by a matrix.

For each item in the sequence, we check if any other items are waiting for it. We mark the dependency as satisfied by clearing all flags in the row. If that was the last dependency the item was waiting for, then we can release it to output.

Scanning a matrix is not very convenient so we rather use dictionary of lists where the key will be the key of the pending dependency and list will contain nodes with a counter.

Clearing a row in the matrix means removing the key from dictionary and decreasing the counter for each node.

When the counter reaches zero, we can release it to output.

Implementation

I chose to implement this using a custom enumerator[^]. Compared to iterators[^] (yield return) this allows for better encapsulation.

public class TopoSortEnumerator<TItem, TKey> : IEnumerator<TItem>
{
    private readonly IEnumerator<TItem> source;
    private readonly Func<TItem, TKey> getKey;
    private readonly Func<TItem, IEnumerable<TKey>> getDependencies;
    private readonly HashSet<TKey> sortedItems;
    private readonly Queue<TItem> readyToOutput;
    private readonly WaitList<TItem, TKey> waitList = new WaitList<TItem, TKey>();

    private TItem current;

You should be able to recognize most of the declarations based on the algorithm description. sortedItems helps us keep track of all items that are already in output or are ready to be. waitList tracks the items which are waiting for some dependencies. Together, they implement the lazy-loaded graph with its edges and nodes. current is the next element of the sequence to be outputted.

public bool MoveNext()
{
    while (true)
    {
        if (itemsWithoutDependencies.Count > 0)
        {
            current = readyToOutput.Dequeue();
            Release(current);
            return true;
        }

        if (!source.MoveNext())
        {
            break;
        }

        Process(source.Current);
    }

    if (waitList.Count > 0)
    {
        throw new ArgumentException("Cyclic dependency or missing dependency.");
    }

    return false;
}

The main loop resides in the MoveNext() method of the enumerator. Assigning the current item and returning true is equivalent to placing the item into the output. If there are still items in the wait list after all items are processed, this means that there is either a cycle or we are missing a dependency.

private void Process(TItem item)
{
    var pendingDependencies = getDependencies(item)
        .Where(key => !sortedItems.Contains(key))
        .ToArray();

    if (pendingDependencies.Length > 0)
    {
        waitList.Add(item, pendingDependencies);
    }
    else
    {
        readyToOutput.Enqueue(item);
    }
}

Processing an item simply means inspecting all dependencies to determine if the item is ready for output or needs to wait for some dependencies.

private void Release(TItem item)
{
    var key = getKey(item);
    sortedItems.Add(key);

    var releasedItems = waitList.Remove(key);
    if (releasedItems != null)
    {
        foreach (var releasedItem in releasedItems)
        {
            readyToOutput.Enqueue(releasedItem);
        }
    }
}

Before placing an element into output, we notify all waiting items that the dependency has been satisfied. If this is their last pending dependency, they themselves are placed into the ready queue.

Points of Interest

To tell the difference between cyclic dependency (an item is waiting for itself) and missing dependency (item is waiting for an item not present in the source sequence) is not trivial. You would need to do a search of all remaining items and their dependencies in the waitlist to make the distinction.

When we return an item as sorted, we still need to keep track of it. We store its key in the sortedItems hashset. This, of course, steadily grows during sorting as new items are placed into output. But the assumption is that the key is much smaller than the item itself.

To merge two partially sorted sequences, you would need to merge the wait lists and apply the release procedure to each already sorted item.

History

28^th January, 2015: Initial version
6^th February, 2015: Cyclic dependencies
7^th January, 2016: Part II added