A Concise Overview of Threads

Marc Clifton

4.97/5 (56 votes)

Aug 2, 2018

CPOL

35 min read

69080

484

A discussion of various approaches to threading, covering locks, mutexes, semaphores, concurrent collections, work queues, threads, PLINQ, TPL, exception handling, and cancellation tokens

Download source code - 132 KB

Introduction
Brute Force Algorithm
Threaded The Brute Force Algorithm
Balancing the Brute Force Algorithm Threads
- Asking for Work vs. Telling the Thread What Work to Do
- Interlocked and Atomic
More "Modern" Ways of Working with Threads
Semaphores and Queuing Work
Mutexes, Locks, and Semaphores
Exception Handling
Awaitable Threads - A Hybrid Approach using TaskCompletionSource
Canceling Threads and Tasks
The ThreadPool Class and CPU-Bound Work
- More Threads than Cores
- More Tasks than Cores
Conclusion
- Other Reading
Footnotes
History

Introduction

I was recently asked to provide some training on how threads are used in C#. What I thought would be a fairly simple discussion turned into a detailed analysis of different approaches to threading and important topics such as canceling tasks and exception handling. So much for applying my 30+ years of experience in judging the scope of a project, even if the "project" is an article! What's new here is that while this has all been discussed before, I'm hoping that this article provides some interesting insights and coverage of the topic that perhaps hasn't been done before in one article. Threading, tasks, cancellation, exceptions, and so forth is a huge topic, whole books have been written on the subject. This article presents an incomplete but hopefully concise overview. There are lots of examples and the focus is on the essentials that you need to know about 90% of the time rather than all the other extraneous stuff you might need the remaining 10% of the time.

What Is a Thread?

I suppose I should start with a brief description of what a thread is, although quite frankly, I'm assuming you actually know what a thread is. But, if someone asked you "what is a thread?" would you be able to provide a concise and correct definition? Wikipedia¹¹ defines a thread as: "...a thread of execution is the smallest sequence of programmed instructions that can be managed independently by a scheduler, which is typically a part of the operating system."

So, firstly, the term "thread" is actually shorthand for "thread of execution." Secondly, it incorporates the concept of a scheduler that switches between "sequences of programmed instructions." This sample chapter¹² of Windows Internals, 5^th Edition, is a highly technical description of the scheduler in Windows (note this sample chapter was written in 2009, the book Windows Internals is now on the 7^th edition.)

What is a Fiber?

Again, according to Wikipedia¹³: "...fibers use cooperative multitasking while threads use preemptive multitasking."

Cooperative Multitasking vs. Preemptive Multitasking

Cooperative Multitasking: "is a style of computer multitasking in which the operating system never initiates a context switch from a running process to another process. Instead, processes voluntarily yield control periodically or when idle or logically blocked in order to enable multiple applications to be run concurrently."¹⁴

Preemptive Multitasking: "...involves the use of an interrupt mechanism which suspends the currently executing process and invokes a scheduler to determine which process should execute next. Therefore, all processes will get some amount of CPU time at any given time."¹⁵

CPU Bound Threads vs. I/O Bound Threads

In broad terms, there are two kinds of threads, or tasks. CPU-bound tasks (or threads) refers to code that is executed 100% on the local CPU and that code doesn't wait for anything else. By "anything else", we mean a variety of things:

hardware interrupts (you usually don't have to deal with things at this level anymore)
completion of requests to remote servers (very common -- requesting data from a database and calling a web API are two examples)
waiting on another thread (fairly common)
waiting on an I/O completion event (usually refers to writing data as a stream to a hardware device or reading data in as a stream from a hardware device, such as the serial port stream class¹⁸.

An I/O bound thread or task ins the opposite -- while it might do some processing, it most likely spends most of its time waiting for something else to signal that data is ready to be processed. The "something else" is the list above.

Coming Up With a Good Thread Example

The first problem of course is coming up with a good example where the work is capable of being broken up into isolated chunks and that takes a reasonable amount of computational time so you can really see the difference in approaches. I chose a brute force "is the number prime" algorithm. And I truly mean brute force:

static bool IsPrime(int n)
{
  bool ret = true;
  for (int i = 2; i <= n / 2 && ret; ret = n % i++ != 0);
  return ret;
}

Yes, there actually is no content to the for loop -- it does end with a ; because the look terminates when ret becomes false.

Obviously, you would NEVER write a prime number calculation this way (a much better approach is the Sieve of Eranthoses) but this algorithm has the advantage of:

being slow
returning a true/false for each number

This is an advantage when writing an article about the nuances of threads!

The Timing Algorithm

To time things, I wrote a simple "time how long this function takes":

static void DurationOf(Func<int> action, string section)
{
  var start = DateTime.Now;
  int numPrimes = action();
  var stop = DateTime.Now;

  lock (locker)
  {
    Console.WriteLine(section);
    Console.WriteLine("Number of primes is : " + numPrimes);
    Console.WriteLine("Total seconds = " + (stop - start).TotalSeconds);
  }
}

It's rather tailored to the examples here in this article.

Locks

Let's talk about locks right now. Notice in the above code the lock statement. The lock ensures that each Console.WriteLine is not "interrupted" by another thread also writing to the console. A lock statement requires an object that all threads can access which acts as the synchronization object between threads. Typically, the synchronization object is a root level Object:

static object locker = new object();

Historically, the concept of a lock was called a "critical section", meaning that only one thread could enter the code at a time. Native Windows programmers will be familiar with the CRITICAL_SECTION⁴ structure and related methods.

Locks can be dangerous:

If what you're performing in the lock takes a long time, your thread will lose performance as it waits for the lock to be released by another thread.
It's fairly easy to create a deadlock in which thread A is waiting for the lock to be released, and thread B, currently inside the lock, is waiting for thread A to finish some task.

The body of a lock statement should never include anything that waits for a thread to do some work. However, locks are useful when doing simple synchronization, particularly of debug output or synchronizing access to a physical "thing" like a hardware port. Otherwise, if there's a lock in your code, it's probably a big red flag.

A Helper Method for the Brute Force Algorithm

This is called by several of the variations of brute force algorithm that are to demonstrate threading, so it's useful to implement it once.

static int NumPrimes(int start, int end)
{
  int numPrimes = 0;
  for (int i = start; i < end; numPrimes += IsPrime(i++) ? 1 : 0);
  return numPrimes;
}

Yes, another one of those do-nothing loops where the work is done in the iterator portion of the for loop.

Brute Force Algorithm

Here, we want to find how many primes there are between 2 and 500,000. I chose 500,000 as the maximum as it takes about 35 seconds on my machine to determine that there are 41,538 prime numbers.

Calling the algorithm:

DurationOf(BruteForce, "Brute force:");

The implementation:

static int BruteForce()
{
  int numPrimes = NumPrimes(2, MAX);

  return numPrimes;
}

The results:

Brute force:
Number of primes is : 41538
Total seconds = 30.1119874

Threaded the Brute Force Algorithm

Let's take a stab at optimizing this by breaking the work up into chunks and giving each thread (up to the number of processors) the chunk to work on. We'll evenly divide the chunks. The astute reader will realize that is not optimal, but we'll illustrate why and discuss one of the important things about multi-threading -- make sure your work load is evenly distributed!

Keep in mind that what I'm about to show you is rather old school!

Calling the algorithm:

DurationOf(ThreadedBruteForce, "Threaded brute force:");

The implementation:

First, setting up the threads:

static int ThreadedBruteForce()
{
  List<(Thread thread, int threadNum, int start, int end)> threads = 
                 new List<(Thread thread, int threadNum, int start, int end)>();

  int numProcs = Environment.ProcessorCount;

  for (int i = 0; i < numProcs; i++)
  {
    int start = Math.Max(1, i * (MAX / numProcs)) + 1;
    int end = (i + 1) * (MAX / numProcs);
    var thread = new Thread(new ParameterizedThreadStart(BruteForceThread));
    thread.IsBackground = true;
    threads.Add((thread, i, start, end));
  }

  totalNumPrimes = 0;
  threads.ForEach(t => t.thread.Start((t.threadNum, t.start, t.end)));
  threads.ForEach(t => t.thread.Join());

  return totalNumPrimes;
}

The worker thread:

static void BruteForceThread(object parms)
{
  (int threadNum, int start, int end) parm = (ValueTuple<int, int, int>)parms;
  DurationOf(() =>
  {
    int numPrimes = NumPrimes(parm.start, parm.end);
    Interlocked.Add(ref totalNumPrimes, numPrimes);
    return numPrimes;
  }, $"Thread {parm.threadNum} processing {parm.start} to {parm.end}");
}

The results:

Thread 0 processing 2 to 125000
Number of primes is : 11734
Total seconds = 3.8519907
Thread 1 processing 125001 to 250000
Number of primes is : 10310
Total seconds = 9.0879819
Thread 2 processing 250001 to 375000
Number of primes is : 9860
Total seconds = 12.963975
Thread 3 processing 375001 to 500000
Number of primes is : 9634
Total seconds = 16.4079704
Threaded brute force:
Number of primes is : 41538
Total seconds = 16.4119713

OK, cool, we've taken a 35 second process and reduced it to 16 seconds. But notice that the workload is not evenly distributed. The threads take different times. The reason should be obvious -- the larger the number we're trying to determine is prime or not, the more divisions need to be executed. So for numbers at the lower range, the thread finishes faster:

Thread 0 processing 2 to 125000
Number of primes is : 11734
Total seconds = 3.8519907

vs:

Thread 3 processing 375001 to 500000
Number of primes is : 9634
Total seconds = 16.4079704

Thread Joins

Notice once the threads have been started, there is this statement:

threads.ForEach(t => t.thread.Join());

Here, the Join method suspends execution until the thread on which Join is called has finished.

Be aware:

This can cause the current thread to indefinitely suspend its operation.
You wouldn't do this on a UI thread as the UI will no longer respond to user actions.

Background Threads

Notice this statement:

thread.IsBackground = true;

Telling the thread that it is a background thread ensures that it is killed when the application exits.

If you have a thread that is not a background thread and you close the application or it in some other way terminates, the thread will continue to exist (and run) as a process.

Thread Parameters

Notice this statement:

var thread = new Thread(new ParameterizedThreadStart(BruteForceThread));

Here, we are setting up the method that implements the thread to accept parameters. The signature must be [methodName](object parameters), which requires that you cast the object to the same type being passed in when the thread is started:

Starting the thread (I'm using a value tuple as the parameter):

threads.ForEach(t => t.thread.Start((t.threadNum, t.start, t.end)));

Casting the object (a value tuple in this case) in the method implementing the thread:

(int threadNum, int start, int end) parm = (ValueTuple<int, int, int>)parms;

Avoiding Casting the Object Parameter With a Lambda Expression

You can avoid the cast using an lambda expression which provides closure for the current parameter values, like this:

for (int i = 0; i < numProcs; i++)
{
  int j = i;
  int start = Math.Max(1, i * (MAX / numProcs)) + 1;
  int end = (i + 1) * (MAX / numProcs);
  var thread = new Thread(() => BruteForceThread(j, start, end));
  thread.IsBackground = true;
  threads.Add((thread, i, start, end));
}

Notice int j = i; is required for the closure -- otherwise, the thread number (not to be confused with the thread ID) is always the number of processors (4 in my examples, usually.)

Also notice that this works but is not necessary:

var thread = new Thread(new ThreadStart(() => BruteForceThread(j, start, end)));

Why? Because ThreadStart is defined as a delegate: public delegate void ThreadStart();so a lambda expression is perfectly valid.

The method implementing the thread now can accept parameters in the method signature, avoiding the cast:

static void BruteForceThread(int threadNum, int start, int end)

Using the ThreadStart Delegate is Time Costly!

These two lines of code are NOT equivalent:

var thread = new Thread(() => BruteForceThread(j, start, end));

vs.

var thread = new Thread(new ThreadStart(() => BruteForceThread(j, start, end)));

Notice the timing difference in the second version:

Thread 0 processing 2 to 125000
Number of primes is : 11734
Total seconds = 4.3607559
Thread 1 processing 125001 to 250000
Number of primes is : 10310
Total seconds = 10.5416041
Thread 2 processing 250001 to 375000
Number of primes is : 9860
Total seconds = 14.9788427
Thread 3 processing 375001 to 500000
Number of primes is : 9634
Total seconds = 19.2458287
Threaded brute force:
Number of primes is : 41538
Total seconds = 19.2458287

Roughly:

thread 0 takes .5 seconds longer
thread 1 and 2 take 1.5 seconds longer
thread 3 takes almost 3 seconds longer

This isn't just an anomaly, this is a consistent repeatable difference. I have not found any discussion of why this is the case.

Balancing the Brute Force Algorithm Threads

Notice that the worker threads are not very balanced -- the work load looks like it is evenly distributed because we're giving each thread 1/n^th (where n is the number of processors) of the work, but due to the nature of the work, CPU cores are not efficiently used -- the thread with the lowest number range finishes much sooner than the thread with the highest number range.

Asking for Work vs. Telling the Thread What Work to Do

A clue as to whether your threads are optimized is this: are you telling your threads what work to do, or are your threads requesting work? In the threaded brute force algorithm above, we are telling the thread what work it should be doing. In this next iteration, the thread will be asking for work when it is available.

Calling the algorithm:

DurationOf(ThreadedGetNextWorkItemBruteForce, "Threaded get next work item brute force:");

The implementation:

static int ThreadedGetNextWorkItemBruteForce()
{
  List<(Thread thread, int threadNum)> threads = new List<(Thread thread, int threadNum)>();
  int numProcs = Environment.ProcessorCount;

  for (int i = 0; i < numProcs; i++)
  {
    var thread = new Thread(new ParameterizedThreadStart(NextWorkItemBruteForceThread));
    thread.IsBackground = true;
    threads.Add((thread, i));
  }

  totalNumPrimes = 0;
  nextNumber = 1;
  threads.ForEach(t => t.thread.Start(t.threadNum));
  threads.ForEach(t => t.thread.Join());

  return totalNumPrimes;
}

The worker thread:

static void NextWorkItemBruteForceThread(object parms)
{
  int threadNum = (int)parms;
  DurationOf(() =>
  {
    int numPrimes = 0;
    int n;

    while ((n = Interlocked.Increment(ref nextNumber)) < MAX)
    {
      if (IsPrime(n))
      {
        ++numPrimes;
      }
    }

    Interlocked.Add(ref totalNumPrimes, numPrimes);

    return numPrimes;
  }, $"Thread: {threadNum}");
}

The results:

Thread: 3
Number of primes is : 10446
Total seconds = 13.2079996
Thread: 2
Number of primes is : 10378
Total seconds = 13.2079996
Thread: 0
Number of primes is : 10437
Total seconds = 13.2079996
Thread: 1
Number of primes is : 10277
Total seconds = 13.2079996
Threaded get next work item brute force:
Number of primes is : 41538
Total seconds = 13.2079996

Notice now that each thread actually runs for exactly the same amount of time (that was actually a fluke of that particular test run, they do vary ever so slightly.) Also notice we're using our threads efficiently now -- we've shaved 6 seconds off the processing time because each core is fully utilized, and the core utilization is much more efficient.

Interlocked and Atomic

Notice these two lines:

while ((n = Interlocked.Increment(ref nextNumber)) < MAX)
...
Interlocked.Add(ref totalNumPrimes, numPrimes);

The way this worker thread requests work is to simply get the next number to process -- there's no queuing of work, semaphores, or other complexity. However, in order to get the next number and update the total count, each thread must ensure that it momentarily blocks the other threads. Otherwise, another thread might get exactly the same number or the total might be updated simultaneously, resulting in an incorrect count.

.NET has an Interlocked class (read more here) that ensures that the operation is performed atomically, meaning that, even if several bytes of data are being changed, the operation is treated as a single, synchronous, change. Essentially, it's like writing a lock statement around the operation, however it is much more performant because the generated IL code (and ultimately the assembly code) can take advantage of CPU instructions to exchange the value in memory because you're doing something very specific. Using a lock statement, the compiler has no idea what you're really doing and can't optimize the code for you. In fact, the methods that Interlocked implements closely match the actual Intel CPU instructions¹ that are atomic and can therefore be locked.

More "Modern" Ways of Working with Threads

Nowadays, the junior programmer probably isn't even instructed regarding the Thread class and is instead taught one or more of the various other ways to fire off a thread. Learning about the underlying Thread class is however important because there are situations when you absolutely want to manage the thread yourself rather than letting the .NET Framework manage the thread for you, particularly threads that interact with .NET's ThreadPool (more on that later.) Here are the most common options:

AsParallel - a PLINQ (Parallel LINQ) extension on an Enumerable
Task.Run - forcing an asynchronous method
Task.Factory.StartNew⁸ - equivalent to Task.Run but with some defaults:
- no cancellation token
- children cannot attach
- default task scheduler
await - calling an asynchronous method with a continuation
QueueUserWorkItem - queues work
BackgroundWorker

Most of these approaches use the .NET's ThreadPool with the potential exception of the "await" syntax in which you have more control over how the task is set up. Each also has its nuances regarding:

exception handling
cancellation tokens
progress reporting
best practices with ThreadPool

One really needs to understand these nuances and how they affect performance and error handling.

AsParallel

AsParallel is one of several query extension methods in the Parallel LINQ (PLINQ) library. I'm not going to get into the details of the various methods in the ParallelEnumerable class, instead let's look at just AsParallel.

Calling the algorithm:

DurationOf(AsParallelGetNextWorkItemBruteForce, "AsParallel get next work item brute force:");

The implementation:

static int AsParallelGetNextWorkItemBruteForce()
{
  int totalNumPrimes = 0;

  var nums = Enumerable.Range(2, MAX);
  nums.AsParallel().ForAll(n =>
  {
    if (IsPrime(n))
    {
      Interlocked.Increment(ref totalNumPrimes);
    }
  });

  return totalNumPrimes;
}

Note the code nums.AsParallel().ForAll, which attempts, for each item in the enumerable, to execute in parallel the action action declared in ForAll.

The results (don't compare this with the results above, currently I'm on a VM that seems to run faster than my laptop natively):

AsParallel get next work item brute force:
Number of primes is : 41538
Total seconds = 11.7537395

Compare with the balanced brute force thread timing:

Thread: 1
Number of primes is : 10326
Total seconds = 11.7124344
Thread: 2
Number of primes is : 10517
Total seconds = 11.7124344
Thread: 0
Number of primes is : 10502
Total seconds = 11.7124344
Thread: 3
Number of primes is : 10193
Total seconds = 11.7134212
Threaded get next work item brute force:
Number of primes is : 41538
Total seconds = 11.7304613

So that's neat -- the timing is essentially identical.

Keep in mind that AsParallel.ForAll is a synchronous operation -- code execution continues after all the numbers have been processed. While the processing itself is asynchronous, you application will be waiting for the ForAll to finish. Mixing AsParallel with awaitable tasks (discussed next) is rather pointless.

Task.Run

Task.Run is one of the ways to implement a task-based asynchronous pattern³ (TAP) quickly to create a worker thread. This is an example of what is called a "compute-bound" task, meaning that the task is performed by the CPU rather than the program waiting for a device (like a fingerprint reader) or connection (to a database, for example) to return a result. It's important to understand the difference between compute-bound io-bound tasks², particularly with regards to threads managed by the ThreadPool, which should not block for large periods of time.

Calling the algorithm:

DurationOf(TaskRunGetNextWorkItemBruteForce, "Task.Run get next work item brute force:");

The implementation:

static int TaskRunGetNextWorkItemBruteForce()
{
  int numProcs = Environment.ProcessorCount;
  totalNumPrimes = 0;
  nextNumber = 1;
  List<Task> tasks = new List<Task>();

  for (int i = 0; i < numProcs; i++)
  {
    var task = Task.Run(() => NextWorkItemBruteForceThread(i));
    tasks.Add(task);
  }

  Task.WaitAll(tasks.ToArray());

  return totalNumPrimes;
}

Notice the Task.WaitAll, which is like thread Join -- the current thread blocks until all tasks are complete. As with Join, use with caution! Also notice that we can call the same method, NextWorkItemBruteForceThread(i), that performs the computation as we do in the balanced brute force thread setup routine.

The results (again don't compare with the previous results, I'm running on a different machine at the moment) - these compare quite well with AsParallel and the balanced brute force algorithm.

Thread: 4
Number of primes is : 10369
Total seconds = 11.9374993
Thread: 4
Number of primes is : 10351
Total seconds = 11.9374993
Thread: 4
Number of primes is : 10293
Total seconds = 11.9374993
Thread: 4
Number of primes is : 10525
Total seconds = 11.9374993
Task.Run get next work item brute force:
Number of primes is : 41538
Total seconds = 11.953198

Using await

This is a very simple example in which the above code is modified slightly to use the await keyword. Using await (along with the async keyword) has the following effect:

The method initiates the asynchronous process.
It then immediately returns to the caller.
When the asynchronous process completes, code execution continues with the code immediately after the await statement.

There's a few "gotchas" in this. First, it is useful to be aware of the context in which the continuation is executed. For WinForm applications in which the await was called on the application thread (the thread running the UI), execution is marshaled to continue on the application (UI) thread. If you're running a non-UI application, the continuation can occur on the same thread to which the asynchronous method was assigned, or a new thread! This behavior is controlled by the SynchronizationContext^5,6 and discussion of this class is beyond the scope of this article. There's a good Code Project article series on the topic: Part I, Part II, Part III.

Second, I find the mental gyrations of "return to caller and continue later" to be difficult. The reason for this is that it's not obvious to the method performing the call to an async method will return immediately. This is why the guidance on these method names is to append them with "Async".

Third, the mental gyrations get even harder when there are nested await calls.

Conversely, the Task class provides some really useful features, particularly with regards to returning a result and passing exceptions back to you if an exception occurs while executing the task.

First, the implementation for the very simple example:

static int TaskAwaitGetNextWorkItemBruteForce()
{
  int numProcs = Environment.ProcessorCount;
  totalNumPrimes = 0;
  nextNumber = 1;
  List<Task> tasks = new List<Task>();

  for (int i = 0; i < numProcs; i++)
  {
    var task = DoWorkAsync(i);
    tasks.Add(task);
  }

  Task.WaitAll(tasks.ToArray());
  
  return totalNumPrimes;
}

static async Task DoWorkAsync(int threadNum)
{
  await Task.Run(() => NextWorkItemBruteForceThread(threadNum));
}

Here, DoWorkAsync initiates the work and the await returns to the caller, TaskAwaitGetNextWorkItemBruteForce. This example is over-simplified because there is no continuation. Also notice that DoWorkAsync returns a Task but there's no explicit return myTask; statement. An async method can return a Task<TResult>, Task, void, or (C# 7) a type implementing a public GetAwaiter method⁷.

Let's look at a more useful example next -- here, you'll see that we can remove the global variables that NextWorkItemBruteForceThread uses and instead return the result of the algorithm in the TResult generic type. This also removes the Interlocked call: Interlocked.Add(ref totalNumPrimes, numPrimes) which greatly improves the compartmentalization of the thread. Instead, we'll compute the total number of primes by summing the primes found by each asynchronous task.

The algorithm implementation:

static int AwaitableBruteForceAlgorithm(object parms)
{
  int threadNum = (int)parms;
  int numPrimes = 0;

  DurationOf(() =>
  {
    int n;

    while ((n = Interlocked.Increment(ref nextNumber)) < MAX)
    {
      if (IsPrime(n))
      {
        ++numPrimes;
      }
    }

    return numPrimes;
  }, $"Thread: {threadNum}");

  return numPrimes;
}

The implementation:

static int TaskAwaitGetNextWorkItemBruteForceWithReturn()
{
  int numProcs = Environment.ProcessorCount;
  nextNumber = 1;
  List<Task<int>> tasks = new List<Task<int>>();

  for (int i = 0; i < numProcs; i++)
  {
    var task = DoWorkWithReturnAsync(i);
    tasks.Add(task);
  }

  Task.WaitAll(tasks.ToArray());

  return tasks.Sum(t => t.Result);
}

static async Task<int> DoWorkWithReturnAsync(int threadNum)
{
  return await Task.Run(() => AwaitableBruteForceAlgorithm(threadNum));
}

Notice here that the DoWorkWithReturnAsync method now requires a return keyword! Also, we're no longer using the global totalNumPrimes, as the results of all the tasks, once their complete, is summed and returned.

Let's make this example a wee bit more useful by adding a continuation that indicates that the thread is complete, and see what happens.

The implementation:

static int TaskAwaitGetNextWorkItemBruteForceWithReturnAndContinuation()
{
  int numProcs = Environment.ProcessorCount;
  nextNumber = 1;
  List<Task<int>> tasks = new List<Task<int>>();

  for (int i = 0; i < numProcs; i++)
  {
   Console.WriteLine("Starting thread " + i + " at " + 
                    (DateTime.Now - start).TotalMilliseconds + " ms");
   var task = DoWorkWithReturnAsyncAndContinuation(i);
    tasks.Add(task);
  }

  Task.WaitAll(tasks.ToArray());

    return tasks.Sum(t => t.Result);
}

static async Task<int> DoWorkWithReturnAsyncAndContinuation(int threadNum)
{
  var t = await Task.Run(() => AwaitableBruteForceAlgorithm(threadNum));

  lock (locker)
  {
    Console.WriteLine("Thread number " + threadNum + " finished.");
  }

  return t;
}

The results:

Starting thread 0 at 0 ms
Starting thread 1 at 0 ms
Starting thread 2 at 0 ms
Starting thread 3 at 0 ms
Thread: 3
Number of primes is : 10431
Total seconds = 13.4916976
Thread number 3 finished.
Thread: 2
Number of primes is : 10205
Total seconds = 13.4916976
Thread number 2 finished.
Thread: 0
Number of primes is : 10442
Total seconds = 13.4916976
Thread number 0 finished.
Thread: 1
Number of primes is : 10460
Total seconds = 13.4916976
Thread number 1 finished.
await Task.Run get next work item with return brute force and continuation:
Number of primes is : 41538
Total seconds = 13.5073256

So this code should strike you as non-intuitive:

static async Task<int> DoWorkWithReturnAsyncAndContinuation(int threadNum)
{
  var t = await Task.Run(() => AwaitableBruteForceAlgorithm(threadNum));

  lock (locker)
  {
    Console.WriteLine("Thread number " + threadNum + " finished.");
  }

  return t;
}

The await should return to the caller. It does -- we know this because we note that the tasks are created all at once -- there is no delay between the start times. But what does the method return? It returns a Task<int>, and the return statement is after the continuation. It's not part of the continuation which, when you think about it, can't return anything anyways! Furthermore, t, when used inside the continuation is the return (an int) of the method we're awaiting upon. So if we wanted to output the count of primes: we would write:

Console.WriteLine("Count = " + t);

and not t.Result. But in the caller that receives Task<int>, we have to use t.Result. Wow. Just Wow, I say. And yes, if you're task creates another task, the outer return type is Task<Task>. If it's a three layer nested task, the return type will be Task<Task<Task>>>. See what I mean by mind warping?

Console output (your mileage will vary):

Starting thread 0 at 0 ms
Starting thread 1 at 6.3902 ms
Starting thread 2 at 6.3902 ms
Starting thread 3 at 6.3902 ms
Thread: 0
Number of primes is : 10428
Total seconds = 13.3906041
Thread number 0 finished.
Count = 10428
Thread: 2
Number of primes is : 10270
Total seconds = 13.3916069
Thread: 3
Number of primes is : 10382
Total seconds = 13.3916069
Thread number 3 finished.
Count = 10382
Thread: 1
Number of primes is : 10458
Total seconds = 13.3916069
Thread number 1 finished.
Count = 10458
Thread number 2 finished.
Count = 10270
await Task.Run get next work item with return brute force and continuation:
Number of primes is : 41538
Total seconds = 13.4070216

Using await in WinForm Applications

In the above console application, DurationOf is a thread blocking method, waiting until all the tasks complete. This is completely unsuitable in a WinForm application:

Blocking the main UI thread prevents the application's message pump from processed Windows messages.
Calls to Invoke for performing a UI update from a non-UI thread use the application's message pump to marshal the call onto the main UI thread.
When the main UI thread is blocked, messages aren't processed, and a deadlock occurs between the blocked UI thread and the thread requesting the Invoke.
Calls to BeginInvoke perform the UI thread asynchronously, so should not deadlock.

However:

The whole point is that it is desirable to use await and tasks without having to call Invoke or BeginInvoke.
The await should return control to the UI thread.
The only time we would need to use Invoke or BeginInvoke is if we're updating the UI in the non-UI thread, which ought to be avoided if possible.
And most importantly, the main UI thread should not be blocked!

To achieve this, special care also must be taken with regards to how async methods are nested, so that the await returns immediately to the desired caller. I consider this an important point -- using async and await requires that your method call hierarchy has the correct structure to handle the immediate return to the caller.

Task Initialization

We're going to start the threads as soon as the form is shown, but this cannot block the UI thread:

public Form1()
{
  InitializeComponent();
  Shown += OnFormShown;
}

private async void OnFormShown(object sender, EventArgs e)
{
  List<Task<int>> tasks = TaskAwaitGetNextWorkItemBruteForceWithReturnAndContinuation();
  await Task.WhenAll(tasks);
  int numPrimes = tasks.Sum(t => t.Result);
  tbOutput.AppendLine("Number of primes is : " + numPrimes);
}

In OnFormShown, the method that starts all the tasks (TaskAwaitGetNextWorkItemBruteForceWithReturnAndContinuation) returns those tasks. Then we wait until this tasks all complete with Task.WhenAll(tasks). We await this, which immediately exits out of the OnFormShown method -- our UI thread is NOT blocked! Note the difference between Task.WhenAll vs. Task.WaitAll! Once the tasks complete, the continuation executes, summing the results of each task and displaying the total number of primes found. Because the synchronization context is a WinForm application, then continuation is itself marshaled back onto the main UI thread, so when we append the line to tbOutput, we don't need to do the marshalling ourselves with Invoke or BeginInvoke.

Creating Each Task

protected List<Task<int>> TaskAwaitGetNextWorkItemBruteForceWithReturnAndContinuation()
{
  int numProcs = Environment.ProcessorCount;
  nextNumber = 1;
  List<Task<int>> tasks = new List<Task<int>>();
  DateTime start = DateTime.Now;

  for (int i = 0; i < numProcs; i++)
  {
    tbOutput.AppendLine("Starting thread " + i + " at " + 
                       (DateTime.Now - start).TotalMilliseconds + " ms");
    var task = DoWorkWithReturnAsyncAndContinuation(i);
    tasks.Add(task);
  }

  return tasks;
}

Notice that this method does not actually create the tasks. It instead calls DoWorkWithReturnAsyncAndContinuation. The reason is so that the await in DoWorkWithReturnAsyncAndContinuation returns to the caller, which is the code shown above, so that the next thread can be created. We can't even write the method with an embedded Task.Run():

And even if we could, the await would return immediately to the caller. When the task completed, the continuation would execute the next iteration of the for loop and start the next thread. Since no other threads were picking up work, all the work would have been done by the first thread!

Instead, it is the DoWorkWithReturnAsyncAndContinuation method that actually gets around to creating the task:

protected async Task<int> DoWorkWithReturnAsyncAndContinuation(int threadNum)
{
  DateTime start = DateTime.Now;
  var t = await Task.Run(() => BruteForceAlgorithm(threadNum)); //.ConfigureAwait(true);

  DateTime stop = DateTime.Now;

  tbOutput.AppendLine("Continuation: Thread number " + threadNum + " finished.");
  tbOutput.AppendLine("Total seconds = " + (stop - start).TotalSeconds);
  tbOutput.AppendLine("Continuation: Count = " + t);

  return t;
}

Because Task.Run is being called from the application's UI thread, the continuation is executed on the "captured context" - the UI thread. This is equivalent to adding .ConfigureAwait(true). If we used .ConfigureAwait(false), the continuation may or may not execute on the captured context thread, in which case the UI outputs would have to be marshaled with Invoke or BeginInvoke. It's also worth pointing out that we don't need to lock the output to the UI to ensure that the lines are written in the correct sequence. Because we're continuing on the captured context (the UI thread) and there's only ever one UI thread, we're guaranteed that all three AppendLine calls occur contiguously, regardless of whether another task finishes in the meantime. That task's continuation will be blocked until our continuation exits.

Semaphores and Queuing Work

A vital aspect of threads is how the thread picks up work. The above examples all use a simple mechanism to get the next integer to test for "primality":

while ((n = Interlocked.Increment(ref nextNumber)) < MAX)

In the real world, this is usually unrealistic. A thread usually is instantiated:

To perform some task asynchronously and when the task is complete, the thread exists.
To wait for work to appear in a queue, which it then processes and after processing, goes back to sleep.

The second form, waiting for work to appear, often involves using a semaphore (from railroad signals⁹) to indicate that work is ready. The complexity created here is terminating the thread when the application no longer needs to the thread to do work. Semaphores can be used with "old style" threads as well as the more modern techniques for creating threads. We'll use semaphores with the Task class in the following examples. These examples also introduce the ConcurrentQueue class found in the System.Collections.Concurrent namespace. Here's the code:

static void UsingSemaphores()
{
  Semaphore sem = new Semaphore(0, Int32.MaxValue);
  int numProcs = Environment.ProcessorCount;
  var queue = new ConcurrentQueue<int>();
  int numPrimes = 0;

  for (int i = 0; i < numProcs; i++)
  {
    Task.Run(() =>
    {
      while (true)
      {
        sem.WaitOne();

        if (queue.TryDequeue(out int n))
        {
          if( IsPrime(n))
          {
            Interlocked.Increment(ref numPrimes);
          }
        }
      }
    });
  }

  DurationOf(() =>
  {
    Enumerable.Range(2, MAX).ForEach(n =>
    {
      queue.Enqueue(n);
      sem.Release();
    });

    while (!queue.IsEmpty) Thread.Sleep(1);

    return numPrimes;
  }, "Threads using semaphores");
}

There are three sections to the above method. The reason I combined them into a single method is for convenience of the closures. The first section, instantiating the tasks, has two important lines:

...
sem.WaitOne();

if (queue.TryDequeue(out int n))
...

Here, the thread is suspended until the semaphore signals that there is work to be done. Once signaled, the work (in this case, the next number to process) is dequeued. In the ConcurrentQueue, TryDequeue is the only dequeuing method available, which ensures that no other thread has removed the queued item. With the semaphore, only one thread is released, so this is not an issue here, but it would be an issue if there is no synchronization of dequeuing work between threads.

Next, the work is queued. One can release the semaphore one at a time, or, after queuing the work, the semaphore can be given a release count. In the example, I opted for the "one at a time" approach because I find it more typical:

Enumerable.Range(2, MAX).ForEach(n =>
{
  queue.Enqueue(n);
  sem.Release();
});

The third section waits for the queue to be empty -- this is really bad practice, but we want to discuss why it's wrong, so this is a teaching example:

while (!queue.IsEmpty) Thread.Sleep(1);

Ideally, putting a thread to sleep while we spin waiting for something to happen, is a really bad idea. Instead, completion of a task should be signaled using a semaphore or a task "wait" call, but for now we'll use this code.

One of the complexities of working with semaphores in threads (as opposed to awaitable async methods) is that while you can easily enqueue the work for the thread to pick up, you don't really know when the threads have finished their work. For example:

Notice the second computation of the number of primes is short by 4. This is because, while the queue is empty, indicating that all work has been done, the threads hadn't actually finished processing!

As mentioned earlier, another problem is telling the thread, which is waiting for work, that it should terminate because the application doesn't need it anymore. We can deal with both issues by enqueuing a 0 as a poor-man's flag to indicate that the thread should terminate. But we have to do this for the number of threads that we created! First, we keep track of the tasks:

...
List<Task> tasks = new List<Task>();

for (int i = 0; i < numProcs; i++)
{
  tasks.Add(Task.Run(() =>
  ...

Second, when a 0 is dequeued, the task exits:

...
if (queue.TryDequeue(out int n))
{
  if (n == 0)
  {
    break;
  }
...

Third, we enqueue the work of "0" for each task and release the semaphore for that count -- this works because the thread exits, so the each thread will receive exactly 1 "0" to indicate that it should terminate:

for (int i = 0; i < numProcs; i++)
{
  queue.Enqueue(0);
}

sem.Release(numProcs);

Lastly, we wait until the threads complete:

Task.WaitAll(tasks.ToArray());

Now we have successfully written the code based on when the work is done rather than when the queue is empty of work to be done. A very important difference. Also note that we could have written very similar code using pure threads and Thread.Join.

Mutexes, Locks, and Semaphores

It's also important to understand the difference between the Mutex¹⁰ class and the lock statement. While a mutex (mutual-exclusion) might appear to be the same as a lock statement, there are a couple important differences:

a mutex can have a timeout
a mutex can cross application boundaries

This differs from a lock statement which cannot cross application boundaries and does not support a timeout. In contrast with a semaphore, a mutex enforces thread identity, meaning that a mutex can only be released by the thread that acquired it. With a semaphore, any thread can Release the semaphore for the current or a different thread to be released from the WaitOne call and continue execution. We can use a mutex instead of the lock statement here:

static void DurationOf(Func<int> action, string section)
{
  var start = DateTime.Now;
  int numPrimes = action();
  var stop = DateTime.Now;

  lock (locker)
  {
    Console.WriteLine(section);
    Console.WriteLine("Number of primes is : " + numPrimes);
    Console.WriteLine("Total seconds = " + (stop - start).TotalSeconds);
  }
}

The change would look like this:

...
static Mutex mutex = new Mutex();
...
static void DurationOf(Func<int> action, string section)
{
  var start = DateTime.Now;
  int numPrimes = action();
  var stop = DateTime.Now;

  mutex.WaitOne();
  Console.WriteLine(section);
  Console.WriteLine("Number of primes is : " + numPrimes);
  Console.WriteLine("Total seconds = " + (stop - start).TotalSeconds);
  mutex.ReleaseMutex();
}

Mutexes are used to ensure the mutual exclusion of the same code, whereas a semaphore is typically used to wait for work and to process work simultaneously by the same code in different threads. Also, as you can see from the above code, which demonstrates a very simple mutual exclusion, a lock statement is superior because it doesn't require instantiating a global mutex.

Exception Handling

While I'd love to claim that my threads never throw exceptions, the reality is that exception handling has to be dealt with.

Exceptions Handling and the Thread Class

For giggles, let's change the brute force method to throw an exception if the number is prime!

static void NextWorkItemBruteForceThreadThrowsException(object parms)
{
  int threadNum = (int)parms;
  DurationOf(() =>
  {
    int n;

    while ((n = Interlocked.Increment(ref nextNumber)) < MAX)
    {
      if (IsPrime(n))
      {
        throw new Exception("Number is prime: " + n);
      }
    }

    return 0;
  }, $"Thread: {threadNum}");
}

Running this in our console tests, we get this ugly mess:

What we want to do though is provide a way to handle the exception more gracefully. Implementing UnhandledException:

Thread.GetDomain().UnhandledException += (sndr, exargs) =>
{
  Console.WriteLine("Thread: " + (exargs.ExceptionObject as Exception)?.Message);
};

AppDomain.CurrentDomain.UnhandledException += (sndr, exargs) =>
{
  Console.WriteLine("AppDomain: " + (exargs.ExceptionObject as Exception)?.Message);
};

isn't really what we want because the console still logs the entire stack trace and more importantly, the uncaught exception in the thread will terminate the application, which probably isn't what we want to happen. A better solution is the wrap the thread method in a try-catch block and provide an exception handler:

static void SafeThread(Action action)
{
  try
  {
    action();
  }
  catch (Exception ex)
  {
    Console.WriteLine("Exception: " + ex.Message);
  }
}

static void ThreadExceptionExample()
{
  List<(Thread thread, int threadNum)> threads = new List<(Thread thread, int threadNum)>();
  int numProcs = Environment.ProcessorCount;

  for (int i = 0; i < numProcs; i++)
  {
    // We changed the signature of NextWorkItemBruteForceThreadThrowsException slightly:
    // And we're no longer using a parameterized thread start because we have closure on i.
    var thread = new Thread(new ThreadStart(() => SafeThread(() => 
                            NextWorkItemBruteForceThreadThrowsException(i))));
    thread.IsBackground = true;
    threads.Add((thread, i));
  }

  totalNumPrimes = 0;
  nextNumber = 1;
  threads.ForEach(t => t.thread.Start());
  threads.ForEach(t => t.thread.Join());
}

Now we get:

Notice that the console application also doesn't abruptly terminate.

Exception Handling with Task

Here's the code that sets up running tasks that throws an exception when the number is prime:

static int TaskAwaitGetNextWorkItemBruteForceThrowsException()
{
  int numProcs = Environment.ProcessorCount;
  totalNumPrimes = 0;
  nextNumber = 1;
  List<Task> tasks = new List<Task>();

  for (int i = 0; i < numProcs; i++)
  {
    var task = DoWorkAsyncThrowsException(i);
    tasks.Add(task);
  }

  Task.WaitAll(tasks.ToArray());

  return totalNumPrimes;
}

static async Task DoWorkAsyncThrowsException(int threadNum)
{
  await Task.Run(() => NextWorkItemBruteForceThreadThrowsException(threadNum));
}

This results in the usual ugly console log and the console application terminates:

If we do this however:

static async Task DoWorkAsyncThrowsException(int threadNum)
{
  try
  {
     await Task.Run(() => NextWorkItemBruteForceThreadThrowsException(threadNum));
  }
  catch (Exception ex)
  {
    Console.WriteLine(ex.Message);
  }
}

We can now handle the exception and the application doesn't terminate:

If you use the one of the Task.Wait methods, the exception being thrown by the awaited task is actually very sophisticated. Read more about it here "Exception Handling (Task Parallel Library)". Changing our example slightly:

static int TaskAwaitGetNextWorkItemBruteForceThrowsException()
{
  int numProcs = Environment.ProcessorCount;
  totalNumPrimes = 0;
  nextNumber = 1;
  List<Task> tasks = new List<Task>();

  for (int i = 0; i < numProcs; i++)
  {
    var task = DoWorkAsyncThrowsException(i);
    tasks.Add(task);
  }

  try
  {
    Task.WaitAll(tasks.ToArray());
  }
  catch (AggregateException ex)
  {
    Console.WriteLine(ex.Message);

    tasks.ForEachWithIndex((t, i) =>
    {
      Console.WriteLine("Task " + i);
      Console.WriteLine("Is canceled: " + t.IsCanceled);
      Console.WriteLine("Is completed: " + t.IsCompleted);
      Console.WriteLine("Is faulted: " + t.IsFaulted);
    });
  }

  return totalNumPrimes;
}

static async Task DoWorkAsyncThrowsException(int threadNum)
{
  await Task.Run(() => NextWorkItemBruteForceThreadThrowsException(threadNum));
}

Notice the output now, after moving the try-catch block to the Wait call:

Do NOT use async void!

Methods declared as "async void" do not have a Task object -- exceptions thrown in the "task" will be raised directly on the SynchronizationContext. In a console app, that means we can't catch the exception, instead it's handled by the AppDomain.CurrentDomain.UnhandledException. Here's an example:

static async void ThrowExceptionAsync()
{
  await Task.Run(() => NextWorkItemBruteForceThreadThrowsException(0));
}

static void AsyncVoidExceptionTest()
{
  try
  {
    ThrowExceptionAsync();
  }
  catch (Exception ex)
  {
    Console.WriteLine("AsyncVoidExceptionTest: " + ex.Message);
  }
}

Notice the output:

The task exception was handled by the AppDomain handler! Contrast the exception handling with this code:

static async Task<int> ThrowExceptionAsync()
{
  await Task.Run(() => NextWorkItemBruteForceThreadThrowsException(0));

  return 0;
}

static async Task<int> AsyncVoidExceptionTest()
{
  try
  {
    await ThrowExceptionAsync();
  }
  catch (Exception ex)
  {
    Console.WriteLine("AsyncVoidExceptionTest: " + ex.Message);
  }

  return 0;
}

and the output:

Here, the exception is being handled by our outer await. What throws (no pun intended) everyone for a loop is that when a method is declared as a non-void async method, it must be awaited upon:

But awaiting on a task means the method has to return a Task or be declared as async void, which is exactly what we're trying to NOT do! This can result in a cascade of refactoring parent methods to await and have the signature of async Task<T>, all the way up to Main, which now can be declared as async void! If you find yourself doing this, you're doing something wrong with the structure of your async methods. In the test case presented here, I stopped the cascade by calling, in Main:

Task.Run(() => AsyncVoidExceptionTest());

Conversely, we could do this:

static async Task<int> Main(string[] args)
{
  await AsyncVoidExceptionTest();
  return 0;
}

but not this:

static async void Main(string[] args)

which results in a compiler error: "error CS4009: A void or int returning entry point cannot be async"

The point being, avoid (no pun intended again) async void, but really look carefully at how your async tasks are being called so you don't end up refactoring every method signature in the call hierarchy, all the way up to Main!

Awaitable Threads - A Hybrid Approach using TaskCompletionSource

What if you want to use threads that you create yourself rather than being married to the thread pool, but you still want to take advantage of the ability to await on the thread? This is accomplished using the TaskCompletionSource class. This class wraps a Task instance that you can use exactly as if the task had been created using Task.Run or similar task creation methods. In the following example, we set up a TaskCompletionSource for each thread and pass the TaskCompletionSource instance to that thread:

static int HybridAwaitableThread()
{
  List<(Thread thread, int threadNum)> threads = new List<(Thread thread, int threadNum)>();
  List<TaskCompletionSource<int>> tasks = new List<TaskCompletionSource<int>>();
  int numProcs = Environment.ProcessorCount;

  for (int i = 0; i < numProcs; i++)
  {
    var thread = new Thread(new ParameterizedThreadStart(HybridThread));
    thread.IsBackground = true;
    threads.Add((thread, i));
    tasks.Add(new TaskCompletionSource<int>());
  }

  nextNumber = 1;
  threads.ForEachWithIndex((t, idx) => t.thread.Start((t.threadNum, tasks[idx])));
  Task.WaitAll(tasks.Select(t=>t.Task).ToArray());

  return tasks.Sum(t=>t.Task.Result);
}

Now, while the above code is using the blocking Task.WaitAll for timing purposes, it would be perfectly valid in a different scenario (for example, a Windows app) to also call await Task.WhenAll(tasks.Select(t=>t.Task)) so that awaiting for the threads to complete isn't blocking.

Notice that this line:

threads.ForEachWithIndex((t, idx) => t.thread.Start((t.threadNum, tasks[idx])));

passes in a tuple including the TaskCompletionSource instance. In the thread implementation, we use SetResult when the thread is complete, "signaling" that the task is done:

static void HybridThread(object parms)
{
  (int threadNum, TaskCompletionSource<int> tcs) parm = 
                      (ValueTuple<int, TaskCompletionSource<int>>)parms;

  DurationOf(() =>
  {
    int numPrimes = 0;
    int n;

    while ((n = Interlocked.Increment(ref nextNumber)) < MAX)
    {
      if (IsPrime(n))
      {
        ++numPrimes;
      }
    }

    parm.tcs.SetResult(numPrimes);
    return numPrimes;
  }, $"Thread: {parm.threadNum}");
}

The salient line here is parm.tcs.SetResult(numPrimes); which completes the task.

Canceling Threads and Tasks

Canceling a thread or task is a complicated subject^16,17, the essence of which is described here. The first thing to determine is what it is you're trying to cancel:

A CPU-bound thread that doesn't wait on any mutexes, semaphores, or other "delays"
An I/O-bound thread that waits on the completion of an asynchronous I/O operation
A thread that waits on a mutex or semaphore to release it

All three approaches can take advantage of the CancellationTokenSource class, but how you use the token varies according to the above three options:

CPU-bound work: typically polls the cancellation token. You can choose:
1. Operation cancellation: Clean up and exit the thread
2. Object cancellation: Set up an object that implements a Cancel (or similar) method and the object-internal mechanism for canceling the work
I/O-bound work: This requires registering a callback that in turn cancels the async I/O operation.
Mutex/Semaphore: This requires setting up a wait handle that you pass to the mutex or sempahore. Canceling a thread in a wait state releases the thread and allows it to terminate.

The proper use of the OperationCanceledException is important so that your code interacts well with library code, and vice-versa. An important thing to understand is that cancellation is cooperative -- requesting a cancel does not mean that the listener of the token has to actually stop.

Lastly, tasks cannot be canceled if the code library implementing the task does not observe the CancellationToken! If your code implements a task, consider well whether you should be observing a cancellation token yourself!

For the following examples, let's say we want to set an upper limit of 1 second of processing for computing primes using our brute force algorithm. First off, we should create cancellation tokens for each operation. If performing object cancellation, one token can cancel all the required objects. Here, we'll focus on canceling operations, not objects.

Canceling Threads

Do not cancel a task by killing it with thread.Abort(). You run the risk of memory leaks, deadlocks and other nasty affects by terminating a thread without giving it the opportunity to do its cleanup. Instead, here, when we instantiate the threads, we pass a cancellation token for each thread to check:

static int CancelThreads()
{
  List<(Thread thread, int threadNum, CancellationTokenSource cts)> threads = 
    new List<(Thread thread, int threadNum, CancellationTokenSource cts)>();
  int numProcs = Environment.ProcessorCount;

  for (int i = 0; i < numProcs; i++)
  {
    var thread = new Thread(new ParameterizedThreadStart(CancellableThread));
    var cts = new CancellationTokenSource();
    thread.IsBackground = true;
    threads.Add((thread, i, cts));
  }

  totalNumPrimes = 0;
  nextNumber = 1;
  threads.ForEach(t => t.thread.Start((t.threadNum, t.cts)));

  // After 1 second, cancel our threads
  threads.ForEach(t => t.cts.CancelAfter(1000));
  threads.ForEach(t => t.thread.Join());

  return totalNumPrimes;
}

We then request that each thread is cancelled after 1 second. The worker thread checks its token for each iteration:

static void CancellableThread(object parms)
{
  (int threadNum, CancellationTokenSource cts) parm = (ValueTuple<int, CancellationTokenSource>)parms;

  DurationOf(() =>
  {
    int numPrimes = 0;
    int n;

    while ((n = Interlocked.Increment(ref nextNumber)) < MAX)
    {
      if (parm.cts.IsCancellationRequested)
      {
        break;
      }

      if (IsPrime(n))
      {
        ++numPrimes;
      }
    }

    Interlocked.Add(ref totalNumPrimes, numPrimes);
    return numPrimes;
  }, $"Thread: {parm.threadNum}");
}

In the screenshot above, notice that the threads don't terminate exactly at 1 second -- the brute force algorithm takes some time between each iteration!

Canceling Tasks

With tasks, we can exit the task:

gracefully
by calling the cancellation token's ThrowIfCancellationRequested method
by throwing OperationCanceledException ourselves

Let's look at the differences. Here's the setup common to all three cancellation examples:

static int CancelTasks()
{
  int numProcs = Environment.ProcessorCount;
  totalNumPrimes = 0;
  nextNumber = 1;
  List<(Task task, CancellationTokenSource cts)> tasks = 
                        new List<(Task, CancellationTokenSource)>();
  DateTime start = DateTime.Now;

  for (int i = 0; i < numProcs; i++)
  {
    Console.WriteLine("Starting thread " + i + " at " + 
                     (DateTime.Now - start).TotalMilliseconds + " ms");
    var cts = new CancellationTokenSource();
    var task = Task.Run(() => CancellableTask(i, cts), cts.Token);
    tasks.Add((task, cts));
  }

  tasks.ForEach(t => t.cts.CancelAfter(1000));

  try
  {
    Task.WaitAll(tasks.Select(t => t.task).ToArray());
  }
  catch (AggregateException ex)
  {
    Console.WriteLine(ex.Message);

    tasks.ForEachWithIndex((t, i) =>
    {
      Console.WriteLine("Task " + i);
      Console.WriteLine("Is canceled: " + t.task.IsCanceled);
      Console.WriteLine("Is completed: " + t.task.IsCompleted);
      Console.WriteLine("Is faulted: " + t.task.IsFaulted);
    });
  }

  return totalNumPrimes;
}

Gracefully Cancel the Task

The implementation of the worker task:

static void CancellableTask(int threadNum, CancellationTokenSource cts)
{
  DurationOf(() =>
  {
    int numPrimes = 0;
    int n;

    while ((n = Interlocked.Increment(ref nextNumber)) < MAX)
    {
      if (cts.IsCancellationRequested)
      {
        // Graceful exit
        break;
      }

      if (IsPrime(n))
      {
        ++numPrimes;
      }
    }

    Interlocked.Add(ref totalNumPrimes, numPrimes);
    return numPrimes;
  }, $"Thread: {threadNum}");
}

Notice how long it takes to cancel a task! We requested that the task be cancelled after 1 second, but it actually takes the tasks 4 seconds to cancel. As expected with a graceful exit, no exception is thrown.

Call the Token's ThrowIfCancellationRequested

The implementation differs in that, instead of:

if (cts.IsCancellationRequested)
{
  // Graceful exit
  break;
}

We do this instead:

cts.Token.ThrowIfCancellationRequested();

In this example, notice that an exception is thrown and each task is in the "canceled" state and the "completed" state.

Notice in the setup, this very important line:

var task = Task.Run(() => CancellableTask(i, cts), cts.Token);

If we don't set the cancellation token as part of the Task.Run, the task's IsCanceled flag is NOT set and the exception is treated as a fault instead!

Throwing Our Own OperationCanceledException

The implementation is again slightly different:

if (cts.IsCancellationRequested)
{
  throw new OperationCanceledException();
}

Notice here that the task's IsCanceled flag is false but the IsFaulted flag is true!

Canceling Threads Awaiting on a Semaphore

Lastly, let's go back to the threading example where we used a semaphore to signal work was queued and look at how to cancel the thread. In this example, I will specifically NOT queue any work so we can be sure that the semaphore is released as the result of the cancellation rather than processing some work. As before, this is an "inline" method -- notice:

As we did before, we're passing the cancellation token as a parameter to Task.Run.
Something new here -- we're using SemaphoreSlim instead of Semaphore because SemaphoreSlim supports a Wait method that accepts the cancellation token as a parameter. This is how the semaphore is released when the cancel request is made.

Here's the code that instantiates and executes the tasks, then cancels them after 1 second:

static void CancellableSemaphores()
{
  SemaphoreSlim sem = new SemaphoreSlim(0, Int32.MaxValue);
  int numProcs = Environment.ProcessorCount;
  var queue = new ConcurrentQueue<int>();
  int numPrimes = 0;
  List<(Task task, CancellationTokenSource cts)> tasks = 
                       new List<(Task, CancellationTokenSource)>();

  for (int i = 0; i < numProcs; i++)
  {
    var cts = new CancellationTokenSource();

    tasks.Add((Task.Run(() =>
    {
      while (true)
      {
        sem.Wait(cts.Token);

        if (queue.TryDequeue(out int n))
        {
          if (n == 0)
          {
            break;
          }

          if (IsPrime(n))
          {
            Interlocked.Increment(ref numPrimes);
          }
        }
      }
    }, cts.Token), cts));
  }

  DurationOf(() =>
  {
    // Don't enqueue anything. We want the thread to wait and 
    // be released by the cancellation token.
    tasks.ForEach(t => t.cts.CancelAfter(1000));

    try
    {
      Task.WaitAll(tasks.Select(t => t.task).ToArray());
    }
    catch (AggregateException ex)
    {
      Console.WriteLine(ex.Message);

      tasks.ForEachWithIndex((t, i) =>
      {
        Console.WriteLine("Task " + i);
        Console.WriteLine("Is canceled: " + t.task.IsCanceled);
        Console.WriteLine("Is completed: " + t.task.IsCompleted);
        Console.WriteLine("Is faulted: " + t.task.IsFaulted);
      });
    }

    return numPrimes;
  }, "Threads using cancellable semaphores");
}

Notice how the semaphore is released as a result of the cancel request. This code, as the screenshot above shows, will throw an exception on the task that is caught in the Task.WaitAll call. Instead of:

sem.Wait(cts.Token);

We could catch the exception in the task and exit gracefully:

try
{
  sem.Wait(cts.Token);
}
catch (OperationCanceledException)
{
  break;
}

However, we don't know that the task was cancelled. The only way we would know that a task has been cancelled is if we use some external mechanism, even possibly a return value that indicates the task cancellation state. To the outside world, the above code looks like the task completed all its work.

We could also write this somewhat redundant code:

try
{
  sem.Wait(cts.Token);
}
catch (OperationCanceledException)
{
  cts.Token.ThrowIfCancellationRequested();
}

In all cases, the AggregateException.Exceptions[] contains instances of TaskCanceledException exception types.

The ThreadPool Class and CPU-Bound Work

There is no advantage to creating more threads than cores for CPU-bound work -- that is, where each thread is never in a wait state.

More Threads than Cores

For example, when I test the performance of 4 threads (not tasks, actual threads which do not use the .NET ThreadPool) because there are 4 CPUs on my system, it takes about 8 seconds.

If I increase the number of threads to 20:

The processing time is almost 9 seconds -- 2 seconds longer! The overhead in doing all that context switching between threads is quite noticeable.

More Tasks than Cores

With tasks, if the number of tasks I create is equal to the number of cores, they all start pretty much at the same time (the times are in seconds):

Now look what happens when I try to start 20 CPU-bound tasks on a 4 core system:

Again, the times are in seconds. Tasks 0-3 start as expected. Then there is a 4 second delay before the next task starts, and 5 seconds for the next one after that. By that time, the first 4 tasks have completed all the work and the remaining tasks are created with nothing to do so, they terminate fairly quickly. The time delay after the first four tasks are the consequence of the ThreadPool's scheduler, which delays starting more CPU-bound tasks when the number of tasks exceeds the number of cores. There's a variety of write-ups on the behavior of the ThreadPool, particularly that once tasks > CPU cores, subsequent tasks are delayed by 500ms. There's clearly a lot more delay going on than that! Diving into the ThreadPool class is really going beyond the scope of this article -- the point here is, if you use the ThreadPool class (and beware of all the various ways of creating threads that tie in to the ThreadPool class) you need to be cognizant of how many threads you're creating, what they're doing, and how long they do it.