VC++: 30 Multithreading Mistakes on Windows

Shao Voon Wong

4.90/5 (17 votes)

Feb 11, 2024

CPOL

23 min read

9356

752

Learn 30 Windows Multithreading Mistakes and Solutions to Avert Them

Introduction
1. Stack
- 1.1 Setting Stack Size Less than 20 KB
2. Data Corruption
3. Graphic User Interface (GUI)
- 3.1 Updating GUI From a Worker Thread
4. Deadlocks and Hangs
5. Mutex
6. Thread Safety
7. Priority
- 7.1 Setting Realtime Thread Priority
- 7.2 Setting Lower Thread Priority
8. Performance
9. Task-Based Threading
- 9.1 Using Thread Local Storage in Tasks
- 9.2 Waiting Inside Task for Another Task
10. Parallel Patterns Library
- 10.1 Using Parallel Patterns Library Leaks Memory
11. Windows API
- 11.1 GetCurrentProcess and GetCurrentThread Returns Pseudo Handle
12. COM Threading
- 12.1 Using COM Interface Pointer in Another Thread Without Marshalling
Problem with Shared State
Wrapping Up
Reference
History

Introduction

The motivation for writing this article came from a dentist visit. I had an intense toothache and had to take a 3D facial CT scan to diagnose it. The scan was running halfway. Then it stopped. And hanged. At this point, I wondered if the X-ray was still blasting onto my body, causing damage. The dentist commented the machine was from Germany as if mentioning German-origin technology absolves it of its fault.

A deadlock in multi-threading could have caused the hang. I then vowed to make a list of every multi-threading mistake. The intended audience is both novice and expert. Experts should be familiar with most mistakes, but they can still read the article for the knowledge gaps. Writing error-free threading code remains a challenge for most developers. Threading bugs are hard to reproduce reliably.

As the saying goes, prevention is the best cure. This article does not only help prevention. With knowledge from the article, developers can review their code again to narrow down and identify the root cause. This article focuses on C++ and Windows development. Every item is written so that the reader can understand the concept without getting lost in the minutiae of details. Prerequisites include a familiarity with C++ multi-threading and Windows Development.

1. Stack

A process is an instantiation of a program. A process cannot execute code by itself. It needs to have at least one thread to do so. Every thread comes with the default stack size of 1 megabyte. Some developers deem the amount as wasteful. We are taking a closer look at this topic in the item below.

1.1 Setting Stack Size Less Than 20 KB

Changing Stack Size on Visual Studio

The default stack value can be adjusted in Visual Studio by going to the Project->Property Page->Linker->System->Stack Commit Size/Stack Reserve Size.

More often than not, the stack size is modified via dwStackSize parameter of CreateThread, CreateRemoteThread and CreateFiber. If zero is specified, the default size of the executable is used. When STACK_SIZE_PARAM_IS_A_RESERVATION is not in the dwCreationFlags parameter, dwStackSize is the commit size, not the reserved size. Only committed memory uses physical memory. Reserving memory informs Windows not to allocate in this address range as this range might be committed in the future. When the committed size is exceeded, more memory from the reserved range is committed. 95% of the default 1MB is reserved, so they do not touch the physical memory.

HANDLE CreateThread(
	LPSECURITY_ATTRIBUTES   lpThreadAttributes,
	SIZE_T                  dwStackSize,
	LPTHREAD_START_ROUTINE  lpStartAddress,
	LPVOID                  lpParameter,
	DWORD                   dwCreationFlags,
	LPDWORD                 lpThreadId
);
HANDLE CreateRemoteThread(
	HANDLE                 hProcess,
	LPSECURITY_ATTRIBUTES  lpThreadAttributes,
	SIZE_T                 dwStackSize,
	LPTHREAD_START_ROUTINE lpStartAddress,
	LPVOID                 lpParameter,
	DWORD                  dwCreationFlags,
	LPDWORD                lpThreadId
);
LPVOID CreateFiber(
	SIZE_T                dwStackSize,
	LPFIBER_START_ROUTINE lpStartAddress,
	LPVOID                lpParameter
);

Setting 20KB for both commit and reserved size is unadvisable. Your thread calls other code that may unexpectedly use more stack memory (example: a large local array) or recursion. Stack overflow can happen because the maximum it can expand is 20KB. Beware: When compiled with AddressSanitizer, there is an increased chance of stack overflow. If you want to limit stack size, at least choose a sensible 100KB.

If you are interested in the process of a stack’s reserved memory converting to committed memory, look no further than this blog post by Raymond Chen.

2. Data Corruption

The multi-threaded program exhibits strange behavior when data is corrupted. Because in a single-threaded program, code is executed sequentially versus concurrently in a multi-threaded program. These bugs can be hard to trace. Here, we examine corruption bugs.

2.1 Using Out-Of-Scope Local Variable Passed to the Thread

When the address or reference of a local variable declared on the stack is passed to the worker thread, the local variable may go out of scope before the thread starts. This is an intermittent problem. The solutions are listed below.

Use a member variable but it can constitute data race when two threads are created in an instant and their variable value varies.
Allocate the variable on the heap but do remember to deallocate it inside the thread after use.
Capture the variable by copy to the lambda

I experienced this bug first-hand on my slide-show application. The video-encoding thread crashes once after every few runs. However, the GUI looks fine. Initially, I dismissed it as a glitch. Then, it crashed a few more times. I took a closer look and was able to nail down the problem to an out-of-scope string.

2.2 Using Boolean as a Synchronization Primitive

This is a classic beginner mistake. The C++ compiler can reorder the write instructions before or after the boolean guard because there is no memory barrier or fence erected for boolean. The solution is always synchronize using the mutex or atomic variable.

2.3 Forget to End the Threads Before Program Exits

When the program ends, it terminates all its worker threads. Corruption occurs if any of them are writing to a file or registry at this point. The solution is to signal the threads with the Win32 event and wait for them to end before exiting.

2.4 Calling TerminateThread or TerminateProcess

TerminateThread stops the threads abruptly without cleaning up; Resource Acquisition Is Initialization (RAII) does not have a chance to clean up. Rule of thumb: Do not call TerminateThread. By extension to the first rule, do not call TerminateProcess as it calls TerminateThread on every thread.

2.5 Thread Started With _beginthread, Ended With ExitThread

This item is only applicable to a C program where a thread must be started with either _beginthread or _beginthreadex to initialize the C runtime. They must be paired with a call to _endthread or _endthreadex to reclaim the C thread allocated resources. Calling ExitThread is a mistake.

3. Graphic User Interface (GUI)

This section deals with updating GUI from a worker thread.

3.1 Updating GUI From a Worker Thread

For a program with a graphic user interface (GUI), there is a main thread that drives the UI. A worker thread is sometimes created to run a long-running task to keep the UI responsive. The thread updates the UI with the results at the end of task processing. This is a mistake, as the UI and worker threads update the UI simultaneously without synchronization. All UI updates must be made from UI thread.

For Microsoft Foundation Classes (MFC) applications, a custom message identifier can be defined to send messages from the non-UI thread to the UI thread using PostMessage. SendMessage is not suitable for it bypass the message queue and sends messages directly to the Windows handle to be processed before returning. In contrast, PostMessage puts the message in the message queue and returns immediately, and the message is retrieved from the UI thread to be processed.

In MFC project, define your custom message ID from an offset from WM_APP.

#define FSM_MESSAGE WM_APP+200

In the UI class header, define this function. A meaningful name can be chosen.

afx_msg LRESULT OnFsmMessage(WPARAM wparam, LPARAM lparam);

In the UI class cpp’s message map macro, insert the ON_MESSAGE entry

BEGIN_MESSAGE_MAP(CSendMsgToParentExampleDlg, CDialogEx)
…
	ON_MESSAGE(FSM_MESSAGE, &CSendMsgToParentExampleDlg::OnFsmMessage)
…
END_MESSAGE_MAP()

This is the body of OnFsmMessage:

LRESULT CSendMsgToParentExampleDlg::OnFsmMessage(WPARAM wparam, LPARAM lparam)
{
    CString* pStr = reinterpret_cast<CString*>(wparam);

    switch (lparam)
    {
    case 268:
        m_edtText.SetWindowTextW(*pStr);
    }

    delete pStr;

    return 0;
}

You can post the FsmMessage from UI or worker thread.

CString* pStr = new CString(msg);

GetParent()->PostMessage(FSM_MESSAGE, (WPARAM)pStr, 268);

This is the MFC example of message posting. You can download the demo at GitHub or here. The approach is easily adaptable to classic Windows API project.

4. Deadlocks and Hangs

In this section, deadlock is the focus. Deadlock can manifest as stalled operation. The problem is hard to reproduce consistently.

4.1 Dead Lock

Deadlocks happen when two threads acquire two or more locks in opposite order; each of them acquires one lock and tries to acquire the next lock, which is already acquired by the other thread, and neither thread can proceed, hence deadlocked as illustrated by the diagram.

Deadlock diagram

Deadlock can be prevented by always taking the locks in the same order. Unlocking can be done in any order. Since a lock has a unique unchanging memory address and addresses can be ordered numerically by comparing them. For RAII2CSLock class, Windows’ CRITICAL_SECTION is used as a lock. You can download the class here.

#include "CriticalSection.h"

class RAII2CSLock
{
public:
    RAII2CSLock(CriticalSection& a_section, CriticalSection& b_section)
        : m_SectionA(a_section)
        , m_SectionB(b_section)
    {
        if (&m_SectionA < &m_SectionB)
        {
            m_SectionA.Lock();
            m_SectionB.Lock();
        }
        else if (&m_SectionB < &m_SectionA)
        {
            m_SectionB.Lock();
            m_SectionA.Lock();
        }
        else // m_SectionA and m_SectionB are actually the same lock
            // because they shared the same address
            // so just need to lock one of them
        {
            m_SectionA.Lock();
        }
    }
    ~RAII2CSLock()
    {
        if (&m_SectionA == &m_SectionB)
        {
            m_SectionA.Unlock();
            return;
        }
        // the order of unlock does not matter
        m_SectionA.Unlock();
        m_SectionB.Unlock();
    }

private:
    // copy ops are private to prevent copying
    RAII2CSLock(const RAII2CSLock&);
    RAII2CSLock& operator=(const RAII2CSLock&);

    CriticalSection& m_SectionA;
    CriticalSection& m_SectionB;
};

The number of comparisons exploded exponentially for more than two locks, degrading performance. Only use this approach when you have two locks. Good news for C++17 users: C++17 offers scoped_lock as an alternative for lock_guard that provides the ability to lock multiple mutexes using a deadlock avoidance algorithm.

std::mutex mutA;
std::mutex mutB;
...
{
	std::scoped_lock lock(mutA, mutB);
    ...
}

The benchmark of scoped_lock and RAII2CSLock is as follows and can be downloaded here. RAII2CSLock which is based on CriticalSection has a slight edge over scoped_lock that is based on C++11 mutex.

                 Std Locking timing: 3417ms
         scoped_lock locking timing: 3412ms
      RAII2MutexLock locking timing: 3301ms
         RAII2CSLock locking timing: 2853ms

Do not mix scoped_lock usage with RAII2CSLock. Just stick to one. Using scoped_lock everywhere gives the deadlock avoidance algorithm the visibility of all locks to do its work and the ability to temporarily unlock one of them to let one thread go ahead.

4.2 Live Lock

Live lock happens whenever the thread tries to acquire a lock and when that fails, it does some idle processing. During that time, the lock becomes available but it is busy doing idle work and later, it tries to acquire again, but the lock is once again unavailable. Because lock acquisition is a blocking process, some developers think CPU time is wasted while waiting. This is not the case; when the lock fails to acquire and goes into blocking mode, the thread gives up its remaining time slice and no processing time is wasted from OS perspective. The cure for this problem is always acquire and never try to acquire and create a separate dedicated thread for idle processing.

4.3 Processing Packets in Network Threads

I made this mistake in the past which I processed my packet in the packet receiving thread and sent another TCP request that blocked until the packet was received and inadvertently hanged the thread because the network thread was blocked by me from receiving the packet. In network processing, there should be a dedicated thread to receive the packets and another thread to process them.

4.4 DllMain Loader Lock Deadlock

When LoadLibrary is called to load your DLL, a loader lock is acquired during the DllMain call and if this DllMain indirectly loads a DLL and acquires the loader lock (but failed) in another thread, your current DLL load can block as a result. What code can cause a DLL to load indirectly? The answer is nobody knows for sure. Opening a file can cause your Anti-virus to load its DLLs for scanning. I was running a demo application that loads a non-licensed DLL that its DllMain pops up a nagging dialog which then hangs on my PC on the first run. I gave feedback to the DLL author, and he never encountered this problem. I deduce it is my Anti-virus. Beware this hang problem might not surface on your development machine but client's machine.

Microsoft recommends not putting complex initialization and cleanup code inside DllMain. Allocating and deallocating memory is fine. Create your own DLL functions for initialization and cleanup and have the application call them.

5. Mutex

In this section, we explore the problems that can plague the misuse of mutex.

5.1 Using Mutex When Atomic Variable Is Sufficient

The counting code below takes a mutex lock whenever count is incremented. An atomic count is sufficient in this case. You can ignore is_valid; it is to prevent optimizer from deducing what I am trying to do and optimizing away my code.

bool is_valid(int n)
{
    return (n % 2 == 0);
}
...
int count = 0;
std::mutex mut;

std::for_each(std::execution::par, Vec.begin(), Vec.end(),
    [&mut, &count](size_t num) -> void
    {
        if (is_valid((int)(num)))
        {
            std::lock_guard<std::mutex> guard(mut);
            ++count;
        }
    });

You can accomplish the same thing more efficiently with atomic variable.

std::atomic<int> count = 0;

std::for_each(std::execution::par, Vec.begin(), Vec.end(),
    [&count](size_t num) -> void
    {
        if (is_valid((int)(num)))
            ++count;
    });

If you are not using C++11, you can use Windows Interlocked primitives.

LONG count = 0;

std::for_each(std::execution::par, Vec.begin(), Vec.end(),
    [&count](size_t num) -> void
    {
        if (is_valid((int)(num)))
            ::InterlockedIncrement(&count);
    });

The benchmark results are as follows.

           inc mutex: 3136ms
          inc atomic: 1007ms
 inc win Interlocked: 1005ms

You can download the benchmark code from the top of the article.

5.2 Using Global Windows Mutex for Single Instance Application

This mutex here refers to Windows Mutex, not C++11’s mutex class. Windows Mutex is an interprocess kernel object that can be used to implement a single instance application using an unique global name. Anyone can find your global mutex listed in Process Explorer or WinObj and close it and voila, a second instance of your application can be launched. This Youtube video illustrates the problem. The solution is a private mutex and the code is more involved. I’ll expand on it in another separate article.

5.3 Abandoned Mutex

An abandoned mutex is whereby the mutex is acquired but the thread exited without releasing it because an exception is thrown and is not caught and handled. C++‘s Resource Acquisition Is Initialization (RAII) can help to solve this by releasing the mutex in its destructor. If you are using C++11, you need not reinvent the wheel and use lock_guard. lock_guard acquires the mutex and releases it in the constructor and destructor respectively.

std::mutex mut;
...
std::lock_guard<std::mutex> lock(mut);

If you are using CRITICAL_SECTION as your mutex, you can make use of these two classes written by Jonathan Dodds. The member functions, Enter and Leave are renamed to Lock and Unlock.

#include <Windows.h>

class CriticalSection
{
public:
    CriticalSection()
    {
        ::InitializeCriticalSection(&m_rep);
    }
    ~CriticalSection()
    {
        ::DeleteCriticalSection(&m_rep);
    }

    void Lock()
    {
        ::EnterCriticalSection(&m_rep);
    }
    void Unlock()
    {
        ::LeaveCriticalSection(&m_rep);
    }

private:
    // copy ops are private to prevent copying
    CriticalSection(const CriticalSection&);
    CriticalSection& operator=(const CriticalSection&);

    CRITICAL_SECTION m_rep;
};

This is the RAII class that will unlock the CriticalSection in its destructor. You can download the two classes here.

#include "CriticalSection.h"

class RAIICSLock
{
public:
    RAIICSLock(CriticalSection& a_section)
        : m_Section(a_section) {
        m_Section.Lock();
    }
    ~RAIICSLock()
    {
        m_Section.Unlock();
    }

private:
    // copy ops are private to prevent copying
    RAIICSLock(const RAIICSLock&);
    RAIICSLock& operator=(const RAIICSLock&);

    CriticalSection& m_Section;
};

If you are using Windows mutex, you can detect abandoned mutex whenever WaitForSingleObject returns WAIT_ABANDONED or WaitForMultipleObjects returns WAIT_ABANDONED_0 to (WAIT_ABANDONED_0 + nCount– 1). Please refer to MSDN documentation for more information.

5.4 Using Semaphore Of One Count As Mutex

Some developers mistook the semaphore of one count as equivalent to a mutex. But it is not. A mutex has thread ownership, meaning only the thread that acquires it can release it, while a semaphore has no such requirement. Beware that this is a common trick question for low-latency trading job interviews.

6. Thread Safety

A threadsafe function can be called safely from threads. In this section, we look at how to make your function threadsafe.

6.1 Using Global or Static Variable in Thread

To make your function safe to be called from multiple threads, it cannot use global variables or static local variables to store its state. Static variable initialization is threadsafe but its access is not reentrant.

6.2 Using Non-Threadsafe Functions in Thread

In addition to not using global and static variables, ensure your functions not to call non-threadsafe code such as string splitting function like strtok. strtok stores its current state for the next call. To make strtok thread-safe, C language has introduced a reentrant version called strtok_s. Replace your non-threadsafe function calls with their threadsafe counterparts.

6.3 Not All Access to Shared Variable Is Protected

A memory shared between threads must be protected with a lock. Exercise extra caution when dealing shared state. During code reviews, check every access is done under synchronization.

7. Priority

Every thread is associated with priority that determines its scheduling. Higher priority means it gets more CPU processing time. Setting higher or lower priority can be detrimental to your application, as we shall see in this section.

7.1 Setting Realtime Thread Priority

Do not ever set real-time priority for your thread as it would starve other threads. It should be reserved for processing that need real-time attention such as audio. Note: It is not possible to set to real-time priority using Windows API because thread priority can only be bumped up to two priority from the process priority but this limitation can be bypassed using a kernel driver.

7.2 Setting Lower Thread Priority

Priority inversion is whereby a high-priority thread is waiting on synchronization primitive held by a low-priority thread and execution could not proceed as low-priority thread is not awakened enough to do processing to release the lock. This is exactly what happened to Mars PathFinder. Whenever Windows detects a thread has not been run for 4 seconds, it is given a priority boost. I rather prefer Windows not mitigate this with a priority boost as it can lead the thread starvation undetected during development. The solution to avoid this is not to set low priority and also not set high priority because high priority implies there is relative low priority.

8. Performance

Threading performance misconceptions and improvements are discussed in this section.

8.1 Double-checked Locking

Double-checked Locking is typically used to improve the singleton performance. But the first check is subjected to data race, rendering this optimization useless.

class Singleton
{
public:
    static Singleton *instance (void)
    {
        // First check
        if (instance_ == 0)
        {
            // Ensure serialization (guard
            // constructor acquires lock_).
            std::lock_guard<std::mutex> guard (lock_);
            // Double check.
            if (instance_ == 0)
                instance_ = new Singleton;

            // guard destructor releases lock_.
        }
        return instance_;
    }
private:
    static std::mutex lock_;
    static Singleton *instance_;
};

C++11 Standard provides the threadsafe access for static variable by stating If control enters the declaration concurrently while the variable is being initialized, the concurrent execution shall wait for completion of the initialization. The GetInstance function with static local Singleton shall suffice.

Singleton& GetInstance() {
  static Singleton s;
  return s;
}

8.2 False Sharing

Data is usually placed as close as possible to be cache-friendly but this policy run counter-intuitive to multithreading performance where each thread writes its data on the same cache line causing thrashing whereby each thread is informed its data in cache is invalidated by other thread and has to be fetched from main memory whereas thread is not operating on the changed data and therefore has no interest in it whatsoever.

To get over this problem, the data in the struct should be aligned with cacheline size using C++17’s hardware_destructive_interference_size to force data onto different cacheline. Do note this size is defined at compile time as 64bytes and does not update itself when running on a machine with different cacheline size. And the value could have been more aptly named. The keep_apart struct is an example from C++ Reference on how to align hardware_destructive_interference_size with alignas keyword to avoid false sharing.

struct keep_apart
{
    alignas(std::hardware_destructive_interference_size) std::atomic<int> cat;
    alignas(std::hardware_destructive_interference_size) std::atomic<int> dog;
};

To be on the safe side, you can set a dummy array of 128 bytes between your data of interest. 128 bytes is the maximum cache line size.

8.3 Coarse Grained Locking

Coarse-grained locking means holding a lock over a large block of code. Performance can be improved by taking lock only when needed. This can be accomplished by RAII using curly braces to limit the scope of lock_guard. See the CoarseGrainedLockingFunc below.

std::mutex mutA;
void CoarseGrainedLockingFunc()
{
    std::lock_guard<std::mutex> lock(mutA);
    A->DoWork();

    // .. lots of code in between

    A->DoAnotherWork();
}

See how FineGrainedLockingFunc improves by locking the mutex only whenever A is used.

std::mutex mutA;
void FineGrainedLockingFunc()
{
    {
        std::lock_guard<std::mutex> lock(mutA);
        A->DoWork();
    }

    // ... lots of code in between

    {
        std::lock_guard<std::mutex> lock(mutA);
        A->DoAnotherWork();
    }
}

8.4 Setting Processor Core Affinity

Setting a thread affinity can be counter-productive because that processor core runs your thread and other threads. When other core is ready to run your thread but scheduler cannot select your thread to run because of its thread affinity. Thread affinity proponents are betting that thread data is still in the cache when the thread is run again. On the other hand, the data could be already evicted. You experiment it to see if it pays off.

8.5 Spinning

Spinning is not recommended to acquire a lock in user mode application because your thread can be pre-empt by other threads and spinning deprives other threads from doing useful work. Spinning drain laptop batteries fast and generate much heat in data center without doing real work. Spinning only makes sense in kernel code because they are at higher priority than thread scheduler which cannot pre-empt them. But note kernel code can be interrupted by hardware/software interrupt.

If spinning is desired in user mode for perf reason, you can spin CRITICAL_SECTION temporarily with InitializeCriticalSectionAndSpinCount to initialize with a spin count. This spin count is ignored on single processor system. Whenever EnterCriticalSection is called, the thread spins: it enters a loop which iterates spin count times, checking to see if the lock is released. If the lock is not released before the loop finishes, the thread goes to sleep to wait for the lock to be released.

BOOL InitializeCriticalSectionAndSpinCount(
    LPCRITICAL_SECTION lpCriticalSection,
    DWORD              dwSpinCount
);

8.6 Too Much Locking

To improve performance, we can reduce the amount of synchronization. For example, look at this code from the above mutex section, it requires synchronization whenever the count is incremented.

int count = 0;
std::mutex mut;

std::for_each(std::execution::par, Vec.begin(), Vec.end(),
    [&mut, &count](size_t num) -> void
    {
        if (is_valid((int)(num)))
        {
            std::lock_guard<std::mutex> guard(mut);
            ++count;
        }
    });

We can modify the code to increment a temp_count inside the for loop and add the temp_count to the count at the end of lambda, only this section needs mutex synchronization. The number of mutex locks is equal to number of processor core which is identified by threads variable.

const size_t threads = std::thread::hardware_concurrency();
std::vector<size_t> vecIndex;
for (size_t i = 0; i < threads; ++i)
    vecIndex.push_back(i);

int count = 0;
std::mutex mut;

std::for_each(std::execution::par, vecIndex.begin(), vecIndex.end(),
    [&mut, &count, threads](size_t index) -> void
    {
        size_t thunk = Vec.size() / threads;
        size_t start = thunk * index;
        size_t end = start + thunk;
        if (index == (threads - 1))
        {
            size_t remainder = Vec.size() % threads;
            end += remainder;
        }
        int temp_count = 0;
        for (int i = start; i < end; ++i)
        {
            if (is_valid((int)(Vec[i])))
            {
                ++temp_count;
            }
        }
        {
            std::lock_guard<std::mutex> guard(mut);
            count += temp_count;
        }
    });

The naive code mutex takes 3136ms while the one with less mutex lock takes 33ms, almost a 100x improvement.

           inc mutex: 3136ms
 inc less mutex lock:   33ms

9. Task-Based Threading

In this section, we look at mistakes of task-based threading using the Intel Threading Building Blocks (TBB).

9.1 Using Thread Local Storage in Tasks

When you are converting your existing thread to use task, be prepared to remove the thread local storage (TLS). In task-based processing, a thread is reused to process different tasks. If that next task also accesses TLS, it may be accessing the information meant for another task.

9.2 Waiting Inside Task for Another Task

Inside the current task, do not create a task and wait for it. Task-based processing works by placing a task in the thread queue. Your new task can be in the same queue as your current task and the current thread is blocked from processing by waiting for the new task completion. When they are in different queues, no blocking problem. This is an intermittent problem which is tricky to detect. The blocking problem can be averted by not waiting for a new task to complete or not creating a new task inside a task.

10. Parallel Patterns Library

The Microsoft Parallel Patterns Library (PPL) parallel_for has a memory leak problem and an alternative comes from C++17.

10.1 Using Parallel Patterns Library Leaks Memory

The PPL's parallel_for leaks memory and I reported it and was told it was global memory but the leaks keep increasing with more parallel_for invocations. The Microsoft recommendation is to use C++17 parallel for_each().

11. Windows API

In this section, we look at two Windows API to get process and thread handles.

11.1 GetCurrentProcess and GetCurrentThread Returns Pseudo Handle

Beware GetCurrentProcess and GetCurrentThread functions return a pseudo handle of -1 and -2 respectively. In the code below, the ptrdiff_t cast is used to convert the unsigned HANDLE to signed integer according to the platform bitness so that the negative numbers can be observed. The below code can be downloaded here.

std::cout << "GetCurrentProcess(): " << 
    (std::ptrdiff_t)::GetCurrentProcess() << "\n";

std::cout << "GetCurrentThread(): " << 
    (std::ptrdiff_t)::GetCurrentThread() << "\n";

This is the output:

GetCurrentProcess(): -1
GetCurrentThread(): -2

This does not present a problem if you intend to use the handle in the current process or thread but if there is a need to pass the handle to another process or thread, call DuplicateHandle to get the real handle. You can read more on Bruno van Dooren’s The Current Thread Handle.

12. COM Threading

In this section, we explore the usage of COM interface pointer in another thread.

12.1 Using COM Interface Pointer in Another Thread Without Marshalling

This is the 31^st mistake on the list. Since it does not seem to pose a problem, I did not include it in the list. When using a COM interface pointer created on another thread, it has to be marshalled. Because the new worker thread did not call CoInitialize()/CoInitializeEx() to initialize the COM apartment, the COM interface pointer could not realize it was being called from another thread, and thus, no error was thrown. The correct practice is to marshal the COM object. The easiest way to accomplish this is through the Global Interface Table. Only IStream pointer is not required to be marshalled across threads.

Problem with Shared State

Many threading problems stem from having shared state. Shared state requires synchronization and therefore is a performance killer: Increasing the number of threads from one to two can reduce the performance by 20% or even more. Take this multi-threaded shared atomic incrementing example from the above section.

std::atomic<int> count = 0;

std::for_each(std::execution::par, Vec.begin(), Vec.end(),
    [&count](size_t num) -> void
    {
        if (is_valid((int)(num)))
            ++count;
    });

It takes 1007ms to complete. While this multi-threaded non-shared incrementing where each thread as its own counter takes 73ms. This loop::parallel_for is a new library I am writing but it has a lower performance than C++17 parallel for_each due to every lambda invocation, it has to pass an extra thread index argument.

struct CountStruct
{
    CountStruct() : Count(0)
    {
        memset(buf, 0, sizeof(buf));
    }
    int Count;
    char buf[128]; // to prevent false sharing
};
...
int threads = std::thread::hardware_concurrency();
std::vector<CountStruct> count(threads);

loop::parallel_for(threads, (size_t)(0), (size_t)(VEC_SIZE),
    [&count](int threadIndex, int index) -> void
    {
        if (is_valid(Vec[index]))
            ++(count[threadIndex].Count);
    });

int total = 0;
for (auto& st : count)
    total += st.Count;

However, 99.99999% of the time, things are not as simple as giving every thread its counter or object. Synchronization can be avoided by making copies of data to the thread but that usually involves heap memory allocation. Heap allocation (and deallocation) has considerable overheads that may not be amortized. Performance gain can fail to materialize. There is also the inherent risk of operating on stale data. As complexity grows uncontrollably in pursuit of share-free code, the bugs can creep in. Making your code share-free is a massive undertaking and is a decision that should not be made lightly. Do note some resource such as network bandwidth and database connections are inherently shared even though no explicit locks are taken.

Summary of the counting benchmark is as follows. The result is surprising that taking lesser mutex lock is faster than taking no lock, though I have taken care to eliminate false-sharing. The reason should be the no lock version has additional parameter called threadIndex in every lambda invocation.

           inc mutex: 3136ms
          inc atomic: 1007ms
 inc win Interlocked: 1005ms
         inc no lock:   73ms
   inc single_thread:   86ms
 inc less mutex lock:   33ms

Download the MultithreadedCount benchmark here.

You can compile the benchmark on Linux with these commands. Remember to copy timer.h and parallel_for_each.h to your Linux folder. The Linux build automatically excludes the Windows Interlocked primitive function.

g++ MultithreadedCount.cpp -O3 -std=c++17
clang++ MultithreadedCount.cpp -O3 -std=c++17

Wrapping Up

Four years have passed since my dentist visit in 2020. As I deepened my research on the subject, the list of mistakes increased from 5 to the final 30 items. To keep the scope manageable, I only cover C++ Standard Library and Windows. Graphic processing units (GPU) and other Operating Systems, such as Linux and MacOS, are left out in the discussion. Lock-free data structures are also not discussed because I lacked fluency in this subject.

At the end of this journey, I have become adept at writing bug-free multi-threading code. By writing this article, I hope to raise awareness of multi-threading pitfalls and the solutions to avert them.

References

Concurrent Programming on Windows by Joe Duffy
Windows Internals, Part 1 by Pavel Yosifovich, Mark E. Russinovich, Alex Ionescu, David A. Solomon
Windows Internals on Pluralsight by Pavel Yosifovich
Essential COM by Don Box

History

15^th February, 2024: Updated all the source code downloads except the TestGetCurrentThread
12^th February, 2024: Updated the RAIICSLock and PostMessage source code downloads
11^th February, 2024: First release