Fast critical sections with timeout

valdok

4.85/5 (46 votes)

Apr 12, 2007

CPOL

20 min read

242628

2054

A faster and lighter ciritcal section implementation.

Download source - 2.7 KB

Introduction

This article observes critical sections: what are they and what are they for, and how they're implemented in Windows. Next, we'll see an implementation of critical sections with some performance advantages and extra functionality (timeout).

Readers who are not very experienced in multi-threaded applications will find this article educational, since it covers the synchronization problems in all the details. Those who are experienced in multithreaded, high-performance applications may find my optimization useful in some cases, though it's not a revolutionary performance breakthrough (critical sections are already fast enough, in most of the cases). Those can skip right to the final section.

What is synchronization?

In operating systems such as Windows and Linux, the decision to grant a processor time to a particular thread is up to the system. An application is not allowed to interfere the threads scheduling mechanism. Even assigning different priorities to threads does not guarantee the order of execution. In particular, two consequent processor instructions are not guaranteed to execute without preemption by another thread.

In multi-threaded applications, there's usually a need to access some resource from within different threads. Depending on the type of the resource/object we access, we may or may not admit simultaneous access to it. For example, there's no problem when multiple threads read some global parameter simultaneously. There's, however, a problem if you want to append a node in a shared linked list, for example. Inserting a node into a linked list involves several steps, hence we can be preempted in the middle, and until we resume, the linked list is left in the inconsistent state. We say that insertion into a linked list is not an atomic operation (atom, in Greek, means non-dividable).

So, we need a solution. As was said already, we can not "ask" the system not to preempt us until we finish the operation. It may seem that we can solve this simply by keeping a boolean variable, which we set to true when we access the shared resource, and set to false when we finish; then, before we access it, we may check this variable. But, this naive method does not solve the problem: querying a variable and then modifying it is not atomic, you can be preempted after querying the variable and just before modifying it.

The correct solution was designed into the processors. Modern processors have some instructions that perform several steps, and still execute as atomic operations. For example, on i386 processors (modern Intel and AMD processors are their successors), there're xadd and cmpxchg instructions. xadd adds a specified value to the operand, and then returns its previous value. cmpxchg modifies the operand to a specified value if the operand was equal to another specified value. But, what makes those instructions so valuable for us is the ability of the processor to execute them atomically. That is, a thread can not be preempted in the middle so that xadd did only a part of the job.

So far, we have atomic operations that perform several steps; this gives us immunity against the threads scheduling mechanism. But, there's also another factor we must take into account: multi-processor systems. In machines where several processors (or other on-board devices) work with the same memory location, extra care must be taken to restrict simultaneous access to the same variable. Because of this fact, those atomic instructions we've discussed earlier should be executed with the lock prefix, which signals all other devices in the system that a specific memory location is being modified. This affects the performance (tens to hundreds of cycles, depending on the processor), but this is the only way to synchronize between processors. It may seem for some that multi-processor systems are something very special intended for 'monster' servers, not seen every day, but this is not true; today's modern home computers are already dual or quad processor, soon there won't be such a thing as a single-processor computer.

Such operations are called interlocked, those that restrict simultaneous access to a variable. And, they are the basis for all the synchronization mechanisms in all operating systems. (In kernel mode, there's a mechanism that guarantees the order of execution, but still the lock prefix is needed in multi-processor machines.)

Synchronization in Win32 API

In Win32, interlocked functions are available through the standard API (exported by kernel32.dll); you don't have to be an assembler guru to work with them. All of them have the InterlockedXXXX prefix. Examples are InterlockedIncrement, InterlockedDecrement, InterlockedExchangeAdd, and etc.

All this is very good, but let's get down from the skies back to our goats. How can all this help us to add a node to a shared linked list? Up until now, we haven't seen something like InterlockedAppendNodeToMyLinkedList. So, let's try to implement it using the functionality we have.

A simple way to do this is the following: initialize some variable, say nLockCount, by some known value, say 0. Next, at the beginning of the function, put the following loop:

while (InterlockedCompareExchange(&nLockCount, 1, 0));

The first caller of InterlockedCompareExchange will set the nLockCount to 1 and receive 0, all others will receive 1 without affecting nLockCount. Even if two threads, and even on different processors, enter this loop simultaneously (nearly), only one of them will receive 1. After the loop is performed, we can do whatever we want: add nodes to a linked list, reallocate buffers, everything. Does it mean that we can't be preempted by another thread? Of course, no. The system runs its scheduling mechanism as usual. But, if we're interrupted and some other thread calls our function - it will just hang on this loop. Until eventually our thread is resumed, finishes its work (whatever it is), and sets the nLockCount back to 0. After this point, one of the "waiting" threads (if there is one) will finish the loop when it is resumed, and so on.

We say that we've implemented a synchronization mechanism, and nLockCount is our shared synchronization object. A situation in which a thread attempts to lock an object that is already locked by another thread is called synchronization collision. Our mechanism detects such collisions, and prevents simultaneous access to the shared resource. Let's now analyze the advantages and drawbacks of what we've just done.

First, suppose, from within a locked section (between the locking loop and setting nLockCount back to 0), we call another function, which, in turns, wishes to lock the same resource. The locking loop will hang because the nLockCount is already 1, and there's no one that's going to unlock it. So, we have a deadlock. There're two things we can do about it: either always be aware about which synchronization objects we've already locked before calling your functions, or we can enhance our synchronization mechanism to remember which thread owns it, hence skipping lock/unlock if it is already owned by the calling thread.

(This doesn't mean we've solved the deadlock problem completely. Suppose a thread owns A and waits for B, whereas another thread owns B and waits for A. Deadlock problems may arise from incorrect general design, and they can't be solved by "enhancements" of the synchronization objects themselves. There're different models of how to design the whole application correctly, we'll skip the discussion about this because it's endless.)

Now, about the performance. Our synchronization object is very light in terms of the memory it consumes: just a single 32 bit variable, or two variables in case we want to remember which thread owns us. Locking/unlocking it is also fast. Well, not faster than the lock prefix can afford, but that's what we have. But, what is the cost of the synchronization collision for our method? On collision, we just wait in the loop until the object is unlocked, this is called spinning. This is OK if we're working on multi-processor systems and usually locking our object for small time durations, so that after several attempts, the thread is likely to acquire the lock. But, in other cases, we have a situation where the OS gives a processor time to a thread, and all this valuable time is just spent in spinning, without reasonable chance (on single-processor machines - without any chance) to succeed. Isn't it better to "tell" the OS somehow that we don't need the processor time right now?

This can be easily fixed. Modify the locking loop so that if you fail to acquire the lock - put Sleep(0), this will tell the OS that we don't need the rest of our processor time portion. Well, this helps, but the reality is even more cruel. Attaching/detaching a thread for execution, the so-called context-switch, is a complex operation for the OS, and it costs the processor time. Although we saved the rest of the processor time from obsolete spinning, still it would be better to let the system know preliminarily that our thread should not execute right now, to omit the context-switch to it. In order to implement this, there should be some agreed mechanism that informs the system about our synchronization pattern.

Win32 synchronization objects

We can make the OS aware of our synchronization intentions by using its synchronization objects. There're different types of these objects, the most common are events, mutexes, semaphores, and critical sections. They're created by CreateEvent, CreateMutex, CreateSemaphore, and InitializeCriticalSection, respectively. I will not get deeply into the details of every one of them, they are well documented in the MSDN. I'd just like to point to some of their key differences:

Only critical sections and mutexes "remember" which thread owns them. (By the way, that's the only difference between a mutex and the auto-reset event.) They are intended mainly for the situation we've described: per-thread locking of a shared resource. Other objects may be used for this purpose too; however, that's not their intended use.
All of them, except critical sections, are kernel objects (hence, you free them by CloseHandle). Below, we'll discuss in depth what it means for us.

To acquire kernel synchronization objects, there're the WaitXXXX (and MsgWaitXXXX) functions in the Win32 API. The most basic is WaitForSingleObject. If the object(s) is available, then it is acquired in-place, and the function returns immediately. Otherwise, the OS suspends the caller thread and gives the rest of its processor time to another one. Furthermore, the OS will not execute this thread unless the object(s) it's waiting for can be acquired.

It is also worth to note that the same kernel object may be accessed from different processes (applications), thus enabling synchronization between applications. (In fact, even non-kernel mode objects can be shared between processes using a shared memory mechanism.)

Sounds good. But unfortunately, we have the drawbacks too. Any operation on a kernel object is performed by the system in the kernel mode, leading to a kernel-mode transaction, which has significant implications. Even if the object is available and acquired in-place by WaitForSingleObject, calling this function too often can lead to a serious performance hit. Interlocked operations cost us from tens to hundreds of processor cycles, whereas every kernel-mode call, including WaitForSingleObject and ReleaseMutex, for example, cost thousands of cycles, so that on multi-processor systems for short-time locking - spinning may be a preferred way to go.

Critical Sections

Minding all the above, we introduce critical sections. It's a hybrid. When you attempt to lock it (call EnterCriticalSection), the idea is to perform the following steps:

Check if this thread already owns the critical section. If it does, the lock/unlock is omitted (skip the rest).
Attempt to lock a dedicated variable via the interlocked instruction (similar to what we've done). If the lock succeeds, return (skip the rest).
Optionally, retry the second step a number of times. This can be enabled by calling InitializeCriticalSectionAndSpinCount instead of InitializeCriticalSection. This step is always skipped on single-processor machines.
After we've tried all of the above, call a kernel-mode WaitXXXX function.

In such a way, we can achieve the highest possible performance: lock as fast as possible if no collision, and also suspend for as much as it takes if the critical section is busy.

Note that implementing this, however, is not as trivial as it may seem. For example: when unlocking, you must somehow know that someone is already waiting, to signal the kernel object for it. Also, depending on how exactly you implement it, the successful return of the WaitForSingleObject function may not mean yet that the critical section is acquired; you may need to repeat all the steps again.

And also, there's a choice here of performance vs. fairness: if there're threads currently suspended, then the critical section is eventually unlocked; but, before the system schedules one of those threads, comes another one. Should it immediately acquire the lock (performance), or just wait like all the others (fairness)?

In Windows API, the answer on this question is the performance priority (comes out from experiments). And, that's logical in my opinion: the whole idea of critical sections is to achieve maximal performance. Also, the WaitXXXX functions do not guarantee absolute fairness.

Also, in Windows, the kernel waitable object is created only upon the first collision, not straight at the beginning.

Sounds good so far. So, what can be optimized then?

Optimizations

Let's now see in-depth what we have already.

First of all, Windows interlocked functions are very fast. Furthermore, the Microsoft VC++ compiler supports intrinsic interlocked functions (see the underscored versions, _InterlockedXXXX in MSDN). Actually, I could not squeeze any more, even at the assembler level.

So, what's wrong with the standard Windows critical sections?

Let's look at the definition of the CRITICAL_SECTION C/C++ structure (defined in winnt.h). That's how it is defined:

typedef struct _RTL_CRITICAL_SECTION {
    PRTL_CRITICAL_SECTION_DEBUG DebugInfo;

    //
    //  The following three fields control entering and exiting the critical
    //  section for the resource
    //


    LONG LockCount;
    LONG RecursionCount;
    HANDLE OwningThread;        // from the thread's ClientId->UniqueThread

    HANDLE LockSemaphore;
    ULONG_PTR SpinCount;        // force size on 64-bit systems when packed

} RTL_CRITICAL_SECTION, *PRTL_CRITICAL_SECTION;

Some explanation about the members: LockCount is the parameter to the interlocked operations, OwningThread is the owning thread ID (although declared as HANDLE, it is a DWORD), RecursionCount specifies how many times the owning thread has acquired the critical section, SpinCount is the number of additional attempts that should be performed before going into the suspended state, and the LockSemaphore is the kernel object handle - however, it seems to be an auto-reset event rather than a semaphore.

In addition, there's a DebugInfo member, which is our first problem. In fact, when you call InitializeCriticalSection besides initializing the variables of the structure, the Windows API also allocates an additional structure and stores a pointer to it in the DebugInfo member. In such a way, Windows tracks all your critical sections. The question is, what is this for? It is stated that in such a way, Windows can automatically detect deadlocks in the program. In my opinion - stupid idea. At least, there has to be an option to initialize a critical section without this DebugInfo.

First of all, I've never seen this deadlock detection mechanism working, even when a deadlock really happened (should it be enabled somehow?). Also, tracking all the critical sections alone does not do the deadlock detection, since there may be other synchronization objects too. If so, the deadlock detection should rather be something implemented in the kernel mode. But anyway, there's no ultimate solution for this: an application can also use infinite spin-locks, it's not forbidden. In my opinion, deadlocks should be prevented from the beginning by correctly designing the application. That's something you may need in debug time only.

The CRITICAL_SECTION structure consists of six 32 bit members, but because of the DebugInfo, the real memory consumption is more than 24 bytes. RTL_CRITICAL_SECTION_DEBUG takes another 32 bytes, plus it seems to be allocated on heap, which makes memory consumption even larger. From my experiments, it's about 100 bytes.

Also, creating/destroying such critical sections is not zero time-consuming. Allocating/freeing memory on the heap takes some time, plus it needs additional synchronization.

This impact is insignificant for applications that work with some constant set of critical sections, but in some cases, you need to create/destroy them on-the-fly. That's where we have a performance hit.

Next, I noticed that calling EnterCriticalSection leads to a single interlocked instruction when there's no collision, and this is OK. What was a bad surprise for me was that calling LeaveCriticalSection also involves an interlocked instruction. This is a pity, because it is possible to implement a critical section so that unlock will not use the lock prefix at all.

I believe that this is so because of historical reasons, and up until now, no one has fixed it. On early 386 processors, there was no cmpxchg instruction. Instead, you could only use inc and dec with the lock prefix. You could not check the variable and modify it only if it equals to something in an atomic operation. Also, you could not retrieve its previous value (like xadd offers). Instead, you could only know if upon inc/dec, this variable became negative, positive, or zero, plus its parity (by testing the CPU flags). Note also that InitializeCriticalSection sets the LockCount member to -1. Why? Because, executing lock inc on it will set the ZR CPU flag to 1 for the first caller, others will get 0.

But, a real bad surprise for me was when I noticed that even recursive lock/unlock of the critical section leads to interlocked instructions. What for? You know what, I could sincerely forgive all of the above, but this barbarity crosses all the barriers. The only explanation for this I can ever imagine is that critical sections were designed in such a way, that you can enter it from one thread, and leave in another. But, what kind of a programmer would do such a thing? Critical sections (and mutexes) are especially designed to be a per-thread locking object.

Plus, there's no timeout option designed. You can either specify to wait forever, or try to lock without waiting at all (TryEnterCriticalSection). Why not implement it too, if it's not too complex?

Alternative implementation

Minding all of the above, I've decided to write my own critical section. I'll not get too deep into the code discussion, we've already discussed this plenty. It's possible (though a bit tricky) to figure out how it works. Just to note: implementing a timeout (without performance degradation) was not so trivial as it may seem for many.

Let's just point to some key features of it.

A critical section is encapsulated in a CritSectEx C++ object. On construction, you may set the maximal spin count for it (like InitializeCriticalSectionAndSpinCount). You can also change it later by the SetSpinMax method; even in the middle of the work, it's absolutely safe.

We've been told also that on collision, we create a kernel synchronization object silently, but you can also pre-create it from the beginning if you want, by AllocateKernelSemaphore. (This functionality is available in Windows, hence I decided to provide it too.) Again, it's safe to call it during the use.

You can use this critical section object in a traditional way: call Lock and Unlock on it; however, I wouldn't recommend that. Instead, it is better to use a Scope locking technique, something like this:

// in some place you want to use a critical section declared as myCs
{
    CritSectEx::Scope scope(myCs);
    // access the shared resource
    // ...

}
//Here the critical section has been unlocked

To use it with timeout, you can write something like this:

{
    CritSectEx::Scope scope(myCs, 500); // 500 milliseconds timeout

    if (scope)
    {
        // access the shared resource
        // ...

    }
}

In my opinion, this should the the correct way to work with critical sections. Cases where lock and unlock should be separated are pretty rare. And even there, the Scope gives you flexibility: you may call Lock and Unlock explicitly on the Scope variable. For example: you want to lock a CS in a function, and then unlock it somewhere after it returns. This can be done in the following way:

void SomeFunc(CritSectEx::Scope& scope)
{
    // ...

    // at some point acquire the lock

    scope.Lock(myCs);
    // ...

}
void OtherFunc()
{
    // ...

    // at some point call SomeFunc.

    CritSectEx::Scope scope; // empty constructor doesn't lock anything

    SomeFunc(scope);
    // the CS is still locked

    // ...

}
// The CS is finally unlocked when OtherFunc returns

The idea is to avoid situations where you 'forget' to unlock the CS. Why not make the C++ compiler do this job? It is responsible for tracking the lifetime of C++ objects and calling their destructors, when necessary. Hence, we only need to ensure the unlocking in the destructor, and make the lifetime of the Scope appropriate.

Anyway, you can also use explicit Lock and Unlock. However, it's a bit more complex than in the standard critical sections: Upon return, Lock passes you an extra parameter that you should later pass to Unlock. This is so because, I've omitted one extra parameter in my implementation, something similar to RecursionCount in the standard critical sections. This is justifiable.

Here, I give some numbers for performance comparison of my critical sections with the standard Windows ones. The results are pretty accurate, though every test in a multi-threaded environment depends on dozens of miscellaneous factors. Tests were performed on Intel® Pentium® 4 and Intel® Core™ Duo processors. Besides, we have two operating systems: Windows Server 2003 Standard Edition, and Windows XP Professional.

These are the results for the standard critical sections:

OS	CPU	Init + Uninit (cycles)	Lock + Unlock (cycles)	Recursive Lock + Unlock (cycles)	Memory consumption (bytes)
Server 2003	P4 D	977	250	138	100
Professional	P4 (earlier)	7660	401	388	100
Professional	Duo	6672	85	90	100

First of all, there's a clear difference between the two operating systems. On Windows XP Professional, init + uninit is extremely slow. Well, this has an explanation: on Windows XP, heap functions are implemented in the kernel mode, yet another barbarity. Besides, on Server 2003, recursive lock + unlock is optimized (a single interlocked operation).

We also see a difference between the processors. The Intel® Core™ Duo processor is really impressive. Although with a bit lower clock speed, it is a real monster (in fact - two monsters). Interlocked operations are greatly optimized, which is important for processors cooperation. (This is not an Intel advertisement; in fact, I haven't tested it on AMD processors, maybe they give even better performance.) Kernel mode transactions, however, are still slow.

Now, let's see our critical sections in action:

OS	CPU	Init + Uninit (cycles)	Lock + Unlock (cycles)	Recursive Lock + Unlock (cycles)	Memory consumption (bytes)
Server 2003	P4 D	5	116	16	16
Professional	P4 (earlier)	8	198	15	16
Professional	Duo	5	45	1	16

There's no significant difference between operating systems, which is expected. The only significant operation is the non-recursive lock + unlock, which is twice faster than what the standard critical sections offer. All other operations are negligible, in comparison.

And, again we see a clear advantages of the Intel® Core™ Duo processor. It may seem that with such a processor our optimization becomes less significant, but this is not accurate. Yes, it executes interlocked operations in less cycles, but there're many other instructions that it probably executes faster too, so the relative impact on the performance may be nearly the same.

Conclusion

Some will probably laugh. "What is this for?", "You will never be able to see the difference", "Better to use something standard", "Yet another bicycle inventor". I disagree, and there's no point to argue about this. Take into account that in a single cycle, modern processors execute several instructions, and you'll see that an interlocked operation may cost about a thousand (!!!) regular instructions. When you write a hi-performance application that should squeeze the maximum from the computer, you have to care about how to enable the CPU to give its maximum. In particular, sometimes, it is worth to use a per-thread heap (without serialization), or even consider non-thread-safe reference counters for objects unless they are really referenced in different threads. Some believe that writing a program in a multithreaded way already makes it faster. Not true, my friend, improper design kills the performance totally, it'll work even slower than if it would be single-threaded.

Believe it or not, but once I managed to optimize a server so that it became 30 times faster by just removing/caching heap allocations, omitting kernel-mode transactions, and etc. No one could believe me until I demonstrated the numbers. That's the sad truth: our computers are super fast; unfortunately, we don't usually use them correctly.

Back to our critical sections. Well, not a breakthrough in most of the cases, I admit. But:

There're situations where the impact may be significant.
Nothing to lose anyway, we have no disadvantages compared to standard critical sections.
Timeout ability.

Well, some will probably value the timeout (unless those who are already decided to work with mutexes because the standard critical sections lack timeout, and see no significant benefits in critical sections).

I will appreciate comments, both positive and negative. But, I say it again: please don't try to convince me that there's no significant performance difference.