Click here to Skip to main content
Click here to Skip to main content

User-Level Spin Locks

By , 23 Oct 2000
 
"The concept of mutual exclusion is a crucial one in operating systems development. It refers to the guarantee that one, and only one, thread can access a particular resource at a time. Mutual exclusion is necessary when a resource doesn't lend itself to shared access or when sharing would result in an unpredictable outcome... If one thread reads a memory location while another writes to it, the first thread will receive unpredictable data... The same error could occur on a single processor system if the operating system were to perform a context switch while the two threads are accessing this non-shareable resource or critical section... The biggest area of concern is interrupts. The kernel (or a thread) might be updating a global data structure when an interrupt occurs whose interrupt-handling routine also modifies the structure."

Excerpt from the book Inside Microsoft Windows 2000, Third Edition, by David Solomon and Mark Russinovich...

1. What is a Spin Lock?

A Spin Lock is an inter-process or task synchronization mechanism like a Mutex, Semaphore or a (Win32) Critical Section. Spin Locks are often used in Symmetric Multiprocessor Operating System Kernels. Because sometimes it is more efficient to constantly poll for the availability of a lock rather than...

  1. Pre-empting the thread (doing the context switch), schedule another thread
  2. Maintaining and altering the extra housekeeping structures ("who" owns the synchronization object or lock, how long, has the timeout interval elapsed,...)
  3. Doing the extra work (of C function calls or even of Assembler routines) for a fair first-in-first-out mechanism.

Compared with the Win32 CRITICAL_SECTION structure and its functions, and certainly with any other synchronization mechanism like advanced mutexes and semaphores, a Spin Lock is certainly less fair (there is no protection - in this version anyway - against the theoretical "starvation” of a thread that is forever-waiting for the resource to unlock).

Also, it is more "unsafe" for your programs: if a Thread holding the lock unexpectedly dies (without freeing the lock), the presented Spin Lock has no way of knowing this, hence your threads waiting on the resource will also not be able to recover and thus "spin forever", burning CPU cycles.

But if used very carefully, a Spin Lock is in certain situations a lot faster than any other synchronization mechanism your operating system has to offer (unless it also provides user accessible Spin Locks, of course). These classes (a not-re-entrant and a re-entrant version of a USER-LEVEL SPIN LOCK) were constructed because they are a "semi-portable" synchronization mechanism for threads or tasks within the same process or address space. You can use this lock mechanism in software running on top of an OS, and also on a system without an OS where you have to protect queues, buffers, lists, structures, etc... against interrupt routines, which can leave your shared data in an inconsistent state. This situation has a lot of similarities to a thread synchronization problem (race condition) for a critical section of a standard Win32 multi-threaded program.

All you have to do is to port one (1) function to your specific CPU platform. (The i386 implementation is only 4 instructions long).

When first investigating these synchronization issues for no-OS-Embedded systems, I came across software locking mechanisms such as Peterson's Algorithm (for protecting a critical section against data corruption by 2 competing processes or tasks) and Lamport's Bakery Algorithm (for protecting a critical section against data corruption by N competing processes or tasks). Both solutions (and their derivatives) make the assumption that, reads and writes are "atomic". This just means that, reads and writes happen in a not-interruptible sequence of (CPU) instructions (or preferably only one CPU instruction), which is not the case for most hardware platforms – just take a look a the number of (i386) processor instructions required to store or write a value into a variable.

This explanation brings us to the standard way of avoiding data corruption in the no-OS-world when memory is shared between interrupt routines and a 'normal' – not-interrupt-driven – part of the program (what is mostly referred to as the "main loop"): 

=> Temporarily disabling the interrupts from within the main loop when retrieving data from the shared buffer.

Pretty 'lame' - as they say - because this solution can (in extreme cases) lead to the "loss of events", if the 'retrieve' function in the main loop takes too long to complete.

The solution here is to find a processor specific instruction that is SMP safe and executes in one not-interruptible sequence. As an example, the SWAP instruction is mostly used. This simple but ingenious idea is, in literature, also referred to as the "Test And Set" method.

2. The Algorithm

The Hardware algorithm for a binary semaphore using the Test and Set method:

wait(s)
{
   do
   {
      set s = 0; 
      if (s == 0 && previous value of s == 1)
      {
         break; // we obtained the lock
      }
   }
   while (true);
}

signal(s)
{
    set s = 1; // release the lock
}

Just to set the dot on the 'i': 

This is nothing new; you can find this in any good OS internals-explaining textbook.

3. The Spin Lock implementation:

3.1 Non-reentrant spin lock

// THIS LOCK IS NOT-REENTRANT !!
class CPP_SpinLock
{
public:    // inlined constructor
    // inlined NON-virtual destructor
    inline CPP_SpinLock() : m_s(1) {}
    inline ~CPP_SpinLock() {}    
    
    // enter the lock, spinlocks (with/without Sleep)
    // when mutex is already locked
    inline void Enter(void)
    {
        int prev_s;
        do
        {
            prev_s = TestAndSet(&m_s, 0);
            if (m_s == 0 && prev_s == 1)
            {
                break;
            }
            // reluinquish current timeslice (can only
            // be used when OS available and
            // we do NOT want to 'spin')
            // HWSleep(0);
        }
        while (true);
    }
    // Tries to enter the lock, returns 0
    // when mutex is already locked, 
    // returns != 0 when success
    inline int TryEnter(void)
    {
        int prev_s = TestAndSet(&m_s, 0);
        if (m_s == 0 && prev_s == 1)
        {
            return 1;
        }
        return 0;
    }
    // Leaves or unlocks the mutex
    // (should only be called by lock owner)
    inline void Leave(void)
    {
        TestAndSet(&m_s, 1);
    }
    
    protected:
    // sets BIT value and returns previous
    // value.in 1 atomic un-interruptable operation
    int TestAndSet(int* pTargetAddress, int nValue);
    
    private:
    int m_s;
};

// This part is Platform dependent!
#ifdef WIN32
inline int CPP_SpinLock::TestAndSet(int* pTargetAddress, 
                                              int nValue)
{
    __asm
    {
        mov edx, dword ptr [pTargetAddress]
        mov eax, nValue
        lock xchg eax, dword ptr [edx]
    }
    // mov = 1 CPU cycle
    // lock = 1 CPU cycle
    // xchg = 3 CPU cycles
}
#endif // WIN32

3.2 Reentrant Spin Lock

// THIS LOCK IS REENTRANT !!
// Thanks to George V. Reilly for pointing out
// the flaws in the first inefficient implementation.
// So here is the second try.
// NB PNumber is used as the Task, Thread or
// Process Number, on Windows you should
// pass GetCurrentThreadId() into it.
// For Embedded systems you pass your owned
// assigned task (or interrupt) numbers.
class CPP_SpinLock_Reentrant_Hopefully_Less_Clumsy
{
public:

    inline 
      CPP_SpinLock_Reentrant_Hopefully_Less_Clumsy()  
      m_nLockCount(0), 
      m_nOwner(0xffffffff) {};

    inline 
      ~CPP_SpinLock_Reentrant_Hopefully_Less_Clumsy() {};

    inline void Enter(unsigned int PNumber)
    {
        if (PNumber == m_nOwner)
        {
            m_nLockCount++;
            return;
        }
        m_RealCS.Enter();
        m_nOwner = PNumber;
        m_nLockCount++;
        return;
    }

    inline int TryEnter(unsigned int PNumber)
    {
        if (PNumber == m_nOwner) // we own it
        {
            m_nLockCount++;
            return 1;
        }
        if (m_RealCS.TryEnter())
        {
            m_nOwner = PNumber;
            m_nLockCount++;
            return 1;
        }
        return 0;
    }

    inline void Leave(unsigned int PNumber)
    {
        assert(PNumber == m_nOwner);
        m_nLockCount--;
        if (m_nLockCount == 0)
        {
            m_nOwner = 0xffffffff;
            m_RealCS.Leave();
        }
    }

protected:

private:
    CPP_SpinLock m_RealCS;
    unsigned int m_nLockCount;
    unsigned int m_nOwner;
};

4. Spin Lock Usage

4.1 When should you use a Spin Lock?

Only when it makes sense not to do a context switch in the event of a locked resource:

  • The time the thread has to wait is smaller than the amount of time to do the context switch. (This is only true for SMP operating systems running on SMP hardware! If there is no real parallelism and you have a pre-emptive OS, you should always prefer a context switch. Which thread is going to free the resource anyway on a uni-processor system?)
  • The chance of a resource conflict (= the chance of having to actually wait for the lock) in the critical section is either very small or the average waiting time is less than the time needed for the OS to do the context switching.
  • The critical section itself is very small and the number of checks is very high.
  • You don't have another synchronization mechanism because you do not have an OS.

4.2 Benefits of using a Spin Lock:

Depends on the situation, of course (as we will see in a couple of tests): speed. An NT Critical Section check (and the resource is 'free' – you have a "green light") takes about 6 CPU cycles. (I haven't checked this - it depends also on the type of CPU - but I remember reading it somewhere. N.B.: You can always do further testing yourself if you wanted...).

5. The tests

The test system is a dual PII 350 MHz (2 CPUs), 256 MB RAM and running Windows 2000 Advanced Server Edition.

Depending on the situation, the NOT-REENTRANT version is about 1 to 5 times faster (and remember it is sometimes also slower and wasting resources) than the EnterCriticalSection/LeaveCriticalSection stuff, which is the fastest Win32 synchronization mechanism. The REENTRANT version, however, is a lot (more than 2 times) slower and is only provided for no-OS Embedded systems programmers. I am sure the implementation can be a lot faster when using more 'assembler', but that would decrease the portability (A better re-entrant algorithm would also do the trick). So if you do not need re-entrance, do not use the re-entrant version!

In all the tests, the higher the number, the better the performance. There is no absolute proof to be found in these tests. It just helps you to reflect on things. Let common sense prevail.

5.1 Test 1:

This test code will generate a lot of collisions in the critical section, so other differences will emerge when collisions are sparse.

double value = 0;
double prevvalue = 0;
bool stop = false;

CPP_SpinLock theTestMutex;
CRITICAL_SECTION theCS;

void IncreaseValue(unsigned int PNumber)
{
    theTestMutex.Enter();
    // EnterCriticalSection(&theCS);
    prevvalue = value;
    value++;
    if (prevvalue != value-1)
    {
        MessageBox(NULL, "Oh Oh Oh!", 0, MB_OKCANCEL);
    }
    theTestMutex.Leave();
    // LeaveCriticalSection(&theCS);
}

DWORD WINAPI ThreadProc( LPVOID lpParameter )
{
    while (!stop)
    {
        IncreaseValue((unsigned int) lpParameter);
    }
    return 0;
}

int main(int argc, char* argv[])
{
    HANDLE hThread1 = NULL;
    HANDLE hThread2 = NULL;
    DWORD dwThreadID1 = 0;
    DWORD dwThreadID2 = 0;

    InitializeCriticalSection(&theCS);

    hThread1 = ::CreateThread(NULL, 0, 
      ThreadProc, (void*) 0, 0, &dwThreadID1);
    hThread2 = ::CreateThread(NULL, 0, ThreadProc, 
      (void*) 1, 0, &dwThreadID2);

    WaitForSingleObject(hThread1, 10000);

    stop = true;

    WaitForSingleObject(hThread1, INFINITE);
    WaitForSingleObject(hThread2, INFINITE);

    CloseHandle(hThread1);
    CloseHandle(hThread2);

    DeleteCriticalSection(&theCS);

    cerr << "\nNumber Increases:" 
             << value << "\n";

    return 0;
}
Using the CRITICAL_SECTION
Run1 output: Number Increases:1.2227e+006
Run2 output: Number Increases:1.29654e+006
Run3 output: Number Increases: 1.23221e+006
Run4 output: Number Increases: 1.15439e+006

Using the Spin Lock 
Run1 output: Number Increases: 8.43261e+006
Run2 output: Number Increases: 9.13226e+006
Run3 output: Number Increases: 8.1077e+006
Run4 output: Number Increases: 7.93513e+006

This certainly would prove my point, but don't cheer just yet. The performance differences are subtler than that.

5.2 Test 2:

Now, we change the test code, so we no longer have collisions on the lock:

double value[2];
double prevvalue[2];
bool stop = false;

CPP_SpinLock theTestMutex[2];
CRITICAL_SECTION theCS[2];

void IncreaseValue(unsigned int PNumber)
{
    theTestMutex[PNumber].Enter();
    // EnterCriticalSection(&(theCS[PNumber]));
    prevvalue[PNumber] = value[PNumber];
    (value[PNumber])++;
    if (prevvalue[PNumber] != value[PNumber]-1)
    {
        MessageBox(NULL, "Oh Oh Oh!", 0, MB_OKCANCEL);
    }
    theTestMutex[PNumber].Leave();
    // LeaveCriticalSection(&(theCS[PNumber]));
}

DWORD WINAPI ThreadProc( LPVOID lpParameter )
{
    while (!stop)
    {
        IncreaseValue((unsigned int) lpParameter);
    }
    return 0;
}

int main(int argc, char* argv[])
{
    HANDLE hThread1 = NULL;
    HANDLE hThread2 = NULL;
    DWORD dwThreadID1 = 0;
    DWORD dwThreadID2 = 0;

    InitializeCriticalSection(&(theCS[0]));
    InitializeCriticalSection(&(theCS[1]));

    value[0] = 0;
    prevvalue[0] = 0;
    value[1] = 0;
    prevvalue[1] = 0;

    hThread1 = ::CreateThread(NULL, 0, 
      ThreadProc, (void*) 0, 0, &dwThreadID1);
    hThread2 = ::CreateThread(NULL, 0, ThreadProc, 
      (void*) 1, 0, &dwThreadID2);

    WaitForSingleObject(hThread1, 10000);

    stop = true;

    WaitForSingleObject(hThread1, INFINITE);
    WaitForSingleObject(hThread2, INFINITE);

    CloseHandle(hThread1);
    CloseHandle(hThread2);

    DeleteCriticalSection(&(theCS[0]));
    DeleteCriticalSection(&(theCS[1]));

    cerr << "\nNumber Increases:" 
      << value[0] + value[1] << "\n";

    return 0;
}
Using the CRITICAL_SECTION
Run1 output: Number Increases: 1.01325e+007
Run2 output: Number Increases: 1.06665e+007
Run3 output: Number Increases: 1.13035e+007
Run4 output: Number Increases: 1.04613e+007

Using the Spin Lock 
Run1 output: Number Increases: 1.04721e+007
Run2 output: Number Increases: 1.13043e+007
Run3 output: Number Increases: 1.17862e+007
Run4 output: Number Increases: 1.09453e+007

In my opinion, there is hardly a significant difference, although the spin lock seems to do a slightly better job.

5.3 Test 3:

double value;
double prevvalue;
bool stop = false;

CPP_SpinLock theTestMutex;
CRITICAL_SECTION theCS;

DWORD WINAPI ThreadProc2( LPVOID lpParameter )
{
    while (!stop)
    {
        // theTestMutex.Enter();
        EnterCriticalSection(&(theCS));
        (value)++;
        // theTestMutex.Leave();
        LeaveCriticalSection(&(theCS));
    }
    return 0;
}

int main(int argc, char* argv[])
{
    HANDLE hThread1 = NULL;
    DWORD dwThreadID1 = 0;

    InitializeCriticalSection(&(theCS));

    value = 0;
    prevvalue = 0;

    hThread1 = ::CreateThread(NULL, 0, 
      ThreadProc2, (void*) 0, 0, &dwThreadID1);

    WaitForSingleObject(hThread1, 10000);

    stop = true;

    WaitForSingleObject(hThread1, INFINITE);

    CloseHandle(hThread1);

    DeleteCriticalSection(&(theCS));

    cerr << "\nNumber Increases:" 
           << value << "\n";

    return 0;
}
Using the CRITICAL_SECTION
Run1 output: Number Increases: 4.19735e+007
Run2 output: Number Increases: 4.18007e+007
Run3 output: Number Increases: 4.19747e+007
Run4 output: Number Increases: 4.2017e+007

Using the Spin Lock 
Run1 output: Number Increases: 4.19765e+007
Run2 output: Number Increases: 4.1972e+007
Run3 output: Number Increases: 4.20036e+007
Run4 output: Number Increases: 4.19853e+007

This also does not show a 'big' difference (IMHO, it does not show a difference at all). I just think the Win32 Critical Sections have a pretty efficient implementation.

6. Conclusion

It's up to the potential user of these Spin Locks to decide which application is best suited for them. 

Spin Locks can either be used on an embedded device without an operating system, or they can be used - with care - on a SMP machine with a SMP operating system. The mechanism is used to enforce mutual exclusion for shared resources. It is not recommended to use (blocking) Spin Locks on uni-processor systems. The key question is, whether the developer will let the scheduler intervene in case of a locked "shared resource" or "critical section". If it is decided to "spin" the lock, the developer must be certain that performance will be gained by not triggering the context switch.

History

  • 24 Oct 2000 - Updated reentrant code in section 3.2

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Gert Boddaert
Technical Lead rtos.be
Belgium Belgium
Member
Gert Boddaert is an experienced embedded software architect and driver developer who worked for companies such as Agfa Gevaert, KBC, Xircom, Intel, Niko, (Thomson) Technicolor, Punch Powertrain, Fifthplay, Cisco and Barco. For more obscure details, please take a look at his LinkedIn profile. Anyway, he started out as a Commercial Engineer – option “Management Informatics”, but was converted to the code-for-food religion by sheer luck. After writing higher level software for a few years, he descended to the lower levels of software, and eventually landed on the bottom of embedded hell… and apparently still likes it down there.
 
His favourite motto: “Think hard, experiment and prototype, think again, write (easy and maintainable) code”,
 
favourite quote: “If you think it’s expensive to hire a professional to do the job, wait until you hire an amateur.” – by Red Adair,
 
I can be contacted for real-time embedded software development projects via http://www.rtos.be and http://www.rtos.eu

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board.
Search this forum  
    Spacing  Noise  Layout  Per page   
GeneralModified code, :)membersld8 Jan '10 - 22:03 
class CPP_SpinLock
{
public: // inlined constructor
// inlined NON-virtual destructor
CPP_SpinLock(volatile LONG * lock) {m_lock=lock;}
~CPP_SpinLock() {}

// enter the lock, spinlocks (with/without Sleep)
// when mutex is already locked
void Enter(void)
{
do
{
if( InterlockedCompareExchange(m_lock, 1,0)==0)
{
break;
}
// reluinquish current timeslice (can only
// be used when OS available and
// we do NOT want to 'spin')
Sleep(0);
}
while (true);
}
// Tries to enter the lock, returns 0
// when mutex is already locked,
// returns != 0 when success
int TryEnter(void)
{
if( InterlockedCompareExchange(m_lock, 1,0)==0)
{
return 1;
}
return 0;
}
// Leaves or unlocks the mutex
// (should only be called by lock owner)
void Leave(void)
{
InterlockedExchange(m_lock, 0);
}

protected:
volatile LONG * m_lock;
};
GeneralSome opinion [modified]membersld8 Jan '10 - 21:46 
1. InterlockedCompareExchange maybe better.
2. There should be a sleep in enter function...
3. It's useful in windows for it's faster in inter-process lock(between memory map). The CRITICAL_SECTION is not usable in this situation and WaitForSingleObject is very slow.
 
modified on Saturday, January 9, 2010 3:54 AM

QuestionWhat about InterlockedExchange?memberMember 167475527 Nov '09 - 11:03 
Why not use InterlockedExchange, which is available on all recent Windows platforms, instead of the architecture-dependent assembly code?
AnswerRe: What about InterlockedExchange?memberGert Boddaert29 Nov '09 - 21:33 
Works fine on all those recent Windows platforms. Smile | :)
 
--------------------------------------------------
If my messages appear curt, I apologize.
I try to be brief to save your time as well as mine.
--------------------------------------------------

GeneralThis type of SpinLock is dangerous, and will cause dead-lock.sussmatt_77712 Dec '07 - 13:04 
Using pure spinlock in user-mode is generally dangerous, you should use CRITICAL_SECTION instead which will turn to block(sleep) after several spinning.
 
For example, suppose thread A has acquire a spinlock then interrupted by OS scheduler, and OS switches the CPU to thread B that might be of higher priority; the thread B runs and try to get the spinlock, then it found it locked and spinning for it. Since thread A has no chance to release the lock, dead lock occurs.
 
Also in kernel mode, work protected by spinlock should not be preempted, and should not cause page-fault.
QuestionProblem with InitializeCriticalSectionAndSpinCountmembereryreyeryeryeryeyryye22 Mar '07 - 0:38 
In MFC application when I call InitializeCriticalSectionAndSpinCount
error comes while compiling

c:\Documents and Settings\kazimr\My Documents\Visual Studio Projects\cr\cr.cpp(183): error C3861: 'InitializeCriticalSectionAndSpinCount': identifier not found, even with argument-dependent lookup
 
but when I call InitializeCriticalSection but no error and compiles well.
 
both functions have same dependencies kernel.lib and Windowss.h and winbase.h
I have to use InitializeCriticalSectionAndSpinCount as i am working on dual core processor technology.
 
Very awful Help me please...

 
It is Kaz

GeneralAtomic or notmemberPierre Couderc3 Jan '05 - 22:02 
It could be debated if your assertion about the non atomicity of read and write operations is true or not.
It is a question of definition, but atomicity is usually defined as the fact that the operation (read or write in memory, in this case) is done in a non interruptable "cycle" (either 1 cycle, either many cycles with the bus locked). It is the operation on the bus that is considered and not the instruction on the processor. And in this sense, the i386 like any processor at my knowledge have atomic read and write.
It has too an atomic TestAndSet.
Peterson and Lamport algorithms were done for computers on which read and write are supposed to be atomic, but where no assumption is done about TestAndSet.
So these algorithm are perfectly valid on i386 but not so useful on modern computers that have hardware ways(TestAndSet) to solve these problems.
 
As you have done "User-Level Spin Locks", and for which I thank you.
 

 
Pierre Couderc
www.tol.fr

GeneralImplementing critical section with lists in multi threaded environmentmembervallabh inamdar12 Apr '02 - 3:18 
Hi All,
 
We are using STL list on which we want to perform search, add, modify & delete operations. We will have multiple threads trying to access the list element in above mentioned operations.
We want to allow multiple threads to access the list elements while performing search operation but want to allow only single thread to access the list element while performing add, modify and delete operations. While one thread is searching and has found an element and the thread sleeps due to time slicing, second thread should not be allowed to add, modify or delete that particular element..
 
SGI provides spin lonk implementation, how to use it???
Is there any way in which I can use critical section for the above? Will taking section critical while performing search lock entire list?
How can I use critical section in which only particular element of interest is locked instead of whole list..
 
Sorry for lengthy description..It will be great if someone can answer this quickly..
 
thanks and regards
 
vallabh
 


GeneralInitializeCriticalSectionAndSpinCountmemberGerald Schwab11 Oct '01 - 17:05 
Not sure if your aware of this, but there is a WIN32 API that uses a spin lock called InitializeCriticalSectionAndSpinCount.
GeneralRe: InitializeCriticalSectionAndSpinCountmemberximvu12 Jun '08 - 5:07 
Correct. Although this feature did not come out until later and was possibly not available when the author wrote this article, I would have to imagine the WIN32 API functions are more effecient.
GeneralRe: InitializeCriticalSectionAndSpinCountmemberGerald Schwab12 Jun '08 - 5:24 
InitializeCriticalSectionAndSpinCount was first available in Windows 2000, which came out in Feb. 2000. The author first posted this article in Oct. 2000.
QuestionHow about a context switch free reader writer lock?memberPieter Viljoen7 Nov '00 - 16:53 
Hi
 
I often face the dillema of when to use a critical section or when to use a reader writer lock. The reader writer lock I use is derived from MSDN article "Compound Win32 Synchronization Objects" by Ruediger R. Asche. The problem is that the mutex will always cause a context switch and a critical section possibly not.
 
As an example, an IO completion port implementation may need to retrieve a structure from a collection, but at the same time a new handle and associated structure may need to be registered with the IOCP and stored in the collection. Multiple IO's may execute consecutavely (SMP) allowing multiple reads on the same protected collection to retrieve the associated structure of interest, but the write needs to protect the collection. Obviously for an IOCP implemenation speed is of critical importance and a context switch is the last thing one wants to do, but it may also take some time finding the appropriate structure.
 
An advantage of using a mutex is that the request order is maintaned, while with a critical section any thread may gain access causing out of sync requests and or starvation.
 
A disadvantage of a mutex based implementation is that a single thread may lock itself. e.g. The thread locks for reading and then somewhere along the way it locks for writing causing a deadlock. This will not happen with a critical section.
 
The ultimate solution would be a reader writer lock that has the speed of a critical section, the advantage of not causing a deadlock when being locked multiple times by the same thread, and that will maintain the lock request order.
 
Does this exist, and if it does where do I find it?
 
Regards
Pieter Viljoen
AnswerRe: How about a context switch free reader writer lock?memberGert Boddaert16 Nov '00 - 1:58 
Hi,
 
Sorry about the delay...
 
What you are looking for is also called an executive resource with shared read access. They are not available in the Win32 API but are defined in the NTddk.h and explained in the DDK reference documentation. (I am not sure whether they can 'spin' though,... they are allocated directly from the non-paged pool).
 
If you want to write your own you need a combination of a 'queued spinlock' with shared read access.
 
Best Regards,
 
Gert.
GeneralRe: How about a context switch free reader writer lock?memberPieter Viljoen18 Nov '00 - 13:24 
Gert
 
I couldn't find any reference to "executive resource" regarding synchronization in MSDN.
 
Could you please give me more info to search for in MSDN and ntddk.h.
 

Regards
 
Pieter
GeneralRe: How about a context switch free reader writer lock?memberGert Boddaert24 Nov '00 - 3:13 
Pieter,
 
Again, sorry about the delay...
- Download e.g. the W2K DDK
- Look in "ntddk.h" for "executive resource",
and you will find this...
Unfortunately the DDK documentation is no longer available on MSDN. MS wants to sell more books...
 
Best regards,
 
Gert.
 
//
// Define executive resource data structures.
//
 
typedef ULONG_PTR ERESOURCE_THREAD;
typedef ERESOURCE_THREAD *PERESOURCE_THREAD;
 
typedef struct _OWNER_ENTRY {
ERESOURCE_THREAD OwnerThread;
union {
LONG OwnerCount;
ULONG TableSize;
};
 
} OWNER_ENTRY, *POWNER_ENTRY;
 
typedef struct _ERESOURCE {
LIST_ENTRY SystemResourcesList;
POWNER_ENTRY OwnerTable;
SHORT ActiveCount;
USHORT Flag;
PKSEMAPHORE SharedWaiters;
PKEVENT ExclusiveWaiters;
OWNER_ENTRY OwnerThreads[2];
ULONG ContentionCount;
USHORT NumberOfSharedWaiters;
USHORT NumberOfExclusiveWaiters;
union {
PVOID Address;
ULONG_PTR CreatorBackTraceIndex;
};
 
KSPIN_LOCK SpinLock;
} ERESOURCE, *PERESOURCE;

AnswerRe: How about a context switch free reader writer lock?memberadisakp19 Feb '09 - 12:10 
FWIW, I would avoid using the "Ruediger Asche" Reader-Writer-Lock algorithm as it a big bug in it.
 
I wrote some RWLock code and was comparing it against 6 (well 5 because I don't have Vista here) other types of RWLocks in a test harness.
 
While modifying the performance and correctness testing of the code, one of the algorithms failed and it wasn't mine! Surprisingly the algorithm that failed came from a Microsoft Researcher's published work on MSDN while mine passed this initial testing I've done so far in it's first revision.
 
There is a bug in the Ruediger Asche RWLock which MSDN: http://msdn.microsoft.com/en-us/library/ms810427.aspx[^]
 
I wanted to know how the bug failed to I looked a bit deeper into the code and "played computer" inside my head to identify possible failure points.
 
I concluded that I could cause the "Ruediger Asche" algorithm to fail with 4 simultaneous threads accessing the RWLock:
 
A - "Reader" Thread exiting the RWL
	must be only "Reader" thread that decrements iCounter to -1
	Timing is such that A is stalled between InterlockedDecrement(&iCounter) and ResetEvent(hReaderEvent)
 
B - "Writer" Thread already waiting on the RWL
	will aquire the lock once it is released by A
 
C & D - two "Readers" simultaneous attempting to enter the RWL
	between the times that A performs InterlockedDecrement(&iCounter) and ResetEvent(hReaderEvent)
 
C	increments iCounter to 0 and properly waits
D	increments iCounter to 1 and continues running because hReaderEvent has not yet been Reset by A
 
After this D will think it has a valid Reader lock while B actually owns the Writer lock.
 
The code can be modified slightly to exacerbate the problem and then it will fail much more quickly even on a simplified test harness.
 
void CRWLock::Release(int i)
{ if (i == WRITER)
  { 
     SetEvent(hMutex);
     ReleaseMutex(hWriterMutex);
  }
  else
  if (InterlockedDecrement(&iCounter) < 0)
     {
       Sleep(1);	// This will cause the bug to occur with much higher frequency
       ResetEvent(hReaderEvent);
       SetEvent(hMutex);
     };
};

GeneralPriority InversionsussRob Krakora25 Oct '00 - 3:00 
Hey, what about priority inversion. What if the thread that has the spin lock is lower priority than the thread that is waiting on the spin lock. Don't you have a dead-lock situation in this instance, presuming that you are running with a pre-emptive OS, unless you sleep periodically which slows down your spin lock. Use Microsoft's CRITICAL_SECTION and save yourself some headaches. I don't know what your point is? Oh, and I am an embedded programmer and I spend about 70% of my time on embedded code (No-OS, Micro-TOS, VxWorks and Windows CE) and about 30% of my time on PC programming (Windows 98, NT, 2000, Linux). Most of our embedded systems, such as HDTV, DSS, Cable Modem and Cable STB, run with pre-emptive, embedded operating systems for which we have created an OS abstraction (or obstruction) layer in an attempt to make our code more OS independent (if that's possible). Also, how does a No-OS system really benefit from a spin lock if there is no concept of tasks or process-thread? Interrupts could possibly make use of them, I guess
GeneralRe: Priority InversionsussGert Boddaert25 Oct '00 - 5:23 
Hi Rob,
 
< What if the thread that has the spin lock is lower priority than the thread that is waiting on the spin lock. >
 
Good point: in that case, on a pre-emptive OS, you have a big chance of thread 'starvation' .
 
A CRITICAL_SECTION IS the way to go to avoid headaches. It's Windows; easy to use and safe and as pointed out in these discussion threads earlier (not in the article yet), if you set the spin count, pretty efficient (even) on SMP systems.
You can choose to use the presented light-weight spin lock if
- you have to protect shared memory against interrupt routine access or manipulation on embedded systems with no OS.
- you know what you are doing Cool | :cool: and want a faster not-reentrant lock.
- you have a lot of locks and want to save memory. (cfr. somewhere in these thread discussions)
 
Drawbacks are :
- you need to implement the TestAndSet function yourself for other architectures. ( 64 bit processors, not-intel processors, ... ) because that part of the code is not portable because of processor instruction dependency.
- the dangers involved : no error recovery ( thread holding the lock dies unexpectedly ) , possible CPU resource waste, starvation, ...
 
The article tries to explain some theory on locks, why you need locks, how to build your own lock, things to be aware of (like starvation, ...)... That's it, nothing more.
 
As you already mentioned, I created the not-reentrant version because I needed it for the Interrupts.
 
Just wondering... How do you protect your shared data from corruption on your no OS platforms ?
 
Cheers,
 
Gert
GeneralWhat's about SetCriticalSectionSpinCount()sussDaniel Lohmann24 Oct '00 - 7:21 
Since NT4SP3 there is also the SetCriticalSectionSpinCount() call, that uses a spincount in CriticalSection. The current thread first "spins" the given number of cycles and then enters the wait state (resulting in a context switch).
 
There are some really nice side effects:
- There is no starvation, since the thread "spins" only for a finite time and then enters a ordinary wait state.
- NT itself completly ignores the spincount on non-SMP system since, as you have pointed out, it makes no sense there.
 
I would be interested in your tests using CriticalSections with spincount. I suppose they will be not slower than your implementation. But still reentrant and non-SMP safe Wink | ;-)
 
Danie
GeneralRe: What's about SetCriticalSectionSpinCount()sussGert Boddaert24 Oct '00 - 7:55 
Just thought I should post the code to set the CRITICAL_SECTION spin count and enable others to do some testing of their own. (provided others also have a SMP-machine)
 
// in the header file
 
typedef DWORD (WINAPI *fpt_SetCriticalSectionSpinCount)(LPCRITICAL_SECTION lpCriticalSection, DWORD dwSpinCount);
 
// in the CPP file
CRITICAL_SECTION CS;
 
// ...
 
HMODULE hMod = ::GetModuleHandle("kernel32.dll");
if (hMod == NULL)
{
assert(NULL);
return -1;
}
FARPROC fp = ::GetProcAddress(hMod, "SetCriticalSectionSpinCount");
if (fp == NULL)
{
assert(NULL);
return -1;
}
 
InitializeCriticalSection(&CS);
((fpt_SetCriticalSectionSpinCount) fp)(&CS, 4000);
 
// ...
DeleteCriticalSection(&CS);
 

Non-SMP safe or not, if I could only start enough commotion to get Microsoft to release their CRITICAL_SECTION source code ... ?
GeneralComments on the codesussGeorge V. Reilly, Microsoft23 Oct '00 - 13:30 
You can omit the test for m_s == 0 in CPP_SpinLock::Enter and TryEnter. It's irrelevant. Just look at prev_s. If it was 1, you now own the lock. Period. If prev_s == 0, then some other thread owns the lock.
 
Testing m_s leads to a harmless race condition. Consider what happens if another thread already owns the lock, but calls Leave between your calling TestAndSet(0) and your testing the value of m_s?
 
BTW, the code would be more readable if you used an enum for LOCKED and UNLOCKED, especially as LOCKED==0 and UNLOCKED==1 is the inverse of how most people think of the state of the lock.
 
Your implementation of TestAndSet should be ifdef'd on _M_IX86, not just WIN32. Also, look up __fastcall and/or __declspec(naked) in MSDN. Your cycle cost ignores the cost of the prolog and push/pop instructions.
 
The PORTABLE way to do TestAndSet on Win32/WinCE/Win64 is to use InterlockedExchange. It will be indetectably slower in any real application.
 
For the re-entrant spinlock, the PNumber stuff is pretty clumsy. Much better to omit it entirely and use GetCurrentThreadId() to identify the owner of the lock. The code is pretty clumsy and it's possible to do without m_RealCS entirely.
 
Your code, as listed, is completely bogus on uniprocessor systems (you sort of mention this). If the lock is held by another thread, the only appropriate thing to do is to yield immediately. If you don't, the owning thread won't get a chance to release the lock until your timeslice expires. The easiest way to yield is to call Sleep(0). Even on multiprocessor systems, you should yield eventually in most circumstances.
 
Perhaps the biggest cost of a context switch is the effect upon the processor's L1 and L2 cache. They're full of stale data after a context switch and it takes tens of thousands of cycles to refill them.
 
One other benefit of using your spinlocks is size. The non-re-entrant version is 4 bytes. A CRITICAL_SECTION is 24 bytes and it points to a debug structure that's about 32 bytes.
 
Your tests would be more meaningful if they reported how many CPUs were in the machine and the context switch rate.
 
Finally, note that NT has supported spincounts on CRITICAL_SECTIONs since NT4.0sp3. Look up SetCriticalSectionSpinCount in MSDN, then retry your experiment
GeneralReaction to comments on the codesussGert Boddaert23 Oct '00 - 23:29 
<< You can omit the test for m_s == 0 in CPP_SpinLock::Enter and TryEnter. It's irrelevant. Just look at prev_s. If it was 1, you now own the lock. Period. If prev_s == 0, then some other thread owns
the lock. >>
 
<< Testing m_s leads to a harmless race condition. Consider what happens if another thread already owns the lock, but calls Leave between your calling TestAndSet(0) and your testing the value of m_s? >>
 
Absolutely correct, but as I will explain further on, I did have my reasons.
 
<< BTW, the code would be more readable if you used an enum for LOCKED and UNLOCKED, especially as LOCKED==0 and UNLOCKED==1 is the inverse of how most people think of the
state of the lock.>>
 
Well, what about the resource is free
=> the value is 1
The resource is not free
=> the value is 0
Isn’t that how a semaphore logically works? (You have a pool of 5 resources, so you count down to 0 when they are all taken/locked/no longwer available). It just depends how you look at it?
 
Concerning the test of (m_s == 0), Actually this is an EXCELLENT remark! The reason for this is the availability of the supported processor instruction for swapping (values): some architectures, like the one we use now, allow only a Bit Test And Set: it is nice to test the actual assembler function result whether it conforms to the expected result ( 32 bit “0” and “1” ) . That does not mean that if this conformation test is done, you can and should omit it (m_s == 0).
 
<< Your implementation of TestAndSet should be ifdef'd on
_M_IX86, not just WIN32. Also, look up __fastcall and/or __declspec(naked) in MSDN. Your cycle cost ignores the cost of the prolog and push/pop instructions. >>
 
Point taken (for _M_IX86, I will change that ASAP). __fastcall and __declspec(naked) is Microsoft specific. I cannot predict cycles for other platforms and actual performance changes between (i386) processor generations. Modern processors do more in 1 cycle.
 
<< The PORTABLE way to do TestAndSet on Win32/WinCE/Win64 is to use InterlockedExchange. It will be indetectably slower in any real application. >>
 
Yes, but that is because any “real application” will not have this lock testing function as the relatively most called function. However, That does not say which function is faster. It is also debatable what PORTABLE means in this (Microsoft) context.
 
<< For the re-entrant spinlock, the PNumber stuff is pretty
clumsy. Much better to omit it entirely and use GetCurrentThreadId() to identify the owner of the lock. The code is pretty clumsy and it's possible to do without m_RealCS entirely.>>
 
Nobody likes it when his code is called clumsy, but I do like constructive criticism. Open your mind and try to expand beyond Microsoft Windows for a moment (no offence intended). Not every platform has a GetCurrentThreadId() function,… not every system even has threads. However, If you NUMBER your INTERRUPT ORIGINS and your main loop on a NO-OS SYSTEM, you should see how "PNumber" could come in Handy. I know this a Windows Programming site and I do program 70% of my time on Windows NT/2000, which is the certainly the top Desktop OS. However, over 90% of all processors sold go into embedded systems. Some (and more than you think) of those Embedded systems do not even have an OS, although thank God - I might add - this is changing.
I do admit there is a lot of room for improvement in the Re-entrant version. Any suggestion would be greatly appreciated. I also recognize not to have explained the intentions behind the "PNumber" construction and the Re-entrant version in general, which is a mistake on my part. Either I should omit it or explain it.
 
<< Your code, as listed, is completely bogus on
uniprocessor systems (you sort of mention this). If the lock is held by another thread, the only appropriate thing to do is to yield immediately. If you don't, the owning thread won't get a chance to release the lock until your timeslice expires. The easiest way to yield is to call Sleep(0). Even on multiprocessor systems, you should yield eventually in most circumstances. >>
 
I think you overlooked some things in the article, because I did mention that (The Sleep(0) is even in the code but I commented it out). It is bogus on uniprocessor systems with a pre-emptive multi-threaded OS like Windows 9X/NT/2000/LINUX/UNIX/…. To protect an embedded program against corruption by interrupt routines, it is more than simple, sufficient and efficient enough. The interrupt tests the lock. If the lock is acquired, it can put its data into the shared queue (shared with the main loop and possibly other interrupts, that is). Otherwise, it will keep the data in a private queue and retry to repost it to the shared queue ASAP.
 
<< Your tests would be more meaningful if they reported how many CPUs were in the machine and the context switch rate.>>
 
I did. (It’s right below 5. Tests) I don’t know the context switch rate of the tests though.
 
<< Finally, note that NT has supported spincounts on CRITICAL_SECTIONs since NT4.0sp3. Look up SetCriticalSectionSpinCount in MSDN, then retry your experiment.>>
 
Point taken.
 
Thank you very much for your reaction.
 
Best regards and take care,
 
Gert.
GeneralRe: Comments on the codememberDoug Gale16 Mar '02 - 7:15 
Regarding the cycle counts in the spin-lock article.
 
First, on the x86 architecture, the xchg instruction automatically asserts a bus lock (the lock prefix is superfluous).
 
Even more importantly, locked bus cycles cause the cache line to be flushed and invalidated before the actual xchg operation begins.
 
On a multi-processor machine, cache-coherency protocols further delay the xchg. In short, the xchg instruction is more likely to consume hundreds of cycles.
 
About the InterlockedExchange: can it cause a thread-switch? Is it internally implemented as atomic instructions? Does it momentarily disable interrupts? I'm very curious.
 
Thanks.
 
Doug
GeneralRe: Comments on the codememberGert Boddaert17 Mar '02 - 22:14 
Doug Gale wrote:
First, on the x86 architecture, the xchg instruction automatically asserts a bus lock (the lock prefix is superfluous).
 
Interesting, didn't know that. Does that mean the xchg instruction is also safe on a MPS system?
 
Doug Gale wrote:
On a multi-processor machine, cache-coherency protocols further delay the xchg. In short, the xchg instruction is more likely to consume hundreds of cycles.
 
Since, I am not an expert on x86 assembly, I cannot comment on that. However tests showed that dependent on the circumstances (and that is in most circumstances) the simple lock was a lot faster (and even faster than the CRITICAL_SECTION with spin count enabled - I did not alter my article with these findings but you can easily test it yourself) . Also as I remember (it has been a while) , deepdown the CRITICAL_SECTION mechanism uses a similar test and set technique.
 
Doug Gale wrote:
About the InterlockedExchange: can it cause a thread-switch? Is it internally implemented as atomic instructions? Does it momentarily disable interrupts?
 
Does not disable interrupts as such (anyway disabling interrupts must be avoided if possible), but since a single x86 instruction (like the test-and-set) cannot be 'interrupted', the instruction itself is thread- and interrupt safe. IOW, it's atomic from the programmers point of view.
 
The proposed lock was written some time ago, when I needed a locking mechanism between interrupt and single task main thread in an embedded system without OS. I was hoping for reactions (like yours) which would point me in the direction of better x86 assembly implementation tricks. Thx!
 
Now, I would use a circular buffer with a pointer to a start point (altered by the reading thread) and a pointer to an end point (altered by the writing interrupt) .
 
cheers,
 
Gert.
 
--------------------------------------------------
If my messages appear curt, I apologize.
I try to be brief to save your time as well as mine.
--------------------------------------------------

GeneralRe: Comments on the codememberJoelKatz21 Feb '06 - 21:51 
First, you should never, ever attempt to implement a spinlock if you are not a total expert on the hardware details of the platform. Not just the CPU, but the bus logic, cache coherency, and so on. It is very easy to get things horribly wrong.
 
For example, your spinlock has a truly horrible defect. Imagine a machine with 3 or more CPUs. A thread running on one CPU holds the spinlock. Threads running on two other CPUs are trying to get the spinlock. Your spin loop causes those two CPUs to pass the cache line back and forth at full speed across the FSB. This will slow down the CPU that has the spinlock terribly, resulting in the other two CPUs spinning for even longer. Yuck.
 
Second, you are completely incorrect about a single "read modify write" instruction being thread safe. It is interrupt safe, but not thread safe. If it was thread safe, we wouldn't need the 'lock' prefix.
 
When a CPU is doing a read-modify-write operation, it cannot be interrupted. So there is no need to disable interrupts. However, another processor can steal the cache line after the read and before the write. The result of two CPUs incrementing the same address at the same time could be one increment, as follows:
 
1) CPU 1 has the cache line, does the read part of the increment getting the original value.
2) CPU 2 steals the cache line and reads the original value, CPU1 adds one to the value it read.
3) CPU 1 steals the cache line, writes back the incremented value as CPU 2 increments the original value also.
4) CPU 2 steals back the cache line and writes the original value plus one.
 
The 'lock' prefix is needed precisely because read-modify-write operations are *NOT* atomic.
 
There are a lot of spinlock pitfalls.

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Permalink | Advertise | Privacy | Mobile
Web02 | 2.6.130516.1 | Last Updated 24 Oct 2000
Article Copyright 2000 by Gert Boddaert
Everything else Copyright © CodeProject, 1999-2013
Terms of Use
Layout: fixed | fluid