Add your own alternative version
Stats
197.8K views 53 bookmarked
Posted
18 Oct 2000
|
Comments and Discussions
|
|
class CPP_SpinLock
{
public: // inlined constructor
// inlined NON-virtual destructor
CPP_SpinLock(volatile LONG * lock) {m_lock=lock;}
~CPP_SpinLock() {}
// enter the lock, spinlocks (with/without Sleep)
// when mutex is already locked
void Enter(void)
{
do
{
if( InterlockedCompareExchange(m_lock, 1,0)==0)
{
break;
}
// reluinquish current timeslice (can only
// be used when OS available and
// we do NOT want to 'spin')
Sleep(0);
}
while (true);
}
// Tries to enter the lock, returns 0
// when mutex is already locked,
// returns != 0 when success
int TryEnter(void)
{
if( InterlockedCompareExchange(m_lock, 1,0)==0)
{
return 1;
}
return 0;
}
// Leaves or unlocks the mutex
// (should only be called by lock owner)
void Leave(void)
{
InterlockedExchange(m_lock, 0);
}
protected:
volatile LONG * m_lock;
};
|
|
|
|
|
1. InterlockedCompareExchange maybe better.
2. There should be a sleep in enter function...
3. It's useful in windows for it's faster in inter-process lock(between memory map). The CRITICAL_SECTION is not usable in this situation and WaitForSingleObject is very slow.
modified on Saturday, January 9, 2010 3:54 AM
|
|
|
|
|
Why not use InterlockedExchange, which is available on all recent Windows platforms, instead of the architecture-dependent assembly code?
|
|
|
|
|
Works fine on all those recent Windows platforms.
--------------------------------------------------
If my messages appear curt, I apologize.
I try to be brief to save your time as well as mine.
--------------------------------------------------
|
|
|
|
|
Using pure spinlock in user-mode is generally dangerous, you should use CRITICAL_SECTION instead which will turn to block(sleep) after several spinning.
For example, suppose thread A has acquire a spinlock then interrupted by OS scheduler, and OS switches the CPU to thread B that might be of higher priority; the thread B runs and try to get the spinlock, then it found it locked and spinning for it. Since thread A has no chance to release the lock, dead lock occurs.
Also in kernel mode, work protected by spinlock should not be preempted, and should not cause page-fault.
|
|
|
|
|
In MFC application when I call InitializeCriticalSectionAndSpinCount
error comes while compiling
c:\Documents and Settings\kazimr\My Documents\Visual Studio Projects\cr\cr.cpp(183): error C3861: 'InitializeCriticalSectionAndSpinCount': identifier not found, even with argument-dependent lookup
but when I call InitializeCriticalSection but no error and compiles well.
both functions have same dependencies kernel.lib and Windowss.h and winbase.h
I have to use InitializeCriticalSectionAndSpinCount as i am working on dual core processor technology.
Very awful Help me please...
It is Kaz
|
|
|
|
|
It could be debated if your assertion about the non atomicity of read and write operations is true or not.
It is a question of definition, but atomicity is usually defined as the fact that the operation (read or write in memory, in this case) is done in a non interruptable "cycle" (either 1 cycle, either many cycles with the bus locked). It is the operation on the bus that is considered and not the instruction on the processor. And in this sense, the i386 like any processor at my knowledge have atomic read and write.
It has too an atomic TestAndSet.
Peterson and Lamport algorithms were done for computers on which read and write are supposed to be atomic, but where no assumption is done about TestAndSet.
So these algorithm are perfectly valid on i386 but not so useful on modern computers that have hardware ways(TestAndSet) to solve these problems.
As you have done "User-Level Spin Locks", and for which I thank you.
Pierre Couderc
www.tol.fr
|
|
|
|
|
Hi All,
We are using STL list on which we want to perform search, add, modify & delete operations. We will have multiple threads trying to access the list element in above mentioned operations.
We want to allow multiple threads to access the list elements while performing search operation but want to allow only single thread to access the list element while performing add, modify and delete operations. While one thread is searching and has found an element and the thread sleeps due to time slicing, second thread should not be allowed to add, modify or delete that particular element..
SGI provides spin lonk implementation, how to use it???
Is there any way in which I can use critical section for the above? Will taking section critical while performing search lock entire list?
How can I use critical section in which only particular element of interest is locked instead of whole list..
Sorry for lengthy description..It will be great if someone can answer this quickly..
thanks and regards
vallabh
|
|
|
|
|
Not sure if your aware of this, but there is a WIN32 API that uses a spin lock called InitializeCriticalSectionAndSpinCount.
|
|
|
|
|
Correct. Although this feature did not come out until later and was possibly not available when the author wrote this article, I would have to imagine the WIN32 API functions are more effecient.
|
|
|
|
|
InitializeCriticalSectionAndSpinCount was first available in Windows 2000, which came out in Feb. 2000. The author first posted this article in Oct. 2000.
|
|
|
|
|
Hi
I often face the dillema of when to use a critical section or when to use a reader writer lock. The reader writer lock I use is derived from MSDN article "Compound Win32 Synchronization Objects" by Ruediger R. Asche. The problem is that the mutex will always cause a context switch and a critical section possibly not.
As an example, an IO completion port implementation may need to retrieve a structure from a collection, but at the same time a new handle and associated structure may need to be registered with the IOCP and stored in the collection. Multiple IO's may execute consecutavely (SMP) allowing multiple reads on the same protected collection to retrieve the associated structure of interest, but the write needs to protect the collection. Obviously for an IOCP implemenation speed is of critical importance and a context switch is the last thing one wants to do, but it may also take some time finding the appropriate structure.
An advantage of using a mutex is that the request order is maintaned, while with a critical section any thread may gain access causing out of sync requests and or starvation.
A disadvantage of a mutex based implementation is that a single thread may lock itself. e.g. The thread locks for reading and then somewhere along the way it locks for writing causing a deadlock. This will not happen with a critical section.
The ultimate solution would be a reader writer lock that has the speed of a critical section, the advantage of not causing a deadlock when being locked multiple times by the same thread, and that will maintain the lock request order.
Does this exist, and if it does where do I find it?
Regards
Pieter Viljoen
|
|
|
|
|
Hi,
Sorry about the delay...
What you are looking for is also called an executive resource with shared read access. They are not available in the Win32 API but are defined in the NTddk.h and explained in the DDK reference documentation. (I am not sure whether they can 'spin' though,... they are allocated directly from the non-paged pool).
If you want to write your own you need a combination of a 'queued spinlock' with shared read access.
Best Regards,
Gert.
|
|
|
|
|
Gert
I couldn't find any reference to "executive resource" regarding synchronization in MSDN.
Could you please give me more info to search for in MSDN and ntddk.h.
Regards
Pieter
|
|
|
|
|
Pieter,
Again, sorry about the delay...
- Download e.g. the W2K DDK
- Look in "ntddk.h" for "executive resource",
and you will find this...
Unfortunately the DDK documentation is no longer available on MSDN. MS wants to sell more books...
Best regards,
Gert.
//
// Define executive resource data structures.
//
typedef ULONG_PTR ERESOURCE_THREAD;
typedef ERESOURCE_THREAD *PERESOURCE_THREAD;
typedef struct _OWNER_ENTRY {
ERESOURCE_THREAD OwnerThread;
union {
LONG OwnerCount;
ULONG TableSize;
};
} OWNER_ENTRY, *POWNER_ENTRY;
typedef struct _ERESOURCE {
LIST_ENTRY SystemResourcesList;
POWNER_ENTRY OwnerTable;
SHORT ActiveCount;
USHORT Flag;
PKSEMAPHORE SharedWaiters;
PKEVENT ExclusiveWaiters;
OWNER_ENTRY OwnerThreads[2];
ULONG ContentionCount;
USHORT NumberOfSharedWaiters;
USHORT NumberOfExclusiveWaiters;
union {
PVOID Address;
ULONG_PTR CreatorBackTraceIndex;
};
KSPIN_LOCK SpinLock;
} ERESOURCE, *PERESOURCE;
|
|
|
|
|
FWIW, I would avoid using the "Ruediger Asche" Reader-Writer-Lock algorithm as it a big bug in it.
I wrote some RWLock code and was comparing it against 6 (well 5 because I don't have Vista here) other types of RWLocks in a test harness.
While modifying the performance and correctness testing of the code, one of the algorithms failed and it wasn't mine! Surprisingly the algorithm that failed came from a Microsoft Researcher's published work on MSDN while mine passed this initial testing I've done so far in it's first revision.
There is a bug in the Ruediger Asche RWLock which MSDN: http://msdn.microsoft.com/en-us/library/ms810427.aspx[^]
I wanted to know how the bug failed to I looked a bit deeper into the code and "played computer" inside my head to identify possible failure points.
I concluded that I could cause the "Ruediger Asche" algorithm to fail with 4 simultaneous threads accessing the RWLock:
A - "Reader" Thread exiting the RWL
must be only "Reader" thread that decrements iCounter to -1
Timing is such that A is stalled between InterlockedDecrement(&iCounter) and ResetEvent(hReaderEvent)
B - "Writer" Thread already waiting on the RWL
will aquire the lock once it is released by A
C & D - two "Readers" simultaneous attempting to enter the RWL
between the times that A performs InterlockedDecrement(&iCounter) and ResetEvent(hReaderEvent)
C increments iCounter to 0 and properly waits
D increments iCounter to 1 and continues running because hReaderEvent has not yet been Reset by A
After this D will think it has a valid Reader lock while B actually owns the Writer lock.
The code can be modified slightly to exacerbate the problem and then it will fail much more quickly even on a simplified test harness.
void CRWLock::Release(int i)
{ if (i == WRITER)
{
SetEvent(hMutex);
ReleaseMutex(hWriterMutex);
}
else
if (InterlockedDecrement(&iCounter) < 0)
{
Sleep(1);
ResetEvent(hReaderEvent);
SetEvent(hMutex);
};
};
|
|
|
|
|
Hey, what about priority inversion. What if the thread that has the spin lock is lower priority than the thread that is waiting on the spin lock. Don't you have a dead-lock situation in this instance, presuming that you are running with a pre-emptive OS, unless you sleep periodically which slows down your spin lock. Use Microsoft's CRITICAL_SECTION and save yourself some headaches. I don't know what your point is? Oh, and I am an embedded programmer and I spend about 70% of my time on embedded code (No-OS, Micro-TOS, VxWorks and Windows CE) and about 30% of my time on PC programming (Windows 98, NT, 2000, Linux). Most of our embedded systems, such as HDTV, DSS, Cable Modem and Cable STB, run with pre-emptive, embedded operating systems for which we have created an OS abstraction (or obstruction) layer in an attempt to make our code more OS independent (if that's possible). Also, how does a No-OS system really benefit from a spin lock if there is no concept of tasks or process-thread? Interrupts could possibly make use of them, I guess
|
|
|
|
|
Hi Rob,
< What if the thread that has the spin lock is lower priority than the thread that is waiting on the spin lock. >
Good point: in that case, on a pre-emptive OS, you have a big chance of thread 'starvation' .
A CRITICAL_SECTION IS the way to go to avoid headaches. It's Windows; easy to use and safe and as pointed out in these discussion threads earlier (not in the article yet), if you set the spin count, pretty efficient (even) on SMP systems.
You can choose to use the presented light-weight spin lock if
- you have to protect shared memory against interrupt routine access or manipulation on embedded systems with no OS.
- you know what you are doing and want a faster not-reentrant lock.
- you have a lot of locks and want to save memory. (cfr. somewhere in these thread discussions)
Drawbacks are :
- you need to implement the TestAndSet function yourself for other architectures. ( 64 bit processors, not-intel processors, ... ) because that part of the code is not portable because of processor instruction dependency.
- the dangers involved : no error recovery ( thread holding the lock dies unexpectedly ) , possible CPU resource waste, starvation, ...
The article tries to explain some theory on locks, why you need locks, how to build your own lock, things to be aware of (like starvation, ...)... That's it, nothing more.
As you already mentioned, I created the not-reentrant version because I needed it for the Interrupts.
Just wondering... How do you protect your shared data from corruption on your no OS platforms ?
Cheers,
Gert
|
|
|
|
|
Since NT4SP3 there is also the SetCriticalSectionSpinCount() call, that uses a spincount in CriticalSection. The current thread first "spins" the given number of cycles and then enters the wait state (resulting in a context switch).
There are some really nice side effects:
- There is no starvation, since the thread "spins" only for a finite time and then enters a ordinary wait state.
- NT itself completly ignores the spincount on non-SMP system since, as you have pointed out, it makes no sense there.
I would be interested in your tests using CriticalSections with spincount. I suppose they will be not slower than your implementation. But still reentrant and non-SMP safe
Danie
|
|
|
|
|
Just thought I should post the code to set the CRITICAL_SECTION spin count and enable others to do some testing of their own. (provided others also have a SMP-machine)
// in the header file
typedef DWORD (WINAPI *fpt_SetCriticalSectionSpinCount)(LPCRITICAL_SECTION lpCriticalSection, DWORD dwSpinCount);
// in the CPP file
CRITICAL_SECTION CS;
// ...
HMODULE hMod = ::GetModuleHandle("kernel32.dll");
if (hMod == NULL)
{
assert(NULL);
return -1;
}
FARPROC fp = ::GetProcAddress(hMod, "SetCriticalSectionSpinCount");
if (fp == NULL)
{
assert(NULL);
return -1;
}
InitializeCriticalSection(&CS);
((fpt_SetCriticalSectionSpinCount) fp)(&CS, 4000);
// ...
DeleteCriticalSection(&CS);
Non-SMP safe or not, if I could only start enough commotion to get Microsoft to release their CRITICAL_SECTION source code ... ?
|
|
|
|
|
You can omit the test for m_s == 0 in CPP_SpinLock::Enter and TryEnter. It's irrelevant. Just look at prev_s. If it was 1, you now own the lock. Period. If prev_s == 0, then some other thread owns the lock.
Testing m_s leads to a harmless race condition. Consider what happens if another thread already owns the lock, but calls Leave between your calling TestAndSet(0) and your testing the value of m_s?
BTW, the code would be more readable if you used an enum for LOCKED and UNLOCKED, especially as LOCKED==0 and UNLOCKED==1 is the inverse of how most people think of the state of the lock.
Your implementation of TestAndSet should be ifdef'd on _M_IX86, not just WIN32. Also, look up __fastcall and/or __declspec(naked) in MSDN. Your cycle cost ignores the cost of the prolog and push/pop instructions.
The PORTABLE way to do TestAndSet on Win32/WinCE/Win64 is to use InterlockedExchange. It will be indetectably slower in any real application.
For the re-entrant spinlock, the PNumber stuff is pretty clumsy. Much better to omit it entirely and use GetCurrentThreadId() to identify the owner of the lock. The code is pretty clumsy and it's possible to do without m_RealCS entirely.
Your code, as listed, is completely bogus on uniprocessor systems (you sort of mention this). If the lock is held by another thread, the only appropriate thing to do is to yield immediately. If you don't, the owning thread won't get a chance to release the lock until your timeslice expires. The easiest way to yield is to call Sleep(0). Even on multiprocessor systems, you should yield eventually in most circumstances.
Perhaps the biggest cost of a context switch is the effect upon the processor's L1 and L2 cache. They're full of stale data after a context switch and it takes tens of thousands of cycles to refill them.
One other benefit of using your spinlocks is size. The non-re-entrant version is 4 bytes. A CRITICAL_SECTION is 24 bytes and it points to a debug structure that's about 32 bytes.
Your tests would be more meaningful if they reported how many CPUs were in the machine and the context switch rate.
Finally, note that NT has supported spincounts on CRITICAL_SECTIONs since NT4.0sp3. Look up SetCriticalSectionSpinCount in MSDN, then retry your experiment
|
|
|
|
|
<< You can omit the test for m_s == 0 in CPP_SpinLock::Enter and TryEnter. It's irrelevant. Just look at prev_s. If it was 1, you now own the lock. Period. If prev_s == 0, then some other thread owns
the lock. >>
<< Testing m_s leads to a harmless race condition. Consider what happens if another thread already owns the lock, but calls Leave between your calling TestAndSet(0) and your testing the value of m_s? >>
Absolutely correct, but as I will explain further on, I did have my reasons.
<< BTW, the code would be more readable if you used an enum for LOCKED and UNLOCKED, especially as LOCKED==0 and UNLOCKED==1 is the inverse of how most people think of the
state of the lock.>>
Well, what about the resource is free
=> the value is 1
The resource is not free
=> the value is 0
Isn’t that how a semaphore logically works? (You have a pool of 5 resources, so you count down to 0 when they are all taken/locked/no longwer available). It just depends how you look at it?
Concerning the test of (m_s == 0), Actually this is an EXCELLENT remark! The reason for this is the availability of the supported processor instruction for swapping (values): some architectures, like the one we use now, allow only a Bit Test And Set: it is nice to test the actual assembler function result whether it conforms to the expected result ( 32 bit “0” and “1” ) . That does not mean that if this conformation test is done, you can and should omit it (m_s == 0).
<< Your implementation of TestAndSet should be ifdef'd on
_M_IX86, not just WIN32. Also, look up __fastcall and/or __declspec(naked) in MSDN. Your cycle cost ignores the cost of the prolog and push/pop instructions. >>
Point taken (for _M_IX86, I will change that ASAP). __fastcall and __declspec(naked) is Microsoft specific. I cannot predict cycles for other platforms and actual performance changes between (i386) processor generations. Modern processors do more in 1 cycle.
<< The PORTABLE way to do TestAndSet on Win32/WinCE/Win64 is to use InterlockedExchange. It will be indetectably slower in any real application. >>
Yes, but that is because any “real application” will not have this lock testing function as the relatively most called function. However, That does not say which function is faster. It is also debatable what PORTABLE means in this (Microsoft) context.
<< For the re-entrant spinlock, the PNumber stuff is pretty
clumsy. Much better to omit it entirely and use GetCurrentThreadId() to identify the owner of the lock. The code is pretty clumsy and it's possible to do without m_RealCS entirely.>>
Nobody likes it when his code is called clumsy, but I do like constructive criticism. Open your mind and try to expand beyond Microsoft Windows for a moment (no offence intended). Not every platform has a GetCurrentThreadId() function,… not every system even has threads. However, If you NUMBER your INTERRUPT ORIGINS and your main loop on a NO-OS SYSTEM, you should see how "PNumber" could come in Handy. I know this a Windows Programming site and I do program 70% of my time on Windows NT/2000, which is the certainly the top Desktop OS. However, over 90% of all processors sold go into embedded systems. Some (and more than you think) of those Embedded systems do not even have an OS, although thank God - I might add - this is changing.
I do admit there is a lot of room for improvement in the Re-entrant version. Any suggestion would be greatly appreciated. I also recognize not to have explained the intentions behind the "PNumber" construction and the Re-entrant version in general, which is a mistake on my part. Either I should omit it or explain it.
<< Your code, as listed, is completely bogus on
uniprocessor systems (you sort of mention this). If the lock is held by another thread, the only appropriate thing to do is to yield immediately. If you don't, the owning thread won't get a chance to release the lock until your timeslice expires. The easiest way to yield is to call Sleep(0). Even on multiprocessor systems, you should yield eventually in most circumstances. >>
I think you overlooked some things in the article, because I did mention that (The Sleep(0) is even in the code but I commented it out). It is bogus on uniprocessor systems with a pre-emptive multi-threaded OS like Windows 9X/NT/2000/LINUX/UNIX/…. To protect an embedded program against corruption by interrupt routines, it is more than simple, sufficient and efficient enough. The interrupt tests the lock. If the lock is acquired, it can put its data into the shared queue (shared with the main loop and possibly other interrupts, that is). Otherwise, it will keep the data in a private queue and retry to repost it to the shared queue ASAP.
<< Your tests would be more meaningful if they reported how many CPUs were in the machine and the context switch rate.>>
I did. (It’s right below 5. Tests) I don’t know the context switch rate of the tests though.
<< Finally, note that NT has supported spincounts on CRITICAL_SECTIONs since NT4.0sp3. Look up SetCriticalSectionSpinCount in MSDN, then retry your experiment.>>
Point taken.
Thank you very much for your reaction.
Best regards and take care,
Gert.
|
|
|
|
|
Regarding the cycle counts in the spin-lock article.
First, on the x86 architecture, the xchg instruction automatically asserts a bus lock (the lock prefix is superfluous).
Even more importantly, locked bus cycles cause the cache line to be flushed and invalidated before the actual xchg operation begins.
On a multi-processor machine, cache-coherency protocols further delay the xchg. In short, the xchg instruction is more likely to consume hundreds of cycles.
About the InterlockedExchange: can it cause a thread-switch? Is it internally implemented as atomic instructions? Does it momentarily disable interrupts? I'm very curious.
Thanks.
Doug
|
|
|
|
|
Doug Gale wrote:
First, on the x86 architecture, the xchg instruction automatically asserts a bus lock (the lock prefix is superfluous).
Interesting, didn't know that. Does that mean the xchg instruction is also safe on a MPS system?
Doug Gale wrote:
On a multi-processor machine, cache-coherency protocols further delay the xchg. In short, the xchg instruction is more likely to consume hundreds of cycles.
Since, I am not an expert on x86 assembly, I cannot comment on that. However tests showed that dependent on the circumstances (and that is in most circumstances) the simple lock was a lot faster (and even faster than the CRITICAL_SECTION with spin count enabled - I did not alter my article with these findings but you can easily test it yourself) . Also as I remember (it has been a while) , deepdown the CRITICAL_SECTION mechanism uses a similar test and set technique.
Doug Gale wrote:
About the InterlockedExchange: can it cause a thread-switch? Is it internally implemented as atomic instructions? Does it momentarily disable interrupts?
Does not disable interrupts as such (anyway disabling interrupts must be avoided if possible), but since a single x86 instruction (like the test-and-set) cannot be 'interrupted', the instruction itself is thread- and interrupt safe. IOW, it's atomic from the programmers point of view.
The proposed lock was written some time ago, when I needed a locking mechanism between interrupt and single task main thread in an embedded system without OS. I was hoping for reactions (like yours) which would point me in the direction of better x86 assembly implementation tricks. Thx!
Now, I would use a circular buffer with a pointer to a start point (altered by the reading thread) and a pointer to an end point (altered by the writing interrupt) .
cheers,
Gert.
--------------------------------------------------
If my messages appear curt, I apologize.
I try to be brief to save your time as well as mine.
--------------------------------------------------
|
|
|
|
|
First, you should never, ever attempt to implement a spinlock if you are not a total expert on the hardware details of the platform. Not just the CPU, but the bus logic, cache coherency, and so on. It is very easy to get things horribly wrong.
For example, your spinlock has a truly horrible defect. Imagine a machine with 3 or more CPUs. A thread running on one CPU holds the spinlock. Threads running on two other CPUs are trying to get the spinlock. Your spin loop causes those two CPUs to pass the cache line back and forth at full speed across the FSB. This will slow down the CPU that has the spinlock terribly, resulting in the other two CPUs spinning for even longer. Yuck.
Second, you are completely incorrect about a single "read modify write" instruction being thread safe. It is interrupt safe, but not thread safe. If it was thread safe, we wouldn't need the 'lock' prefix.
When a CPU is doing a read-modify-write operation, it cannot be interrupted. So there is no need to disable interrupts. However, another processor can steal the cache line after the read and before the write. The result of two CPUs incrementing the same address at the same time could be one increment, as follows:
1) CPU 1 has the cache line, does the read part of the increment getting the original value.
2) CPU 2 steals the cache line and reads the original value, CPU1 adds one to the value it read.
3) CPU 1 steals the cache line, writes back the incremented value as CPU 2 increments the original value also.
4) CPU 2 steals back the cache line and writes the original value plus one.
The 'lock' prefix is needed precisely because read-modify-write operations are *NOT* atomic.
There are a lot of spinlock pitfalls.
|
|
|
|
|
|
General News Suggestion Question Bug Answer Joke Praise Rant Admin
Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.
|
|