Click here to Skip to main content
Click here to Skip to main content

Testing reader/writer locks

, 20 Jan 2009 CPOL
Rate this:
Please Sign up or sign in to vote.
This article describes testing the results of reader/writer locks on Windows XP/Vista Uniprocessor and Multi-core PCs.

Introduction

Various homegrown reader/writer lock objects were suggested by a number of developers striving to provide better performance than the existing API constructs could offer. Recently, Microsoft introduced the SRWLOCK, a new built-in reader/writer primitive for Windows Vista.

Existing API synchronization constructs provide simple and fast access to a shared resource. On the other hand, API primitives (for example, a critical section) might not work well in scenarios when high contention on the lock occurs. Lock convoy is not the only problem here. If a writer competes with multiple readers, performance of a particular reader may not be guaranteed, varying in a wide range, to the extent where minimum performance is not acceptable due to the application requirements. As a rule, custom reader/writer lock objects are slower than API constructs. However, the benefit of using custom locks is that a hand-coded object may provide reasonable performance for all reader/writer threads. We will be looking for a solution that compromises between maximum performance of multiple readers and acceptable performance of a writer.

The tests results are summarized in Figure 1 and Figure 6. One writer and four readers were concurrently accessing a shared resource. The writer thread acquired an exclusive lock, filled the shared buffer with random bytes, and calculated the CRC. Each reader thread acquired a shared lock, calculated the CRC, and compared the latter with the CRC saved by the writer (an exception was supposed to be thrown in case of a CRC error, which would indicate incorrectness of the lock implementation). The following reader/writer lock objects were tested: the “CSRW” object was the simplest implementation using the Windows API critical section; the “RWFavorNeither” and “RWLockFavorWriters” objects were taken from the MSDN magazine article “CONCURRENCY: Synchronization Primitives New To Windows Vista” written by Robert Saccone and Alexander Taskov (http://msdn.microsoft.com/en-us/magazine/cc163405.aspx); the “Ruediger_Asche_RWLock” object was taken from the MSDN article “Compound Win32 Synchronization Objects” written by Ruediger R. Asche (http://msdn.microsoft.com/en-us/library/ms810427.aspx). This lock implements the “CRWLock” compound object. The “Jim_B_Robert_Wiener_RWLock” lock object was taken from “Multithreading Applications in Win32: The Complete Guide to Threads” written by Jim Beveridge and Robert Wiener. The new Windows Vista API user-space lock, SRWLOCK, was also tested.

Figure 1. Testing one writer and four readers on a Uniprocessor PC, AMD Sempron(tm) 3000+, 2.00 GHz, 32-bit Windows XP Home Edition SP3, compiled in VC++ 2005 Express. Duration per operation in microseconds and the coefficient of variation, %.
Without locking

reader: t=0.213 (s=0.85%)
writer: t=3.234 (s=0.46%)

Critical section CSRW lock

reader: t1=41586.02 (s1=161.18%) t2=0.974 (s2=40.32%) 
writer: t1=83.78 (s1=202.08%) t2=6.938 (s2=0.69%)

RWFavorNeither lock

reader: t1=1.166 (s1=9.18%) t2=1.09 (s2=7.05%) 
writer: t1=230.71 (s1=118.71%) t2=7.741 (s2=0.86%) 

RWLockFavorWriters lock

reader: t1=1.553 (s1=19.52%) t2=1.431 (s2=10.41%) 
writer: t1=28.7240 (s1=11.4%) t2=8.623 (s2=1.86%)

Jim_B_Robert_Wiener_RWLock

reader: t1=30.202 (s=8.05%) t2=9.973 (s2=1.26%) 
writer: t1=6.303 (s=6.84%) t2=6.105 (s2=2.68%) 

Ruediger_Asche_RWLock

reader: t1=2.605 (s1=6.84%) t2=2.444 (s2=1.53%) 
writer: t1=4484987.15 (s1=76.96%) t2=14.438 (s2=1.22%)

The average duration per read/write operation “t”, “t1”, and “t2” in microseconds and the coefficients of variation “s”, “s1”, and “s2” in % were calculated based on 50 tests; each of them included 1,000,000 and 4,000,000 iterations for a writer and a reader, respectively. Comparison of the “t” value (duration per operation without acquiring locks) with “t1”, and “t2” values (the duration when thread concurrency is present) shows performance degradation due to contention on the locks. The average duration “t2” was measured from the beginning to the end of the working threads. After all the threads were done with the given number of iterations, the overall duration per read/write operations was calculated. The following pseudo-code illustrates calculating “t2”:

Figure 2. Calculating average duration per operation.
worker_thread_procedure()
{
   ...
   worker.m_perf_counter.start();
   for (int i = 0; i < test_iteration; i++)
   {
      if( false == worker.simulate_work() )
      {
         ::RaiseException( STATUS_NONCONTINUABLE_EXCEPTION, 0, 0, 0);
      }   
      else
      {
      }
   }
   worker.m_perf_counter.end();
   worker.m_perf_counter.set_iteration_done(test_iteration);
}

The duration “t2” in Figure 1 shows that there is not much difference in the locks' performance. Although using “t2” as criteria may appear simple and reasonable at first glance, it has several problems. If many reader threads are holding a shared lock, a writer can become completely locked from acquiring an exclusive lock due to the race condition. Being almost blocked and therefore having poor performance in the beginning, the writer may show “fine” overall performance at the end of the test after the other reader threads have completed the given “test_iteration”. The “Ruediger_Asche_RWLock” object is an example of a lock that behaves this way. As one of four readers has done all the 4,000,000 iterations, the writer may complete only 1-7 iterations (from 1,000,000), see Figure 3:

Figure 3. Individual results of testing the “Ruediger_Asche_RWLock” object. The number of iterations done and the duration per operation in milliseconds.
reader thread 1: n=3999973 (t=0.0025) writer thread: n=7 (t=1433.846) 
reader thread 2: n=4000000 (t=0.0025) 
reader thread 3: n=3999993 (t=0.0025) 
reader thread 4: n=3999901 (t=0.0025)

When all the readers have finished, the writer unlocks and quickly finishes the remaining 999,993 iterations. Although, the “overall” performance of the “Ruediger_Asche_RWLock” object may appear acceptable (t2=14.438 microseconds per write operation), it does not. Take a look at the duration t1=4484987.15 microseconds per writing operation in Figure 1. The estimation “t1” is similar to “t2”, but it is calculated during a short period of time at the beginning of the test, when high-contention on the shared/exclusive locks exists. The pseudo-code below illustrates calculating the duration “t1”:

Figure 4. Calculating average duration per operation when all the threads work concurrently.
worker_thread_procedure()
{
   ...
   worker.m_perf_counter.start();
   int i;
   for (i = 0; i < test_iteration; i++)
   {
      if( false == worker.simulate_work() )
      {
         ::RaiseException( STATUS_NONCONTINUABLE_EXCEPTION, 0, 0, 0);
      }
      else
      {
      }
      if(TRUE == g_stop_test)
      {
         break;
      }
      else
      {
      }
   }
   worker.m_perf_counter.end();
   ::InterlockedExchange(&g_stop_test, TRUE);
   worker.m_perf_counter.set_iteration_done(i);
}

“t1” and “s1” show poor performance of the critical section object “CSRW” on Windows XP. Several reader threads can be almost “blocked” under high concurrency condition. While one reader thread has completed 4,000,000 iterations, the others may complete 6-48 iterations only. Figure 5 illustrates deviation in the “CSRW” performance within individual threads and tests:

Figure 5. Individual test results when testing the “CSRW” object. The number of iterations done and the duration per operation in milliseconds.
...
test1:

reader thread 1: n=138302 (t=0.013412) writer thread: n= 4151 (t=0.449985) 
reader thread 2: n=4000000 (t=0.000464) 
reader thread 3: n=69812 (t=0.026571) 
reader thread 4: n=3513004 (t=0.000492)

test2:

reader thread 1: n=30 (t=64.038158) writer thread: n= 294219 (t=0.006664)
reader thread 2: n=30 (t=64.035178) 
reader thread 3: n=30 (t=64.032356) 
reader thread 4: n=4000000 (t=0.000480) 
...

Although, some reader threads are very fast (0.464-0.492 microseconds), the average reading cost, however, is extremely high (t1= 41586.02 microseconds). The high value of the coefficient of variation s1=161.18% indicates that performance of the individual reader threads varies in a large extent. It is difficult for the CSRW to guarantee performance of all the readers. When using a new superior Windows Vista API SRWLOCK object, high contention may have a very negative impact on performance. If one writer competes with many readers running on a Multi-core PC, it can be quite difficult for the writer to acquire an exclusive lock. Figure 6 shows poor SRWLOCK writer performance, t1=186.171 microseconds, with a quite high coefficient of variation, s1=25.17%, on a Quad-core PC. A writer could only complete 5,000-30,000 iterations (from 1,000,000), while readers had almost finished all the 4,000,000 iterations. When changing the number of concurrent reader threads, the SRWLOCK writer performance may vary in a great extend. If the number of readers increases from 2 to 4, most of the lock objects show performance regression from 40% to 100%, but the SRWLOCK does not. The SRWLOCK writer performance degrades form 7.1 to 186.2 microseconds (~26 times).

Figure 6. Testing one writer and four readers on a Quad-CPU PC, Intel Q6700, 2.66 GHz, Hyper-Threading disabled, 64-bit Windows Vista Home Premium SP1, compiled in VC++ 2008 Express. Duration per operation in microseconds and the coefficient of variation, %.
Without locking

reader: t=0.106 (s=1.45%)
writer: t=1.416 (s=0.79%)

Critical section CSRW lock

reader: t1=19.337 (s1=48.99%) t2=1.322 (s2=5.65%) 
writer: t1=2.018 (s1=3.17%) t2=2.194 (s2=4.71%) 

Windows Vista SRWLOCK lock

reader: t1=0.991 (s1=11.77%) t2=0.881 (s2=8.2%) 
writer: t1=186.171 (s1=25.17%) t2=5.204 (s2=1.31%) 

RWFavorNeither lock

reader: t1=1.274 (s1=6.59%) t2=1.381 (s2=3.07%) 
writer: t1=72.13 (s1=5.31%) t2=6.981 (s2=1.64%) 

RWLockFavorWriters lock

reader: t1=6.311 (s1=4.83%) t2=5.511 (s2=3.48%) 
writer: t1=21.948 (s1=4.34%) t2=21.127 (s2=5.06%) 

Jim_B_Robert_Wiener_RWLock

reader: t1=61.688 (s1=1.03%) t2=59.166 (s2=0.59%) 
writer: t1=84.911 (s1=9.15%) t2=85.005 (s2=9.6%) 

Ruediger_Asche_RWLock

reader: t1=1.755 (s1=11.69%) t2=1.803 (s2=7.95%) 
writer: t1=2107697.58 (s1=108.77%) t2=10.607 (s2=5.17%)

The decision on which reader/writer lock is better to employ depends on the hardware and software architecture and a particular application scenario. However, the test results show that a likely candidate might be the “RWLockFavorWriters” object. The cost of locks is relatively low (1.6 - 6.3 and 22 - 29 microseconds for reader/writer, respectively). Unlike the others, the “RWLockFavorWriters” object performs consistently on different hardware/software platforms and test scenarios. If a writer competes with two readers (in place of four readers in Figure 1 and Figure 6), the readers and the writer perform ~30-60 % faster on both Uniprocessor and Quad-core PCs. The coefficient of variation stays relatively low (the same improving performance of the “RWFavorNeither” object will lead to increasing the coefficient of variation up to 60-70%). On the other hand, depending on the number of concurrent reader threads, the kernel CPU loading can be high ~10-40%; a large number of thread context switching per second ~150000-200000 can cause additional overheads. As against to custom locks, the new user-space Windows slim lock performs very well, loading kernel CPU ~0%, and generating a small number of thread context switching per second ~10-1000. The SRWLOCK object seems to be the best reader/writer lock to employ, if the writer performance is acceptable, and solution scalability is considered with respect to the increasing number of reader threads; as the number of reader threads increases, high contention on the lock may have a negative impact on the SRWLOCK writer performance.

Conclusion

  • It might be a disadvantage to use the Windows API critical section lock CSRW as a reader/writer lock in scenarios with multiple readers and under the condition when high contention on the lock occurs.
  • Windows Vista SRWLOCK seems to be the best object to employ, if potential regression of the writer performance along with increasing number of readers is acceptable.
  • The “RWLockFavorWriters” lock object is a high-performance, robust implementation of a reader/writer lock object in a wide range of application scenarios, performing consistently on both Uniprocessor and Multi-core platforms, and giving reasonable parity of reader/writer performance.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Valery Grebnev
Software Developer
Canada Canada
No Biography provided

Comments and Discussions

 
GeneralRuediger_Asche_RWLock has a bug Pinmemberadisakp19-Feb-09 12:42 
GeneralRe: Ruediger_Asche_RWLock has a bug Pinmemberadisakp19-Feb-09 13:13 
GeneralRe: Ruediger_Asche_RWLock has a bug PinmemberValery Grebnev19-Feb-09 16:39 
GeneralRe: Ruediger_Asche_RWLock has a bug Pinmemberadisakp19-Feb-09 17:40 
GeneralRe: Ruediger_Asche_RWLock has a bug Pinmemberadisakp20-Feb-09 9:51 
GeneralRe: Ruediger_Asche_RWLock has a bug PinmemberValery Grebnev20-Feb-09 15:00 
I am interested in your changes. And it would be nice if you could publish your changes here. Perhaps, an interesting thing we might found then - is about the “RWLockFavorWriters” object from http://msdn.microsoft.com/en-us/magazine/cc163405.aspx. Just a couple days ago, I found something like a deadlock there. I can reproduce this consistently in a new test application I wrote a week ago. I have prepared this environment in order to test my RWL implementation:
 
#define CACHE_ALIGN 64
 
class test_value_t
{
...
	char cache_padding[CACHE_ALIGN*2]; 
 
	lock_type __declspec(align(CACHE_ALIGN)) m_lock;
	
	//lock_type m_lock;
};
 
This does not affect testing on Uniprocessor PC, but it does for Multi-core PCs. If you run this, you may find that your locks are much faster on a 4-CPU box. Also it makes the tested locks highly contended.
 
So far, it is not clear for me why the “RWLockFavorWriters”deadlocks. I already send a feedback message on that MSDN article via Microsoft site. It might be that the “deadlock” I found was caused by something which does not relate to the “RWLockFavorWriters”. It would be interesting to perform a stress test of the “RWLockFavorWriters” in your environment.
GeneralRe: Ruediger_Asche_RWLock has a bug Pinmemberadisakp12-Mar-09 9:34 
GeneralRe: Ruediger_Asche_RWLock has a bug PinmemberValery Grebnev17-Mar-09 18:39 
GeneralRe: Ruediger_Asche_RWLock has a bug Pinmemberadisakp17-Mar-09 22:46 
GeneralRe: Ruediger_Asche_RWLock has a bug PinmemberValery Grebnev18-Mar-09 16:43 
GeneralRe: Ruediger_Asche_RWLock has a bug Pinmemberadisakp23-Dec-09 17:34 
GeneralRe: Ruediger_Asche_RWLock has a bug PinmemberValery Grebnev24-Dec-09 10:59 
GeneralPut the results into graphs PinmemberWong Shao Voon8-Feb-09 21:50 
GeneralRe: Put the results into graphs PinmemberValery Grebnev9-Feb-09 15:08 
GeneralOne Question to locking PinmemberKarstenK21-Jan-09 22:52 
GeneralRe: One Question to locking PinmvpLuc Pattyn22-Jan-09 10:09 
GeneralRe: One Question to locking PinmemberValery Grebnev22-Jan-09 17:07 
GeneralRe: One Question to locking PinmemberKarstenK22-Jan-09 21:40 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Terms of Use | Mobile
Web04 | 2.8.141220.1 | Last Updated 21 Jan 2009
Article Copyright 2009 by Valery Grebnev
Everything else Copyright © CodeProject, 1999-2014
Layout: fixed | fluid