Click here to Skip to main content
Click here to Skip to main content

Testing simple concurrent containers

, 25 Oct 2009 CPOL
Rate this:
Please Sign up or sign in to vote.
This article illustrates simple approaches and test results when creating containers with concurrent flavor and running on a multi-core PC.

Introduction

Concurrent containers are important for multi-threaded applications. Third party libraries, such as the award-wining Intel Threading Building Blocks (TBB) C++ library [1], offer parallel containers for various hardware, Operating Systems, and compilers. The TBB library seems to be the first choice to employ. On the other hand, if thread concurrency is not a high priority, we may consider simple techniques to add some concurrency flavor to our containers.

For such relatively large containers like arrays, vectors, or maps, it is unlikely for all the application threads to access the same elements at a time. We can try to leverage element scope fine-grained locking along with simple lock-free techniques, or fine-tuning concurrent access techniques similar to [2], while performing element scope operations (reading/writing element values). A bottleneck might be the global scope operations (inserts/deletes). Recently, Microsoft has offered an effective slim read/write lock API object, SRWLOCK. For the scenarios when global scope operations are not frequent, we might want to take advantage of this API to facilitate container global scope locking.

To try out these techniques, we will create and test containers using the STL std::map, stdext::hash_map, and different locking algorithms. Then, we will compare performance with the TBB tbb::concurrent_hash_map. When the test application starts, a thread creates a container having 1000,000 elements, and then rebuilds it, inserting/deleting elements. In parallel, 4 to 32 threads concurrently read and write element values. A writer thread fills an element buffer with random bytes, and calculates the CRC. A reader thread reads the buffer, and recalculates and compares the CRC with the value saved by the writer (an exception is supposed to be thrown in case of a CRC error, which would indicate incorrectness of the implementation). To compare the performance of different containers, we calculate the average duration per read/write in microseconds based on 50 tests; each of them includes 1,000,000 iterations for a thread inserting/deleting elements, and 4,000,000 iterations for writer and reader threads. The test application is attached to this article.

A simple approach underlies tested containers:

  • When a thread inserts or deletes an element, it calls the global scope exclusive lock. On acquiring the global exclusive lock, no other thread can access the container (unfortunately, it may denote no parallelism, no scalability, and might be a bottleneck depending on the global scope lock implementation). Inserts are demonstrated in Figure 1:
  • Figure 1. Using a global scope exclusive lock when changing the container structure.
    template <typename key_type, typename value_type>
    bool insert( const std::pair<key_type, value_type>& pair) 
    {
         Exclusive_lock lock();
         return m_map.insert(pair).second;
    }
  • When a writer thread changes an element value, it uses two-step locking. First, the writer calls the global scope shared lock. On acquiring the global shared lock, no thread can insert or delete elements. However, other writer/reader threads can still go parallel. Then, the writer acquires the element scope exclusive lock, and updates the element value (depending on the element scope locking, writes can be more or less parallel and scalable). The pseudo-code in Figure 2 demonstrates the updates:
  • Figure 2. Using a global scope shared lock and an element scope exclusive lock when writing a container element value.
    template <typename key_type, typename value_type>
    bool update(const key_type& _key, const value_type& _val)
    {
         Shared_lock lock();
         map_iterator iterator = m_map.find(_key);
         if (iterator != m_map.end())
         {
              Element_exclusive_lock lock(iterator);
              iterator->second = _val; 
              return true;
         }
         else
         {
              return false;
         }
    }
  • A reader thread behaves similarly to the writer's, except it calls the element scope shared lock in place of the element exclusive lock. Reads can be parallel and scalable. Figure 3 demonstrates reads.
  • Figure 3. Using shared global scope and element scope locks when reading a container element.
    template <typename key_type, typename value_type>
    bool find(const key_type& _key, value_type& _val)
    {
         Shared_lock lock();
         map_iterator iterator = m_map.find(_key);
         if (iterator != m_map.end())
         {
              Element_shared_lock lock(iterator);
              _val = iterator->second; 
              return true;
         }
         else
         {
              return false;
         }
    }

A simple implementation of the global scope lock is a wrapper of the Windows API SRWLOCK. Custom read/write locks might be a better choice in a scenario when the writer's performance matters [3]. We use, however, a "standard" object, SRWLOCK, for the sake of the test result's clarity and comparability.

As for element scope fine-grained locking, we may try the simplest spin-locks per element. Internally, a container in Figure 4 stores _internal_value<value_type> objects, which encapsulate both the element m_value and the correspondent spin-lock variable m_lock. The spin-lock status variable m_lock (1 or 0) controls m_value access per element.

Figure 4. Using spin-locks/rwspin-lock per element for fine-grained element scope locking.
template <typename key_type, typename value_type, typename lock_type>
class SRWL_map_t 
{
public:
     
     bool insert( const std::pair<key_type, value_type>& _val) 
     {
          internal_value value(_val.second);
          std::pair<key_type, internal_value> pair(_val.first, value);
          Exclusive_lock lock(m_global_scope_lock);
          return m_map.insert(pair).second;
     }
     bool erase(const key_type& _key)
     {
          Exclusive_lock lock(m_global_scope_lock);
          return (m_map.erase(_key) != 0);
     }
     bool update(const key_type& _key, const value_type& _val)
     {
          Shared_lock lock(m_global_scope_lock);
          map_iterator iterator = m_map.find(_key);
          if (iterator != m_map.end())
          {
               Element_exclusive_lock lock(iterator->second.m_lock);
               iterator->second.m_value = _val; 
               return true;
          }
          else
          {
               return false;
          }
     }
     bool find(const key_type& _key, value_type& _val)
     {
          Shared_lock lock(m_global_scope_lock);
          map_iterator iterator = m_map.find(_key);
          if (iterator != m_map.end())
          {
               Element_shared_lock lock(iterator->second.m_lock);
               _val = iterator->second.m_value; 
               return true;
          }
          else
          {
               return false;
          }
     }
private:     
     template <typename value_type>
     struct _internal_value
     {
          _internal_value(value_type value): m_lock(0), m_value(value) 
          {
          }
          volatile LONG CACHE_ALIGN m_lock;
          CACHE_PADDING1
          value_type m_value;
     };
     typedef _internal_value<value_type> internal_value;
     typedef ref_exclusive_lock_t<lock_type> Exclusive_lock;
     typedef ref_shared_lock_t<lock_type> Shared_lock;
     typedef exclusive_lock_t<ELEMENT_SPINLOCK, volatile LONG> Element_exclusive_lock;
     typedef shared_lock_t<ELEMENT_SPINLOCK, volatile LONG> Element_shared_lock;
#ifdef HAS_HASH_MAP
     typedef typename stdext::hash_map<key_type, internal_value>::iterator map_iterator;
     stdext::hash_map<key_type, internal_value> m_map;
#elif defined HAS_MAP
     typedef typename std::map<key_type, internal_value>::iterator map_iterator;
     std::map<key_type, internal_value> m_map;
#endif
     CACHE_PADDING1
     lock_type m_global_scope_lock;
};

The macro ELEMENT_SPINLOCK defines the Spin_lock class. We intentionally use the simplest implementation of the Spin_lock, in order to spot out how thread contention (even supposedly low at first glance) impacts the performance. Only one thread is allowed to access an element at a time, which might be a disadvantage for multiple readers. Figures 5, 6 show performance degradation (more than 3 times as compared with the Intel TBB tbb::concurrent_hash_map class) when the number of threads increases up to 32 (16 reader and 16 writer threads).

Figure 5. Comparison reader performance of hash containers using STL stdext::hash_map and Intel TBB tbb::concurrent_hash_map when running multiple writers and reader threads on a Quad-CPU PC, Intel Q6700, 2.66 Ghz, Hyper-Threading disabled, 64-bit Windows Vista Home Premium SP2, compiled in VC++ 2008 Express (x64-bit applications). Duration per reading operation in microseconds.

Container_locks

Figure 6. Comparison reader performance of map wrappers using STL std::map when running multiple writers and reader threads on a Quad-CPU PC, Intel Q6700, 2.66 Ghz, Hyper-Threading disabled, 64-bit Windows Vista Home Premium SP2, compiled in VC++ 2008 Express (x64-bit applications). Duration per reading operation in microseconds.

Container_locks

We could slightly improve this solution by using spin-counters, SwitchToThread() calls, and other amendments which are usually done for spin-locks. However, it will not facilitate more concurrency, in particular, for multiple readers.

A simple solution, which might pave the way for better reader parallelism and scalability, could be the usage of reader/writer locks in place of basic spin-locks. No changes of the template class SRWL_map_t are needed – we implement a reader/writer lock class, Spin_rwlock, and change the macro ELEMENT_SPINLOCK to use Spin_rwlock. An unsophisticated implementation of the Spin_rwlock can be alike with a reader/writer spin-lock from the article [3]:

Figure 7. A simple reader/writer spin-lock.
/*
Copyright (c) 2009 Valery Grebnev.
CPOL (Code Project Open License)
Licensed <a href="http://www.codeproject.com/info/cpol10.aspx">http://www.codeproject.com/info/cpol10.aspx</a>
*/
class CACHE_ALIGN Spin_rwlock
{
public:
     void acquireLockShared()
     {
          LONG i;
          LONG j;
          j = acquire_read(m_rw_status);
          do {
               i = j & 0xFFFF;
               j = acquire_interlocked_compare_exchange(&m_rw_status,
                                        i + 1,
                                        i);
               PAUSE
          } while (i != j);
     }
     void acquireLockExclusive()
     {
          while( acquire_interlocked_compare_exchange(
                    &m_rw_status,0x1+0xFFFF,0x0) != 0x0)
          {
               PAUSE
          }
     }
     void releaseLockShared()
     {
          interlocked_decrement_release(&m_rw_status);
     }
     void releaseLockExclusive()
     {
          interlocked_Xor_release(&m_rw_status, 0x1+0xFFFF);
     }
     Spin_rwlock(volatile LONG& rw_status) : m_rw_status(rw_status)
     {
     }
     ~Spin_rwlock()
     {
     }
private:
     CACHE_PADDING1
     volatile LONG& m_rw_status;
     Spin_rwlock & operator=( const Spin_rwlock & ) {}
};

The HIWORD of the m_rw_status variable (0 or 1) indicates that a writer is active. The LOWORD counts active readers. Multiple readers now access the same elements in parallel, which improves reader performance (up to 2 times for 32 threads, see Figures 5, 6). The disadvantage of the Spin_rwlock is degradation of writer performance, when increasing the number of threads; see writer performance in Figures 8, 9. Both spin-locks and read/write spin-locks work fine when thread contention is low, but for highly concurrent containers, they do not. For such scenarios, we might try to leverage more efficient locks, for instance, Windows SRWLOCK objects.

A wonderful, almost 10-year old, solution, providing fine-tuned concurrent element access for large containers, was introduced by Ian Emmons in his article [2]. The solution uses a simple technique of hashing the container element addresses to a lock object, an element of the array of 256 locks. This approach underlies the CritSecTable_locker_t and CST_map_t classes we implemented for the test purposes. The container shows great performance for both readers and writers, see the CS_table results in Figures 5, 6 and 8, 9. The advantage of this approach is that it facilitates the usage of complex and effective objects for fine-grained locking in place of basic spin-locks. This amends performance in a wide range of scenarios. Fine-tuned concurrent access in the article [2] is not, however, a fine-grained locking technique, because it is possible for different threads to hash and share the same lock, while accessing different elements. The more hash collisions we have, the more "fine-tuned access" stays away from fine-grained locking.

Figure 8. Comparison writer performance of hash containers using STL stdext::hash_map and Intel TBB tbb::concurrent_hash_map when running multiple writers and reader threads on a Quad-CPU PC, Intel Q6700, 2.66 Ghz, Hyper-Threading disabled, 64-bit Windows Vista Home Premium SP2, compiled in VC++ 2008 Express (x64-bit applications). Duration per writing operation in microseconds.

Container_locks

Figure 9. Comparison writer performance of map wrappers using STL std::map when running multiple writers and reader threads on a Quad-CPU PC, Intel Q6700, 2.66 Ghz, Hyper-Threading disabled, 64-bit Windows Vista Home Premium SP2, compiled in VC++ 2008 Express (x64-bit applications). Duration per writing operation in microseconds.

Container_locks

To facilitate fine-grained locking, we can use a pool of locks in place of hashing element addresses. When a thread needs access to a container element, it retrieves a lock object from the pool. Other threads address this lock, while accessing the same container element. The lock is pushed back to the pool when no threads are accessing the element. In other words, we need a pool of shared objects (locks). The minimum capacity of such pool equals the maximum number_of_threads (supposing that all the reader/writer threads might access different elements). The implementation of the Map_t container, which uses a pool of locks, is shown in Figure 10. For the sake of the test code simplicity, we use the same template argument lock_type for both pool locks and the global lock. The template could be easily adjusted, if we wanted to use different lock classes for global- and element-scope locking. To ease usage of a lock-pool in Map_t, we introduce a helper class Container_locker_t, see Figure 11.

Figure 10. A concurrent container Map_t with a pool of locks.
/*
Copyright (c) 2009 Valery Grebnev.
CPOL (Code Project Open License) Licensed 
<a href="http://www.codeproject.com/info/cpol10.aspx">http://www.codeproject.com/info/cpol10.aspx</a>
*/
template <typename key_type, typename value_type, 
          typename lock_type, WORD number_of_threads>
class Map_t 
{
public:
     
     bool insert( const std::pair<key_type, value_type>& _val) 
     {
          internal_value value(_val.second);
          std::pair<key_type, internal_value> pair(_val.first, value);
          Container_locker::Exclusive_lock lock(m_map_lock.m_global_scope_lock);
          return m_map.insert(pair).second;
     }
     bool erase(const key_type& _key)
     {
          Container_locker::Exclusive_lock lock(m_map_lock.m_global_scope_lock);
          return (m_map.erase(_key) != 0);
     }
     bool update(const key_type& _key, const value_type& _val)
     {
          Container_locker::Shared_lock lock(m_map_lock.m_global_scope_lock);
          map_iterator iterator = m_map.find(_key);
          if (iterator != m_map.end())
          {
               Container_locker::Element_exclusive_lock lock(
                  iterator->second.m_lock, m_map_lock.m_element_locks);
               iterator->second.m_value = _val; 
               return true;
          }
          else
          {
               return false;
          }
     }
     bool find(const key_type& _key, value_type& _val)
     {
          Container_locker::Shared_lock lock(m_map_lock.m_global_scope_lock);
          map_iterator iterator = m_map.find(_key);
          if (iterator != m_map.end())
          {
               Container_locker::Element_shared_lock lock(
                 iterator->second.m_lock, m_map_lock.m_element_locks);
               _val = iterator->second.m_value; 
               return true;
          }
          else
          {
               return false;
          }
     }
private:     
     typedef Container_locker_t<lock_type, number_of_threads> Container_locker;
     typedef typename Container_locker::internal_value<value_type> internal_value;
     
#ifdef HAS_HASH_MAP
     typedef typename stdext::hash_map<key_type, internal_value>::iterator map_iterator;
     stdext::hash_map<key_type, internal_value> m_map;
#elif defined HAS_MAP
     typedef typename std::map<key_type, internal_value>::iterator map_iterator;
     std::map<key_type, internal_value> m_map;
#endif
     CACHE_PADDING1
     Container_locker m_map_lock;
};

Internally, the container holds Container_locker::internal_value<value_type> objects, encapsulating both the m_value and the correspondent m_lock per element. The HIWORD of the m_lock member is the index of the lock object (pool element index), which controls thread access to m_value. The LOWORD is the number of threads currently sharing the lock. When a thread needs to access a container element, it retrieves the index of a free lock object within the pool, and writes this value to the m_lock HIWORD. The LOWORD counter is incremented, when threads use this index to address an actual lock from the pool. The counter is decremented, after a thread has finished accessing the element. Finally, the lock index is pushed back to the pool, when both HIWORD and LOWORD are zero (no threads are accessing the lock).

Figure 11. A helper class Container_locker_t with a pool of locks.
/*
Copyright (c) 2009 Valery Grebnev.
CPOL (Code Project Open License)
Licensed <a href="http://www.codeproject.com/info/cpol10.aspx">http://www.codeproject.com/info/cpol10.aspx</a>
*/
template <typename Lock_type, WORD number_of_threads>
struct CACHE_ALIGN Container_locker_t
{
     typedef ref_exclusive_lock_t<Lock_type> Exclusive_lock;
     typedef ref_shared_lock_t<Lock_type> Shared_lock;
     
     typedef Pool_of_shared_elements<Lock_type, number_of_threads> Pool_of_locks;
     typedef ref_exclusive_pool_lock_t<Lock_type, Pool_of_locks> Element_exclusive_lock;
     typedef ref_shared_pool_lock_t<Lock_type, Pool_of_locks> Element_shared_lock;
     template <typename value_type>
     struct internal_value
     {
          internal_value(value_type value): m_lock(0), m_value(value) 
          {
          }
          volatile LONG CACHE_ALIGN m_lock;
          CACHE_PADDING(2)
          value_type m_value;
     };
     CACHE_PADDING(1)
     Pool_of_locks CACHE_ALIGN m_element_locks;
     CACHE_PADDING(2)
     Lock_type CACHE_ALIGN m_global_scope_lock;
     Container_locker_t& operator=(const Container_locker_t&); //C4512 warning
};

To make these operations thread safe, we implement a pool wrapper Pool_of_shared_elements, see Figure 12. The elements m_lock values are passed to the Pool_of_shared_elements pop/push methods (as the reference_counter argument), where both HIWORD and LOWORD are set in an atomic way. A simple implementation of a base pool class, Pool_t, can be similar to the pool from the article[4]. We slightly adjusted that code, in order to address elements by indexes. The size of the Pool_t is defined by the template class parameter as pool_size = (2*number_of_threads–1). We use greater pool capacity than the minimum "number_of_threads", for the sake of simplicity of the Pool_of_shared_elements code. For each thread pair per m_lock, two pool objects (locks) may be held by one thread in a pair. As a result, two threads may hold three locks. Three threads - five locks. Having considered all the thread combinations by pairs, the pool capacity will be 2*number_of_threads–1 (to illustrate this scenario, we additionally attach the code of a test class, Pool_of_shared_elements_crash_test).

Figure 12. A pool of shared elements (locks), Pool_of_shared_elements.
/*
Copyright (c) 2009 Valery Grebnev.
CPOL (Code Project Open License)
Licensed <a href="http://www.codeproject.com/info/cpol10.aspx">http://www.codeproject.com/info/cpol10.aspx</a>
*/
template <typename Entry_t, WORD number_of_threads, 
          const size_t pool_size = (2*number_of_threads - 1)> 
class CACHE_ALIGN Pool_of_shared_elements : private Pool_t<Entry_t, pool_size>
{
public:
     WORD pop(volatile LONG& reference_counter)
     {
          LONG i;
          LONG j;
          LONG exch_counter;
          WORD new_index = static_cast<WORD>(__super::pop());
          if (new_index == 0)
          {
               return 0; /* error */ 
          }
          else
          {
          }
          j = acquire_read(reference_counter);
          do
          {
               exch_counter = (HIWORD(j) == 0)? ((LONG)new_index << 16) | 
                                                 (LOWORD(j) + 1) : j + 1;
               i = j;
               j = acquire_interlocked_compare_exchange(&reference_counter,
                                   exch_counter,
                                   i);
               PAUSE
          } while(i != j);
          WORD index = HIWORD(exch_counter); 
          if ( index != new_index)
          {
               __super::push( new_index );
          }
          else
          {
          }
          return index;
     }
     void push(WORD index, volatile LONG& reference_counter)
     {
          LONG counter = interlocked_decrement_release(&reference_counter);
          if (LOWORD(counter) == 0)
          {
               if ( counter == acquire_interlocked_compare_exchange(
                                       &reference_counter, 0x0, counter))
               {
                    __super::push( index );
               }
               else
               {
               }
          }
          else
          {
          }
     }
     inline Entry_t* get(WORD index)
     {
          return __super::get(index);
     }
};

Two of the factors affecting the performance of Pool_t are "false sharing" and thread contention. To make pool operations faster on a multi-core PC, a distributed pool solution might be a better choice [4]. The Array_of_pools_t class, borrowed from [4] and having "per processor" distributed pools, is used in this article when implementing the Map_t class. The container shows high performance for both readers and writers, see the Lock_pool results in Figures 5, 6 and 8, 9. The advantage of this approach is that it better facilitates fine-grained locking for readers and writers, amending performance and scalability.

Conclusion

  • The simple locking techniques considered in this article can be used to facilitate more concurrency and scalability for element scope operations.
  • Locking techniques using the simplest spin-locks per container element can be leveraged for large, not highly contented containers. For the scenarios with "reader favor", using reader/writer spin-locks considerably improves reader concurrency and performance.
  • Using pools of locks is a fast and robust solution for highly concurrent containers on a multi-core, giving reasonable parity of scalability and element scope reader/writer performance.

References

  1. Intel® Threading Building Blocks 2.2, http://www.threadingbuildingblocks.org/
  2. Ian Emmons, “Multiprocessor Optimizations: Fine-Tuning Concurrent Access to Large Data Collections”, http://msdn.microsoft.com/en-ca/magazine/cc301503.aspx
  3. Valery Grebnev, “Reader/writer locks with a helper”, testing_spin_rwlocks.aspx
  4. Valery Grebnev, “Testing distributed memory pools”, test_pool_t.aspx

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Valery Grebnev
Software Developer
Canada Canada
No Biography provided

Comments and Discussions

 
QuestionAbout Conclusion PinmemberMember 1005879620-May-13 4:15 
AnswerRe: About Conclusion PinmemberValery Grebnev20-May-13 16:48 
QuestionI have so many link errors Pinmemberltn61421-Aug-10 6:40 
GeneralNice Pinmemberbakermargar4-Jun-10 0:13 
GeneralHelp PinmemberAhmed Charfeddine9-Nov-09 7:23 
AnswerRe: Help PinmemberAhmed Charfeddine9-Nov-09 10:27 
GeneralRe: Help PinmemberValery Grebnev9-Nov-09 14:44 
GeneralRe: Help PinmemberAhmed Charfeddine9-Nov-09 14:55 
QuestionChecked Iterators Pinmember Randor 26-Oct-09 11:30 
AnswerRe: Checked Iterators PinmemberValery Grebnev26-Oct-09 17:07 
Good point. Thank you David. The goal of the article is to show basic ideas on how to improve concurrency of the containers. We wanted to see the difference in the results when using different locking techniques. Different _SECURE_SCL settings (0 or 1) affect performance, but the results are still comparable, because all the tests are done under the same conditions. I agree that it would be better to describe the test scenarios and conditions more clearly in this article. Will do next time Smile | :)
 
Thank you,
Valery
QuestionIs PLinq an implementation of the concepts you have described here ? Pinmembergaurav_verma_mca25-Oct-09 18:21 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Terms of Use | Mobile
Web03 | 2.8.141216.1 | Last Updated 25 Oct 2009
Article Copyright 2009 by Valery Grebnev
Everything else Copyright © CodeProject, 1999-2014
Layout: fixed | fluid