Use Thread-local Storage to Reduce Synchronization

Intel ISN

2 Jun 2010CPOL5 min read

19.6K

Synchronization is often an expensive operation that can limit the performance of a multithreaded program. Using thread-local data structures instead of data structures shared by the threads can reduce synchronization in certain cases, allowing a program to run faster.

This article is in the Product Showcase section for our sponsors at CodeProject. These articles are intended to provide you with information on products and services that we consider useful and of value to developers.

Abstract

Synchronization is often an expensive operation that can limit the performance of a multi-threaded program. Using thread-local data structures instead of data structures shared by the threads can reduce synchronization in certain cases, allowing a program to run faster.

This article is part of the larger series, "The Intel Guide for Developing Multithreaded Applications," which provides guidelines for developing efficient multithreaded applications for Intel® platforms.

Background

When data structures are shared by a group of threads and at least one thread is writing into them, synchronization between the threads is sometimes necessary to make sure that all threads see a consistent view of the shared data. A typical synchronized access regime for threads in this situation is for a thread to acquire a lock, read or write the shared data structures, then release the lock.

All forms of locking have overhead to maintain the lock data structures, and they use atomic instructions that slow down modern processors. Synchronization also slows down the program, because it eliminates parallel execution inside the synchronized code, forming a serial execution bottleneck. Therefore, when synchronization occurs within a time-critical section of code, code performance can suffer.

The synchronization can be eliminated from the multithreaded, time-critical code sections if the program can be re-written to use thread-local storage instead of shared data structures. This is possible if the nature of the code is such that real-time ordering of the accesses to the shared data is unimportant. Synchronization can also be eliminated when the ordering of accesses is important, if the ordering can be safely postponed to execute during infrequent, non-time-critical sections of code.

Consider, for example, the use of a variable to count events that happen on several threads. One way to write such a program in OpenMP* is as follows:

int count=0;
#pragma omp parallel shared(count)
{
  . . .
  if (event_happened) {
#pragma omp atomic
    count++;
}
. . .
}

This program pays a price each time the event happens, because it must synchronize to guarantee that only one thread increments count at a time. Every event causes synchronization. Removing the synchronization makes the program run faster. One way to do this is to have each thread count its own events during the parallel region and then sum the individual counts later. This technique is demonstrated in the following program:

int count=0;
int tcount=0;
#pragma omp threadprivate(tcount)
 
omp_set_dynamic(0);
 
#pragma omp parallel 
{
. . .
  if (event_happened) {
    tcount++;
  }
  . . .
}
#pragma omp parallel shared(count)
{
#pragma omp atomic
  count += tcount;
}

This program uses a tcount variable that is private to each thread to store the count for each thread. After the first parallel region counts all the local events, a subsequent region adds this count into the overall count. This solution trades synchronization per event for synchronization per thread. Performance will improve if the number of events is much larger than the number of threads. Please note that this program assumes that both parallel regions execute with the same number of threads. The call to omp_set_dynamic(0) prevents the number of threads from being different than the number requested by the program.

An additional advantage of using thread-local storage during time-critical portions of the program is that the data may stay live in a processor’s cache longer than shared data, if the processors do not share a data cache. When the same address exists in the data cache of several processors and is written by one of them, it must be invalidated in the caches of all other processors, causing it to be re-fetched from memory when the other processors access it. But thread-local data will never be written by any other processors than the one it is local to and will therefore be more likely to remain in the cache of its processor.

The code snippet above shows one way to specify thread-local data in OpenMP. To do the same thing with Pthreads, the programmer must create a key for thread-local data, then access the data via that key. For example:

#include <pthread.h>
 
pthread_key_t tsd_key;
<arbitrary data type> value;
 
if( pthread_key_create(&tsd_key, NULL) ) err_abort(status, "Error creating key");
if( pthread_setspecific( tsd_key, value)) 
  err_abort(status, "Error in pthread_setspecific");
. . .
value = (<arbitrary data type>)pthread_getspecific( tsd_key );

With Windows threads, the operation is very similar. The programmer allocates a TLS index with TlsAlloc, then uses that index to set a thread-local value. For example:

DWORD tls_index;
LPVOID value;
 
tls_index = TlsAlloc();
if (tls_index == TLS_OUT_OF_INDEXES) err_abort( tls_index, "Error in TlsAlloc");
status = TlsSetValue( tls_index, value );
if ( status == 0 ) err_abort( status, "Error in TlsSetValue");
  . . .
value = TlsGetValue( tls_index );

In OpenMP, one can also create thread-local variables by specifying them in a private clause on the parallel pragma. These variables are automatically deallocated at the end of the parallel region. Of course, another way to specify thread-local data, regardless of the threading model, is to use variables allocated on the stack in a given scope. Such variables are deallocated at the end of the scope.

Advice

The technique of thread-local storage is applicable if synchronization is coded within a time-critical section of code, and if the operations being synchronized need not be ordered in real-time. If the real-time order of the operations is important, then the technique can still be applied if enough information can be captured during the time-critical section to reproduce the ordering later, during a non-time-critical section of code.

Consider, for example, the following example, where threads write data into a shared buffer:

int buffer[NENTRIES];
 
main() {
 
  . . .
 
#pragma omp parallel
{
  . . .
  update_log(time, value1, value2);
  . . .
}
 
  . . . 
}
void update_log(time, value1, value2)
{
  #pragma omp critical
  {
    if (current_ptr + 3 > NENTRIES) { print_buffer_overflow_message(); }
 
    buffer[current_ptr] = time;
    buffer[current_ptr+1] = value1;
    buffer[current_ptr+2] = value2;
    current_ptr += 3;
  }
}

Assume that time is some monotonically increasing value and the only real requirement of the program for this buffer data is that it be written to a file occasionally, sorted according to time. Synchronization can be eliminated in the update_log routine by using thread-local buffers. Each thread would allocate a separate copy of tpbuffer and tpcurrent_ptr. This allows the elimination of the critical section in update_log. The entries from the various thread-private buffers can be merged later, according to the time values, in a non-time-critical portion of the program.

Usage Guidelines

One must be careful about the trade-offs involved in this technique. The technique does not remove the need for synchronization, but only moves the synchronization from a time-critical section of the code to a non-time-critical section of the code.

First, determine whether the original section of code containing the synchronization is actually being slowed down significantly by the synchronization. Intel® Parallel Amplifier and/or Intel® VTune ™ Performance Analyzer can be used to check each section of code for performance problems.
Second, determine whether the time ordering of the operations is critical to the application. If not, synchronization can be removed, as in the event-counting code. If time ordering is critical, can the ordering be correctly reconstructed later?
Third, verify that moving synchronization to another place in the code will not cause similar performance problems in the new location. One way to do this is to show that the number of synchronizations will decrease dramatically because of your work (such as in the event-counting example above).

Additional Resources

Intel® Software Network Parallel Programming Community
OpenMP* Specifications
David R. Butenhof, Programming with POSIX Threads, Addison-Wesley, 1997.
Johnson M. Hart, Win32 System Programming (2nd Edition), Addison-Wesley, 2001.
Jim Beveridge and Robert Weiner, Multithreading Applications in Win32, Addison-Wesley, 1997.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Written By

Intel ISN

United States

This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.