This is a showcase review for our sponsors at The Code Project. These reviews are intended to provide you with information on products and services that we consider useful and of value to developers.
The Threading Methodology used at Intel has four major steps: Analysis, Design & Implementation, Debugging, and Performance Tuning. These steps are used to create a multithreaded application from a serial base code. Besides the recent Patterns for Parallel Programming by Mattson, Sanders and Massingill, there hasn’t been much written about how to do the Design & Implementation part of the process.
Multithreaded programming is still more art than science. This article gives 8 Simple Rules that you can add to your palette of threading design methods. By following these rules, you will have more success in writing the best and most efficient threaded implementation of your applications.
Rule 1. Be sure you identify truly independent computations.
You can’t execute anything concurrently unless the operations that would be executed in parallel can be run independently of each other. We can easily think of different real-world instances of independent actions being performed in order to satisfy a single goal. For example, building a house can involve many different workers with different skills. That is, carpenters, electricians, glazers, plumbers, roofers, painters, masons, lawn wranglers, etc. There are some obvious scheduling dependencies between pairs of workers (can’t put on roof shingles before walls are built, can’t paint the walls until the drywall is installed), but for the most part, the people involved in building a house can work independently of each other.
There are cases where you will have exclusively sequential computations that cannot be made concurrent; many of these will be dependencies between loop iterations or steps that must be carried out in a specific order. An example for the latter is pregnancy. You can’t get a baby by putting nine women on the job for one month, but you can field a future baseball team by getting each of those nine women to produce a new batter after nine months.
Rule 2. Implement concurrency at highest level possible.
There are two directions that you can use when approaching a project to thread a serial code: bottom-up and top-down. In the Analysis phase of the Threading Methodology, you identify the segments of your code that take the most execution time (a hotspot). If you are able to run those code portions in parallel, you will have the best chance at achieving the maximum performance possible.
In a bottom-up approach, you would attempt to thread the hotspots in your code. If this is not possible, you can search up the call stack of the application to determine if there is another place in the code that can be run in parallel and still execute the hotspots. Even if it is possible to employ concurrency at the hotspot code, you should still look to see if it would be possible to implement that concurrency at a point in the code higher up in the call stack. This can increase the granularity of the execution done by each thread.
With the top-down approach, you first consider the whole application, what the computation is coded to accomplish and, at a very abstract level, all the parts of the application that combine to realize that computation. If there is no obvious concurrency, you should distill the parts of the computation into successively smaller parts until you can identify independent computations. Results from the Analysis phase can guide your investigation to include the most time-consuming modules.
The granularity of concurrent computations is loosely defined as the amount of computation done before synchronization is needed. The longer the time between synchronizations, the coarser the granularity will be. Fine-grained parallelism runs the danger of not having enough work assigned to threads to overcome the overhead costs of using threads. Coarse-grained parallelism has lower overhead costs and also tends to be more readily scalable to an increase in the number of threads. Top-down approaches to threading (or driving the point of threading as high in the call stack) are the best options to achieve a coarse-grain solution.
Rule 3. Plan early for scalability to take advantage of increasing numbers of cores.
Processors have recently gone from being dual-core to quad-core. The number of cores available in future processors will only increase. Thus, you should plan for such processor increases within your software. Scalability is the measure of an application’s ability to handle changes, typically increases, in system resources (number of cores, memory size, bus speed, etc.) or data set sizes. In the face of more cores being available, write flexible code that can take advantage of different numbers of cores.
To paraphrase C. Northecote Parkinson, “Data expands to fill the processing power available.” As the amount of computational power increases (more cores), the more likely it will be that the data to be processed will expand. Designing and implementing concurrency by data decomposition methods will be more scalable than functional decompositions. The number of independent functions is likely limited and fixed during the course of an application’s execution.
Even though an application has been written with threads assigned to independent functions, there still may be possibilities of using extra threads when the input workload increases. Consider the house building example from above. There is a finite number of separate tasks to be done. However, if the size of the house to be built were doubled, you would expect that extra workers might be assigned within some of those tasks (i.e. extra painters, extra roofers, extra plumbers, etc.). Therefore you should be aware of the data decomposition possibilities that can arise from increased data sets, even within functionally decomposed tasks.
Rule 4. Make use of thread-safe libraries wherever possible.
If your hotspot computations can be executed through a library call, then you should strongly consider using an equivalent library function instead of executing hand-written code. It’s never a good idea to “reinvent the wheel” by writing code that performs calculations already encapsulated with library routines that have been optimized. Many libraries, such as Intel® Math Kernel Library (MKL) and Intel® Integrated Performance Primitives (IPP), have functions that are already threaded to take advantage of multi-core processors.
Even more important than using threaded library routines, though, is to ensure that any library calls used are thread-safe. That is, if library routines are called from two different threads, the results of each call will return the correct answers. Check the library documentation for the thread-safety of any third-party library you are using.
Rule 5. Use the right threading model.
If threaded libraries are insufficient to cover all the concurrency of an application and you must employ user-controlled threads, don’t use (more complex) explicit threads if OpenMP or Intel® Threading Building Blocks (TBB) has all the functionality you need. If you don’t need the extra flexibility you can get with explicit threads, there’s probably no reason to do more work than necessary or add complexity to your implementation that will make it harder to maintain the code.
OpenMP is focused on data decomposition methods, especially targeted to threading of loops that range over large data sets. TBB is similarly focused, but also contains several other parallel algorithms and parallel data containers. Even if you will ultimately need to implement your threading with an explicit model, you could still use OpenMP to prototype the planned concurrency and better estimate the potential performance gains, possible scalability, and how much effort will be needed to thread the serial code with explicit threads.
Rule 6. Never assume a particular order of execution.
Thread execution order is non-deterministic and controlled by the OS scheduler. There is no reliable way of predicting the order of threads running from one execution to another or even which thread will be scheduled to run next. This is done primarily to hide execution latency within an application, especially when run on a system with fewer cores than threads.
Data races are a direct result of this scheduling non-determinism. If you assume that one thread will write a value into a shared variable before another thread will read that value, you may be right all of the time or you may be right some of the time or you may be right none of the time. Code that relies on a particular order of execution among threads, through nothing more than positive thinking, will be in trouble with things like data races and deadlock.
From a performance perspective, it is best to allow threads to run as unencumbered as possible. Don’t try to enforce a particular order of execution unless absolutely necessary. You need to be able to recognize those times when it is absolutely necessary and implement some form of synchronization to coordinate the execution order of threads relative to each other.
Rule 7. Use thread-local storage whenever possible; associate locks to specific data, if needed.
Synchronization is overhead that does not contribute to the furtherance of the computation, except to guarantee that the correct answers are produced from the parallel execution of an application. It is a necessary evil. Even so, you should actively seek to keep the amount of synchronization to a minimum. This can most readily be done through the use of storage local to threads or the use of exclusive memory locations (e.g. an array element indexed by thread ID).
Temporary work variables rarely need to be shared between threads and should be declared locally. Variables that hold partial results for each thread should also be local to threads. Combining the partial results into a shared location will require some synchronization. Ensuring that the shared updates are done as infrequently as possible will keep the amount of overhead to a minimum.
If local storage for each thread is not a valid option and you must coordinate access to shared resources through synchronization objects (like a lock), be sure to properly associate (or “attach”) locks to data items. The easiest way to do this is to have a 1:1 relationship of locks to datum. You can set a lock to protect more than one data item if the set of memory locations is always accessed in the same critical regions. Segal’s Law states, “A man with a watch knows what time it is. A man with two watches is never sure.” If two different locks protect access to the same variable, one part of the code may use one lock for access while another section of code can use the other lock. Threads executing in these two code portions will create a data race, since both will assume that they have exclusive access to the contested data.
Rule 8. Don’t be afraid to change the algorithm for a better chance of concurrency.
When comparing the performance of applications, both serial and parallel, the bottom line is wall clock execution time. When choosing between two or more algorithms, programmers may rely on the asymptotic order of execution. This theoretic metric will almost always correlate with an application’s performance.
In parallel applications, algorithms with a better asymptotic order of execution will run faster, too. Nonetheless, there will be times when the best serial algorithm will not be amenable to being parallelized. If a hotspot turns out to be something that can’t be easily turned into threaded code (and you can’t find a point higher in the call stack of the hotspot that can be made parallel), you should consider using a sub-optimal serial algorithm that may parallelize easier than the better serial algorithm currently in the code. Of course, there may be other, less drastic, changes that will allow you to parallelize a given portion of code.
We’ve covered 8 Simple Rules that you should keep in mind when you are designing the threading that will transform a serial application into a parallel version. By following the rules presented here and keeping in mind practical programming rules, you can more easily create concurrent solutions that will be more robust, less likely to contain threading problems, and move toward optimal performance in less time.