TAM - Threaded Array Manipulator

Gilad Kapelushnik

4.95/5 (17 votes)

Nov 14, 2006

CPOL

4 min read

35139

381

Performing mathematical operations on large arrays while exploiting 100% of a multi-core CPU.

Download source - 3.93 Kb

Scope

This article describes how to perform mathematical operations on large arrays while exploiting 100% of the CPU computation power regardless of how many cores/threads your CPU has.

Introduction

In the good old past, life was easy. Each CPU was capable of running a single thread, and the Mhz\GHz race was at its best. After the guys at Intel/AMD got smarter, they figured getting more GHz out of the CPU is not the answer but adding more processing power is, why try getting your horse to run faster when you can buy another horse (not spending too much $$$) to help him? First, Intel came up with Hyper Threading (HT) which theoretically doubled the CPU power, then came the dual cores and dual HT cores. Quad core processors just hit the market and will be on your desktop in no time. The big problem with the "core race" is that most software nowadays is not multi-threaded and do not exploit the full power of the multi core/thread CPU. One way to exploit a multi-core CPU is to run multiple applications ("small") on a single machine. This solution isn't always possible especially if you have only one main application to run. The second way is to write multithreaded code where this article comes handy.

TAM shows a very simple way to do multi-threaded mathematical operations on large arrays of numbers.

Some background

Before starting to reprogram your code to run multi-threaded, it is first recommended to know a thing or two on threading.

Most of the CPUs designed in the past 10 years, even single-core CPUs, are built for multi-threaded operation where dedicated assembly commands can be executed to make life easier (and faster). The OS usually takes the responsibility of monitoring and coordinating the different threads executed by the CPU, by switching between the various threads according to various properties. It is most important to understand that each CPU, or CORE, can only execute a single thread at a time. The OS gives you the feeling that many threads are running at once by switching between them sometimes many times per second. In order to switch between two threads, the OS initiates a "Context Switch" routine (each OS does this differently). This short routine saves different CPU parameters and data used by each thread for later use, and loads the next thread's parameters and data. Context switch costs CPU cycles, switching more between the different threads might give you a smoother feeling of all the threads running together, but can result in an overall slower performance caused by the many context switches.

CPU cycles are not the only concern. A standard CPU runs at 3GHz while Memory and FSB run at only 400-1000Mhz, memory runs much slower than the CPU. This means you should always do as much of the mathematical operations on the data at once, as each time you recall the data back to the CPU pipeline causes the CPU to fetch the data from the cache (in the good case) or from the memory or page file (in the worst case). While doing a simple floating point operation can take 1-2 cycles, fetching data from memory can easily take 100 or more cycles.

   //Good
   double Sum = 0;        
   for (int i = 0; i < A.Length; i++) {     
         A[i] = Math.Sqrt(A[i]);
         A[i] = Math.Sin(A[i]);
         Sum += A[i];
   }
 
   //Bad
   double Sum = 0;        
   for (int i = 0; i < A.Length; i++) {     
         A[i] = Math.Sqrt(A[i]);
   }
   for (int i = 0; i < A.Length; i++) {     
         A[i] = Math.Sin(A[i]);
   }
   for (int i = 0; i < A.Length; i++) {     
         Sum += A[i];
   }

The Code

The code is not complicated :)

Each array is divided into sections according to the number of threads. Each worker process does its work on the relevant section and the final results are gathered by the main running thread.

Using TAM

TAM is best used with large arrays, or small arrays with extensive mathematical operations done on every element. Trying to use TAM on a short array will result in longer time than doing the same computation in a straightforward way (this is due to "unneeded" context switches).

In the example below, I created a 90,000,000 elements float array and populated it with values. Then I perform a mathematical operation on each element and sum the values. I preformed this twice, once using TAM and once straightforward.

static void Main() {
   // Create a float array
   float[] A = new float[90000000]; // 360MB
   for(int i=0 ; i < A.Length;i++) A[i] = i;
           
   TAM TamA = new TAM(2); // Set number of worker threads.
            
   //Test With TAM
   DateTime TAMdt = HighResClock.Now;
   double AnsA = TamA.Test(A);
   TimeSpan TAMts = HighResClock.Now - TAMdt;
 
   //Test Without TAM
   DateTime REGdt = HighResClock.Now;
   double AnsB = 0;
   for (int i = 0; i < A.Length; i++) AnsB += 
      Math.Sin(Math.Sqrt(Math.Sqrt(A[i] * Math.PI))*1.01);
   TimeSpan REGts = HighResClock.Now - REGdt;
 
   Console.WriteLine("AnsA is : " + AnsA + " Took : " + TAMts.Ticks);
   Console.WriteLine("AnsB is : " + AnsB + " Took : " + REGts.Ticks);
}

HighResClock is a way I found to calculate accurate time durations. You can find the project here in CodeProject.

Benchmarks

I tried TAM on two different machines, each with three configurations (1 thread, 2 threads, and 4 threads).

Single P4 HT 3GHz + 1GB RAM

Straightforward - 7.9706553 seconds
TAM - 1 thread - 7.9629043 seconds
TAM - 2 threads - 5.6013839 seconds
TAM - 4 threads - 5.5948156 seconds

Using two threads resulted in a 30% improvement over the straightforward method.

Dual Xeon HT 3GHz + 4GB RAM

Straightforward - 7.6798472 seconds
TAM - 1 thread - 7.6072723 seconds
TAM - 2 threads - 4.3261215 seconds
TAM - 4 threads - 2.5714356 seconds

Using four threads resulted in a 66% improvement over the straightforward method.

Although one could think that four pipes are "4 times" faster than a single pipe, there are some in-CPU optimizations while running non-threaded code on a HT CPU, and there is the context switch "tax" while running multi-threaded code.