13,287,116 members (57,702 online)
alternative version

#### Stats

28K views
31 bookmarked
Posted 14 Nov 2006

# TAM - Threaded Array Manipulator

, 14 Nov 2006
 Rate this:
Performing mathematical operations on large arrays while exploiting 100% of a multi-core CPU.

## Scope

This article describes how to perform mathematical operations on large arrays while exploiting 100% of the CPU computation power regardless of how many cores/threads your CPU has.

## Introduction

TAM shows a very simple way to do multi-threaded mathematical operations on large arrays of numbers.

## Some background

Before starting to reprogram your code to run multi-threaded, it is first recommended to know a thing or two on threading.

CPU cycles are not the only concern. A standard CPU runs at 3GHz while Memory and FSB run at only 400-1000Mhz, memory runs much slower than the CPU. This means you should always do as much of the mathematical operations on the data at once, as each time you recall the data back to the CPU pipeline causes the CPU to fetch the data from the cache (in the good case) or from the memory or page file (in the worst case). While doing a simple floating point operation can take 1-2 cycles, fetching data from memory can easily take 100 or more cycles.

```//Good
double Sum = 0;
for (int i = 0; i < A.Length; i++) {
A[i] = Math.Sqrt(A[i]);
A[i] = Math.Sin(A[i]);
Sum += A[i];
}

double Sum = 0;
for (int i = 0; i < A.Length; i++) {
A[i] = Math.Sqrt(A[i]);
}
for (int i = 0; i < A.Length; i++) {
A[i] = Math.Sin(A[i]);
}
for (int i = 0; i < A.Length; i++) {
Sum += A[i];
}
```

## The Code

The code is not complicated :)

Each array is divided into sections according to the number of threads. Each worker process does its work on the relevant section and the final results are gathered by the main running thread.

## Using TAM

TAM is best used with large arrays, or small arrays with extensive mathematical operations done on every element. Trying to use TAM on a short array will result in longer time than doing the same computation in a straightforward way (this is due to "unneeded" context switches).

In the example below, I created a 90,000,000 elements `float` array and populated it with values. Then I perform a mathematical operation on each element and sum the values. I preformed this twice, once using TAM and once straightforward.

```static void Main() {
// Create a float array
float[] A = new float[90000000]; // 360MB
for(int i=0 ; i < A.Length;i++) A[i] = i;

TAM TamA = new TAM(2); // Set number of worker threads.

//Test With TAM
DateTime TAMdt = HighResClock.Now;
double AnsA = TamA.Test(A);
TimeSpan TAMts = HighResClock.Now - TAMdt;

//Test Without TAM
DateTime REGdt = HighResClock.Now;
double AnsB = 0;
for (int i = 0; i < A.Length; i++) AnsB +=
Math.Sin(Math.Sqrt(Math.Sqrt(A[i] * Math.PI))*1.01);
TimeSpan REGts = HighResClock.Now - REGdt;

Console.WriteLine("AnsA is : " + AnsA + " Took : " + TAMts.Ticks);
Console.WriteLine("AnsB is : " + AnsB + " Took : " + REGts.Ticks);
}```

`HighResClock` is a way I found to calculate accurate time durations. You can find the project here in CodeProject.

## Benchmarks

I tried TAM on two different machines, each with three configurations (1 thread, 2 threads, and 4 threads).

### Single P4 HT 3GHz + 1GB RAM

• Straightforward - 7.9706553 seconds
• TAM - 1 thread - 7.9629043 seconds
• TAM - 2 threads - 5.6013839 seconds
• TAM - 4 threads - 5.5948156 seconds

Using two threads resulted in a 30% improvement over the straightforward method.

### Dual Xeon HT 3GHz + 4GB RAM

• Straightforward - 7.6798472 seconds
• TAM - 1 thread - 7.6072723 seconds
• TAM - 2 threads - 4.3261215 seconds
• TAM - 4 threads - 2.5714356 seconds

Using four threads resulted in a 66% improvement over the straightforward method.

Although one could think that four pipes are "4 times" faster than a single pipe, there are some in-CPU optimizations while running non-threaded code on a HT CPU, and there is the context switch "tax" while running multi-threaded code.

## About the Author

Gilad holds a B.Sc in Computer Eng. from the Technion IIT.

## Comments and Discussions

 First Prev Next
 Taking advantage of shared L2 cache DQNOK1-Mar-07 11:48 DQNOK 1-Mar-07 11:48
 there are some bug in the spliting thread. what abt OpenMP? f230-Dec-06 20:33 f2 30-Dec-06 20:33
 Some more data [modified] [echo]11-Dec-06 3:07 [echo] 11-Dec-06 3:07
 Last Visit: 31-Dec-99 19:00     Last Update: 10-Dec-17 14:00 Refresh 1

General    News    Suggestion    Question    Bug    Answer    Joke    Praise    Rant    Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.