13,631,844 members
Article
alternative version

#### Stats

13.7K views
1 bookmarked
Posted 30 Mar 2012
Licenced CPOL

# Multiprocessor Scheduling Using Partition Approximation

, 30 Mar 2012
This article describes a technique where an approximation for the partition scheduling problem can be generalized to approximate scheduling for multiprocessor machines where the number of processors is a power of two

## Introduction

The partition scheduling problem asks that given a set S of integer values representing process running times, can the set be subdivided into two subsets S' and S'' such that the sum of all process running times in set S' is equal to the sum of all process running times in set S''. Thus, the total sum of process running times in either subset is equivalent to a bound B where B is exactly half of the total cost of all process running times in set S. This problem is a known NP-Complete problem and can be proven via a reduction from three dimensional matching [1].

NP-Complete problems to date, do not have any polynomial time algorithm that can solve them [1]. It is conjectured that such problems may not have a polynomial time solution. However, approximation algorithms have been developed to deal with many NP-Complete problems. Approximation algorithms are algorithms which do not solve the NP-Complete problem optimally in all cases, and trade accuracy for performance. Not all NP-Complete problems can be approximated [1]; however, the partition scheduling problem has several. This article describes a technique where an approximation for the partition scheduling problem can be generalized to approximate scheduling for multiprocessor machines where the number of processors is a power of two, i.e., 2, 4, 8, 16, and so on.

## Background

The partition scheduling problem may be approximated using the following heuristic:

heuristic1:

sort input values in descending order
select values from the largest element down to the smallest element
assign the value to the processor that has the least load
continue until all elements are assigned

Thus, a processor will get its next assignment if it has the lowest work load and that such a schedule is a greedy schedule in that it tries to select the best and most favorable condition at every iteration. This schedule will produce a good approximation for the problem. In the two processor case, the greedy approximation yields a 4/3 * OPT schedule [2].

The idea presented in this article is to extend the two processor schedule to multiprocessor schedules where the number of processors is a power of two. This can be done by the following heuristic:

heuristic2:

let S be a set of elements
apply heuristic1 to obtain a schedule for two processors, schedule A and schedule B
take schedule A and set it to be S, apply heuristic1 producing schedule A' and schedule A''
take schedule B and set it to be S, apply heuristic1 producing schedule B' and schedule B''
continue, setting each sub-schedule to be S and re-applying heuristic1 as needed

One will note that heuristic2 will repeatedly subdivide the schedules each time finding an assignment for a power of two many processors. For example, if there are only two processors, then there will only be schedule A and schedule B produced from heuristic1. If there are four processors, then schedule A and schedule B are reassigned back as input set S to be subdivided again by the heuristic. The end result is that schedule A would be subdivided into schedule A' and A'' for assignment to processors 1 and 2, and similarly, schedule B would be subdivided into schedule B' and B'' for assignment to processors 3 and 4. This assignment requires log2 N many steps and at each pass a call to the heuristic is made. The code can be implemented within a loop using a queue to hold the schedules.  The implementation is as follows:

```work_q .push(new work(a, size, 0));
for(int i = 0; i < iter; ++i) {
j = work_q.front();
if(!j) {
break;
}
while(j && j->pass == i) {
work_q.pop();
p(j->a,j->size,&b,&bsize,&c,&csize);
if(bsize)
work_q.push(new work(b,bsize,j->pass+1));
if(csize)
work_q.push(new work(c,csize,j->pass+1));
free((void *)j->a);
delete j;
j = work_q.front();
}
}
```

Where work is a struct of the following form:

```struct work {
work(int *_a, int _size, int _pass)
: a(_a), size(_size), pass(_pass)
{
}
int *a;
int size;
int pass;
};```

And the queue is defined as:

`queue<work *> work_q;`

Note that in the above implementation, the initial work pushed into the queue logically represents the initial set S of integer elements which is stored in the variable `int * a`. The for loop iterates a log2 N number of times where N is the number of processors and for which the value is calculated and stored in the variable `iter`. The while loop pops the work from the queue and processes it. At each iteration, the work that is inserted into the queue has its pass value incremented by one. Further, at each time the while loop is entered, all work with pass equal to the current iteration is processed. This ensures that the work is split correctly. For example, for four processors, the number of iterations is 2. The initial pass is 0. At the time the while loop is encountered, there is only one unit of work with pass 0, and it is processed by a function called `p` which contains the approximation logic for the partition scheduling problem. However, the results from the processing produces two schedules, stored in variables b and c respectively with their lengths stored in variables bsize and csize. If these sizes are non-zero, new work is pushed into the queue this time with pass equal to pass + 1. In the next iteration, there would be two schedules to process as there are two schedules with pass now equal to 1. In this step, the while loop must process both schedules, and this in turn produces four new schedules as each of the two schedules are processed by splitting into two new schedules. Thus, the loops produce 2iter many schedules which is equal to N where N is a power of two.

The logic for approximating partition is contained in a method called `p` and its implementation is:

```int compare(const void *_a, const void *_b)
{
int a = *(int *)_a;
int b = *(int *)_b;
if(a == b) {
return 0;
} else if (a < b) {
return 1;
} else {
return -1;
}
}

void p(int *a, int size, int **b, int *bsize, int **c, int *csize)
{
int i = 0, j = 0, k = 0, bsum = 0, csum = 0;

if(!a || !size) {
(*b) = (*c) = NULL;
(*bsize) = (*csize) = 0;
return;
}

qsort(a, size, sizeof(int), compare);

(*b) = (int *)calloc(1, size * sizeof(int));
(*c) = (int *)calloc(1, size * sizeof(int));

(*bsize) = 0;
(*csize) = 0;

for(i = 0, j = 0, k = 0; i < size; i++) {
if(bsum == csum || bsum < csum) {
(*b)[j++] = a[i];
(*bsize)++;
bsum += a[i];
continue;
}
if(csum < bsum) {
(*c)[k++] = a[i];
(*csize)++;
csum += a[i];
}
}
}```

Before the approximation runs, the input set of elements stored in array `a ` (referred to in the heuristic as set S) must be sorted in reverse order. The approximation then runs selecting the processor with the least processing time and assigning to it an element in descending order. Notice that the function produces two new arrays, b and c, and their respective sizes, bsize and csize. The approximation implemented is the aforementioned greedy approximation for partition scheduling. The approximation uses two variables bsum and csum to keep track of which processor has the least load at every pass. This is a better implementation than repeatedly calculating the sum at every pass.

## Using the Code

The code is available in a C++ file called multischedule.cc. The code has been compiled and tested on Linux using g++. Run the command g++ multischedule.cc -o ms. Then, to test out the algorithm, run:

./ms data.dat

Where data.dat is a file containing integer values corresponding to:

`Nr_Elements Nr_CPUs elem1 elem2 ... elemN`

All integer values are separated by a single white space. Nr_Elements refers to the number of elements or processor times given in the file. Nr_CPUs is a power of two number of CPUs for which the algorithm is to try to find a schedule for.

An example instance may be:

20 4 24 27 12 54 6 4 7 10 54 55 26 10 9 43 54 2 9 6 38 34

where there are 20 elements and 4 processors followed by 20 elements (processor times) from 24 to 34.

The algorithm then produces the following schedule:

```  size [20] cpus [4]

log2 [2]

queue size [4]

cpu = [0]

len = [5] { 55 34 24 6 2 }

sum = [121]

cpu = [1]

len = [5] { 54 38 10 10 9 }

sum = [121]

cpu = [2]

len = [5] { 54 43 12 7 6 }

sum = [122]

cpu = [3]

len = [5] { 54 27 26 9 4 }

sum = [120]```

Notice that there are four schedules, one for each processor. The sum of the total cost of running the assigned schedule on a processor is provided. These times are 121, 121, 122, and 120 for processors 0 to 3, respectively. The times are close to optimal if not optimal in this instance. The maximum difference between any schedule is just two, and for two processors (0 and 1) the times are equal. At each pass, a 4/3 * OPT approximation algorithm[2] is run to produce a schedule and the resulting final schedules is efficient.

## Points of Interest

It is intended that heursitic2 be used interchangeably with any approximation of the partition scheduling problem. However, not all approximations may yield a good schedule for the multiprocessor assignment. The approximation suggested in heuristic1 seemed to work well in the cases for which it was tried. To change the algorithm to use a different approximation, one has to implement function `p` in the code using a different approach.

## References

1. Garey, M. R., & Johnson, D. S. (2000). Computers and intractability.New York, NY: W. H. Freeman and Company.
2. [2] Mertens, S. (2003). The Easiest Hard Problem: Number Partitioning. Retrieved March 17, 2012 from http://www.arxiv.org/pdf/cond-mat/0310317.

## Share

 Software Developer (Senior) United States
Dr. Doss holds a PhD in Applied Computer Science. His research interests include Operating Systems, Networking, Theory of NP Completeness,and Software Engineering.
http://www.rdoss.com

## You may also be interested in...

 Pro Pro