Download Source in VB.Net 2012
Abstract
In this article, .Net parallel class is utilized
to boost the performance of biconjugate gradient stabilized algorithm which is
an iterative method of solving system of linear equations. A Matrix class with
appropriate operators, methods and properties is developed. A solver class capable of performing the algorithm on separate thread is developed and, a set of matrices of size 1000, 1500, 2000, 2500, 3000, 4000 and 6000 is tested and discussed. In
this study implementing Parallel Class could reduce the CPU time by an average
of 41% and maximum of 65%.
Introduction
.Net Framework Parallel Class provides developers by a
simple structured loop that executes a block in parallel on several cores. This
class first introduced with .Net Framework 4.0 and since then it’s been
practices in several studies. In this article, Parallel Class is implemented
inside a globally experienced algorithm of solving system of linear equations
known as biconjugate gradient stabilized. This job is performed through
splitting the independent computations and performing them on iterations of the parallel loop. As for any equation of this type, the input data includes
matrices of A, X and b. Matrices of X and b are in size n while matrix of A is
of size n^2 . Thus, matrix of A requires heavy computations and it can be
considered as a target of parallel computation wherever it appears inside the
algorithm.
BiConjugate Gradient Stabilized
BiConjugate Gradient Stabilized or briefly
BiConGradStab is an advanced iterative method of solving system of linear
equations. This method and other methods of this family such as Conjugate
Gradient are perfect for memory management due to implementing vectors of size
n in their calculations rather than matrices of size n^2. In addition this
method is not restricted to specific type or size of matrix and can be applied
to any sparse matrices of A. The following illustrates the algorithm of
BiConGradStab method.
 r_{0} = b − Ax_{0}
 Choose an arbitrary vector r̂_{0} such that (r̂_{0}, r_{0}) ≠ 0, e.g., r̂_{0} = r_{0}
 ρ_{0} = α = ω_{0} = 1
 v_{0} = p_{0} = 0
 For i = 1, 2, 3, …
 ρ_{i} = (r̂_{0}, r_{i−1})
 β = (ρ_{i}/ρ_{i−1})(α/ω_{i−1})
 p_{i} = r_{i−1} + β(p_{i−1} − ω_{i−1}v_{i−1})
 v_{i} = Ap_{i}
 α = ρ_{i}/(r̂_{0}, v_{i})
 s = r_{i−1} − αv_{i}
 t = As
 ω_{i} = (t, s)/(t, t)
 x_{i} = x_{i−1} + αp_{i} + ω_{i}s
 If x_{i} is accurate enough then quit
 r_{i} = s − ω_{i}t
As shown above, there are two steps of 4 and 7 involved with
matrices of size n^2 where matrices can be divided into smaller ones and then computed separately in parallel processes.
Matrix Class
In order to perform clear and easytounderstand
matrix calculations it’s necessary to develop a matrix class that complies with
algorithms and calculations. Following lists some of the properties, methods and operators of the developed matrix class.
Property

Description

n (Integer)

Returns number of i elements

m (Integer)

Returns number of j elements

Values (Double(,))

(Default property) gets or sets values of element at i and j

Method

Description

New ()

Overloads: 1.void, 2. gets size of matrix, 3. returns identity matrix 4. gets values.

T()

Returns transposed of calling matrix

Split()

Splits current matrix into matrices of same n and returns an array of
them

Join()

Joins matrices of same n and returns a matrix

CopyTo()

Copies elements in current matrix onto another matrix

Operator

Description

/+

Subtracts/adds matrices of appropriate size

*

Multiplies matrices of appropriate size

/

Divides element at (0,0) of one matrix by element of the same at
another matrix and returns a matrix of size 1*1 of the result.

Solver Class
The Solver Class contains methods and event for performing the BiConGradStab calculations. The SolverThread is executed in a separate class of SolverThreadClass. The Solve() method requires matrices of A
, b
, initial guess for X
, number of parallel processes nParallel,
and final iteration Iteration
.
Public Sub Solve(A As Matrix, X As Matrix, b As Matrix, Parallel As Integer, Iterations As Integer)
A.CopyTo(Me.A)
X.CopyTo(Me.X)
b.CopyTo(Me.b)
SolverThread = New SolverThreadClass
AddHandler SolverThread.Solved, AddressOf _Solved
Dim T As New Threading.Thread(New Threading.ParameterizedThreadStart(AddressOf SolverThread.SolverThread))
T.Start(New Object() {_A, _X, _b, Parallel, Iterations})
End Sub
In Solve() method of the SolverThreadClass, the matrices of A
and b
are split into nParallel
number of smaller matrices using Split() method.
Dim A() As Matrix = rawA.Split(nParallel)
Dim b() As Matrix = rawB.Split(nParallel)
In the main loop as shown below, steps 4 and 7 of the algorithm are being executed in Parallel blocks.
Do While nIteration < Iterations
NewRho = r_tilde * r
beta = (NewRho / Rho) * (Alpha / Omega)
P = r + beta * (P  Omega * V)
Parallel.For(0, nParallel, Sub(i As Integer)
V_Split(i) = A(i) * P
End Sub)
V.Join(V_Split)
Alpha = NewRho / (r_tilde * V)
s = r  Alpha * V
Parallel.For(0, nParallel, Sub(i As Integer)
t_Split(i) = A(i) * s
End Sub)
t.Join(t_Split)
Omega = (t.T * s) / (t.T * t)
X += Alpha * P + Omega * s
r = s  Omega * t
NewRho.CopyTo(Rho)
nIteration += 1
Loop
Using The Code
In order to use this solver in your application, you need to declare it as below so that you would be able to catch the Solved event once it's fired.
Friend WithEvents mySolver As Solver
And Solved event handler can be written as below:
Sub Solved(X As Matrix) Handles mySolver.Solved
Stop
End Sub
Now in the calling procedure the solver class is instantiated and called for matrices of system of linear equations. A simple example is given:
mySolver = New Solver
Dim A As New Matrix(3, 3)
Dim x As New Matrix(1, 3)
Dim b As New Matrix(1, 3)
A(0, 0) = 0.1
A(1, 1) = 0.4
A(2, 2) = 0.2
A(1, 0) = 0.45
A(0, 1) = 0.67
A(0, 2) = 0.98
b(0, 0) = 1
b(0, 1) = 1.47
b(0, 2) = 1.58
x(0, 0) = 4
x(0, 1) = 2
x(0, 2) = 1
mySolver.Solve(A, x, b, 1, 5)
Results
CPU: 2.8GHz and  RAM: 2GB  Precision: Double
Matrix Size: 1000*1000
Matrix Size: 2000*2000
Matrix Size: 4000*4000
Matrix Size: 6000*6000
Discussion
The average CPU Time graph is a good indicator
of the overall performance of Parallel Class for each matrix. As shown in the
graph, the CPU time decreases gradually as the size of matrix increases relatively.
For matrix of size 6000 it can be interpreted as a memory access issue as
matrix of size 6000 would take two times more memory than that of 4000. In addition,
the graph of best CPU Times suggests that the performance of Parallel Class for
BiConGradStab gets improved softly by increasing the size of the matrix.
Conclusions
1)
The Parallel class improves
the performance of BiConGradStab by 41 to 65% on multiruns and 35% to 75% on
singleruns of CPU time on a dual core 2.8GHz with 2GB of RAM having Windows
8.1 installed and continuous condition. This may be relatively different on
other systems of different hardware and software capabilities.
2)
With high level of certainty
one would try using Parallel class for matrices of size 1000 and more and
expect significant improvement in overall performance.
3)
There’s no exact estimation
of improvement in overall or local performance because it’s being affected by
several factors. But as for higher number of parallel processes we’ve got least
fluctuation and Parallel gives more reliable result.
4)
Best results achieved at
maximum parallel processes where Iterations on Parallel class are comparable to
size of the matrix e.g. 1000 Parallel processes for a matrix of 3000 on each
side.
5)
For smaller matrices Parallel
class is unstable but still effective.
Recommendations
1) For more serious works, it'd be better to add tolerance and/or mse checkers to the iterative part of the code.
2) Using preconditioners one would achieve better convergence speed for BiConGradStab method. Nonetheless this code does not aim at evaluating the accuracy or convergence speed of this method.
3) A memory and CPU check procedure can be developed so that user would get informed about the status of memory and CPU before and when the iterations are in progress.
4) At many scenarios, users may need to perform the algorithm several times for simulation purposes. For such cases that would be great if a survey method determines the best run structure prior to the simulation job,
5) The whole work could be simply accomplished through two
separated loops of matrix operations without having other parts of
BiConGradStab algorithm involved, this work aims at showing the practical
aspect of Parallel Class. One would try similar code for other iterative
solutions wherever Parallel computation is possible.