Solving Linear Systems Using oneMKL and OpenMP Target Offloading





0/5 (0 vote)
Accelerate Linear Algebra Computations with Standard, Open Approaches
The previous article, Accelerating LU Factorization Using Fortran, oneMKL, and OpenMP*, showed how to offload LU factorization and subsequent inversion to an accelerator. It gave some tips on reducing overhead by consolidating OpenMP offload regions and minimizing host-device data transfers but stopped short of solving a system of linear equations. This was left as an exercise for the reader, but we’re going to do it now to illustrate some techniques from the excellent Minimizing Data Transfers and Memory Allocations chapter in the oneAPI GPU Optimization Guide.
We’re going to show how to solve a group of linear systems using the batched LU factorization and solver functions in oneMKL, offloaded to an accelerator using OpenMP. We don’t use inverted matrix factors much in practice, so we’re going to drop that step in this demonstration and go from factorization to solution. Because the previous example codes are written in Fortran, and because we like it, we’re going to continue using Fortran here.
Let’s start with a couple of small linear systems to confirm that we’re loading and solving them correctly. We found some suitable examples in an old linear algebra textbook (Figure 1). The matrices are square so we can solve them using LU factorization.
The demo program for this article is similar to the one in the previous article. It begins by including the Fortran interfaces for OpenMP offload support, then it decides whether to use 32- or 64-bit integer type (Figure 2). (See Using the ILP64 Interface vs. LP64 Interface for more information.) This decision will be made at compile-time (see the compiler command below). The test problem in Figure 1 consists of two, 3x3 linear systems, each with one right-hand side (RHS). The batch_size
, n
, nrhs
, and stride
variables are set accordingly, and the a
, b
, and ipiv
arrays are allocated to hold the batched systems. The last four statements load the two matrices and their right-hand sides into a and b, respectively.
include "mkl_omp_offtload.t90"
program solve_batched_1inear_systems
! I Decide whether to use 32- or 64-bit integer type
#if defined(MKL_ILP64)
use onemkl_lapack_omp_offload_ilp64 ! 64-bit
#else
use onemkl_lapack_omp_offload_1p64 ! 32-bit
#endif
integer, parameter :: n = 3, batch_size = 2, nrhs = 1
integer :: lda, stride_a, stride_ipiv
integer :: 1db, stride_b
real (kind=8), allocatable :: a(:,:), b(:,:)
integer, allocatable :: ipiv(:, 1), info(:)
integer allocstat
character (len = 132) :: allocmsg
1da = n
stride_a = n * Tda
stride_ipiv = n
1db = n
stride_b = n * nrhs
! Allocate required memory
allocate (a(stride_a, batch_size), b(n, batch_size*nrhs), &
ipiv(stride_ipiv, batch_size), &
info(batch_ size), stat = allocstat, errmsg = allocmsg)
if (allocstat > 0) stop trim(allocmsg)
! Initialize the matrices. Remember that Fortran is a column-major language.
a(:,1) = (/2.0, 1.0, -6.0, 4.0, -4.0, -9.0, -4.0, 3.0, 10.0/)
a(:,2) = (/0.0, 2.0, 6.0, 0.0, 4.0, 5.0, 3.0, -1.0, 5.0/)
b(:,1) = (/12.0, -21.0, -24.0/) ! x = (-4, 2, -3)
b(:,2) = (/ 6.0, -10.0, -7.0/) ! x = (-2, -1, 2)
We are now ready to begin the computation. Let’s start with a basic implementation that uses two OpenMP target regions to dispatch the LU factorizations and the solutions to the linear systems (Figure 3). We compile and run the example as follows to verify that it’s giving correct results:
$ ifx -i8 -DMKL_ILP64 -qopenmp -fopenmp-targets=spir64 -fsycl -free \
> lu_solve_omp_offload_ex1_small.F90 -o lu_solve_ex1_small \
> -L${MKLROOT}/lib/intel64 -lmkl_sycl -lmkl_intel_ilp64 -lmkl_intel_thread \
> -lmkl_core -liomp5 -lpthread -ldl
$ ./lu_solve_ex1_small
=================================
Solutions to the linear systems
=================================
-4.0000 2.0000 -3.0000
-2.0000 -1.0000 2.0000
! Compute the LU factorization using OpenMP offload
! On entry, "a" contains the input matrix.
! On exit, it contains the LU factorization.
!$omp target data map(tofrom:a) map(from:ipiv) map(from:info)
!$omp dispatch
call dgetrf_batch_strided(n, n, a, lda, stride_a, &
ipiv, stride_ipiv, batch_size, info)
!$omp end target data
if (any(info .ne. 0)) then
print *, 'Error: getrf_batch_strided returned with errors'
else
! Solving the linear systems. On exit, the solutions are stored in b.
!$omp target data map(to:a) map(to:ipiv) map(tofrom: b) map(from:info)
!$omp dispatch
call dgetrs_batch_strided('N', n, nrhs, a, lda, stride_a, &
ipiv, stride_ipiv, &
b, 1db, stride_b, batch_size, info)
!$omp end target data
if (any(info .ne. 0)) then
print *, 'Error: getrs_batch_strided returned with errors'
else
print *, 'Computation executed successfully'
endif
endif
The target
construct transfers control to the accelerator device. The first region transfers the input matrices [map(tofrom:a)
] to the device memory; dispatches the in-place LU factorization (dgetrf_batch_strided
) to the device; and retrieves the LU-factored matrices (now stored in a
), pivot indices [map(from:ipiv)
], and status [map(from:info)
] from device memory. If factorization succeeded, the program proceeds to the next OpenMP region. The LU-factored matrices and pivot indices are transferred back to the device memory [map(to:a)
and map(to:ipiv)
] and the linear systems are solved on the device (dgetrs_batch_strided
). The RHS are overwritten with the solution vectors and retrieved from the device memory [map(tofrom:b)
] along with the status of the computation [map(from:info)
]. The host-device data transfers are shown schematically in Figure 4.
We don’t have to rely on a conceptual understanding of host-device data transfer, however. We can directly check data movement by requesting debugging information from the OpenMP runtime library, as explained in Minimizing Data Transfers and Memory Allocations. The test problem is too small to warrant acceleration, so we’ll increase the problem size to something more interesting. Let’s solve (in double precision) a batch of eight 8,000 x 8,000 linear systems, each with one RHS:
We’ve highlighted the arrays that are explicitly transferred using OpenMP map constructs and added a space to delineate the two OpenMP target regions. The unhighlighted data movements are the dope vectors of the arrays being mapped to the target device. We can ignore them because they’re generally small.
In the first target region, array a (4,096,000,000 bytes) is transferred [map(tofrom:a)
] from host to target (hst→tgt) and target to host (tgt→hst). It is transferred [map(to:a)] back to the target device in the second target region. The pivots (ipiv
, 512,000 bytes) are computed on the device in the first target region, retrieved [map(from:ipiv)
, tgt→hst], then transferred back to the device [map(to:ipiv)
, hst→tgt] in the second target region to compute the solutions. The RHS stored in array b (512,000 bytes) are transferred to the device, overwritten with the solution vectors, then transferred back to the host [map(tofrom:b)
]. The status array, info
(64 bytes), is retrieved from the device [map(from:info)
, tgt→hst] at the end of both target regions.
Moving data between disjoint memories takes time and energy, so we need to pay close attention to the highlighted transfers (12,290,048,128 bytes in total). With this in mind, let’s see how to improve our initial implementation.
The previous article on LU factorization discussed the disadvantages of using two OpenMP target regions when one will suffice. First, transferring the control flow to the target device incurs overhead. This happens twice in the initial implementation (Figure 3). Second, redundant host-device data transfer is required (Figure 4). When solving the linear systems, we only need to copy the input arrays and RHS to the device and retrieve the solutions from the device. We also need to retrieve the status arrays, but these are relatively small. The pivots are only used by the device, so we’ll simply allocate the ipiv array directly in device memory [map(alloc:ipiv(1:stride_ipiv, 1:batch_size))
]. Fusing the two OpenMP target regions results in cleaner, more concise code (Figure 5) that requires less data movement (Figure 6).
!$omp target data map(to:a) map(tofrom: Db) &
map(alloc:ipiv(1:stride_ipiv, l:batch_size)) &
map(from:info_rf, info_rs)
!$omp dispatch
call dgetrf_batch_strided(n, n, a, lda, stride_a, &
ipiv, stride_ipiv, batch_size, info_rf)
!$omp dispatch
call dgetrs_batch_strided('N', n, nrhs, a, lda, stride_a, &
ipiv, stride_ipiv, &
b, 1db, stride_b, batch_size, info_rs)
!$omp end target data
if (any(info_rf .ne. 0)) then
print *, 'Error: getrf_batch_strided returned with errors.'
elseif (any(info_rs .ne. 0)) then
print *, 'Error: getrs_batch_strided returned with errors.'
else
print *, 'Computation executed successfully'
endif
The debug information confirms this. Once again, we’ve highlighted the arrays that are explicitly transferred using OpenMP map constructs:
If we ignore the dope vectors, the single OpenMP target region in Figure 5 requires significantly less host-device data transfer (4,097,024,128 bytes) than the two target regions in the initial implementation (12,290,048,128 bytes) (Figure 3).
Minimizing host-device data transfer is a first-order concern in heterogeneous parallel computing. The code in the second example (Figure 5) significantly outperforms the original code (Figure 3), especially as the linear systems get larger (Table 1).
Matrix Size | CPU Time | GPU time | GPU time |
---|---|---|---|
1,000 x 1,000 | 0.13 | 1.72 | 1.70 |
2,000 x 2,000 | 0.60 | 0.84 | 0.72 |
4,000 x 4,000 | 3.90 | 2.51 | 2.10 |
8,000 x 8,000 | 19.15 | 5.98 | 4.55 |
12,000 x 12,000 | 63.11 | 12.55 | 9.51 |
16,000 x 16,000 | 139.45 | 21.33 | 15.71 |
The example programs used in this article are available at this repository. You can experiment with OpenMP accelerator offload on the free Intel® Developer Cloud, which has the latest Intel hardware and software.
The OpenMP target construct has many more options to give the programmer fine control over accelerator offload. We encourage you to review the OpenMP specification and example codes, and to check out the Intro to GPU Programming with the OpenMP API and Minimizing Data Transfers and Memory Allocations tutorials.