next up previous contents index
Next: QR Factorization Up: Examples of Block Algorithms Previous: Examples of Block Algorithms   Contents   Index


Factorizations for Solving Linear Equations

The well-known LU and Cholesky factorizations are the simplest block algorithms to derive. No extra floating-point operations nor extra working storage are required.

Table 3.7 illustrates the speed of the LAPACK routine for LU factorization of a real matrix, DGETRF in double precision. This corresponds to 64-bit floating-point arithmetic. A block size of 1 means that the unblocked algorithm is used, since it is faster than -- or at least as fast as -- a blocked algorithm. These numbers may be compared to those for DGEMM in Table 3.12, which should be upper bounds.


Table 3.7: Speed in megaflops of DGETRF for square matrices of order n
  No. of Block Values of n
  processors size 100 1000
Dec Alpha Miata 1 28 172 370
Compaq AlphaServer DS-20 1 28 353 440
IBM Power 3 1 32 278 551
IBM PowerPC 1 52 77 148
Intel Pentium II 1 40 132 250
Intel Pentium III 1 40 143 297
SGI Origin 2000 1 64 228 452
SGI Origin 2000 4 64 190 699
Sun Ultra 2 1 64 121 240
Sun Enterprise 450 1 64 163 334

Table 3.8 gives similar results for Cholesky factorization.


Table 3.8: Speed in megaflops of DPOTRF for matrices of order n with UPLO = `U'
  No. of Block Values of n
  processors size 100 1000
Dec Alpha Miata 1 28 197 399
Compaq AlphaServer DS-20 1 28 306 464
IBM Power 3 1 32 299 586
IBM PowerPC 1 52 79 125
Intel Pentium II 1 40 118 253
Intel Pentium III 1 40 142 306
SGI Origin 2000 1 64 222 520
SGI Origin 2000 4 64 137 1056
Sun Ultra 2 1 64 131 276
Sun Enterprise 450 1 64 178 391

LAPACK, like LINPACK, provides a factorization for symmetric indefinite matrices, so that A is factorized as P U D UT PT, where P is a permutation matrix, and D is block diagonal with blocks of order 1 or 2. A block form of this algorithm has been derived, and is implemented in the LAPACK routine SSYTRF/DSYTRF. It has to duplicate a little of the computation in order to ``look ahead'' to determine the necessary row and column interchanges, but the extra work can be more than compensated for by the greater speed of updating the matrix by blocks as is illustrated in Table 3.9, provided that n is large enough.


Table 3.9: Speed in megaflops of DSYTRF for matrices of order n with UPLO = `U' on an IBM Power 3
Block Values of n
size 100 1000
1 186 215
32 130 412

LAPACK, like LINPACK, provides LU and Cholesky factorizations of band matrices. The LINPACK algorithms can easily be restructured to use Level 2 BLAS, though that has little effect on performance for matrices of very narrow bandwidth. It is also possible to use Level 3 BLAS, at the price of doing some extra work with zero elements outside the band [48]. This becomes worthwhile for matrices of large order and semi-bandwidth greater than 100 or so.


next up previous contents index
Next: QR Factorization Up: Examples of Block Algorithms Previous: Examples of Block Algorithms   Contents   Index
Susan Blackford
1999-10-01