The well-known LU and Cholesky factorizations are the simplest block algorithms to derive. No extra floating-point operations nor extra working storage are required.
Table 3.7 illustrates the speed of the LAPACK routine for LU factorization of a real matrix, DGETRF in double precision. This corresponds to 64-bit floating-point arithmetic. A block size of 1 means that the unblocked algorithm is used, since it is faster than -- or at least as fast as -- a blocked algorithm. These numbers may be compared to those for DGEMM in Table 3.12, which should be upper bounds.
No. of | Block | Values of n | ||
processors | size | 100 | 1000 | |
Dec Alpha Miata | 1 | 28 | 172 | 370 |
Compaq AlphaServer DS-20 | 1 | 28 | 353 | 440 |
IBM Power 3 | 1 | 32 | 278 | 551 |
IBM PowerPC | 1 | 52 | 77 | 148 |
Intel Pentium II | 1 | 40 | 132 | 250 |
Intel Pentium III | 1 | 40 | 143 | 297 |
SGI Origin 2000 | 1 | 64 | 228 | 452 |
SGI Origin 2000 | 4 | 64 | 190 | 699 |
Sun Ultra 2 | 1 | 64 | 121 | 240 |
Sun Enterprise 450 | 1 | 64 | 163 | 334 |
Table 3.8 gives similar results for Cholesky factorization.
No. of | Block | Values of n | ||
processors | size | 100 | 1000 | |
Dec Alpha Miata | 1 | 28 | 197 | 399 |
Compaq AlphaServer DS-20 | 1 | 28 | 306 | 464 |
IBM Power 3 | 1 | 32 | 299 | 586 |
IBM PowerPC | 1 | 52 | 79 | 125 |
Intel Pentium II | 1 | 40 | 118 | 253 |
Intel Pentium III | 1 | 40 | 142 | 306 |
SGI Origin 2000 | 1 | 64 | 222 | 520 |
SGI Origin 2000 | 4 | 64 | 137 | 1056 |
Sun Ultra 2 | 1 | 64 | 131 | 276 |
Sun Enterprise 450 | 1 | 64 | 178 | 391 |
LAPACK, like LINPACK, provides a factorization for symmetric indefinite matrices, so that A is factorized as P U D UT PT, where P is a permutation matrix, and D is block diagonal with blocks of order 1 or 2. A block form of this algorithm has been derived, and is implemented in the LAPACK routine SSYTRF/DSYTRF. It has to duplicate a little of the computation in order to ``look ahead'' to determine the necessary row and column interchanges, but the extra work can be more than compensated for by the greater speed of updating the matrix by blocks as is illustrated in Table 3.9, provided that n is large enough.
LAPACK, like LINPACK, provides LU and Cholesky factorizations of band matrices. The LINPACK algorithms can easily be restructured to use Level 2 BLAS, though that has little effect on performance for matrices of very narrow bandwidth. It is also possible to use Level 3 BLAS, at the price of doing some extra work with zero elements outside the band [48]. This becomes worthwhile for matrices of large order and semi-bandwidth greater than 100 or so.