Factorizations for Solving Linear Equations

The well-known LU and Cholesky factorizations are the simplest block algorithms to derive. No extra floating-point operations nor extra working storage are required.

Table 3.7 illustrates the speed of the LAPACK routine for LU factorization of a real matrix, DGETRF in double precision. This corresponds to 64-bit floating-point arithmetic. A block size of 1 means that the unblocked algorithm is used, since it is faster than -- or at least as fast as -- a blocked algorithm. These numbers may be compared to those for DGEMM in Table 3.12, which should be upper bounds.

**Table 3.7:** Speed in megaflops of DGETRF for square matrices of order n
	No. of	Block	Values of n
	processors	size	100	1000
Dec Alpha Miata	1	28	172	370
Compaq AlphaServer DS-20	1	28	353	440
IBM Power 3	1	32	278	551
IBM PowerPC	1	52	77	148
Intel Pentium II	1	40	132	250
Intel Pentium III	1	40	143	297
SGI Origin 2000	1	64	228	452
SGI Origin 2000	4	64	190	699
Sun Ultra 2	1	64	121	240
Sun Enterprise 450	1	64	163	334

**Table 3.8:** Speed in megaflops of DPOTRF for matrices of order n with UPLO = `U'
	No. of	Block	Values of n
	processors	size	100	1000
Dec Alpha Miata	1	28	197	399
Compaq AlphaServer DS-20	1	28	306	464
IBM Power 3	1	32	299	586
IBM PowerPC	1	52	79	125
Intel Pentium II	1	40	118	253
Intel Pentium III	1	40	142	306
SGI Origin 2000	1	64	222	520
SGI Origin 2000	4	64	137	1056
Sun Ultra 2	1	64	131	276
Sun Enterprise 450	1	64	178	391

LAPACK, like LINPACK, provides a factorization for symmetric indefinite matrices, so that A is factorized as P U D U^T P^T, where P is a permutation matrix, and D is block diagonal with blocks of order 1 or 2. A block form of this algorithm has been derived, and is implemented in the LAPACK routine SSYTRF/DSYTRF. It has to duplicate a little of the computation in order to ``look ahead'' to determine the necessary row and column interchanges, but the extra work can be more than compensated for by the greater speed of updating the matrix by blocks as is illustrated in Table 3.9, provided that n is large enough.

**Table 3.9:** Speed in megaflops of DSYTRF for matrices of order n with UPLO = `U' on an IBM Power 3
Block	Values of n
size	100	1000
1	186	215
32	130	412

LAPACK, like LINPACK, provides LU and Cholesky factorizations of band matrices. The LINPACK algorithms can easily be restructured to use Level 2 BLAS, though that has little effect on performance for matrices of very narrow bandwidth. It is also possible to use Level 3 BLAS, at the price of doing some extra work with zero elements outside the band [48]. This becomes worthwhile for matrices of large order and semi-bandwidth greater than 100 or so.