Lanczos Approach to the Inverse Square Root of a Large and Sparse Matrix

发布于:2021-06-24 00:41:42

A Lanczos approach to the inverse square root of a large and sparse matrix
arXiv:hep-lat/9910045v3 16 Jul 2000
Artan Bori? ci
Paul Scherrer Institute CH-5232 Villigen PSI

Abstract I construct a Lanczos process on a large and sparse matrix and use the results of this iteration to compute the inverse square root of the same matrix. The algorithm is a stable version of an earlier proposal by the author. It can be used for problems related to the matrix sign and polar decomposition. The application here comes from the theory of chiral fermions on the lattice.



The computation of the inverse square root of a matrix is a special problem in scienti?c computing. It is related to the matrix sign and polar decomposition [1]. One may de?ne the matrix sign by: sign(A) = A(A2 )? 2


where A is a complex matrix with no pure imaginary eigenvalues. In polar coordinates, a complex number z = x + i y , is represented by z = |z | ei φ , 1 φ = arctan y x (2)

In analogy, the polar decomposition of a matrix A is de?ned by: A = V (A? A) 2 ,

V ?1 = V ?


where V is the polar part and the second factor corresponds to the absolute value of A. The mathematical literature invloving the matrix sign function traces back to 1971 when it was used to solve the Lyapunov and algebraic Riccati equations [1]. In computational physics one may face a similar problem when dealing with Monte Carlo simulations of fermion systems, the so-called sign problem [2]. In this case the integration measure is proportional to the determinant of a matrix and the polar decomposition may be helpful to monitor the sign of the determinant. The example brought in this paper comes from the recent progress in formulating Quantum Chromodynamics (QCD) on a lattice with exact chiral symmetry [3]. In continuum, the massless Dirac propagator Dcont is chirally symmetric, i.e. γ5 Dcont + Dcont γ5 = 0 (4)

On a regular lattice with spacing a the symmetry is suppressed according to the Ginsparg-Wilson relation: [4]: γ5 D + Dγ5 = aDγ5 D where D is the lattice Dirac operator. An explicit example of a Dirac operator obeying this relation is the so-called overlap operator [5]: aD = 1 ? A(A? A)?1/2 , A = M ? aDW (6) (5)

where M is a shift parameter in the range (0, 2), which I have ?xed at one. DW is the Wilson operator, DW a γ ? ?? ? = 2 ?=1 2
4 4



which is a nearest-neighbors discretization of the continuum Dirac operator (it violates the Ginsparg-Wilson relation). ?? and ?? are ?rst and second order covariant di?erences given by: ( ?? ψ ) i = (?? ψ )i =
1 (U?,i ψi+? ? 2a 1 (U?,i ψi+? ? a2 ? ? U?,i ?) ?? ? ψi?? ? + U?,i ? ? 2ψi ) ?? ? ψi??

where ψi is a fermion ?eld at the lattice site i and U?,i an SU (3) lattice gauge ?led associated with the oriented link (i, i + ? ?). These are unitary 3 by 3 complex matrices with determinant one. A set of such matrices forms a lattice gauge “con?guration”. γ? , ? = 1, . . . , 5 are 4 × 4 Dirac matrices which anticommute with each-other. Therefore, if there are N lattice points in four dimensions, the matrix A is of order 12N . A restive symmetry of the matrix A that comes from the continuum is the so called γ5 ? symmetry which is the Hermiticity of the γ5 A operator. By de?nition, computation of D involves the inverse square root of a matrix. This is a non-trivial task for large matrices. Therefore several algorithms have been proposed by lattice QCD physicists [6, 7, 8, 9, 10]. All these methods rely on matrix-vector multiplications with the sparse Wilson matrix DW , being therefore feasible for large simulations. In fact, methods that approximate the inverse square root by Legendre [6] and Chebyshev polynomials [7] need to know apriori the extreme eigenvalues of A? A to be e?ective. This requires computational resources of at least one Conjugate Gradients (CG) inversion. In [8] the inverse square root is approximated by a rational approximation, which allows an e?cient computation via a multi-shift CG iteration. Storage here may be an obstacle which is remedied by a second CG step [11]. The Pade approximation used by [9] needs the knowledge of the smallest eigenvalue of A? A. Therefore the method becomes e?ective only in connection with the D inversion [12]. The method presented earlier by the author [10] relies on taking exactly the inverse square root from the Ritz values. These are the roots of the Lanczos polynomial approx3

imating the inverse of A? A. In that work the Lanczos polynomial was constructed by applying the Hermitian operator γ5 A. The latter is inde?nite, thereby responsible for observed oscillations in the residual vector norm [10]. Here I use a Lanczos polynomial on the positive de?nite matrix A? A. In this case the residual vector norm decreases monotonically and leads to a stable method. This is a crucial property that allows a reliable stopping criterion that I will present here. The paper is selfcontained: in the next section I will brie?y present the Lanczos algorithm and set the notations. In section 3, I use the algorithm to solve linear systems, and in section 4, the computation of the inverse square root is given. The method is tested in section 5 and conclusions are drawn in the end.


The Lanczos Algorithm

The Lanczos iteration is known to approximate the spectrum of the underlying matrix in an optimal way and, in particular, it can be used to solve linear systems [13]. Let Qn = [q1 , . . . , qn ] be the set of orthonormal vectors, such that
(n) T A? AQn = Qn Tn + βn qn+1 (en ) ,

q1 = ρ1 b,

ρ1 = 1/||b||2


where Tn is a tridiagonal and symmetric matrix, b is an arbitrary vector, and βn a real and positive constant. em denotes the unit vector with n elements in the direction m. By writing down the above decomposition in terms of the vectors qi , i = 1, . . . , n and the matrix elements of Tn , I arrive at a three term recurrence that allows to compute these


vectors in increasing order, starting from the vector q1 . This is the LanczosAlgorithm: β0 = 0, ρ1 = 1/||b||2 , q0 = o, q1 = ρ1 b f or i = 1, . . . v = A? Aqi
? αi = qi v


v := v ? qi αi ? qi?1 βi?1 βi = ||v ||2 if βi < tol, n = i, end f or qi+1 = v/βi where tol is a tolerance which serves as a stopping condition. The Lanczos Algorithm constructs a basis for the Krylov subspace [13]: span{b, A? Ab, . . . , (A? A)n?1 b} (10)

If the Algorithm stops after n steps, one says that the associated Krylov subspace is invariant. In the ?oating point arithmetic, there is a danger that once the Lanczos Algorithm (polynomial) has approximated well some part of the spectrum, the iteration reproduces vectors which are rich in that direction [13]. As a consequence, the orthogonality of the Lanczos vectors is spoiled with an immediate impact on the history of the iteration: if the algorithm would stop after n steps in exact arithmetic, in the presence of round o? errors the loss of orthogonality would keep the algorithm going on.


The Lanczos Algorithm for solving A? Ax = b

Here I will use this algorithm to solve linear systems, where the loss of orthogonality will not play a role in the sense that I will use a di?erent stopping condition. I ask the solution in the form x = Qn yn 5 (11)

By projecting the original system on to the Krylov subspace I get:
? ? Q? n A Ax = Qn b


By construction, I have b = Qn e1 /ρ1 , Substituting x = Qn yn and using (8), my task is now to solve the system Tn yn = e1 /ρ1 Therefore the solution is given by
?1 x = Qn Tn e1 /ρ1 (n) (n) (n)




This way using the Lanczos iteration one reduces the size of the matrix to be inverted. Moreover, since Tn is tridiagonal, one can compute yn by short recurences. If I de?ne: ri = b ? A? Axi , qi = ρi ri , x ?i = ρi xi where i = 1, . . . , it is easy to show that ρi+1 βi + ρi αi + ρi?1 βi?1 = 0 qi + x ?i+1 βi + x ?i αi + x ?i?1 βi?1 = 0 Therefore the solution can be updated recursively and I have the following Algorithm1 (17) (16)


for solving the system A? Ax = b: β0 = 0, ρ1 = 1/||b||2 , q0 = o, q1 = ρ1 b f or i = 1, . . . v = A? Aqi
? αi = qi v

v := v ? qi αi ? qi?1 βi?1 βi = ||v ||2 qi+1 = v/βi
xi αi +? xi?1 βi?1 x ?i+1 = ? qi +? βi
i?1 βi?1 ρi+1 = ? ρi αi +ρ βi


ri+1 := qi+1 /ρi+1 xi+1 := yi+1 /ρi+1 < tol, n = i, end f or if |ρi1 +1 |


The Lanczos Algorithm for solving (A? A)1/2x = b

Now I would like to compute x = (A? A)?1/2 b and still use the Lanczos Algorithm. In order to do so I make the following observations: Let (A? A)?1/2 be expressed by a matrix-valued function, for example the integral formula [1]: (A A)
? ?1/2 ∞

2 = π

dt(t2 + A? A)?1


From the previous section, I use the Lanczos Algorithm to compute
?1 (A? A)?1 b = Qn Tn e1 /ρ1 (n)


It is easy to show that the Lanczos Algorithm is shift-invariant. i.e. if the matrix A? A is shifted by a constant say, t2 , the Lanczos vectors remain invariant. Moreover, the corresponding Lanczos matrix is shifted by the same amount. This property allows one to solve the system (t2 + A? A)x = b by using the same Lanczos iteration as before. Since the matrix (t2 + A? A) is better conditioned than A? A, 7

it can be concluded that once the original system is solved, the shifted one is solved too. Therefore I have: (t2 + A? A)?1 b = Qn (t2 + Tn )?1 e1 /ρ1 Using the above integral formula and putting everything together, I get:
?1/2 e1 /ρ1 x = (A? A)?1/2 b = Qn Tn (n) (n)



There are some remarks to be made here: a) As before, by applying the Lanczos iteration on A? A, the problem of computing (A? A)?1/2 b reduces to the problem of computing yn = Tn much smaller problem than the original one. But since Tn
?1/2 (n) e1 /ρ1

which is typically a


is full, yn cannot be computed

by short recurences. It can be computed for example by using the full decomposition of Tn in its eigenvalues and eigenvectors; in fact this is the method I have employed too, for its compactness and the small overhead for moderate n. b) The method is not optimal, as it would have been, if one would have applied it directly for the matrix (A? A)1/2 . By using A? A the condition is squared, and one looses a factor of two compared to the theoretical case! c) From the derivation above, it can be concluded that the system (A? A)1/2 x = b is solved at the same time as the system A? Ax = b. d) To implement the result (22), I ?rst construct the Lanczos matrix and then compute
?1/2 yn = Tn e1 /ρ1 (n)


To compute x = Qn yn , I repeat the Lanczos iteration. I save the scalar products, though it is not necessary.


Therefore I have the following Algorithm2 for solving the system (A? A)1/2 x = b: β0 = 0, ρ1 = 1/||b||2 , q0 = o, q1 = ρ1 b f or i = 1, . . . v = A? Aqi
? αi = qi v

v := v ? qi αi ? qi?1 βi?1 βi = ||v ||2 qi+1 = v/βi
i?1 βi?1 ρi+1 = ? ρi αi +ρ βi

if |ρi1 < tol, n = i, end f or +1 | (24) Set (Tn )i,i = αi , (Tn )i+1,i = (Tn )i,i+1 = βi , otherwise (Tn )i,j = 0 yn = Tn
?1/2 (n) e1 /ρ1 ?1/2 (n)

= Un Λn

T Un e1 /ρ1

q0 = o, q1 = ρ1 b, x0 = o f or i = 1, . . . , n xi = xi?1 + qi yn v = A? Aqi v := v ? qi αi ? qi?1 βi?1 qi+1 = v/βi where by o I denote a vector with zero entries and Un , Λn the matrices of the eigenvectors and eigenvalues of Tn . Note that there are only four large vectors necessary to store: qi?1 , qi , v, xi .



Testing the method

I propose a simple test: I solve the system A? Ax = b by applying twice the Algorithm2, i.e. I solve the linear systems (A? A)1/2 z = b, (A? A)1/2 x = z in the above order. For each approximation xi , I compute the residual vector ri = b ? A? Axi (26) (25)

The method is tested for a SU(3) con?guration at β = 6.0 on a 83 16 lattice, corresponding to an order 98304 complex matrix A. In Fig.1 I show the norm of the residual vector decreasing monotonically. The stagnation of ||ri||2 for small values of tol may come from the accumulation of round o? error in the 64-bit precision arithmetic used here. This example shows that the tolerance line is above the residual norm line, which con?rms the expectation that tol is a good stopping condition of the Algorithm2.



I have presented a Lanczos method to compute the inverse square root of a large and sparse positive de?nite matrix. The method is characterized by a residual vector norm that decreases monotonically and a consistent stopping condition. This stability should be compared with a similar method presented earlier by the author [10], where the underlying Hermitian but inde?nite matrix γ5 A led to appreciable instabilities in the norm of the residual vector. In terms of complexity this algorithm requires less operations for the same accuracy than its inde?nite matrix counterpart. This property is guaranteed by the monotonicity of the residual vector norm. Nontheless, the bulk of the work remains the same. With the improvement in store the method is complete.


It shares with methods presented in [8, 9] the same underlying Lanczos polynomial. As it is wellknown [13] CG and Lanczos methods for solving a linear system produce the same results in exact arithmetic. In fact, CG derives from the Lanczos algorithm by solving the coupled two-term recurences of CG for a single three-term recurence of Lanczos. However, the coupled two-term recurences of CG accumulate less round o?. This makes CG preferable for ill-conditioned problems. There are two main di?erences between the method presented here and those in [8, 9]: a) Since CG and Lanczos are equivalent, they produce the same Lanczos matrix. Therefore, any function of A? A translates for both algorithms into a function of Tn (given the basis of Lanczos vectors). The latter function translates into a function of the Ritz values, the eigenvalues of Tn . That is, whenever the methods of papers [8, 9] try to approximate the inverse square root of A? A, the underlying CG algorithm shifts this function to the Ritz values. It is clear now that if I take the inverse square root from the Ritz values exactly, I don’t have any approximation error. This is done in Algorithm2. b) Algorithm2 sets no limits in the amount of memory required, whereas the multishift CG needs to store as many vectors as the number of shifts. For high accuracy approximations the multi-shift CG is not practical. However, one may lift this limit in expense of a second CG iteration (two-step CG) [11]. Therefore Algorithm2 and the two-step CG have the same iteration workload, with Algorithm2 computing exactly the inverse square root. Additionally, Algorithm2 requires the calculation of Ritz eigenpairs of Tn , which makes for an overhead proportional to ? n2 when using the QR algorithm for the eigenvalues and the inverse iteration for the eigenvectors [13]. Since the complexity of the Lanczos algorithm is ? nN , the relative overhead is proportional to ? n/N . For moderate gauge couplings and lattice sizes this is a small percentage. I conclude that the algorithms of [8, 9] may be used in situations where a high accuracy is not required and/or A is well-conditioned. Experience with overlap fermions shows that high accuracy is often essential [7, 10]. In such situations the Algorithm2 is best suited. 11

[1] For a starting point and a thorough review of the problem see N. J. Higham, Proceedings of ”Pure and Applied Linear Algebra: The New Generation”, Pensacola, March 1993. [2] For example in simulating ?nite density QCD: P. Hasenfratz and F. Karsch, Phys. Lett. B 125B (1983) 308 and the Hubbard model: D. J. Scalapino and R. L. Sugar, Phys. Rev. Lett. 46 (1981) 519 [3] For a recent review see H. Neuberger, hep-lat/9909042 [4] P. H. Ginsparg and K. G. Wilson, Phys. Rev. D 25 (1982) 2649. [5] H. Neuberger, Phys. Lett. B 417 (1998) 141, Phys. Rev. D 57 (1998) 5417. [6] B. Bunk, Nucl.Phys.Proc.Suppl. B63 (1998) 952. [7] P. Hernandez, K. Jansen, and M. L¨ uscher, Nucl.Phys. B552 (1999) 363 [8] H. Neuberger, Phys. Rev. Lett. 81 (1998) 4060. [9] R. G. Edwards, U. M. Heller and R. Narayanan, Nucl.Phys. B540 (1999) 457-471. [10] A. Bori? ci, Phys.Lett. B453 (1999) 46-53. [11] H. Neuberger, Int.J.Mod.Phys. C10 (1999) 1051 [12] R. G. Edwards, U. M. Heller and R. Narayanan, Phys.Rev. D59 (1999) 094510 [13] G. H. Golub and C. F. Van Loan, Matrix Computations, The Johns Hopkins University Press, Baltimore, 1989. This is meant as a general reference with original references included therein.




Test of the Lanczos Algorithm for the Inverse Square Root: (A A)








Tolerance level prescribed in the Algorithm2







Norm of the residual ||b ? A A x||










600 800 Ax multiplications




Figure 1: The dots show the norm of the residual vector, whereas the line shows the tolerance level set by tol in the Algorithm2.