21 Numerical linear algebra

Published

December 22, 2025

21.1 Inversion

21.1.1 Moore-Penrose inverse

21.1.2 Woodbury identity

If we have matrices A_{n \times n}, U_{n\times k}, C_{k \times k}, V_{k \times n}, then

( A + UCV)^{-1} = A^{-1} - A^{-1} U (C^{-1} + VA^{-1}U)^{-1} VA^{-1}

Derivation

Consider matrices X and Y.

\begin{align*} X(I + YX) &= (I+XY)X\\ (I+XY) ^{-1} X &= X(I + YX) ^{-1} \end{align*}

\begin{align*} I &= (I+P)(I+P)^{-1} \\ I &= (I+P)^{-1} + P(I+P)^{-1} \\ (I+P)^{-1} &= I - P(I+P)^{-1} \quad \text{(Rearranging terms)}\\ (I + YX)^{-1} &= I - YX(I+YX)^{-1} \quad \text{(Substituting } P = YX \text{)}\\ (I + YX)^{-1} &= I - Y (I+XY)^{-1}X \quad \text{(Using (1)) } \end{align*}

We substitue Y := A^{-1} U and X := C V

\begin{align*} (I + A^{-1} UCV)^{-1} &= I - A^{-1} U (I + CVA^{-1}U)^{-1} CV\\ (A^{-1} A + A^{-1} UCV)^{-1} &= I - A^{-1} U (C C^{-1} + CVA^{-1}U)^{-1} CV\\ [A^{-1}( A + UCV)]^{-1} &= I - A^{-1} U [ C (C^{-1} + VA^{-1}U)]^{-1} CV\\ ( A + UCV)^{-1} A &= I - A^{-1} U (C^{-1} + VA^{-1}U)^{-1} C^{-1} CV\\ ( A + UCV)^{-1} &= A^{-1} - A^{-1} U (C^{-1} + VA^{-1}U)^{-1} VA^{-1} \end{align*}

21.1.3 Schur complements

Say we have a matrix M_{(m+ n) \times (m+n)} consisting of four blocks A_{m \times m}, B_{m \times n}, C_{n \times m}, D_{n \times n}.

Then, \begin{bmatrix} A & B \\ C & D \end{bmatrix}^{-1} =\begin{bmatrix} (M/D)^{-1} & - (M/D)^{-1} BD^{-1}\\ -D^{-1}C (M/D)^{-1} & D^{-1} + D^{-1} C (M/D)^{-1} BD^{-1} \end{bmatrix}

Derivation

\begin{bmatrix} A_{m \times m} & B_{m \times n} \\ C_{n \times m} & D_{n \times n} \end{bmatrix} \begin{bmatrix} X & Y \\ Z & W \\ \end{bmatrix} = \mathbf I_{m+n}

\begin{bmatrix} A X + B Z & A Y + B W \\ C X + D Z & C Y + D W \end{bmatrix}= \begin{bmatrix} \mathbf I_m & \mathbf 0_{m \times n} \\ \mathbf 0_{n \times m} & \mathbf I_n \\ \end{bmatrix}

We can separate these matrix equations and do some straightforward linear equation solving (assuming A and D are invertible):

\begin{align*} A Y + B W &= \mathbf 0 \\ A Y &= - B W \\ Y &= - A^{-1} B W \end{align*}

\begin{align*} C X + D Z &= \mathbf 0 \\ D Z &= - C X\\ Z &= - D^{-1} C X \end{align*}

\begin{align*} D W + C Y &= \mathbf I \\ D W - C A^{-1} BW &= \mathbf I \\ (D- C A^{-1} B) W &= \mathbf I\\ W &= (D- C A^{-1} B)^{-1} \end{align*}

\begin{align*} Y &= A^{-1} BW\\ Y &= -A^{-1} B(D - C A^{-1} B)^{-1} \end{align*}

\begin{align*} A X + B Z &= \mathbf I \\ AX - B D^{-1} C X &= \mathbf I\\ (A- BD^{-1} C) X &= \mathbf I\\ X &= (A- BD^{-1} C)^{-1} \end{align*}

\begin{align*} Z &= - D^{-1} C (A - BD^{-1} C)^{-1} \end{align*} \begin{bmatrix} A & B \\ C & D \end{bmatrix}^{-1} = \begin{bmatrix} X & Y \\ Z & W \\ \end{bmatrix} = \begin{bmatrix} (A- BD^{-1} C)^{-1} & -A^{-1} B(D - C A^{-1} B)^{-1} \\ - D^{-1} C (A - BD^{-1} C)^{-1} & (D- C A^{-1} B)^{-1} \end{bmatrix}

Substituting Q := D, U := -C, R := A^{-1}, V :=B into the Woodbury Matrix Identity, we get W = (D- CA^{-1}B)^{-1} = D^{-1} + D^{-1} C (A - BD^{-1}C)^{-1} BD^{-1}

\begin{align*} Y &= -A^{-1} B(D - C A^{-1} B)^{-1} \\ &= -A^{-1} B [D^{-1} + D^{-1} C (A - BD^{-1}C)^{-1} BD^{-1}]\\ &= - [A^{-1} B D^{-1} + A^{-1} B D^{-1} C (A - BD^{-1}C)^{-1} BD^{-1}]\\ &= - [A^{-1} + A^{-1} B D^{-1} C (A - BD^{-1}C)^{-1} ]BD^{-1}\\ &= - A^{-1}[I + B D^{-1} C (A - BD^{-1}C)^{-1} ]BD^{-1}\\ &= - A^{-1}[I - (A- B D^{-1} C ) (A - BD^{-1}C)^{-1} + A(A - BD^{-1}C)^{-1} ]BD^{-1}\\ &= - A^{-1} A(A - BD^{-1}C)^{-1} BD^{-1}\\ &= - (A - BD^{-1}C)^{-1} BD^{-1} \end{align*}

If D is invertible, we define the Schur complement of D as M /D = A - BD^{-1}C. We can easily express the block matrix inverse in terms of the inverse of this Schur complement: \begin{bmatrix} A & B \\ C & D \end{bmatrix}^{-1} =\begin{bmatrix} (M/D)^{-1} & - (M/D)^{-1} BD^{-1}\\ -D^{-1}C (M/D)^{-1} & D^{-1} + D^{-1} C (M/D)^{-1} BD^{-1} \end{bmatrix}

Equivalently, we can rephrase this in terms of the Schur complement of A, M /A = D - CA^{-1}B: \begin{bmatrix} A & B \\ C & D \end{bmatrix}^{-1} =\begin{bmatrix} A^{-1} + A^{-1} B (M/A)^{-1} CA^{-1} & -A^{-1}B (M/A)^{-1}\\ - (M/A)^{-1} CA^{-1} & (M/A)^{-1} \end{bmatrix}

A convenient mnemonic is mentioned in Power (2022).

21.2 Decompositions

21.2.1 Cholesky decomposition

21.2.1.1 Rank one update of Cholesky

21.2.2 QR decomposition

21.2.3 Singular value decomposition

21.3 Matrix gradient identities

Let \textbf{\textbf{A}} be an m \times n matrix. Let \textbf{x} be a n\times 1 column vector

\frac{\partial{\textbf{Ax}}}{\partial{\textbf{x}}} = \textbf{A}^\top

\frac{\partial{\textbf{x}^\top \textbf{A}}}{\partial{\textbf{x}}} = \textbf{A} \frac{\partial{\textbf{x}^\top \textbf{Ax}}}{\partial{\textbf{x}}} = \textbf{Ax} + \textbf{A}^\top \textbf{x}

21.4 Gradient and Hessian

Let f : \mathbb R^p \rightarrow \mathbb R. The derivative of f (gradient) is a vector and the second derivative (Hessian) is a matrix. \frac{\partial f(\textbf{x})}{\partial \textbf{x}} = \nabla f(\mathbf{x}) = \begin{bmatrix} \frac{\partial f(\mathbf{x})}{\partial x_1} \\ \frac{\partial f(\mathbf{x})}{\partial x_2} \\ \vdots \\ \frac{\partial f(\mathbf{x})}{\partial x_n} \end{bmatrix} \nabla^2 f(\mathbf{x}) = \begin{bmatrix} \frac{\partial^2 f(\mathbf{x})}{\partial x_1^2} & \frac{\partial^2 f(\mathbf{x})}{\partial x_1 \partial x_2} & \cdots & \frac{\partial^2 f(\mathbf{x})}{\partial x_1 \partial x_n} \\ \frac{\partial^2 f(\mathbf{x})}{\partial x_2 \partial x_1} & \frac{\partial^2 f(\mathbf{x})}{\partial x_2^2} & \cdots & \frac{\partial^2 f(\mathbf{x})}{\partial x_2 \partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 f(\mathbf{x})}{\partial x_n \partial x_1} & \frac{\partial^2 f(\mathbf{x})}{\partial x_n \partial x_2} & \cdots & \frac{\partial^2 f(\mathbf{x})}{\partial x_n^2} \end{bmatrix}

22 Jacobian vs matrix partial derivative

Consider a vector valued function f: \mathbb R^n \rightarrow \mathbb R^m, the first derivative, called Jacobian, is a matrix of dimensions m \times n: J_{\mathbf{f}(\mathbf{x})} = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \frac{\partial f_2}{\partial x_1} & \frac{\partial f_2}{\partial x_2} & \cdots & \frac{\partial f_2}{\partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \frac{\partial f_m}{\partial x_2} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix} Notice that the gradient of a scalar function is transpose of its Jacobian.

In matrix calculus used in stats, we conventionally adhere to the gradient-like ordering (each column is the gradient of each sub-function).

\frac{\partial \mathbf{f}(\mathbf{x})}{\partial \mathbf{x}} = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \frac{\partial f_2}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_1} \\ \frac{\partial f_1}{\partial x_2} & \frac{\partial f_2}{\partial x_2} & \cdots & \frac{\partial f_m}{\partial x_2} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_1}{\partial x_n} & \frac{\partial f_2}{\partial x_n} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix}

Power, Sam. 2022. “A Mnemonic for Schur Complements.” https://hackmd.io/@sp-monte-carlo/SymBw03Ki.