20 Numerical linear algebra
20.1 Inversion
20.1.1 Moore-Penrose inverse
20.1.2 Woodbury identity
If we have matrices A_{n \times n}, U_{n\times k}, C_{k \times k}, V_{k \times n}, then
( A + UCV)^{-1} = A^{-1} - A^{-1} U (C^{-1} + VA^{-1}U)^{-1} VA^{-1}
Derivation
Consider matrices X and Y.
\begin{align*} X(I + YX) &= (I+XY)X\\ (I+XY) ^{-1} X &= X(I + YX) ^{-1} \end{align*}
\begin{align*} I &= (I+P)(I+P)^{-1} \\ I &= (I+P)^{-1} + P(I+P)^{-1} \\ (I+P)^{-1} &= I - P(I+P)^{-1} \quad \text{(Rearranging terms)}\\ (I + YX)^{-1} &= I - YX(I+YX)^{-1} \quad \text{(Substituting } P = YX \text{)}\\ (I + YX)^{-1} &= I - Y (I+XY)^{-1}X \quad \text{(Using (1)) } \end{align*}
We substitue Y := A^{-1} U and X := C V
\begin{align*} (I + A^{-1} UCV)^{-1} &= I - A^{-1} U (I + CVA^{-1}U)^{-1} CV\\ (A^{-1} A + A^{-1} UCV)^{-1} &= I - A^{-1} U (C C^{-1} + CVA^{-1}U)^{-1} CV\\ [A^{-1}( A + UCV)]^{-1} &= I - A^{-1} U [ C (C^{-1} + VA^{-1}U)]^{-1} CV\\ ( A + UCV)^{-1} A &= I - A^{-1} U (C^{-1} + VA^{-1}U)^{-1} C^{-1} CV\\ ( A + UCV)^{-1} &= A^{-1} - A^{-1} U (C^{-1} + VA^{-1}U)^{-1} VA^{-1} \end{align*}
20.1.3 Schur complements
Say we have a matrix M_{(m+ n) \times (m+n)} consisting of four blocks A_{m \times m}, B_{m \times n}, C_{n \times m}, D_{n \times n}.
Then, \begin{bmatrix} A & B \\ C & D \end{bmatrix}^{-1} =\begin{bmatrix} (M/D)^{-1} & - (M/D)^{-1} BD^{-1}\\ -D^{-1}C (M/D)^{-1} & D^{-1} + D^{-1} C (M/D)^{-1} BD^{-1} \end{bmatrix}
Derivation
\begin{bmatrix} A_{m \times m} & B_{m \times n} \\ C_{n \times m} & D_{n \times n} \end{bmatrix} \begin{bmatrix} X & Y \\ Z & W \\ \end{bmatrix} = \mathbf I_{m+n}
\begin{bmatrix} A X + B Z & A Y + B W \\ C X + D Z & C Y + D W \end{bmatrix}= \begin{bmatrix} \mathbf I_m & \mathbf 0_{m \times n} \\ \mathbf 0_{n \times m} & \mathbf I_n \\ \end{bmatrix}
We can separate these matrix equations and do some straightforward linear equation solving (assuming A and D are invertible):
\begin{align*} A Y + B W &= \mathbf 0 \\ A Y &= - B W \\ Y &= - A^{-1} B W \end{align*}
\begin{align*} C X + D Z &= \mathbf 0 \\ D Z &= - C X\\ Z &= - D^{-1} C X \end{align*}
\begin{align*} D W + C Y &= \mathbf I \\ D W - C A^{-1} BW &= \mathbf I \\ (D- C A^{-1} B) W &= \mathbf I\\ W &= (D- C A^{-1} B)^{-1} \end{align*}
\begin{align*} Y &= A^{-1} BW\\ Y &= -A^{-1} B(D - C A^{-1} B)^{-1} \end{align*}
\begin{align*} A X + B Z &= \mathbf I \\ AX - B D^{-1} C X &= \mathbf I\\ (A- BD^{-1} C) X &= \mathbf I\\ X &= (A- BD^{-1} C)^{-1} \end{align*}
\begin{align*} Z &= - D^{-1} C (A - BD^{-1} C)^{-1} \end{align*} \begin{bmatrix} A & B \\ C & D \end{bmatrix}^{-1} = \begin{bmatrix} X & Y \\ Z & W \\ \end{bmatrix} = \begin{bmatrix} (A- BD^{-1} C)^{-1} & -A^{-1} B(D - C A^{-1} B)^{-1} \\ - D^{-1} C (A - BD^{-1} C)^{-1} & (D- C A^{-1} B)^{-1} \end{bmatrix}
Substituting Q := D, U := -C, R := A^{-1}, V :=B into the Woodbury Matrix Identity, we get W = (D- CA^{-1}B)^{-1} = D^{-1} + D^{-1} C (A - BD^{-1}C)^{-1} BD^{-1}
\begin{align*} Y &= -A^{-1} B(D - C A^{-1} B)^{-1} \\ &= -A^{-1} B [D^{-1} + D^{-1} C (A - BD^{-1}C)^{-1} BD^{-1}]\\ &= - [A^{-1} B D^{-1} + A^{-1} B D^{-1} C (A - BD^{-1}C)^{-1} BD^{-1}]\\ &= - [A^{-1} + A^{-1} B D^{-1} C (A - BD^{-1}C)^{-1} ]BD^{-1}\\ &= - A^{-1}[I + B D^{-1} C (A - BD^{-1}C)^{-1} ]BD^{-1}\\ &= - A^{-1}[I - (A- B D^{-1} C ) (A - BD^{-1}C)^{-1} + A(A - BD^{-1}C)^{-1} ]BD^{-1}\\ &= - A^{-1} A(A - BD^{-1}C)^{-1} BD^{-1}\\ &= - (A - BD^{-1}C)^{-1} BD^{-1} \end{align*}
If D is invertible, we define the Schur complement of D as M /D = A - BD^{-1}C. We can easily express the block matrix inverse in terms of the inverse of this Schur complement: \begin{bmatrix} A & B \\ C & D \end{bmatrix}^{-1} =\begin{bmatrix} (M/D)^{-1} & - (M/D)^{-1} BD^{-1}\\ -D^{-1}C (M/D)^{-1} & D^{-1} + D^{-1} C (M/D)^{-1} BD^{-1} \end{bmatrix}
Equivalently, we can rephrase this in terms of the Schur complement of A, M /A = D - CA^{-1}B: \begin{bmatrix} A & B \\ C & D \end{bmatrix}^{-1} =\begin{bmatrix} A^{-1} + A^{-1} B (M/A)^{-1} CA^{-1} & -A^{-1}B (M/A)^{-1}\\ - (M/A)^{-1} CA^{-1} & (M/A)^{-1} \end{bmatrix}A convenient mnemonic is mentioned in Power (2022).
20.2 Decompositions
20.2.1 Cholesky decomposition
20.2.1.1 Rank one update of Cholesky
20.2.2 QR decomposition
20.2.3 Singular value decomposition
20.3 Matrix gradient identities
Let \textbf{\textbf{A}} be an m \times n matrix. Let \textbf{x} be a n\times 1 column vector
\frac{\partial{\textbf{Ax}}}{\partial{\textbf{x}}} = \textbf{A}^\top
\frac{\partial{\textbf{x}^\top \textbf{A}}}{\partial{\textbf{x}}} = \textbf{A} \frac{\partial{\textbf{x}^\top \textbf{Ax}}}{\partial{\textbf{x}}} = \textbf{Ax} + \textbf{A}^\top \textbf{x}
20.4 Gradient and Hessian
Let f : \mathbb R^p \rightarrow \mathbb R. The derivative of f (gradient) is a vector and the second derivative (Hessian) is a matrix. \frac{\partial f(\textbf{x})}{\partial \textbf{x}} = \nabla f(\mathbf{x}) = \begin{bmatrix} \frac{\partial f(\mathbf{x})}{\partial x_1} \\ \frac{\partial f(\mathbf{x})}{\partial x_2} \\ \vdots \\ \frac{\partial f(\mathbf{x})}{\partial x_n} \end{bmatrix} \nabla^2 f(\mathbf{x}) = \begin{bmatrix} \frac{\partial^2 f(\mathbf{x})}{\partial x_1^2} & \frac{\partial^2 f(\mathbf{x})}{\partial x_1 \partial x_2} & \cdots & \frac{\partial^2 f(\mathbf{x})}{\partial x_1 \partial x_n} \\ \frac{\partial^2 f(\mathbf{x})}{\partial x_2 \partial x_1} & \frac{\partial^2 f(\mathbf{x})}{\partial x_2^2} & \cdots & \frac{\partial^2 f(\mathbf{x})}{\partial x_2 \partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 f(\mathbf{x})}{\partial x_n \partial x_1} & \frac{\partial^2 f(\mathbf{x})}{\partial x_n \partial x_2} & \cdots & \frac{\partial^2 f(\mathbf{x})}{\partial x_n^2} \end{bmatrix}
21 Jacobian vs matrix partial derivative
Consider a vector valued function f: \mathbb R^n \rightarrow \mathbb R^m, the first derivative, called Jacobian, is a matrix of dimensions m \times n: J_{\mathbf{f}(\mathbf{x})} = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \frac{\partial f_2}{\partial x_1} & \frac{\partial f_2}{\partial x_2} & \cdots & \frac{\partial f_2}{\partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \frac{\partial f_m}{\partial x_2} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix} Notice that the gradient of a scalar function is transpose of its Jacobian.
In matrix calculus used in stats, we conventionally adhere to the gradient-like ordering (each column is the gradient of each sub-function).
\frac{\partial \mathbf{f}(\mathbf{x})}{\partial \mathbf{x}} = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \frac{\partial f_2}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_1} \\ \frac{\partial f_1}{\partial x_2} & \frac{\partial f_2}{\partial x_2} & \cdots & \frac{\partial f_m}{\partial x_2} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_1}{\partial x_n} & \frac{\partial f_2}{\partial x_n} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix}