1 Matrix calculus
1.1 Matrix gradient identities
Let \(\textbf{\textbf{A}}\) be an \(m \times n\) matrix. Let \(\textbf{x}\) be a \(n\times 1\) column vector
\[\frac{\partial{\textbf{Ax}}}{\partial{\textbf{x}}} = \textbf{A}^\top\]
\[\frac{\partial{\textbf{x}^\top \textbf{A}}}{\partial{\textbf{x}}} = \textbf{A}\] \[\frac{\partial{\textbf{x}^\top \textbf{Ax}}}{\partial{\textbf{x}}} = \textbf{Ax} + \textbf{A}^\top \textbf{x}\]
1.2 Gradient and Hessian
Let \(f : \mathbb R^p \rightarrow \mathbb R\). The derivative of \(f\) (gradient) is a vector and the second derivative (Hessian) is a matrix. \[ \frac{\partial f(\textbf{x})}{\partial \textbf{x}} = \nabla f(\mathbf{x}) = \begin{bmatrix} \frac{\partial f(\mathbf{x})}{\partial x_1} \\ \frac{\partial f(\mathbf{x})}{\partial x_2} \\ \vdots \\ \frac{\partial f(\mathbf{x})}{\partial x_n} \end{bmatrix}\] \[ \nabla^2 f(\mathbf{x}) = \begin{bmatrix} \frac{\partial^2 f(\mathbf{x})}{\partial x_1^2} & \frac{\partial^2 f(\mathbf{x})}{\partial x_1 \partial x_2} & \cdots & \frac{\partial^2 f(\mathbf{x})}{\partial x_1 \partial x_n} \\ \frac{\partial^2 f(\mathbf{x})}{\partial x_2 \partial x_1} & \frac{\partial^2 f(\mathbf{x})}{\partial x_2^2} & \cdots & \frac{\partial^2 f(\mathbf{x})}{\partial x_2 \partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 f(\mathbf{x})}{\partial x_n \partial x_1} & \frac{\partial^2 f(\mathbf{x})}{\partial x_n \partial x_2} & \cdots & \frac{\partial^2 f(\mathbf{x})}{\partial x_n^2} \end{bmatrix} \]
2 Jacobian vs matrix partial derivative
Consider a vector valued function \(f: \mathbb R^n \rightarrow \mathbb R^m\), the first derivative, called Jacobian, is a matrix of dimensions \(m \times n\): \[J_{\mathbf{f}(\mathbf{x})} = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \frac{\partial f_2}{\partial x_1} & \frac{\partial f_2}{\partial x_2} & \cdots & \frac{\partial f_2}{\partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \frac{\partial f_m}{\partial x_2} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix} \] Notice that the gradient of a scalar function is transpose of its Jacobian.
In matrix calculus used in stats, we conventionally adhere to the gradient-like ordering (each column is the gradient of each sub-function).
\[\frac{\partial \mathbf{f}(\mathbf{x})}{\partial \mathbf{x}} = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \frac{\partial f_2}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_1} \\ \frac{\partial f_1}{\partial x_2} & \frac{\partial f_2}{\partial x_2} & \cdots & \frac{\partial f_m}{\partial x_2} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_1}{\partial x_n} & \frac{\partial f_2}{\partial x_n} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix} \]