Nov 4, 2019 Reading time ~3 minutes

Jacobi's formula

\[ \tag{46} \frac{\partial \det \left( \mathbf{Y} \right)}{\partial x}=\,\,\det \left( \mathbf{Y} \right) Tr\left[ \mathbf{Y}^{-1}\frac{\partial \mathbf{Y}}{\partial x} \right] \]

Formula \((46)\) is actually Jacobi’s formula. ¹

Analogy in functions

For a differentiable function \(f: D\subseteq R\rightarrow R\), for all \(x\) in some neighborhood of \(a\), \(f\) can be written as: ² \[f(x)=f(a)+f^{\prime}(a) (x−a)+R(x−a) \] and, \(L(x)=f(a)+f^{\prime}(a)(x−a)\) is the best affine approximation of the function \(f\) at \(a\).

or, the idea could be expressed in other way: \[f(x+\epsilon)=f(x)+f^{\prime}(x) \epsilon +R\epsilon \]

It comes from Taylor aproximation at \(x+\epsilon\): \[f(x+\epsilon)=f(x)+f^{\prime}(x) \epsilon +f^{\prime\prime}(x) \epsilon^2 /2 + \cdots \]

Lemma1 ³

\[ \det \left( \mathbf{I}+\epsilon \mathbf{A} \right) =\,\,1+\epsilon Tr\left( \mathbf{A} \right) +O\left( \epsilon^2 \right) \]

Let \(A_1,A_2, \cdot,A_N\) be the column vectors of the matrix \(A\). Let \(e_1,e_2, \cdot,e_N\) be the standard basis; note that these basis vectors form the columns of the identity matrix \(I\). Then we recall that the determinant is an alternating multi-linear map on the column space.

\[det(I+ϵA)=det(e_1+ϵA_1,e_2+ϵA_2,…,e_N+ϵA_N) \\ =det(e_1,e_2,…,e_N)+\epsilon \left\{ det(A_1,e_2,…,e_N)+det(e_1,A_2,…,e_N) +\cdots \\ +det(e_1,e_2,…,A_N) \right\} + O(\epsilon^2)\]

The first term is just the determinant of the identity matrix which is 1. The term proportional to ϵ is a sum of expressions like \(det(e_1,e_2,…,A_j,…,e_N)\) where the j’th column of the identity matrix is replaced with the j’th column of A. Expanding the determinant along the j’th row we see that \(det(e_1,e_2,…,A_j,…,e_N)=A_{jj}\).

\[det(I+ϵA)=1+ϵ\sum_{j=1}^N A_{jj}+O(ϵ^2)=1+ϵTr(A)+O(ϵ^2)\]

Particularly when \(n=2\), \[\begin{align} \det \left( I+\epsilon A \right) &=\det \left( \begin{matrix}{} 1+\varepsilon a_{11}& \varepsilon a_{12}\\ \varepsilon a_{21}& 1+\varepsilon a_{22}\\ \end{matrix} \right) \,\, \\ &=\,\,1+\varepsilon \left( a_{11}+a_{22} \right) +\varepsilon ^2\left( a_{11}a_{22}-a_{12}a_{21} \right) \,\,\\ &=\,\,1+\varepsilon Tr\left( A \right) +\varepsilon ^2\det \left( A \right) \end{align}\]

Lemma 2. ⁴

\[ det^{\prime}(I)=\mathrm {Tr} \]

where \(det^{\prime}(I)=Tr\) is the differential of \({\displaystyle \det }\)

This equation means that the differential of \({\displaystyle \det }\), evaluated at the identity matrix, is equal to the trace. The differential \({\displaystyle \det '(I)}\) is a linear operator that maps an n × n matrix to a real number.

Using the definition of a directional derivative together with one of its basic properties for differentiable functions, we have \[\begin{equation} \operatorname{det}^{\prime}(I)(T)=\nabla_{T} \operatorname{det}(I)=\lim _{\varepsilon \rightarrow 0} \frac{\operatorname{det}(I+\varepsilon T)-\operatorname{det} I}{\varepsilon} \\ = lim_{\varepsilon \rightarrow 0} \frac{1+ϵTr(T)+O(ϵ^2)-1}{\varepsilon} \\ = Tr(T) \end{equation}\]

Alternative proof of lemma 2: ⁵

\(det\) is a function \(M_{n×n}→R\) where \(M_{n×n}\) is the space of \(n×n\) square matrices. Therefore, a matrix is the equivalent of a point for real functions. The best linear approximation to \(det\) near the identity is given by: \[det(\mathbf{I}+\mathbf{M})=det(\mathbf{I})+d(det(\mathbf{I}))M+R(\mathbf{I},\mathbf{M})\] \[\underset{||\mathbf{M} ||\rightarrow 0}{lim}\frac{R(\mathbf{I},\mathbf{M})}{||\mathbf{M} ||}=0\]

\(det^{\prime}(I)=\mathrm {Tr}\) is equivalent to the following:

\[\begin{equation} \left.\frac{\mathrm{d}}{\mathrm{d} t}[\operatorname{det}(\mathbf{I}+t \mathbf{B})]\right|_{t=0}=\operatorname{Tr}(\mathbf{B}) \end{equation}\]

Lemma 3 ⁶

For an invertible matrix \(A\), we have:

\[\begin{equation} \operatorname{det}^{\prime}(A)(T)=\operatorname{det} A \operatorname{tr}\left(A^{-1} T\right) \end{equation}\]

proof: ⁷

Remember that, if \(f:E→F\) is a differentiable map, a way to compute \(df(a)(v)\) is to find a curve \(γ:R→E\) with \(γ(0)=a\) and \(γ′(0)=v\), and then \(df(a)(v)=\frac{d}{dt}|_0f(γ(t))\) (this is the chain rule). Here, find a curve \(γ:R→M_n(R)\) with \(γ(0)=A\) and \(γ′(0)=T\). Then note that

\[\begin{equation} \begin{aligned} d \operatorname{det}(A)(T)=\left.\frac{d}{d t}\right|_{0}(\operatorname{det}(\gamma(t))) &=\left.\frac{d}{d t}\right|_{0}\left(\operatorname{det}\left(A A^{-1} \gamma(t)\right)\right) \\ =\left.\operatorname{det}(A) \frac{d}{d t}\right|_{0}\left(\operatorname{det}\left(A^{-1} \gamma(t)\right)\right) &=\operatorname{det}(A) d \operatorname{det}(\operatorname{I})\left(A^{-1} T\right) \end{aligned} \end{equation}\] (since \(t↦A^{−1}γ(t)\) is a curve which is \(I\) in 0 and which the derivative is \(A^{−1}T\) in 0).

Oct 30, 2019 Reading time ~2 minutes

Derivative of log of determinant

\[\begin{equation} \tag{43} \partial(\ln (\operatorname{det}(\mathbf{X})))=\operatorname{Tr}\left(\mathbf{X}^{-1} \partial \mathbf{X}\right) \end{equation}\]

Lemma 1

\[\begin{equation} \sum_{i} \sum_{j} \mathbf{A}^{\mathrm{T}}_{i j} \mathbf{B}_{i j} = \operatorname{Tr}\left(\mathbf{A} \mathbf{B}\right) \end{equation}\]

Lemma 2 ¹

(Credit to https://statisticaloddsandends.wordpress.com/2018/05/24/derivative-of-log-det-x/)

\[\begin{equation} \frac{\partial(\operatorname{det} \mathbf{X})}{\partial \mathbf{X}_{i j}}=\mathbf{C}_{i j} \end{equation}\]

For a matrix \(X\), we define some terms:

The \((i,j)\) minor of \(X\), denoted \(M_{ij}\), is the determinant of the \((n-1) \times (n-1)\) matrix that remains after removing the \(i\)th row and \(j\)th column from \(X\).
The cofactor matrix of \(X\), denoted \(C\), is an \(n \times n\) matrix such that \(C_{ij} = (-1)^{i+j} M_{ij}\).
The adjugate matrix of \(X\), denoted \(\operatorname{adj } X\), is simply the transpose of \(C\).

These terms are useful because they related to both matrix determinants and inverses. If \(X\) is invertible, then \(X^{-1}=\frac{1}{\operatorname{det} X}(\operatorname{adj} X)\), so

\[\begin{equation} \left(\textbf{X}^{-1}\right)^T_{ij} = \frac{1}{\operatorname{det} X} C_{ij} \end{equation}\]

On the other hand, by the cofactor expansion of the determinant, \(\det X=\,\,\underset{k=1}{\overset{n}{\varSigma}}X_{ik}C_{ik}\), so by the product rule,

\[ \frac{\partial \left( \det X \right)}{\partial X_{ij}}=\,\,\underset{k=1}{\overset{n}{\varSigma}}\frac{\partial X_{ik}}{\partial X_{ij}}C_{ik}\,\,+\,\,\underset{k=1}{\overset{n}{\varSigma}}X_{ik}\frac{\partial C_{ik}}{\partial X_{ij}} \]

If \(k \neq j\), then \(\dfrac{\partial X_{ik}}{\partial X_{ij}} = 0\), otherwise it is equal to 1. This means that the first term above reduces to \(C_{ij}\). For any \(k\), the elements of \(X\) which affect \(C_{ik}\) are those which do not lie on row \(i\) or column \(k\). Hence, \(\dfrac{\partial C_{ik}}{\partial X_{ij}} = 0\) for all k! So,

\[\frac{\partial \left( \det X \right)}{\partial X_{ij}}=C_{ij}\]

Proof

Putting all this together with an application of the chain rule, we get

\[\left(\ln (\det X)\right)_{ij}' = \dfrac{1}{\det X} \dfrac{\partial (\det X)}{\partial X_{ij}} = \dfrac{1}{\det X} C_{ij} = (X^{-1})^T_{ij}\]

So,

\[\begin{align} \partial(\ln (\operatorname{det}(\mathbf{X})))&=\sum_{i} \sum_{j} \left(\ln (\det X)\right)_{ij}' d_{ij} \\ &= \sum_{i} \sum_{j}(\mathbf{X}^{-1})^T_{ij} d_{ij} \\ &= \operatorname{Tr}\left(\mathbf{X}^{-1} \partial \mathbf{X}\right) \end{align}\]

where \[ \partial X=\left( \begin{matrix}{} dX_{11}& \cdots& dX_{1n}\\ \vdots& \ddots& \vdots\\ dX_{n1}& \cdots& dX_{nn}\\ \end{matrix} \right) \]

https://statisticaloddsandends.wordpress.com/2018/05/24/derivative-of-log-det-x/↩

Oct 24, 2019 Reading time ~1 minute

Derivative of inverse matrix

\[\begin{equation} \tag{40} \partial\left(\mathbf{X}^{-1}\right)=-\mathbf{X}^{-1}(\partial \mathbf{X}) \mathbf{X}^{-1} \end{equation}\]

Explanation: ¹

\[\begin{equation} \underbrace{(I)^{\prime}}_{=0}=\left(\mathbf{X} \mathbf{X}^{-1}\right)^{\prime}=\mathbf{X}^{\prime} \mathbf{X}^{-1}+\mathbf{X}\left(\mathbf{X}^{-1}\right)^{\prime} \Rightarrow \end{equation}\]

\[\begin{equation} \mathbf{X}\left(\mathbf{X}^{-1}\right)^{\prime}=-\mathbf{X}^{\prime} \mathbf{X}^{-1} \quad \Rightarrow \end{equation}\]

\[\begin{equation} \left(\mathbf{X}^{-1}\right)^{\prime}=-\mathbf{X}^{-1} \mathbf{X}^{\prime} \mathbf{X}^{-1} \end{equation}\]

\[\begin{equation} \tag{41} \partial(\operatorname{det}(\mathbf{X}))=\operatorname{Tr}(\operatorname{adj}(\mathbf{X}) \partial \mathbf{X}) \end{equation}\]

Background

Adjugate Matrix

The adjugate of \(A\) is the transpose of the cofactor matrix \(C\) of \(X\), \[\begin{equation} \operatorname{adj}(\mathbf{X})=\mathbf{C}^{\top} \end{equation}\]

and, \[\begin{equation} \mathbf{X}^{-1}=\operatorname{det}(\mathbf{X})^{-1} \operatorname{adj}(\mathbf{X}) \quad \Rightarrow \end{equation}\]

\[\begin{equation} \operatorname{det}(\mathbf{X}) \mathbf{I} = \operatorname{adj}(\mathbf{X}) \mathbf{X} \end{equation}\]

Characteristic Polynomial

The characteristic polynomial of a square matrix is a polynomial which is invariant under matrix similarity and has the eigenvalues as roots. It has the determinant and the trace of the matrix as coefficients.

The characteristic polynomial of a sqaure matrix \(A\) is defined by \[\begin{equation} p_{A}(t)=\operatorname{det}(t I-A) \end{equation}\]

Proof ²

Via Matrix Computation

\[\begin{equation} \frac{\partial \operatorname{det}(\mathbf{X})}{\partial \mathbf{X}_{i j}}=\sum_{k} \operatorname{adj}^{\mathrm{T}}(\mathbf{X})_{i k} \delta_{j k}=\operatorname{adj}^{\mathrm{T}}(\mathbf{X})_{i j} \quad \Rightarrow \end{equation}\]

\[\begin{equation} d(\operatorname{det}(\mathbf{X}))=\sum_{i} \sum_{j} \operatorname{adj}^{\mathrm{T}}(\mathbf{X})_{i j} d \mathbf{X}_{i j} \quad \Rightarrow \end{equation}\]

\[\begin{equation} d(\operatorname{det}(\mathbf{X}))=\operatorname{tr}(\operatorname{adj}(\mathbf{X}) d \mathbf{X}) \end{equation}\]

Via Chain Rule

Lemma 1. \(\operatorname{det}^{\prime}(I)=\operatorname{tr}\), where \(\operatorname{det}^{\prime}\) is the differential of \(\operatorname{det}\).

Lemma 2. For an invertible matrix \(\mathbf{A}\), we have: \(\operatorname{det}^{\prime}(\mathbf{A})(\mathbf{T})=\operatorname{det} \mathbf{A} \operatorname{tr}(\mathbf{A}^{-1}\mathbf{T}))\)

\[\begin{equation} \tag{42} \partial(\operatorname{det}(\mathbf{X}))=\operatorname{det}(\mathbf{X}) \operatorname{Tr}\left(\mathbf{X}^{-1} \partial \mathbf{X}\right) \end{equation}\]

Oct 11, 2019 Reading time ~1 minute

Matrix cookbook - determinant

\[\begin{equation} \tag{18} \operatorname{det}(\mathbf{A})=\prod_{i} \lambda_{i} \quad \lambda_{i}=\operatorname{eig}(\mathbf{A}) \end{equation}\]

\[\begin{equation} \tag{19} \operatorname{det}(c \mathbf{A})=c^{n} \operatorname{det}(\mathbf{A}), \quad \text { if } \mathbf{A} \in \mathbb{R}^{n \times n} \end{equation}\]

\[\begin{equation} \tag{20} \operatorname{det}\left(\mathbf{A}^{T}\right)=\operatorname{det}(\mathbf{A}) \end{equation}\]

\[\begin{equation} \tag{21} \operatorname{det}(\mathbf{A B})=\operatorname{det}(\mathbf{A}) \operatorname{det}(\mathbf{B}) \end{equation}\]

The determinant of a tranformation matrix is the scale of area/volume of the shape before and after the tranformation. \(\mathbf{A B}\) are two consecutive transformations, therefore its determinant is the product of two scales.

\[\begin{equation} \tag{22} \operatorname{det}\left(\mathbf{A}^{-1}\right)=1 / \operatorname{det}(\mathbf{A}) \end{equation}\]

\[\begin{equation} \tag{23} \operatorname{det}\left(\mathbf{A}^{n}\right)=\operatorname{det}(\mathbf{A})^{n} \end{equation}\]

\[\begin{equation} \tag{24} \operatorname{det}\left(\mathbf{I}+\mathbf{u v}^{T}\right)=1+\mathbf{u}^{T} \mathbf{v} \end{equation}\]

\[\begin{equation} \tag{25} \begin{array}{l}{\text { For } n=2:} \\ {\qquad \operatorname{det}(\mathbf{I}+\mathbf{A})=1+\operatorname{det}(\mathbf{A})+\operatorname{Tr}(\mathbf{A})}\end{array} \end{equation}\]

\[\begin{equation} \tag{26} \begin{array}{l}{\text { For } n=3:} \\ {\qquad \operatorname{det}(\mathbf{I}+\mathbf{A})=1+\operatorname{det}(\mathbf{A})+\operatorname{Tr}(\mathbf{A})+\frac{1}{2} \operatorname{Tr}(\mathbf{A})^{2}-\frac{1}{2} \operatorname{Tr}\left(\mathbf{A}^{2}\right)}\end{array} \end{equation}\]

Oct 11, 2019 Reading time ~2 minutes

Matrix cookbook - Trace

\[\begin{equation} \tag{11} \operatorname{Tr}(\mathbf{A})=\sum_{i} A_{i i} \end{equation}\]

Let’s write the trace in a more convenient way. We have: ¹ \[\begin{equation} A e_{i}=\left[\begin{array}{ccc}{a_{11}} & {\cdots} & {a_{1 n}} \\ {\vdots} & {\ddots} & {\vdots} \\ {a_{n 1}} & {\cdots} & {a_{n n}}\end{array}\right]\left[\begin{array}{c}{0} \\ {\vdots} \\ {1} \\ {\vdots} \\ {0}\end{array}\right]=\left[\begin{array}{c}{a_{i 1}} \\ {\vdots} \\ {a_{i n}}\end{array}\right] \end{equation}\] where the \(1\) is in the \(i\)-th entry. This way: \[\begin{equation} \left\langle e_{i}, A e_{i}\right\rangle= e_{i}^{t} A e_{i}=A_{i i} \end{equation}\] So \(\operatorname{Tr}(\mathbf{A})=\sum_{i}A_{ii}\).

Intuitive explanation ²

\[\begin{equation} \tag{12} \operatorname{Tr}(\mathbf{A})=\sum_{i} \lambda_{i}, \quad \lambda_{i}=\operatorname{eig}(\mathbf{A}) \end{equation}\]

If eigendecomposition of matrix \(\mathbf{A}\) is \(\mathbf{A}=\mathbf{Q} \mathbf{\Lambda} \mathbf{Q}^{-1}\), then according to equation (16): \[\begin{align} \operatorname{Tr}(\mathbf{A})&=\operatorname{Tr}(\mathbf{Q} \mathbf{\Lambda} \mathbf{Q}^{-1}) \\ &=\operatorname{Tr}(\mathbf{\Lambda} \mathbf{Q}^{-1} \mathbf{Q}) \\ &=\operatorname{Tr}(\mathbf{\Lambda}) \\ &=\sum_{i} \lambda_{i} \end{align}\]

\[\begin{equation} \tag{13} \operatorname{Tr}(\mathbf{A})=\operatorname{Tr}\left(\mathbf{A}^{T}\right) \end{equation}\]

\[\begin{equation} \tag{14} \operatorname{Tr}(\mathbf{A B})=\operatorname{Tr}(\mathbf{B A}) \end{equation}\]

Now: \((\mathbf{A B})_{ij}=\sum_{k}A_{ik}B_{kj}\), and: ³ \[\begin{equation} \operatorname{tr}(A B)=\sum_{i} \sum_{k} A_{i k} B_{k i} \end{equation}\]

On the other hand, \((\mathbf{B A})_{ij}=\sum_{k}B_{ik}A_{kj}\). So: \[\begin{equation} \operatorname{tr}(B A)=\sum_{i} \sum_{k} B_{i k} A_{k i} \end{equation}\] They are the same quantity, up to renaming indices \((i \leftrightarrow k)\)

\[\begin{equation} \tag{15} \operatorname{Tr}(\mathbf{A}+\mathbf{B})=\operatorname{Tr}(\mathbf{A})+\operatorname{Tr}(\mathbf{B}) \end{equation}\]

\[\begin{equation} \tag{16} \operatorname{Tr}(\mathbf{A B C})=\operatorname{Tr}(\mathbf{B C A})=\operatorname{Tr}(\mathbf{C A B}) \end{equation}\]

\[\begin{equation} \tag{17} \mathbf{a}^{T} \mathbf{a}=\operatorname{Tr}\left(\mathbf{a a}^{T}\right) \end{equation}\]

\[\begin{align} \mathbf{a a}^{T}&=\left[\begin{array}{c}{a_{1}} \\ {\vdots} \\ {a_{n}}\end{array}\right]\left[{a_{1}}, {\cdots}, {a_{n}}\right] \\ &=\left[\begin{array}{ccc}{a_{1}}^{2} & {\cdots} & {a_{1}a_{n}} \\ {\vdots} & {\ddots} & {\vdots} \\ {a_{n}a_{1}} & {\cdots} & {a_{n}}^{2}\end{array}\right] \end{align}\]

So, \[\begin{equation} \operatorname{Tr}\left(\mathbf{a a}^{T}\right) = a_{1}^{2}+\cdots+a_{n}^{2} = \mathbf{a}^{T} \mathbf{a} \end{equation}\]

Runner

Time and pressure ...

Jacobi's formula

Derivative of log of determinant

Derivative of inverse matrix

Background

Adjugate Matrix

Characteristic Polynomial

Proof ²

Via Matrix Computation

Via Chain Rule

Matrix cookbook - determinant

Matrix cookbook - Trace

Runner

Time and pressure ...

Background

Adjugate Matrix

Characteristic Polynomial

Proof 2

Via Matrix Computation

Via Chain Rule

Proof ²