13 Discrepancies
13.1 Integral probability metrics
For a measurable space (\Omega, \mathcal H), let \mathcal P be the set of all probability measures on that space. We can essentially define IPMs via a set of measurable, real-valued functions \mathcal F (not necessarily sigma algebra), letting d_{\mathcal F} ( \mu, \nu) = \sup_{f \in \mathcal F} \Bigg|\int f d\mu - \int f d \nu \Bigg|
We say that \mathcal F “separates” \mathcal P if, for every pair \mu \neq \nu \in \mathcal P there exists some f \in \mathcal F, such that d_\mathcal F(\mu, \nu) \neq 0.
IPMs are nice for analysis because they don’t require the existence of densities, but terrible for computation because they involve computing a supremum (hard optimisation problem).
d_\mathcal F is a metric \iff \mathcal F separates \mathcal P
Proof:
To be a metric, a function needs to be symmetric and positive. d_\mathcal F is clearly non-negative thanks to the non-negativity of the absolute value function, and it is also symmetric since for every f, |\int f d\mu - \int f d \nu|= |\int f d\nu - \int f d \mu|.
It is positive semi-definite, and now we need to check positive definiteness i.e. whether d_\mathcal F (\mu, \nu)= 0 \iff \mu = \nu. This is essentially equivalent to the separation property. If \mu = \nu the difference of Lebesgue integrals is zero for any integrable f so sufficiency holds. For necessity, if \mu \neq \nu, the separation property guarantees existence of f such that |\int f d\mu - \int f d \nu| > 0 and thus the IPM must be strictly greater than zero.| Metric | Separating set |
|---|---|
| Total variation distance | \mathcal F_{TV} = \{f \in \mathcal H: \forall x, 0 \leq f(x) \leq 1 \} |
| 1-Wasserstein distance | \mathcal F_{W} = \{f \in \mathcal H: \sup_{x\neq y \in \Omega} \frac{|f(x) - f(y)|}{d(x,y)} \leq 1 \} |
| Bounded Lipschitz distance | \mathcal F_{BL} = \mathcal F_{W} \cap \{f \in \mathcal H: \forall x, |f(x)| \leq 1 \} |
The latter two require \Omega to be a metric space with metric d. The Lipschitz condition in the latter two guarantees continuity, and thus measurability,
13.1.1 Example
Consider \mu = \delta_0 and \mu_x = \delta_x. This is how the metrics d(\mu, \mu_x) behave as a function of x:

We can observe that:
- Total variation is a real finicky perfectionist, but it doesn’t actionably tell you how to improve (zero gradient).
- Wasserstein and BL (partly) are sensitive to how far away the probability mass is under the two measures- TV isn’t.
13.1.2 Empirical estimation of IPMs
13.2 \phi divergences
A commonly used family of divergences, parametrised by a convex function \phi: \mathbb R \to \mathbb R such that \phi(1) = 0: d_\phi[\mu || \nu] = \begin{cases} \int \mu(dx) \phi \left(\frac{d\nu}{\mu}(x) \right) =\mu\left[ \phi\left(\frac{d\nu}{\mu}\right) \right] & \nu \ll \mu \\ \infty & \text{otherwise} \end{cases}
13.2.1 Positive definiteness
d_\phi[\nu || \mu] \geq 0 and if \phi is strictly convex then \nu = \mu \iff d_\phi[\nu || \mu] = 0.
Proof
13.2.2 For marginal measures
Let (E, \mathcal E) and (F, \mathcal F) be standard measurable spaces. Let \mu and \nu be probability measures on (E, \mathcal E) such that \frac{d\nu}{d\mu} exists.
Let L, K be probability kernels E \time \mathcal F \to [0,1] such that \frac{L(x, dy)}{K(x, dy)} =\frac{dL}{dK}(x, y) \in (\mathcal E \otimes \mathcal F)_+ exists is measurable wrt product sigma algebra.
d_\phi[\nuL || \muK] \leq \int \mu(dx) \int K(x, dy) \phi\left( \frac{d\nu}{d\mu}(x) \frac{dL}{dK}(x, y) \right)
13.2.3 Convexity
d_\phi[\muL || \muK] \leq \int \mu(dx) d_\phi[L(x, \cdot) || K(x, \cdot) ]
A special but more interpretable case when we let \mu be a Bernoulli wp \alpha and K(x, A) = \begin{cases} p_0(A) & x = 0\\ p_1(A) & x = 1 \end{cases} L(x, A) = \begin{cases} q_0 & x = 0\\ q_1 & x = 1 \end{cases}
Then d_\phi[ \alpha q_0+ (1- \alpha ) q_1||\alpha p_0+ (1- \alpha ) p_1 ] \leq \alpha d_\phi[q_0 || p_0] + (1-\alpha) d_\phi[q_1 || p_1]
13.2.4 Data processing inequality
d_\phi[ \nu K || \mu K] \leq d_\phi[\nu || \mu]
13.2.5 Divergence between pushforwards
For measurable h: E \to F, since pushforward measures can be expressed as marginal measures of a kernel K(x, dy) = \delta_{h(x)}(dy), we can apply the data processing inequality:
d_\phi[\nu \circ h^{-1} || \mu \circ h^{-1}] \leq d_\phi[\nu || \mu]
13.2.6 Transformation invariance
If h is an injection, with a measurable inverse then:
d_\phi[\nu \circ h^{-1} || \mu \circ h^{-1}] = d_\phi[\nu || \mu]
13.3 Wasserstein distances
13.4 Specific divergences
13.4.1 Bounded Lipschitz distance
The Bounded Lipschitz or Dudley metric metrises weak convergence, i.e d_{BL}(\mu_n, \mu) \to 0 \iff \mu_n \stackrel{w}{\to} \mu for a standard Borel space.
13.4.2 1-Wasserstein distance
It is a Wasserstein distance as well as an integral probability metric. IPM over 1-Lipschitz continuous functions.
| Representation | Formula |
|---|---|
| IPM | \sup_\{ 1 \text{-Lipschitz functions} \} \Bigg|\int f d\mu - \int f d\nu \Bigg| |
| Earth mover’s distance | \inf_{(X,Y) \in \Gamma(\mu, \nu)} \mathbb E[d(X,Y)] |
| \mu, \nu: \mathcal B_\mathbb R \to \mathbb R |
where c_\mu and c_\nu are respective CDFs.
d_{W}(\mu_n, \mu) \to 0 \implies \mu_n \stackrel{w}{\to} \mu.
13.4.3 Total variation distance
It is both an f divergence and an integral probability metric.
| Representation | Formula |
|---|---|
| IPM | \sup_{f \in \mathcal H: 0 \leq f \leq 1 \forall x} \Bigg|\int f d\mu - \int f d\nu \Bigg| |
| Probability difference | \sup_{A \in \mathcal H} |\mu(A) - \nu(A)| |
| \phi divergence | \frac{1}{2} \int |\frac{d\mu}{d\lambda} - \frac{d\nu}{d\lambda} | d\lambda |
| Coupling | \inf_{(X, Y) \in \Gamma(\mu, \nu)} \mathbb P (X \neq Y) |
Derivations
where \Gamma(\mu, \nu) is set of all joint distributions (X,Y) such that X \sim \mu and Y \sim \nu
Properties
- d_{TV}(\mu_n, \mu) \to 0 \implies \mu_n \stackrel{w}{\to} \mu
13.4.4 Jensen-Shannon divergence
It is symmetric. It is an f-divergence with $f(u) = (u u - (u+1)) $
D_{JS}[p || q] = \frac{1}{2} D_{KL}[p || \frac{p+q}{2}] + \frac{1}{2} D_{KL}[q || \frac{p+q}{2}]
13.4.5 Relative entropy (KL divergence)
This quantity aims to measure the difference between two probability distributions. It is not however symmetric and is thus not a valid distance metric between probability distributions. It can be interpreted as the information lost when q is used to approximate p.
D_{KL}(p||q) = H_q(p) - H(p) =\mathbb E_{p} \Bigg [ \log \frac{p(\mathbf x)}{q(\mathbf x)} \Bigg] = \sum p(x) \log \frac{p(x)}{q(x)}
It is also known as relative entropy (difference of cross and self entropy).
Most explanations of KL divergence opt for a more information theoretic intuition which is, at least for me, harder to grasp. However, there also exists a rather neat statistical interpretation in terms of the likelihood that I found in Shlens (2014) and is also present in Jordan (n.d.).
13.4.5.1 Statistical interpretation of KL divergence
Say we have a discrete set of outcomes, and a candidate model q for the true underlying distribution. The probability of observing the histogram counts \mathbf c according to our model follows a multinomial distribution: L \propto \prod_{i} q_i ^{c_i}
But this shrinks multiplicatively as the number of observations grows, so we normalise it to an : \bar L \propto \prod_{i} q_i ^{\frac{c_i}{n}}
As we approach infinitely many measurements, \frac{c_i}{n} \rightarrow p_i by the law of large numbers and, D_{KL}(p \| q) = - \log \bar L
Succinctly put, the KL divergence is the asymptotic value of the negative log “average” likelihood under a model q for data actually from p.
If we just consider an unnormalised average likelihood \prod_{i} q_i ^{\frac{c_i}{n}} \rightarrow \prod_{i} q_i ^{p_i}, its negative logarithm yields the cross entropy. The two differ only by a constant, the entropy of q.
As Jordan (n.d.) says > Minimizing the KL divergence to the empirical distribution is equivalent to maximizing the likelihood.
13.4.5.2 Properties of KL divergence
It satisifes:
- D_{KL}[p|| q] \geq 0
- D_{KL} [p|| q] = 0 \iff p=q
- D_{KL} [p|| q] \neq D_{KL} (q || p) in general
KLD between simple MVNs
We want to show: \textcolor{purple}{ { D_{KL}\left(\mathcal{N}\left((\mu_1, \ldots, \mu_d)^\mathsf{T}, \operatorname{diag} (\sigma_1^2, \ldots, \sigma_d^2)\right) \parallel \mathcal{N}\left(\mathbf{0}, \mathbf{I}\right)\right) = {1 \over 2} \sum_{i=1}^k (\sigma_i^2 + \mu_i^2 - \ln(\sigma_i^2) - 1)}}
Since \Sigma = diag(\sigma^2_1, \dots, \sigma^2_d), |\Sigma| = \prod_{i=1}^{d} \sigma^2_i and \Sigma^{-1} = diag\Big(\frac{1}{\sigma^2_1},\dots, \frac{1}{\sigma^2_d}\Big)
\begin{align*} p( x) &= \frac{1}{\sqrt{(2 \pi)^k\Big(\prod_{i=1}^{k} \sigma^2_i\Big) }} \exp \left( -\frac{1}{2} \sum_{i=1}^{k} \frac{( {x_i- \mu_i})^2}{\sigma^2_i} \right)\\ \log p( x) &= -\frac{1}{2} \left( k\log{({2 \pi) +\sum_{i=1}^{k} \log \sigma^2_i }} + \sum_{i=1}^{k} \frac{( {x_i- \mu_i})^2}{\sigma^2_i} \right) \end{align*}
\begin{align*} q( x) &= \frac{1}{\sqrt{(2 \pi)^k }} \exp \left( -\frac{1}{2} \sum_{i=1}^{k} { {x_i}^2} \right)\\ \log q( x) &= -\frac{1}{2}\left( k\log (2 \pi) + \sum_{i=1}^{k} { {x_i}^2} \right)\\ \end{align*}
\begin{align*} D_{KL}(p || q) &= \int p(x) \log \frac{p(x)}{q(x)} dx \\ &= \int p(x) \left (-\frac{1}{2} \left( k\log{({2 \pi) +\sum_{i=1}^{k} \log \sigma^2_i }} + \sum_{i=1}^{k} \frac{( {x_i- \mu_i})^2}{\sigma^2_i} \right) + \frac{1}{2}\left(k \log (2 \pi) + \sum_{i=1}^{k} { {x_i}^2} \right) \right) dx \\ &= \frac{1}{2}\textcolor{blue}{ \int p(x) \left ( \sum_{i=1}^{k} { {x_i}^2}\right) dx} -\frac{1}{2} \textcolor{red}{ \int p(x) \left(\sum_{i=1}^{k} \frac{( {x_i- \mu_i})^2}{\sigma^2_i} \right) dx} - \frac{1}{2} \sum_{i=1}^{k} \log \sigma^2_i \\ \end{align*}
Let’s deal with the two integration terms one by one.
Observe that p(x) can be fully factorised into k densities:
\begin{align*} p( x) &= \frac{1}{\sqrt{(2 \pi)^k \prod_{i=1}^{k} \sigma^2_i }} \exp \left( -\frac{1}{2} \sum_{i=1}^{k} \frac{( {x_i- \mu_i})^2}{\sigma^2_i} \right)\\ p(x_1, x_2 ,\dots x_n)&= \prod _{i=1}^{k} \left [\frac{1}{\sqrt{2 \pi \sigma^2_i}} \exp\left( -\frac{(x_i-\mu_i)^2}{2 \sigma_i^2} \right) \right]\\ &= \prod_{i=1}^{k} p_i(x_i) \end{align*} \begin{align*} \int \int \dots \int f(x_j) p(x_1, x_2 ,\dots x_d) dx_1dx_2 \dots dx_d &= \int \int \dots \int f(x_j) \prod_{i=1}^{k} (p_i(x_i) dx_i )\\ &= \prod_{i \neq j} \left(\int p_i(x_i) dx_i \right) \cdot \int f(x_j) p(x_j) dx_j\\ &= \int f(x_j) p(x_j) dx_j \end{align*}
Now, we can deal with the blue term easily: Recall that \mathbb E [\mathsf x^2] = \sigma^2 + \mu^2 for \mathsf x \sim \mathcal N(\mu, \sigma^2)
\begin{align*} \textcolor{blue}{\int p(\mathbf x) \left ( \sum_{i=1}^{k} { {x_i}^2}\right) d \mathbf x} &= \int \int \dots \int p(x_1, x_2 , \dots x_n) \left ( \sum_{i=1}^{k} { {x_i}^2}\right) d x_1 dx_2 \dots dx_n\\ &= \sum_{i=1}^{k} \int \int \dots \int { p(x_1, x_2 , \dots x_n) {x_i}^2} d x_1 dx_2 \dots dx_n \\ &= \sum_{i=1}^{k} \int p_i(x_i) \cdot { {x_i}^2} d x_i \\ &= \textcolor{blue}{\sum_{i=1}^{k} \left( \sigma^2_i + \mu^2_i \right)}\\ \end{align*}\ Now note that \begin{align*} {\int \frac{(x-\mu)^2}{\sigma^2} p(x) dx} &= \frac{1}{\sigma^2} \mathbb E [(\mathsf x-\mu)^2] \\ &=\frac{1}{\sigma^2} Var[\mathsf x]\\ &={1} \end{align*}
Thus, in the red term: \begin{align*} \textcolor{red}{\int p(\mathbf x) \left ( \sum_{i=1}^{k} { \frac{(x_i - \mu_i)^2}{\sigma^2}}\right) d \mathbf x} &= \int \int \dots \int p(x_1, x_2 , \dots x_n) \left ( \sum_{i =1}^{k} { \frac{(x_i - \mu_i)^2}{\sigma^2}}\right) d x_1 dx_2 \dots dx_n\\ &= \sum_{i=1}^{k} \int p_i(x_i) \cdot { \frac{(x_i - \mu_i)^2}{\sigma^2}} d x_i \\ &= \textcolor{red}{\sum_{i=1}^{k} 1}\\ \end{align*}
Putting the two together: \begin{align*} D_{KL}(p || q) &= \frac{1}{2} \sum_{i=1}^{k}\left( \mu_i^2 + \sigma^2_i - \log \sigma^2_i - 1 \right) \end{align*}13.4.6 Fisher divergence
For two distributions with densities p and q,
D_F[p || q] = \mathbb E_{\mathbf x \sim p} \left[ \| \nabla_\mathbf x \log p(\mathbf x) - \nabla_\mathbf x q(\mathbf x) \|^2_2 \right]
It measures the discrepancy between the score functions, which are “vector fields pointing towards regions of higher probability”. It is invariant to normalisers which is quite neat.
13.5 Relations between discrepancies
13.5.1 d_{BL} \leq \min \{ d_W, 2 d_{TV} \}
Proof
First, notice that \mathcal F_{BL} \subset \mathcal F_{W}, hence the supremum over elements of \mathcal F_{BL} is dominated by that over \mathcal F_{W} and thus d_{BL} \leq d_W.
For the second part, for each f \in \mathcal f_{BL}, define g_f = \frac{1+f}{2}. This is measurable since f is measurable, and 0 \leq g_f\leq 1 so g_f \in \mathcal F_{TV}.
d_{BL}(\mu, \nu) = \sup_{f \in \mathcal F_{BL}} \Bigg|\int f d\mu - \int fd\nu \Bigg| = \sup_{f \in \mathcal F_{BL}} 2\Bigg|\int g_f d\mu - \int g_f d\nu \Bigg| \leq \sup_{g \in \mathcal F_{TV}} 2\Bigg|\int g d\mu - \int g d\nu \Bigg| = 2d_{TV}
Hence:
- d_{TV}(\mu_n, \nu_n) \to 0 \implies d_{BL}(\mu_n, \nu_n) \to 0
- d_{W}(\mu_n, \nu_n) \to 0 \implies d_{BL}(\mu_n, \nu_n) \to 0
Nielsen (n.d.)