10 Annealed Importance Sampling
Neal (1998)
10.1 AIS Algorithm
We have unnormalised densities \(f_0, f_1, \dots f_n\) where \(f_0\) is our target and \(f_n\) is our prior.
We have a set of \(n-1\) Markov transition kernels \(T_{n-1} \dots T_1\) that have \(F_{n-1}, \dots F_1\) as their invariant distributions respectively (for example Metropolis-Hastings updates).
To generate a single point \(x^{(i)}\):
- Generate \(x_{n-1}^{(i)} \sim F_{n}\)
- Generate \(x_{n-2}^{(i)} \sim T_{n-1}(x_{n-1}^{(i)} \rightarrow x_{n-2})\)…
- Generate \(x_0^{(i)} \sim T_1(x_1^{(i)} \rightarrow x_0)\)
Assign \(x^{(i)} = x_0^{(i)}\).
We’ve sampled this single point through a complicated Markov Chain that made use of kernels corresponding to the intermediate annealed distributions between our prior \(f_n\) and posterior \(f_0\).
The importance weight assigned to \(x^{(i)}\) is: \[ w^{(i)} = \frac{f_{n-1}\Big(x_{n-1}^{(i)}\Big)}{f_{n}\Big(x_{n-1}^{(i)}\Big)} \cdot \frac{f_{n-2}\Big(x_{n-2}^{(i)}\Big)}{f_{n-1}\Big(x_{n-2}^{(i)}\Big)} \dots \frac{f_{1}\Big(x_{1}^{(i)}\Big)}{f_{2}\Big(x_{1}^{(i)}\Big)} \cdot \frac{f_{0}\Big(x_{0}^{(i)}\Big)}{f_{1}\Big(x_{0}^{(i)}\Big)}\]
10.2 Derivation of AIS weights
Let’s take the simple case where we have the prior \(F_2\), the posterior \(F_0\) and one intermediate annealing distribution \(F_1\). The transition kernel corresponding to \(F_1\) is \(T_1\).
To generate \(x^{(i)}\):
- Sample \(y^{(i)} \sim F_2\).
- Sample \(x^{(i)} \sim T_1( y^{(i)} \rightarrow x)\).
Consider the extended state space with points \((x, y)\).
We can define an unnormalised target density for the entire state space as \[ f(x, y) = f_0(x) \cdot \tilde T_1(x, y)\] This gives us the probability that we first sampled \(x\) from our target \(F_0\) and then sampled \(y\) conditional on the value of \(x\).
Here \(\tilde T_1\) is the reversed transition kernel density: \[ \tilde T_1(x, y) = T_1(y, x) \cdot \frac{f_1(y)}{f_1(x)} \]
Becasuse \(F_1\) is the stationary distribution of \(T_1\), \[\int f_1(y) \cdot T_1(y,x) dy = f_1(x) \]
This ensures our \(\tilde T_1\) is a valid transition kernel: \[ \int \tilde T_1(x, y) dy = \int T_1(y, x) \cdot \frac{f_1(y)}{f_1(x)} dy = 1\]
Similarly when we marginalise our joint density \(f(x,y)\) we just get \(f_0(x)\) exactly as desired.
Now that we have our target joint density and our actual joint density, we can reweigh the samples by the ratio of those two:
\[\begin{align*} w^{(i)} &= \frac{f(x^{(i)}, y^{(i)})}{f_2(y^{(i)}) \cdot T_1(y^{(i)}, x^{(i)})} \\ &= \frac{f_0(x^{(i)}) \tilde T_1(x^{(i)}, y^{(i)})}{f_2(y^{(i)}) \cdot T_1(y^{(i)}, x^{(i)})} \\ &= \frac{f_0(x^{(i)}) \cdot f_1(y^{(i)}) \cdot T_1(y^{(i)}, x^{(i)})}{f_1(x^{(i)})} \cdot \frac{1}{f_2(y^{(i)}) \cdot T_1(y^{(i)}, x^{(i)})}\\ &= \frac{f_0(x^{(i)})}{f_1(x^{(i)})} \cdot \frac{f_1(y^{(i)})}{f_2(y^{(i)})} \end{align*}\]