10 Annealed Importance Sampling

Published

September 24, 2025

Neal (1998)

10.1 AIS Algorithm

We have unnormalised densities \(f_0, f_1, \dots f_n\) where \(f_0\) is our target and \(f_n\) is our prior.

We have a set of \(n-1\) Markov transition kernels \(T_{n-1} \dots T_1\) that have \(F_{n-1}, \dots F_1\) as their invariant distributions respectively (for example Metropolis-Hastings updates).

To generate a single point \(x^{(i)}\):

Generate \(x_{n-1}^{(i)} \sim F_{n}\)
Generate \(x_{n-2}^{(i)} \sim T_{n-1}(x_{n-1}^{(i)} \rightarrow x_{n-2})\)…

Generate \(x_0^{(i)} \sim T_1(x_1^{(i)} \rightarrow x_0)\)

Assign \(x^{(i)} = x_0^{(i)}\).

We’ve sampled this single point through a complicated Markov Chain that made use of kernels corresponding to the intermediate annealed distributions between our prior \(f_n\) and posterior \(f_0\).

The importance weight assigned to \(x^{(i)}\) is: \[ w^{(i)} = \frac{f_{n-1}\Big(x_{n-1}^{(i)}\Big)}{f_{n}\Big(x_{n-1}^{(i)}\Big)} \cdot \frac{f_{n-2}\Big(x_{n-2}^{(i)}\Big)}{f_{n-1}\Big(x_{n-2}^{(i)}\Big)} \dots \frac{f_{1}\Big(x_{1}^{(i)}\Big)}{f_{2}\Big(x_{1}^{(i)}\Big)} \cdot \frac{f_{0}\Big(x_{0}^{(i)}\Big)}{f_{1}\Big(x_{0}^{(i)}\Big)}\]

10.2 Derivation of AIS weights

Let’s take the simple case where we have the prior \(F_2\), the posterior \(F_0\) and one intermediate annealing distribution \(F_1\). The transition kernel corresponding to \(F_1\) is \(T_1\).

To generate \(x^{(i)}\):

Sample \(y^{(i)} \sim F_2\).
Sample \(x^{(i)} \sim T_1( y^{(i)} \rightarrow x)\).

Consider the extended state space with points \((x, y)\).

We can define an unnormalised target density for the entire state space as \[ f(x, y) = f_0(x) \cdot \tilde T_1(x, y)\] This gives us the probability that we first sampled \(x\) from our target \(F_0\) and then sampled \(y\) conditional on the value of \(x\).

Here \(\tilde T_1\) is the reversed transition kernel density: \[ \tilde T_1(x, y) = T_1(y, x) \cdot \frac{f_1(y)}{f_1(x)} \]

Becasuse \(F_1\) is the stationary distribution of \(T_1\), \[\int f_1(y) \cdot T_1(y,x) dy = f_1(x) \]

This ensures our \(\tilde T_1\) is a valid transition kernel: \[ \int \tilde T_1(x, y) dy = \int T_1(y, x) \cdot \frac{f_1(y)}{f_1(x)} dy = 1\]

Similarly when we marginalise our joint density \(f(x,y)\) we just get \(f_0(x)\) exactly as desired.

Now that we have our target joint density and our actual joint density, we can reweigh the samples by the ratio of those two:

\[\begin{align*} w^{(i)} &= \frac{f(x^{(i)}, y^{(i)})}{f_2(y^{(i)}) \cdot T_1(y^{(i)}, x^{(i)})} \\ &= \frac{f_0(x^{(i)}) \tilde T_1(x^{(i)}, y^{(i)})}{f_2(y^{(i)}) \cdot T_1(y^{(i)}, x^{(i)})} \\ &= \frac{f_0(x^{(i)}) \cdot f_1(y^{(i)}) \cdot T_1(y^{(i)}, x^{(i)})}{f_1(x^{(i)})} \cdot \frac{1}{f_2(y^{(i)}) \cdot T_1(y^{(i)}, x^{(i)})}\\ &= \frac{f_0(x^{(i)})}{f_1(x^{(i)})} \cdot \frac{f_1(y^{(i)})}{f_2(y^{(i)})} \end{align*}\]

Neal, Radford M. 1998. “Annealed Importance Sampling.” https://arxiv.org/abs/physics/9803008.