17 Computational Lebesgue integration

Published

November 3, 2025

17.1 Computational Lebesgue integration aka Monte Carlo methods

One of the most elegant formulations of Monte Carlo methods I’ve come across so far is the idea that it represents the translation of Lebesgue integration into practice.

I think a great way of extending this analogy is to characterise quadrature as computational Riemann integration and juxtaposing the two. To compute the Riemann integral of a function, you need to divide up its domain into a grid of infinitesimal intervals/squares/hypercubes and then multiply the “width” with the height (function value) to get the area under the curve. There are two big problems with this:

How do you even make a grid on domains that aren’t Euclidean \mathbb R^d? For example, a space of graphs or trees or a manifold?
Even for Euclidean spaces, the number of intervals in your grid required to have a fixed estimation error grows exponentially with the dimensionality d (the error is \mathcal O(N^{-1/d}))… This is known as the curse of dimensionality.

Replacing this notion of interval widths with a more general idea of measure is what allows Monte Carlo methods to address both these problems (to some extent).

17.2 Integral estimation

17.2.1 Goal

We want compute the integral of some integrand b with respect to the Lebesgue measure: \lambda [b] =\int b d\lambda

We can rewrite this with respect to a probability measure P (let’s call this the surrogate distribution) \int b(x) \frac{d\lambda}{dP} dP =\int \frac{b(x)}{\tilde p(x)} dP =\int \phi dP

where \tilde p = \frac{dP}{d\lambda} is the Radon-Nikodym derivative of P wrt the Lebesgue measure and \phi(x) = \frac{b(x)}{\tilde p(x)}.

17.2.2 Algorithm

Sample IID X_1, \dots X_N \sim P.
Transform the samples to get \phi(X_1), \dots \phi(X_N)
Compute estimator \hat P_N [\phi] := \frac{1}{N} \sum \phi(X_i)

17.2.3 Justification

We know by the law of large numbers that if we have a sequence of IID random variables Y_i \stackrel{d}{=} Y \frac{1}{N} \sum Y_i \xrightarrow{a.s.} \mathbb E[Y]

Now consider Y_i = \phi(X_i) above (assuming \phi is measurable). Since X_i’s are IID, Y_i’s are also independent and identically distributed according to the pushforward probability measure P \circ \phi^{-1}.

So we can conclude: \frac{1}{N} \sum \phi(X_i) \xrightarrow{a.s.} \mathbb E[Y] = \int y \cdot dP \circ \phi^{-1}(y)

This is simply an integral with respect to a pushforward measure. We know for general measurable functions f: \int f(y) \cdot dP \circ \phi^{-1}(y) = \int f(\phi(x)) \cdot dP(x) and thus \int y \cdot dP \circ \phi^{-1}(y) = \int \phi(x) \cdot dP(x)

This sequence of reasoning is often omitted in practice (thankfully for novices like my previous self). We tend to define expectations of a transformed random variable directly as the integral of the transformation wrt the base probability measure \mathbb E [Y] = \mathbb E[ \phi \circ X] := \int \phi(x) dP(x) (rather than the integral of the identity function wrt the transformed probability measure). This is the so called law of the unconscious statistician.

Thus, we get \hat P_N [\phi] \xrightarrow{a.s.} \int \phi \cdot dP = \int b \cdot d\lambda

17.3 Expectation estimation

I don’t quite know what ‘principled’ here means but Michael Betancourt puts it this way:

“Probability theory teaches us that the only well-posed way to extract information from a probability distribution is through expectation values.”

One of the main practical uses of high dimensional integration is in statistics, where we want to compute the expectation of a test function a(\cdot) of a random variable, X \sim Q. Here, our integrand can be factorised into a probability density times the test function a(x) \cdot \tilde q(x): Q[a] = \int a(x) \tilde q(x) dx = \mathbb E [a(X)]

Note that here we assume we can compute the density exactly.

17.3.1 Self expectation estimation aka simple Monte Carlo

The most intuitive and simplest (not the only, or even the best) thing to do here is to treat Q itself as our surrogate distribution.

Then \phi(X) = \frac{b(X)}{\tilde p(X)} = \frac{a(X)\tilde q(X)}{\tilde q(X)} = a(X)

So if we draw n i.i.d samples X_1, \dots X_N \sim Q: \frac{1}{N} \sum_{i=1}^{n} a(X_i) \xrightarrow{a.s.} \mathbb E[a(X)]

17.3.2 Cross expectation estimation aka importance sampling

What if sampling from X \sim Q were hard? Recall the general integral estimation setting- there’s nothing special about Q except that it cancels out a part of our integrand. We could just as easily use samples from another distribution instead of Q.

Say we have i.i.d. samples Y_1, \dots Y_N from a different surrogate distribution M with density \tilde m(\cdot). We can estimate an expectation with respect to Q by a weighted average of transformed samples:

\begin{align*} \frac{1}{N} \sum_{n=1}^{N} \frac{\tilde q(Y_N)}{\tilde m(Y_N)} \cdot a(Y_N) &\xrightarrow{a.s.} \mathbb E \Bigg[ \frac{\tilde q(Y)}{\tilde m(Y)} \cdot a(Y) \Bigg] \\ &= \int \frac{\tilde q(y)}{\tilde m(y)} \cdot a(y) \cdot \tilde m(y) dy \\ &= \int \tilde q(x) \cdot a(x) dx \\ &= \mathbb E \Big[a(X)\Big] \end{align*}

This is an unbiased estimator.

17.3.3 Normaliser estimation

Given any non-negative function b that is integrable (i.e. \lambda[b] = \int b \cdot d\lambda is finite), thanks to the Radon Nikodym theorem we can define a unique probability measure, say P_b corresponding to it, via the probability density \tilde b = \frac{b}{\lambda[b]}.

Sometimes, we want to find the normalisation constant (i.e. the integral) of some such function, often termed an unnormalised density. For example, marginal likelihood computation in Bayesian statistics.

Even though it appears like an expectation estimation problem with respect to our target since b = \tilde b \times \lambda[b], here \lambda[b] is actually an unknown constant, not an evaluatable function of the target samples. Thus, we cannot actually use samples from P_b itself unless we know \lambda[b], in which case we don’t need to find \lambda[b] any more.

However if we phrase it as an expectation of the “weight function” with respect to any other distribution M with known probability density \tilde m b(x) = \tilde m(x) \times w(x) = \tilde m(x) \frac{b(x)}{\tilde m(x)}

Then we can easily use self expectation estimation (simple Monte Carlo) using IID samples Y_1, \dots Y_N \sim M. This is commonly called the importance sampling estimator of the normalising constant

\begin{align*} \int b(x) dx = \int \frac{b(x)}{\tilde m(x)} \tilde m(x) dx\\ \frac{1}{N} \sum_{i} \frac{b(Y_i)}{\tilde m(Y_i)} \xrightarrow{a.s.} \int b(x) dx \end{align*}

17.4 Integral ratio estimation

In most practical scenarios, instead of a single integral, we want to evaluate the ratio of two integrals: \frac{\lambda[a]}{\lambda[b]} = \frac{\int a \cdot d\lambda}{\int b \cdot d \lambda}

17.4.1 Duality of expectations and normalisers

This was probably a fact known since antiquity but got abstracted away- the earliest I can find it is Rainforth et al. (2020).

The expectation of a positive (for general integrable functions, \varphi=\varphi^+ + \varphi^-) integrable function \varphi with respect to a distribution Q, whose density is only known in an unnormalised form q, is the ratio of two integrals:

\mathbb E_{X \sim Q}[\varphi (X)] = \frac{\int q(x) \varphi(x) dx }{\int q(x) dx} = \frac{\lambda[q\cdot \varphi]}{\lambda[q]}

This is the ratio of normalising constants of the “tilted” density, q \cdot \varphi and the base density, q. Thus, we can apply the intuition, theory and methods developed in statistical physics and computational statistics for estimating normalising constants (or partition functions) to this problem.

Say we have a positive function \phi and we want to compute \mathbb E [\phi(Y)] where Y \sim Q, and the normalised probability density of Q is intractable- we only it in an unnormalised form q.

\mathbb E [\phi (Y)] = \int \phi(y) \tilde q(y) dy = \int \phi(y) \frac{ q(y)}{\lambda[q]} dy = \frac{\lambda[\phi \cdot q]}{\lambda[q]}

The two commonly used ways to compute this are: 1. Sample Y_i from Q and take mean of \phi(Y_i) 2. Sample from some other distribution and use ratio Monte Carlo (e.g. self-normalised importance sampling)

However, note that this is also the ratio of normalising constants of the unnormalised densities \phi \cdot q and q. This opens up avenues for applying techniques developed originally for normaliser ratio estimation.

We can extend this to any integrable function (not just positive) by splitting it as \phi = \phi^+ - \phi^- where \phi^+(x) = \max (0, \phi(x)) and \phi^-(x) = \max (0, -\phi(x))

The other side of this equivalence, that normaliser ratios are expectations, is quite well-known as the “importance sampling” estimator that estimates the expectation of the “weight” function. For unnormalised densities p and p_0, we have X_1 ,\dots, X_n \sim P_0:

\frac{\lambda[p]}{\lambda[p_0]}= \frac{\int p(x) dx}{\lambda[p_0]} = \int \frac{p(x)}{p_0(x)} \frac{p_0(x)}{\lambda[p_0]} dx = \mathbb E_{X \sim P_0}\left[\frac{p(X)}{p_0(X)}\right] =\mathbb E_{X \sim P_0}\left[w(X)\right] \approx \frac{1}{n}\sum_{i=1}^n w(X_i)

17.5 Ratio Monte Carlo

The most straightforward approach is to separately estimate the two integrals and take the ratio.

We could use samples Y_1 \dots Y_N from some surrogate distribution M with probability density m:

\begin{align*} \frac{1}{N} \sum \frac{a(Y_i)}{\tilde m(Y_i)} \xrightarrow{a.s.} \lambda[a]\\ \frac{1}{N} \sum \frac{b(Y_i)}{\tilde m(Y_i)} \xrightarrow{a.s.} \lambda[b]\\ \end{align*}

By the continuous mapping theorem, \frac{\sum \frac{a(Y_i)}{\tilde m(Y_i)}}{\sum \frac{b(Y_i)}{\tilde m(Y_i)}} = \frac{\frac{1}{N} \sum \frac{a(Y_i)}{\tilde m(Y_i)}}{\frac{1}{N} \sum \frac{b(Y_i)}{\tilde m(Y_i)}} \xrightarrow{a.s.} \frac{\lambda[a]}{\lambda[b]}

Observe that even if we had only the “unnormalised density” m corresponding to M, this would still work.

\begin{align*} \frac{1}{N} \sum \frac{a(Y_i)}{m(Y_i)} &\xrightarrow{a.s.} \int \frac{a(x)}{m(x)} \cdot \tilde m(x) dx = \int \frac{a(x)}{m(x)} \cdot \frac{m(x)}{\lambda[m]} dx = \frac{1}{\lambda[m]} \int a(x) dx \\ \frac{1}{N} \sum \frac{b(Y_i)}{m(Y_i)} &\xrightarrow{a.s.} \frac{1}{\lambda[m]} \int b(x) dx \\ \frac{\sum \frac{a(Y_i)}{m(Y_i)}}{\sum \frac{b(Y_i)}{m(Y_i)}} &\xrightarrow{a.s.} \frac{\lambda[a]}{\lambda[b]} \end{align*}

17.5.1 General cross expectation estimation aka self normalised importance sampling

The most common application of this method is estimating expectations with respect to a distribution where we can’t compute the normalised density. Say we have a test function \phi and we want to find \mathbb E[\phi(Y)] where Y \sim Q. However, we can only compute some function (“unnormalised density”) q which is proportional to the probability density \tilde q. The two are related as q = \tilde q \cdot \lambda[\tilde q].

Then we can reduce our problem to the above case, putting in a(x) := \phi(x)\cdot q(x) and b := q(x). Observe that the latter corresponds to normaliser estimation (if we use a surrogate with known normalised probability density) or normaliser ratio estimation if otherwise.

17.5.2 Normaliser ratio estimation aka IS normalising constant estimation

Sometimes, our setup \frac{\lambda[a]}{\lambda[b]} is such that it is possible to sample from the distribution P_b corresponding to b (if b is non-negative). In such cases, we only need to esimate one integral.

\frac{1}{N} \sum \frac{a(Y_i)}{b(Y_i)} \xrightarrow{a.s.} \int \frac{a(y)}{b(y)} \tilde b(y) dy = \frac{a(y)}{b(y)} \frac{b(y)}{\lambda[b]} dy = \frac{\lambda[a]}{\lambda[b]}

This is used most often to compute the ratio of normalising constants when a and b are unnormalised densities.

If instead we use samples Z_1, \dots Z_N from P_a, we get the harmonic mean estimator of Newton and Raftery.

\begin{align*} \frac{1}{N} \sum \frac{b(Z_i)}{a(Z_i)} \xrightarrow{a.s.} \int \frac{b(z)}{a(z)} \tilde a(z) dz = \frac{b(z)}{a(z)} \frac{a(z)}{\lambda[a]} dy = \frac{\lambda[b]}{\lambda[a]}\\ \frac{N}{\sum \frac{b(Z_i)}{a(Z_i)}} \xrightarrow{a.s.} \frac{\lambda[b]}{\lambda[a]} \end{align*}

This has notoriously high (infinite?) variance and bias for finite sample sizes.

17.5.3 Curse of dimensionality strikes back

Does Monte Carlo integration actually beat the curse of dimensionality as sometimes believed? The Monte Carlo error scales as \mathcal O(N^{-1}{2}), which is independent of dimensionality d, right?

It actually does not. In high dimensions, the variance of Monte Carlo/importance sampling can be incredibly unstable.

As Chatterjee and Diaconis (2017) show, the number of samples required for accurate estimation rises exponentially with the KL divergence between the surrogate and target distributions \mathcal O(\exp D_{KL}(q || m)). For the sake of back of envelope calculations, if we consider the case where both are multivariate normal, the KL divergence increases somewhat linearly with dimension (see Section 13.4.5.2). This means that here too, the number of samples required is \mathcal O(\exp d).

17.6 Thermodynamic Lebesgue integration aka stepping stone sampling

Since we couldn’t yet completely defeat the curse of dimensionality, could we try introducing an extra degree of freedom that helps us manage the scaling of total number of samples needed?

One way to estimate a ratio of integrals is to decompose it into a telescoping series of “easier” integral ratios:

\frac{\lambda[\pi]}{\lambda[\tilde \pi_0]} = \frac{\lambda[\pi_{\beta_1}]}{\lambda[\tilde \pi_0]} \cdot \frac{\lambda[\pi_{\beta_2}]}{\lambda[\pi_{\beta_1}]} \dots \frac{\lambda[\pi]}{\lambda[\pi_{\beta_{n-1}}]}

where \pi_{\beta_i} = \pi_0^{1-\beta} \pi^{\beta} are annealed/tempered functions interpolating between \pi_0 and \pi (this is a specific case called a geometric path, there could be others).

Each of the constituent ratios can be estimated using the technique for normaliser ratio estimation we discussed above. This is kind of like thermodynamic integration.

Annealing is a provably superior method for estimating ratios of integrals (Chehab, Hyvarinen, and Risteski (2023)). Successive distributions are “closer”, so the variance of each ratio estimate is low, which leads to low overall variance for the product.

Back of the envelope justification: Say we have a multivariate Gaussian M \sim \mathcal N (0, I_d) reference and Q \sim \mathcal N(\mu, I_d) target where \mu is a d-dimensional vector. Then the KL divergence D(Q || M) = \frac{1}{2} \|\mu\|^2. The number of samples required would be N_0 = \mathcal O(\exp \sum \mu_i^2)

If we introduce an annealing sequence of k-1 intermediate tempered distributions (Grosse, Maddison, and Salakhutdinov (2013)): M_i \sim \mathcal N (\mu * i/k, I_d), then the successive KLDs would be D(M_i || M_{i+1}) = \sum (\frac{\mu_i}{k})^2 and thus number of samples required for each step would be \mathcal O(\exp \sum \mu_i^2/k^2). The total number of samples required would be \mathcal O(k \exp \sum \frac{\mu_i^2}{k^2}) = \mathcal O(k \cdot (N_0)^{\frac{1}{k^2}}).

This is common for normaliser estimation, but almost unexplored for expectations (Llorente, Martino, and Delgado (2023)).

17.7 A note on optimal surrogate/reference distributions and target aware Bayesian inference

Excellent review: Llorente and Martino (2025)

Rainforth et al. (2020) was one of the most exciting papers I came across around March 2025. Zero variance estimators for expectations! Too good to be true right?

Indeed. The thing about zero variance surrogates is that they need to be tractable densities- you need to be able to evaluate the normalised density of these (you don’t even need to be able to sample from it). Their optimality comes from the information in the normaliser- the only reason we get zero variance is because the integral we’re looking for is hidden in their normalising constant. The optimal SNIS surrogate is more explicit about having to know the integral.

The only way to use these optimality conditions in practice is to use variational techniques to find the tractable desnity closest to our density (something like Surjanovic et al. (2023)). So this approach will hit the bottleneck most variational approximation methods do- when the variational family doesn’t contain the target distribution (in most practical scenarios), there’s a limit to how good your estimator can be- only as good as your variational family allows. So Target Aware Bayesian Inference is not in fact going to be exact in almost all cases.

In fact, it is probably going to be suboptimal thanks to the simple fact- the class of distributions we can sample from is far larger than the class of distributions with tractable densities.

So we should in fact start with where TABI ends- the paper mentions in the very end a generalised framework for TABI which is, in my optinion, the real gem for recognising.

17.8 Note on nomenclature and models of abstraction

I feel the current nomenclature and corresponding abstraction in Monte Carlo methods, while good pedagogically, are outdated and suboptimal. The overload of the term “importance sampling” is a good example of this. In this set of notes, I have tried using the names that make more sense to me.

Chatterjee, Sourav, and Persi Diaconis. 2017. “The Sample Size Required in Importance Sampling.” https://arxiv.org/abs/1511.01437.

Chehab, Omar, Aapo Hyvarinen, and Andrej Risteski. 2023. “Provable Benefits of Annealing for Estimating Normalizing Constants: Importance Sampling, Noise-Contrastive Estimation, and Beyond.” In Advances in Neural Information Processing Systems, edited by A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, 36:45945–70. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2023/file/90080022263cddafddd4a0726f1fb186-Paper-Conference.pdf.

Grosse, Roger B, Chris J Maddison, and Russ R Salakhutdinov. 2013. “Annealing Between Distributions by Averaging Moments.” In Advances in Neural Information Processing Systems, edited by C. J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger. Vol. 26. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2013/file/fb60d411a5c5b72b2e7d3527cfc84fd0-Paper.pdf.

Llorente, Fernando, and Luca Martino. 2025. “Optimality in Importance Sampling: A Gentle Survey.” arXiv Preprint arXiv:2502.07396. https://arxiv.org/abs/2502.07396.

Llorente, Fernando, Luca Martino, and David Delgado. 2023. “Target-Aware Bayesian Inference via Generalized Thermodynamic Integration.” Computational Statistics 38 (4): 2097–2119.

Rainforth, Tom, Adam Golinski, Frank Wood, and Sheheryar Zaidi. 2020. “Target–Aware Bayesian Inference: How to Beat Optimal Conventional Estimators.” Journal of Machine Learning Research 21: 1–54.

Surjanovic, Nikola, Saifuddin Syed, Alexandre Bouchard-Côté, and Trevor Campbell. 2023. “Parallel Tempering with a Variational Reference.” https://arxiv.org/abs/2206.00080.