Hiroaki Hayashi / Importance Sampling
Initializing importance sampling…

What is importance sampling?

Importance sampling is a variance reduction technique for Monte Carlo estimation. The core problem it solves: you want to estimate an expectation or integral — say $\mathbb{E}_p[f(x)] = \int f(x)\,p(x)\,dx$ — but naive sampling from $p$ wastes most of its budget drawing samples from regions where $f(x)$ contributes almost nothing to the integral.

How it works

Instead of sampling from the original distribution $p(x)$, you sample from a proposal distribution $q(x)$ that you choose, then correct for the bias by reweighting each sample:

$$\mathbb{E}_p[f(x)] = \int f(x) \cdot \frac{p(x)}{q(x)} \cdot q(x)\,dx = \mathbb{E}_q[f(x) \cdot w(x)]$$

where $w(x) = p(x)/q(x)$ is the importance weight. You draw $x_1, \ldots, x_n \sim q$, compute weights $w_i = p(x_i)/q(x_i)$, and estimate the expectation as $\frac{1}{n} \sum_i f(x_i) \cdot w_i$. The weights keep you honest — they undo the distributional mismatch so the estimator is unbiased.

If you only know $p$ up to a normalizing constant (common in Bayesian inference), you use self-normalized importance sampling: $\hat{I} = \sum_i w_i f(x_i) \,/\, \sum_i w_i$. This introduces a small bias but removes the need to know the normalization.

When it's useful

The classic cases are rare event estimation (estimating $P(X > \text{threshold})$ when that probability is tiny — uniform sampling almost never hits the tail, but a proposal shifted toward the tail gets you there efficiently), Bayesian posterior computation (the posterior is known only up to a constant, and IS gives you weighted samples without MCMC), off-policy evaluation in RL (you have data collected under policy $\pi_b$ but want to evaluate $\pi_e$ — the importance weight is the likelihood ratio between policies), and integration of functions with concentrated mass where you can roughly identify where the mass lives.

The variance reduction can be dramatic. The optimal proposal is $q^*(x) \propto |f(x)| \cdot p(x)$, which concentrates sampling exactly where the integrand contributes most. In practice you can't achieve this exactly (it requires knowing the answer), but even a rough approximation can cut variance by orders of magnitude compared to naive MC.

When it breaks down

The fundamental failure mode is when $q$ has lighter tails than $f \cdot p$. If there exist regions where $f(x) \cdot p(x)$ is non-negligible but $q(x) \approx 0$, some samples will get enormous weights, and the estimator's variance explodes — potentially becoming infinite. This is the weight degeneracy problem. In practice you see it as a few samples dominating the entire estimate, with the effective sample size ($\text{ESS} = (\sum w_i)^2 / \sum w_i^2$) collapsing toward 1.

This gets catastrophically worse in high dimensions. Even if $q$ is a reasonable approximation to $p$ in each marginal, the joint mismatch grows exponentially with dimensionality. A proposal that's slightly off in 100 dimensions will produce weights that are essentially all zero except for one or two samples. This is why vanilla IS is rarely used beyond maybe 10–20 dimensions without serious adaptation.

Other limitations: you need to evaluate $p(x)/q(x)$ pointwise, which requires the density of both distributions (at least up to constants if using self-normalized IS). If $p$ is implicit — say, defined only through a simulator with no tractable density — you can't compute the weights. Similarly, if the support of $q$ doesn't cover the support of $f \cdot p$, the estimator is biased with no recourse.

Finally, choosing a good proposal is itself a hard problem. Adaptive IS methods (like population Monte Carlo or adaptive multiple IS) try to iteratively refine $q$, but they add complexity and their own failure modes. In many modern settings, people reach for MCMC or variational inference instead — IS remains most powerful in low-to-moderate dimensions where you have genuine domain knowledge about where the mass of the integrand lives.