Forward KL vs Reverse KL Divergence

Forward KL — $\mathrm{KL}(p \| q)$

$$D_{\mathrm{KL}}(p \| q) = \mathbb{E}_{x \sim p}\!\left[\log \frac{p(x)}{q(x)}\right]$$

Mode-covering / Zero-avoiding. The expectation under $p$ means $\log \frac{p}{q}$ explodes wherever $q(x) \approx 0$. So $q$ must spread to cover all of $p$'s support—even at the cost of placing mass in low-density regions between modes.

In LLM distillation: diverse but sometimes incoherent outputs—the student hedges across all valid completions.

Reverse KL — $\mathrm{KL}(q \| p)$

$$D_{\mathrm{KL}}(q \| p) = \mathbb{E}_{x \sim q}\!\left[\log \frac{q(x)}{p(x)}\right]$$

Mode-seeking / Zero-forcing. The expectation under $q$ means $\log \frac{q}{p}$ explodes wherever $p(x) \approx 0$. So $q$ avoids regions outside $p$'s support and concentrates on a single high-density mode.

In LLM distillation: sharp and fluent but lacks diversity—the student may ignore valid alternatives entirely.

Jensen-Shannon — $\mathrm{JSD}(p \| q)$

$$\mathrm{JSD}(p \| q) = \tfrac{1}{2}\,\mathrm{KL}(p \| m) + \tfrac{1}{2}\,\mathrm{KL}(q \| m), \quad m = \tfrac{p+q}{2}$$

Symmetric / Bounded. Averages both KL directions through the midpoint $m$. Bounded in $[0, \ln 2]$, always finite even when supports don't overlap. The optimal $q$ balances coverage and precision—wider than reverse KL, tighter than forward KL.

In LLM distillation: a practical middle ground, often used in GAN-style training and f-divergence distillation.

Total Variation — $\mathrm{TV}(p, q)$

$$\mathrm{TV}(p, q) = \tfrac{1}{2}\!\int\! |p(x) - q(x)|\,dx$$

Symmetric / $L^1$ distance. Measures the maximum probability mass that $q$ assigns differently from $p$. Bounded in $[0, 1]$. The optimal $q$ minimizes the area of mismatch—it tries to match the shape of the largest mode cluster.

Directly interpretable: the largest event-probability gap between $p$ and $q$.

Forward KL vs Reverse KL Divergence

Optimal Approximations

Forward KL — $\mathrm{KL}(p \| q)$

Reverse KL — $\mathrm{KL}(q \| p)$

Jensen-Shannon — $\mathrm{JSD}(p \| q)$

Total Variation — $\mathrm{TV}(p, q)$

Explore: Move q Yourself

Optimization Dynamics