Interactive visualization of mode-covering and mode-seeking behavior when approximating a multimodal distribution
Mode-covering / Zero-avoiding. The expectation under $p$ means $\log \frac{p}{q}$ explodes wherever $q(x) \approx 0$. So $q$ must spread to cover all of $p$'s support—even at the cost of placing mass in low-density regions between modes.
In LLM distillation: diverse but sometimes incoherent outputs—the student hedges across all valid completions.
Mode-seeking / Zero-forcing. The expectation under $q$ means $\log \frac{q}{p}$ explodes wherever $p(x) \approx 0$. So $q$ avoids regions outside $p$'s support and concentrates on a single high-density mode.
In LLM distillation: sharp and fluent but lacks diversity—the student may ignore valid alternatives entirely.
Symmetric / Bounded. Averages both KL directions through the midpoint $m$. Bounded in $[0, \ln 2]$, always finite even when supports don't overlap. The optimal $q$ balances coverage and precision—wider than reverse KL, tighter than forward KL.
In LLM distillation: a practical middle ground, often used in GAN-style training and f-divergence distillation.
Symmetric / $L^1$ distance. Measures the maximum probability mass that $q$ assigns differently from $p$. Bounded in $[0, 1]$. The optimal $q$ minimizes the area of mismatch—it tries to match the shape of the largest mode cluster.
Directly interpretable: the largest event-probability gap between $p$ and $q$.
Drag μ and σ to see how each KL responds in real time
Both start from your q position above
Forward KL Optimization
Reverse KL Optimization