Kullback-Leibler divergence
1 As information gain
KL-divergence can be interpreted as the additional bits needed to encode samples from a distribution \(P\), using a code that has been optimized for \(Q\):
\[D_{KL}(P || Q) = \sum_{x\in \mathcal{X}} P(x) \left[ \log P(x) - \log Q(x) \right]\]
Recall that \(\sum_{x\in \mathcal{X}} P(x) \log P(x)\) is the entropy of \(P\) and \(\log P(x)\) is the optimal number of bits to use in a code optimized for \(P\).
2 Properties
2.1 \(D (P || Q) \geq 0\)
2.1.1 Proof
\[\begin{align*} D(P || Q) &= \sum_{i=1}^{n} P(x_i) \log \left(\frac{P(x_i)}{Q(x_i)}\right)\\ &= -\sum_{i=1}^{n} P(x_i) \log \left(\frac{Q(x_i)}{P(x_i)}\right)\\ &\geq -\log \left(\sum_{i=1}^{n} P(x_i)\frac{Q(x_i)}{P(x_i)}\right)\\ &= -\log \left(1\right)\\ &= 0 \end{align*}\] The third line comes from Jensen's inequality and the fact that \(-\log(x)\) is a convex function.