Kullback-Leibler divergence

1 As information gain

KL-divergence can be interpreted as the additional bits needed to encode samples from a distribution \(P\), using a code that has been optimized for \(Q\):

\[D_{KL}(P || Q) = \sum_{x\in \mathcal{X}} P(x) \left[ \log P(x) - \log Q(x) \right]\]

Recall that \(\sum_{x\in \mathcal{X}} P(x) \log P(x)\) is the entropy of \(P\) and \(\log P(x)\) is the optimal number of bits to use in a code optimized for \(P\).

2 Properties

2.1 \(D (P || Q) \geq 0\)

2.1.1 Proof

\[\begin{align*} D(P || Q) &= \sum_{i=1}^{n} P(x_i) \log \left(\frac{P(x_i)}{Q(x_i)}\right)\\ &= -\sum_{i=1}^{n} P(x_i) \log \left(\frac{Q(x_i)}{P(x_i)}\right)\\ &\geq -\log \left(\sum_{i=1}^{n} P(x_i)\frac{Q(x_i)}{P(x_i)}\right)\\ &= -\log \left(1\right)\\ &= 0 \end{align*}\] The third line comes from Jensen's inequality and the fact that \(-\log(x)\) is a convex function.