mutual information

The mutual information between random variables \(X\) and \(Y\) is the Kullback-Leibler divergence between \(P(X,Y)\) and \(P(X)P(Y)\): \[ I(X;Y) = D_{KL}(P(X,Y) || P(X)P(Y) = \sum_{x,y} p(x,y) \log\frac{p(x,y)}{p(x)p(y)} \] Expressed in terms of entropy: \(\begin{align} I(X;Y) &= H(X) - H(X\mid Y)\\ &= H(Y) - H(Y\mid X) \end{align}\) In the first line, you can think of:

\(H(X)\) as the uncertainty in \(X\)
the conditional entropy \(H(X\mid Y)\) as the uncertainty remaining in \(X\) after \(Y\) is known
\(I(X;Y)\) as the amount that knowing \(Y\) reduces the uncertainty in \(X\)

You can also think of \(I(X;Y)\) as a measure of the information shared by \(X\) and \(Y\). How much does knowing one variable reduce uncertainty about the other? If \(X\) and \(Y\) are independent, then \(I(X;Y)\) is zero, because knowing the value of \(Y\) doesn't change the distribution of \(X\) at all. However, if \(X\) is a deterministic function of \(Y\), then \(I(X;Y) = H(X) - H(X\mid Y) = H(X)\) because \(H(X \mid Y) = 0\), since there is no uncertainty after observing \(Y\). Then, the mutual information is \(H(X)\). That is, the amount of uncertainty reduction that we get from observing \(Y\) is exactly all the uncertainty that \(X\) had to begin with.

It turns out that \(I(X;Y) = 0\) if and only if \(X\) and \(Y\) are independent.

Note that \(I(X;Y) = I(Y;X)\) is a symmetric measure.

Notice that \(\log(p(x,y)) - \log(p(x)p(y))\) can be thought of as a distance from independence.

1 useful links

stack overflow question about relationship with correlation coefficient mutual information