hammersley-clifford theorem

Any probability distribution that satisfies markov property with respect to a graph is a gibbs random field and vice versa.

the markov property means that if any two nodes \(X\) and \(Y\) can be separated by a set of nodes \(U\), then \(X \perp Y \mid U\)
a gibbs random field is any probability distribution that can be written as a product of factors over cliques: \(P(U) = \prod_{d\in D} f_d(X_d)\)

1 proof:

Proof from wikipedia with my commentary

1.1 mrf \(\Rightarrow\) grf

1.1.1 lemma 1

Let \(U\) be all the variables under consideration.
Let \(\Theta, \Phi_1, ... \Phi_n \subseteq U\) and \(\Psi_1, ..., \Psi_m \subseteq U\) be arbitary sets of variables
Here, when applicable, \(X\) will also refer to an (arbitrary) assignment to the variables \(X\)

If, \[ Pr(U) = f(\Theta)\prod_{i=1}^{n}g(\Phi_i) = \prod_{j=1}^{m}h(\Psi_j) \] for functions \(f\), \(g_i\) and \(h_j\), then there exist functions \(g_i'\) and \(h_j'\) such that: \[ Pr(U) = \prod_{j=1}^{m}h'(\theta \cap \Psi_j)\prod_{i=1}^{n}g'(\Phi_i) \]

In other words, \(h_j\) gives a template for further factoring \(f\).

Now, with lemma 1, we can finish the proof. The local Markov property tells us that for any variable \(x\), we can write:

\begin{equation} Pr(U) = f_{x}(x,\delta x)f_{-x}(x,U\setminus\{x\}) \label{factor} \end{equation}

where \(\delta x\) are the neighors of \(x\). Namely: \[ Pr(U) = \underbrace{P(\delta x)P(x \mid \delta x)}_{f_{x}} \underbrace{P(U\setminus(\{x\} \cup \delta x) \mid \delta x)}_{f_{-x}} \]

Then, let's repeatedly apply lemma 1 to Equation \ref{factor}. The goal is to end up with an expression that only contains functions over cliques. Let's take a look at what happens when our template is: \[ Pr(u) = f_{y}(y, \delta y) f_{-y}(y, U\setminus\{y\}) \] Then, by lemma 1, applying this template to \Equation \ref{factor} gets us: \[ f_{x,y}'(\{x,\delta x\} \cap \{y, \delta y\})h'(\{x,\delta x\} \cap (U\setminus \{y\}))f_{-x}'(U\setminus \{x\}) \] Since we get to choose how to factor this function, let's assume that \(x\) and \(y\) belong to the same maximal clique

To take stock:

f_x,y' is a function over things that are neighbors of \(x\) and \(y\)
h' is now a function over \((\{x,\delta x\} \setminus \{y\})\)
f_-x is still a function over (U \setminux \{x\})

Here is the key realization:

Let's recursively take templates and factor each one of the terms
so for \(f_{x,y}\) let's continue to factor it, i.e. by using the templates for \(z\), where \(z\) is some node in the same maximal clique as \(x\) and \(y\)
once that first term consists only of nodes in a maximal clique, we can move on to the next term and rinse and repeat
how do we know that this process will terminate? because the sets of variables involved will strictly decrease everytime we factor

1.1.1.1 proof of lemma 1

Let \(\bar\theta\) denote a particular assignment to the variables \(U\setminus \Theta\). For an arbitrary set of variables \(X\) let \(\bar\theta[X]\) be the variables in \(X\) where the assignments to the variables that fall outside \(\Theta\) are fixed by \(\theta\).

Then, we will write \[ Pr(U) = f(\Theta)\prod_{i=1}^{n}g_i(\Phi_i) \] as \[ Pr(U) = f(\Theta)\prod_{i=1}^{n}g_i(\Phi_i \cap \Theta, \bar\theta[\Phi_i])\prod_{i=1}^{n}\frac{g_i(\Phi_i)}{g_i(\Phi_i\cap\Theta, \bar\theta[\Phi_i])} \] where \(g_i(\Theta_i \cap \Theta, \bar\theta[\Phi_i])\) is the function where the variables in \(\Phi_i\) that lie outside \(\Theta\) are fixed. Then, let:

\(f'(\Theta) = f(\Theta)\prod_{i=1}^{n}g_i(\Theta\cap\Phi_i, \bar\theta[\Phi_i])\)
\(g_i'(\Phi_i) = \frac{g_i(\Theta_i)}{g_i(\Phi_i \cap \Theta, \bar\theta[\Phi_i])}\)

Then, \[ Pr(U) = f'(\Theta)\prod_{i=1}^{n} g'(\Phi_i) = \prod_{j=1}^{m}h_j(\Psi_j) \]

Now, note that \(g_i'(\Phi_i) = 1\) when the assignments to \(\Phi_i\) exactly match the assignments in \(\bar\theta\). So, let's see what happens when we fix the assignments outside \(\Theta\) to be exactly the assignments in \(\bar\theta\). Then, we have:

\[ Pr(\Theta, \bar\theta) = f'(\Theta)\prod_{i=1}^n 1 = \prod_{j=1}^{m} h_j(\Psi_j\cap \Theta, \bar\theta[\Psi_j]) \] So, \[ f'(\Theta) = \prod_{j=1}^{m} h_j(\Psi_j\cap \Theta, \bar\theta[\Psi_j]) \] Then, let's set \[ h'_j(\Psi_j \cap \Theta) = h_j(\Psi_j \cap \Theta, \bar\theta[\Psi_j]) \] So, now we can finally say: \[ Pr(U) =\left( \prod_{j=1}^{m}h_j'(\Psi_j \cap \Theta)\right) \left(\prod_{i=1}^{n} g_i'(\Phi_i) \right) \]

1.2 grf \(\Rightarrow\) mrf

Consider \(X\) and \(Y\) separated by a set of nodes. Once all the separating variables have been fixed, we see that the probability function breaks into two neat pieces: one piece whose terms only involve \(X\) and one piece that only involves \(Y\)