value iteration

Iteratively try to find the best value function and then finally find the induced policy.

Finding the best value function is very similar to the policy evaluation portion of policy iteration. Use the Bellman equation to update our value function at a given state. Iteratively, set: \[ v(s) \leftarrow \max_{a} \sum_{s'} p(s' \mid s, a) \left[R(s, \pi(s), s') + \gamma v(s') \right] \] for every state \(s\). Note that we are implicitly finding a policy at the same time, namely the policy of taking the \(\arg\max_{a}\) for the quantity on the RHS when in state \(s\).

Finally once the value function has converged, we can extract a policy: \[ \pi(s) \leftarrow \arg\max_{a} \sum_{s'} p(s' \mid s, a) \left[R(s, \pi(s), s') + \gamma v(s') \right] \]

1 useful links

policy iteration vs value iteration