off-policy policy gradient

Recall that for Policy Gradient, the objective function is: \[ J(\theta) = \mathbb{E}_{\tau \sim \pi_{\theta}(\tau)} [R(\tau)] \]

Now, what if instead of sampling \(\tau\) according to the policy, we instead sampled according to some other distribution \(\beta_{\phi}(a \mid s)\). Then, in order to keep the same objective, we need to weight our samples, i.e. importance sampling: \[ J(\theta) = E_{\tau \sim \beta_{\phi}(\tau)}\left[ \frac{\pi_{\theta}(\tau)}{\beta_{\phi}(\tau)} R(\tau)\right] \]

1 Useful links

helpful medium article
lilian weng