Skip to content

Connections between
Policy Gradient and Zeroth-order Methods

by me, Harshat Kumar, George J. Pappas, and Alejandro Ribeiro

published on August 7 2020

This is my first (and hopefully not the last) blog-type technical article, and our attempt (with Harshat, George, and Alejandro) to look closer into intrinsic connections between Policy Gradient (PG) (both stochastic and deterministic) and zeroth-order methods in non-convex stochastic optimization. To the best of our knowledge, such connections, although very natural (as we will see below), have not been explored before, and this is why I have decided to discuss them in this post.

Put simply, what this article shows is that, under basic and natural assumptions, policy gradients are zeroth-order quasigradients of the gradient associated with the original stochastic program at hand, which of course implies that policy gradient methods are themselves instances of standard, general purpose zeroth-order methods, specialized in a Reinforcement Learning (RL) setting.

Further, an additional byproduct of our treatment is to show explicitly that, in fact, classical stochastic policy gradient (even with a baseline) results in quasigradients that are too inexact (relative to the true gradient of the original objective), and that stochastic policy gradient injects too much exploration noise during policy learning. This eventually results in both poor average performance and performance margins of unnecessarily high variance, even when randomness in the policy is discarded during policy evaluation; the latter scheme is a very popular heuristic in practice.

Then, a rather plausible way to mitigate those issues is to fully depart from adopting stochastic policies for policy learning, which are integral to (stochastic) policy gradients, and instead explore the possibility of developing a primitive zeroth-order policy gradient method based on the very well-known Deterministic Policy Gradient (DPG) theorem of [Silver et al., 2014], which is originally fully gradient-based and thus, in a sense, model-based, because it requires the full gradient of the associated Q-function in order to be invoked. This fact motivates Zeroth-order Deterministic Policy Gradient (ZDPG), which we have recently developed in [Kumar et al., 2020] (download here) and which essentially provides a model-free version of the framework set by Silver et al. in their seminal paper. The following table provides a basic classification of the possible actor-only PG approaches, and where the various resulting algorithms stand.

Policy\Method Model-Based (Q-Gradient-Based) Model-Free
Stochastic ? (Baseline) PG [2000]
Deterministic DPG [2014] ZDPG [2020]

In particular, in ZDPG, parameter exploration is efficiently achieved by using a purely deterministic policy together with a minimal amount of low-dimensional action-space exploration noise, which is intuitively and rigorously justified from the perspective of stochastic zeroth-order optimization. As we later show explicitly in this article, the zeroth-order quasigradients associated with ZDPG induce much less and better-controlled approximation error relative to the true deterministic policy gradient corresponding to the initial dynamic program, as compared to not only stochastic PG, but also Baseline PG (PG-B), for a number of standard baseline choices. Our discussion justifies our experimental results presented in [Kumar et al., 2020].

Remark. The question mark in the table above corresponds to the class of (actor-only) model-based methods involving stochastic policies, which do not seem to be very useful at least in regard to the problem setting considered in this article, since we are interested in model-free RL.

To read more, click and expand the titles below.

–– A Standardized Problem Setting

Let us now introduce more formally the stochastic program of interest in this article, closely following our intro section in [Kumar et al., 2020].

We will consider a standardized RL problem, in which an agent moves through a continuous and compact state space \mathcal{S}\subset \mathbb{R}^q, and takes actions in a finite dimensional action space \mathcal{A}=\mathbb{R}^p. The agent’s task is to accumulate as much reward as possible in the long term, where the reward is revealed to the agent by the environment at each step. This generic problem may be abstracted by a Markov decision process (MDP) as a tuple (\mathcal{S}, \mathcal{A}, P, R, \gamma), where R:\mathcal{S}\times \mathbb{R}^p \to \mathbb{R} denotes the reward function, and P_{s\to s'}^{a}:= p(s'\vert (s,a) \in \mathcal{S} \times \mathcal{A}) determines the probability of moving to state s' starting from state s and action a, satisfying the Markov property

    \[ p(s_{t+1}\vert (s_u,a_u) \in \mathcal{S} \times \mathcal{A}, \forall u \leq t) = p(s_{t+1}\vert (s_t,a_t) \in \mathcal{S} \times \mathcal{A}). \]

The value \gamma \in (0,1) is the discount factor which determines how much future rewards matter to the behavior of the agent. Lastly, the policy \pi defines a mapping from a state to an action, deterministic or randomized, depending on the approach taken (will be made clear below). Then, the problem is to choose a policy that maximizes the expected discounted sum of rewards over all starting states; this means finding a policy that maximizes the expected value function V: \mathcal{S} \to \mathbb{R} starting from state s or, formally,

(1)   \begin{equation*}  \max_{\pi \in \Pi} \mathbb{E}_{s\sim\rho^0} \left\{V(s) \hspace{-2pt}:=\hspace{-1pt} \mathbb{E} \left\{\sum_{t = 0}^\infty \gamma^t R(s_t,a_t \hspace{-1pt} \sim \hspace{-1pt} \pi(\cdot | s_t)) \Bigg| s \right\} \right\}, \end{equation*}

where \rho^0 is the initial state distribution and \Pi is the class of policies that we are interested in; this can be either the class of all deterministic admissible policies for the particular problem, in which case the random measure \pi(\cdot | s_t) trivializes to a function \pi(s_t), or an appropriate class of randomized policies.

By conditioning the value function on an action, we can define the state-action function Q: \mathcal{S} \times \mathcal{A} \to \mathbb{R} by

    \[ \begin{split} Q^\pi(s,a)&\triangleq R(s,a) + \gamma \mathbb{E}_{s'\sim p(\cdot|s,a), a' \sim \pi(\cdot|s')} \left\{ Q^{\pi}(s', a')\right\} \\ &=R(s,a) + \gamma \mathbb{E}_{s'\sim p(\cdot|s,a)} \left\{ \mathbb{E}_{a' \sim \pi(\cdot|s')} \left\{ Q^{\pi}(s', a') | s'\right\} \right\} \\ &= \mathbb{E} \hspace{-2pt} \left\{ \hspace{-1pt} R(s_0, a_0) \hspace{-1.5pt}+\hspace{-1.5pt} \sum_{t=1}^\infty \hspace{-1pt} \gamma^t R(s_t,a_t \sim \pi(\cdot | s_t)) \Bigg| s_0 \hspace{-1pt}=\hspace{-1pt} s, a_0 \hspace{-1pt}=\hspace{-1pt} a\right\}\hspace{-2pt}.\hspace{-1pt} \end{split} \]

The Q-function determines the quality of taking an action a while being at state s and then following policy \pi for the remainder of time. By selecting the first action according to the policy and integrating, the (integrated) Q-function becomes equivalent to the value function, and problem (1) is equivalently written as

(2)   \begin{equation*}  \max_{\pi \in \Pi} \mathbb{E}_{s\sim \rho^0, a \sim \pi(\cdot|s)} \left\{Q^{\pi}(s,a)\right\}. \end{equation*}

Apparently, solving (2) requires searching over an infinite dimensional space (a space of probability measures in the case of randomized policies and a space of functions in the case of deterministic policies), which is generally an intractable task. To circumvent this complexity, the policy is parameterized by some \theta \in \Theta \subset \mathbb{R}^d so that the search is over a Euclidean space of finite dimension. The problem of interest consequently becomes

(3)   \begin{equation*}  \max_{\theta \in \Theta } \left[ \mathbb{E}_{s\sim \rho^0, a \sim \pi_{\theta}(\cdot|s)} \left\{Q^{\pi_{\theta}}(s,a)\right\} \triangleq J(\theta) \right]. \end{equation*}

When the search is over deterministic policy parameterizations, in particular, the above problem takes the simpler form

(4)   \begin{equation*}  \max_{\theta \in \Theta } \left[ \mathbb{E}_{s\sim \rho^0} \left[Q^{\pi_{\theta}}(s,\pi_{\theta}(s))\right] \triangleq J(\theta) \right]. \end{equation*}

Although it might seem that searching within a class of randomized policies could be advantageous over searching within a class of deterministic policies, this is in fact not the case at least in unconstrained RL, because for every randomized policy (and under standard regularity conditions), it is possible to find another deterministic policy that achieves at least the same (average) performance (this is really a standard result). Additionally, a deterministic optimal policy is much more desirable, because it induces the smallest amount of higher-order statistical variability possible in the resulting random cumulative reward (still resulting, of course, in optimal mean performance).

Therefore, in the following, our base problem (i.e., the one what we are primarily interested in) will be (4). Even if we may sometimes focus on (3) (i.e., randomized policies), the latter should be implicitly considered as a proxy for solving (4). This is perfectly in line with the common practice of policy derandomization during policy evaluation, that is, when evaluating performance after a (near-)optimal (stochastic) policy has been learned.

–– Gaussian Policies in Stochastic Policy Gradient (PG)

Let us start with the (stochastic) policy gradient theorem, which states that, given a stationary stochastic policy \pi_{\theta}^{o} (a random probability measure on the action space {\cal A}, conditioned on some random state s\in{\cal S}, and admitting a strictly positive density), the gradient of the \theta-parameterized objective function J may be represented as

    \[ \nabla J_{o}(\theta)=\dfrac{1}{1-\gamma}\mathbb{E}_{s\sim\rho^{\pi_{\theta}^{o}},a\sim\pi_{\theta}^{o}}\big\{\nabla_{\theta}\log\pi_{\theta}^{o}(a|s)Q^{\pi_{\theta}^{o}}(s,a)\big\},\quad\forall\theta\in\Theta, \]

where \rho^{\pi_{\theta}^{o}} constitutes the (proper) discounted state distribution associated with the stochastic policy \pi_{\theta}^{o}. In somewhat more precise mathematical language, this expression may be rewritten, for every \theta\in\Theta, as

    \[ \nabla J_{o}(\theta)=\dfrac{1}{1-\gamma}\mathbb{E}_{s\sim\rho^{\pi_{\theta}^{o}}}\big\{\mathbb{E}_{a\sim\pi_{\theta}^{o}(\cdot|s)}\big\{\nabla_{\theta}\log\pi_{\theta}^{o}(a|s)Q^{\pi_{\theta}^{o}}(s,a)|s\big\}\big\}. \]

Note that the product \pi_{\theta}^{o}(a|s)\rho^{\pi_{\theta}^{o}}(s) is also well-known as the discounted action-space distribution associated with \pi_{\theta}^{o}.

As most often done both in theoretical developments and in practice, let us assume that, for every s\in{\cal S}, \pi_{\theta}^{o}(\cdot|s) is chosen as a Gaussian random element with mean \pi_{\theta}(s) (a known deterministic policy parameterization) and constant variance \Sigma\triangleq\mu^{2}I, for some \mu>0. In other words, highlighting the dependence on \mu,

    \[ \pi_{\theta}^{\mu}(\cdot|s)={\cal N}(\pi_{\theta}(s),\mu^{2}I). \]

Then, it follows that, for every a\in{\cal A},

    \[ \log\pi_{\theta}^{\mu}(a|s)\propto-\dfrac{\Vert a-\pi_{\theta}(s)\Vert^{2}}{2\mu^{2}}. \]

Therefore, it is true that, for every a\in{\cal A},

    \[ \nabla_{\theta}\log\pi_{\theta}^{\mu}(a|s)=\nabla_{\theta}\pi_{\theta}(s)\dfrac{a-\pi_{\theta}(s)}{\mu^{2}}. \]

Plugging into the expression for the (stochastic) policy gradient, we have

    \[ \nabla J_{\mu}(\theta)=\dfrac{1}{1-\gamma}\mathbb{E}_{s\sim\rho^{\pi_{\theta}^{\mu}}}\bigg\{\nabla_{\theta}\pi_{\theta}(s)\mathbb{E}_{a\sim\pi_{\theta}^{\mu}(\cdot|s)}\bigg\{\dfrac{a-\pi_{\theta}(s)}{\mu^{2}}Q^{\pi_{\theta}^{\mu}}(s,a)\bigg|s\bigg\}\hspace*{-1bp}\hspace*{-1bp}\bigg\}. \]

Regarding the inner expectation (given s), we may now write

    \begin{flalign*} & \hspace{-10bp} \mathbb{E}_{a\sim\pi_{\theta}^{\mu}(\cdot|s)}\bigg\{\dfrac{a-\pi_{\theta}(s)}{\mu^{2}}Q^{\pi_{\theta}^{\mu}}(s,a)\bigg|s\bigg\} \\ & =\dfrac{1}{\mu^{2}}\int(a-\pi_{\theta}(s))Q^{\pi_{\theta}^{\mu}}(s,a){\cal N}(\pi_{\theta}(s),\mu^{2}I)\mathrm{d}a\\ & =\dfrac{1}{\mu^{2}}\int uQ^{\pi_{\theta}^{\mu}}(s,u+\pi_{\theta}(s)){\cal N}(0,\mu^{2}I)\mathrm{d}u\\ & =\dfrac{1}{\mu}\int uQ^{\pi_{\theta}^{\mu}}(s,\mu u+\pi_{\theta}(s)){\cal N}(0,I)\mathrm{d}u\\ & \equiv\mathbb{E}_{u\sim{\cal N}(0,I)}\bigg\{\dfrac{Q^{\pi_{\theta}^{\mu}}(s,\pi_{\theta}(s)+\mu u)}{\mu}u\bigg|s\bigg\}\\ & \equiv-\mathbb{E}_{u\sim{\cal N}(0,I)}\bigg\{\dfrac{Q^{\pi_{\theta}^{\mu}}(s,\pi_{\theta}(s)-\mu u)}{\mu}u\bigg|s\bigg\}. \end{flalign*}

Also note that

(5)   \begin{flalign*} & \hspace{-10bp} \mathbb{E}_{u\sim{\cal N}(0,I)}\bigg\{\dfrac{Q^{\pi_{\theta}^{\mu}}(s,\pi_{\theta}(s)+\mu u)}{\mu}u\bigg|s\bigg\}  \\ & \equiv\mathbb{E}_{u\sim{\cal N}(0,I)}\bigg\{\dfrac{Q^{\pi_{\theta}^{\mu}}(s,\pi_{\theta}(s)+\mu u)-Q^{\pi_{\theta}^{\mu}}(s,\pi_{\theta}(s))}{\mu}u\bigg|s\bigg\}  \\ & \equiv\mathbb{E}_{u\sim{\cal N}(0,I)}\bigg\{\dfrac{Q^{\pi_{\theta}^{\mu}}(s,\pi_{\theta}(s)+\mu u)-Q^{\pi_{\theta}^{\mu}}(s,\pi_{\theta}(s)-\mu u)}{2\mu}u\bigg|s\bigg\}.  \end{flalign*}

What we have shown here is that the part of the expression of the policy gradient which is not related to the gradient of the deterministic part of the involved Gaussian policy (i.e., its mean) may be exactly represented as an average of stochastic finite differences of the associated Q-function. This fact has important consequences, in terms of not only interpretability, but also algorithm design and performance analysis, as we will see below.

–– Baselines: Stochastic, Deterministic and In-Between

An example of the usefulness of the discussion above is that it can be used in designing and interpretably analyzing baseline policy gradient methods (at least when restricting our attention to Gaussian policies).

Indeed, note that it is possible to also write

    \begin{flalign*} & \hspace{-10bp} \mathbb{E}_{u\sim{\cal N}(0,I)}\bigg\{\dfrac{Q^{\pi_{\theta}^{\mu}}(s,\pi_{\theta}(s)+\mu u)}{\mu}u\bigg|s\bigg\} \\ & \equiv\mathbb{E}_{u\sim{\cal N}(0,I)}\bigg\{\dfrac{Q^{\pi_{\theta}^{\mu}}(s,\pi_{\theta}(s)+\mu u)-Q^{\pi_{\theta}^{\mu}}(s,\pi_{\theta}(s)+\mu u')}{\mu}u\bigg|s,u'\bigg\}, \end{flalign*}

where u' and u are iid and, trivially,

    \begin{flalign*} & \hspace{-10bp} \mathbb{E}_{u\sim{\cal N}(0,I)}\bigg\{\dfrac{Q^{\pi_{\theta}^{\mu}}(s,\pi_{\theta}(s)+\mu u')}{\mu}u\bigg|s,u'\bigg\} \\ & \equiv\dfrac{Q^{\pi_{\theta}^{\mu}}(s,\pi_{\theta}(s)+\mu u')}{\mu}\mathbb{E}\{u\}\\ & \triangleq \dfrac{\widetilde{V}^{\pi_{\theta}^{\mu}}(s)}{\mu}\mathbb{E}\{u\}(\equiv0)\\ & \equiv\widetilde{V}^{\pi_{\theta}^{\mu}}(s)\mathbb{E}_{a\sim\pi_{\theta}^{\mu}(\cdot|s)}\bigg\{\dfrac{a-\pi_{\theta}(s)}{\mu^{2}}\bigg|s\bigg\}. \end{flalign*}

This gives exactly policy gradient with a stochastic baseline (neat, right?)!

Of course, we may also write

    \begin{flalign*} & \hspace{-10bp} \mathbb{E}_{u\sim{\cal N}(0,I)}\bigg\{\dfrac{Q^{\pi_{\theta}^{\mu}}(s,\pi_{\theta}(s)+\mu u)}{\mu}u\bigg|s\bigg\}  \\ & \equiv\mathbb{E}_{u\sim{\cal N}(0,I)}\bigg\{\dfrac{Q^{\pi_{\theta}^{\mu}}(s,\pi_{\theta}(s)+\mu u)-V^{\pi_{\theta}^{\mu}}(s)}{\mu}u\bigg|s\bigg\}, \end{flalign*}

where V^{\pi_{\theta}^{\mu}}(s)=\mathbb{E}\{Q^{\pi_{\theta}^{\mu}}(s,\pi_{\theta}(s)+\mu u')|s\} is the corresponding value function at state s, giving exactly the classical (i.e., deterministic) baseline policy gradient.

We could even look at variances. To do this, we will also assume that the Q-function is L-Lipschitz in the sense that, for every (a_{1},a_{2})\in{\cal A}\times{\cal A},

    \[ \sup_{s\in{\cal S}}\sup_{\pi\in\Pi}|Q^{\pi}(s,a_{1})-Q^{\pi}(s,a_{2})|\le L \Vert a_{1}-a_{2}\Vert, \]

where \Pi denotes the whole class of admissible stochastic policies considered in our formulation.

Let us take baseline policy gradient first. We have

    \begin{align*} & \hspace*{-1bp}\hspace*{-1bp}\hspace*{-1bp}\hspace*{-1bp}\hspace*{-1bp}\hspace*{-1bp}\hspace*{-1bp}\hspace*{-1bp}\hspace*{-1bp}\hspace*{-1bp}\hspace*{-1bp}\hspace*{-1bp}\mathbb{E}_{u\sim{\cal N}(0,I)}\bigg\{\dfrac{Q^{\pi_{\theta}^{\mu}}(s,\pi_{\theta}(s)+\mu u)-V^{\pi_{\theta}^{\mu}}(s)}{\mu}u\bigg|s\bigg\}\\ & \equiv\mathbb{E}\bigg\{\dfrac{\mathbb{E}\{Q^{\pi_{\theta}^{\mu}}(s,\pi_{\theta}(s)+\mu u)-Q^{\pi_{\theta}^{\mu}}(s,\pi_{\theta}(s)+\mu u')|s,u\}}{\mu}u\bigg|s\bigg\}\\ & \equiv\mathbb{E}\bigg\{\mathbb{E}\bigg\{\dfrac{Q^{\pi_{\theta}^{\mu}}(s,\pi_{\theta}(s)+\mu u)-Q^{\pi_{\theta}^{\mu}}(s,\pi_{\theta}(s)+\mu u')}{\mu}u\bigg|s,u\bigg\}\bigg|s\bigg\}\\ & \equiv\mathbb{E}_{u,u'}\bigg\{\dfrac{Q^{\pi_{\theta}^{\mu}}(s,\pi_{\theta}(s)+\mu u)-Q^{\pi_{\theta}^{\mu}}(s,\pi_{\theta}(s)+\mu u')}{\mu}u\bigg|s\bigg\}. \end{align*}

As a result,

    \begin{align*} & \hspace*{-1bp}\hspace*{-1bp}\hspace*{-1bp}\hspace*{-1bp}\hspace*{-1bp}\hspace*{-1bp}\hspace*{-1bp}\hspace*{-1bp}\hspace*{-1bp}\hspace*{-1bp}\hspace*{-1bp}\hspace*{-1bp}\mathbb{E}_{u\sim{\cal N}(0,I)}\bigg\{\bigg\Vert\dfrac{Q^{\pi_{\theta}^{\mu}}(s,\pi_{\theta}(s)+\mu u)-V^{\pi_{\theta}^{\mu}}(s)}{\mu}u\bigg\Vert^{2}\bigg|s\bigg\}\\ & \equiv\dfrac{1}{\mu^{2}}\mathbb{E}\{|Q^{\pi_{\theta}^{\mu}}(s,\pi_{\theta}(s)+\mu u)-Q^{\pi_{\theta}^{\mu}}(s,\pi_{\theta}(s)+\mu u')|^{2}\Vert u\Vert^{2}|s\}\\ & \le L^{2}\mathbb{E}\{\Vert u-u'\Vert^{2}\Vert u\Vert^{2}\}\\ & \equiv L^{2}\mathbb{E}\{(\Vert u'\Vert^{2}-2u^{\top}u'+\Vert u\Vert^{2})\Vert u\Vert^{2}\}\\ & \equiv L^{2}\mathbb{E}\{\Vert u'\Vert^{2}\Vert u\Vert^{2}\}+L^{2}\mathbb{E}\{\Vert u\Vert^{4}\}\\ & \le L^{2}p^{2}+L^{2}(p+4)^{2}. \end{align*}

This yields

    \[ \mathbb{E}\bigg\{\bigg\Vert\dfrac{Q^{\pi_{\theta}^{\mu}}(s,\pi_{\theta}(s)+\mu u)-V^{\pi_{\theta}^{\mu}}(s)}{\mu}u\bigg\Vert^{2}\bigg\}\le L^{2}p^{2}+L^{2}(p+4)^{2}. \]

Similarly, for the stochastic baseline we have

    \begin{align*} & \hspace*{-1bp}\hspace*{-1bp}\hspace*{-1bp}\hspace*{-1bp}\hspace*{-1bp}\hspace*{-1bp}\hspace*{-1bp}\hspace*{-1bp}\hspace*{-1bp}\hspace*{-1bp}\hspace*{-1bp}\hspace*{-1bp}\mathbb{E}_{u\sim{\cal N}(0,I)}\bigg\{\dfrac{Q^{\pi_{\theta}^{\mu}}(s,\pi_{\theta}(s)+\mu u)-Q^{\pi_{\theta}^{\mu}}(s,\pi_{\theta}(s)+\mu u')}{\mu}u\bigg|s,u'\bigg\}\\ & \equiv\dfrac{1}{\mu}\mathbb{E}\{|Q^{\pi_{\theta}^{\mu}}(s,\pi_{\theta}(s)+\mu u)-Q^{\pi_{\theta}^{\mu}}(s,\pi_{\theta}(s)+\mu u')|^{2}\Vert u\Vert^{2}|s,u'\}\\ & \le L^{2}\mathbb{E}\{\Vert u-u'\Vert^{2}\Vert u\Vert^{2}|u'\}\\ & \equiv L^{2}\mathbb{E}\{(\Vert u'\Vert^{2}-2u^{\top}u'+\Vert u\Vert^{2})\Vert u\Vert^{2}|u'\}\\ & \equiv L^{2}\Vert u'\Vert^{2}\mathbb{E}\{\Vert u\Vert^{2}\}+L^{2}\mathbb{E}\{\Vert u\Vert^{4}\}\\ & \le L^{2}\Vert u'\Vert^{2}p+L^{2}(p+4)^{2}. \end{align*}

This also yields

    \begin{align*} & \hspace{-12pt} \mathbb{E}\bigg\{\bigg\Vert\dfrac{Q^{\pi_{\theta}^{\mu}}(s,\pi_{\theta}(s)+\mu u)-Q^{\pi_{\theta}^{\mu}}(s,\pi_{\theta}(s)+\mu u')}{\mu}u\bigg\Vert^{2}\bigg\} \\ & \le L^{2}p^{2}+L^{2}(p+4)^{2}. \end{align*}

At this point, it would be interesting to note that, even if we attempted variance reduction via defining

    \[ \widetilde{V}^{\pi_{\theta}^{\mu}}(s)\triangleq\dfrac{1}{N}\sum_{n=1}^{N}Q^{\pi_{\theta}^{\mu}}(s,\pi_{\theta}(s)+\mu u_{n}'), \]

the result would be precisely the same, since

    \begin{align*} & \mathbb{E}_{u\sim{\cal N}(0,I)}\bigg\{\bigg\Vert\dfrac{Q^{\pi_{\theta}^{\mu}}(s,\pi_{\theta}(s)+\mu u)-\widetilde{V}^{\pi_{\theta}^{\mu}}(s)}{\mu}u\bigg\Vert^{2}\bigg|s,\{u_{n}'\}_{n}\bigg\}\\ & \equiv\dfrac{1}{N^{2}\mu^{2}}\mathbb{E}\bigg\{\bigg|\sum_{n=1}^{N}Q^{\pi_{\theta}^{\mu}}(s,\pi_{\theta}(s) \hspace*{-1bp}+ \mu u) \hspace*{-1bp}-\hspace*{-1bp} Q^{\pi_{\theta}^{\mu}}(s,\pi_{\theta}(s) \hspace*{-1bp}+\hspace*{-1bp} \mu u_{n}')\bigg|^{2}\Vert u\Vert^{2}\bigg|s, \hspace*{-1bp} \{u_{n}'\}_{n}\bigg\}\\ & \le\dfrac{L^{2}}{N^{2}}\mathbb{E}\bigg\{\bigg|\sum_{n=1}^{N}\Vert u-u_{n}'\Vert\bigg|^{2}\Vert u\Vert^{2}\bigg|\{u_{n}'\}_{n}\bigg\}\\ & \le\dfrac{L^{2}}{N}\sum_{n=1}^{N}\mathbb{E}\{\Vert u-u_{n}'\Vert^{2}\Vert u\Vert^{2}|u_{n}'\}\\ & \le L^{2}p\dfrac{1}{N}\sum_{n=1}^{N}\Vert u'\Vert^{2}+L^{2}(p+4)^{2}. \end{align*}

We note that here the bounds might be somewhat loose, but they exploit the same known facts as in the derivations above. What remains here would be to evaluate the resulting policy gradient baselines empirically, to confirm whether they indeed exhibit similar empirical performance. We have partially done that in our recent preprint Zeroth-order Deterministic Policy Gradient (ZDPG), which is available for download on this website. Put simply, our results showed that, at least on the problem setting considered in our experiments, baseline policy gradient exhibits similar performance either one or more sample rollouts (i.e., N) are considered in the baseline evaluation (more rollouts imply that the resulting stochastic baseline is closer to the true –that is, deterministic– baseline V^{\pi_{\theta}^{\mu}} –that is, the value function–).

–– Stochastic PG in Deterministic PG Form, and ZDPG

Now, note that under some conditions (Lipschitzness –as above– or smoothness of the Q-function, for instance), we may write (you may take a look at our paper on ZDPG that we have referred to above to see why this should be true)

    \[ \mathbb{E}_{u\sim{\cal N}(0,I)}\bigg\{\dfrac{Q^{\pi_{\theta}^{\mu}}(s,\pi_{\theta}(s)+\mu u)}{\mu}u\bigg|s\bigg\}{\equiv}\nabla_{a}Q_{\mu}^{\pi_{\theta}^{\mu}}(s,a)|_{a=\pi_{\theta}(s)}. \]

In that case, we end up with the equivalent representation (again, neat, right?)

    \[ \boxed{\nabla J_{\mu}(\theta)\overset{!}{=}\dfrac{1}{1-\gamma}\mathbb{E}_{s\sim\rho^{\pi_{\theta}^{\mu}}}\big\{\nabla_{\theta}\pi_{\theta}(s)\nabla_{a}Q_{\mu}^{\pi_{\theta}^{\mu}}(s,a)|_{a=\pi_{\theta}(s)}\hspace*{-1bp}\big\}.} \]

This of course coincides with the deterministic policy gradient representation of [Silver et al., 2014], but for the stochastic policy gradient setup with Gaussian parameterization and exploration parameter \mu. Additionally, the relation above holds also for \mu\equiv0 identically (from [Silver et al., 2014]), giving, very carefully,

    \[ \underline{\nabla J_{0}(\theta)=\dfrac{1}{1-\gamma}\mathbb{E}_{s\sim\rho^{\pi_{\theta}}}\big\{\nabla_{\theta}\pi_{\theta}(s)\nabla_{a}Q^{\pi_{\theta}}(s,a)|_{a=\pi_{\theta}(s)}\hspace*{-1bp}\big\}.} \]

Observe the difference in the superscripts of the Q-function, as well as the different discounted state distributions. It should be now kind of obvious that \nabla J_{\mu} may be thought of as a quasigradient of \nabla J_{0}, which is the exact gradient of our infinite horizon dynamic program, where the search is over all admissible deterministic policies parameterized by \theta.

At this point, it is worth mentioning that, in our work on ZDPG, we instead define the quasigradient

    \[ \underline{\hat{\nabla}J_{\mu}(\theta)=\dfrac{1}{1-\gamma}\mathbb{E}_{s\sim\rho^{\pi_{\theta}}}\big\{\nabla_{\theta}\pi_{\theta}(s)\nabla_{a}Q_{\mu}^{\pi_{\theta}}(s,a)|_{a=\pi_{\theta}(s)}\hspace*{-1bp}\big\},} \]

which, as we have shown, is provably a uniform approximation to \nabla J_{0} (in fact, under weaker Lipschitz assumptions on the associated Q-function).

But what about \nabla J_{\mu}?  Well, let’s see (simply add and substract terms):

    \begin{flalign*} & \hspace*{-1bp}\hspace*{-1bp}\hspace*{-1bp}\hspace*{-1bp}\hspace*{-1bp}\hspace*{-1bp}\hspace*{-1bp}\hspace*{-1bp}\hspace*{-1bp}\hspace*{-1bp}\hspace*{-1bp}\hspace*{-1bp}(1-\gamma)\nabla J_{\mu}(\theta)-(1-\gamma)\nabla J_{0}(\theta)\\ & \equiv\mathbb{E}_{s\sim\rho^{\pi_{\theta}^{\mu}}}\big\{\nabla_{\theta}\pi_{\theta}(s)\nabla_{a}Q_{\mu}^{\pi_{\theta}^{\mu}}(s,a)|_{a=\pi_{\theta}(s)}\hspace*{-1bp}\big\}\\ & \quad\quad-\mathbb{E}_{s\sim\rho^{\pi_{\theta}^{\mu}}}\big\{\nabla_{\theta}\pi_{\theta}(s)\nabla_{a}Q^{\pi_{\theta}}(s,a)|_{a=\pi_{\theta}(s)}\hspace*{-1bp}\big\}\\ & \quad\quad+\mathbb{E}_{s\sim\rho^{\pi_{\theta}^{\mu}}}\big\{\nabla_{\theta}\pi_{\theta}(s)\nabla_{a}Q^{\pi_{\theta}}(s,a)|_{a=\pi_{\theta}(s)}\hspace*{-1bp}\big\}\\ & \quad\quad\quad-\mathbb{E}_{s\sim\rho^{\pi_{\theta}}}\big\{\nabla_{\theta}\pi_{\theta}(s)\nabla_{a}Q^{\pi_{\theta}}(s,a)|_{a=\pi_{\theta}(s)}\hspace*{-1bp}\big\}\\ & \equiv\mathbb{E}_{s\sim\rho^{\pi_{\theta}^{\mu}}}\big\{\nabla_{\theta}\pi_{\theta}(s)(\nabla_{a}Q_{\mu}^{\pi_{\theta}^{\mu}}(s,a)-\nabla_{a}Q^{\pi_{\theta}}(s,a))|_{a=\pi_{\theta}(s)}\hspace*{-1bp}\big\}\\ & \quad\quad+\mathbb{E}_{s\sim\rho^{\pi_{\theta}^{\mu}}}\big\{\nabla_{\theta}\pi_{\theta}(s)\nabla_{a}Q^{\pi_{\theta}}(s,a)|_{a=\pi_{\theta}(s)}\hspace*{-1bp}\big\}\\ & \quad\quad-\mathbb{E}_{s\sim\rho^{\pi_{\theta}}}\big\{\nabla_{\theta}\pi_{\theta}(s)\nabla_{a}Q^{\pi_{\theta}}(s,a)|_{a=\pi_{\theta}(s)}\hspace*{-1bp}\big\}\\ & \equiv\mathbb{E}_{s\sim\rho^{\pi_{\theta}^{\mu}}}\big\{\nabla_{\theta}\pi_{\theta}(s)(\nabla_{a}Q_{{{\color{magenta}{\mu}}}}^{\pi_{\theta}^{\mu}}(s,a)-\nabla_{a}Q^{\pi_{\theta}^{\mu}}(s,a))|_{a=\pi_{\theta}(s)}\hspace*{-1bp}\big\}\\ & \quad\quad+\mathbb{E}_{s\sim\rho^{\pi_{\theta}^{\mu}}}\big\{\nabla_{\theta}\pi_{\theta}(s)(\nabla_{a}Q^{\pi_{\theta}^{{\color{magenta}{\mu}}}}(s,a)-\nabla_{a}Q^{\pi_{\theta}}(s,a))|_{a=\pi_{\theta}(s)}\hspace*{-1bp}\big\}\\ & \quad\quad\quad+\mathbb{E}_{s\sim\rho^{\pi_{\theta}^{{\color{magenta}{\mu}}}}}\big\{\nabla_{\theta}\pi_{\theta}(s)\nabla_{a}Q^{\pi_{\theta}}(s,a)|_{a=\pi_{\theta}(s)}\hspace*{-1bp}\big\}\\ & \quad\quad\quad-\mathbb{E}_{s\sim\rho^{\pi_{\theta}}}\big\{\nabla_{\theta}\pi_{\theta}(s)\nabla_{a}Q^{\pi_{\theta}}(s,a)|_{a=\pi_{\theta}(s)}\hspace*{-1bp}\big\}. \end{flalign*}

We can therefore identify three sources or error in the quasigradient \nabla J_{\mu}:

  • Perturbation stability induced error, due to the term

        \[ $\nabla_{a}Q_{{\color{magenta}{\mu}}}^{\pi_{\theta}^{\mu}}(s,a)-\nabla_{a}Q^{\pi_{\theta}^{\mu}}(s,a)$. \]

  • System/Policy stability induced error, due to the term 

        \[ $\nabla_{a}Q^{\pi_{\theta}^{{\color{magenta}{\mu}}}}(s,a)-\nabla_{a}Q^{\pi_{\theta}}(s,a)$. \]

  • Weak (distributional) stability induced error, due to the term

        \[ $\mathbb{E}_{s\sim\rho^{\pi_{\theta}^{{\color{magenta}{\mu}}}}}\{\cdots\}-\mathbb{E}_{s\sim\rho^{\pi_{\theta}}}\{\cdots\}$. \]

Again, it is noteworthy that in our work on ZDPG, gradient approximation errors are solely due to perturbations, i.e., all three latter terms in the right-hand side of the expression above do not appear in the analysis.

It turns out that this difference in the approximation quality of the two quasigradients just discussed (\nabla J_{\mu} and that of ZDPG, \hat{\nabla}J_{\mu}) translates into rather significant effects with regard to empirical performance. We would like to very briefly demonstrate this through a numerical comparison between stochastic PG and ZDPG on the inverted pendulum problem, which, although deterministic, is a standard benchmark for testing RL algorithms.

In the plots, “ZDPG” and ZDPG-S” are ZDPG variants that employ non-symmetric and symmetric stochastic finite differences, respectively, very similarly to the second and third line of (5), respectively, but in the zeroth-order setting. Also, PG-B denotes the standard stochastic PG method, with the stochastic baseline \widetilde{V}^{\pi_{\theta}^{\mu}}, as this is defined in our discussion about baselines above. For all experiments, the forgetting factor is chosen as \gamma=0.99, and algorithm hyperparameters such as stepsizes (chosen constant), exploration variances, smoothing parameters, etc. (where applicable) have been “optimally” tuned via trial-and-error. Further, in the left plot we see results without any kind of minibatching, whereas in the right plot we employ full gradient minibatches of size 10. For policy evaluation we have used deterministic policies (“mean” learned policy for PG-B),  and cumulative rewards are collected for 200 steps (problem stages), whose mean and variance are shown, computed over 50 independent trials for each method. For more details on the algorithms, see our paper on ZDPG [Kumar et al., 2020].

We observe that ZDPG-S substantially outperforms PG-B in both plots, not only in terms of average performance, but also in terms of statistical variability. This shows that, really, injection of unnecessary exploration noise into the policy learning procedure can actually make performance worse, both on average and in terms of margins.

On a side note, also observe that minibatching actually hampered the performance of PG-B; this is probably due to the method being stuck at local minima or flat regions, together with using a constant stepsize. Of course, ZDPG-S is also prone to “sticking” (problem is non-convex anyways), but much less so.

–– Main Conclusions

Some fruitful (in my humble opinion) remarks of our discussion follow:

  • It provides a new interpretation of stochastic policy gradient (with Gaussian parameterization) as a smoothing technique, which produces zeroth-order gradient approximations (i.e., quasigradients) enabling model-free (and actor-only) policy training. In fact, given the initial, standardized stochastic program (4) where the search is over deterministic policies, we may define the smoothed problem (a smoothed surrogate) (3) by augmenting the associated feasible set to include randomized Gaussian policies. This new surrogate program then provides the ground for the development of multiple zeroth-order techniques for model-free RL within a unified framework (e.g., PG with baselines, etc.).
  • It shows that stochastic and deterministic policy gradients are fundamentally connected. For Gaussian policies, stochastic policy gradient may be written in the same form as deterministic policy gradient, with the only (but significant) difference being in the policy that is followed by the associated Q-function.
  • The latter fact reveals that the quasigradients the original problem (4) resulting from stochastic policy gradient are indeed rather inexact, and that, indeed, it is possible to come up with much more efficient quasigradients, which induce less error and less variability in the policy learning process. One such example is provably the quasigradient exploited by ZDPG in [Kumar et al., 2020], which empirically outperforms standard (baseline) PG by a substantial amount.

Thanks for reading; comments are most welcome!


[Silver et al., 2014] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller, “Deterministic Policy Gradient Algorithms,” In Proceedings of the 31st International Conference on International Conference on Machine Learning – Volume 32, ICML’14, page I–387–I–395., 2014.

[Kumar et al., 2020] H. Kumar, D. S. Kalogerias, A. Khan, G. J. Pappas, and A. Ribeiro, “Zeroth-order Deterministic Policy Gradient,” arXiv preprint, arXiv:2006.07314, 2020.