*by me, ***Harshat Kumar, ****George J. Pappas,** and **Alejandro Ribeiro**

**Harshat Kumar,**

**George J. Pappas,**and

**Alejandro Ribeiro**

*published o**n ***August 7 2020**

**August 7 2020**

This is my first (and hopefully not the last) blog-type technical article, and our attempt (with Harshat, George, and Alejandro) to look closer into intrinsic connections between Policy Gradient (PG) (both stochastic and deterministic) and zeroth-order methods in non-convex stochastic optimization. To the best of our knowledge, such connections, although very natural (as we will see below), have not been explored before, and this is why I have decided to discuss them in this post.

Put simply, what this article shows is that, under basic and natural assumptions, **policy gradients are zeroth-order quasigradients **of the gradient associated with the original stochastic program at hand, which of course implies that

**policy gradient methods**, specialized in a Reinforcement Learning (RL) setting.

*are themselves*instances of standard, general purpose zeroth-order methodsFurther, an additional byproduct of our treatment is to show explicitly that, in fact, **classical stochastic policy gradient (even with a baseline) results in quasigradients that are too inexact** (relative to the true gradient of the original objective), and that **stochastic policy gradient** **injects too much exploration noise during policy learning**. This eventually results in **both** poor average performance **and** performance margins of unnecessarily high variance, even when randomness in the policy is discarded during policy evaluation; the latter scheme is a very popular heuristic in practice.

Then, a rather plausible way to mitigate those issues is to fully depart from adopting stochastic policies for policy learning, which are integral to (stochastic) policy gradients, and instead explore the possibility of developing a **primitive** zeroth-order policy gradient method based on the very well-known **Deterministic** Policy Gradient (DPG) theorem of [Silver *et al.*, 2014], which is originally **fully gradient-based** and thus, in a sense, **model-based**, because it requires the full gradient of the associated -function in order to be invoked. This fact motivates **Zeroth-order Deterministic Policy Gradient (ZDPG)**, which we have recently developed in [Kumar *et al.*, 2020] (download here) and which essentially provides a model-free version of the framework set by Silver et al. in their seminal paper. The following table provides a basic classification of the possible *actor-only* **PG** approaches, and where the various resulting algorithms stand.

Policy\Method | Model-Based (-Gradient-Based) | Model-Free |

Stochastic | ? |
(Baseline) PG [2000] |

Deterministic | DPG [2014] |
ZDPG [2020] |

In particular, in **ZDPG**, parameter exploration is efficiently achieved by using a purely deterministic policy together with a minimal amount of low-dimensional action-space exploration noise, which is intuitively and rigorously justified from the perspective of stochastic zeroth-order optimization. As we later show explicitly in this article, the zeroth-order quasigradients associated with **ZDPG** induce **much less and better-controlled approximation error** relative to the true deterministic policy gradient corresponding to the initial dynamic program, as compared to not only stochastic **PG**, but also **Baseline PG (PG-B)**, for a number of standard baseline choices. Our discussion justifies our experimental results presented in [Kumar *et al.*, 2020].

**Remark.** The question mark in the table above corresponds to the class of (actor-only) model-based methods involving stochastic policies, which do not seem to be very useful at least in regard to the problem setting considered in this article, since we are interested in model-free RL.

To read more, click and expand the titles below.

**–– A Standardized Problem Setting**

*et al.*, 2020].

We will consider a **standardized RL problem**, in which an agent moves through a continuous and compact state space , and takes actions in a finite dimensional action space . The agent’s task is to accumulate as much reward as possible in the long term, where the reward is revealed to the agent by the environment at each step. This generic problem may be abstracted by a Markov decision process (MDP) as a tuple , where denotes the reward function, and determines the probability of moving to state starting from state and action , satisfying the Markov property

The value is the discount factor which determines how much future rewards matter to the behavior of the agent. Lastly, the policy defines a mapping from a state to an action, *deterministic or randomized*, depending on the approach taken (will be made clear below). Then, the problem is to choose a policy that maximizes the expected discounted sum of rewards over all starting states; this means finding a policy that maximizes the expected value function starting from state or, formally,

(1)

where is the initial state distribution and is the class of policies that we are interested in; this can be either the class of all deterministic admissible policies for the particular problem, in which case the random measure trivializes to a function , or an appropriate class of randomized policies.

By conditioning the value function on an action, we can define the state-action function by

The -function determines the quality of taking an action while being at state and then following policy for the remainder of time. By selecting the first action according to the policy and integrating, the (integrated) -function becomes equivalent to the value function, and problem (1) is equivalently written as

(2)

Apparently, solving (2) requires searching over an infinite dimensional space (a space of probability measures in the case of randomized policies and a space of functions in the case of deterministic policies), which is generally an intractable task. To circumvent this complexity, the policy is parameterized by some so that the search is over a Euclidean space of finite dimension. The problem of interest consequently becomes

(3)

When the search is over deterministic policy parameterizations, in particular, the above problem takes the simpler form

(4)

Although it might seem that searching within a class of randomized policies could be advantageous over searching within a class of deterministic policies, this is in fact** not the case at least in unconstrained RL**, because for every randomized policy (and under standard regularity conditions), it is possible to find another deterministic policy that achieves at least the same (average) performance (this is really a standard result). Additionally, a deterministic optimal policy is much more desirable, because it induces the smallest amount of higher-order statistical variability possible in the resulting random cumulative reward (still resulting, of course, in optimal mean performance).

Therefore, in the following, our **base problem** (i.e., the one what we are primarily interested in) will be (4). Even if we may sometimes focus on (3) (i.e., randomized policies), the latter should be implicitly considered as **a** **proxy for solving (4)**. This is perfectly in line with the common practice of **policy derandomization during policy evaluation**, that is, when evaluating performance after a (near-)optimal (stochastic) policy has been learned.

**–– Gaussian Policies in Stochastic Policy Gradient (PG)**

where constitutes the (proper) discounted state distribution associated with the stochastic policy . In somewhat more precise mathematical language, this expression may be rewritten, for every , as

Note that the product is also well-known as the discounted action-space distribution associated with .

As most often done both in theoretical developments and in practice, let us assume that, for every , is chosen as a Gaussian random element with mean (a known deterministic policy parameterization) and constant variance , for some . In other words, highlighting the dependence on ,

Then, it follows that, for every ,

Therefore, it is true that, for every ,

Plugging into the expression for the (stochastic) policy gradient, we have

Regarding the inner expectation (given ), we may now write

Also note that

(5)

What we have shown here is that the part of the expression of the policy gradient which is not related to the gradient of the deterministic part of the involved Gaussian policy (i.e., its mean) may be **exactly represented** as an average of stochastic finite differences of the associated -function. This fact has important consequences, in terms of not only interpretability, but also algorithm design and performance analysis, as we will see below.

**–– Baselines: Stochastic, Deterministic and In-Between**

**baseline**policy gradient methods (at least when restricting our attention to Gaussian policies).

Indeed, note that it is possible to also write

where and are *iid* and, trivially,

This gives exactly policy gradient with a **stochastic baseline** (neat, right?)!

Of course, we may also write

where is the corresponding value function at state , giving **exactly the classical (i.e., deterministic) baseline policy gradient**.

We could even look at variances. To do this, we will also assume that the -function is -Lipschitz in the sense that, for every ,

where denotes the whole class of admissible **stochastic** policies considered in our formulation.

Let us take baseline policy gradient first. We have

As a result,

This yields

Similarly, for the stochastic baseline we have

This also yields

At this point, it would be interesting to note that, even if we attempted variance reduction via defining

the result would be precisely the same, since

We note that here the bounds might be somewhat loose, but they exploit the same known facts as in the derivations above. What remains here would be to evaluate the resulting policy gradient baselines empirically, to confirm whether they indeed exhibit similar empirical performance. We have partially done that in our **recent preprint Zeroth-order Deterministic Policy Gradient (ZDPG)**, which is available for download on this website. Put simply, our results showed that, at least on the problem setting considered in our experiments, baseline policy gradient exhibits similar performance either one or more sample rollouts (i.e., ) are considered in the baseline evaluation (more rollouts imply that the resulting stochastic baseline is closer to the true –that is, deterministic– baseline –that is, the value function–).

**–– Stochastic PG in Deterministic PG Form, and ZDPG**

**ZDPG**that we have referred to above to see why this should be true)

In that case, we end up with the *equivalent representation* (again, neat, right?)

This of course coincides with the deterministic policy gradient representation of [Silver *et al.*, 2014], but for the **stochastic** policy gradient setup with Gaussian parameterization and exploration parameter . Additionally, the relation above holds also for identically (from [Silver *et al.*, 2014]), giving, *very carefully*,

Observe the difference in the superscripts of the -function, as well as the different discounted state distributions. It should be now kind of obvious that may be ** thought of as a quasigradient of** , which is the exact gradient of our infinite horizon dynamic program, where the search is over all admissible

*deterministic*policies parameterized by .

At this point, it is worth mentioning that, in our work on **ZDPG**, we instead define the quasigradient

which, as we have shown, is provably **a uniform approximation to ** (in fact, under weaker Lipschitz assumptions on the associated -function).

**But what about** ? Well, let’s see (simply add and substract terms):

We can therefore identify three sources or error in the quasigradient :

**Perturbation stability**induced error, due to the term**System/Policy stability**induced error, due to the term**Weak (distributional) stability**induced error, due to the term

Again, it is noteworthy that in our work on **ZDPG**, **gradient approximation errors are solely due to perturbations**, i.e., all three latter terms in the right-hand side of the expression above do **not** appear in the analysis.

It turns out that this difference in the approximation quality of the two quasigradients just discussed ( and that of **ZDPG**, ) translates into rather significant effects with regard to empirical performance. We would like to very briefly demonstrate this through a numerical comparison between stochastic **PG** and **ZDPG** on the inverted pendulum problem, which, although deterministic, is a standard benchmark for testing RL algorithms.

In the plots, **“ZDPG”** and **“****ZDPG-S”** are **ZDPG** variants that employ non-symmetric and symmetric stochastic finite differences, respectively, very similarly to the second and third line of (5), respectively, but in the zeroth-order setting. Also, **PG-B** denotes the standard stochastic **PG** method, with the stochastic baseline , as this is defined in our discussion about baselines above. For all experiments, the forgetting factor is chosen as , and algorithm hyperparameters such as stepsizes (chosen constant), exploration variances, smoothing parameters, etc. (where applicable) have been “optimally” tuned via trial-and-error. Further, in the left plot we see results without any kind of minibatching, whereas in the right plot we employ full gradient minibatches of size . For policy evaluation we have used deterministic policies (“mean” learned policy for **PG-B**), and cumulative rewards are collected for steps (problem stages), whose mean and variance are shown, computed over independent trials for each method. For more details on the algorithms, see our paper on **ZDPG** [Kumar* et al.*, 2020].

We observe that **ZDPG-S** substantially outperforms **PG-B** in both plots, not only in terms of average performance, but also in terms of statistical variability. This shows that, really, injection of unnecessary exploration noise into the policy learning procedure can actually make performance worse, both on average and in terms of margins.

On a side note, also observe that minibatching actually hampered the performance of **PG-B**; this is probably due to the method being stuck at local minima or flat regions, together with using a constant stepsize. Of course, **ZDPG-S **is also prone to “sticking” (problem is non-convex anyways), but much less so.

**–– Main Conclusions**

- It provides a new
**interpretation of stochastic policy gradient (with Gaussian parameterization) as a smoothing technique**, which produces zeroth-order gradient approximations (i.e., quasigradients) enabling model-free (and actor-only) policy training. In fact, given the initial, standardized stochastic program (4) where the search is over deterministic policies, we may define the**smoothed problem (a**(3) by augmenting the associated feasible set to include randomized Gaussian policies. This new surrogate program then provides the ground for the development of multiple zeroth-order techniques for model-free RL within a unified framework (e.g.,*smoothed**surrogate*)**PG**with baselines, etc.).

ok - It shows that stochastic and deterministic policy gradients are fundamentally connected. For Gaussian policies,
**stochastic policy gradient may be written in the same form as deterministic policy gradient**, with the only (but significant) difference being in the policy that is followed by the associated -function.

ok - The latter fact reveals that the
**quasigradients the original problem (4) resulting from stochastic policy gradient are indeed rather inexact**, and that, indeed,**it is possible to come up with much more efficient quasigradients, which induce less error and less variability in the policy learning process**. One such example is provably the quasigradient exploited by**ZDPG**in [Kumar*et al.*, 2020], which empirically outperforms standard (baseline)**PG**by a substantial amount.

Thanks for reading; comments are most welcome!

**References**

*et al.*, 2014] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller, “Deterministic Policy Gradient Algorithms,” In Proceedings of the 31st International Conference on International Conference on Machine Learning – Volume 32, ICML’14, page I–387–I–395. JMLR.org, 2014. [Kumar

*et al.*, 2020] H. Kumar, D. S. Kalogerias, A. Khan, G. J. Pappas, and A. Ribeiro, “Zeroth-order Deterministic Policy Gradient,” arXiv preprint, arXiv:2006.07314, 2020.