SAMPLE EFFICIENT POLICY GRADIENT METHODS WITH RECURSIVE VARIANCE REDUCTION
Improving the sample efficiency in reinforcement learning has been a longstanding research problem. In this work, we aim to reduce the sample complexity of existing policy gradient methods. We propose a novel policy gradient algorithm called SRVR-PG, which only requires O(1/∊3/2)1 episodes to find an ∊- approximate stationary point of the nonconcave performance function J(θ) (i.e., θ such that ∥∇J(θ)∥22 ≤ ∊). This sample complexity improves the existing result O(1/∊5/3) for stochastic variance reduced policy gradient algorithms by a factor of O(1/∊1/6). In addition, we also propose a variant of SRVR-PG with parameter exploration, which explores the initial policy parameter from a prior probability distribution. We conduct numerical experiments on classic control problems in reinforcement learning to validate the performance of our proposed algorithms.