Global Convergence of Policy Gradient Methods for the Linear Quadratic Regulator
Direct policy gradient methods for reinforcement learning and continuous control problems arc a popular approach for a variety of reasons: 1) they are easy to implement without explicit knowledge of the underlying model, 2) they are an "end- to-end" approach, directly optimizing the performance metric of interest, 3) they inherently allow for richly parameterized policies. A notable drawback is that even in the most basic continuous control problem (that of linear quadratic regulators), these methods must solve a non-convex optimization problem, where little is understood about their efficiency from both computational and statistical perspectives. In contrast, system identification and model based planning in opti- : Mal control theory have a much more solid theo- ! retical footing, where much is known with regards to their computational and statistical properties. , This work bridges this gap showing that (model ; free) policy gradient methods globally converge to the optimal solution and are efficient (polynomi- ' ally so in relevant problem dependent quantities) : With regards to their sample and computational complexities.