Robust Exploration with Adversary via Langevin Monte Carlo
In the realm of Deep Q-Networks (DQNs), numerous exploration strategies have demonstrated efficacy within controlled environments. However, these methods encounter formidable challenges when confronted with the unpredictability of real-world scenarios marked by disturbances. The optimization of exploration efficiency under such disturbances is not fully investigated. In response to these challenges, this work introduces a versatile reinforcement learning (RL) framework that systematically addresses the intricate interplay between exploration and robustness in dynamic and unpredictable environments. In particular, we propose a robust RL methodology, framed within a two-player max-min adversarial paradigm; this formulation is cast as a Probabilistic Action Robust Markov Decision Process (MDP), grounded in a cyber-physical perspective. Our methodology capitalizes on Langevin Monte Carlo (LMC) for Q-function exploration, facilitating iterative updates that empower both the protagonist and adversary to efficaciously explore. Notably, we extend this adversarial training paradigm to encompass robustness against delayed feedback episodes. Empirical evaluation, conducted on benchmark problems such as N-Chain and deep brain stimulation, underlines the consistent superiority of our method over baseline approaches across diverse perturbation scenarios and instances of delayed feedback.