Action-Dependent Optimality-Preserving Reward Shaping
Recent RL research has utilized reward shaping-particularly complex shaping rewards such as intrinsic motivation (IM)-to encourage agent exploration in sparse-reward environments. While often effective, “reward hacking” can lead to the shaping reward being optimized at the expense of the extrinsic reward. Prior techniques have mitigated this, allowing for implementing IM without altering optimal policies, but have only thus far been tested in simple environments. In this work we show that they are effectively unsuitable for complex, exploration-heavy environments with long episodes. To remedy this, we introduce Action-Dependent Optimality Preserving Shaping (ADOPS), a method of converting arbitrary intrinsic rewards to an optimality-preserving form that allows agents to utilize them more effectively in the extremely sparse environment of Montezuma's Revenge. We demonstrate significant improvement over prior SOTA optimality-preserving IM-conversion methods, and argue that these improvements come from ADOPS's ability to preserve 'action-dependent' IM terms.