Decomposition and Foresight: Comparing Human and Simulated Teacher in Preference-Based Reinforcement Learning
Preference-based reinforcement learning (PBRL) algorithms train intelligent agents efficiently by learning reward functions from human preferences, bypassing the need for costly pre-existing reward functions. However, prior PBRL research has predominantly relied on simulated teachers to simulate human preferences, overlooking the absence of simulated teachers in unresolved real-world problems. To effectively apply PBRL to real-world problems, it is essential to investigate the distinctions between human teachers and simulated teachers in terms of the preference selection patterns and the behaviors exhibited by the agents. Therefore, we propose HPBRL, a novel Human Preference-Based Reinforcement Learning Collaboration prototype, in which the agent learns a flexible reward function from real human preferences. To facilitate a comprehensive comparison between human teachers and simulated teachers, we conduct an in-depth analysis through a between-subjects study involving 18 users.