Scholars@Duke publication: Off-Policy Evaluation for Human Feedback

Off-Policy Evaluation for Human Feedback

Publication , Conference

Gao, Q; Gao, G; Dong, J; Tarokh, V; Chi, M; Pajic, M

Published in: Advances in Neural Information Processing Systems

January 1, 2023

Off-policy evaluation (OPE) is important for closing the gap between offline training and evaluation of reinforcement learning (RL), by estimating performance and/or rank of target (evaluation) policies using offline trajectories only. It can improve the safety and efficiency of data collection and policy testing procedures in situations where online deployments are expensive, such as healthcare. However, existing OPE methods fall short in estimating human feedback (HF) signals, as HF may be conditioned over multiple underlying factors and is only sparsely available; as opposed to the agent-defined environmental rewards (used in policy optimization), which are usually determined over parametric functions or distributions. Consequently, the nature of HF signals makes extrapolating accurate OPE estimations to be challenging. To resolve this, we introduce an OPE for HF (OPEHF) framework that revives existing OPE methods in order to accurately evaluate the HF signals. Specifically, we develop an immediate human reward (IHR) reconstruction approach, regularized by environmental knowledge distilled in a latent space that captures the underlying dynamics of state transitions as well as issuing HF signals. Our approach has been tested over two real-world experiments, adaptive in-vivo neurostimulation and intelligent tutoring, as well as in a simulation environment (visual Q&A). Results show that our approach significantly improves the performance toward estimating HF signals accurately, compared to directly applying (variants of) existing OPE methods.

Duke Scholars

Author Vahid Tarokh Pierre R. Lamond Department of Electrical and Computer Engin ...

Author Miroslav Pajic Pierre R. Lamond Department of Electrical and Computer Engin ...

Published In

Advances in Neural Information Processing Systems

ISSN

1049-5258

Publication Date

January 1, 2023

Volume

Related Subject Headings

4611 Machine learning

Citation

APA

Chicago

ICMJE

MLA

NLM

Gao, Q., Gao, G., Dong, J., Tarokh, V., Chi, M., & Pajic, M. (2023). Off-Policy Evaluation for Human Feedback. In Advances in Neural Information Processing Systems (Vol. 36).

Gao, Q., G. Gao, J. Dong, V. Tarokh, M. Chi, and M. Pajic. “Off-Policy Evaluation for Human Feedback.” In Advances in Neural Information Processing Systems, Vol. 36, 2023.

Gao Q, Gao G, Dong J, Tarokh V, Chi M, Pajic M. Off-Policy Evaluation for Human Feedback. In: Advances in Neural Information Processing Systems. 2023.

Gao, Q., et al. “Off-Policy Evaluation for Human Feedback.” Advances in Neural Information Processing Systems, vol. 36, 2023.

Gao Q, Gao G, Dong J, Tarokh V, Chi M, Pajic M. Off-Policy Evaluation for Human Feedback. Advances in Neural Information Processing Systems. 2023.

Published In

Advances in Neural Information Processing Systems

ISSN

1049-5258

Publication Date

January 1, 2023

Volume

Related Subject Headings

4611 Machine learning