Scholars@Duke publication: Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-training

Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-training

Publication , Conference

Hao, W; Li, C; Li, X; Carin, L; Gao, J

Published in: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition

January 1, 2020

Learning to navigate in a visual environment following natural-language instructions is a challenging task, because the multimodal inputs to the agent are highly variable, and the training data on a new task is often limited. We present the first pre-training and fine-tuning paradigm for vision-and-language navigation (VLN) tasks. By training on a large amount of image-text-action triplets in a self-supervised learning manner, the pre-trained model provides generic representations of visual environments and language instructions. It can be easily used as a drop-in for existing VLN frameworks, leading to the proposed agent PREVALENT¹. It learns more effectively in new tasks and generalizes better in a previously unseen environment. The performance is validated on three VLN tasks. On the Room-to-Room [3] benchmark, our model improves the state-of-the-art from 47% to 51% on success rate weighted by path length. Further, the learned representation is transferable to other VLN tasks. On two recent tasks, vision-and-dialog navigation [30] and “Help, Anna!” [22], the proposed PREVALENT leads to significant improvement over existing methods, achieving a new state of the art.

Duke Scholars

Author Lawrence Carin Electrical and Computer Engineering

Published In

Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition

DOI

10.1109/CVPR42600.2020.01315

ISSN

1063-6919

Publication Date

January 1, 2020

Start / End Page

13134 / 13143

Citation

APA

Chicago

ICMJE

MLA

NLM

Hao, W., Li, C., Li, X., Carin, L., & Gao, J. (2020). Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-training. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (pp. 13134–13143). https://doi.org/10.1109/CVPR42600.2020.01315

Hao, W., C. Li, X. Li, L. Carin, and J. Gao. “Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-training.” In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 13134–43, 2020. https://doi.org/10.1109/CVPR42600.2020.01315.

Hao W, Li C, Li X, Carin L, Gao J. Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-training. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 2020. p. 13134–43.

Hao, W., et al. “Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-training.” Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2020, pp. 13134–43. Scopus, doi:10.1109/CVPR42600.2020.01315.

Hao W, Li C, Li X, Carin L, Gao J. Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-training. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 2020. p. 13134–13143.

Published In

Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition

DOI

10.1109/CVPR42600.2020.01315

ISSN

1063-6919

Publication Date

January 1, 2020

Start / End Page

13134 / 13143