Scholars@Duke publication: Advancing the Understanding and Evaluation of AR-Generated Scenes: When Vision-Language Models Shine and Stumble

Advancing the Understanding and Evaluation of AR-Generated Scenes: When Vision-Language Models Shine and Stumble

Publication , Conference

Duan, L; Xiu, Y; Gorlatova, M

Published in: Proceedings 2025 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops Vrw 2025

January 1, 2025

Published version (DOI)

Augmented Reality (AR) enhances the real world by integrating virtual content, yet ensuring the quality, usability, and safety of AR experiences presents significant challenges. Could Vision-Language Models (VLMs) offer a solution for the automated evaluation of AR-generated scenes? In this study, we evaluate the capabilities of three state-of-the-art commercial VLMs - GPT, Gemini, and Claude - in identifying and describing AR scenes. For this purpose, we use DiverseAR, the first AR dataset specifically designed to assess VLMs' ability to analyze virtual content across a wide range of AR scene complexities. Our findings demonstrate that VLMs are generally capable of perceiving and describing AR scenes, achieving a True Positive Rate (TPR) of up to 93% for perception and 71% for description. While they excel at identifying obvious virtual objects, such as a glowing apple, they struggle when faced with seamlessly integrated content, such as a virtual pot with realistic shadows. Our results highlight both the strengths and the limitations of VLMs in understanding AR scenarios. We identify key factors affecting VLM performance, including virtual content placement, rendering quality, and physical plausibility. This study underscores the potential of VLMs as tools for evaluating the quality of AR experiences.

Duke Scholars

Author Maria Gorlatova Electrical and Computer Engineering

Published In

Proceedings 2025 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops Vrw 2025

DOI

10.1109/VRW66409.2025.00039

Publication Date

January 1, 2025

Start / End Page

156 / 161

Citation

APA

Chicago

ICMJE

MLA

NLM

Duan, L., Xiu, Y., & Gorlatova, M. (2025). Advancing the Understanding and Evaluation of AR-Generated Scenes: When Vision-Language Models Shine and Stumble. In Proceedings 2025 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops Vrw 2025 (pp. 156–161). https://doi.org/10.1109/VRW66409.2025.00039

Duan, L., Y. Xiu, and M. Gorlatova. “Advancing the Understanding and Evaluation of AR-Generated Scenes: When Vision-Language Models Shine and Stumble.” In Proceedings 2025 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops Vrw 2025, 156–61, 2025. https://doi.org/10.1109/VRW66409.2025.00039.

Duan L, Xiu Y, Gorlatova M. Advancing the Understanding and Evaluation of AR-Generated Scenes: When Vision-Language Models Shine and Stumble. In: Proceedings 2025 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops Vrw 2025. 2025. p. 156–61.

Duan, L., et al. “Advancing the Understanding and Evaluation of AR-Generated Scenes: When Vision-Language Models Shine and Stumble.” Proceedings 2025 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops Vrw 2025, 2025, pp. 156–61. Scopus, doi:10.1109/VRW66409.2025.00039.

Duan L, Xiu Y, Gorlatova M. Advancing the Understanding and Evaluation of AR-Generated Scenes: When Vision-Language Models Shine and Stumble. Proceedings 2025 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops Vrw 2025. 2025. p. 156–161.

Published In

Proceedings 2025 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops Vrw 2025

DOI

10.1109/VRW66409.2025.00039

Publication Date

January 1, 2025

Start / End Page

156 / 161