Skip to main content

MLLM-LLaVA-FL: Multimodal Large Language Model Assisted Federated Learning

Publication ,  Conference
Zhang, J; Yang, H; Li, A; Guo, X; Wang, P; Wang, H; Chen, Y; Li, H
Published in: Proceedings 2025 IEEE Winter Conference on Applications of Computer Vision Wacv 2025
January 1, 2025

Previous studies on federated learning (FL) often encounter performance degradation due to data heterogeneity among different clients. In light of the recent advances in multimodal large language models (MLLMs), such as GPT-4v and LLaVA, which demonstrate their exceptional proficiency in multimodal tasks, such as image captioning and multimodal question answering. We introduce a novel federated learning framework, named Multimodal Large Language Model Assisted Federated Learning (MLLM-LLaVA-FL), which employs powerful MLLMs at the server end to address the heterogeneous and long-tailed challenges. Owing to the advanced cross-modality representation capabilities and the extensive open-vocabulary prior knowledge of MLLMs, our framework is adept at harnessing the extensive, yet previously underexploited, open-source data accessible from websites and powerful server-side computational resources. Hence, the MLLM-LLaVA-FL not only enhances the performance but also avoids increasing the risk of privacy leakage and the computational burden on local devices, distinguishing it from prior methodologies. Our framework has three key stages. Initially, we conduct global visual-text pretraining of the model. This pretraining is facilitated by utilizing the extensive open-source data available online, with the assistance of MLLMs. Subsequently, the pretrained model is distributed among various clients for local training. Finally, once the locally trained models are transmitted back to the server, a global alignment is carried out under the supervision of MLLMs to further enhance the performance. Experimental evaluations on established benchmarks, show that our framework delivers promising performance in the typical scenarios with data heterogeneity and long-tail distribution across different clients in FL.

Duke Scholars

Published In

Proceedings 2025 IEEE Winter Conference on Applications of Computer Vision Wacv 2025

DOI

Publication Date

January 1, 2025

Start / End Page

4066 / 4076
 

Citation

APA
Chicago
ICMJE
MLA
NLM
Zhang, J., Yang, H., Li, A., Guo, X., Wang, P., Wang, H., … Li, H. (2025). MLLM-LLaVA-FL: Multimodal Large Language Model Assisted Federated Learning. In Proceedings 2025 IEEE Winter Conference on Applications of Computer Vision Wacv 2025 (pp. 4066–4076). https://doi.org/10.1109/WACV61041.2025.00400
Zhang, J., H. Yang, A. Li, X. Guo, P. Wang, H. Wang, Y. Chen, and H. Li. “MLLM-LLaVA-FL: Multimodal Large Language Model Assisted Federated Learning.” In Proceedings 2025 IEEE Winter Conference on Applications of Computer Vision Wacv 2025, 4066–76, 2025. https://doi.org/10.1109/WACV61041.2025.00400.
Zhang J, Yang H, Li A, Guo X, Wang P, Wang H, et al. MLLM-LLaVA-FL: Multimodal Large Language Model Assisted Federated Learning. In: Proceedings 2025 IEEE Winter Conference on Applications of Computer Vision Wacv 2025. 2025. p. 4066–76.
Zhang, J., et al. “MLLM-LLaVA-FL: Multimodal Large Language Model Assisted Federated Learning.” Proceedings 2025 IEEE Winter Conference on Applications of Computer Vision Wacv 2025, 2025, pp. 4066–76. Scopus, doi:10.1109/WACV61041.2025.00400.
Zhang J, Yang H, Li A, Guo X, Wang P, Wang H, Chen Y, Li H. MLLM-LLaVA-FL: Multimodal Large Language Model Assisted Federated Learning. Proceedings 2025 IEEE Winter Conference on Applications of Computer Vision Wacv 2025. 2025. p. 4066–4076.

Published In

Proceedings 2025 IEEE Winter Conference on Applications of Computer Vision Wacv 2025

DOI

Publication Date

January 1, 2025

Start / End Page

4066 / 4076