Skip to main content

Lexicon-Enhanced Self-Supervised Training for Multilingual Dense Retrieval

Publication ,  Conference
Ren, H; Shou, L; Pei, J; Wu, N; Gong, M; Jiang, D
Published in: Findings of the Association for Computational Linguistics: EMNLP 2022
January 1, 2022

Recent multilingual pre-trained models have shown better performance in various multilingual tasks. However, these models perform poorly on multilingual retrieval tasks due to lacking multilingual training data. In this paper, we propose to mine and generate self-supervised training data based on a large-scale unlabeled corpus. We carefully design a mining method which combines the sparse and dense models to mine the relevance of unlabeled queries and passages. And we introduce a query generator to generate more queries in target languages for unlabeled passages. Through extensive experiments on Mr. TYDI dataset and an industrial dataset from a commercial search engine, we demonstrate that our method performs better than baselines based on various pre-trained multilingual models. Our method even achieves on-par performance with the supervised method on the latter dataset.

Duke Scholars

Published In

Findings of the Association for Computational Linguistics: EMNLP 2022

Publication Date

January 1, 2022

Start / End Page

444 / 459
 

Citation

APA
Chicago
ICMJE
MLA
NLM
Ren, H., Shou, L., Pei, J., Wu, N., Gong, M., & Jiang, D. (2022). Lexicon-Enhanced Self-Supervised Training for Multilingual Dense Retrieval. In Findings of the Association for Computational Linguistics: EMNLP 2022 (pp. 444–459).
Ren, H., L. Shou, J. Pei, N. Wu, M. Gong, and D. Jiang. “Lexicon-Enhanced Self-Supervised Training for Multilingual Dense Retrieval.” In Findings of the Association for Computational Linguistics: EMNLP 2022, 444–59, 2022.
Ren H, Shou L, Pei J, Wu N, Gong M, Jiang D. Lexicon-Enhanced Self-Supervised Training for Multilingual Dense Retrieval. In: Findings of the Association for Computational Linguistics: EMNLP 2022. 2022. p. 444–59.
Ren, H., et al. “Lexicon-Enhanced Self-Supervised Training for Multilingual Dense Retrieval.” Findings of the Association for Computational Linguistics: EMNLP 2022, 2022, pp. 444–59.
Ren H, Shou L, Pei J, Wu N, Gong M, Jiang D. Lexicon-Enhanced Self-Supervised Training for Multilingual Dense Retrieval. Findings of the Association for Computational Linguistics: EMNLP 2022. 2022. p. 444–459.

Published In

Findings of the Association for Computational Linguistics: EMNLP 2022

Publication Date

January 1, 2022

Start / End Page

444 / 459