Scholars@Duke publication: Typos-aware Bottlenecked Pre-Training for Robust Dense Retrieval

Typos-aware Bottlenecked Pre-Training for Robust Dense Retrieval

Publication , Conference

Zhuang, S; Shou, L; Pei, J; Gong, M; Ren, H; Zuccon, G; Jiang, D

Published in: SIGIR AP 2023 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region

November 26, 2023

Published version (DOI)

Current dense retrievers (DRs) are limited in their ability to effectively process misspelled queries, which constitute a significant portion of query traffic in commercial search engines. The main issue is that the pre-trained language model-based encoders used by DRs are typically trained and fine-tuned using clean, well-curated text data. Misspelled queries are typically not found in the data used for training these models, and thus misspelled queries observed at inference time are out-of-distribution compared to the data used for training and fine-tuning. Previous efforts to address this issue have focused on fine-tuning strategies, but their effectiveness on misspelled queries remains lower than that of pipelines that employ separate state-of-the-art spell-checking components. To address this challenge, we propose ToRoDer (TypOs-aware bottlenecked pre-training for RObust DEnse Retrieval), a novel pre-training strategy for DRs that increases their robustness to misspelled queries while preserving their effectiveness in downstream retrieval tasks. ToRoDer utilizes an encoder-decoder architecture where the encoder takes misspelled text with masked tokens as input and outputs bottlenecked information to the decoder. The decoder then takes as input the bottlenecked embeddings, along with token embeddings of the original text with the misspelled tokens masked out. The pre-training task is to recover the masked tokens for both the encoder and decoder. Our extensive experimental results and detailed ablation studies show that DRs pre-trained with ToRoDer exhibit significantly higher effectiveness on misspelled queries, sensibly closing the gap with pipelines that use a separate, complex spell-checker component, while retaining their effectiveness on correctly spelled queries.

Duke Scholars

Author Jian Pei Computer Science

Published In

SIGIR AP 2023 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region

DOI

10.1145/3624918.3625324

Publication Date

November 26, 2023

Start / End Page

212 / 222

Citation

APA

Chicago

ICMJE

MLA

NLM

Zhuang, S., Shou, L., Pei, J., Gong, M., Ren, H., Zuccon, G., & Jiang, D. (2023). Typos-aware Bottlenecked Pre-Training for Robust Dense Retrieval. In SIGIR AP 2023 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region (pp. 212–222). https://doi.org/10.1145/3624918.3625324

Zhuang, S., L. Shou, J. Pei, M. Gong, H. Ren, G. Zuccon, and D. Jiang. “Typos-aware Bottlenecked Pre-Training for Robust Dense Retrieval.” In SIGIR AP 2023 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region, 212–22, 2023. https://doi.org/10.1145/3624918.3625324.

Zhuang S, Shou L, Pei J, Gong M, Ren H, Zuccon G, et al. Typos-aware Bottlenecked Pre-Training for Robust Dense Retrieval. In: SIGIR AP 2023 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region. 2023. p. 212–22.

Zhuang, S., et al. “Typos-aware Bottlenecked Pre-Training for Robust Dense Retrieval.” SIGIR AP 2023 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region, 2023, pp. 212–22. Scopus, doi:10.1145/3624918.3625324.

Zhuang S, Shou L, Pei J, Gong M, Ren H, Zuccon G, Jiang D. Typos-aware Bottlenecked Pre-Training for Robust Dense Retrieval. SIGIR AP 2023 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region. 2023. p. 212–222.

Published In

SIGIR AP 2023 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region

DOI

10.1145/3624918.3625324

Publication Date

November 26, 2023

Start / End Page

212 / 222