Scholars@Duke publication: Close or Cloze? Assessing the Robustness of Large Language Models to Adversarial Perturbations via Word Recovery

Close or Cloze? Assessing the Robustness of Large Language Models to Adversarial Perturbations via Word Recovery

Publication , Conference

Moffett, L; Dhingra, B

Published in: Proceedings International Conference on Computational Linguistics Coling

January 1, 2025

The current generation of large language models (LLMs) show a surprising degree of robustness to adversarial perturbations, but it is unclear when these models implicitly recover the original text and when they rely on surrounding context. To isolate this recovery faculty of language models, we study a new diagnostic task-Adversarial Word Recovery-an extension of spellchecking where the inputs may be adversarial. We collect a new dataset using 9 popular perturbation attack strategies from the literature and organize them using a taxonomy of phonetic, typo, and visual attacks. We use this dataset to study the word recovery performance of the current generation of LLMs, finding that proprietary models (GPT-4, GPT-3.5 and Palm-2) match or surpass human performance. Conversely, open-source models (Llama-2, Mistral, Falcon) demonstrate a material gap between human performance, especially on visual attacks. For these open models, we show that performance of word recovery without context correlates to word recovery with context, and ultimately affects downstream task performance on a hateful, offensive, and toxic classification task. Finally, to show improving word recovery can improve robustness, we mitigate these attacks with a small Byt5 model tuned to recover visually attacked words.

Duke Scholars

Author Bhuwan Dhingra Computer Science

Published In

Proceedings International Conference on Computational Linguistics Coling

ISSN

2951-2093

Publication Date

January 1, 2025

Start / End Page

6999 / 7019

Citation

APA

Chicago

ICMJE

MLA

NLM

Moffett, L., & Dhingra, B. (2025). Close or Cloze? Assessing the Robustness of Large Language Models to Adversarial Perturbations via Word Recovery. In Proceedings International Conference on Computational Linguistics Coling (pp. 6999–7019).

Moffett, L., and B. Dhingra. “Close or Cloze? Assessing the Robustness of Large Language Models to Adversarial Perturbations via Word Recovery.” In Proceedings International Conference on Computational Linguistics Coling, 6999–7019, 2025.

Moffett L, Dhingra B. Close or Cloze? Assessing the Robustness of Large Language Models to Adversarial Perturbations via Word Recovery. In: Proceedings International Conference on Computational Linguistics Coling. 2025. p. 6999–7019.

Moffett, L., and B. Dhingra. “Close or Cloze? Assessing the Robustness of Large Language Models to Adversarial Perturbations via Word Recovery.” Proceedings International Conference on Computational Linguistics Coling, 2025, pp. 6999–7019.

Moffett L, Dhingra B. Close or Cloze? Assessing the Robustness of Large Language Models to Adversarial Perturbations via Word Recovery. Proceedings International Conference on Computational Linguistics Coling. 2025. p. 6999–7019.

Published In

Proceedings International Conference on Computational Linguistics Coling

ISSN

2951-2093

Publication Date

January 1, 2025

Start / End Page

6999 / 7019