Skip to main content

Sequence Reducible Holdout Loss for Language Model Pretraining

Publication ,  Conference
Thirukovalluru, R; Monath, N; Dhingra, B; Wiseman, S
Published in: 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 - Main Conference Proceedings
January 1, 2024

Data selection techniques, which adaptively select datapoints inside the training loop, have demonstrated empirical benefits in reducing the number of gradient steps to train neural models. However, these techniques have so far largely been applied to classification. In this work, we study their applicability to language model pretraining, a highly time-intensive task. We propose a simple modification to an existing data selection technique (reducible hold-out loss training) in order to adapt it to the sequence losses typical in language modeling. We experiment on both autoregressive and masked language modelling, and show that applying data selection to pretraining offers notable benefits including a 4.3% reduction in total number of steps, a 21.5% steps reduction in average, to an intermediate target perplexity, over the course of pretraining an autoregressive language model. Further, data selection trained language models demonstrate significantly better performance on out of domain datasets, including 7.9% reduction in total number of steps and 23.2% average steps reduction to an intermediate target perplexity.

Duke Scholars

Published In

2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 - Main Conference Proceedings

Publication Date

January 1, 2024

Start / End Page

14705 / 14716
 

Citation

APA
Chicago
ICMJE
MLA
NLM
Thirukovalluru, R., Monath, N., Dhingra, B., & Wiseman, S. (2024). Sequence Reducible Holdout Loss for Language Model Pretraining. In 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 - Main Conference Proceedings (pp. 14705–14716).
Thirukovalluru, R., N. Monath, B. Dhingra, and S. Wiseman. “Sequence Reducible Holdout Loss for Language Model Pretraining.” In 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 - Main Conference Proceedings, 14705–16, 2024.
Thirukovalluru R, Monath N, Dhingra B, Wiseman S. Sequence Reducible Holdout Loss for Language Model Pretraining. In: 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 - Main Conference Proceedings. 2024. p. 14705–16.
Thirukovalluru, R., et al. “Sequence Reducible Holdout Loss for Language Model Pretraining.” 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 - Main Conference Proceedings, 2024, pp. 14705–16.
Thirukovalluru R, Monath N, Dhingra B, Wiseman S. Sequence Reducible Holdout Loss for Language Model Pretraining. 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 - Main Conference Proceedings. 2024. p. 14705–14716.

Published In

2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 - Main Conference Proceedings

Publication Date

January 1, 2024

Start / End Page

14705 / 14716