Scholars@Duke publication: Calibrating Long-form Generations from Large Language Models

Calibrating Long-form Generations from Large Language Models

Publication , Conference

Huang, Y; Liu, Y; Thirukovalluru, R; Cohan, A; Dhingra, B

Published in: Emnlp 2024 2024 Conference on Empirical Methods in Natural Language Processing Findings of Emnlp 2024

January 1, 2024

To enhance Large Language Models' (LLMs) reliability, calibration is essential-the model's confidence scores should align with the likelihood of its responses being correct. However, traditional calibration methods typically rely on a binary true/false assessment of response correctness, unsuitable for long-form generations where an answer can be partially correct. Addressing this gap, we introduce a unified calibration framework, in which both the correctness of the LLMs' responses and their associated confidence levels are treated as distributions across a range of scores. We develop three metrics for assessing LLM calibration and propose confidence elicitation methods based on self-consistency and self-evaluation. Our experiments demonstrate that larger models don't necessarily guarantee better calibration, that various calibration metrics complement each other, and that self-consistency methods excel in factoid datasets. We also find that calibration can be enhanced through techniques such as fine-tuning, scaling the temperature. Finally, we illustrate one application of long-form calibration through selective answering in long-form responses, optimizing correctness within a constrained API budget.

Duke Scholars

Author Bhuwan Dhingra Computer Science

Published In

Emnlp 2024 2024 Conference on Empirical Methods in Natural Language Processing Findings of Emnlp 2024

DOI

10.18653/v1/2024.findings-emnlp.785

Publication Date

January 1, 2024

Start / End Page

13441 / 13460

Citation

APA

Chicago

ICMJE

MLA

NLM

Huang, Y., Liu, Y., Thirukovalluru, R., Cohan, A., & Dhingra, B. (2024). Calibrating Long-form Generations from Large Language Models. In Emnlp 2024 2024 Conference on Empirical Methods in Natural Language Processing Findings of Emnlp 2024 (pp. 13441–13460). https://doi.org/10.18653/v1/2024.findings-emnlp.785

Huang, Y., Y. Liu, R. Thirukovalluru, A. Cohan, and B. Dhingra. “Calibrating Long-form Generations from Large Language Models.” In Emnlp 2024 2024 Conference on Empirical Methods in Natural Language Processing Findings of Emnlp 2024, 13441–60, 2024. https://doi.org/10.18653/v1/2024.findings-emnlp.785.

Huang Y, Liu Y, Thirukovalluru R, Cohan A, Dhingra B. Calibrating Long-form Generations from Large Language Models. In: Emnlp 2024 2024 Conference on Empirical Methods in Natural Language Processing Findings of Emnlp 2024. 2024. p. 13441–60.

Huang, Y., et al. “Calibrating Long-form Generations from Large Language Models.” Emnlp 2024 2024 Conference on Empirical Methods in Natural Language Processing Findings of Emnlp 2024, 2024, pp. 13441–60. Scopus, doi:10.18653/v1/2024.findings-emnlp.785.

Huang Y, Liu Y, Thirukovalluru R, Cohan A, Dhingra B. Calibrating Long-form Generations from Large Language Models. Emnlp 2024 2024 Conference on Empirical Methods in Natural Language Processing Findings of Emnlp 2024. 2024. p. 13441–13460.

Published In

Emnlp 2024 2024 Conference on Empirical Methods in Natural Language Processing Findings of Emnlp 2024

DOI

10.18653/v1/2024.findings-emnlp.785

Publication Date

January 1, 2024

Start / End Page

13441 / 13460