Scholars@Duke publication: Can Large Language Models Verify System Software? A Case Study Using FSCQ as a Benchmark

Can Large Language Models Verify System Software? A Case Study Using FSCQ as a Benchmark

Publication , Conference

Qin, J; Du, A; Zhang, D; Lentz, M; Zhuo, D

Published in: HOTOS 2025 Proceedings of the 2025 Workshop in Hot Topics in Operating Systems

June 6, 2025

Large language models (LLMs) have demonstrated remarkable coding capabilities. They excel in code synthesis benchmarks across diverse domains and have become ubiquitous in coding tools. Recently, they have also shown promise in generating mathematical proofs and small software programs. In this paper, we explore their potential to produce proofs for complex system software (e.g., file systems), where verification typically requires substantial manual effort. By automating parts of this process, LLMs could reduce the verification burden and make rigorous proofs for system software more accessible. To evaluate LLMs for system software verification, we use FSCQ, a verified file system, as our benchmark. Our results confirm the promise of this approach: with appropriate proof context and a straightforward best-first tree search, off-the-shelf LLMs achieve 38% proof coverage for theorems sampled from FSCQ. Moreover, for simpler theorems - -those with human proofs under 64 tokens, which make up about 60% of all FSCQ theorems - -LLMs achieve over 57% coverage. These findings are preliminary, and we anticipate that various techniques can further improve proof coverage.

Duke Scholars

Author Matthew Lentz Computer Science

Author Danfeng Zhang Computer Science

Author Danyang Zhuo Computer Science

Published In

HOTOS 2025 Proceedings of the 2025 Workshop in Hot Topics in Operating Systems

DOI

10.1145/3713082.3730382

Publication Date

June 6, 2025

Start / End Page

34 / 41

Citation

APA

Chicago

ICMJE

MLA

NLM

Qin, J., Du, A., Zhang, D., Lentz, M., & Zhuo, D. (2025). Can Large Language Models Verify System Software? A Case Study Using FSCQ as a Benchmark. In HOTOS 2025 Proceedings of the 2025 Workshop in Hot Topics in Operating Systems (pp. 34–41). https://doi.org/10.1145/3713082.3730382

Qin, J., A. Du, D. Zhang, M. Lentz, and D. Zhuo. “Can Large Language Models Verify System Software? A Case Study Using FSCQ as a Benchmark.” In HOTOS 2025 Proceedings of the 2025 Workshop in Hot Topics in Operating Systems, 34–41, 2025. https://doi.org/10.1145/3713082.3730382.

Qin J, Du A, Zhang D, Lentz M, Zhuo D. Can Large Language Models Verify System Software? A Case Study Using FSCQ as a Benchmark. In: HOTOS 2025 Proceedings of the 2025 Workshop in Hot Topics in Operating Systems. 2025. p. 34–41.

Qin, J., et al. “Can Large Language Models Verify System Software? A Case Study Using FSCQ as a Benchmark.” HOTOS 2025 Proceedings of the 2025 Workshop in Hot Topics in Operating Systems, 2025, pp. 34–41. Scopus, doi:10.1145/3713082.3730382.

Qin J, Du A, Zhang D, Lentz M, Zhuo D. Can Large Language Models Verify System Software? A Case Study Using FSCQ as a Benchmark. HOTOS 2025 Proceedings of the 2025 Workshop in Hot Topics in Operating Systems. 2025. p. 34–41.

Published In

HOTOS 2025 Proceedings of the 2025 Workshop in Hot Topics in Operating Systems

DOI

10.1145/3713082.3730382

Publication Date

June 6, 2025

Start / End Page

34 / 41