Skip to main content

Can Large Language Models Verify System Software? A Case Study Using FSCQ as a Benchmark

Publication ,  Conference
Qin, J; Du, A; Zhang, D; Lentz, M; Zhuo, D
Published in: HOTOS 2025 Proceedings of the 2025 Workshop in Hot Topics in Operating Systems
June 6, 2025

Large language models (LLMs) have demonstrated remarkable coding capabilities. They excel in code synthesis benchmarks across diverse domains and have become ubiquitous in coding tools. Recently, they have also shown promise in generating mathematical proofs and small software programs. In this paper, we explore their potential to produce proofs for complex system software (e.g., file systems), where verification typically requires substantial manual effort. By automating parts of this process, LLMs could reduce the verification burden and make rigorous proofs for system software more accessible. To evaluate LLMs for system software verification, we use FSCQ, a verified file system, as our benchmark. Our results confirm the promise of this approach: with appropriate proof context and a straightforward best-first tree search, off-the-shelf LLMs achieve 38% proof coverage for theorems sampled from FSCQ. Moreover, for simpler theorems - -those with human proofs under 64 tokens, which make up about 60% of all FSCQ theorems - -LLMs achieve over 57% coverage. These findings are preliminary, and we anticipate that various techniques can further improve proof coverage.

Duke Scholars

Published In

HOTOS 2025 Proceedings of the 2025 Workshop in Hot Topics in Operating Systems

DOI

Publication Date

June 6, 2025

Start / End Page

34 / 41
 

Citation

APA
Chicago
ICMJE
MLA
NLM
Qin, J., Du, A., Zhang, D., Lentz, M., & Zhuo, D. (2025). Can Large Language Models Verify System Software? A Case Study Using FSCQ as a Benchmark. In HOTOS 2025 Proceedings of the 2025 Workshop in Hot Topics in Operating Systems (pp. 34–41). https://doi.org/10.1145/3713082.3730382
Qin, J., A. Du, D. Zhang, M. Lentz, and D. Zhuo. “Can Large Language Models Verify System Software? A Case Study Using FSCQ as a Benchmark.” In HOTOS 2025 Proceedings of the 2025 Workshop in Hot Topics in Operating Systems, 34–41, 2025. https://doi.org/10.1145/3713082.3730382.
Qin J, Du A, Zhang D, Lentz M, Zhuo D. Can Large Language Models Verify System Software? A Case Study Using FSCQ as a Benchmark. In: HOTOS 2025 Proceedings of the 2025 Workshop in Hot Topics in Operating Systems. 2025. p. 34–41.
Qin, J., et al. “Can Large Language Models Verify System Software? A Case Study Using FSCQ as a Benchmark.” HOTOS 2025 Proceedings of the 2025 Workshop in Hot Topics in Operating Systems, 2025, pp. 34–41. Scopus, doi:10.1145/3713082.3730382.
Qin J, Du A, Zhang D, Lentz M, Zhuo D. Can Large Language Models Verify System Software? A Case Study Using FSCQ as a Benchmark. HOTOS 2025 Proceedings of the 2025 Workshop in Hot Topics in Operating Systems. 2025. p. 34–41.

Published In

HOTOS 2025 Proceedings of the 2025 Workshop in Hot Topics in Operating Systems

DOI

Publication Date

June 6, 2025

Start / End Page

34 / 41