Skip to main content

Touchstone Benchmark: Are We on the Right Way for Evaluating AI Algorithms for Medical Segmentation?

Publication ,  Conference
Bassi, PRAS; Li, W; Tang, Y; Isensee, F; Wang, Z; Chen, J; Chou, YC; Roy, S; Kirchhoff, Y; Rokuss, M; Huang, Z; Ye, J; He, J; Wald, T ...
Published in: Advances in Neural Information Processing Systems
January 1, 2024

How can we test AI performance? This question seems trivial, but it isn't. Standard benchmarks often have problems such as in-distribution and small-size test sets, oversimplified metrics, unfair comparisons, and short-term outcome pressure. As a consequence, good performance on standard benchmarks does not guarantee success in real-world scenarios. To address these problems, we present Touchstone, a large-scale collaborative segmentation benchmark of 9 types of abdominal organs. This benchmark is based on 5,195 training CT scans from 76 hospitals around the world and 5,903 testing CT scans from 11 additional hospitals. This diverse test set enhances the statistical significance of benchmark results and rigorously evaluates AI algorithms across out-of-distribution scenarios. We invited 14 inventors of 19 AI algorithms to train their algorithms, while our team, as a third party, independently evaluated these algorithms. In addition, we also evaluated pre-existing AI frameworks-which, differing from algorithms, are more flexible and can support different algorithms-including MONAI from NVIDIA, nnU-Net from DKFZ, and numerous other open-source frameworks. We are committed to expanding this benchmark to encourage more innovation of AI algorithms for the medical domain.

Duke Scholars

Published In

Advances in Neural Information Processing Systems

ISSN

1049-5258

Publication Date

January 1, 2024

Volume

37

Related Subject Headings

  • 4611 Machine learning
  • 1702 Cognitive Sciences
  • 1701 Psychology
 

Citation

APA
Chicago
ICMJE
MLA
NLM
Bassi, P. R. A. S., Li, W., Tang, Y., Isensee, F., Wang, Z., Chen, J., … Zhou, Z. (2024). Touchstone Benchmark: Are We on the Right Way for Evaluating AI Algorithms for Medical Segmentation? In Advances in Neural Information Processing Systems (Vol. 37).
Bassi, P. R. A. S., W. Li, Y. Tang, F. Isensee, Z. Wang, J. Chen, Y. C. Chou, et al. “Touchstone Benchmark: Are We on the Right Way for Evaluating AI Algorithms for Medical Segmentation?” In Advances in Neural Information Processing Systems, Vol. 37, 2024.
Bassi PRAS, Li W, Tang Y, Isensee F, Wang Z, Chen J, et al. Touchstone Benchmark: Are We on the Right Way for Evaluating AI Algorithms for Medical Segmentation? In: Advances in Neural Information Processing Systems. 2024.
Bassi, P. R. A. S., et al. “Touchstone Benchmark: Are We on the Right Way for Evaluating AI Algorithms for Medical Segmentation?Advances in Neural Information Processing Systems, vol. 37, 2024.
Bassi PRAS, Li W, Tang Y, Isensee F, Wang Z, Chen J, Chou YC, Roy S, Kirchhoff Y, Rokuss M, Huang Z, Ye J, He J, Wald T, Ulrich C, Baumgartner M, Maier-Hein KH, Jaeger P, Ye Y, Xie Y, Zhang J, Chen Z, Xia Y, Xing Z, Zhu L, Sadegheih Y, Bozorgpour A, Kumari P, Azad R, Merhof D, Shi P, Ma T, Du Y, Bai F, Huang T, Zhao B, Wang H, Li X, Gu H, Dong H, Yang J, Mazurowski MA, Gupta S, Wu L, Zhuang J, Chen H, Roth H, Xu D, Blaschko MB, Decherchi S, Cavalli A, Yuille AL, Zhou Z. Touchstone Benchmark: Are We on the Right Way for Evaluating AI Algorithms for Medical Segmentation? Advances in Neural Information Processing Systems. 2024.

Published In

Advances in Neural Information Processing Systems

ISSN

1049-5258

Publication Date

January 1, 2024

Volume

37

Related Subject Headings

  • 4611 Machine learning
  • 1702 Cognitive Sciences
  • 1701 Psychology