Scholars@Duke publication: Improving peptide-MHC class I binding prediction for unbalanced datasets.

Improving peptide-MHC class I binding prediction for unbalanced datasets.

Publication , Journal Article

Sales, AP; Tomaras, GD; Kepler, TB

Published in: BMC Bioinformatics

September 19, 2008

BACKGROUND: Establishment of peptide binding to Major Histocompatibility Complex class I (MHCI) is a crucial step in the development of subunit vaccines and prediction of such binding could greatly reduce costs and accelerate the experimental process of identifying immunogenic peptides. Many methods have been applied to the prediction of peptide-MHCI binding, with some achieving outstanding performance. Because of the experimental methods used to measure binding or affinity between peptides and MHCI molecules, however, available datasets are enriched for nonbinders, and thus highly unbalanced. Although there is no consensus on the ideal class distribution for training sets, extremely unbalanced datasets can be detrimental to the performance of prediction algorithms. RESULTS: We have developed a decision-theoretic framework to construct cost-sensitive trees to predict peptide-MHCI binding and have used them to 1) Assess the impact of the training data's class distribution on classifier accuracy, and 2) Compare resampling and cost-sensitive methods as approaches to compensate for training data imbalance. Our results confirm that highly unbalanced training sets can reduce the accuracy of classifier predictions and show that, in the peptide-MHCI binding context, resampling methods do not improve the classifier performance. In contrast, cost-sensitive methods significantly improve accuracy of decision trees. Finally, we propose the use of a training scheme that, when the training set is enriched for nonbinders, consistently improves the overall classifier accuracy compared to cost-insensitive classifiers and, in particular, increases the sensitivity of the classifiers. This method minimizes the expected classification cost for large datasets. CONCLUSION: Our method consistently improves the performance of decision trees in predicting peptide-MHC class I binding by using cost-balancing techniques to compensate for the imbalance in the training dataset.

Duke Scholars

Author Georgia Doris Tomaras Surgery, Surgical Sciences

Published In

BMC Bioinformatics

DOI

10.1186/1471-2105-9-385

EISSN

1471-2105

Publication Date

September 19, 2008

Volume

Start / End Page

385

Location

England

Related Subject Headings

Vaccines, Subunit
Peptides
Humans
Histocompatibility Antigens Class I
Bioinformatics
Algorithms
49 Mathematical sciences
46 Information and computing sciences
31 Biological sciences
08 Information and Computing Sciences

Citation

APA

Chicago

ICMJE

MLA

NLM

Sales, A. P., Tomaras, G. D., & Kepler, T. B. (2008). Improving peptide-MHC class I binding prediction for unbalanced datasets. BMC Bioinformatics, 9, 385. https://doi.org/10.1186/1471-2105-9-385

Sales, Ana Paula, Georgia D. Tomaras, and Thomas B. Kepler. “Improving peptide-MHC class I binding prediction for unbalanced datasets.” BMC Bioinformatics 9 (September 19, 2008): 385. https://doi.org/10.1186/1471-2105-9-385.

Sales AP, Tomaras GD, Kepler TB. Improving peptide-MHC class I binding prediction for unbalanced datasets. BMC Bioinformatics. 2008 Sep 19;9:385.

Sales, Ana Paula, et al. “Improving peptide-MHC class I binding prediction for unbalanced datasets.” BMC Bioinformatics, vol. 9, Sept. 2008, p. 385. Pubmed, doi:10.1186/1471-2105-9-385.

Sales AP, Tomaras GD, Kepler TB. Improving peptide-MHC class I binding prediction for unbalanced datasets. BMC Bioinformatics. 2008 Sep 19;9:385.

Published In

BMC Bioinformatics

DOI

10.1186/1471-2105-9-385

EISSN

1471-2105

Publication Date

September 19, 2008

Volume

Start / End Page

385

Location

England

Related Subject Headings

Vaccines, Subunit
Peptides
Humans
Histocompatibility Antigens Class I
Bioinformatics
Algorithms
49 Mathematical sciences
46 Information and computing sciences
31 Biological sciences
08 Information and Computing Sciences