Skip to main content

Improved datasets and evaluation methods for the automatic prediction of DNA-binding proteins.

Publication ,  Journal Article
Zaitzeff, A; Leiby, N; Motta, FC; Haase, SB; Singer, JM
Published in: Bioinformatics (Oxford, England)
December 2021

Accurate automatic annotation of protein function relies on both innovative models and robust datasets. Due to their importance in biological processes, the identification of DNA-binding proteins directly from protein sequence has been the focus of many studies. However, the datasets used to train and evaluate these methods have suffered from substantial flaws. We describe some of the weaknesses of the datasets used in previous DNA-binding protein literature and provide several new datasets addressing these problems. We suggest new evaluative benchmark tasks that more realistically assess real-world performance for protein annotation models. We propose a simple new model for the prediction of DNA-binding proteins and compare its performance on the improved datasets to two previously published models. In addition, we provide extensive tests showing how the best models predict across taxa.Our new gradient boosting model, which uses features derived from a published protein language model, outperforms the earlier models. Perhaps surprisingly, so does a baseline nearest neighbor model using BLAST percent identity. We evaluate the sensitivity of these models to perturbations of DNA-binding regions and control regions of protein sequences. The successful data-driven models learn to focus on DNA-binding regions. When predicting across taxa, the best models are highly accurate across species in the same kingdom and can provide some information when predicting across kingdoms.The data and results for this article can be found at https://doi.org/10.5281/zenodo.5153906. The code for this article can be found at https://doi.org/10.5281/zenodo.5153683. The code, data and results can also be found at https://github.com/AZaitzeff/tools_for_dna_binding_proteins.

Duke Scholars

Altmetric Attention Stats
Dimensions Citation Stats

Published In

Bioinformatics (Oxford, England)

DOI

EISSN

1367-4811

ISSN

1367-4803

Publication Date

December 2021

Volume

38

Issue

1

Start / End Page

44 / 51

Related Subject Headings

  • Molecular Sequence Annotation
  • DNA-Binding Proteins
  • DNA
  • Bioinformatics
  • Amino Acid Sequence
  • 49 Mathematical sciences
  • 46 Information and computing sciences
  • 31 Biological sciences
  • 08 Information and Computing Sciences
  • 06 Biological Sciences
 

Citation

APA
Chicago
ICMJE
MLA
NLM
Zaitzeff, A., Leiby, N., Motta, F. C., Haase, S. B., & Singer, J. M. (2021). Improved datasets and evaluation methods for the automatic prediction of DNA-binding proteins. Bioinformatics (Oxford, England), 38(1), 44–51. https://doi.org/10.1093/bioinformatics/btab603
Zaitzeff, Alexander, Nicholas Leiby, Francis C. Motta, Steven B. Haase, and Jedediah M. Singer. “Improved datasets and evaluation methods for the automatic prediction of DNA-binding proteins.Bioinformatics (Oxford, England) 38, no. 1 (December 2021): 44–51. https://doi.org/10.1093/bioinformatics/btab603.
Zaitzeff A, Leiby N, Motta FC, Haase SB, Singer JM. Improved datasets and evaluation methods for the automatic prediction of DNA-binding proteins. Bioinformatics (Oxford, England). 2021 Dec;38(1):44–51.
Zaitzeff, Alexander, et al. “Improved datasets and evaluation methods for the automatic prediction of DNA-binding proteins.Bioinformatics (Oxford, England), vol. 38, no. 1, Dec. 2021, pp. 44–51. Epmc, doi:10.1093/bioinformatics/btab603.
Zaitzeff A, Leiby N, Motta FC, Haase SB, Singer JM. Improved datasets and evaluation methods for the automatic prediction of DNA-binding proteins. Bioinformatics (Oxford, England). 2021 Dec;38(1):44–51.

Published In

Bioinformatics (Oxford, England)

DOI

EISSN

1367-4811

ISSN

1367-4803

Publication Date

December 2021

Volume

38

Issue

1

Start / End Page

44 / 51

Related Subject Headings

  • Molecular Sequence Annotation
  • DNA-Binding Proteins
  • DNA
  • Bioinformatics
  • Amino Acid Sequence
  • 49 Mathematical sciences
  • 46 Information and computing sciences
  • 31 Biological sciences
  • 08 Information and Computing Sciences
  • 06 Biological Sciences