Scholars@Duke publication: Beyond the E-Value: Stratified Statistics for Protein Domain Prediction.

Beyond the E-Value: Stratified Statistics for Protein Domain Prediction.

Publication , Journal Article

Ochoa, A; Storey, JD; Llinás, M; Singh, M

Published in: PLoS Comput Biol

November 2015

E-values have been the dominant statistic for protein sequence analysis for the past two decades: from identifying statistically significant local sequence alignments to evaluating matches to hidden Markov models describing protein domain families. Here we formally show that for "stratified" multiple hypothesis testing problems-that is, those in which statistical tests can be partitioned naturally-controlling the local False Discovery Rate (lFDR) per stratum, or partition, yields the most predictions across the data at any given threshold on the FDR or E-value over all strata combined. For the important problem of protein domain prediction, a key step in characterizing protein structure, function and evolution, we show that stratifying statistical tests by domain family yields excellent results. We develop the first FDR-estimating algorithms for domain prediction, and evaluate how well thresholds based on q-values, E-values and lFDRs perform in domain prediction using five complementary approaches for estimating empirical FDRs in this context. We show that stratified q-value thresholds substantially outperform E-values. Contradicting our theoretical results, q-values also outperform lFDRs; however, our tests reveal a small but coherent subset of domain families, biased towards models for specific repetitive patterns, for which weaknesses in random sequence models yield notably inaccurate statistical significance measures. Usage of lFDR thresholds outperform q-values for the remaining families, which have as-expected noise, suggesting that further improvements in domain predictions can be achieved with improved modeling of random sequences. Overall, our theoretical and empirical findings suggest that the use of stratified q-values and lFDRs could result in improvements in a host of structured multiple hypothesis testing problems arising in bioinformatics, including genome-wide association studies, orthology prediction, and motif scanning.

Duke Scholars

Author Alejandro Ochoa Biostatistics & Bioinformatics, Division of Integrative Geno ...

Altmetric Attention Stats

Dimensions Citation Stats

Published In

PLoS Comput Biol

DOI

10.1371/journal.pcbi.1004509

EISSN

1553-7358

Publication Date

November 2015

Volume

Issue

Start / End Page

e1004509

Location

United States

Related Subject Headings

Proteins
Protein Structure, Tertiary
Models, Statistical
Databases, Protein
Computational Biology
Bioinformatics
Algorithms
08 Information and Computing Sciences
06 Biological Sciences
01 Mathematical Sciences

Citation

APA

Chicago

ICMJE

MLA

NLM

Ochoa, A., Storey, J. D., Llinás, M., & Singh, M. (2015). Beyond the E-Value: Stratified Statistics for Protein Domain Prediction. PLoS Comput Biol, 11(11), e1004509. https://doi.org/10.1371/journal.pcbi.1004509

Ochoa, Alejandro, John D. Storey, Manuel Llinás, and Mona Singh. “Beyond the E-Value: Stratified Statistics for Protein Domain Prediction.” PLoS Comput Biol 11, no. 11 (November 2015): e1004509. https://doi.org/10.1371/journal.pcbi.1004509.

Ochoa A, Storey JD, Llinás M, Singh M. Beyond the E-Value: Stratified Statistics for Protein Domain Prediction. PLoS Comput Biol. 2015 Nov;11(11):e1004509.

Ochoa, Alejandro, et al. “Beyond the E-Value: Stratified Statistics for Protein Domain Prediction.” PLoS Comput Biol, vol. 11, no. 11, Nov. 2015, p. e1004509. Pubmed, doi:10.1371/journal.pcbi.1004509.

Ochoa A, Storey JD, Llinás M, Singh M. Beyond the E-Value: Stratified Statistics for Protein Domain Prediction. PLoS Comput Biol. 2015 Nov;11(11):e1004509.

Published In

PLoS Comput Biol

DOI

10.1371/journal.pcbi.1004509

EISSN

1553-7358

Publication Date

November 2015

Volume

Issue

Start / End Page

e1004509

Location

United States

Related Subject Headings

Proteins
Protein Structure, Tertiary
Models, Statistical
Databases, Protein
Computational Biology
Bioinformatics
Algorithms
08 Information and Computing Sciences
06 Biological Sciences
01 Mathematical Sciences