Skip to main content
construction release_alert
Scholars@Duke will be undergoing maintenance April 11-15. Some features may be unavailable during this time.
cancel
Journal cover image

Annotation of phenotypes using ontologies: a gold standard for the training and evaluation of natural language processing systems.

Publication ,  Journal Article
Dahdul, W; Manda, P; Cui, H; Balhoff, JP; Dececchi, TA; Ibrahim, N; Lapp, H; Vision, T; Mabee, PM
Published in: Database : the journal of biological databases and curation
January 2018

Natural language descriptions of organismal phenotypes, a principal object of study in biology, are abundant in the biological literature. Expressing these phenotypes as logical statements using ontologies would enable large-scale analysis on phenotypic information from diverse systems. However, considerable human effort is required to make these phenotype descriptions amenable to machine reasoning. Natural language processing tools have been developed to facilitate this task, and the training and evaluation of these tools depend on the availability of high quality, manually annotated gold standard data sets. We describe the development of an expert-curated gold standard data set of annotated phenotypes for evolutionary biology. The gold standard was developed for the curation of complex comparative phenotypes for the Phenoscape project. It was created by consensus among three curators and consists of entity-quality expressions of varying complexity. We use the gold standard to evaluate annotations created by human curators and those generated by the Semantic CharaParser tool. Using four annotation accuracy metrics that can account for any level of relationship between terms from two phenotype annotations, we found that machine-human consistency, or similarity, was significantly lower than inter-curator (human-human) consistency. Surprisingly, allowing curatorsaccess to external information did not significantly increase the similarity of their annotations to the gold standard or have a significant effect on inter-curator consistency. We found that the similarity of machine annotations to the gold standard increased after new relevant ontology terms had been added. Evaluation by the original authors of the character descriptions indicated that the gold standard annotations came closer to representing their intended meaning than did either the curator or machine annotations. These findings point toward ways to better design software to augment human curators and the use of the gold standard corpus will allow training and assessment of new tools to improve phenotype annotation accuracy at scale.

Duke Scholars

Altmetric Attention Stats
Dimensions Citation Stats

Published In

Database : the journal of biological databases and curation

DOI

EISSN

1758-0463

ISSN

1758-0463

Publication Date

January 2018

Volume

2018

Related Subject Headings

  • Phenotype
  • Natural Language Processing
  • Humans
  • Gene Ontology
  • Data Mining
  • Data Curation
  • 4605 Data management and data science
  • 3102 Bioinformatics and computational biology
  • 0807 Library and Information Studies
  • 0804 Data Format
 

Citation

APA
Chicago
ICMJE
MLA
NLM
Dahdul, W., Manda, P., Cui, H., Balhoff, J. P., Dececchi, T. A., Ibrahim, N., … Mabee, P. M. (2018). Annotation of phenotypes using ontologies: a gold standard for the training and evaluation of natural language processing systems. Database : The Journal of Biological Databases and Curation, 2018. https://doi.org/10.1093/database/bay110
Dahdul, Wasila, Prashanti Manda, Hong Cui, James P. Balhoff, T Alexander Dececchi, Nizar Ibrahim, Hilmar Lapp, Todd Vision, and Paula M. Mabee. “Annotation of phenotypes using ontologies: a gold standard for the training and evaluation of natural language processing systems.Database : The Journal of Biological Databases and Curation 2018 (January 2018). https://doi.org/10.1093/database/bay110.
Dahdul W, Manda P, Cui H, Balhoff JP, Dececchi TA, Ibrahim N, et al. Annotation of phenotypes using ontologies: a gold standard for the training and evaluation of natural language processing systems. Database : the journal of biological databases and curation. 2018 Jan;2018.
Dahdul, Wasila, et al. “Annotation of phenotypes using ontologies: a gold standard for the training and evaluation of natural language processing systems.Database : The Journal of Biological Databases and Curation, vol. 2018, Jan. 2018. Epmc, doi:10.1093/database/bay110.
Dahdul W, Manda P, Cui H, Balhoff JP, Dececchi TA, Ibrahim N, Lapp H, Vision T, Mabee PM. Annotation of phenotypes using ontologies: a gold standard for the training and evaluation of natural language processing systems. Database : the journal of biological databases and curation. 2018 Jan;2018.
Journal cover image

Published In

Database : the journal of biological databases and curation

DOI

EISSN

1758-0463

ISSN

1758-0463

Publication Date

January 2018

Volume

2018

Related Subject Headings

  • Phenotype
  • Natural Language Processing
  • Humans
  • Gene Ontology
  • Data Mining
  • Data Curation
  • 4605 Data management and data science
  • 3102 Bioinformatics and computational biology
  • 0807 Library and Information Studies
  • 0804 Data Format