All the Clades in the World: Building a Semantically-Rich and Testable Ontology of Phylogenetic Clade Definitions

Journal Article

Taxonomic names are ambiguous as identifiers of biodiversity data, as they refer to a particular concept of a taxon in an expert’s mind (Kennedy et al. 2005). This ambiguity is particularly problematic when attempting to reconcile taxonomic names from disparate sources with clades on a phylogeny. Currently, such reconciliation requires expert interpretation, which is necessarily subjective, difficult to reproduce, and refractory to scaling. In contrast, phylogenetic clade definitions are a well-developed method for unambiguously defining the semantics of a clade concept in terms of shared evolutionary ancestry (Queiroz and Gauthier 1990, Queiroz and Gauthier 1994), and these semantics allow locating clades on any phylogeny. Although a few software tools have been created for resolving clade definitions, including for definitions expressed in the Mathematical Markup Language (e.g. Names on Nodes in Keesey 2007) and as lists of GenBank accession numbers (e.g. mor in Hibbett et al. 2005), these are application-specific representations that do not provide formal definitions with well-defined semantics for every component of a clade definition. Being able to create such machine-interpretable definitions would allow computers to store, compare, distribute and resolve semantically-rich clade definitions. To this end, the Phyloreferencing project (, Cellinese and Lapp 2015) is working on a specification for encoding phylogenetic clade definitions as ontologies using the Web Ontology Language (OWL in W3C OWL Working Group 2012). Our specification allows the semantics of these definitions, which we call phyloreferences, to be described in terms of shared ancestor and excluded lineage properties. The aim of this effort is to allow any OWL-DL reasoner to resolve phyloreferences on a phylogeny that has itself been translated into a compatible OWL representation. We have developed a workflow that allows us to curate phyloreferences from phylogenetic clade definitions published in natural language, and to resolve the curated phyloreference against the phylogeny upon which the definition was originally created, allowing us to validate that the phyloreference reflects the authors’ original intent. We have started work on curating dozens of phyloreferences from publications and the clade definition database RegNum (, which will provide an online catalog of all clade definitions that are part of the Phylonym Volume, to be published together with the PhyloCode ( We will comprehensively curate these definitions into a reusable and fully computable ontology of phyloreferences. In our presentation, we will provide an overview of phyloreferencing and will describe the model and workflow we use to encode clade definitions in OWL, based on concepts and terms taken from the Comparative Data Analysis Ontology (Prosdocimi et al. 2009), Darwin-SW (Baskauf and Webb 2016) and Darwin Core (Wieczorek et al. 2012). We will demonstrate how phyloreferences can be visualized, resolved and tested on the phylogeny that they were originally described on, and how they resolve on one of the largest synthetic phylogenies available, the Open Tree of Life (Hinchliff et al. 2015). We will conclude with a discussion of the problems we faced in referring to taxonomic units in phylogenies, which is one of the key challenges in enabling better integration of phylogenetic information into biodiversity analyses.

Full Text

Duke Authors

Cited Authors

  • Vaidya, G; Zhang, G; Lapp, H; Cellinese, N

Published Date

  • May 21, 2018

Published In

Volume / Issue

  • 2 /

Start / End Page

  • e25776 - e25776

Published By

Electronic International Standard Serial Number (EISSN)

  • 2535-0897

Digital Object Identifier (DOI)

  • 10.3897/biss.2.25776