Inferring taxonomic placement from DNA barcoding aiding in discovery of new taxa
Predicting the taxonomic affiliation of DNA sequences collected from biological samples is a fundamental step in biodiversity assessment. This task is performed by leveraging existing databases containing reference DNA sequences endowed with a taxonomic identification. However, environmental sequences can be from organisms that are either unknown to science or for which there are no reference sequences available. Thus, taxonomic novelty of a sequence needs to be accounted for when doing classification. We propose Bayesian nonparametric taxonomic classifiers, BayesANT, which use species sampling model priors to allow unobserved taxa to be discovered at each taxonomic rank. Using a simple product multinomial likelihood with conjugate Dirichlet priors at the lowest rank, a highly flexible supervised algorithm is developed to provide a probabilistic prediction of the taxa placement of each sequence at each rank. As an illustration, we run our algorithm on a carefully annotated library of Finnish arthropods (FinBOL). To assess the ability of BayesANT to recognize novelty and to predict known taxonomic affiliations correctly, we test it on two training-test splitting scenarios, each with a different proportion of taxa unobserved in training. We show how our algorithm attains accurate predictions and reliably quantifies classification uncertainty, especially when many sequences in the test set are affiliated to taxa unknown in training. By enabling taxonomic predictions for DNA barcodes to identify unseen branches, we believe BayesANT will be of broad utility as a tool for DNA metabarcoding within bioinformatics pipelines.
Zito, A; Rigon, T; Dunson, DB
Electronic International Standard Serial Number (EISSN)
Digital Object Identifier (DOI)