A phylogenetic transform enhances analysis of compositional microbiota data
High-throughput DNA sequencing technologies have revolutionized the study of microbial communities (microbiota) and have revealed their importance in both human health and disease. However, due to technical limitations, data from microbiota surveys reflect the relative abundance of bacterial taxa and not their absolute levels. It is well known that applying common statistical methods, such as correlation or hypothesis testing, to relative abundance data can lead to spurious results. Here, we introduce the PhILR transform, a data transform that utilizes microbial phylogenetic information. This transform enables off-the-shelf statistical tools to be applied to microbiota surveys free from artifacts usually associated with analysis of relative abundance data. Using environmental and human-associated microbial community datasets as benchmarks, we find that the PhILR transform significantly improves the performance of distance-based and machine learning-based statistics, boosting the accuracy of widely used algorithms on reference benchmarks by 90%. Because the PhILR transform relies on bacterial phylogenies, statistics applied in the PhILR coordinate system are also framed within an evolutionary perspective. Regression on PhILR transformed human microbiota data identified evolutionarily neighboring bacterial clades that may have differentiated to adapt to distinct body sites. Variance statistics showed that the degree of covariation of bacterial clades across human body sites tended to increase with phylogenetic relatedness between clades. These findings support the hypothesis that environmental selection, not competition between bacteria, plays a dominant role in structuring human-associated microbial communities.