Machine Learning; Phylogenetics; Evolution
Surprisingly, despite extensive data, our understanding of some relationships in the tree of life are still unclear. For example, there is a debate over which group of mammals the giraffes belong. For the last several decades we have compared DNA from different species to determine how closely related they are. With new “next-generation” DNA sequencing, we are now able to analyze whole genomes. Although this provides substantially more information, different subsets of the genome produce conflicting results about species relationships. We approach the problem of distinguishing “information” from “noise” using a random forest machine learning approach. This method uses large datasets to generate a decision tree, which we can think of as a series of yes/no questions asked about our data leading to a prediction.
We obtained whole-genome sequence data from 36 mammals, which we divided into 24,625 loci (genomic segments), and characterized each locus for 16 variables. For example, we examined the percent invariant in each locus, which is the percent of the sequence that is the same for all species; percent biallelic, the proportion of the locus where only two DNA bases are present across all species; and percent singleton, the proportion of locus where the DNA is consistent across all but a single species. We identified three main factors influencing the quality of data contributing to accurate phylogenies: percent coding (the percent of the data that codes for genes), percent invariant, and percent biallelic. The variable found to be most misleading during tree construction was the percent coding. This result contrasts with decades of focus on using coding sequences to estimate species relationships due to their slow rate of change. However, this conclusion is consistent with work identifying coding loci as providing a less accurate understanding of evolutionary relationships, potentially due to the erratic nature of changes in regions of the genome under natural selection (Literman and Schwartz in revision).
To avoid overfitting our data, we performed 10-fold cross-validation on the dataset. We split the data into 10 sections, each with a different combination of testing and training data, and observed if the same characteristics appeared near the top of the resulting trees. We found that 30% of the same leading characteristics were present in every tree produced. We anticipate extending this work by examining different groups of species and adding additional characteristics of the data as input to the model.