My research interests focus on computational and theoretical approaches for statistical phylogenetic studies of various evolutionary processes. These processes include the estimation of species relationships (i.e., inference of phylogenies), species divergence times and diversification rates, gene-tree discordance, and the genetics of adaptation. I address these problems by developing new mathematical and statistical theory, by implementing these statistical methods in computer software, and by analyzing empirical data.
Software and Statistical Model for Phylogeny Estimation
Phylogenies depict the evolutionary relationships among species. Much of the current interest in phylogenetics comes from advances in DNA sequencing technologies that now allow the phylogenetic comparison of hundreds or thousands of genes. However, the detailed evolutionary history of specific organisms remains largely unknown. A major challenge in phylogenetics is to infer events that occurred million of years ago while only having data from extant species. Nevertheless, new mathematical and statistical methods continue to advance our ability to effectively study these problems.
My Ph.D. work introduced the concept of probabilistic graphical models to phylogenetics (Höhna et al., 2014, Systematic Biology). A probabilistic graphical model consists of vertices (the parameters/variables of the model) and edges (the dependencies between parameters. Vertices (parameters) are associated with conditional probability distributions or parameter transformations, thus, a larger model is decomposed into smaller, modular pieces. This model representation has the advantage of being easily extendable to more complex (i.e., realistic) models. Furthermore, teaching statistics to empirical biologists is simplified by probabilistic graphical models because model assumptions are made explicit and parameter dependencies are readily visualized. In a collaborative project, I combined and implemented this mathematical and statistical theory in a new computer program called RevBayes (Höhna et al., 2016, Systematic Biology). The initial development of this framework was extremely challenging and included several complete re-writes during the first years; however, this investment has wrought an incredibly powerful framework that enables exciting new research.
The increasing use of genomic data for phylogeny inference raise a number of novel challenges and opportunities for phylogenetic inferences. For example, genomic sequence data exhibits qualitative differences in the substitution process across lineages where the GC content of the sequences vary across branches of the phylogeny. This can be modeled efficiently using hierarchical models. Additionally, genomic data size show a large variation on the substitution process among genes. We developed partitioned-data analyses that can accommodate complex patterns of the substitution process within and among gene regions of an alignment.
Gene-tree discordance due to incomplete lineage sorting, gene duplication and transfer.
Inference of Macroevolutionary Processes
Many evolutionary processes entail differential rates of diversification: e.g., adaptive radiation, diversity-dependent diversification, key innovations, and mass extinction. Current methods for exploring these evolutionary processes involves the use of stochastic birth-death branching process models, where rates of diversification may be (a) time-dependent, (b) lineage-specific, or (c) character-dependent. However, our ultimate goal is to infer diversification rates through time and among lineages and identify correlations to genetic, phenotypic and/or environmental factors that impact diversification rate; and thus species diversity.
In our previous work we extended the theory on birth-death models (Höhna, 2013, Bioinformatics; Höhna, 2015, JTB; Höhna et al., 2016, Bioinformatics) and implemented computationally efficient methods to infer episodic changes in diversification rate, including cases in which species sampling is incomplete (Höhna, 2014, PLoS one; Höhna et al., 2011, MBE). In a recent study, we used this new approach to show that the diversity of conifers was impacted by one major mass-extinction event approximately 23 million years ago (May et al., 2016, MEE). This event coincides with the known timing of the increase of more arid, grassland ecosystems. In an upcoming study, we also develop a statistical method to identify correlations between rates of diversification and environmental variables, such as changes in atmospheric CO2, which includes an empirical study of the daisy family, Asteraceae.
Our next steps are two develop and implement a robust model for inferring diversification-rate variation across lineages that correct problems with previous methods. These models allow us to study diversification rate-variation through-time and among lineages. Currently, we obtain lineage-specific diversification rate estimates independently of potentially correlated factors. Future projects in this research stream include the development of statistical methods to infer correlations between time-dependent and lineage-specific diversification rates with species traits (such as morphological traits), substitution rates and environmental factors/habitat.