SPHIS Home » Departments » Bioinformatics & Biostatistics » Research » Seminar Series » 2024-11-01
2024-11-01
Jinyuan Liu, Ph.D., Department of Biostatistics, Vanderbilt University
"Distance-based Regression to Embrace High-dimensionality: from Cross-sectional to Longitudinal Studies with Missing Data"
Technological breakthroughs such as high-throughput sequencing generate flourishing high-dimensional data that provoke statistical innovations. Since directly modeling high-dimensional data is constantly challenged by multiple testing and weak signals, an emerging alternative is to conduct feature aggregation on the high-dimensional feature at the outset. One example is to aggregate two subjects' features using dissimilarity metrics, yielding metrics with “between-subject attributes.” In the first half of this talk, I will extend the classical generalized linear models (GLM) to establish a distance-based regression paradigm for between-subject attributes, which aggregate individual signals for scientific-relevant associational insights. We illustrate the proposed approach to elucidate insights from high-dimensional microbiome sequence data, focusing on the well-recognized Beta-diversity, a between-subject distance index to offer a comprehensive picture of the community structure.
Given the dynamic nature of the human microbiome, repeated monitoring of the change in microbiota composition is pivotal to deciphering their role in human health. To characterize the change of Beta-diversity over time, we extend the distance-based regression to allow for capturing their associations with time-varying clinical phenotypes that contribute to such changes, including diet, region, exposure, genetics, etc. As a natural extension of the familiar generalized estimating equations (GEE) for longitudinal within-subject attributes, the proposed approach provides not only robust and flexible inference with minimum model assumptions but enhanced computational feasibility. Most importantly, it overcomes the challenges from the commonly encountered missing data patterns such as missing at random (MAR). We illustrate the method with simulated and real data.