Skip to Content

Department of Statistics


The colloquia listed here are presented by visiting academic researchers, members of the business community, as well by USC faculty and graduate students. The research topics introduced by the speakers delve into all areas of statistics.

Faculty, students, and off-campus visitors are invited to attend any of our colloquia and Palmetto Lecture Series.

2024 – 2025 Department of Statistics Colloquium Speaker 

When: Thursday, August 22, 2024—2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speaker: Dr. Melissa Smith, Department of Biostatistics, University of Alabama at Birmingham

Abstract: A causal decomposition analysis allows researchers to determine whether the difference in a health outcome between two groups can be attributed to a difference in each group's distribution of one or more modifiable mediator variables. With this knowledge, researchers and policymakers can focus on designing interventions that target these mediator variables. In this talk, I will discuss the similarities and differences between a causal mediation analysis and a causal decomposition analysis. I will then present our recent work on a method for performing causal decomposition analyses with multiple correlated mediator variables. Existing methods for causal decomposition analysis either focus on one mediator variable or assume that each mediator variable is conditionally independent given the group label and the mediator-outcome confounders. Our Monte Carlo-based causal decomposition analysis method is designed to accommodate multiple correlated and interacting mediator variables, while identifying path-specific effects through individual mediators. I will illustrate an evaluation of our method through a simulation study and an application to examine potential reasons for Black-White differences in incident diabetes using data from a national cohort study.

When: Thursday, September 12, 2024—2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speaker: Dr. Will Cipolli, Department of Mathematics, Colgate University

Abstract: Much work has been done in "robustifying" standard statistical approaches with mixtures of multivariate Polya trees (MMPTs). In this talk, I will present a FAST Markov chain Monte Carlo (MCMC) sampling technique for MMPTs that overcomes difficulties in traditional sampling procedures and is completed in a fraction of the time. This new technique permits time-feasible Bayesian nonparametric solutions to contexts requiring many or repeated density estimates. The efficacy of this approach will be demonstrated via simulation and biomedical applications.

When: Thursday, September 19, 2024—2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speaker: Dr. Ian Dryden, Department of Statistics, University of South Carolina

Abstract: Complex object data such as networks and shapes are becoming increasingly available, and so there is a need to develop suitable methodology for statistical analysis. Networks can be represented as graph Laplacian matrices, which are a type of manifold-valued data. Shapes of 3D objects are also a type of manifold-valued data, invariant to translation, rotation and scale. Our main objective is to estimate a regression curve from a sample of graph Laplacian matrices or 3D shapes conditional on a set of Euclidean covariates, for example in dynamic objects where the covariate is time. We develop an adapted Nadaraya-Watson estimator which has uniform weak consistency for estimation using Euclidean and power Euclidean metrics, and we also explore splines on shape spaces.

When: Thursday, September 26, 2024—2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speaker: Dr. Kimberly Sellers, Department of Statistics, North Carolina State University

Abstract: While the Poisson distribution is a classical statistical model for count data, it hinges on the constraining equi-dispersion property (i.e. that the mean and variance equal). This assumption, however, does not usually hold for real count data; over-dispersion (i.e. when the variance is greater than the mean) is a more common phenomenon for count data, however data under-dispersion has also been prevalent in various settings. It would be more convenient to work with a distribution that can effectively model data (over- or under-) dispersion because it can offer more flexibility (and, thus, more appropriate inference) in the statistical methodology. This talk introduces the Conway-Maxwell-Poisson distribution along with several associated statistical methods motivated by this model to better analyze count data under various scenarios (e.g. distributional theory, generalized linear modeling, control chart theory, and count processes). As time permits, this talk will likewise acquaint the audience with available associated tools for statistical computing.

When: Thursday, October 03, 2024—2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speaker: Dr. Rahul Ghosal, Department of Epidemiology and Biostatistics, University of South Carolina

Abstract: Modern clinical and epidemiological studies widely employ wearables to record parallel streams of real-time data on human physiology and behavior. With recent advances in distributional data analysis, these high-frequency data are now often treated as distributional observations resulting in novel regression settings. Motivated by these modelling setups, we develop a distributional outcome regression via quantile functions (DORQF) that expands existing literature with three key contributions: i) handling both scalar and distributional predictors, ii) ensuring jointly monotone regression structure without enforcing monotonicity on individual functional regression coefficients, iii) providing statistical inference via asymptotic projection-based joint confidence bands and a statistical test of global significance to quantify uncertainty of the estimated functional regression coefficients. The method is motivated by and applied to Actiheart component of Baltimore Longitudinal Study of Aging that collected one week of minute-level heart rate (HR) and physical activity (PA) data on 781 older adults to gain deeper understanding of age-related changes in daily life heart rate reserve, defined as a distribution of daily HR, while accounting for daily distribution of physical activity, age, gender, and body composition. Intriguingly, the results provide novel insights in epidemiology of daily life heart rate reserve.

When: Thursday, October 10, 2024—2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speaker: Dr. Hongtu Zhu, Department of Biostatistics, University of North Carolina at Chapel Hill

Abstract: This talk provides an insightful overview of integrating artificial intelligence (AI) and statistical methods in medical data analysis. It is structured into three key sections: Introduction to Medical Image Data Analysis: This section sets the stage by outlining the fundamentals and significance of medical image analysis in healthcare, charting its evolution and current applications. State-of-the-Art AI Applications and Statistical Challenges: Here, we explore the impact of AI, particularly deep learning, on medical imaging, and address the accompanying statistical challenges, such as data quality and model interpretability. Opportunities for Statisticians: The final section highlights the critical role of statisticians in refining AI applications in medical imaging, focusing on opportunities for advancing algorithmic accuracy and integrating statistical rigor. The talk aims to demonstrate the crucial synergy between AI and statistics in enhancing medical data analysis, emphasizing the evolving challenges and the vital contributions of statisticians in this domain.

When: Tuesday, October 15, 2024—2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speaker: Dr. Weijie Su, Department of Statistics and Data Science, University of Pennsylvania

Abstract: Large language models (LLMs) have rapidly emerged as a transformative innovation in machine learning. However, their increasing influence on human decision-making processes raises critical societal questions. In this talk, we will demonstrate how statistics can help address two key challenges: ensuring fairness for minority groups through alignment and combating misinformation through watermarking. First, we tackle the challenge of creating fair LLMs that equitably represent and serve diverse populations. We derive a regularization term that is both necessary and sufficient for aligning LLMs with human preferences, ensuring equitable outcomes across different demographics. Second, we introduce a general statistical framework to analyze the efficiency of watermarking schemes for LLMs. We develop optimal detection rules for an important watermarking scheme recently developed at OpenAI and empirically demonstrate its superiority over the existing detection method. Throughout the talk, we will showcase how statistical insights can not only address pressing challenges posed by LLMs but also unlock substantial opportunities for the field of statistics to drive responsible generative AI development. This talk is based on arXiv:2405.16455 and arXiv:2404.01245.

When: Thursday, October 24, 2024—2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speaker: Dr. Whitney Huang, School of Mathematical and Statistical Sciences, Clemson University

Abstract: The class of max-stable models is commonly used for modeling multivariate and spatial extremes. Despite recent advancements in model construction and implementation, a fundamental limitation persists in incorporating timing information for extreme events due to the "component-wise maximum" data selection process. This limitation can lead to inaccurate assessments of multivariate and spatial extreme risk. In this talk, I will present a conditional approach to model multivariate extremes, aiming to capture extremes at the event level by conditioning on the timing and corresponding vector values when at least one variable is extreme. The proposed approach shares some similarities with the conditional extreme value models developed by Jonathan Tawn and his collaborators, but it treats the modeling of the conditional distribution of the concomitant variable(s) differently when the conditioning variable is extreme. Specifically, the conditional distribution function is modeled by a composition of distribution functions, where an extreme value base distribution is enriched by a conditional beta distribution. Simulated examples and an application to bivariate concurrent wind and precipitation extremes will illustrate the proposed approach.

When: Thursday, October 31, 2024—2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speaker: Dr. Sanat Sarkar, Department of Statistics, Temple University

Abstract: Simultaneous testing of multivariate Gaussian means against two-sided alternatives is considered under two different scenarios – (i) when the correlation matrix is known, and (ii) when the correlation matrix is unknown but estimated from an independent Wishart matrix. New methods, capturing dependence among the variables and with theoretically proven finite-sample control of the false discovery rate (FDR), are presented. When the correlation matrix is known, two methods, referred to as shifted-BH methods, are produced. Each of them is developed by shifting the p-values and considering BH-type step up procedure based on the shifted p-values. The amount of shift for each p-value is appropriately determined from the correlation matrix to achieve the desired FDR control. Simulation studies and real-data application show favorable performances of the shifted-BH methods when compared with their relevant competitors available in the literature. When the correlation matrix is estimated using an independent Wishart matrix, no method with theoretically proven finite-sample FDR control is available in the literature, as far as we know. This talk will present some new results in this context addressing the long-standing open question: Can the Benjamini-Hochberg method in its original form theoretically control FDR? 

When: Thursday, November 07, 2024—2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speaker: Dr. Andee Kaplan, Department of Statistics, Colorado State University

Abstract: With the ubiquity of data, linking data sets has become crucial for myriad applications including healthcare, official statistics, ecology, and fraud detection and national security. Record linkage is the task of resolving duplicates in two or more partially overlapping sets of records, or files, from noisy data sources without a unique identifier. In any field where multiple sources of messy data are available to answer a scientific problem, record linkage is critical in the analysis pipeline. In streaming record linkage, files arrive sequentially in time and estimates of links are updated after the arrival of each file. The challenge in streaming record linkage is to efficiently update parameter estimates as new data arrives. We approach the problem from a Bayesian perspective with estimates in the form of posterior samples of parameters and present methods for updating link estimates after the arrival of a new file that are faster than fitting a joint model with each new data file. In this talk, we present a Bayesian linkage model for the multi-file case formulated specifically for the streaming data context and propose computational methods to perform streaming updates that achieve near-equivalent posterior inference at a small fraction of the compute time.

When: Thursday, November 14, 2024—2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speaker: Dr. Yuqi Gu, Department of Statistics, Columbia University

Abstract: Mixed membership models are popular individual-level mixture models widely used in various fields including network analysis, topic modeling, and multivariate categorical data analysis. This work focuses on mixed membership models for multivariate categorical data, which are also called Grade of Membership (GoM) models. GoM models drastically increase the modeling flexibility of latent class models by allowing each individual to partially belong to multiple extreme latent profiles. However, such flexibility also comes with challenging identifiability and estimation issues, especially for high-dimensional polytomous (categorical with more than two categories) data. Such data take the form of a three-way tensor, with N subjects responding to J items each with C categories. Existing estimation methods based on maximum likelihood or Bayesian MCMC inference are not computationally efficient and lack high-dimensional theoretical guarantees. We propose an SVD-based spectral method for high-dimensional polytomous Models with potential local dependence. We innovatively flatten the three-way tensor into a “fat” matrix and exploit the singular subspace geometry based on the matrix SVD for estimation. We establish fine-grained finite-sample entrywise error bounds for all the parameters. Moreover, we develop a novel two-to-infinity singular subspace perturbation theory under arbitrary local dependent noise, which is of independent interest. Simulations and applications to real-world data in genetics, political science, and single-cell sequencing demonstrate the merit of the proposed method.

When: Thursday, November 21, 2024—2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speaker: Dr. Dana Tudorascu, Department of Psychiatry and Biostatistics, University of Pittsburgh

Abstract: Multisite imaging studies increase statistical power and enable the generalization of research outcomes; however, due to the variety of imaging acquisition, different PET tracer properties and inter-scanner variability hinders the direct comparability of multi-scanner PET data. The PET imaging field is lacking behind in terms of harmonization methods due to the complexity associated with combination of different tracers and different scanners. Similarly, MRI present similar challenges, but mainly due to scanner differences. In this study we investigate samples of cognitively normal participants, mild cognitive impaired and Alzheimer’s disease subjects in two major multisite studies of Alzheimer’s disease.We present challenges and solutions associated with different MRI scanners, PET tracers, as well as analysis and harmonization techniques including simple imaging standardization, Combat and deep learning methods. We show regions of interest differences in PET outcome measures before and after the harmonization in multisite studies of Alzheimer’s Disease as well as voxel level harmonization along with summary measures before and after harmonization in MRI studies.

When: Thursday, December 05, 2024—2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speaker: Dr. Sumanta Basu, Department of Statistics and Data Science, Cornell University

Abstract: With advances in data collection and storage, statistical learning algorithms are becoming increasingly popular for structure learning and prediction with large-scale data sets that exhibit temporal or spatial dependence. Most algorithms in the literature focus on using off-the-shelf machine learning algorithms that ignore the dependent nature of the data. In this talk, we aim to demonstrate the merit of incorporating classical statistical wisdoms for scale and dependence modeling into the statistical learning framework through two algorithms that we developed. The first, called RF-GLS, extends random forests (RF) for dependent error processes in the same way Generalized Least Squares (GLS) fundamentally extends Ordinary Least Squares (OLS) for linear models under dependence. The second algorithm, called AutoTune, offers an automatic tuning parameter selection algorithm for LASSO, by revisiting the well-known problem of scale estimation and adjustment for high-dimensional regression. We illustrate the benefit of these algorithms on simulated data sets, and provide some theoretical analysis to shed insight on their asymptotic properties.

When: Thursday, January 16, 2025—2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speaker: Ms. Christine Franklin, Department of Statistics and Data Science, Cornell University

Abstract: Imagine a world where data and statistical reasoning are the cornerstones of discovery and decision-making. To realize this vision, developing proficiency in data analysis and statistical literacy must be as integral to our school curriculum as reading and numeracy. Data encompasses more than just numbers, encompassing counts, measurements, images, videos, sounds, and words. Therefore, students must grapple with the dynamic nature of data and the requisite analytical tools to ensure our curriculum remains pertinent to their needs. What has been the journey at the school level toward equipping all students with the skills and mindset to interpret data and think statistically in today’s world? We will find that this journey has been marked by disruptive events that have propelled us closer to this goal, though acknowledging there is still much to accomplish.

When: Thursday, January 23, 2025—2:40 p.m. to 3:50 p.m.
Where: LeConte 224

Speaker: Dr. Chih-Li Sung, Department of Statistics and Probability, Michigan State University

Abstract: In this talk, I will provide an overview of my research on Uncertainty Quantification (UQ), statistical modeling, and their applications in digital twins and complex systems. I will then focus on my recent work in multi-fidelity data and experimental design, specifically the paper "Stacking designs: designing multi-fidelity computer experiments with target predictive accuracy." With scientific experiments becoming increasingly costly, it is crucial to design multi-fidelity experiments that maximize predictive power while staying within a limited computational budget. I will introduce a novel stacking design methodology that simultaneously optimizes experimental design and ensures a desired prediction accuracy with confidence. By utilizing a multi-level reproducing kernel Hilbert space (RKHS) interpolator, the stacking design approach effectively balances accuracy and computational cost. I'll present key theoretical results, including a cost complexity theorem, and illustrate the method's practical advantages through simulation studies and an application in finite element analysis.

When: Thursday, January 30, 2025—2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speaker: Dr. Mingxia Liu, Department of Radiology and BRIC, University of North Carolina at Chapel Hill

Abstract: Multi-site multi-modal neuroimaging data, such as magnetic resonance imaging (MRI) and positron emission tomography (PET), are critical to expanding the diversity of subject populations and enhancing the statistical robustness of predictive models in neuroscience research. Despite their potential, the field faces substantial challenges, notably the heterogeneity of data across imaging sites and modalities. Addressing these complexities, my research focuses on creating machine learning and deep learning methodologies to analyze multi-modal imaging data from multiple sites, with the goal of uncovering imaging biomarkers associated with neurodegenerative disorders. This talk will delineate our progress in address three long-standing challenges: neuroimage representation learning, multi-modality neuroimage fusion, and multi-site data adaptation. Key highlights will include our latest advances in the representation learning of MRI, capturing both structural and functional dimensions. Subsequently, I will elucidate our strategies for the effective integration of multi-modal neuroimaging data, which promises the accurate synthesis of MRI and PET scans, particularly beneficial in cases plagued by missing or incomplete data modalities. Concluding the talk, I will introduce our comprehensive suite of multi-site neuroimage harmonization techniques and unveil DomainATM, our open-source toolbox specifically designed for medical data adaptation.

When: Thursday, February 06, 2025—2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speaker: Dr. Mikael Kuusela, Department of Statistics and Data Science, Carnegie Mellon University

Abstract: Ill-posed inverse problems are characterized by a large set of parameters that are consistent with the observed data. Rigorous and useful finite-sample inference in these problems is possible by inferring functionals of the unknown parameter subject to physical constraints. The presence of the functional, constraints and potential unidentifiability makes this a challenging inference problem. In this talk, I will show that confidence intervals in this setting can be obtained through the inversion of a specific, non-standard likelihood-ratio test. The critical values of this test are typically not available in closed form so we propose calibrating the test over a compact subset of the parameter space chosen so that it contains the true parameter with a high probability. I will then describe concrete computational strategies based on optimization and sampling for performing this calibration and test inversion in practice. I will first illustrate these ideas with low-dimensional toy examples and then present a realistic application to a moderately high-dimensional particle unfolding problem, where the new confidence intervals are significantly shorter than previous alternatives while maintaining correct finite-sample coverage. I will conclude by discussing the potential for carrying out this type of inference in ultra-high-dimensional inverse problems arising in climate science.

When: Thursday, February 13, 2025—2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speaker: Dr. Len Stefanski, Department of Statistics, North Carolina State University

Abstract: Ridge regression was introduced by Hoerl and Kennard (1970a,b) and twenty-six years later was followed by the introduction of the lasso Tibshirani (1996). The body of research ensuing from these seminal papers is staggering, and has contributed immensely to our understanding of shrinkage and selection methodology and to the practice of regression modeling in many areas of science. In some applications of regression modeling the goal is simply to achieve the best possible predictions of future response values. In other applications, interpretation is important as a way to guide understanding of the process under investigation. Ridge regression is very good at prediction, although is often eclipsed by the lasso in terms of both prediction and interpretation because the lasso also allows for selection.The method introduced in this talk, fractional ridge regression, has the potential to improve both prediction (as measured by mean square error) and interpretability (as measured by the specificity of variable selection) relative to the lasso.

When: Thursday, February 20, 2025—2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speaker: Dr. Tama Budavari, Department of Applied Mathematics and Statistics, Johns Hopkins University

Abstract: Astronomy has always been a data-driven science. With the introduction of large-format CCD cameras and dedicated survey telescopes around the turn of the Century, statistical studies of "big" data have started to dominate nearly every aspect of observational astronomy. The upcoming Rubin Observatory will produce 60PB raw data over the next decade. We will explore some of the fundamental challenges and recent advancements that could unlock better handling and understanding the populations of celestial objects, from stars and galaxies to various transients. And it's all statistics!

When: Thursday, February 27, 2025—2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speaker: Dr. Chris Bilder, Department of Statistics, University of Nebraska - Lincoln

Abstract: Laboratories use group testing (also known as pooled testing) to test high volumes of clinical specimens for pathogens, such as SARS-CoV-2, West Nile virus, and Chlamydia trachomatis. The process works by testing multiple specimens together as an amalgamation (i.e., as a “group”), rather than testing each specimen separately, in an effort to reduce the total number of tests needed. There are many different algorithmic ways to apply group testing. The job of a statistician is to determine which algorithm will be best for a diagnostic testing laboratory to implement given the information available, such as disease prevalence. Algorithms are most often compared by their expected number of tests needed for an application, where a lower value is preferred. Unfortunately, this measure alone does not account for some algorithms having a lower expected number of tests at the expense of being much more complex to implement. For this reason, I propose a new comparison measure that I refer to as the complexity. In my presentation, I present its definition and derive its expression for several common algorithms. I show that some algorithms may be too complex for everyday implementation, while other algorithms should become more widely used. The proposed measure is illustrated for a recent SARS-CoV-2 testing implementation. 

When: Thursday, March 06, 2025—2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speaker: Dr. Chris Rorden, Department of Neuroimage, University of South Carolina

Abstract: Neuroimaging offers a non-invasive approach to understanding human brain function and neurological disorders. Yet, the field faces challenges due to small effect sizes and substantial individual variability. To address these limitations, the community is increasingly embracing shared datasets that promote reproducibility and reuse. By aggregating data across teams, we can enhance statistical power and improve the generalizability of findings.I will briefly highlight four recent advances from our team in South Carolina. First, I will discuss our efforts to share large clinical datasets publicly through OpenNeuro. Second, I will describe how techniques like stability selection can enhance machine learning models, supporting computer-aided diagnosis and prognosis. Third, I will introduce our cloud-based visualization tools, which facilitate data sharing, improve the interpretation of statistical maps, and enable cloud-based processing. Finally, I will share details about our institution's upcoming imaging center, which will serve as a hub for our rural health network, creating new opportunities for collaboration, discovery, and methodological refinement.

When: Tuesday, March 18, 2025—2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speaker: Dr. Xiao Wang, Department of Statistics, Purdue University

Abstract: Recent advancements in generative modes have significantly improved our ability to analyze complex, high-dimensional data. In this talk, I will present two recent works that tackle multimodal sampling and manifold learning. The first part of the talk will introduce a novel approach to sampling from complex probability distributions with multiple isolated modes. By leveraging Wasserstein gradient flow theory and tempering strategies, our method overcomes the limitations of traditional transport-based samplers, significantly enhancing efficiency in high-dimensional settings. Applications to generative modeling and Bayesian inference will be discussed. In the second part, I will introduce a new framework for learning the intrinsic dimension of data manifolds in generative models. Our approach integrates Wasserstein autoencoders and GANs to dynamically adjust the latent space dimension, ensuring better representation learning and sample quality. Theoretical guarantees on intrinsic dimension estimation and generalization error will be provided, along with empirical results demonstrating superior performance in data generation tasks.

When: Thursday, March 20, 2025—2:50 p.m. to 3:50 p.m.
Where: LeConte 224

Speaker: Dr. Ruoqi Yu, Department of Statistics, University of Illinois Urbana-Champaign

Abstract: Many scientific questions in biomedical, environmental, and psychological research involve understanding the effects of multiple factors on outcomes. While factorial experiments are ideal for this purpose, randomized controlled treatment assignment is generally infeasible in many empirical studies. Therefore, investigators must rely on observational data, where drawing reliable causal inferences for multiple factors remains challenging. As the number of treatment combinations grows exponentially with the number of factors, some treatment combinations can be rare or missing by chance in observed data, further complicating factorial effects estimation. To address these challenges, we propose a novel weighting method tailored to observational studies with multiple factors. Our approach uses weighted observational data to emulate a randomized factorial experiment, enabling simultaneous estimation of the effects of multiple factors and their interactions. Our investigations reveal a crucial nuance: achieving balance among covariates, as in single-factor scenarios, is necessary but insufficient for unbiasedly estimating factorial effects; balancing the factors is also essential in multi-factor settings. Moreover, we extend our weighting method to handle missing treatment combinations in observed data. Finally, we study the asymptotic behavior of the new weighting estimators and propose a consistent variance estimator, providing reliable inferences on factorial effects in observational studies.

Past colloquium talks are archived here.

Challenge the conventional. Create the exceptional. No Limits.
