Skip to main content

Stefan Groha DPhil

Senior AI/ML Engineer

I lead research in causal machine learning and generative models for biomarker discovery, target identification, and drug response prediction at GSK.ai.

My work leverages large-scale multi-omic datasets, genetic data, medical imaging, electronic health records, and clinical trial data to uncover causal biological mechanisms. I bring extensive expertise in bulk, single-cell, and spatial RNA sequencing, as well as perturbation data analysis, combining state-of-the-art machine learning with classical statistical methods.

Before joining GSK, I was a postdoctoral research fellow at the Dana-Farber Cancer Institute and the Broad Institute of MIT and Harvard, working with Alexander Gusev. My PhD research at the University of Oxford focused on the intersection of theoretical condensed matter physics, non-equilibrium statistical mechanics, and quantum computing.

Stefan Groha

Publications

Research in machine learning, computational biology, and theoretical physics

Selected Publications

Germline variants associated with toxicity to immune checkpoint blockade

Stefan Groha, et al. • Nature Medicine, 2022

• First GWAS to identify germline variants associated with immunotherapy toxicity

Immune checkpoint inhibitors (ICIs) have yielded remarkable responses in patients across multiple cancer types, but often lead to immune related adverse events (irAEs). Although a germline cause for irAEs has been hypothesized, no systematic genome wide association study (GWAS) has been performed and no individual variants associated with the overall likelihood of developing irAEs have yet been identified. We carried out a Genome-Wide Association Study (GWAS) of 1,751 patients on ICIs across 12 cancer types, with replication in an independent cohort of 196 patients and independent clinical trial data from 2275 patients. We investigated two irAE phenotypes: (i) high-grade (3-5) events defined through manual curation and (ii) all detectable events (including high-grade) defined through electronic health record (EHR) diagnosis followed by manual confirmation. We identified three genome-wide significant associations (p<5×10−8) in the discovery cohort associated with all-grade irAEs: rs16906115 near IL7 (combined p=1.6×10−11; hazard ratio (HR)=2.1), rs75824728 near IL22RA1 (combined p=6.6×10−9; HR=1.9), and rs113861051 on 4p15 (combined p=1.3×10−8, HR=2.0); with rs16906115 replicating in two independent studies. The association near IL7 colocalized with the gain of a novel cryptic exon for IL7, a critical regulator of lymphocyte homeostasis. Patients carrying the IL7 germline variant exhibited significantly increased lymphocyte stability after ICI initiation than non-carriers, and this stability was predictive of downstream irAEs and improved survival.
@article{groha2022germline,
  title={Germline variants associated with toxicity to immune checkpoint blockade},
  author={Groha, Stefan and Alaiwi, Sarah Abou and Xu, Wenxin and Nassar, Amin H and Bakouny, Ziad and Adib, Elio and Nuzzo, Pier Vito and Schmidt, Andrew L and Labaki, Chris and Ricciuti, Biagio and others},
  journal={Nature Medicine},
  volume={28},
  number={12},
  pages={2584--2591},
  year={2022},
  publisher={Nature Publishing Group},
  doi={10.1038/s41591-022-02094-6}
}
                                            
SurviVAEl: Variational Autoencoders for Clustering Time Series

Stefan Groha, Alexander Gusev, Sebastian Schmon • NeurIPS Workshop, 2022

• Novel VAE framework for multi-state survival analysis

Classical multi-state models, such as the Cox-Markov model, are limited in their ability to handle non-linear, time-dependent, or high-dimensional data. To address these limitations, we propose SurviVAEl, a multi-state survival analysis framework that utilizes a Variational Autoencoder (VAE) architecture. Our model allows for the quantification of uncertainty in its predictions and facilitates the clustering of patient trajectories in an interpretable manner. We show that the latent space of the variational model produces meaningful clusters, which can be used with a non-parametric Aalen–Johansen estimator to obtain state occupation probabilities.
@inproceedings{groha2022survivael,
  title={SurviVAEl: Variational Autoencoders for Clustering Time Series},
  author={Groha, Stefan and Gusev, Alexander and Schmon, Sebastian M},
  booktitle={NeurIPS 2022 Workshop on Learning from Time Series for Health},
  year={2022},
  url={https://openreview.net/forum?id=pREEF8_kWNT}
}
                                            
A General Framework for Survival Analysis and Multi-State Modelling

Stefan Groha*, Sebastian Schmon*, Alexander Gusev • arXiv preprint, 2020
* co-first author

• First general framework for multi-state survival analysis using neural ODEs

Survival models are a popular tool for the analysis of time to event data with applications in medicine, engineering, economics and many more. Advances like the Cox proportional hazard model have enabled researchers to better describe hazard rates for the occurrence of single fatal events, but are limited by modeling assumptions, like proportionality of hazard rates and linear effects. Moreover, common phenomena are often better described through multiple states, for example, the progress of a disease might be modeled as healthy, sick and dead instead of healthy and dead, where the competing nature of death and disease has to be taken into account. Also, individual characteristics can vary significantly between observational units, like patients, resulting in idiosyncratic hazard rates and different disease trajectories. These considerations require flexible modeling assumptions. Current standard models, however, are often ill-suited for such an analysis. To overcome these issues, we propose the use of neural ordinary differential equations as a flexible and general method for estimating multi-state survival models by directly solving the Kolmogorov forward equations. To quantify the uncertainty in the resulting individual cause-specific hazard rates, we further introduce a variational latent variable model. We show that our model exhibits state-of-the-art performance on popular survival data sets and demonstrate its efficacy in a multi-state setting.
@article{groha2020general,
  title={A General Framework for Survival Analysis and Multi-State Modelling},
  author={Groha, Stefan and Schmon, Sebastian M and Gusev, Alexander},
  journal={arXiv preprint arXiv:2006.04893},
  year={2020},
  url={https://arxiv.org/abs/2006.04893}
}
                                            

Machine Learning / Computational Biology

  • A comprehensive analysis of clinical and polygenic germline influences on somatic mutational burden

    Kodi Taraszka, Stefan Groha, et al. • The American Journal of Human Genetics, 2024

    Tumor mutational burden (TMB), the total number of somatic mutations in the tumor, and copy number burden (CNB), the corresponding measure of aneuploidy, are established fundamental somatic features and emerging biomarkers for immunotherapy. However, the genetic and non-genetic influences on TMB/CNB and, critically, the manner by which they influence patient outcomes remain poorly understood. Here, we present a large germline-somatic study of TMB/CNB with >23,000 individuals across 17 cancer types, of which 12,000 also have extensive clinical, treatment, and overall survival (OS) measurements available. We report dozens of clinical associations with TMB/CNB, observing older age and male sex to have a strong effect on TMB and weaker impact on CNB. We additionally identified significant germline influences on TMB/CNB, including fine-scale European ancestry and germline polygenic risk scores (PRSs) for smoking, tanning, white blood cell counts, and educational attainment. We quantify the causal effect of exposures on somatic mutational processes using Mendelian randomization. Many of the identified features associated with TMB/CNB were additionally associated with OS for individuals treated at a single tertiary cancer center. For individuals receiving immunotherapy, we observed a complex relationship between PRSs for educational attainment, self-reported college attainment, TMB, and survival, suggesting that the influence of this biomarker may be substantially modified by socioeconomic status. While the accumulation of somatic alterations is a stochastic process, our work demonstrates that it can be shaped by host characteristics including germline genetics.
    @article{taraszka2024comprehensive,
      title={A comprehensive analysis of clinical and polygenic germline influences on somatic mutational burden},
      author={Taraszka, Kodi and Groha, Stefan and King, David and Tell, Robert and White, Kevin and Ziv, Elad and Zaitlen, Noah and Gusev, Alexander},
      journal={The American Journal of Human Genetics},
      volume={111},
      number={2},
      pages={242--258},
      year={2024},
      publisher={Elsevier},
      doi={10.1016/j.ajhg.2023.12.010}
    }
  • Discovery of disease-associated cellular states using ResidPCA in single-cell RNA and ATAC sequencing data

    Shaye Carver, Kodi Taraszka, Stefan Groha, Alexander Gusev • bioRxiv, 2024

    To enhance understanding of cellular heterogeneity and disease from single-cell sequencing data, we introduce ResidPCA, a robust method for cell state identification that models cell type heterogeneity. Simulations demonstrate ResidPCA's efficacy, particularly in complex scenarios, with its accuracy more than four times higher than conventional Principal Component Analysis (PCA) and over three times higher than Non-negative Matrix Factorization (NMF)-based methods in identifying states expressed across multiple cell types. In single nucleus data from an Alzheimer's disease cohort, ResidPCA identified 44 snATAC-based and 42 snRNA-based states. 30 snATAC states were significantly enriched for Alzheimer's disease heritability and were often more significantly enriched than established cell types such as microglia. The ResidPCA-based snATAC state most significantly enriched for Alzheimer's disease heritability further elucidates a recently identified mechanism involving the neuron-ODC-microglial axis. This state links early amyloid production in neurons and oligodendrocytes with later-stage microglial activation and immune response, driving Alzheimer's disease progression. These results demonstrate ResidPCA's ability to reveal additional biological variation in single-cell data and uncover disease-relevant cell states.
    @article{carver2024discovery,
      title={Discovery of disease-associated cellular states using ResidPCA in single-cell RNA and ATAC sequencing data},
      author={Carver, Shaye and Taraszka, Kodi and Groha, Stefan and Gusev, Alexander},
      journal={bioRxiv},
      year={2024},
      publisher={Cold Spring Harbor Laboratory},
      doi={10.1101/2024.12.29.630536}
    }
  • SurvLatent ODE: A Neural ODE based time-to-event model with competing risks for longitudinal data improves cancer-associated Venous Thromboembolism (VTE) prediction

    Intae Moon, Stefan Groha, Alexander Gusev • ML4H Conference, 2022

    Effective learning from electronic health records (EHR) data for prediction of clinical outcomes is often challenging because of features recorded at irregular timesteps and loss to follow-up as well as competing events such as death or disease progression. To that end, we propose a generative time-to-event model, SurvLatent ODE, which adopts an Ordinary Differential Equation-based Recurrent Neural Networks (ODE-RNN) as an encoder to effectively parameterize dynamics of latent states under irregularly sampled input data. Our model then utilizes the resulting latent embedding to flexibly estimate survival times for multiple competing events without specifying shapes of event-specific hazard function. We demonstrate competitive performance of our model on MIMIC-III, a freely-available longitudinal dataset collected from critical care units, on predicting hospital mortality as well as the data from the Dana-Farber Cancer Institute (DFCI) on predicting onset of Venous Thromboembolism (VTE), a life-threatening complication for patients with cancer, with death as a competing event. SurvLatent ODE outperforms the current clinical standard Khorana Risk scores for stratifying VTE risk groups, while providing clinically meaningful and interpretable latent representations.
    @inproceedings{moon2022survlatent,
      title={SurvLatent ODE: A Neural ODE based time-to-event model with competing risks for longitudinal data improves cancer-associated Venous Thromboembolism (VTE) prediction},
      author={Moon, Intae and Groha, Stefan and Gusev, Alexander},
      booktitle={Proceedings of the 7th Machine Learning for Healthcare Conference},
      pages={800--827},
      year={2022},
      volume={182},
      publisher={PMLR}
    }
  • Constructing germline research cohorts from the discarded reads of clinical tumor sequences

    Alexander Gusev, Stefan Groha, et al • Genome Medicine, 2021

    Background: Hundreds of thousands of cancer patients have had targeted (panel) tumor sequencing to identify clinically meaningful mutations. In addition to improving patient outcomes, this activity has led to significant discoveries in basic and translational domains. However, the targeted nature of clinical tumor sequencing has a limited scope, especially for germline genetics. In this work, we assess the utility of discarded, off-target reads from tumor-only panel sequencing for the recovery of genome-wide germline genotypes through imputation.

    Methods: We developed a framework for inference of germline variants from tumor panel sequencing, including imputation, quality control, inference of genetic ancestry, germline polygenic risk scores, and HLA alleles. We benchmarked our framework on 833 individuals with tumor sequencing and matched germline SNP array data. We then applied our approach to a prospectively collected panel sequencing cohort of 25,889 tumors.

    Results: We demonstrate high to moderate accuracy of each inferred feature relative to direct germline SNP array genotyping: individual common variants were imputed with a mean accuracy (correlation) of 0.86, genetic ancestry was inferred with a correlation of > 0.98, polygenic risk scores were inferred with a correlation of > 0.90, and individual HLA alleles were inferred with a correlation of > 0.80. We demonstrate a minimal influence on the accuracy of somatic copy number alterations and other tumor features. We showcase the feasibility and utility of our framework by analyzing 25,889 tumors and identifying the relationships between genetic ancestry, polygenic risk, and tumor characteristics that could not be studied with conventional on-target tumor data.

    Conclusions: We conclude that targeted tumor sequencing can be leveraged to build rich germline research cohorts from existing data and make our analysis pipeline publicly available to facilitate this effort.
    @article{gusev2021constructing,
      title={Constructing germline research cohorts from the discarded reads of clinical tumor sequences},
      author={Gusev, Alexander and Groha, Stefan and Taraszka, Kodi and Semenov, Yevgeniy R and Zaitlen, Noah},
      journal={Genome Medicine},
      volume={13},
      number={1},
      pages={179},
      year={2021},
      month={Nov},
      day={8},
      publisher={BioMed Central},
      doi={10.1186/s13073-021-00999-4},
      pmid={34749793},
      pmcid={PMC8576948}
    }
  • Clinical inflection point detection on the basis of EHR data to identify clinical trial-ready patients with cancer

    Kenneth Kehl, Stefan Groha, et al • JCO Clinical Cancer Informatics, 2021

    Purpose: To develop methods using electronic health records (EHRs) to identify cancer patients experiencing "clinical inflection points" that indicate worsening prognosis or high potential for treatment changes, enabling real-time clinical trial screening.

    Methods: Researchers trained a deep neural network natural language processing (NLP) model using serial unstructured imaging reports from patients with solid tumors or lymphoma. The model aimed to dynamically predict patients' prognoses and estimate propensity to start new palliative-intent systemic therapy within 30 days. Model performance was evaluated using Harrell's c-index for prognosis and area under the receiver operating characteristic curve (AUC) for new treatment and clinical trial enrollment.

    Results: Trained on 302,688 imaging reports for 16,780 patients. In a test set of 34,770 reports for 1,952 patients: predicted survival with c-index of 0.76, predicted new treatment initiation with AUC of 0.77. Model-generated prognostic scores associated with manual cancer progression review: AUC of 0.78 for prognostic scores, AUC of 0.84 for new treatment predictions.

    Conclusion: Training a deep NLP model to identify clinical inflection points is feasible and could enable real-time, targeted clinical trial screening at a health system scale.
    @article{kehl2021clinical,
      title={Clinical Inflection Point Detection on the Basis of EHR Data to Identify Clinical Trial–Ready Patients With Cancer},
      author={Kehl, Kenneth L and Groha, Stefan and Lepisto, Eva M and Elmarakeby, Haitham and Lindsay, James and Gusev, Alexander and Van Allen, Eliezer M and Hassett, Michael J and Schrag, Deborah},
      journal={JCO Clinical Cancer Informatics},
      volume={5},
      pages={622--630},
      year={2021},
      publisher={American Society of Clinical Oncology},
      doi={10.1200/CCI.20.00184}
    }
  • Topological Data Analysis of copy number alterations in cancer

    Stefan Groha*, Caroline Weis*, Alexander Gusev, Bastian Rieck • arXiv preprint, 2020 * co-first author

    Identifying subgroups and properties of cancer biopsy samples is a crucial step towards obtaining precise diagnoses and being able to perform personalized treatment of cancer patients. Recent data collections provide a comprehensive characterization of cancer cell data, including genetic data on copy number alterations (CNAs). We explore the potential to capture information contained in cancer genomic information using a novel topology-based approach that encodes each cancer sample as a persistence diagram of topological features, i.e., high-dimensional voids represented in the data. We find that this technique has the potential to extract meaningful low-dimensional representations in cancer somatic genetic data and demonstrate the viability of some applications on finding substructures in cancer data as well as comparing similarity of cancer types.
    @article{groha2020topological,
      title={Topological Data Analysis of copy number alterations in cancer},
      author={Groha, Stefan and Weis, Caroline and Gusev, Alexander and Rieck, Bastian},
      journal={arXiv preprint arXiv:2011.11070},
      year={2020},
      note={arXiv:2011.11070},
      url={https://arxiv.org/abs/2011.11070}
    }

Theoretical Physics

  • Full counting statistics in the transverse field Ising chain

    Stefan Groha, Fabian Essler, Pasquale Calabrese • SciPost Physics, 2018

    We consider the full probability distribution for the transverse magnetization of a finite subsystem in the transverse field Ising chain. We derive a determinant representation of the corresponding characteristic function for general Gaussian states. We consider applications to the full counting statistics in the ground state, finite temperature equilibrium states, non-equilibrium steady states and time evolution after global quantum quenches. We derive an analytical expression for the time and subsystem size dependence of the characteristic function at sufficiently late times after a quantum quench. This expression features an interesting multiple light-cone structure.
    @article{groha2018full,
      title={Full counting statistics in the transverse field Ising chain},
      author={Groha, Stefan and Essler, Fabian H. L. and Calabrese, Pasquale},
      journal={SciPost Physics},
      volume={4},
      number={6},
      pages={043},
      year={2018},
      publisher={SciPost},
      doi={10.21468/SciPostPhys.4.6.043}
    }
  • Full counting statistics in the spin-1/2 Heisenberg XXZ chain

    Mario Collura*, Fabian HL Essler*, Stefan Groha* • J. Phys. A: Math. Theor., 2017 * co-first author

    The spin-1/2 Heisenberg chain exhibits a quantum critical regime characterized by quasi long-range magnetic order at zero temperature. We quantify the strength of quantum fluctuations in the ground state by determining the probability distributions of the components of the (staggered) subsystem magnetization. We derive a determinant representation for the characteristic function of the full counting statistics of the staggered magnetization component along the z-axis. We study the limiting distributions in different regimes and analyze the scaled cumulant generating function.
    @article{collura2017full,
      title={Full counting statistics in the spin-1/2 Heisenberg XXZ chain},
      author={Collura, Mario and Essler, Fabian H. L. and Groha, Stefan},
      journal={Journal of Physics A: Mathematical and Theoretical},
      volume={50},
      number={34},
      pages={344003},
      year={2017},
      publisher={IOP Publishing},
      doi={10.1088/1751-8121/aa7a2e}
    }
  • Spinon decay in the spin-1/2 Heisenberg chain with weak next nearest neighbour exchange

    Stefan Groha, Fabian HL Essler • J. Phys. A: Math. Theor., 2017

    Integrable models support elementary excitations with infinite lifetimes. In the spin-1/2 Heisenberg chain these are known as spinons. We consider the stability of spinons when a weak integrability breaking perturbation is added to the Heisenberg chain in a magnetic field. We calculate the spinon decay rate using perturbation theory and identify the dominant decay channels. We find that the decay rate is small, indicating that spinons remain well-defined excitations even though integrability is broken.
    @article{groha2017spinon,
      title={Spinon decay in the spin-1/2 Heisenberg chain with weak next nearest neighbour exchange},
      author={Groha, Stefan and Essler, Fabian H. L.},
      journal={Journal of Physics A: Mathematical and Theoretical},
      volume={50},
      number={33},
      pages={334002},
      year={2017},
      publisher={IOP Publishing},
      doi={10.1088/1751-8121/aa7d41}
    }
  • Thermalization and light cones in a model with weak integrability breaking

    Bruno Bertini*, Fabian HL Essler*, Stefan Groha*, Neil J Robinson* • Phys. Rev. B, 2016 * co-first author

    We employ equation of motion techniques to study the non-equilibrium dynamics in a lattice model of weakly interacting spinless fermions. Our model provides a simple setting for analyzing the effects of weak integrability breaking perturbations on the time evolution after a quantum quench. We observe prethermalization plateaux for weak integrability-breaking interactions and a crossover towards thermal behavior at later times. We find an intermediate light cone region with a temporal width scaling as t^(1/3).
    @article{bertini2016thermalization,
      title={Thermalization and light cones in a model with weak integrability breaking},
      author={Bertini, Bruno and Essler, Fabian H. L. and Groha, Stefan and Robinson, Neil J.},
      journal={Physical Review B},
      volume={94},
      number={24},
      pages={245117},
      year={2016},
      publisher={American Physical Society},
      doi={10.1103/PhysRevB.94.245117}
    }
  • Prethermalization and thermalization in models with weak integrability breaking

    Bruno Bertini*, Fabian HL Essler*, Stefan Groha*, Neil J Robinson* • Phys. Rev. Lett., 2015 * co-first author

    We study the effects of integrability breaking perturbations on the non-equilibrium evolution of many-particle quantum systems. We focus on a class of spinless fermion models with weak interactions. Using equation of motion techniques similar to quantum Boltzmann equations, we observe robust prethermalization plateaux for local observables. We find that increasing integrability breaking causes drift towards thermal behavior. Our results provide insights into the mechanisms underlying thermalization in quantum many-body systems.
    @article{bertini2015prethermalization,
      title={Prethermalization and thermalization in models with weak integrability breaking},
      author={Bertini, Bruno and Essler, Fabian H. L. and Groha, Stefan and Robinson, Neil J.},
      journal={Physical Review Letters},
      volume={115},
      number={18},
      pages={180601},
      year={2015},
      publisher={American Physical Society},
      doi={10.1103/PhysRevLett.115.180601}
    }

Talks and Presentations

Conference presentations and invited talks

  • IARC Collider bias and Mendelian randomization (CMR) working group, 2023, invited talk
  • Time series for Health, NeurIPS workshop, 2022, poster presentation
  • NCI Immuno-Oncology Translational Network Bioinformatics and Computational Biology Working Group, 2022, invited talk
  • American Conference on Pharmacometrics, Denver Colorado, 2022, invited talk
  • American Society for Human Genetics Conference, 2022, poster presentation
  • Invitae research seminar, 2022, invited talk
  • Probably Genetic Research Forum, 2021, invited talk
  • UCLA/UChicago joint journal club on statistical genetics, 2020, invited talk
  • ML4H seminar series, Broad Institute of MIT and Harvard, 2021, invited talk
  • AAAI Symposium 2021: Survival prediction, oral presentation
  • Modeling & Simulation Forum, Genentech, 2021, invited talk
  • Machine Learning for Health (ML4H), NeurIPS workshop 2020, poster presentation.
  • Learning Meaningful Representations of Life, NeurIPS workshop 2020, poster presentation and oral.
  • American Society for Human Genetics Conference 2020, poster presentation.
  • European Society for Human Genetics Conference 2020, oral presentation.
  • Learning Meaningful Representations of Life, Neurips workshop 2019, poster presentation.
  • Harvard PQG, Quantitative Challenges in Cancer Immunology and Immunotherapy 2019.
  • American Society for Human Genetics Conference 2019, poster presentation.
  • Erwin Schrödinger International Institute, Vienna, 2018, invited talk.
  • Rudolf Peierls Centre for Theoretical Physics, Oxford, 2018, invited talk.
  • Brookhaven National Laboratory, 2017, invited talk.