Data Science for Population Health

Research

We are a team of computational scientists from diverse educational backgrounds (biology, engineering, physics, medicine, mathematics) based at the Singapore Institute for Clinical Sciences and the Bioinformatics Institute. Our central mission is to identify the determinants of population health using molecular profiling and machine learning approaches. The cohorts we investigate are composed of individuals affected by complex health conditions (cardiometabolic, respiratory, neurological, cancer) and may have many outcomes. The datasets we work with come from many different sources and pose several challenges for analysis.

Integrating population and disease cohorts

In ageing individuals, multiple conditions (e.g. hypertension, stroke and dementia) frequently co-occur and lead to poor outcomes. Adolescents and middle-aged adults may also have traits (e.g. obesity) that are risk factors for these diseases, and these can have an impact on their well-being throughout their lifetime. Few studies have gone beyond epidemiology to examine the biological and molecular basis of multiple long-term conditions across the human lifespan. Observational and experimental studies focusing on one condition at a time will miss the physiological and molecular characteristics that are common between diseases. Our vision is to redefine the diagnosis of diseases using molecular and physiological traits, leveraging data from multiple cohorts and using machine learning techniques. One of our current studies is exploring the effects of gene-lipid interactions on child and mother mental health using multi-omic data from the GUSTO, S-PRESTO and ALSPAC birth cohorts. We also link early-life risk factors to cardiovascular and neurodegenerative disease cases in later life using ageing cohorts, such as ATTRaCT, UK Biobank, SG100K. To identify more generalizable risk factors, we actively work with informatics colleagues in academia and industry to develop and test federated data platforms for analysing cohorts across different countries and jurisdictions.

Modelling and machine learning

Since the factors contributing to the development of complex diseases may come from multiple sources, including genetics, epigenetics, lifestyle and the environment, we must consider many different molecular and health-related readouts collected from individuals over time. We actively use statistical and machine learning models to predict future health outcomes with the different readouts from molecular (-omics) assays and wearable sensors. These include classification algorithms that employ feature selection and dimensionality reduction to untangle the heterogeneity of pulmonary hypertension, COVID-19, and dementia. Probabilistic models to describe longitudinal changes in epigenetic ageing and drug response. And regression-based methods to perform Mendelian randomization and generate polygenic scores. We often collaborate with our sister team at Imperial College London to improve these techniques.

Members

Senior Principal Scientist	WANG Dennis \| [View Bio]
Principal Scientist	PAN Hong
Senior Scientist	GUPTA Varsha
Senior Scientist	HUANG Jian
Senior Scientist	LAU Evelyn
Senior Scientist	VAZ Candida
Scientist	MISHRA Priti
Lead Research Officer	TEH Ai Ling
Senior Research Officer	TAN Pei Fang
Research Officer	CHE Jinyi
Research Officer	ZHANG Xiaohe

Selected Publications

Gupta, V., Kariotis, S., Rajab, M.D. et al. Unsupervised machine learning to investigate trajectory patterns of COVID-19 symptoms and physical activity measured via the MyHeart Counts App and smart devices. npj Digit. Med. 6, 239 (2023). https://doi.org/10.1038/s41746-023-00974-w
Huang, J., Kee, M.Z.L., Law, E.C. et al. Parental and child genetic burden of glycaemic dysregulation and early-life cognitive development: an Asian and European prospective cohort study. Transl Psychiatry 14, 2 (2024).
Rajab MD, Taketa T, Wharton SB, Wang D. Ranking and filtering of neuropathology features in the machine learning evaluation of dementia studies. Brain Pathol. 2024 Feb 19:e13247.
Leroy, A., Teh, A.L., Dondelinger, F., Alvarez, M.A., Wang, D. Longitudinal prediction of DNA methylation to forecast epigenetic outcomes. arXiv 2312.13302 (2023).
Pan, H., Tan, P.F., Lim, I.Y. et al. Integrative multi-omics database (iMOMdb) of Asian pregnant women, Human Molecular Genetics 31(18):3051-3067 (2022).
Kariotis, S., Jammeh, E., Swietlik, E.M. et al. Biological heterogeneity in idiopathic pulmonary arterial hypertension identified through unsupervised transcriptomic profiling of whole blood. Nat Commun 12, 7104 (2021).
Errington, N., Iremonger, J., Pickworth, J.A. et al. A diagnostic miRNA signature for pulmonary arterial hypertension using a consensus machine learning approach. EBioMedicine 69:103444 (2021).
Wang, D., Hensman, J., Kutkaite, G., et al. A statistical framework for assessing pharmacological response and biomarkers using uncertainty estimates. eLife 9:e60352 (2020).

Principal Scientist	PAN Hong
Senior Scientist	GUPTA Varsha
Senior Scientist	HUANG Jian
Senior Scientist	LAU Evelyn
Senior Scientist	VAZ Candida
Scientist	MISHRA Priti
Lead Research Officer	TEH Ai Ling
Senior Research Officer	TAN Pei Fang
Research Officer	CHE Jinyi
Research Officer	ZHANG Xiaohe