Many people will develop more than one disease in their lifetime, but predicting how different diseases will interact remains difficult.
Accurately predicting a patient's future health trajectory remains a core need in healthcare decision-making . Artificial intelligence (AI) models can leverage vast amounts of data from patient records to help identify disease progression patterns. However, their potential remains largely untapped, especially across large populations.
A joint team from the Oncology AI Division of the German Cancer Research Center (DKFZ) in Heidelberg, Germany, and others recently published a paper in the journal Nature , presenting a groundbreaking research result: the Delphi-2M model . Based on Generative Pretrained Transformer (GPT) technology, the model analyzes individual medical records and lifestyles to provide up to 20 years of potential disease risk assessment for over 1,000 diseases. The model also generates privacy-protected synthetic data, opening up new avenues for personalized medicine and long-term health planning.
Paper link: https://www.nature.com/articles/s41586-025-09529-3
Magic GPT-2, AI predicts individual health over the next 20 years
The core of the Delphi-2M model is to predict future disease risks and intervene by understanding the patient's past and current health status.
In the past, while AI methods could learn and predict disease progression from medical records, limitations in model architecture made it difficult to accurately predict multiple diseases over long periods of time and on a large scale . With the aging of the population, the importance of disease prediction has become increasingly prominent. In this context, AI models that can accurately simulate the progression of multiple diseases will become a key tool for healthcare planning and resource allocation.
To simulate historical disease data, the research team modified the GPT-2 architecture. The Transformer model maps inputs into an embedding space, gradually aggregating information to achieve autoregressive predictions. They encoded continuous age using sine and cosine basis functions and added another module to the output header to predict the next time using an exponential waiting time model. This architecture allows users to provide a partial health trajectory and calculate the daily rate of new disease and death events. Based on these rates, subsequent tokens and corresponding times are sampled to gradually complete the sampling of the complete health trajectory.
Figure | Delphi-2M model architecture
Delphi-2M was trained and validated using two high-quality internal and external datasets to ensure the model's generalization and reliability. The training data primarily came from 400,000 participants in the UK Biobank, covering ICD-10 top-level diagnosis codes, gender, body mass index (BMI), smoking/drinking habits, and mortality information.
Internal validation data : The remaining 20% of participants in the UK Biobank (approximately 102,000 people) were used for model hyperparameter optimization; at the same time, 471,000 participants who were still alive on July 1, 2020 were selected and tracked until July 1, 2022 to verify the longitudinal predictive ability of the model.
External validation data : Data from the Danish National Disease Registry covering 1.93 million individuals from 1978 to 2018. Notably, when applying the model to the Danish data, no parameters were adjusted; instead, the model weights trained on the UK data were reused to test its applicability across populations and healthcare systems.
Traditional clinical risk models tend to focus on specialization, such as QRisk3 for cardiovascular disease risk assessment and UKBDRS for dementia prediction. Most models only cover a few dozen diseases. Delphi-2M, on the other hand, achieves near-full spectrum coverage , simultaneously predicting the risk of 1,256 diseases and mortality with exceptional accuracy.
Figure | The Delphi-2M model accurately simulates the incidence of various diseases.
In terms of internal validation performance, using UK Biobank data, the model achieved an average age- and sex-stratified AUC (area under the receiver operating characteristic curve; higher values indicate stronger predictive ability) of 0.76 for most diseases. For 97% of diseases, the AUC exceeded 0.5, demonstrating reasonable predictive value. The highest AUC was achieved for mortality risk prediction, reaching 0.97 for both men and women, representing near-perfect prediction.
When compared with clinical tools, the research team found that when the model was used to predict cardiovascular disease and dementia, the AUC was comparable to classic tools such as QRisk3 and UKBDRS; when predicting mortality risk, the AUC was better than commonly used indicators such as the Charlson Comorbidity Index and the Elixhauser Comorbidity Index; it was only slightly inferior to the clinical gold standard HbA1c in predicting diabetes, which also suggests that researchers can further optimize it by integrating biomarkers in the future.
Delphi-2M also demonstrates excellent cross-population generalization. When applied to Danish data, the average AUC for Delphi-2M is slightly lower than that for UK data. However, the disease prediction results are highly correlated with actual disease patterns in the Danish population, demonstrating its broad applicability across diverse healthcare systems.
Figure | Delphi-2M informs a modeling approach to generating future health trajectories.
Unlike traditional models that only predict disease probabilities within 1-5 years, Delphi-2M's generative nature allows it to simulate an individual's health path for up to 20 years. Using a 60-year-old participant from the UK Biobank as an example, the research team generated future health trajectories based on their medical history prior to age 60. Comparing these with actual follow-up results, they concluded:
First, the agreement at the population level is high . The Delphi-2M disease incidence rates for those aged 70-75 years are highly consistent with actual observations. The cross-entropy loss, which measures the difference between the predicted and true distributions, is not significantly different from the real data. The accuracy of the simulation results decreases significantly when the participants' medical histories are randomly shuffled, demonstrating that Delphi-2M captures the relationship between medical history and future disease.
Second, individual risk is clearly differentiated . For diseases like pancreatic cancer, the model can distinguish between "high-risk" and "low-risk" individuals. For example, people with a history of digestive system diseases have a significantly increased risk of pancreatic cancer. While risk prediction for diseases like asthma and osteoarthritis still relies on age-sex trends, it can also identify individuals whose risk deviates from the group average.
Furthermore, experiments have shown that long-term predictions remain effective . While the model's accuracy decreases as the prediction time increases, it still outperforms predictions based solely on age and gender, demonstrating its long-term predictive value.
Justin Stebbing, Professor of Biomedical Sciences at Anglia Ruskin University, commented: “Delphi-2M is a major breakthrough in computational medicine and data integration, demonstrating the power of GPT models to predict the incidence and timing of over a thousand diseases across large populations and individual health trajectories.”
Gustavo Sudre, professor of genomic neuroimaging and artificial intelligence at King's College London, agreed that "Delphi-2M clearly demonstrates how to use explainable AI for predictive modeling, which is crucial for applying this technology to clinical practice and has implications for identifying high-risk individuals who require intervention."
Furthermore, the privacy sensitivity of medical data has always been a pain point in AI research. Directly using real data to train models may leak personal information, while anonymization will result in the loss of key information. The model's ability to generate synthetic data provides a new solution to this problem.
Delphi-2M can generate completely fictitious health trajectories that replicate age- and sex-specific morbidity patterns in a real population. Since it is impossible to infer real personal information from synthetic data, it can be used as a substitute for real data to train other medical AI models, protecting privacy while avoiding waste of data resources. Professor Stebbing also affirmed this advantage, stating that "its external validation capabilities and ability to generate synthetic datasets demonstrate the model's robustness, privacy management advantages, and potential for healthcare planning."
Shortcomings and the future
Although Delphi-2M performs outstandingly, the research team also clearly pointed out its limitations in the paper and warned that it should be used with caution.
For example, Delphi-2M suffers from training data bias, a result of inherent biases introduced when learning from the UK Biobank. UK Biobank participants are predominantly white, aged 40-70, and of high socioeconomic status, resulting in lower reliability of the model's predictions for other populations. The current model cannot establish causal relationships, only capturing correlations, and cannot directly formulate intervention plans based on its predictions.
Furthermore, Delphi-2M has only been validated through data fitting and has not undergone prospective clinical trials or been tested in real-world clinical settings. Peter Bannister, a Fellow of the Institution of Engineering and Technology, also stated, "Both datasets are biased in terms of age, ethnicity, and current medical outcomes, leaving them a long way from improving healthcare."
The release of Delphi-2M marks a significant advancement in AI prediction in healthcare, expanding from a single model to a multifaceted one, from focusing on short-term risk to focusing on long-term trajectory, and from relying on real-world data to maintaining privacy. Its core value lies not only in its strong predictive capabilities but also in providing an interpretable and scalable framework for precision medicine. Through SHAP analysis, the model clearly demonstrates how a prior illness impacts future risk. Its predictive capabilities can be further enhanced by integrating genomic data, richer metabolomics information, diagnostic imaging data, or wearable device data.
Regarding the future of the Delphi-2M model, Professor Sudre pointed out, "Although the current version relies only on anonymized clinical records, it is encouraging that the model architecture has been carefully designed to be compatible with richer data types such as biomarkers, imaging, and even genomics. With the advancement of future data integration, the Delphi platform is expected to develop into a true multimodal precision medicine tool."
Of course, models are an aid to medical decision-making, not a replacement. Their predictions must be considered in conjunction with physician experience and patient preferences. In the future, with the diversification of training data and the clinicalization of validation scenarios, AI models like Delphi-2M are expected to be truly integrated into the medical process , providing personalized health management solutions for each individual and truly advancing precision medicine from concept to practice .
This article comes from the WeChat public account "Academic Headlines" (ID: SciTouTiao) , compiled by Xiaoyu, and published by 36Kr with authorization.