Press Release: From DNA Patterns Revolutionary AI Predicts Aging and Disease

Posted on November 13, 2024 by Admin

A team of researchers introduced the Cytosine-phosphate-Guanine Pretrained Transformer (CpGPT: a transformer-based foundation model for deoxyribonucleic acid (DNA) methylation) designed to enhance analysis and prediction across diverse tissues and conditions.

Study

To develop the CpGPT model, the researchers curated a comprehensive DNA methylation dataset named "CpGCorpus," aggregating data from more than 1,502 studies and over 106,000 human samples available in the Gene Expression Omnibus. This dataset contained various Illumina methylation array platforms and represented a rich diversity of tissue types, developmental stages, disease conditions, and demographic backgrounds. Raw data were processed using a Single Sample Methylation Analysis pipeline (SeSAMe), while normalized beta value matrices were used for already processed data. Quality control measures and probe harmonization were applied to ensure consistency across the dataset. The data were split into training, validation, and test sets without overlapping samples or studies.

The CpGPT model integrated sequence, positional, and epigenetic information. Input representations included "embeddings of the nucleotide sequences" obtained from a pre-trained DNA language model, methylation beta values representing the methylation state of each site, and genomic positional encoding to capture the CpG site's location within the genome. A dual positional encoding strategy was employed, combining absolute and relative positional encodings to capture multi-scale genomic information. Specialized decoders were designed for beta value prediction, condition prediction, and uncertainty estimation.

Pretraining was conducted using a multi-task learning approach with tailored loss functions, optimizing the model's ability to reconstruct missing data and learn meaningful sample representations. For fine-tuning, CpG sites associated with mortality were selected based on intra-class correlation coefficients and z-score thresholds. The model was then trained using a modified Cox proportional hazard loss. Predictive performance for mortality and morbidity was evaluated across multiple cohorts using Cox regression models, receiver operating characteristic analyses, and survival analyses, adjusting for age and employing appropriate statistical methods.

Findings

The researchers developed CpGPT, which includes over 100,000 human DNA methylation samples from more than 1,500 studies covering a diverse range of tissue types, developmental stages, and disease conditions. The data were thoroughly preprocessed and harmonized to ensure consistency across various Illumina methylation array platforms, such as the HumanMethylation450 BeadChip (450k), HumanMethylation27 BeadChip (27k), Infinium MethylationEPIC BeadChip (EPIC), EPIC+, and EPICv2 arrays.

CpGPT integrates three key types of contextual information: sequence context based on the DNA nucleotides near each CpG site, positional context covering local and global information, and epigenetic state. Sequence context is encoded using embeddings of nucleotide sequences surrounding each CpG site, derived from a pre-trained DNA language model. The model organizes sequence embeddings by genomic positions to capture positional context, groups them by chromosomes, and applies stochastic shuffling to prevent positional biases. Each CpG site's methylation state is transformed into an embedding representing its epigenetic status, and these embeddings are combined to form the model's input.

The core architecture of CpGPT is based on the Transformer++ model, an enhanced version of the transformer architecture with modifications for increased training stability and accuracy. The model is trained in an unsupervised manner to predict methylation states (beta values) and their uncertainties, enabling it to generate meaningful sample-level embeddings that encapsulate comprehensive methylation profiles. The training process employs multiple loss functions to optimize various performance aspects and is designed to handle missing data effectively.

Evaluations using dimensionality reduction techniques revealed that CpGPT's locus embeddings naturally reflect functional genomic annotations, with CpG sites clustering according to features like island status and chromatin states. Sample embeddings effectively captured biological variations, clustering samples according to tissue types and cell lines. The model demonstrated the ability to perform zero-shot reference mapping, which allows it to transfer labels from reference datasets with known annotations to new target datasets without additional training.

CpGPT showed strong performance in imputing missing methylation data, accurately reconstructing beta values for missing probes, and improving the performance of various epigenetic clocks. Through its attention mechanism, CpGPT dynamically weights features, allowing sample-specific interpretation by assigning importance scores to each CpG site. This highlighted biologically relevant genes important for tissue-specific epigenetic regulation.

When fine-tuned for mortality prediction, CpGPT exhibited predictive performance across multiple cohorts, effectively stratifying individuals based on their biological aging profiles. It showed significant associations with mortality and morbidity outcomes, including risks for conditions such as neurodegenerative diseases, cardiovascular issues, and physical function measurements.

Conclusion

To summarize, CpGPT effectively integrates sequence context, positional information, and epigenetic state to learn rich embeddings at both the CpG site and sample levels. The model excels in tasks such as imputing missing methylation values, array conversion, zero-shot reference mapping, and predicting age and mortality. By capturing complex dependencies among CpG sites, CpGPT overcomes the limitations of traditional linear models, enhancing predictive capabilities for aging-related outcomes and disease risks across various datasets.

Source:

https://www.news-medical.net/news/20241111/Revolutionary-AI-predicts-aging-and-disease-from-DNA-patterns.aspx