logo
ResearchBunny Logo
Introduction
Understanding how cells respond transcriptionally to chemical perturbations is fundamental to drug discovery. High-throughput screening (HTS) using bulk and single-cell RNA sequencing (scRNA-seq) has profiled thousands of perturbations, revealing interpretable gene-level programs representing cellular processes in response to chemicals. However, exhaustive experimental screening of all disease-compound combinations is infeasible due to time and cost constraints. Deep learning offers a promising alternative for modeling these responses. Existing deep learning methods, such as CPA, Biolord, scGen, scVIDR, chemCPA, CellOT, and CINEMA-OT, have shown success in simulating and predicting gene expression changes under specific perturbations, but most struggle with predicting responses to entirely novel chemicals. Linear regression-based methods and knowledge graph-based models like GEARS and CellOracle also have limitations in accurately capturing the nonlinearity of chemical perturbations across diverse cell types and compounds, or lack scalability. Gene signature matching methods, such as those used in Connectivity Map (CMap), DLEPS, and OCTAD, effectively screen candidates based on bulk HTS data but lack cell-type specificity and fail to model cellular heterogeneity. Therefore, there's a need for a robust and scalable model capable of predicting responses to unseen perturbations and discovering novel therapeutic candidates. Deep generative models, capable of learning probability densities and generating new samples, offer this potential.
Literature Review
The authors reviewed existing methods for predicting transcriptional responses to chemical perturbations. They categorized these methods into several groups: autoencoder-based models (CPA, chemCPA), generative adversarial networks (GANs) and Variational Autoencoders (VAEs) based models (scGen, scVIDR, Biolord), optimal transport based methods (CellOT, CINEMA-OT), linear regression based methods, and knowledge graph based methods (GEARS, CellOracle). The authors highlighted the limitations of each approach: autoencoder-based models primarily focused on reconstructing known responses, while generative models often struggled with novel compound predictions. Optimal transport approaches required paired perturbed-unperturbed observations, which is experimentally challenging with novel compounds, and linear regression models failed to capture the non-linearity of drug responses. Knowledge graph based models suffered from scalability limitations due to reliance on accurate prior knowledge. Finally, while gene signature matching methods are effective, they fail to capture cell-type specific transcriptional responses and cellular heterogeneity. This review emphasizes the need for a model that addresses the limitations of existing techniques and efficiently screens novel compounds for diverse diseases.
Methodology
The authors introduce PRnet, a perturbation-conditioned deep generative model. PRnet is designed as an encoder-decoder architecture comprising three components: the Perturb-adapter, Perturb-encoder, and Perturb-decoder. The Perturb-adapter uses simplified molecular-input line-entry system (SMILES) strings to represent chemical structures, allowing for generalization to unseen compounds. The Perturb-encoder maps the combined perturbation embedding and unperturbed transcriptional profile into a latent space, capturing the heterogeneity inherent in transcriptional responses. The Perturb-decoder estimates the distribution of the perturbed transcriptional profile and samples a specific transcriptional profile. The model was trained using a large dataset comprising nearly one hundred million bulk HTS observations (from L1000) and tens of millions of single-cell HTS observations (from sci-Plex3), perturbed by a large number of compounds. For bulk HTS data, the model predicts responses for 978 landmark genes, which are then transformed to 12,328 genes using linear transformation. For single-cell HTS data, the model focuses on 5000 highly variable genes (HVGs). The model's performance is evaluated using several metrics: Pearson correlation between true and predicted log(FC), R-squared, and Pearson correlation accounting for cell line specific effects. The authors also detail the process of in silico drug screening using PRnet, which involves predicting transcriptional responses to a library of compounds, calculating enrichment scores using Gene Set Enrichment Analysis (GSEA), and ranking compounds based on their enrichment scores. The hyperparameters used for training PRnet are meticulously documented and the model is evaluated using multiple data split strategies (random, unseen compounds, unseen cell lines) to assess its robustness and generalization capability. Baseline models (linear model, MLP model) were also used for comparison. The model development includes data preprocessing steps, such as normalization and the selection of highly variable genes, to ensure data quality and consistency. SMILES strings, generated by RDKit, and dosage information, form the input for the Perturb-adapter which creates a fixed-size embedding for each perturbation, before being fed to the encoder and decoder components.
Key Findings
PRnet consistently outperforms alternative approaches in predicting transcriptional responses to novel perturbations across various data splits (random, unseen compounds, unseen cell lines). In bulk HTS data, PRnet achieves a Pearson correlation of approximately 0.8 in predicting log(FC) for unseen compounds. In single-cell HTS data, PRnet achieves an R-squared of approximately 0.97, indicating high accuracy in predicting average gene expression for unseen compounds. PRnet effectively captures gene-level changes and learns interpretable latent embeddings, clustering cell lines from the same tissue together in the latent space. The model's ability to capture heterogeneity is further demonstrated through the pseudo-dose trajectory analysis in single-cell data, showing heterogeneous responses to drug treatments. Application of PRnet to SCLC and CRC datasets resulted in the identification of novel compound candidates, which were subsequently experimentally validated using MTT assays. Specifically, SEL120-34A HCI and (+)-Fangchinoline showed significant inhibitory effects against SCLC cell lines with IC50 values below 10 µmol/L. Similarly, 7-Methoxyrosmanol and Mulberrofuran Q showed inhibitory activity against CRC cell lines. Furthermore, PRnet generates a large-scale integration atlas of perturbation profiles covering 88 cell lines, 52 tissues, and various compound libraries. Leveraging this atlas, PRnet successfully recommends 577 drug candidates for 233 diseases. The recommendations for NASH, Crohn's disease, and PCOS are validated through literature review, showcasing the clinical relevance of the model's predictions. The KEGG pathway enrichment analysis further supports the findings by identifying pathways associated with the effects of predicted drug candidates.
Discussion
PRnet addresses the critical need for a robust and scalable method for predicting transcriptional responses to novel chemical perturbations, significantly improving upon existing approaches. Its superior performance stems from its unique architecture, combining SMILES-based chemical encoding for generalization to unseen compounds with a deep generative model for capturing the complex nonlinearity of cellular responses. The experimental validation of drug candidates identified by PRnet underscores the model's practical utility in accelerating drug discovery. The construction of a large-scale integration atlas of perturbation profiles and subsequent drug candidate recommendations for a wide range of diseases demonstrates its broad applicability. While PRnet represents a significant advancement, the authors acknowledge limitations. The use of SMILES might not fully capture all aspects of molecular structure, and the reverse signature paradigm used for drug recommendation might not be universally applicable across all diseases. Future work will focus on incorporating more comprehensive omics data and phenotypic data to further enhance the model's predictive accuracy and clinical relevance. Expanding the model to handle other types of perturbations beyond chemical ones is also a potential future direction.
Conclusion
PRnet is a novel deep generative model that effectively predicts transcriptional responses to novel chemical perturbations. It surpasses existing methods in accuracy and scalability, enabling gene-level response interpretation and in silico drug screening. Experimental validations of its predictions in small cell lung cancer and colorectal cancer demonstrate its practical utility. The generation of a comprehensive perturbation profile atlas and successful drug recommendations for numerous diseases further showcase its potential to revolutionize drug discovery. Future work could focus on improving chemical encoding methods, incorporating additional omics data, and expanding the scope of perturbation types handled by the model.
Limitations
While PRnet shows impressive performance, there are some limitations. The reliance on SMILES for chemical encoding may not capture all relevant structural features. The reverse signature paradigm for drug candidate identification may not apply to all diseases. The current model primarily focuses on gene-level effects and lacks an explicit integration of phenotype data. Further improvements could be made by incorporating more detailed structural information, integrating phenotypic data, and expanding the model's capabilities to handle diverse perturbation types.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs—just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny