logo
ResearchBunny Logo
Predicting transcriptional responses to novel chemical perturbations using deep generative model for drug discovery

Medicine and Health

Predicting transcriptional responses to novel chemical perturbations using deep generative model for drug discovery

X. Qi, L. Zhao, et al.

Discover PRnet, an innovative deep generative model that revolutionizes drug discovery by predicting transcriptional responses to chemical perturbations at bulk and single-cell levels. This groundbreaking research conducted by Xiaoning Qi and colleagues demonstrates superior performance in drug candidate identification against cancer and other diseases, paving the way for gene-based therapeutics.

00:00
00:00
~3 min • Beginner • English
Introduction
Transcriptional responses to chemical perturbations provide key insights into biological processes and are crucial for drug discovery. High-throughput bulk and single-cell RNA-seq experiments have profiled thousands of perturbations, uncovering coherent gene-level programs, but exhaustive experimental screening of the broad disease-compound space is infeasible due to cost, time, and low discovery rates. Deep learning methods have emerged to model perturbation responses, yet many fail to predict responses to truly novel chemicals and across diverse cell types. This study introduces PRnet, a perturbation-conditioned deep generative model designed to predict transcriptional responses to novel compounds and dosages given unperturbed transcriptomes, enabling generalization across compounds, pathways, cell types, and cell lines. The goal is to enhance in silico screening and candidate prioritization for diseases by predicting gene-level effects of unseen chemical perturbations at both bulk and single-cell resolution.
Literature Review
Recent approaches for modeling perturbation responses include autoencoder-based methods (CPA) that reconstruct perturbation effects in latent space; scGen, Biolord, and scVIDR for counterfactual prediction in unseen cellular states; chemCPA, which integrates compound structure for unseen drug effects; and optimal transport methods (CellOT, CINEMA-OT) that match observed perturbed-unperturbed pairs but cannot model truly novel perturbations. Linear regression-based methods model perturbation effects additively but struggle with nonlinear chemical effects across diverse contexts. Graph-based approaches (GEARS, CellOracle) leverage gene-gene networks but depend on accurate prior knowledge and face scalability limits. Signature-based frameworks like CMap, DLEPS, and OCTAD connect diseases and drugs via gene-expression signatures and can screen candidates but generally lack cell-type-specific response prediction and modeling of cellular heterogeneity. Deep generative models (GANs, VAEs, diffusion, normalizing flows, GPTs) learn data distributions and have transformed other domains, motivating their application to drug discovery.
Methodology
Problem formulation: Predict the distribution of post-perturbation gene expression conditioned on chemical perturbations and cellular context. Given unperturbed expression x and perturbation attributes P = (s, d) (compound SMILES and dosage), PRnet learns f to generate perturbed expression x̂ ~ N(μ, σ2). Architecture: PRnet consists of (1) Perturb-adapter: encodes perturbations into a fixed-size embedding using RDKit to derive Functional-Class Fingerprints (FCFP4, radius 2) from SMILES; fingerprints are dose-scaled via rFCFP = Σ log10(d+1)·H(s) and passed through a 2-layer MLP to yield a k=64-dimensional embedding z_pert. (2) Perturb-encoder: a 2-layer MLP maps [x, z_pert] to a latent embedding z_enc (e=64). (3) Perturb-decoder: a 2-layer MLP takes [z_enc, z_pert, noise] to output per-gene μ and σ2, defining a Gaussian likelihood from which perturbed expression is sampled. Loss: Gaussian negative log-likelihood (GaussNLLLoss) with σ2 clamped by eps=1e-6; softplus ensures positivity. Data pairing: Because only unpaired profiles are observed, each perturbed profile is assigned a matched unperturbed profile from the same cell line/cell type. Preprocessing: Bulk data (L1000) use 978 landmark genes; predicted 978-gene responses are linearly transformed to 12,328 genes for downstream signature analyses. Single-cell data (sci-Plex3) are normalized with log-transformation, and 5,000 highly variable genes are selected. Training data: PRnet is trained on (i) bulk HTS L1000: 883,269 profiles, 82 cell lines, 175,549 bioactive compounds; and (ii) single-cell sci-Plex3: 290,888 profiles, 3 cell lines, 188 compounds across four doses. Splits: Datasets are split 6:2:2 into train/validation/test under strict out-of-distribution protocols: random_split, compound_split (unseen compounds), cell_line_split (unseen cell lines) for bulk; and random_split, compound_split, pathway_split for single-cell. Five-fold cross-validation is used per split strategy. Evaluation metrics: fold-change, R2, Pearson of log(FC) in compounds, Pearson of log(FC) in cov_compounds, and R2 in compounds/cov_compounds as defined in Methods. Hyperparameters: batch size 512, learning rate 1e-3, weight decay 1e-8, Adam optimizer, early stopping; module sizes: drug fingerprint h=1024; z_pert k=64; z_enc e=64; hidden layers 128. In silico screening workflow: Step 1: Given SMILES for a library and unperturbed profiles of target cell lines, PRnet predicts dose-dependent post-perturbation profiles (with 3 computational repeats). Step 2: Compute per-gene fold-changes and rank genes by fold-change; transform 978 to 12,328 genes for bulk using L1000 projection. Step 3: Compute enrichment scores via a Kolmogorov–Smirnov-based GSEA-like procedure to assess reversal of disease or sensitive-compound signatures; sum up- and down-signature scores to yield a final efficacy score; rank compounds. Experimental validation: In vitro MTT assays measured viability for selected SCLC and CRC cell lines treated with top-ranked candidates to estimate IC50. Additional analyses: Latent space visualization via t-SNE, pseudo-dose trajectories from mean latent embeddings per dose, KEGG pathway GSEA on predicted ranked gene lists. Tools: RDKit for fingerprints; Scanpy/anndata for single-cell processing; matplotlib/seaborn/ggplot2 for plots; clusterProfiler for GSEA; PyTorch for model implementation.
Key Findings
Performance on bulk L1000: Using ~836,352 paired profiles across 82 cell lines and 175,549 compounds, PRnet achieved state-of-the-art performance on held-out tests. For unseen compounds, PRnet reached average Pearson correlation of mean log(FC) ≈ 0.8, outperforming alternatives. For unseen cell lines, PRnet improved Pearson of log(FC) substantially, exceeding others by >0.3. Cell line–specific evaluation (Pearson of log(FC) in cov_compounds) showed best-in-class performance across splits, over 2× better than baselines for unseen cell lines, and +0.16 improvement for unseen compounds. Performance on single-cell sci-Plex3: In unseen compounds and unseen pathways, PRnet achieved superior R2 scores (R2 in compound: 0.969 for unseen compounds; ≈0.97 for unseen pathways), also leading in covariate-specific R2 metrics. Latent space structure: t-SNE of PRnet embeddings clustered profiles by cell line and tissue of origin, capturing cell line–specific and tissue-level similarities; pseudo-dose trajectories emerged (e.g., MCF7 with AG-14361), indicating dose-dependent heterogeneity. Gene-program capture: PRnet accurately predicted fold-changes and directionality for known drugs, e.g., Vorinostat across 71 cell lines, aligning with KEGG GSEA showing suppression of cell cycle, DNA replication, Spliceosome, and activation of autophagy/lysosome/phagosome pathways. Additional compounds (bortezomib, MG-132, wortmannin) in HT29 showed close alignment between predicted and observed gene-level fold-change distributions. Single-cell predictions for GSK-LSD1 in A549, K562, MCF7 matched differential expression trends and magnitudes. SCLC candidate discovery: In silico screening of 4,158 active and 29,670 drug-like compounds across 6 SCLC lines identified (+)-Fangchinoline, (+)-JQ-1, and SEL120-34A HCl; MTT assays validated activity for (+)-Fangchinoline (CAS 436-77-1) and SEL120-34A HCl (CAS 1609452-30-3) with IC50 < 10 μM across SCLC lines. CRC natural product discovery: Screening 30,456 natural compounds across 14 CRC lines nominated 7-Methoxyrosmanol (CAS 113085-62-4) and Mulberrofuran Q (CAS 101383-35-1); MTT assays showed moderate inhibition of CRC cell viability. Large-scale perturbation atlas: PRnet generated >25 million predicted profiles, including (i) FDA-approved drugs: 1,891,330 profiles (935 drugs × 82 lines), (ii) anti-cancer compounds: 8,781,784 (4,158 compounds × 88 lines), (iii) natural compounds: 10,233,230 (30,456 compounds × 14 CRC lines), (iv) drug-like compounds: 4,272,486 (29,670 compounds × 6 SCLC lines), and (v) GTEx tissues: 1,245,510 (935 drugs × 54 tissues). Drug recommendation across diseases: Using CREEDS signatures for 233 diseases, PRnet produced 577 candidate lists; examples with literature support include NASH (Mirabegron, Vidofludimus, Rifaximin), Crohn’s disease (Escin, Ozanimod), and PCOS (Enzalutamide, Linagliptin, Topiramate), with pathway analyses consistent with known mechanisms.
Discussion
PRnet addresses the infeasibility of exhaustive experimental perturbation screening by learning perturbation-conditioned distributions of gene expression, enabling accurate predictions for novel compounds and unseen cellular contexts at both bulk and single-cell resolution. The model’s latent embeddings capture biologically meaningful structure (cell line and tissue similarities, dose trajectories), facilitating mechanistic interpretation and downstream applications. Validations demonstrate PRnet’s utility in prioritizing compounds for SCLC and CRC, with in vitro assays confirming predicted activity ranges, and in recommending candidates for 233 diseases via reverse-signature matching, supported by prior literature in exemplar metabolic and inflammatory conditions. By generating a large, integrated atlas of perturbation profiles across cell lines, tissues, and compound libraries, PRnet expands the landscape for virtual HTS, drug repositioning, and toxicity analyses. These results show that data-driven generative modeling can reduce the cost and time of candidate discovery while preserving gene-level interpretability and generalization to out-of-distribution perturbations.
Conclusion
This work presents PRnet, a perturbation-conditioned deep generative model that predicts transcriptional responses to novel chemical perturbations across bulk and single-cell assays. PRnet outperforms existing methods in out-of-distribution settings, provides interpretable latent representations, and supports scalable in silico screening and drug recommendation. The model enabled discovery and experimental validation of candidate compounds for SCLC and CRC and produced a large-scale atlas of predicted perturbation profiles, supporting drug repositioning and disease-specific candidate ranking for 233 diseases. Future directions include enriching molecular encodings (e.g., 3D-aware formats and graph representations), integrating multi-omics beyond transcriptomics to address limitations of reverse-signature paradigms, incorporating phenotypic endpoints (AUC, IC50) into training and evaluation, and extending the framework to additional perturbation types (e.g., genetic perturbations).
Limitations
Chemical encoding relies on SMILES and RDKit-derived fingerprints, which may not capture 3D geometry, conformational flexibility, or dynamic interactions; alternative encodings (MOL/SDF, graph-based) could improve representation at the cost of complexity. The reverse-signature paradigm used for drug recommendation may fail for low-quality disease signatures or when transcriptional changes do not correlate with sensitivity, particularly in contexts with phenotypic plasticity and complex resistance mechanisms. The current screening and evaluation focus on gene-level effects; phenotypic measures (AUC/IC50) were not modeled directly and should be integrated for holistic assessment. Broader perturbation scenarios and incorporation of additional biological knowledge and multi-omics data are needed to enhance predictive power and generalizability.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny