logo
ResearchBunny Logo
Determining clinical course of diffuse large B-cell lymphoma using targeted transcriptome and machine learning algorithms

Medicine and Health

Determining clinical course of diffuse large B-cell lymphoma using targeted transcriptome and machine learning algorithms

M. Albitar, H. Zhang, et al.

Discover groundbreaking research on diffuse large B-cell lymphoma (DLBCL) that classifies patients into four survival subgroups using machine learning! Conducted by leading experts including Maher Albitar and Hong Zhang, this innovative approach harnesses the power of targeted transcriptome data to enhance treatment outcomes.

00:00
00:00
~3 min • Beginner • English
Introduction
Diffuse large B-cell lymphoma (DLBCL) is the most common lymphoma but exhibits marked biological and clinical heterogeneity. Although standard R-CHOP cures over 60% of patients, many do not respond adequately, and uniform treatment is unlikely to benefit all subgroups. Prior subclassifications based on gene expression profiling (GCB vs ABC) and genomic algorithms (GenClass, LymphGen) or structural alterations (e.g., Chapuy et al.) identify biologically distinct subtypes but have limited ability to predict overall survival (OS) or progression-free survival (PFS) and can be complex to implement clinically. The authors hypothesized that genomic alterations ultimately manifest as transcriptomic changes; thus, targeted RNA sequencing combined with machine learning could yield a practical, clinically relevant classification that directly reflects patient survival under R-CHOP therapy. They propose first defining survival-based groups from clinical outcomes, then identifying RNA biomarkers that predict these survival-defined groups, enabling identification of patients unlikely to benefit from standard therapy.
Literature Review
The paper reviews prior DLBCL subclassifications: (1) Cell-of-origin (COO) by gene expression microarrays into GCB and ABC, with ~15% unclassified. (2) GenClass genetic algorithm grouping abnormalities into MCD (MYD88, CD79B mutations), BN2 (BCL6 fusions, NOTCH2 mutations), N1 (NOTCH1 mutations), and EZB (EZH2 mutations, BCL2 translocations), classifying only ~54% of cases; extended to LymphGen with seven groups including EZB (MYC−/MYC+), A53 (TP53), ST2 (TET2, P2RY8, GSK1). (3) Chapuy et al. integrated mutation and copy-number alterations to define five subgroups. (4) FISH-defined double/triple hit (MYC with BCL2/BCL6) confers aggressive disease and R-CHOP resistance. While biologically informative, these schemes have suboptimal prediction of OS/PFS and can require whole-exome sequencing, limiting routine clinical application.
Methodology
Study design and cohorts: Targeted RNA sequencing was performed on FFPE samples from 379 patients with de novo DLBCL (model development) and 247 with extranodal DLBCL (independent validation), all treated with R-CHOP across 22 centers under IRB approval. Transformed DLBCL, primary mediastinal large B-cell lymphoma, and primary cutaneous DLBCL were excluded. Clinical variables included IPI, ECOG, COO. Sequencing and expression quantification: DNA/RNA co-extracted from FFPE using Agencourt FormaPure on Kingfisher Flex. Libraries targeted 1,408 cancer-associated genes (Illumina TruSight RNA Pan-Cancer Panel). Standard cDNA synthesis and adapter ligation, capture by sequence-specific probes, sequenced on Illumina NextSeq 550 with 2×150 bp, ~10 million reads/sample. Coverage depth 10×–1,739× (median 41×). Expression quantified with Cufflinks, reported as FPKM. Survival labeling and handling censoring: Patients were first partitioned by OS into two groups: short survival (S) vs long survival (L). To address censored data lacking exact survival times, a machine learning approach estimated survival using the Kaplan–Meier survival function S(t) and the integral ∫ S(t) dt conditional on censoring time t0, with a modified rule to reduce bias toward long survival (use cohort mean when t0 ≤ mean; otherwise use t0 + ∫_{t0}^∞ S(t) dt). Hierarchical survival grouping: The initial S vs L split was refined by splitting each into two, yielding four groups: LL (long survival within L), LS (short within L), SL (long within S), SS (short within S). Biomarker selection and classifier: A generalized naïve Bayes classifier was developed, replacing the standard product of likelihoods with a geometric mean (h(x,d)=x^{1/d}) to prevent numerical underflow and reduce extreme probabilities. Theoretical properties were provided (expected value bounded away from zero as dimension grows; multiplicativity leading to x^{a(d)} form). To avoid overfitting, 12-fold cross-validation was used. Genes were ranked by a discriminant measure (1 − cross-validated error) consistent with the classifier. Parameter estimation leveraged weights P(Ck|yi) linking survival time and class via a logistic model, yielding weighted means and variances for each class. Feature selection was hierarchical: 60 genes to predict S vs L, 60 to predict LL vs LS, and 60 to predict SL vs SS (total 180 genes). Smoothing of prediction distributions was applied to facilitate comparison among biomarkers. Validation: Using the selected gene sets, the 379-case cohort was reclassified into LL/LS/SL/SS and OS and PFS evaluated. Independent validation used 247 extranodal DLBCL cases. Additional analyses combined all 626 cases with train/test split. Multivariate Cox models assessed independence of survival classification versus IPI, COO, TP53, MYD88, CD79B, MYC, and IRF4 expression.
Key Findings
- Survival grouping without biomarkers: Initial S vs L split yielded HR 0.237 (95% CI 0.170–0.330; P<0.00001). Four-group model (LL, LS, SL, SS) achieved HR 0.174 (95% CI 0.120–0.251; P<0.0001), indicating strong separation of survival strata. - Biomarker-based prediction in development cohort (n=379): Using 180 selected genes (60 per split), the model reproduced OS and PFS stratification consistent with the survival-defined groups. Group distribution: LL 43% (164/379), LS 8.5% (32/379), SL 8% (29/379), SS 40.5% (153/379), with OS and PFS separation P<0.00001. - Independent validation in extranodal DLBCL (n=247): Two-group model showed HR 0.426 (95% CI 0.278–0.653; P=0.002). Four-group model showed HR 0.530 (95% CI 0.234–1.197; P=0.121). Extranodal cases had overall shorter survival, as expected. A combined-cohort analysis (n=626) with held-out testing preserved model performance and revealed two intermediate survival groups with distinct biology. - Association with COO: GCB was enriched in favorable survival groups (LL/LS; P<0.0001), and ABC enriched in SS. Despite similar survival between LS and SL, LS had more GCB than SL (P=0.016), supporting biological distinctness of intermediate groups. - TP53 mutations: Present in 22% (82/379), associated with shorter survival (P=0.0019), enriched in short-survival groups (P=0.009), and remained an independent adverse predictor in multivariate models (e.g., HR ~1.44–1.54; P≈0.02–0.048 across models). - MYD88 and CD79B: MYD88 mutations were more common in S group (P=0.001) but, in multivariate analysis including other covariates, MYD88 was an independent predictor of better survival (P=0.042). CD79B mutations were not predictive (P≈0.84). - MYC expression: Higher in S groups (P<0.0001); associated with worse survival as continuous (P=0.0019) and categorical (upper quartile) variable (P=0.0021), but not independent in multivariate models. - IRF4 expression: Overexpressed in S groups; LS showed lower IRF4 than SL (P=0.02) despite similar survival. In multivariate analysis, IRF4 mRNA was a borderline negative predictor (P=0.067). - Multivariate Cox analysis: Survival classification and IPI (>2) were independent predictors; COO lost significance when included with survival classification and IPI. Age was an independent predictor when modeled without IPI; the poorest survival group (SS) had more patients >60 years (P=0.01).
Discussion
The study demonstrates that a survival-first strategy—defining patient subgroups by observed OS under R-CHOP and then discovering transcriptomic biomarkers predictive of these groups—captures clinically meaningful heterogeneity in DLBCL. Targeted RNA sequencing of 1,408 genes combined with a robust, underflow-resistant generalized naïve Bayes classifier and hierarchical feature selection identifies 180 genes that reliably predict four survival strata. This survival-based genomic classification correlates with established biological markers (COO, TP53, MYC, IRF4) yet retains independent prognostic value beyond IPI and COO, indicating it integrates diverse biological determinants into a clinically actionable framework. Notably, the two intermediate survival groups (LS and SL) have similar outcomes but distinct molecular profiles (MYC and IRF4), underscoring that similar clinical courses can arise from different biological underpinnings. The approach enables identification of patients unlikely to respond to R-CHOP who may benefit from alternative therapies or clinical trials and suggests group-specific gene signatures that could inform targeted strategies. The model generalized to an independent extranodal cohort, though with attenuated discrimination for four groups, supporting external validity while highlighting cohort-dependent performance.
Conclusion
This work introduces a practical, transcriptome-based, machine learning classifier that stratifies DLBCL patients into four survival-defined groups under R-CHOP using 180 targeted RNA biomarkers. The model robustly predicts OS and PFS in the development cohort and validates in an independent extranodal cohort. Survival classification and IPI are independent prognostic factors; TP53 mutations confer additional independent adverse risk, while MYD88 mutations are independently favorable and CD79B is non-predictive. The method integrates heterogeneous biology into a clinically relevant framework capable of flagging patients unlikely to benefit from standard therapy, thereby guiding alternative treatment decisions and clinical trial enrollment. Future work should include prospective validation, integration with additional clinical and molecular data, refinement of gene panels for clinical deployment, and evaluation of therapy selection guided by these survival-defined molecular groups.
Limitations
- Retrospective, multi-center design may introduce selection and treatment heterogeneity; prospective validation is needed. - Survival labeling depends on R-CHOP treatment; generalizability to other regimens requires testing. - Age differences likely contribute to outcomes in the poorest group (SS), suggesting potential confounding by non-lymphoma mortality. - Four-group discrimination in the independent extranodal cohort was weaker (non-significant in one analysis), indicating cohort-dependent performance. - Targeted panel limits analysis to 1,408 genes; unmeasured genes/pathways and tumor microenvironment features beyond the panel are not captured. - Naïve Bayes assumptions may not fully hold despite the geometric-mean modification; complex gene–gene dependencies are not explicitly modeled.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny