
Biology
A deep learning model for predicting next-generation sequencing depth from DNA sequence
J. X. Zhang, B. Yordanov, et al.
Discover groundbreaking advancements in targeted high-throughput DNA sequencing! This exciting research by Jinny X. Zhang and colleagues at Rice University and Microsoft Research introduces a deep learning model that accurately predicts sequencing depth and hybridization kinetics. Dive into the future of genomics and molecular diagnostics through innovative algorithms and deep learning techniques.
~3 min • Beginner • English
Introduction
Whole-genome deep sequencing is often impractical for clinical applications due to cost, so targeted sequencing with probe-based enrichment is widely used. However, probe sequences exhibit diverse thermodynamics and kinetics, leading to non-uniform sequencing depth across targets. This non-uniformity either reduces sensitivity at low-depth loci or necessitates additional sequencing, increasing cost. Empirical optimization of probe sequences and concentrations is laborious. The research question is whether sequencing depth for a given probe can be predicted from sequence-informed features using a generalizable deep learning approach that incorporates basic DNA biophysical knowledge. The study aims to build a model that predicts NGS read depth from probe and target sequences by combining local (nucleotide-level) and global (molecule-level) features, thereby enabling rational panel design and balancing probe concentrations for improved uniformity.
Literature Review
Prior work provides well-validated models of DNA structure, thermodynamics, and kinetics, and widely used tools (e.g., Nupack) to compute base-pairing probabilities and folding energies. Traditional first-principles models become intractable in complex hybrid-capture systems, while expert feature-engineering approaches can lack generalizability. Recurrent neural networks have been successful in capturing long-range dependencies in sequence data (e.g., speech and NLP), which is relevant because DNA can form secondary structures with distal interactions. This motivates a hybrid approach that leverages a limited set of automatically computed biophysical features within a deep learning architecture to predict sequencing depth.
Methodology
Model architecture: A deep learning model based on gated recurrent units (GRUs) processes both target (T) and probe (P) sequences in bidirectional fashion (5'→3' and 3'→5'), totaling four GRUs. Each GRU has 128 hidden nodes. Per-nucleotide inputs comprise: (1) purine indicator (A/G), (2) strong base indicator (G/C), and (3) Nupack-computed probability that the nucleotide is unpaired (Punpaired). Encoding uses two binary channels for nucleotide identity to reflect biochemical similarities, plus an analog Punpaired. Although T is the reverse complement of P in these data, separate GRUs for T and P were used for generality.
Downstream network: The final hidden states from the two directions for T and P are summed within direction and concatenated into a 256-dimensional vector. Four global features are appended: reaction temperature, predicted standard free energy of folding for probe ΔG°(P), for target ΔG°(T), and of duplex formation ΔG°(TP), yielding 260 inputs to a feed-forward neural network (FFNN) with two hidden layers (256 and 128 nodes). The output is a single scalar predicting log10(depth).
Training: Data were standardized within each training set (global features and log10 depth). Xavier initialization was used. Optimization used Adam (learning rate 0.0001), batch size 999, and dropout (20%) after each FFNN hidden layer to reduce overfitting. Implemented in TensorFlow. Approximately 300,000 parameters. Early stopping around epoch 250 for SNP-based models (1000 for synthetic panel). Feature computation with Nupack provided Punpaired and ΔG values.
Datasets: Three panels were used. (1) SNP panel: 39,145 80-nt probes (Twist Biosciences) at 65 °C; 1,105 probes with zero reads were excluded, yielding 38,040 on-target probes. (2) lncRNA panel: 2,000 probes designed independently but prepared with the same library protocol as SNP; 34 zero-depth probes excluded. (3) Synthetic (non-human) panel: 7,373 110-nt probes at 55 °C; 158 zero-depth probes excluded; sequences were procedurally designed to avoid problematic motifs and extreme GC content. For SNP and synthetic panels, 20-fold cross-validation was used: random split into 20 classes; each class predicted by a model trained on the other 19. The lncRNA panel served as an independent test set for a model trained on SNP data.
Reproducibility: To assess stability, 15 independent 20-fold cross-validations were run on the SNP panel with different splits and random initializations (300 models total). Pairwise prediction concordance was evaluated via Pearson correlation.
Extension to kinetics: The same DLM architecture was trained to predict single-plex hybridization and strand displacement rate constants using fluorescence time-course data. A set of 100 probe sequences (36 nt) tested across temperatures 28–55 °C yielded 210 hybridization and 211 strand displacement measurements (421 total). Prediction used 100-fold leave-one-class-out (grouped by probe sequence) to counter small validation set bias. DLM was co-trained on both reaction types, with differences communicated via Punpaired features for each mechanism.
Feature ablation: Reduced models were evaluated by removing specific inputs (global ΔG terms, temperature, sequence identity, or Punpaired) to assess their contribution to prediction accuracy.
Key Findings
- Prediction accuracy for NGS depth:
- SNP panel (39,145 probes; 38,040 with reads): 20-fold cross-validation RMSE ~0.301 for log10(depth). Naive mean model RMSE ~0.41; linear model on four global features RMSE ~0.34. Fraction beyond factor-of-2 (F2err) 20.9%; beyond factor-of-3 (F3err) 7.31%. Within factor 3: ~92.7% (~93%).
- lncRNA panel (independent test, same library method): predicted by SNP-trained model (early stop at 250 epochs) yielded RMSE 0.326, F2err 30.4%, F3err 11.0%; within factor 3: ~89%.
- Synthetic panel (7,373 probes; 7,215 with reads): 20-fold cross-validation RMSE 0.116, F2err 1.98%, F3err 0.600%; within factor 3: ~99.4% (~99%).
- Reproducibility: Across 15 independent 20-fold cross-validation runs on SNP data, models consistently early-stopped around epoch 250 and showed high concordance with pairwise Pearson r ≥ 0.975 (average ~0.981), indicating stable predictions across initializations and splits.
- Error characteristics: Discrepancies often involved probes with very low observed depth, frequently low GC content, likely reflecting random experimental fluctuations not inferable from sequence alone.
- Kinetics prediction: The same DLM effectively predicted hybridization and strand displacement rate constants over ~4 orders of magnitude using 100-fold leave-one-class-out. Performance was comparable to an expert-system weighted neighbor voting approach for hybridization kinetics.
- Feature importance: Temperature (constant within a panel) and certain global ΔG features often had minimal impact on NGS depth prediction in these datasets; local features (sequence identity or base-pair opening probabilities) were important and somewhat interchangeable for SNP depth prediction. Reduced models still outperformed random baselines for kinetics tasks.
Discussion
The study demonstrates that incorporating minimal, automatically computable biophysical features into a recurrent neural network enables accurate prediction of targeted NGS sequencing depth from DNA probe sequences. The model captures both local and long-range sequence effects via bidirectional GRUs and leverages base accessibility (Punpaired) to integrate structural context, addressing the challenge posed by secondary structures and distal interactions. Compared to naive and linear baselines, the DLM substantially reduces error and generalizes to an independently designed lncRNA panel prepared with the same library method, supporting its practical utility for panel optimization and probe concentration balancing. High reproducibility across multiple cross-validation runs indicates robustness against training stochasticity. Extending the same architecture to predict hybridization and strand displacement kinetics underscores that the learned representations reflect underlying biophysical determinants of binding yield and speed, not merely dataset-specific artifacts. Feature ablation suggests that while certain global thermodynamic descriptors may be non-informative under specific conditions (e.g., uniform temperature, long probes), local sequence or base accessibility features are critical, aligning with mechanistic expectations. Remaining discrepancies, especially among low-GC probes with low observed depth, are consistent with experimental variability beyond sequence-derived predictors.
Conclusion
This work introduces a deep learning framework that predicts targeted NGS read depth from probe and target sequences by combining bidirectional GRUs on per-nucleotide features with a compact set of global thermodynamic descriptors. The model achieves strong accuracy across human SNP and non-human synthetic panels, generalizes to an independent lncRNA panel using the same library preparation method, and robustly reproduces predictions across runs. The same architecture also predicts single-plex hybridization and strand displacement rate constants, indicating broad applicability to nucleic acid assay design. Future research should improve base-pair accessibility modeling in complex mixtures, incorporate additional sources of experimental variance when available, and explore adaptation across different library preparation chemistries. The established architecture may be extended to other nucleic acid problems, including RNA structure/function prediction and codon optimization.
Limitations
- Sensitivity to experimental variability: Observed deviations, especially for low-depth, low-GC probes, likely reflect random fluctuations in synthesis yield, hybridization efficiency, non-specific interactions, or sequencing steps that are not encoded by sequence-based features.
- Panel/method specificity: The model generalizes well within the same library preparation method but may require retraining or adaptation across different methods due to numerous protocol-specific variables.
- Limited utility of some global features: Temperature was constant within panels, precluding learning its effects; predicted target folding energies are inaccurate for fragmented genomic DNA with variable overhangs; duplex formation energy may be less informative for long probes where thermodynamics are not limiting.
- Dependence on Nupack-derived accessibility: Punpaired and ΔG values may carry significant error in complex heterogeneous solutions, potentially limiting maximal achievable accuracy.
- Exclusion of zero-read probes: Probes with zero reads were removed to avoid synthesis failures/noise, which may bias evaluation toward successfully captured targets.
Related Publications
Explore these studies to deepen your understanding of the subject.