Introduction
Targeted sequencing, using DNA hybridization probes to enrich regions of interest, is a cost-effective alternative to whole-genome sequencing for clinical applications and DNA data storage. However, variations in probe hybridization kinetics result in non-uniform sequencing depth, impacting sensitivity and increasing costs. Empirical optimization is time-consuming. This study aimed to develop a computational method to predict sequencing depth from probe sequences, facilitating the design of more uniform NGS panels. Existing DNA biophysics models provide valuable insights into DNA structure, thermodynamics, and kinetics. To avoid overly complex feature engineering, a middle ground was sought, utilizing global (oligonucleotide-level) and local (nucleotide-level) features computed by the Nupack software. The resulting deep learning model (DLM) was trained and validated on three NGS panels: a human SNP panel (39,145 probes), a human lncRNA panel (2000 probes), and a synthetic panel (7373 probes) designed for DNA data storage. The lncRNA panel served as an independent test set for the SNP panel, sharing the same library preparation method. The choice of a recurrent neural network (RNN) architecture was motivated by the need to capture both short-range and long-range interactions within DNA sequences known to influence capture efficiency.
Literature Review
The literature extensively covers DNA structure, thermodynamics, and kinetics modeling. Existing models, however, are often insufficient for predicting sequencing depth in complex multi-component systems like targeted enrichment. While ignoring biophysical knowledge and relying solely on sequence information would be suboptimal, creating highly curated, labor-intensive expert systems lacks generalizability. This research bridges the gap by integrating a small set of autonomously computed features with a deep learning approach, avoiding the limitations of both purely biophysical and purely data-driven methods.
Methodology
The DLM uses a recurrent neural network (RNN) architecture specifically, gated recurrent units (GRUs), to handle variable-length DNA sequences (50-150 nucleotides). Two GRUs process the target and probe sequences in both 5' to 3' and 3' to 5' directions to reduce directional bias. Input features for each nucleotide include binary indicators for purine/pyrimidine and strong/weak base pairing, and the Nupack-computed probability of the nucleotide being unpaired. Two sets of GRUs (one each for target and probe sequences) feed into a feed-forward neural network (FFNN). The FFNN also incorporates four global features: reaction temperature, predicted free energies of probe and target folding, and predicted free energy of target-probe duplex formation. The FFNN outputs the log10 of the predicted read depth. The SNP and synthetic panels were used for independent DLM training and cross-validation (20-fold). The lncRNA panel served as an independent test set for the SNP-trained model. The model was trained using the Adam optimizer and included dropout layers to prevent overfitting. Hyperparameters were optimized to minimize training time and maximize predictive performance. In addition to the NGS dataset, the DLM was also applied to predict single-plex DNA hybridization and strand displacement rate constants using fluorescence-based kinetics data from separate experiments. A leave-one-class-out approach was employed for these predictions due to the smaller dataset size. Feature importance analysis was conducted by creating and evaluating DLMs with specific features removed to assess their contribution to prediction accuracy.
Key Findings
The DLM achieved high accuracy in predicting NGS sequencing depth. In cross-validation, it predicted sequencing depth within a factor of 3 with 93% accuracy for the SNP panel and 99% accuracy for the synthetic panel. Independent testing on the lncRNA panel, using the SNP-trained model, yielded 89% accuracy. The model's RMSE was significantly lower than naive and linear regression models. Analysis revealed that probes with low G/C content showed a higher discrepancy between predicted and observed depth, possibly due to the sensitivity of low-depth probes to random experimental fluctuations. The reproducibility of the DLM was demonstrated through 15 independent 20-fold cross-validation runs on the SNP panel, showing consistently high Pearson's r values (>0.975) across pairwise comparisons. The DLM successfully predicted single-plex DNA hybridization and strand displacement rate constants, achieving results comparable to a previous expert-system approach. Feature importance analysis showed that nucleotide identity and unpaired probability are crucial features, while global features had minimal impact. The reaction temperature, being constant for each panel, was non-informative. The free energy of formation of the target-probe duplex was also of little use because the probe binding was not thermodynamics-limited for the longer probes used. Finally, the free energy of target folding by itself was not accurately calculable due to the heterogeneity of genomic DNA fragments.
Discussion
The DLM successfully addressed the challenge of predicting NGS sequencing depth, outperforming simpler models. Its ability to generalize to different panels with the same library preparation method demonstrates robustness against experimental variations. The successful application to single-plex kinetics data highlights its potential beyond NGS. The reliance on a small set of automatically computed features simplifies model construction and improves generalizability compared to traditional expert systems. The limitations of the current approach relate to the inherent inaccuracies of Nupack base-pairing predictions and the heterogeneity of target sequences in the genomic DNA panels. Improvements may be achieved by developing models that more accurately predict base-pair accessibility in complex environments.
Conclusion
This study demonstrates the effectiveness of a deep learning model in predicting NGS sequencing depth from DNA probe sequences. The model’s accuracy and generalizability across different panels underscore its potential for improving NGS panel design. Future research should focus on improving base-pairing prediction methods and expanding applications to other nucleic acid-based problems like non-coding RNA structure prediction and codon optimization.
Limitations
The study's limitations include potential inaccuracies in the Nupack-predicted base-pair probabilities, especially in the complex multi-component system of hybrid-capture target enrichment. Additionally, the model's generalization to NGS panels with significantly different library preparation methods or target sequences requires further investigation. The impact of experimental variations on low G/C probes is a notable limitation.
Related Publications
Explore these studies to deepen your understanding of the subject.