Introduction
The human genome presents significant challenges for accurate sequencing and variant calling due to variations in genomic contexts such as repetitive bases, local repeats, and large duplications. Different sequencing platforms (short-read and long-read technologies) and bioinformatics tools (mappers and variant callers) exhibit varying performance characteristics across these diverse genomic contexts. Currently, evaluating the performance trade-offs of different sequencing pipelines often relies on intuition rather than a data-driven approach. This lack of a robust predictive model hinders the optimization of pipelines for specific applications, particularly in clinical settings where the accurate identification of medically relevant variants is critical. The Genome in a Bottle Consortium (GIAB) provides genome stratifications to assess performance in various genomic contexts, but these stratifications do not provide a predictive model for error likelihood. Existing methods, such as GATK Variant Quality Score Recalibration and deep learning models, focus primarily on read characteristics and often lack interpretability. They also tend to concentrate on improving precision (reducing false positives) without directly addressing the prediction of true variants that are likely to be missed (reducing false negatives). This research aims to develop an interpretable model to predict both the precision and recall of variant calling based on genomic context, enabling informed decision-making when designing variant calling pipelines for various applications, from large-scale genomic studies to clinical diagnostics.
Literature Review
Several approaches have been employed to model sequencing errors, mostly focusing on variant calling and filtering out false positives. Methods like GATK Variant Quality Score Recalibration use Gaussian Mixture Models, while recent deep learning models leverage sequence data and aligned read characteristics. Clinical laboratories use additional methods to predict real variants and avoid the need for orthogonal confirmation. However, these approaches often lack interpretability, neglect genomic context, and do not predict the recall of true variants that might be missed. The GIAB stratifications offer insights into performance in specific genomic contexts, but do not provide a quantitative predictive model. This paper addresses these limitations by developing StratoMod, aiming for both accuracy and interpretability.
Methodology
StratoMod utilizes Explainable Boosting Machines (EBMs), an interpretable machine-learning approach based on generalized additive models (GAMs). The model's predictions are derived from additive effects of univariate and pairwise functions of dataset features, allowing visualization of each feature's contribution to the prediction. This interpretability is crucial for clinical applications, allowing clinicians to understand and justify the model's predictions. The model was trained on data from various sources, including the GIAB v4.2.1 benchmark VCFs and a new T2T-HG002-Q100 assembly-based benchmark, providing a more complete representation of difficult-to-analyze genomic regions. The input features encompassed diverse aspects of genomic context, including homopolymer length and imperfection, tandem repeat characteristics, segmental duplication features, RepeatMasker annotations, and mappability scores. Separate models were trained for single nucleotide variants (SNVs) and insertion/deletion variants (INDELs), with false negative (FN), false positive (FP), and true positive (TP) variants used for training and testing. The model's performance was assessed using metrics such as precision, recall, and the Jaccard index. The model's ability to predict recall was extensively evaluated using various pipeline configurations, including comparisons between Illumina and PacBio HiFi data, linear versus graph-based mappers (BWA mem and VG), and different versions of sequencing technologies (Ultima) and variant callers (ONT's guppy and clair). The model was also validated by comparing its predictions on ClinVar variants with the filtering behavior of gnomAD, which employs a different model and dataset. In addition to the main effects, the model included interaction terms to capture relationships between various genomic context features.
Key Findings
StratoMod demonstrated high accuracy in predicting recall for both SNVs and INDELs, across multiple platforms and genomic contexts. The model successfully distinguished the performance of Illumina and PacBio HiFi pipelines in various genomic regions. In hard-to-map regions, PacBio HiFi showed superior performance, while Illumina performed better in homopolymer regions. The model revealed insights into the influence of different genomic features on variant calling accuracy. For instance, longer homopolymers and tandem repeats increased the likelihood of errors, especially for INDELs. StratoMod's interpretability allowed the identification of specific genomic regions where variants are most likely to be missed using various pipelines and technologies. A direct comparison between linear and graph-based mappers revealed a significant advantage for graph-based mappers (VG) in difficult-to-map regions, particularly for INDELs within highly similar segmental duplications. The model was also used to assess the improvement in performance of updated versions of sequencing technology (Ultima) and variant callers (ONT), revealing specific error mechanisms associated with these improvements. Validation using ClinVar and gnomAD data showed a reasonable correlation between StratoMod's predictions and the filtering criteria used by gnomAD. Some discrepancies were observed, partly due to differences in data and modeling approaches, but StratoMod generally showed slightly fewer predicted errors than gnomAD filtering, particularly for variants with low allele counts. The model identified numerous genes with pathogenic or likely pathogenic variants predicted to be missed by short-read sequencing and pointed out regions where long-read sequencing could improve accuracy.
Discussion
StratoMod offers a significant advancement over existing methods by providing an accurate and interpretable model for predicting variant calling errors, considering both genomic context and sequencing technology. Its ability to predict recall, in addition to precision, is a major contribution, enabling a more comprehensive assessment of pipeline performance. The interpretability of the model provides valuable insights into the mechanisms of error, aiding in the design of more robust pipelines and the prioritization of regions requiring further development of tools and technologies. The validation against gnomAD demonstrates the generalizability and reliability of StratoMod's predictions. The ability to quantify the impact of various genomic features on variant calling accuracy empowers researchers and clinicians to make informed decisions about pipeline design and resource allocation. StratoMod’s capacity to predict variants that might be missed enables proactive risk assessment and mitigation, especially crucial in clinical diagnostics where missing potentially lethal variants has severe consequences. The use of StratoMod is not limited to the specific pipelines and data evaluated in this paper; its flexible design allows adaptation to different technologies and variant callers. The successful application of StratoMod to various use cases, such as comparing linear and graph-based mappers and assessing improvements in base and variant callers, highlights its versatility and potential for broad impact.
Conclusion
StratoMod provides a powerful and interpretable tool for predicting and understanding variant calling errors, incorporating both genomic context and sequencing technology. Its ability to predict recall, coupled with its interpretability, fills a critical gap in current variant calling pipeline assessment. The model's validation and diverse applications demonstrated its efficacy and broad utility in various contexts, from genomic research to clinical diagnostics. Future directions include expanding the feature set, incorporating additional interactions, and extending its application to somatic variant calling and other sequencing data types. The availability of StratoMod as an open-source tool allows the research community to adapt and further improve its capabilities.
Limitations
StratoMod was designed for germline variant calls and primarily whole genome sequencing (WGS) datasets. Its performance on other data types, such as whole exome sequencing or targeted sequencing, may require additional features. The limited set of interactions between features, chosen to maintain interpretability, might hinder the model's ability to capture all complex relationships affecting variant calling accuracy. The model was trained on specific variant callers and technologies, and its generalizability to other methods needs further investigation. Although the model incorporated features to address several common sources of errors, it may not capture all possible error mechanisms. The ClinVar data used for validation might contain a small fraction of somatic variants, but this is not expected to significantly affect the results.
Related Publications
Explore these studies to deepen your understanding of the subject.