Biology

DIAMetAlyzer allows automated false-discovery rate-controlled analysis for data-independent acquisition in metabolomics

O. Alka, P. Shanthamoorthy, et al.

Discover DIAMetAlyzer, an open-source revolution in targeted metabolomics from Oliver Alka, Premy Shanthamoorthy, Michael Witting, Karin Kleigrewe, Oliver Kohlbacher, and Hannes L. Röst. This innovative workflow minimizes false discoveries while enhancing biomarker quantification from DIA data, ensuring accurate results even at low concentrations.

00:00

~3 min • Beginner • English

Index

Introduction

Mass spectrometry-based metabolomics can be conducted using untargeted or targeted acquisition strategies. Untargeted methods (often DDA) aim to detect many metabolites but may have lower quantitative precision, whereas targeted methods (MRM/PRM) quantify a defined set of compounds with high precision but limited coverage. DIA acquires multiplexed MS2 spectra across mass windows, improving reproducibility and coverage but introducing challenges due to spectral complexity and increased false positives, particularly at low signal-to-noise. In metabolomics, automated creation of DIA assay libraries and robust, FDR-controlled processing of extracted ion chromatograms have been lacking. The study addresses these gaps by introducing DIAMetAlyzer, an automated workflow that builds experiment-specific assay libraries from DDA, performs targeted extraction on DIA data, and applies a statistically calibrated target–decoy FDR estimation to control false discoveries while retaining quantitative accuracy comparable to manual analysis.

Literature Review

Prior work has established the strengths and weaknesses of DDA versus DIA in metabolomics, with DDA providing higher MS2 spectrum quality and DIA offering superior quantitative precision and coverage. Untargeted DIA analysis often relies on deconvolution to generate pseudo-MS2 spectra for identification (e.g., MS-DIAL) or quantification. Targeted strategies require predefined assays (precursor, fragments, retention time) but have remained largely manual in metabolomics. False discovery rate control via target–decoy strategies is standard in proteomics (Elias & Gygi, 2007) yet not routinely implemented for targeted DIA metabolomics. Tools such as OpenSWATH and PyProphet provide robust targeted DIA workflows and statistical validation in proteomics, while SIRIUS enables fragment annotation via fragmentation trees and Passatutto proposes tree re-rooting for decoy generation in metabolomics spectral matching. DIAMetAlyzer integrates these concepts to automate library generation and FDR control for metabolomics DIA.

Methodology

Overview: DIAMetAlyzer is an automated, open-source workflow built on OpenMS and KNIME that combines DDA-driven assay library generation with targeted DIA extraction and target–decoy FDR estimation. It integrates SIRIUS for fragment annotation and Passatutto for fragmentation-tree-based decoy generation, uses OpenSWATH for chromatogram extraction and scoring, and applies PyProphet (extended for metabolomics) for semi-supervised learning and q-value estimation. Candidate identification (DDA): DDA data undergo feature detection (mass-to-charge, retention time, intensity, charge), adduct grouping, and accurate mass search to propose candidate compositions. Precursor m/z and intensity are reannotated; filtering can be applied by isotope trace count. Feature mapping assigns MS2 spectra to detected features. Library construction: AssayGeneratorMetabo (OpenMS C++ tool) processes spectra (mzML) and features (featureXML), annotates fragments using SIRIUS’ compositional fragmentation trees, and extracts n highest intensity fragments as transitions under intensity thresholds. Ambiguities (multiple observations of the same metabolite-adduct) are resolved by choosing the spectrum with highest precursor intensity. The target library is exportable (tsv, traML, pqp). Decoys are generated via Passatutto re-rooting of fragmentation trees. For overlapping target/decoy transitions, a −CH2 mass shift is applied to decoys; if re-rooting fails or decoys overly resemble targets, −CH2 is applied as fallback (used ~13% and ~5% of cases). MS1 decoys are not generated. Targeted extraction (DIA): The target–decoy assay library is used to extract precursor isotope traces and fragment ion chromatograms from DIA/SWATH data in OpenSWATH (metabolomics-extended), within user-defined RT windows. Peak groups are formed and scored using co-elution, chromatogram shape, and other features. Statistical validation and FDR: PyProphet (extended for metabolomics) merges OpenSWATH results, learns a linear discriminant analysis (LDA) model on low-correlated scores to build a composite discriminant score, fits a null distribution using decoys, estimates q-values, and exports results at compound level. The approach provides statistically calibrated target–decoy FDR estimates. Benchmark datasets and experiments: An Agilent LC/MS Pesticide Comprehensive Mix (APM) was spiked into human plasma extracts and measured by SWATH-DIA across a 10-step concentration series (4-fold dilutions spanning ~5 orders of magnitude) at two collision energy ranges (20–50 eV, 50–80 eV); 30 DIA samples were analyzed. DDA data (reference mixes in solvent/plasma) were acquired to build the library. Manual ground truth was curated in Skyline for validation. For broader benchmarking, the AMD serum dataset (MTBLS417) was used to compare to MetaboDIA, with identifications via accurate mass against HMDB and LIPIDMAPS; libraries from both tools were used for targeted quantification. Instrument and LC-MS settings: UHPLC (Nexera, Shimadzu) coupled to TripleTOF 6600 (AB Sciex). Column: UPLC BEH C18, 2.1 × 100 mm, 1.7 µm. Mobile phases: 0.1% FA in water (A) and 0.1% FA in ACN (B); gradient 5% B (0–0.5 min), to 100% B at 10 min (hold 3 min), re-equilibrate to 5% B at 13.5–16 min; 5 µL injection. DDA: IDA with 200 ms MS1, 80 ms MS2; TOF ranges 50–2000 m/z; CE ramp 20–50 V or 50–80 V. SWATH-DIA: 1 MS1 survey (240 ms) + 8 variable SWATH windows (90 ms each) covering 100–900 m/z (windows optimized for plasma). Data conversion by ProteoWizard (qTofPeakPicker, msconvert). Downstream stats in Python/R. Computation and runtime: SIRIUS is used for fragment annotation; user-set per-compound time limit default 100 s. Example: generating an assay library from 67 DDA samples with prior MS1 identification took ~2.5 h on 10 CPU cores (Xeon Gold 6140, 2.30 GHz). Allowing unknown features took ~12.5 h on 28 cores. The full KNIME workflow for the targeted pesticide mix experiment ran in 36 minutes on a single Intel Core i7 @ 3.50 GHz.

Key Findings

- FDR control markedly reduces false positives while retaining most true signals: applying 5% FDR reduced false positive peak groups by 91% (from 1471 to 125) while true positives decreased by 12% (from 3479 to 3071); at 1% FDR, false positives decreased by 98% (to 19) and true positives to 2523. - Assay library coverage and transitions: stringent filtering of DDA-derived assays yielded high-quality entries; 9% of pesticides were undetected and an additional 14% lacked MS1 or adequate MS2 (≥4 peaks). Library coverage depends on number of transitions: 3 transitions = 60% coverage, 2 transitions = 71%, 1 transition = 77%. Using multiple collision energies increased 3-transition coverage by 11% to 71%. - Simulation of uniqueness: scoring both MS1 and MS2 with three transitions increased uniquely identified compounds by ~2.8× vs MS1-only and ~1.5× vs MRM-based analyses (with NIST 17 LC/MS as background). - Classifier and FDR calibration: Precision–Recall AUC = 0.96, achieving >75% recall at 95% precision (5% FDR). FDR estimates were slightly conservative at lower CE (20–50 eV); higher CE (50–80 eV) had more ambiguous fragments. - Quantification performance: After 5% FDR filtering and normalization, more than half of metabolites were detected at 1:1,024 dilution; median CV across technical triplicates < 0.2. Automated quantification precision matched or exceeded manual analysis (Skyline) at some dilutions; individual LODs (S/N ≥ 10) reported per metabolite. - Comparison with MS-DIAL (untargeted deconvolution) on APM dataset: DIAMetAlyzer identified 156 true positives and 3 false positives (at 5% FDR) versus MS-DIAL’s 84 TPs, 5 FPs, and 70 FNs when provided the same library space. - Comparison with MetaboDIA (MTBLS417): DIAMetAlyzer library contained 695 features vs 476 (≈46% more); targeted quantification yielded 811 features vs 440 with the MetaboDIA library. Restricting to identified features, DIAMetAlyzer still quantified ≈25% more. A combined library (including 144 features unique to MetaboDIA) quantified 682 features; using DIAMetAlyzer’s identification-free option further increased quantified features. - Biological insights (AMD dataset): At 5% FDR, LIMMA with BH correction identified 118 differentially expressed features using the DIAMetAlyzer library (113 with MetaboDIA library; 162 with combined library; 220 with identification-free pipeline). Putative biomarkers included elevated acylcarnitines (e.g., oleoylcarnitine PCNV = 0.002, PPCV = 0.01; L-palmitoylcarnitine PPCV = 0.02; linoelaidylcarnitine PCNV = 0.04, PPCV = 0.03), gamma-glutamyl dipeptides, dityrosine, and increased hypoxanthine in CNV (PCNV = 0.006; 3.9×). Polyunsaturated fatty acids EPA (PCNV = 0.04; PPCV = 0.01; ~1.4–1.7×) and DHA (PCNV = 0.008; PPCV = 0.006; ~1.7–2.0×) were elevated in patient groups.

Discussion

The study demonstrates that integrating DDA-based assay library creation with DIA targeted extraction and target–decoy FDR estimation produces well-calibrated error control in metabolomics DIA. By leveraging SIRIUS for robust fragment annotation and Passatutto for realistic decoys, the workflow reduces false positives dramatically while preserving most true signals, addressing a key challenge of DIA’s multiplexed spectra. The linear LDA-based semi-supervised scoring in PyProphet yields high precision and recall, enabling confident identifications and quantifications comparable to or exceeding manual curation. Compared to untargeted deconvolution (MS-DIAL) and consensus library approaches (MetaboDIA), DIAMetAlyzer increases the number of quantified features and differentially expressed signals, enhancing biological discovery power. Application to AMD serum highlights biologically plausible metabolite classes and putative biomarkers, indicating that accurate, FDR-controlled DIA quantification can support biomarker discovery and disease mechanism studies.

Conclusion

DIAMetAlyzer provides a fully automated, open-source workflow for metabolomics DIA that introduces rigorous, well-calibrated FDR control via target–decoy strategies, while delivering quantification accuracy comparable to manual approaches. By combining experiment-specific DDA-derived assay libraries with targeted DIA extraction and semi-supervised scoring, the pipeline reduces false positives, increases quantification coverage, and outperforms state-of-the-art alternatives in both assay generation and targeted quantification. The workflow supports both targeted (known compounds) and identification-free quantification of unknowns, facilitating biomarker quantification even at low concentrations. Future work includes expanding assay libraries with additional reference standards to mitigate DDA abundance bias, improving runtime and mass range handling in SIRIUS for high-mass compounds, and conducting validation experiments to upgrade putative identifications to level 1 where appropriate.

Limitations

The workflow requires both DDA (for assay library generation) and DIA measurements; while the DDA library can be reused for the same setup, initial acquisition effort is needed. DDA is biased toward higher-abundance analytes, especially in complex matrices; libraries built from complex samples may underrepresent low-abundance compounds unless complemented with pure reference standards. Dependence on SIRIUS imposes requirements for high-resolution data and can result in long runtimes or processing limits for high-mass compounds (user-configurable per-compound time limit; runtime examples provided). Interoperability challenges complicate combining libraries from different feature detection tools; decoy generation at the library level is provided as a workaround. Decoy generation fallbacks (−CH2 shifts) are required in a minority of cases (~13% re-rooting failures; ~5% high similarity), which, while controlled, represent practical compromises.

Related Publications

Explore these studies to deepen your understanding of the subject.

Medicine and Health

Validity of Scottish predictors of child obesity (age 12) for risk screening in mid-childhood: a secondary analysis of prospective cohort study data—with sensitivity analyses for settings without various routinely collected predictor variables

G. Carrillo-balam, L. Doi, et al.

Health and Fitness

Situated data analysis: a new method for analysing encoded power relationships in social media platforms and apps

J. W. Rettberg

Medicine and Health

Mediterranean diet as a strategy for preserving kidney function in patients with coronary heart disease with type 2 diabetes and obesity: a secondary analysis of CORDIOPREV randomized controlled trial

A. Podadera-herreros, A. P. A. Larriva, et al.

Medicine and Health

Automated detection of intracranial aneurysms using skeleton-based 3D patches, semantic segmentation, and auxiliary classification for overcoming data imbalance in brain TOF-MRA

S. Ham, J. Seo, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny