logo
ResearchBunny Logo
Metabolomics integrated with machine learning to discriminate the geographic origin of Rougui Wuyi rock tea

Food Science and Technology

Metabolomics integrated with machine learning to discriminate the geographic origin of Rougui Wuyi rock tea

Y. Peng, C. Zheng, et al.

This innovative study showcases a powerful method that merges metabolomics with machine learning to determine the geographic origin of Wuyi rock tea, boasting over 90% accuracy in results. Discover insights from this cutting-edge research conducted by Yifei Peng, Chao Zheng, Shuang Guo, Fuquan Gao, Xiaxia Wang, Zhenghua Du, Feng Gao, Feng Su, Wenjing Zhang, Xueling Yu, Guoying Liu, Baoshun Liu, Chengjian Wu, Yun Sun, Zhenbiao Yang, Zhilong Hao, and Xiaomin Yu.

00:00
00:00
~3 min • Beginner • English
Introduction
This study addresses the need for reliable, quantitative authentication of the geographic origin of Wuyi rock tea (WRT), particularly distinguishing core-region (CRT, “Zhengyan”) from non-core region (NCRT) products, which differ in perceived quality and price and are subject to fraud. The research question is whether volatile organic compound (VOC) metabolomics integrated with machine learning can accurately discriminate tea origins within a narrow geographic scope using a large, representative sample set. The context includes increasing consumer concern about food authenticity and limitations of traditional sensory evaluation and some analytical methods. The purpose is to develop and validate a robust, high-throughput, and accurate model for origin authentication of Rougui WRT, demonstrating broader applicability to other agri-food products. The importance lies in protecting consumers, improving market fairness, and providing a practical alternative to laborious and costly methods while revealing terroir-driven aroma differences.
Literature Review
Prior work on food and tea authentication includes stable isotope analysis and multi-element profiling (effective but laborious), and metabolite fingerprinting approaches. GC-MS, especially with headspace SPME, is widely used for VOC profiling in foods and has been applied to origin identification in rice, wine, and oranges. Machine learning has become popular for food authentication, yet tea studies often have small sample sizes and broader geographic comparisons, limiting classifier robustness. Previous tea studies using metabolomics and chemometrics (UHPLC-QTOF/MS, GC/MS) demonstrated potential for origin discrimination, but with fewer samples and without extensive ML benchmarking across multiple algorithms. There is a recognized need to authenticate teas from narrow geographic scopes and to leverage larger datasets to select models matched to data complexity.
Methodology
Sample collection: 333 authentic Rougui Wuyi rock tea samples were collected in 2019–2020 from Fujian Province, China: CRT (n=174) from the core production region within Mount Wuyi Scenic Resort, and NCRT (n=159) from surrounding non-core regions. Samples were stored in airtight, lightproof foil at 4°C. VOC extraction and GC-MS: Headspace solid-phase microextraction (HS-SPME) using PDMS-DVB fiber was employed on 2 g finely ground tea in 20 mL vials with 0.5 nmol ethyl decanoate internal standard. GC-TOFMS was performed on a Restek Rxi-5Sil MS column (30 m × 0.25 mm, 0.25 μm). Oven program: 50°C 5 min; ramp 3°C/min to 210°C; ramp 15°C/min to 330°C; hold 5 min. EI at 70 eV; scan 30–500 m/z. Triplicate analyses were run with pooled QC injections every ten runs. Data processing: ChromaTOF (v4.51.6) for deconvolution/alignment with S/N=20, max RT difference=2 s, peak width=5 s, and spectral match score ≥700. Relative quantification was based on analyte/internal standard peak area ratios. Initial detection yielded 2128 features. QC evaluation and hierarchical clustering removed outliers, retaining 276 samples. Low-variance features (mean absolute deviation ≤0) were removed, resulting in 447 features. Identification: 44 compounds by authentic standards, 236 tentatively by NIST 20 and retention indices. Statistical analysis: Data were auto-scaled, quantile-normalized, and log10-transformed. PCA and OPLS-DA (Simca-P v14.1) assessed group separation and model validity (permutation testing). Differential metabolites were identified using volcano plot criteria (VIP>1, p<0.05, |fold change|>1.5) and Wilcoxon rank-sum test. Machine learning: From the 447 features, 176 stable volatile features (detected consistently across batches) were selected for modeling. Data split: 80% training (n=220) and 20% test (n=56), stratified. Fifteen scikit-learn classifiers were benchmarked: MLP, QDA, PA, SVM, LDA, RF, SGD, GB, AB, KNN, LinearSVM, BernoulliNB, ET, GaussianNB, DT (scikit-learn v0.24.2, Python 3.8.12). Five-fold cross-validation on the training set with hyperparameter optimization (grid/Bayesian optimization pipeline reported) and evaluation metrics (accuracy, precision, recall, AUC). Computing: up to 40 parallel jobs on an 80-core machine, single-run time limit 30 min, memory limit 100 GB. Simplified model: A reduced feature set of the top 30 features by OPLS-DA VIP was used to retrain and evaluate all 15 algorithms with the same data splitting and validation procedure, to improve efficiency and assess robustness on small feature sets.
Key Findings
- VOC profiling and multivariate analysis: - PCA showed limited separation (PC1+PC2=23.1% variance). OPLS-DA achieved better class separation (R2Ycum=0.602, Q2cum=0.532); permutation intercepts R2=0.23, Q2=-0.26 supported model validity. - 111 differential VOCs between CRT and NCRT were identified (57 up, 54 down in CRT). Twenty VOCs met stringent criteria (VIP>1, p<0.05, |FC|>1.5), spanning esters, hydrocarbons, ketones, alcohols, heterocycles, and one unknown. CRT exhibited higher levels of floral/woody/roasty notes (e.g., hotrienol, α-terpineol, 2-acetylpyrrole, branched alkanes), while NCRT had higher fruity/green esters (e.g., hexyl 2-methylbutyrate, hexyl hexanoate, trans-2-hexenyl caproate, β-phenylethyl butyrate) and 5,6-epoxy-β-ionone. - Machine learning with 176 features: - MLP achieved the highest 5-fold CV accuracy: 92.7% (AUC=0.96); other models >85% included QDA, PA, SVM. - External test set (n=56): 91.1% accuracy. Independent validation set (n=17, market samples): 94.1% accuracy. - Machine learning with 30 features (simplified): - Gradient Boosting yielded best CV accuracy: 89.6% (AUC=0.93), slightly higher than its 86.8% on the full 176-feature set. - Test set accuracy: 87.5%; independent validation: 94.1%. - Practical comparison: The metabolomics+ML approach provides high accuracy comparable to more complex IRMS/ICP-MS-based methods while being simpler and faster.
Discussion
Integrating HS-SPME GC-MS VOC metabolomics with ML effectively addresses the research question by enabling accurate discrimination of Rougui Wuyi rock tea origins within a narrow geographic scope. The findings reveal clear terroir-associated aroma differences: CRT shows stronger floral, woody, and roasted characteristics, whereas NCRT is richer in fruity and green notes. The ML benchmarking demonstrates that aligning model complexity with data characteristics is crucial: MLP excelled with 176 features and larger effective dimensionality, while Gradient Boosting was superior with a reduced 30-feature set, offering robustness and lower overfitting risk. The high external and independent validation accuracies (>90%) indicate strong generalization. This approach offers a practical alternative to labor-intensive isotope/elemental methods, supporting market surveillance against mislabeling and fraud. Nonetheless, observed VOC differences may also be influenced by processing (e.g., roasting), storage, climate, harvest year, and agricultural practices, indicating that both terroir and postharvest factors shape the volatile profile. Incorporating additional spectral modalities and metadata could further enhance model reliability and interpretability.
Conclusion
This work presents a robust, scalable pipeline combining HS-SPME GC-TOFMS volatile profiling with machine learning to authenticate the geographic origin of Rougui Wuyi rock tea. Using 176 stable VOC features, the MLP model achieved 92.7% cross-validated accuracy and >90% on external and independent validations, while a simplified 30-feature Gradient Boosting model maintained strong performance with improved efficiency. The study identifies 20 discriminative VOCs and evidences a terroir impact on WRT aroma. Future research should: (1) integrate complementary analytical platforms (LC-MS, NMR, NIR) and fusion strategies; (2) incorporate metadata on processing, storage, and agronomic factors; (3) expand sampling across years and producers; and (4) address complex real-world fraud scenarios (e.g., admixture of CRT and NCRT) to further improve generalization and robustness.
Limitations
Potential confounders include variability from roasting and other processing steps, storage conditions, climate, harvest year, and agricultural practices, which may influence VOC profiles alongside geography. The simplified MLP model showed signs of overfitting on test sets, underscoring model selection sensitivity with small feature sets. Real-world fraud (e.g., mixing CRT with NCRT) poses additional challenges not fully captured in the current design. Although external and independent validations were performed, broader multi-year, multi-batch validation and modality fusion could enhance generalizability.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny