Introduction
Food authenticity is increasingly important due to growing international trade and consumer awareness. Agricultural products from specific regions command higher market value due to perceived higher quality. Despite regulations, adulteration and false labeling remain prevalent, harming fair competition and consumer trust. Accurate geographic origin identification is crucial for ensuring food quality and protecting consumers.
Tea is a globally consumed beverage with numerous health benefits. Regional provenance is a key attribute associated with high-quality tea. Famous teas like Darjeeling, Ceylon, Westlake Longjing, and Wuyi rock tea are often targeted for fraud due to their high demand and value.
Various analytical tools, including stable isotope analysis and multi-element profiling, are used for tea authentication but can be laborious. Metabolomics, coupled with chemometrics, is a promising approach due to its high sensitivity and throughput. Gas chromatography-mass spectrometry (GC-MS) with solid-phase microextraction (SPME) simplifies sample preparation and enables simultaneous monitoring of numerous volatile organic compounds (VOCs), making it suitable for tea metabolomics. Machine learning (ML) is capable of handling large, multidimensional datasets and has shown increasing use in food authentication and quality control.
Previous metabolomics studies on tea have used small sample sizes, limiting the reliability of findings, and have not focused on narrow geographic distinctions. Wuyi rock tea (WRT), a prestigious oolong tea, has protected geographical indication (PGI) status in China. Tea from the core production region (CRT) is considered superior and commands higher prices, making it susceptible to imitation. Conventional quality evaluation methods rely on subjective tea tasting, lacking quantitative data. This study aimed to assess the feasibility of using ML-based analysis of VOC metabolomes to differentiate WRT origins.
Literature Review
Existing literature highlights the growing concerns over food authenticity and the increasing demand for reliable methods to trace the geographic origin of agri-food products. While stable isotope analysis and multi-element profiling have shown effectiveness in tea authentication, they are often technically demanding and require laborious sample preparation. Metabolomics combined with chemometrics has emerged as a more promising alternative, offering broader metabolite coverage, high sensitivity and high throughput. Gas chromatography-mass spectrometry (GC-MS), particularly with solid-phase microextraction (SPME), has become a central platform for tea metabolomics research due to its ability to simultaneously monitor a large number of volatile organic compounds (VOCs) with proven reproducibility. The application of machine learning (ML) techniques in metabolomics analysis is increasing in popularity, showcasing its potential for enhancing the accuracy and efficiency of food origin identification. However, previous studies often suffered from small sample sizes, limiting the generalizability of their results and hindering the development of robust predictive models. This study aimed to address these limitations by employing a larger dataset and leveraging the power of advanced ML algorithms.
Methodology
333 Rougui WRT samples were collected from the core production region (CRT, n=174) and non-core region (NCRT, n=159) in Fujian Province, China. Headspace solid-phase microextraction (HS-SPME) coupled with gas chromatography-time-of-flight mass spectrometry (GC-TOFMS) was used to profile VOCs. A total of 2128 features were detected, with 44 volatiles identified and 236 tentatively identified. Data were preprocessed using MetaboAnalyst 5.0, including auto-scaling, quantile-normalization, and log10-transformation. Principle component analysis (PCA) and orthogonal partial least squares-discriminant analysis (OPLS-DA) were performed in Simca-P v14.1. The Wilcoxon rank sum test compared volatile abundance between CRT and NCRT. Fifteen machine learning algorithms from Scikit-learn were tested using 80% of the data for training (five-fold cross-validation) and 20% for validation. The performance of the models was evaluated based on several metrics, including accuracy, precision, recall and AUC. An independent test set of 17 samples was used for external validation, and a simplified model was built using the top 30 features.
Key Findings
PCA did not clearly separate CRT and NCRT samples. OPLS-DA showed better separation, with a permutation test confirming model validity. Volcano plot analysis identified 111 differential VOCs. Twenty VOCs with significant differences (VIP >1, p<0.05, fold change |>1.5) were identified, including esters, hydrocarbons, ketones, alcohols, and heterocycles. CRT samples showed higher levels of volatiles with floral, fruity, and woody scents, as well as 2-acetylpyrrole (nutty, bread note), and several odorless branched alkanes. NCRT samples had higher levels of esters imparting fruity and green flavors, 5,6-epoxy-β-ionone (fruity, floral), and ethyl isopropyl ketone (minty). The differences in volatile profiles suggest distinct aroma profiles: CRT with stronger floral, woody, and roasted notes; NCRT with stronger fruity and green odors. Among fifteen machine learning algorithms tested, Multilayer Perceptron (MLP) achieved the highest accuracy (92.7%) on the training set using 176 volatile features, with over 90% accuracy on independent test sets. Gradient Boosting (GB) achieved the best accuracy (89.6%) on the training set using only the top 30 volatile features, showing robust performance on independent test sets. The study utilized the largest panel of WRT samples tested to date (333 samples).
Discussion
This study demonstrated the effectiveness of combining VOC metabolomics with ML algorithms to discriminate the geographic origin of Rougui WRT. The distinct volatile profiles of CRT and NCRT samples highlight the impact of terroir on tea flavor. The high accuracy achieved by both MLP and GB models, even with a reduced feature set, confirms the potential of this approach for WRT authentication. The results indicate that VOC profiling can serve as a reliable method for distinguishing between CRT and NCRT WRT, offering a rapid and objective alternative to traditional sensory evaluation methods. However, the influence of factors like processing methods and storage conditions warrants further investigation.
Conclusion
This study successfully applied VOC metabolomics and machine learning to discriminate the geographic origin of Rougui Wuyi rock tea, achieving high prediction accuracy with both MLP and GB models. The identified differential VOCs provide insights into the impact of terroir on tea aroma. This approach offers a rapid, reliable, and objective method for WRT authentication, potentially applicable to other agri-food products. Future research could explore the integration of other analytical techniques and data sources to further improve model accuracy and robustness.
Limitations
The study focused solely on Rougui WRT, limiting the generalizability to other Wuyi rock tea cultivars. The influence of factors such as tea processing variations and storage conditions on VOC profiles may affect the accuracy of geographic origin prediction. Further studies are needed to determine the relative contribution of these factors and the impact of potential fraud scenarios, such as intentional mixing of teas from different origins.
Related Publications
Explore these studies to deepen your understanding of the subject.