logo
ResearchBunny Logo
Synchronously Predicting Tea Polyphenol and Epigallocatechin Gallate in Tea Leaves Using Fourier Transform-Near-Infrared Spectroscopy and Machine Learning

Food Science and Technology

Synchronously Predicting Tea Polyphenol and Epigallocatechin Gallate in Tea Leaves Using Fourier Transform-Near-Infrared Spectroscopy and Machine Learning

S. Ye, H. Weng, et al.

Unlock the secrets of tea polyphenols and EGCG with cutting-edge FT-NIR spectroscopy and machine learning. Join Sitan Ye, Haiyong Weng, Lirong Xiang, Liangquan Jia, and Jinchai Xu as they reveal powerful predictive models that promise rapid screening for tea quality.

00:00
00:00
~3 min • Beginner • English
Introduction
Tea, one of the top three non-alcoholic beverages globally, contains bioactive compounds such as tea polyphenols and epigallocatechin gallate (EGCG) that are linked to antioxidant, anti-inflammatory, antimicrobial, and anticancer effects. Rapid, accurate detection of these compounds is important for quality control, product development, and consumer preference profiling. Conventional chemical methods (e.g., Folin phenol method and HPLC) are destructive, time-consuming, and labor-intensive. There is a need for non-destructive, rapid, and reliable approaches to determine tea polyphenols and EGCG across varieties during breeding. This study investigates whether Fourier Transform-near-infrared (FT-NIR) spectroscopy combined with machine learning can rapidly and accurately predict tea polyphenols and EGCG in tea leaves, enabling efficient screening of genotypes with high content of these compounds.
Literature Review
Spectroscopy, especially when combined with chemometrics, enables non-destructive, rapid, and accurate detection of chemical constituents in complex matrices and has been widely applied in agriculture and food science. Prior tea-related studies include: monitoring caffeine during green tea processing using Vis-NIR with SPA-MLR achieving Rp^2 > 0.834; predicting polyphenols in fresh tea leaves with NIR-PLSR achieving Rp^2 > 0.95 (Kumar et al., 2018); quantifying caffeine and nine catechins in green tea powder with NIR using MPLS, PCR, and MLR where major catechins and caffeine models achieved Rp^2 > 0.90 (Lee et al., 2014); and fermentation monitoring with Vis/NIR where modified MPLS achieved Rc^2 > 0.94 for total catechins and theanine (Chen et al., 2021). These works demonstrate feasibility of spectroscopic techniques for tea quality testing but show limited application to rapid detection of tea polyphenols and EGCG during breeding. The present study addresses this gap with FT-NIR plus machine learning to build robust predictive models for both analytes and to explore variable selection to improve efficiency.
Methodology
Samples: Four tea tree varieties (A, DC, BD, W1; Camellia sinensis L.) from the experimental garden of Fujian Agriculture and Forestry University were sampled 28–31 March 2021. A total of 2520 fresh leaves were collected; 30 leaves per species constituted one sample. Leaves were fixed at 120°C for 6 min, dried at 90°C to constant weight, ground 3 min, and sieved (80 mesh) to obtain tea powder, yielding 84 samples. FT-NIR data acquisition: An FT-NIR spectrometer (Antaris II, Thermo Fisher Scientific, US) with integrating sphere diffuse reflectance was used. Approximately 3 g powder was placed in a 4.78 cm i.d. sample cup. Instrument settings: 64 scans, gain 2, room temp ~25°C. Air was the reference; background removed. For each sample, spectra were acquired at three positions 120° apart; the average spectrum was used. Spectral range considered: 10,000–4000 cm−1. Reference analyses: Tea polyphenols were quantified by the Folin phenol (Folin–Ciocalteu) method. EGCG was determined by UPLC following GB/T 8313-2018. UPLC conditions: C18 column; flow 1 mL/min; column pressure ~8650 psi; 35°C; injection 2 µL; detector 200–400 nm, measurement at 278 nm; 10 min run; gradient program provided. Preprocessing: Five spectral preprocessing methods were evaluated: Savitzky–Golay smoothing (SG), standard normal variate (SNV), vector normalization (VN), multiplicative scatter correction (MSC), and first derivative (FD). These target noise reduction, scatter and illumination correction, and feature enhancement. Modeling algorithms: Partial least squares regression (PLSR) and least squares support vector regression (LS-SVR) were used to build quantitative models for predicting tea polyphenols and EGCG. Outlier detection: A PLSR model with 1000 Monte Carlo cross-validation (MCCV) iterations was used to compute prediction residuals. MEAN–STD distributions were inspected. Thresholding at four times the mean of all samples led to identification of outliers. For tea polyphenol, sample 15 was flagged (MEAN > 2.058, STD > 1.816). For EGCG, samples 12 and 15 were flagged (MEAN > 2.062, STD > 1.967). Removing these improved PLSR performance. Dataset split: After removing outliers (n=2), 82 samples were split by Kennard–Stone into calibration (n=55) and prediction (n=27) sets (3:1). Distributions: Tea polyphenols calibration range 11.17–21.96% (mean 15.88, SD 2.33), prediction range 12.62–18.82% (mean 14.64, SD 1.88). EGCG calibration range 3.38–18.43% (mean 8.90, SD 3.07), prediction range 5.72–11.68% (mean 8.18, SD 1.51). Full-spectrum modeling: Models were built on full spectra (1557 wavenumbers) with each preprocessing. Performance was evaluated by correlation coefficient (R), root mean square error (RMSE), and residual predictive deviation (RPD) for calibration and prediction sets. Larger R and RPD, and smaller RMSE indicate better models. Variable selection: To reduce dimensionality and enhance model efficiency, sensitive wavenumbers were selected using competitive adaptive reweighted sampling (CARS; MC sampling=1000; five-fold CV using RMSECV) and random forest (RF; 1000 iterations; wavenumbers selected by probability threshold). For tea polyphenols, CARS identified 30 key wavenumbers across regions 4273–5484, 5966–6622, and 8469–9989 cm−1, largely associated with O–H and C–H functional groups and phenolic ring absorptions. For EGCG, RF selected 27 sensitive wavenumbers spanning 4223–9954 cm−1, attributed to O–H and C–H groups in phenolics and C=O in lipids. Software: Matlab 2016a for spectral processing and modeling; Unscrambler X10.1 for preprocessing; Origin 2017C for figures.
Key Findings
- Composition statistics (n=84): Mean tea polyphenols 15.54 ± 2.29%; mean EGCG 8.73 ± 2.75%. ANOVA across four varieties showed significant differences: p-values 3.779×10^−11 (polyphenols) and 3.375×10^−14 (EGCG). - Outlier removal via MCCV improved PLSR prediction: Tea polyphenols RPD increased from 3.345 to 3.721 (+11.24%); EGCG RPD from 1.796 to 1.981 (+10.30%). - Full-spectrum models: • Tea polyphenols: Best predictive model LS-SVR with SG smoothing: R_p=0.975, RMSEP=0.420%, RPD=4.540. Comparable performance without preprocessing (R_p=0.975, RPD=4.539). PLSR showed R_p≈0.963 with RPD≈3.71–3.72 depending on preprocessing; FD improved PLSR calibration but not prediction. • EGCG: Best predictive model LS-SVR without preprocessing: R_p=0.936, RMSEP=0.637%, RPD=2.841. Preprocessing generally reduced LS-SVR predictive performance for EGCG; PLSR improved under SNV/VN/MSC/FD but remained below LS-SVR. - Variable selection improved efficiency and often performance: • Tea polyphenols: SG-Smooth + CARS + LS-SVR using 30 wavenumbers achieved R_p=0.978, RMSEP=0.395%, RPD=4.833, reducing variables by ~98.07% from 1557. • EGCG: Original spectra + RF + LS-SVR using 27 wavenumbers achieved R_p=0.944, RMSEP=0.937%, RPD=3.049, reducing variables by ~98.26%. - Sensitive spectral regions correspond to O–H and C–H overtones/combination bands and, for EGCG, include contributions from C=O, consistent with phenolic/lipid functional group absorptions. - Overall, EGCG models showed lower performance than tea polyphenols, likely due to lower EGCG content relative to total polyphenols and compositional complexity (EGCG ~60% of total polyphenols).
Discussion
The study addressed the need for rapid, non-destructive quantification of tea polyphenols and EGCG during breeding by demonstrating that FT-NIR combined with machine learning can accurately predict both analytes. High model performance for tea polyphenols (R_p up to 0.978; RPD 4.833) indicates robust predictive capability suitable for screening and quality assessment. EGCG, although more challenging, was still predicted with good accuracy (R_p up to 0.944; RPD 3.049) when leveraging RF-based variable selection. The analysis of preprocessing effects showed that model performance depends on the synergy between preprocessing and algorithm (e.g., SG improved LS-SVR for polyphenols but not EGCG), underscoring the importance of tailored pipelines. Sensitive wavenumbers aligned with known NIR absorptions of O–H/C–H/C=O groups in phenolics and lipids, supporting the chemical interpretability of the models. Variable selection substantially reduced dimensionality (~98%), enhancing computational efficiency and potentially facilitating deployment on portable devices. Collectively, these results confirm that FT-NIR plus LS-SVR, augmented by CARS or RF, can enable rapid screening of tea genotypes with elevated polyphenol and EGCG content, benefiting breeding and quality control workflows.
Conclusion
FT-NIR spectroscopy (10,000–4000 cm−1) combined with machine learning enables rapid prediction of tea polyphenols and EGCG in tea leaves. The best-performing models were LS-SVR using SG-smoothed spectra for tea polyphenols (R_p=0.975, RPD=4.540) and LS-SVR on unprocessed spectra for EGCG (R_p=0.936, RPD=2.841). Variable selection further improved efficiency and predictive ability: CARS-selected 30 bands for tea polyphenols (R_p=0.978, RPD=4.833) and RF-selected 27 bands for EGCG (R_p=0.944, RPD=3.049). These approaches reduce input dimensionality by ~98%, facilitating fast, reliable screening of breeding materials for high-value bioactive compounds. Future work could extend to larger, more diverse sample sets, additional catechins, and on-line or field-deployable systems to enhance generalizability and practical adoption.
Limitations
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny