logo
ResearchBunny Logo
Application of Raman spectroscopy and Machine Learning algorithms for fruit distillates discrimination

Food Science and Technology

Application of Raman spectroscopy and Machine Learning algorithms for fruit distillates discrimination

C. Berghian-grosan and D. A. Magdas

This pilot study explores the transformative potential of combining Raman spectroscopy with Machine Learning algorithms for distinguishing between various fruit distillates based on their trademark, geographical, and botanical origins. Conducted by Camelia Berghian-Grosan and Dana Alina Magdas, the research showcases impressive accuracy rates of 95.5% for trademark differentiation and 90.9% for geographical classification in the Transylvania region.

00:00
00:00
~3 min • Beginner • English
Introduction
Fruit distillates are traditional alcoholic beverages in Central and Eastern Europe produced by distilling fermented fruits (e.g., plums, apples, pears, apricots). In Romania they are known as țuică, pălincă, or horincă and show a wide range of alcoholic strengths depending on product type. Their chemical composition is complex, influenced by botanical origin, geographical provenance, processing technologies, and aging/storage conditions, all of which impact quality. In Transylvania, knowledge of distillate production is traditional and continuously refined. To support high-quality production and detect fraudulent practices (e.g., misdeclared origin), fast, reliable, and cost-effective analytical methods are needed that are easy to apply and suitable for in-situ measurements. Vibrational spectroscopy, especially Raman, combined with advanced data analysis, is a promising candidate. The aim of this study was to test the potential of Raman fingerprinting, in conjunction with Machine Learning algorithms, for classifying fruit distillates by (i) botanical origin (fruit variety), (ii) geographical origin, and (iii) trademark/producer fingerprint.
Literature Review
Vibrational spectroscopy (IR and Raman) has been broadly used for quantitative determination of ethanol and/or methanol in alcoholic beverages. Raman spectroscopy is advantageous for high-water-content products due to relatively weak water bands in the fingerprint region. Given the large datasets generated by spectroscopic methods, advanced data processing is required. Raman spectroscopy coupled with Machine Learning has been successfully applied in diverse fields such as food analysis, bacteria identification, and medical diagnostics. Prior studies have used specific Raman bands (e.g., 883, 1050, 1456 cm⁻¹) for ethanol quantification and characterized ethanol/methanol vibrations and water–ethanol hydrogen bonding. Metals and volatile compounds (e.g., esters) can also influence spectral regions relevant to beverages.
Methodology
Materials and methods comprised: (1) Samples: 30 fruit distillates from 8 Romanian producers: two processing companies (PC1: 5 samples—apples, apricots, pears, plums, quince; PC2: 6 samples—apples, pears, plums, quince) and three manufactures (MF1: 5 samples—apricots, cherries, pears, plums, sour-cherries; MF2: 4 samples—apples, apricots, plums; MF3: 7 samples—apples, grapes, plums). Samples originated from four Transylvanian regions: Bistrita Nasaud (BN), Covasna (CV), Salaj (SJ), Satu Mare (SM). A control set of three samples (2 plums, 1 pears) from small producers in SJ region was added for geographical model testing. Alcoholic strength was determined by GC-FID (PerkinElmer 990). (2) Raman measurements: JASCO NRS-3300 with CCD detector (−69 °C), 785 nm diode laser, 600 lines/mm grating, 20× Olympus objective. Calibration used the Si peak at 521 cm⁻¹. For each measurement, 4 mL of distillate was placed in a glass vessel; spectra were recorded with 100 s exposure and 3 accumulations. Spectra were analyzed with JASCO Spectra Manager to select 120–1700 cm⁻¹ prior to processing. For each sample, spectra from two points were averaged, followed by baseline subtraction and [0,1] normalization in OriginPro 2017 to mitigate fluorescence and enable fair comparison. (3) Machine Learning: Pretreated spectra (120–1700 cm⁻¹) were analyzed with MATLAB R2018b Classification Learner. Algorithms included decision trees, discriminant analysis, support vector machines (SVM), k-nearest neighbors (KNN), and ensemble classifiers. Training/testing splits varied per task: - Botanical origin across producers: training set 22 samples (representative of all producers); testing set 5 randomly selected samples. - Producer (trademark) classification: used the same training/testing groups as the fruit-variety study. - Alcoholic strength classification: training set 20 samples spanning 80% (1), 70% (1), 54% (1), 52% (3), 50% (6), 48% (8) v/v; testing set 7 samples: 52% (1), 50% (1), 48% (3), plus two others per figure. - Botanical origin within each producer: for each producer, all available samples formed a training set; no external testing due to low same-variety counts. - Geographical origin: training set 22 spectra (same base set as above); testing set augmented to 8 spectra (in addition to 5 previously used, 3 more spectra were added) to cover 2 from SM, 1 from BN, and 5 from SJ, totaling 30 samples considered in the geographic study overall. Ethical approval stated no studies with human participants or animals.
Key Findings
- Spectral regions with highest discrimination potential: 200–600 cm⁻¹ (region I) and 1200–1400 cm⁻¹ (region II), potentially reflecting influences from metals (Metal–O, Metal–C) and volatile compounds such as esters (C–O–C). - Producer (trademark) classification: High separation among producers regardless of fruit variety. Best model: Ensemble (subspace KNN) with 95.5% accuracy (only one sample misclassified in the training confusion matrix). Test set of 5 samples was correctly classified for all included producers (PC1, PC2, MF1, MF3). - Fruit variety classification across all producers: Poor performance; best accuracy 27.3% (Ensemble boosted trees), indicating no reliable cross-producer varietal discrimination. - Alcoholic strength classification: Low differentiation across 6 ethanol concentration classes; best accuracy 60% (Ensemble subspace KNN). Due to low accuracy, no further test verification was performed. This supports that producer fingerprints are independent of ethanol concentration and are driven by minor components rather than major ethanol peaks. - Botanical origin within producer: Good to excellent differentiation when controlling for producer effects. Accuracies: PC1 100% (fine Gaussian SVM), PC2 100% (medium Gaussian SVM), MF1 100% (fine Gaussian SVM), MF2 75%, MF3 57.1%. - Geographical origin classification (BN, CV, SJ, SM): Best model Ensemble (subspace KNN) with 90.9% accuracy (two misclassifications in training). In testing (8 spectra), 7 of 8 were correctly attributed to their regions; one SJ sample misclassified as BN. - Observed fluorescence in certain samples (e.g., PC2 plums, quince) affected spectral profiles but the ML approach still achieved high classification performance. - Dominant Raman peaks correspond to ethanol (e.g., 883, 1050, 1456 cm⁻¹), but discrimination arises from subtler spectral differences linked to minor constituents and processing/storage effects.
Discussion
The study addressed whether Raman spectroscopy combined with Machine Learning can discriminate fruit distillates by producer (trademark), geographical origin, and botanical origin. Results show that processing and storage conditions impart a strong, reproducible spectral fingerprint enabling highly accurate producer identification (95.5% training accuracy and perfect prediction on the held-out test set). The poor performance of cross-producer botanical classification (27.3% accuracy) and the strong performance within each producer indicate that technological and storage influences dominate over raw material variation in the overall Raman fingerprint. The low success rate for alcoholic strength classification (60%) further demonstrates that discrimination is not driven by ethanol content or major ethanol bands, but by minor components and processing-related chemical signatures. Geographical origin classification within neighboring Transylvanian regions achieved 90.9% accuracy, with strong test-set generalization (7/8 correct), underscoring the method’s potential for origin verification even when producers differ between training and testing. Two spectral windows (200–600 and 1200–1400 cm⁻¹) contributed most to discrimination, likely due to signals from metal species and volatile esters shaped by raw materials, process type, and storage. Fluorescence affected certain samples but preprocessing and model choice mitigated these effects. Collectively, these findings support the feasibility of rapid, in-situ authentication and fraud detection in fruit distillates using Raman spectroscopy paired with ML, particularly for trademark verification and, to a slightly lesser extent, regional origin, while varietal identification is reliable only when producer-related factors are controlled.
Conclusion
This pilot study revealed a specific producer (trademark) fingerprint in fruit distillates detectable by Raman spectroscopy combined with Machine Learning. Trademark effects dominate varietal effects, reflecting strong influences of processing and storage on Raman spectra. Fruit variety classification is feasible within each producer after minimizing technological influences. The geographical origin model effectively assigned 7 of 8 test samples correctly, including samples from producers not used in training. Overall, the approach demonstrates a rapid, cost-effective method to verify fruit distillate trademarks and support authenticity control. Future work could expand sample sizes, producers, regions, and fruit varieties, and further investigate the specific minor components contributing to discriminatory spectral features.
Limitations
- Pilot-scale dataset with a limited number of samples (30 total) across 8 producers, which may constrain generalizability. - For within-producer botanical classification, testing was not performed due to the low number of same-variety samples per producer. - Alcoholic strength classification yielded low accuracy (60%), and the model was not verified with a test set owing to poor performance. - Geographical origin assessment was confined to neighboring regions within Transylvania; broader geographic validation is needed. - Some samples exhibited significant fluorescence that affected spectra (e.g., certain plums and quince distillates), potentially impacting classification in specific cases.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny