logo
Loading...
Predicting Bordeaux red wine origins and vintages from raw gas chromatograms

Food Science and Technology

Predicting Bordeaux red wine origins and vintages from raw gas chromatograms

M. Schartner, J. M. Beck, et al.

This intriguing study reveals how gas chromatography and mass spectrometry can unveil the hidden origins and vintages of Bordeaux red wines. Conducted by a team of remarkable researchers including Michael Schartner and Jeff M. Beck, the findings suggest that the chemical identity of wine is a complex mosaic rather than just a few key ingredients.... show more
Introduction

The study addresses whether the origin (estate/terroir) and vintage of Bordeaux red wines can be predicted from chemical data, specifically raw gas chromatograms, rather than relying on targeted quantification of selected compounds. Wine composition and perceived typicity depend on many factors (soil, climate, varietal blend, microbiology, and winemaking practices), and traditional targeted analyses may miss broader chemical patterns underlying typicity. The authors propose that wine identity may be encoded in global, potentially nonlinear patterns across many molecules. They therefore apply machine learning, including nonlinear dimensionality reduction and supervised classifiers, to raw GC data to test if terroir (estate) and vintage can be decoded and to assess whether information is concentrated in a few molecules or distributed across the chemical spectrum.

Literature Review

Prior work has used targeted compound measurements and various spectrometric or spectroscopic techniques combined with multivariate analysis to classify wines by variety, region, or authenticity. Methods include ICP-MS elemental fingerprinting, 1H NMR metabolomics, RP-HPLC/DAD, UV/Vis spectroscopy, untargeted LC-MS metabolomics, isotopic ratios, and absorbance-transmission and fluorescence excitation-emission matrix (A-TEEM). Some studies examined regional classification with GC/qTOF-MS or modeled sensory attributes from spectrofluorometric data. While these approaches can classify regions or vintages and identify key compounds, they often rely on manual feature extraction and targeted analyses, potentially overlooking distributed chemical information. In contrast, the present study emphasizes using raw GC chromatograms with machine learning to capture broader chemical signatures of terroir and vintage.

Methodology

Samples: 80 Bordeaux red wines from 7 estates (A–G) with 12 vintages each (1990, 1995, 1996, 1998, 1999, 2000, 2001, 2002, 2004, 2005, 2006, 2007); estate F had 8 vintages (1995, 1998, 2000, 2001, 2002, 2005, 2006, 2007). One 75 cL bottle per wine; all analyzed in a single batch (August 2018). Estates: A (Pomerol), B and C (St-Emilion) on the right bank; D and E (Pauillac), F (Margaux), G (Pessac-Léognan) on the left bank. Wines were aged 11–28 years in estate cellars before analysis.

GC data acquisition: Three GC-MS methods were used, yielding three chromatogram types per wine: esters (SPME-GC/MS), oak-flavor compounds (liquid/liquid extraction GC/MS), and off-flavor compounds (SBSE-GC/MS). Mass spectrometers operated in selected-ion monitoring; TIC chromatograms were exported as CSV without signal alignment or baseline correction. Key method details include columns, temperature programs, and monitored ions as per cited methods; quantification used internal standards and external calibration in an old Bordeaux red wine matrix. The chromatograms reflect a broad set of volatiles, not exclusively the targeted classes due to fragmentation and extraction.

Data preprocessing: For dimensionality reduction and supervised decoding, features (retention-time intensity points) were standard scaled (z-scored across samples per feature) within each chromatogram type. Analyses used either individual chromatogram types or a concatenation of the three into a meta-chromatogram.

Dimensionality reduction: t-SNE (perplexity=30) and UMAP (n_neighbors=60) were applied to visualize structure in 2D (and 3D with no additional structure). Implemented in Python using scikit-learn and UMAP libraries.

Supervised decoding: Estate and vintage classification used linear discriminant analysis (LDA) and logistic regression (LR) with default scikit-learn settings. Cross-validation: leave-7-out for estates (one wine per estate held out, 50 random splits) and leave-12-out for vintages (one per vintage per split). Significance assessed via one-tailed t-tests against chance, Bonferroni-corrected.

Binning and feature selection (“survival of the fittest”): Each chromatogram was divided into N equal, consecutive bins (N=50 for main results; also N=5,10,20 in supplements). Iteratively, the bin whose removal least harmed (or most improved) decoding accuracy was removed, tracking performance versus remaining fraction of data to identify the most informative regions. Additional analyses trained classifiers on individual bins and on concatenations of top-k bins (k=1–4 per chromatogram type).

Compound-based analysis: From the same data, areas for 32 compounds (16 esters, 13 oak-related, 3 off-flavor) were integrated and converted to concentrations using internal standards and calibration, forming 32-D vectors per wine. The same dimensionality reduction and decoding procedures were applied. Classifier weights and single-compound decoding performance were analyzed to assess the distribution of information across compounds.

Code and data availability: Data and analysis code are publicly available at https://github.com/mschart/wine_decoding.

Key Findings
  • Nonlinear embeddings (t-SNE, UMAP) of concatenated raw chromatograms produced clusters matching estates and separated right-bank (A, B, C) from left-bank (D, E, F, G) wines. The 2D arrangement reflected Bordeaux geography, including a north–south axis among left-bank estates (D/E north; G/F south). Right-bank estates were not ordered by geography, possibly due to close proximity (~7 km).
  • Estate classification from raw chromatograms achieved near-perfect accuracy. Best average: 99% correct with LDA on concatenated chromatograms; similar performance with oak or ester chromatograms alone; off-flavor chromatograms were lower (87%). Chance for estates: 14%.
  • Vintage classification was more difficult. Best average: 27% correct (LDA, oak), significantly above 8% chance (p<0.001). Esters and off-flavor chromatograms yielded near-chance performance in most settings.
  • Binning and survival-of-the-fittest showed that removing up to ~90% of bins did not degrade estate decoding; the best ~10% of data matched full-chromatogram performance. Informative bins did not necessarily coincide with largest peaks, implying contributions from low-abundance compounds. Training on top 3 bins per chromatogram type yielded 100% estate accuracy (vs 99% full), indicating redundancy.
  • For vintages, using only the most informative regions improved performance: up to 34% (top 3 bins across chromatogram types) and up to 50% when using only the most informative bins of the oak chromatogram.
  • Single-bin classifiers performed above chance for most bins for both estate and vintage, supporting that information is broadly distributed. PCA indicated high redundancy (90% variance explained by 20 components).
  • Compound-based analyses underperformed raw chromatograms for estate classification: logistic regression accuracy decreased from 95% to 78% (oak), 98% to 75% (ester), and 85% to 27% (off-flavor). Vintage decoding with compounds was mixed: oak 27%→23%; ester 7%→23%; best remained below 27% from oak chromatograms. Applying survival-of-the-fittest to compounds improved vintage decoding to 37%, still below 50% from chromatogram bins.
  • Classifier weights across chromatogram bins and across compounds were broadly distributed; most bins/compounds contributed, and several single compounds could yield up to ~40% estate accuracy alone, reinforcing that wine identity is encoded across many molecules rather than a few key markers.
Discussion

The findings demonstrate that raw GC chromatograms contain robust chemical signatures of terroir and estate identity that can be extracted using machine learning. Nonlinear dimensionality reduction recreated geographic relationships among Bordeaux estates, implying that GC captures composite effects of soil, climate, varietal blends, microbiology, and winemaking practices. Estate identity was predicted with near-perfect accuracy independent of vintage, while vintage was decoded above chance and improved substantially by focusing on the most informative chromatogram regions. Analyses of bin importance, classifier weights, PCA, and single-bin/compound decoding indicate that information is distributed across much of the chemical spectrum with significant redundancy; identity is not dominated by a few molecules. Compared to related studies using A-TEEM or GC/qTOF-MS with manual feature extraction, the raw-GC approach is simpler and cost-effective, yet highly discriminative for estates and moderately informative for vintages. The results suggest that combining complementary techniques might further improve joint estate and vintage classification and that GC-based models could complement human tasters for recognition tasks.

Conclusion

Raw GC chromatograms, analyzed with machine learning, can accurately reveal Bordeaux estate identity and provide above-chance information about vintage, with embeddings reflecting regional geography. Information relevant to identity is broadly distributed and redundant across the chromatogram; focusing on the most informative regions can match or exceed full-data performance and enhances vintage decoding. Traditional targeted-compound approaches underperform relative to raw chromatograms and require more manual effort. This work shows that estate and vintage signatures can be derived without manual peak selection or specialized high-cost instruments. Future work could expand to more regions and varieties, compare performance with expert human tasters, optimize chromatographic/ion-scanning parameters, and integrate complementary modalities (e.g., A-TEEM) to improve joint estate and vintage prediction.

Limitations
  • Scope limited to 80 wines from 7 Bordeaux estates and vintages from 1990–2007 (estate F with fewer vintages), potentially limiting generalizability to other regions, varieties, or broader sets of estates.
  • One bottle per wine and a single analysis batch (August 2018) may not capture within-estate bottle/batch variability.
  • The dataset was collected for another purpose; chromatograms (TIC) include signals from fragmented and non-targeted compounds, and no signal alignment or baseline correction was applied.
  • Dimensionality reduction cluster orientation and spread depend on algorithm parameters (e.g., random seeds); right-bank estates did not reflect intra-bank geography despite clear inter-bank separation.
  • Potential overfitting when using full chromatograms for vintage decoding was suggested by improved performance after bin selection.
  • The model’s performance across a wider variety of wines and comparison to human expert tasters remains untested.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny