
Food Science and Technology
Pattern recognition based on machine learning identifies oil adulteration and edible oil mixtures
K. Lim, K. Pan, et al.
Explore how machine learning can revolutionize the identification of plant oils and their mixtures using fatty acid profiles. This groundbreaking research by Kevin Lim, Kun Pan, Zhe Yu, and Rong Hui Xiao showcases a method that achieves impressive accuracy in detecting oil types, ensuring continuous advancement in oil profiling.
~3 min • Beginner • English
Introduction
Plant oils have replaced animal fats in many diets, prompting extensive characterization of their fatty acid (FA) profiles. Despite large FA databases, identifying an oil type directly from its FA profile has remained elusive, and mixtures of oils can mimic the profile of higher-value oils, enabling adulteration and mislabeling. Reports indicate widespread adulteration in olive and avocado oils and public health incidents from contaminated oils, underscoring the need for accurate, scalable detection methods. The research question addressed is whether machine learning can discover discriminative FA patterns to identify oil types and quantitatively deconvolute compositions of mixtures for practical adulteration detection and labeling. The study aims to develop an end-to-end, generalizable ML framework that handles multiple oil types and complex mixtures, and adapts to geographic and biological variability.
Literature Review
Prior chemometric approaches (e.g., PCA for qualitative clustering; PLS1/PLS2 for quantitative modeling) typically handle only simple 2–3 oil mixtures and often require human interpretation. Extensions to multi-target predictions (PLS2) suffer degraded accuracy when generalized across many possible combinations. Studies have used FA profiles for classification and authentication of vegetable oils, including sesame and peanut oils, but quantitative generalization to multi-way mixtures is limited, with errors increasing markedly beyond binary blends. Other rapid analytical modalities (FT-NIR, LF-NMR, 1H NMR, excitation-emission fluorescence) combined with traditional chemometrics have shown promise in specific contexts but have not demonstrated broad generalizability across diverse oils and complex mixtures. Large FA databases exist, yet variability in study designs and incomplete coverage hinder robust inferential and predictive modeling. Neural networks have been applied to related problems (e.g., geographic origin of olive oil) but not coupled with large-scale simulation to infer blend composition across many oil types.
Methodology
- Samples: 19,583 pure oil samples across 10 edible oil types (groundnut GNO, high-erucic rapeseed HERSO, high-oleic sunflower HOSFO, low-erucic rapeseed LERSO, linseed LNO, low-oleic sunflower LOSFO, maize MZO, rice bran RBO, soybean SBO, sesame SSO) collected over 5 years from 30 factories in China to capture biological variance. Additional pure groundnut oils (n=56) from diverse global regions were collected for online-learning evaluation. Real-life blends were prepared from groundnut, maize, sunflower (both LOSFO and HOSFO), rice bran, and cottonseed oils at known proportions.
- Analytical chemistry: FAs derivatized to FAMEs (AOCS Ce 2-66) and quantified by GC-FID using a 100 m SP-2560 column with specified temperature program, nitrogen carrier, and FID settings. Peak normalization per ISO 12966-4-2015 produced relative abundances (≥18 FAs).
- Unsupervised modeling: Normalize FA data; dimensionality reduction via t-SNE (Barnes-Hut; perplexity per dataset size). Fit Gaussian Mixture Model (GMM) with EM under diagonal covariance, equal volume and shape; select number of clusters by BIC. Map clusters back to latent space; compute per-oil precision and per-cluster sensitivity.
- Simulation: Sample pure-oil FA profiles from GMM parameters; sample mixture proportions from a Dirichlet distribution; form mixtures by linear combination. Generate extensive simulated datasets, including ~100,000 for exploration and 12 million for supervised model training/validation/testing. Visualize mixture behavior in latent space for 2-, 3-, and 4-way combinations.
- Supervised deep learning: Inputs are MinMax-scaled FA relative abundances. Model is a sequential DNN (ReLU activations; kernel weights constrained to unit norm to mitigate overfitting). Training in Keras with TensorFlow backend; optimizer, learning rate, and batch size tuned to minimize loss and prevent overfitting via simulated validation/test sets. Model outputs per-oil composition estimates without prior knowledge of the mixture combination.
- Baseline chemometrics: PLS2 (NIPALS) with component number tuned using R2Y and Q2Y; bootstrapping to check overfitting (Q2Y p=0.05). PCA performed with centered/scaled inputs.
- Online learning: Detect distributional shifts/new subtypes via latent-space clustering or Mahalanobis distance to GMM centroids. Update DNN by augmenting training with simulated mixtures generated from newly surveyed pure oils to refine parameters without needing actual blend FA profiles.
- Evaluation: Report absolute error percentiles (50th/90th/95th/99th) on simulated independent test sets, stratified by mixture type (all 36 two-way; selected common three-way adulterations). Validate on blind real mixtures measured by GC-FID. Additional blind tests conducted in two independent batches; assess pre- and post-online update performance.
Key Findings
- Data structure and clustering: Ten oil types exhibited characteristic yet overlapping FA profiles; t-SNE suggested intra-type heterogeneity. GMM identified 16 subclusters corresponding to the 10 oil types with high agreement: per-oil precision ≈1.0 for most oils (0.998 for HERSO) and high per-cluster sensitivities (largely ≥0.99). Subclusters captured meaningful FA shifts (e.g., HOSFO elevated C18:1; HERSO elevated C22:1 vs LERSO; SBO subclusters showed coordinated C18:0/C18:1 changes; unique markers like C17:0 in some subclusters).
- Simulation insights: Latent-space patterns for mixtures showed smooth transitions from pure-oil epicenters as adulteration increased; patterns persisted from 2- to 4-way mixtures though epicenters became fuzzier with complexity.
- Deep learning accuracy (simulated tests): For 36 two-way mixtures, median absolute errors typically 0.4–1.5%; 90th percentile 1–5.8%; hardest cases (e.g., soybean–sesame) had 99th percentile ≈9%. PLS2 baselines had much larger errors (median 2.6–21.5%; 90th percentile 9.1–40.1%).
- Three-way mixtures (common groundnut adulterants): 50th percentile absolute error 1.4–1.8%; 90th percentile 4–5.4% across combinations such as GNO:MZO:RBO, GNO:MZO:SBO, GNO:LOSFO:RBO, GNO:LOSFO:SBO, GNO:MZO:LOSFO, GNO:RBO:SBO.
- Real-world validation: On 46 GNO blends with maize, sunflower, and rice bran, the model achieved median absolute error 1.35% and 90th percentile 2.7%, closely matching simulated tests.
- Online-learning improvements: For two independent blind-test batches of real GNO mixtures, pre-update median absolute errors were 3.4% and 4.75% for the major oil and 3.4% and 5.7% for the minor adulterant; 90th percentiles 7.54%/8.24% (major) and 7.54%/13.5% (minor). After updating with simulated mixtures from newly surveyed pure oils, median errors improved to 1.1% and 0.95% (major; 90th percentiles 2.58% and 2.01%) and 1.2% and 0.95% (minor; 90th percentiles 3.04% and 2.1%), representing ~52–75% error reduction. The approach also handled a groundnut–cottonseed mixture with small errors, and novel oils can be flagged via clustering or distance-based outlier detection.
Discussion
The study demonstrates that machine learning can extract discriminative FA patterns that both identify oil types and quantify compositions in complex mixtures, directly addressing the challenge of adulteration detection from FA profiles alone. Unsupervised GMM revealed meaningful substructure within oil types, reflecting biological variation (e.g., breeding, geography) and informing a realistic simulation framework. Coupling these simulations with a supervised DNN enabled an end-to-end, generalizable predictor that outperformed traditional chemometrics across all tested scenarios, including more complex three-way mixtures. The close agreement between simulated and real-world performance supports the robustness of the simulation-driven training paradigm. Online learning further addresses domain shift due to geographic and temporal variability, enabling continual model refinement without requiring labeled mixture data. These advances have practical significance for food safety, quality control, pricing, and accurate labeling in the edible oil industry, where economic incentives drive adulteration.
Conclusion
This work introduces a unified ML framework that: (1) uncovers intra- and inter-type FA pattern structure across ten edible oils via GMM; (2) leverages these patterns to simulate vast, realistic mixture datasets; and (3) trains an end-to-end DNN to accurately quantify oil mixture compositions, significantly outperforming PLS-based chemometrics. The model generalizes to complex mixtures, aligns with real GC-FID measurements, and can be continuously improved through online learning with newly surveyed oils. Future directions include expanding to additional “new-world” oils and oil types, integrating with rapid analytical platforms (FT-NIR, LF-NMR, 1H NMR, fluorescence), enhancing outlier and novelty detection, and establishing industry-wide standards for quality assurance and labeling.
Limitations
- Biological and geographic variability means a perfect, static training dataset is unattainable; newly encountered distributions can reduce purity estimates until online updates are applied.
- Model performance degrades modestly as mixture complexity increases; extremely high-order blends are impractical but would be harder.
- Online update procedure assumes new samples belong to existing oil categories; fully novel oils require additional detection (clustering or distance-based outlier detection) before incorporation.
- Data and code access are restricted (available upon reasonable request with permissions), which may limit external replication.
- The approach is validated on GC-FID FA profiles; performance with other analytical modalities requires adaptation and validation.
Related Publications
Explore these studies to deepen your understanding of the subject.