Food Science and Technology
Data fusion and multivariate analysis for food authenticity analysis
Y. Hong, N. Birse, et al.
The study addresses the challenge of authenticating salmon geographical origin and distinguishing production methods (wild vs. farmed) amid rising consumption, complex global supply chains, and documented mislabeling in seafood markets. Single analytical platforms often fall short in achieving robust, high-accuracy geographical traceability due to environmental, dietary, and biological variability affecting salmon composition. The research aims to evaluate whether combining lipidomic (REIMS) and elementomic (ICP-MS) datasets via mid-level data fusion, coupled with multivariate/machine-learning models, can accurately and robustly classify salmon provenance and production method using a large, well-curated sample set.
Prior authenticity approaches have included DNA barcoding and ddPCR for species identification and quantification in processed products, but these do not directly address geographical origin. MS-based methods (e.g., DART-HRMS, LC-HRMS) and spectroscopic techniques (NIR) combined with chemometrics have been explored for wild vs. farmed discrimination and for origin classification (e.g., Norway vs. Chile), yet often require lengthy sample prep and have limited accuracy in geographic traceability. REIMS enables rapid, in situ lipidomic profiling and has shown strong performance in fish analyses. ICP-MS is widely used for elemental profiling to determine geographical origin across foods (rice, tea, honey). Data fusion strategies (low-, mid-, high-level) have improved classification in various food matrices (meat quality, fresh vs. frozen fish) but, prior to this work, ICP-MS and REIMS had not been combined with data fusion and multivariate analysis for salmon origin and production method authentication.
Samples: 522 salmon samples of known provenance were collected from Alaska (n=99, wild sockeye), Scotland (n=183, farmed Atlantic), Norway (n=100, farmed Atlantic), and Iceland (n=140; both farmed Atlantic and wild). Six instrumental replicates per sample were acquired across 2020–2022. An additional 17 retail samples from UK supermarkets (9 Scotland, 7 Alaska, 1 labelled Scotland and/or Norway) were used for external evaluation (six replicates each). Samples were stored at −18 °C and thawed before analysis. REIMS (lipidomics): Electrosurgical dissection using an Erbe VIO50C (autocut, 30 W) with a Waters REIMS source coupled to a Waters Xevo G2-XS QToF. Negative ion mode, m/z 100–1200, 0.5 s/scan; sodium formate calibration (FWHM 15,000 at m/z 600); Leucine Enkephalin lockmass (m/z 554.2615). Data acquired in MassLynx; processed in AMX and SIMCA. Preprocessing included background subtraction with total ion count threshold (<1×10^5), lockmass correction, and 0.2 Da binning, yielding ~5500 variables per sample. Chemometric modeling used PCA, PCA-LDA, PLS-DA, and OPLS-DA (mean-centering, Pareto scaling); S-plots and VIP for marker selection; tentative IDs against LipidMaps following Schymanski confidence tiers. ICP-MS (elementomics): Freeze-dried tissue (100 mg) digested with HNO3 and H2O2, microwave-assisted digestion (Mars 6) with controlled temperature profile, diluted to mass. Analysis on Agilent 7850 (SQ) and 8900 (TQ) ICP-MS with MicroMist nebulizer and SPS4 autosampler. MassHunter 5.1 acquisition; data processing with Agilent Online ICP-MS. Internal standard Rh (10 mg/L). Accuracy checked with CRM RM8414 and routine controls. Elements targeted initially included Li–U; after filtering (LOD and excessive concentrations), 20 elements were retained (Li, B, Al, V, Cr, Mn, Fe, Co, Ni, Cu, Zn, As, Se, Rb, Sr, Nb, Mo, Cd, Cs, Ta). Data normalized using a CRM and min–max scaling for heatmaps. Data fusion and modeling: Low-level (raw concatenation) and mid-level (feature-level) data fusion were evaluated. For mid-level fusion, PCA was used to extract principal components from each platform: to retain ~85% variance, 8 PCs from ICP-MS and 226 PCs from REIMS were selected, yielding 234 fused variables. Unsupervised PCA assessed grouping (R^2 and Q^2 metrics). Supervised models tested: k-NN, LDA, PLS-DA, OPLS-DA, SVM, and Random Forest. Five-fold cross-validation (80/20 splits, repeated across folds) was applied. Model development and statistics were implemented in R (ggplot2, ggpubr, caret, MASS, kknn, randomForest, ropls, kernlab).
- REIMS lipidomics identified 18 candidate lipid biomarkers differentiating the five salmon groupings (unsaturated fatty acids: FA 7:1, 15:1, 18:3, 20:5, 22:6, 22:1; branched/saturated FAs: 16:0, 18:2, 18:1, 20:1; N-acyl amines: NA 7:0, 14:2, 20:2; primary amides; PG 34:2; PC 37:6; TG 57:11). Eight representative lipids (FA 15:1, 18:3, 20:5, 22:6, 22:1, 18:2, 18:1, NA 7:0) recurrently contributed to differentiation across ≥3 groups.
- Lipid trends: Wild salmon (Alaska, Icelandic wild) showed higher omega-3s (EPA, DHA, FA 22:1) and lower ALA (FA 18:3). Farmed salmon (Norway, Scotland, Icelandic farmed) exhibited higher FA 18:1, 18:2, 18:3, likely reflecting plant oil-rich feeds.
- REIMS classification performance: PCA-LDA achieved 100% identification accuracy in leave-20%-out cross-validation on the 522-sample set. For 17 retail samples, outliers were observed in three Scottish-farmed samples; verified as analytical errors after traceability checks. Overall retail identification success with REIMS alone: 82.4%.
- ICP-MS elemental profiling: After filtering, 20 elements showed significant intergroup differences (Kruskal–Wallis p<0.001 for all listed elements). Nine elements (Li, B, V, Fe, Co, Zn, Se, As, Cd) were identified as key markers by OPLS-DA. Trends included higher Fe, Zn, Se, and Cd in wild salmon; Icelandic (wild and farmed) showed higher As; zinc higher in Alaska, Icelandic wild, and Icelandic farmed than in Scotland/Norway; multiple element pairwise comparisons reported (e.g., Li not significantly different between Alaska and Icelandic farmed, p=0.28).
- ICP-MS classification performance: Five-fold CV accuracy of 96.9% among the five groups using the full elemental set. On retail samples, 11/17 origins were correctly identified (65.5%); misclassifications were all wild Alaska samples.
- Data fusion results: Low-level fusion PCA captured 90–95% variance with high Q^2; mid-level fusion using 226 REIMS PCs + 8 ICP-MS PCs yielded PCA R^2=1.00, Q^2=0.98. Mid-level fusion improved robustness and computation time.
- Supervised models on mid-level fusion: LDA, PLS-DA, OPLS-DA, and RF achieved 100% cross-validation accuracy; SVM reached 98.6%; k-NN was lowest at 85.5%.
- External validation (17 retail samples, six replicates each; 102 predictions): PLS-DA and OPLS-DA achieved 100% accuracy at the replicate level (all 102 classified as Alaska or Scotland consistent with labels; the sample labeled "Norway and/or Scotland" was assigned to Scotland). At the sample level, 16/17 were definitively correct; the ambiguous “Norway and/or Scotland” was classified as Scotland.
Combining lipidomic and elementomic profiles through mid-level data fusion with multivariate learning robustly addresses the challenge of salmon origin and production method authentication. While single-platform approaches performed strongly (REIMS CV=100%) or reasonably (ICP-MS CV=96.9%), each struggled with retail samples due to real-world variability and potential analytical artifacts. Fusion retained complementary information from both platforms, substantially enhancing classification robustness and generalizability. The fused models (PLS-DA and OPLS-DA) not only reached perfect cross-validation accuracy but also accurately classified all retail replicates and 16/17 retail samples at the sample level despite different processing, storage, and packaging histories. Biological and feed-related differences in fatty acid profiles and environmental/geochemical signatures captured by elemental profiles underpin the discriminative power. The approach provides a science-based, scalable framework to mitigate mislabeling, support supply chain verification, and inform regulatory and industry stakeholders.
A dual-platform strategy integrating REIMS lipidomics and ICP-MS elementomics via mid-level data fusion, coupled with supervised multivariate models (PLS-DA/OPLS-DA), delivers highly accurate and robust authentication of salmon geographical origin and production method. The study identified 18 lipid and nine elemental markers contributing to discrimination and demonstrated superior performance of fused models over single-platform approaches, including on challenging retail samples. This workflow is broadly applicable to other food authenticity problems. Future work should expand geographical and species coverage, further validate on larger and more diverse retail sets, refine marker identification (including structural confirmation of lipid isomers/branching and arsenic speciation), and explore high-level fusion/ensemble decisions and domain adaptation to accommodate processing-related variability.
Biomarker identifications were largely tentative (based on accurate mass and database matching without full structural elucidation for double-bond positions/branching). Using only nine elemental markers was insufficient for accurate standalone origin prediction of test samples. Retail validation set size was limited (n=17 samples), and one sample had ambiguous labeling (Norway and/or Scotland). Differences in species (Alaskan wild sockeye vs. Atlantic salmon) and processing histories may introduce confounding factors. Although cross-validation and replication were extensive, broader external validation across additional origins, seasons, and supply chains is needed to assess generalizability.
Related Publications
Explore these studies to deepen your understanding of the subject.

