Food Science and Technology
Cocoa bean fingerprinting via correlation networks
S. Kumar, R. N. D'souza, et al.
The study brings network science to food research to improve cocoa bean fingerprinting and authenticity assessment. Using high-throughput LC-MS profiles from 140 cocoa samples spanning three processing stages (unfermented, fermented, liquor) and eight countries, the authors ask whether correlation networks among samples can reveal and quantify classifiability by processing stage and country of origin. Traditional chemometric tools like PCA often separate unfermented from fermented samples but struggle to resolve multiple countries of origin when datasets are large and heterogeneous. By constructing sample–sample correlation networks and varying the correlation threshold, the work aims to expose hierarchical structure in the data—coarser separation by processing stage and finer separation by origin—and to quantify when each factor governs clustering. The approach offers dimensionality reduction and a standardized framework aligned with systems biology/network science to address complex, intertwined sources of variation in food metabolomics data.
Prior work in food fingerprinting has applied diverse analytical platforms and chemometric approaches to detect adulteration, assess authenticity, and classify origin. PCA and related unsupervised methods can distinguish unfermented versus fermented cocoa but often fail to resolve many countries of origin simultaneously. Network science has transformed biological and medical data analysis through graph-based representations, including correlation networks widely used in metabolomics to detect associations among variables or samples. The authors reference foundational applications of correlation-based analyses in metabolomics and their extensions in other fields (e.g., finance), as well as earlier cocoa studies showing origin-based polyphenolic, proteomic/peptidomic, and triacylglycerol fingerprints with limited success when many origins are included. A previous LDA-based cocoa origin classification study demonstrated improved performance with Gaussian feature stability filtering, which the current network approach helps mechanistically rationalize via the requirement for high sample-to-sample correlations for fine-grained (origin) separations.
Data: 140 LC-MS positive ion mode samples collected over ~4 years, spanning three processing stages (Unfermented, Fermented, Liquor) and eight origins (Brazil, Cameroon, Ecuador, Ghana, Indonesia, Ivory Coast, Malaysia, Tanzania). Ivory Coast contributed most samples; Ghana the least.
Preprocessing: LC-MS data processed with MZmine to obtain peak area lists with m/z and retention times. Compounds were assigned names/formulas when possible under ionization states [M+H], [M+2H], [M+3H], [2M+H]; otherwise labeled as Unknown_m/z. Peak areas per sample were normalized so their sum equals 100%, yielding relative abundance profiles. A matrix (samples × ~7000 compounds) was assembled, with compounds sorted by mean peak area across samples. Analyses were robust across using the top 1000–7000 compounds; for computationally intensive steps, the top 1000 were used without loss of qualitative conclusions.
Correlation computation: For each pair of samples α and β, Pearson correlation r_{αβ} was computed as covariance divided by the product of standard deviations of their LC-MS feature vectors. Spearman correlation was computed as the Pearson correlation of ranked feature vectors. Full pairwise correlation matrices (Pearson and Spearman) were produced.
Network construction: Nodes represent LC-MS samples; edges represent pairwise correlations. Thresholded correlation networks were generated by retaining edges with correlation ≥ a specified threshold. A family of networks was obtained by sweeping the threshold from 0 to 1. Both Spearman- and Pearson-based networks were analyzed; figures emphasize Spearman, with Pearson results in supplements.
Visualization: Networks (e.g., full network at threshold 0 with 140 nodes and 6833 edges for Spearman r>0) were visualized in Cytoscape using edge-weighted spring embedded or COSE layouts, positioning highly correlated nodes closer. Node color/shape encodes processing stage and country (legends varied by figure to emphasize either attribute). Heatmaps of correlation matrices with samples ordered by processing stage and then country were generated to reveal block structure.
Similarity metrics: For each thresholded network, two edge-based similarities were defined: sample-type similarity (fraction of edges connecting nodes of the same processing stage) and origin similarity (fraction of edges connecting nodes from the same country). Their evolution with threshold was compared to null models.
Null models: Control networks were created by randomizing edge weights (correlations) while preserving the overall correlation distribution. An ensemble of 100 null networks provided mean and standard deviation for similarity metrics to assess significance.
Accuracy versus ideal networks: For each thresholded network, accuracy was quantified relative to two ideal networks: (1) ideal processing-stage network (edges only between nodes of the same processing stage) and (2) ideal origin network (edges only between nodes of the same country). True positives/negatives and false positives/negatives were counted by comparing observed edges to the ideal; accuracy was defined as the fraction of true positives plus true negatives.
Majority vote inference: A simple unsupervised classifier inferred a node’s processing stage and country using a majority vote among its neighbors at given thresholds; prediction heatmaps across thresholds and mean prediction scores were reported to indicate regimes where each attribute is most reliably inferred.
- Correlation matrix structure: Heatmaps of Spearman correlations revealed clear block structures corresponding to processing stages. Unfermented samples form a distinct block; Fermented and Liquor samples form a larger block adjacent to each other, indicating Liquor samples are more similar to Fermented than to Unfermented. Country-level blocks are not visible at this global level.
- Thresholded networks (low to intermediate thresholds, ~0.1–0.5): Networks first separate into Unfermented versus (Fermented+Liquor) modules. As the threshold increases toward 0.5, three clear modules emerge corresponding to Unfermented, Fermented, and Liquor. This mirrors the heatmap block structure and aligns with the expectation that fermentation drives the major chemical changes.
- High thresholds (≥0.6–0.8): As thresholds increase, large stage-based components fragment into smaller modules enriched for single countries of origin. Same-country nodes cluster together increasingly often, revealing finer substructure governed by origin.
- Edge similarity trends: Sample-type similarity increases approximately linearly from low thresholds and saturates near 1 around threshold ~0.5, indicating dominance of processing-stage effects early. Origin similarity remains near null-model levels up to ~0.5, then rises rapidly toward 1 at higher thresholds, indicating that country effects require higher correlations to manifest.
- Accuracy relative to ideal networks: Accuracy versus threshold increases for both attributes. At low thresholds, networks are closer to the ideal processing-stage network; at higher thresholds, they are closer to the ideal origin network, quantitatively confirming the nested hierarchy (processing stage first, then origin).
- Network scale details: The full Spearman network (threshold >0) comprised 140 nodes and 6833 edges. As thresholds increase, edges and sometimes nodes are pruned, leading to fragmentation into country-enriched components.
- Practical classification: A majority-vote approach among neighbors predicts processing stage correctly at mid and higher thresholds, while country predictions become accurate primarily at higher thresholds. This unsupervised strategy complements prior supervised LDA approaches and provides a mechanistic rationale for the success of Gaussian feature stability filtering (high correlations echo stable features).
The findings demonstrate that correlation networks derived from untargeted LC-MS profiles encode a hierarchical organization of cocoa samples: coarse separation by processing stage at low to intermediate correlation thresholds, and finer country-of-origin separation at higher thresholds. This directly addresses the challenge that standard chemometric methods face in resolving multiple origins while confirming their strength in distinguishing fermentation status. The edge-similarity and accuracy analyses quantify the thresholds at which each attribute dominates, providing an interpretable, unsupervised framework for classification. Practically, the approach enables inference of unknown samples by examining their neighbors in networks at appropriate thresholds: intermediate thresholds for processing-stage determination and high thresholds for origin assignment. The network methodology thus offers a robust dimensionality reduction linked to classification performance and is broadly applicable to other food authenticity problems. The observed alignment with earlier LDA results via Gaussian feature stability highlights a mechanistic link between compound-level stability and sample-level correlations.
This work introduces a network-science framework for cocoa bean fingerprinting based on sample–sample correlation networks from LC-MS data. By sweeping correlation thresholds, the approach reveals a nested structure: processing-stage modules emerge at lower thresholds, while country-of-origin modules appear at higher thresholds. Quantitative metrics (edge similarity, accuracy versus ideal networks) and a simple majority-vote classifier confirm when and how each attribute becomes classifiable. The method is unsupervised, scalable, and complements traditional chemometrics, offering a path to improved authenticity and quality control assessments in cocoa and potentially other foods. Future directions include: (1) incorporating additional factors (variety, soil, climate, season, farming practices) to map finer hierarchical separations and their threshold regimes; (2) dissecting the roles of specific chemical classes (e.g., polyphenols, carbohydrates, peptides, primary/secondary metabolites) in driving stage versus origin separations; and (3) extending and adapting network tools to enhance interpretability and predictive performance for complex food metabolomics datasets.
Cluster separation by country is not perfect, particularly at very high thresholds, because other unmodeled factors (variety, soil, climate, terrain, harvest season, farming practices, and other metadata) influence LC-MS profiles and can introduce sub-modular structure. As thresholds increase, networks become sparse and may lose nodes/edges, limiting inference for some samples. The study focuses on positive ion mode LC-MS and analyzes processing stage and country as the primary factors; broader generalization may require integrating additional data modalities and covariates.
Related Publications
Explore these studies to deepen your understanding of the subject.

