Food Science and Technology

Machine learning prediction of the degree of food processing

G. Menichetti, B. Ravandi, et al.

Discover how FoodProX, a groundbreaking machine learning algorithm developed by Giulia Menichetti, Babak Ravandi, Dariush Mozaffarian, and Albert-László Barabási, reveals that over 73% of the US food supply is ultra-processed. This research highlights alarming connections between ultra-processed food consumption and serious health risks, emphasizing the urgent need for accessible information on food processing to enhance population health.

00:00

~3 min • Beginner • English

Index

Introduction

The study addresses the challenge of defining and quantifying food processing, which has been linked to adverse health outcomes. Traditional dietary guidance focuses on food categories, while newer evidence and guidelines highlight the separate health effects of processing. The widely used NOVA system categorizes foods into four classes, but its qualitative, expert-based nature leads to inconsistencies, limited coverage, difficulty classifying composite foods, and poor reproducibility. Furthermore, most risk is concentrated in the broad NOVA 4 ultra-processed category, which hinders analysis of gradations in processing and practical guidance. The authors propose a quantitative, scalable approach using machine learning on nutrient profiles—regulated and widely available—to objectively predict and grade the extent of processing, enabling improved research and consumer decision-making.

Literature Review

Prior epidemiological and clinical studies associate higher intake of ultra-processed foods (NOVA 4) with increased risks of obesity, coronary heart disease, diabetes, cancer, depression, and short-term weight gain. NOVA has been instrumental in policy and cohort studies but suffers from qualitative definitions, limited coverage (unique class assigned to only ~35% of USDA items), challenges with mixed foods, low inter-expert consistency, and an inability to differentiate degrees within NOVA 4. Expanding food description ontologies and calls from FAO for AI-enabled objective metrics underscore the need for reproducible, comparable measures. The literature also indicates a shift from food security to nutrition security, emphasizing healthful foods.

Methodology

The authors developed FoodProX, a multi-class random forest classifier trained on nutrient concentrations per 100 g from the USDA FNDDS 2009–2010 dataset with NOVA manual labels. Inputs include nutrient panels of varying sizes (12 FDA-mandated nutrition facts nutrients, 58–62 commonly documented nutrients, up to 99 including flavonoids). Nutrient values were log-transformed. The model outputs a probability vector {p1, p2, p3, p4} for NOVA classes 1–4, with the highest probability determining the class. Performance was evaluated via 5-fold stratified cross-validation, reporting AUC and precision-recall metrics, with SMOTE used optionally to address class imbalance and an ensemble of five classifiers to improve generalization. To address the limitations of discrete classes, the authors defined a continuous food processing score FPro for each food k as FPro_k = (1 − p1_k + p4_k)/2, representing the projection between minimally processed (NOVA 1) and ultra-processed (NOVA 4). They also defined an individual Food Processing Score (iFPro), weighting each consumed item’s FPro by its caloric (iFPro_wc) or gram (iFPro_wG) contribution to a person’s diet. Robustness analyses assessed FPro stability across FNDDS cycles (2009–2010 vs 2015–2016) despite nutrient variability, showing minimal changes in FPro. The impact of additive information was tested using Open Food Facts (233,831 records), comparing FoodProX trained on nutrition facts alone versus nutrition facts plus additive counts. For health associations, an Environment-Wide Association Study (EWAS) was conducted on NHANES 1999–2006 (20,047 adults), using survey-weighted generalized linear models adjusted for age, sex, ethnicity, SES, BMI, and total calories. Variables were transformed (Box-Cox or logit where appropriate), standardized, and FDR-corrected (Benjamini–Hochberg). The authors compared iFPro-based associations to those obtained using the fraction of calories from manual NOVA 4. To explore practical diet improvement, a substitution heuristic identified, within each individual’s diet, high-calorie items with substantially less processed alternatives within the same WWEIA food category, estimating changes in iFPro and translating these into expected shifts in phenotypes using EWAS-derived effect sizes.

Key Findings

- Classifier performance: High cross-validated AUCs across nutrient panels: NOVA 1 = 0.9804 ± 0.0012; NOVA 2 = 0.9632 ± 0.0024; NOVA 3 = 0.9696 ± 0.0018; NOVA 4 = 0.9789 ± 0.0015, indicating strong predictive power from nutrient profiles. - Coverage of US food supply (FNDDS 2009–2010): FoodProX classified previously unlabeled items, estimating 7.39% NOVA 1, 0.90% NOVA 2, 18.36% NOVA 3, and 73.35% NOVA 4. - Continuous grading (FPro): Differentiates processing within categories and mixed dishes; examples show progressive increases from raw onion (FPro = 0.0203) to boiled (0.3126), fried (0.7779), and onion rings from frozen (0.9955). FPro exposes variability within NOVA classes and WWEIA categories. - Stability: Despite substantial nutrient changes across FNDDS cycles, FPro shifts were small (e.g., milk FPro 0.0010→0.0011; cookie 0.9943→0.9965). Simulated nutrient variability up to 1000% yielded modest expected FPro changes (~0.0033 on average). - Additives information: Adding additive counts slightly improved AUCs in Open Food Facts data, but nutrition facts alone carried most predictive power; additive-only models produced many false positives for NOVA 1–3 and mainly predicted NOVA 4. - Food sources: Lower FPro for home-grown/caught and residential dining; higher for fast food/pizza (median FPro = 0.9060) and vending machines (0.9800). - Population diet reliance: Median iFPro_wc for US adults was 0.7872, indicating high dependence on ultra-processed foods. - Health associations (EWAS, FDR-corrected): Higher iFPro associated with increased odds or levels of metabolic syndrome, diabetes indicators (fasting glucose, C-peptide), higher blood pressure, trunk fat, subscapular and triceps skinfolds, triglycerides, insulin; lower HDL. Positive associations with cardiovascular risk scores (Framingham, ACC/AHA) and self/family history of CVD. Negative associations with serum/blood vitamins and carotenoids (e.g., vitamin B12, vitamin C, vitamin D, folate, alpha- and beta-carotenes, retinol). Inverse association with telomere length (suggesting higher biological age). Elevated biomarkers of soy-based ultra-processed foods (daidzein, genistein, O-DMA) and lower enterolignans (enterodiol, enterolactone). Increased exposure biomarkers for acrylamide and glycidamide, PAHs, volatile organics (benzene, toluene, ethylbenzene, styrene), phthalates, and BPA. - Superiority over manual NOVA 4 fraction: iFPro identified 209 significant associations (example counts: 134 for iFPro_wc, 170 for iFPro_wG), versus 92 using manual NOVA 4 calories. - Substitution impact: Replacing a single high-calorie item with a less processed alternative reduced median iFPro_wc by 12.15% (0.7872→0.6915), corresponding to −12.25% odds of metabolic syndrome, −8.47% urinary BPA, +4.83% blood vitamin B12, +12.31% blood vitamin C. Replacing 10 items reduced iFPro_wc by 37.03%, with −21.43% BPA, +13.02% vitamin B12, +37.26% vitamin C.

Discussion

The findings demonstrate that nutrient profiles encode measurable signatures of processing that can be detected by machine learning, enabling reproducible classification and, crucially, continuous grading across a spectrum rather than discrete NOVA categories. FoodProX extends NOVA by classifying previously unlabeled foods, handling mixed foods/meals, and quantifying degrees of processing within the broad ultra-processed category, which improves epidemiologic power and practical guidance. The continuous FPro index, combined with dietary intake, yields individual-level iFPro scores that capture population reliance on ultra-processed foods and reveal wide-ranging associations with metabolic, cardiovascular, inflammatory, vitamin, aging, and chemical exposure biomarkers—many not detectable when using the manual NOVA 4 fraction alone. The approach also supports actionable strategies: small, within-category substitutions meaningfully reduce iFPro and are expected to moderate adverse health markers. Because the model relies only on Nutrition Facts, it can be deployed via consumer channels (labels, apps, websites) and scaled to characterize food environments, informing both individual nudges and policy-level interventions.

Conclusion

This work introduces FoodProX and the continuous FPro score to quantify the degree of food processing from standard nutrient data, enabling automated, portable, and reproducible classification across diverse databases and complex foods. FPro improves epidemiological analyses and practical diet guidance by revealing gradations within ultra-processed foods and linking individual diet processing (iFPro) to numerous health and exposure phenotypes. Substitution heuristics show that modest, within-category swaps can substantially reduce iFPro with expected improvements in health-related biomarkers. Future research should incorporate broader chemical classes (additives, processing byproducts) to enhance prediction, move toward unsupervised processing metrics independent of manual labels, and better capture matrix effects. Combining FPro with epidemiology and food environment mapping can provide a scalable pipeline to improve population diet and health.

Limitations

- Additive concentrations and many processing byproducts are generally unavailable in food composition data; while additive information modestly improves prediction when available, current FPro is trained largely on nutrient profiles and may miss signals from additives. - FPro is currently best suited for ranking foods within the same chemically driven food groups or categories rather than across all foods, guiding within-category substitutions. - Nutrient variability across products, recipes, and time exists; although FPro showed stability to substantial nutrient changes, uncertainties in labeling and measurements remain inherent. - Health association analyses use cross-sectional NHANES data with 24-hour recalls; while standard and weighted appropriately, causal inference is limited. - The training relies on NOVA manual labels and specific FNDDS cycles; broader international validation and inclusion of more comprehensive chemical data would further generalize the approach.

Related Publications

Explore these studies to deepen your understanding of the subject.

Medicine and Health

Machine learning-based prediction of in-hospital death for patients with takotsubo syndrome: The InterTAK-ML model

O. D. Filippo, V. L. Cammann, et al.

Chemistry

Accelerating the prediction of CO2 capture at low partial pressures in metal-organic frameworks using new machine learning descriptors

I. B. Orhan, T. C. Le, et al.

Earth Sciences

Machine learning prediction of the Madden-Julian oscillation

R. Silini, M. Barreiro, et al.

Engineering and Technology

Machine Learning Techniques for the Performance Enhancement of Multiple Classifiers in the Detection of Cardiovascular Disease from PPG Signals

S. W. Rabkin, A. Cataldo, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny