logo
ResearchBunny Logo
Crop diversity used in branded products with focus on legume species worldwide

Food Science and Technology

Crop diversity used in branded products with focus on legume species worldwide

T. Salord, M. Magrini, et al.

This intriguing research by Tristan Salord, Marie-Benoît Magrini, Valérie Lullien-Pellerin, and colleagues sheds light on the global diversity of legume species in branded food products. While soy dominates the market, the findings reveal a growing interest in other pulse crops, especially in Europe. Discover the implications for consumer choices and the future of food databases in this essential study!... show more
Introduction

The paper investigates the diversity of crop species used by the food industry worldwide, focusing on branded packaged foods as proxies for market offerings. Motivated by concerns over low crop species diversity in global production and its implications for biodiversity, resilience, and human diets, the study asks: what crop diversity is utilized by the food industry across the world? The authors focus on pulses—key to agroecological and dietary transitions—and compare their use to soy, a dominant legume in both feed and food. They analyze product launches (2010–2021) as indicators of innovation dynamics and industry know-how, arguing that increased use of diverse species in new products can support broader cultivation diversity. The study assesses diversity via (a) variety of species used, (b) market concentration on particular species, and (c) species’ contribution to formulations inferred from ingredient list position and packaging mentions.

Literature Review

Background literature highlights that a small number of crops dominate global production, with wheat, rice, and maize comprising over half of plant-based food. Low species diversity increases vulnerability to pests, climate change, and dependence on synthetic inputs, and may harm dietary diversity. Calls for agroecological transition and more plant-based diets are well documented, but there is limited research detailing the ingredients used in packaged foods and no consensus on ingredient categorization by processing levels. Pulses are widely promoted for sustainability and health, yet face technological and market lock-in vis-à-vis major crops like soy. Bibliometric studies show soy dominates legume-related food science publications, underscoring imbalances in research and development. These gaps motivate a market-level, ingredient-focused analysis using text mining to quantify species diversity and usage patterns.

Methodology

Data source: Mintel Global New Products Database (GNPD), covering food product launches (excluding alcoholic beverages and waters) in >80 countries from 2010 to 2021. Queries targeted legume-related products (e.g., legume, pulse, bean, pea, lentil, chickpea, soy). Approximately 350,000 products with at least one pulse or soy ingredient were retained; after disambiguation and exclusions, the final dataset comprised 343,309 products. Parsing: Ingredient lists (variable-length strings) were transformed into structured JSON dictionaries using a custom Python parser (Salord et al.; code available at https://github.com/Pythrix/FOODCOP.git). The parser retains ingredient rank (order in list) and depth (nesting via brackets), enabling identification of top ingredients (first five) versus remaining items. Tagging: Legume species were identified via a mixed text-mining and expert-supervised approach. A dictionary of stems (e.g., pea, bean, lentil/lens, gram) and context terms (preceding/following words) was developed to resolve vernacular names to scientific species (e.g., black-eyed pea → Vigna unguiculata; grass pea → Lathyrus sativus). Ambiguities were adjudicated via expert consensus and external taxonomic references (e.g., Catalog of Life). Ambiguous terms, false friends (e.g., coffee bean), non-pulse legumes (e.g., Arachis hypogaea), and tagging errors were excluded. To avoid ambiguity between fresh and dried legumes, identified pulse species were labeled NSL (non-soy legume). The same tagging pipeline was applied to product descriptions to capture species mentions on packaging. Analytical framework: Species diversity and usage were analyzed at product and species levels, considering frequency of appearance and position in ingredient lists (top five vs remaining) and packaging mentions. K-means clustering (40 random starts, 3000 iterations, 2–11 clusters) categorized species by frequency profiles across positions and descriptions; the number of clusters was selected via silhouette coefficients. Descriptive statistics summarized market segments and geographic areas. Dataset structure: Products were partitioned into three subcorpora: soy only (n=249,425), NSL only (n=65,653), and both soy and NSL (n=28,231).

Key Findings
  • Identified species: Tagging revealed 32 legume species (including soy) used in product launches. While this suggests notable variety, usage is highly skewed.
  • Soy dominance: Products containing only soy ingredients comprise 73% (n=249,425) of the 343,309 products; NSL-only products are 19% (n=65,653); 8% (n=28,231) contain both. Nearly four times more launches include soy than NSL.
  • Concentration among NSL: Four species—Pisum sativum, Phaseolus vulgaris, Cicer arietinum, Lens culinaris—account for ~79% of NSL ingredients; ~20 NSL species each occur in <1% of NSL-containing products.
  • Top species frequencies (Table 1): Glycine max (soy) 277,656 mentions (71.586%), Pisum sativum 36,144 (9.319%), Phaseolus vulgaris 25,358 (6.538%), Cicer arietinum 16,160 (4.166%), Lens culinaris 9706 (2.502%). Soy is mentioned in product descriptions only 4.09% of the time, versus much higher rates for several NSL (e.g., Lens culinaris 70.33%).
  • Ingredient position and formulation: NSL ingredients more often appear among the first five ingredients than soy across most ingredient list length deciles. Soy tends to appear in longer, more complex formulations and at higher (later) mean positions than NSL, consistent with soy’s frequent fractional uses (e.g., oil, lecithin, protein).
  • Usage context clusters (k-means): A green cluster of NSL (e.g., Phaseolus coccineus, Cajanus cajan, Lens culinaris, Phaseolus vulgaris, Cicer arietinum, Vicia faba) is frequently in top five and often mentioned on packaging, suggesting identity-defining uses. A blue cluster (including Glycine max and Ceratonia siliqua) appears mainly in remaining ingredients and is seldom mentioned on packaging, reflecting fractional/functional uses (e.g., soy oil, soy lecithin, soy protein; locust bean gum). A median red cluster (e.g., Pisum sativum, Vigna unguiculata, Vigna angularis, Vigna radiata, Lupinus angustifolius) shows balanced top vs remaining positions with variable packaging mentions, indicating mixed strategies (e.g., peas vs pea protein; lupin flour).
  • Market segments and growth: Soy dominates most segments, except spreads and fruits/vegetables where balance is closer or reversed. However, NSL-containing products show higher cumulative growth rates than soy across all segments; notably, dairy (~12x), desserts (~9x), and breakfast (~7x) segments.
  • Geography: Soy dominates in all regions, but Europe and Southern Asia are more balanced (~40% NSL-containing launches). The Americas and Eastern/Southeastern Asia remain highly soy-dominant (~80% soy), while other regions (including Africa) still show majority soy (over 60–73%).
Discussion

Findings corroborate a strong market-level lock-in around soy, mirroring field-level dominance, which can hinder broader species diversification. Nonetheless, accelerated growth of NSL-containing products and more balanced usage in Europe and Southern Asia suggest potential shifts. The position of ingredients in lists and frequency of packaging mentions serve as proxies for product-context use: NSL species are more often identity-defining main components, whereas soy often appears as fractional/functional ingredients (oil, lecithin, protein). This indicates a tendency for species identity and biodiversity to be subsumed by functionality in processed foods. The study demonstrates the feasibility and value of ingredient-level text mining to monitor crop usage, offering insights relevant to public policy and industry innovation. It underscores the need for open, standardized vocabularies and ontologies (e.g., FoodON, TransformON) and for improved public databases to ensure transparency. Further work is needed to integrate processing typologies, refine importance metrics, and connect supply-side launches with consumer purchasing behavior to understand adoption and failures.

Conclusion

This work provides the first global assessment of species diversity in branded packaged foods using ingredient-level text mining, focusing on legumes and benchmarking pulses against soy. Despite identifying over 30 NSL species, the market is highly concentrated: soy remains dominant, though NSL-containing launches are growing faster and show more prominent, identity-related uses. Europe and Southern Asia exhibit more balanced markets, signaling potential diversification. The methods and resources (parser, tagging dictionary) enable ongoing monitoring of crop usage in food markets. Future research should: (i) characterize processing profiles and degrees of fractionation to map technological paradigms; (ii) develop and adopt controlled vocabularies/ontologies for ingredient naming; (iii) incorporate quantitative diversity metrics (e.g., Herfindahl index); (iv) capture ingredient origins to compare supply with cultivation geographies; and (v) link launch data with purchase data to analyze consumer uptake and market failures.

Limitations
  • Data coverage: Mintel GNPD does not cover all countries equally; some regions (e.g., Africa) are less documented.
  • Ingredient ambiguity: Variability and lack of global standards in ingredient naming required custom dictionaries; ambiguous terms and false friends were excluded, potentially undercounting.
  • Fresh vs dried legumes: Distinctions are not always made; species were labeled NSL to avoid ambiguity.
  • Proxy measures: Ingredient list order is a proxy for quantity/importance; exact amounts are not known.
  • Origin data: Ingredient geographical origins are typically not disclosed, limiting comparisons between usage and production locales.
  • Market linkage: The study analyzes launches, not purchases; lack of harmonized identifiers (e.g., barcodes) and data constraints hinder linking to consumer behavior.
  • Concentration metrics: A comprehensive multi-species biodiversity assessment would benefit from additional metrics (e.g., Herfindahl index) not fully applied here.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny