Education
FEW questions, many answers: using machine learning to assess how students connect food-energy-water (FEW) concepts
E. A. Royse, A. D. Manzanares, et al.
Unlock the potential of machine learning in education! This innovative research examined how machine learning can assess students' understanding of the complex Food-Energy-Water Nexus, revealing impressive accuracy in identifying key concepts in their responses. Conducted by a diverse group of scholars, the findings highlight the strengths of students' knowledge about water usage but also unveil challenges in grasping trade-offs.
~3 min • Beginner • English
Introduction
The study addresses the need to assess cross-cutting, interdisciplinary learning in higher education environmental and sustainability programs, where systems thinking is a core competency. Traditional multiple-choice concept inventories are inadequate for eliciting complex reasoning; constructed responses (CRs) better capture student thinking but are time-intensive to grade. The authors investigate whether machine learning (ML) text classification can identify instructor-determined key ideas in student CRs about the Food-Energy-Water (FEW) Nexus and what these responses reveal about students' systems thinking. The FEW Nexus provides a practical framework for integrating systems thinking across environmental processes, management, policy, and socioeconomics. The paper poses two research questions: (1) Can ML models identify important concepts in student responses as determined by instructors? (2) What do students know about FEW interconnections and how do they assimilate systems thinking in their CRs? The context highlights the importance of NGSS crosscutting concepts, Bloom’s higher-order thinking, and the challenges of assessing systems thinking within interdisciplinary environmental studies.
Literature Review
The authors situate systems thinking as a key competency in STEM and sustainability education (e.g., Wiek et al., Redman and Wiek, NAS, NSF), noting the lack of broadly validated measures for systems thinking across domains. They review limitations of traditional concept inventories for interdisciplinary and higher-order skills, emphasizing the promise of AI/NLP and ML for scoring student CRs. Prior work shows ML can predict human scoring of short science responses across topics and grades, but performance depends on assessment construct characteristics, item design, data balance, and rubric clarity. The FEW Nexus is presented as a meaningful, real-world scaffold for systems thinking. The review also notes methodological considerations for automated scoring validity, including model-human agreement metrics (e.g., Cohen’s kappa), issues of data sparsity and imbalance, and the potential of advanced NLP and generative AI to support formative assessment and item development.
Methodology
Design: The study follows a modified question development cycle integrating item/rubric development with ML-based text classification. Stages include question design, data collection, exploratory analysis (text mining of student responses), rubric development (analytic and holistic), iterative human coding with interrater agreement checks, confirmatory ML modeling, and deployment as predictive models.
Assessment items: Two FEW-focused items were developed: (1) Sources of FEW & Connections: Reservoir (identify sources of energy and water and explain connections), and (2) Trade-offs of FEW systems: Biomass energy production (evaluate outcomes and compare trade-offs in switching from beans to corn for biomass). Items had sub-parts targeting specific constructs and Bloom-aligned scaffolding.
Data collection: Student CRs (n=698 collected overall; used subsets for modeling) were obtained from introductory IES courses (Spring 2022–Spring 2023) across 7 institutions (from a purposive sample of 10 institutions representing Carnegie categories and curricular emphases). Demographics: 57.45% female, 4% non-binary, 38.55% male; 73.67% White, 5.3% Asian, 4.7% Hispanic/Latino/Latinx, 1.78% Black or African American, 1.38% American Indian or Alaska Native, 11.79% multiracial. IRB-approved and de-identified. Instructors validated content coverage and accessibility of items.
Rubric development and human coding: Analytic, dichotomous rubrics (0/1 per bin) were created inductively from student responses and anchored using instructor expert responses. Iterative coding by 2–3 researchers on random batches of 30 responses per iteration; ≥85% per-bin agreement targeted (acceptable >80%). Rubrics were refined for clarity and bin distinctiveness. Final coding: Reservoir item (346 responses) and Biomass item (483 responses) were scored, with some bins later merged or restructured.
ML text classification: Supervised ML with NLP features using the Constructed Response Classifier (CRC) tool and a 10-fold cross-validation ensemble of eight algorithms. Text preprocessing: tokenization, stemming, stop-word and digit removal; features: bag-of-words n-grams (uni/bi-, extended to tri-/quadgrams as needed). For the Reservoir item, Parts A and B were concatenated into a single response due to identical rubrics and coder practice; for Biomass, parts were modeled separately due to different rubrics. Training/testing datasets: Reservoir N=345; Biomass Part A N=480; Biomass Part B N=466 (after removing missing labels). Evaluation metrics: Cohen’s kappa (primary), accuracy, sensitivity, specificity, F1. Benchmarks: kappa >0.6 (substantial), >0.8 (almost perfect). Iterative tuning included basic feature engineering and extended strategies:
- Additional feature engineering: synonym substitution; longer n-grams (trigrams/quadgrams).
- Data rebalancing: reduce overrepresented 0-cases to address severe class imbalance (target ≤2:1 ratio).
- Dummy responses (data augmentation): generate minimally edited variants of misclassified responses to enrich features; used only for training, removed for evaluation metrics.
- Merging rubric bins: combine overlapping or mutually exclusive bins into single codes or multi-class schemes (e.g., Biomass Part A B3/B4; Part B C1/C2 recoded as multi-class 0/1/2).
Key Findings
Model performance (RQ1):
- Reservoir item: 11 bin-specific models developed; 10 achieved substantial to almost-perfect kappa (0.652–0.957) with high accuracies (generally >0.85). One bin (D2) failed to detect positives (kappa=0, accuracy 0.992) due to severe imbalance (only 3 positives). Examples (Table 4): B1 kappa=0.943 (accuracy 0.975); B2 kappa=0.957 (accuracy 0.986); C3 kappa=0.906 (accuracy 0.959); A2 kappa=0.838 (accuracy 0.940); B4 kappa=0.825 (accuracy 0.947); D1 kappa=0.678 (accuracy 0.829); C2 kappa=0.652 (accuracy 0.857). Extended strategies (dummy data, rebalancing) improved several bins.
- Biomass item: 15 models (8 for Part A, 7 for Part B) showed lower kappas (0–0.674) with accuracies 0.755–0.991 (Table 5). No bin reached kappa >0.8; some bins performed poorly despite frequent positives (reflecting broader answer space and expression variability). Merging B3 and B4 (Part A) improved performance for borderline cases; recoding C1/C2 (Part B) into a single multi-class prediction improved performance over separate binary models.
Student knowledge and systems thinking (RQ2):
- Reservoir co-occurrence (Table 6): A (hydropower) frequently co-occurred with C (uses of energy) and B (energy production). Counts (total instances/co-occurrences): A total=241, co-occurred with C=157 and B=80; C total=206, co-occurred with A=157 and B=92; B total=148, co-occurred with A=80 and C=92. D (water use) was least frequent (total=69) but when present, commonly co-occurred with A (47). This indicates students often link hydropower to energy uses (e.g., irrigation, machinery, homes), but less often make explicit water-use connections beyond hydropower.
- Biomass expertise levels (Table 8/9, derived from bin combinations):
• Part A (water-use change): Level 4=103 (21.4%); Level 3=267 (55.6%); Level 1=110 (22.9%); Level 2=0. Students more proficient in explaining water-use changes than in articulating trade-offs.
• Part B (trade-offs): Two-vertex trade-offs (Level 2 by logic) were most common: 229 (49.2%); Level 1=104 (22.3%); Level 3=51 (10.9%); Level 4=0 by ML (human coding found only 1 Level 4). Level-0 cases included “I don’t know” or trivial statements.
- Overall: Expert-like responses were infrequent; moving from single effects to multiple effects and articulating mechanisms signals more advanced systems thinking. Students exhibited higher expertise in describing increased water usage following a shift to corn biomass than in evaluating FEW trade-offs.
Process insights:
- Rubric clarity and distinctiveness improved model performance; overlapping categories (e.g., A2 vs B4 in Reservoir) posed challenges and suggest future bin consolidation or multi-class schemes.
- Extended strategies (dummy responses, rebalancing, longer n-grams) mitigated data sparsity/imbalance and improved some bins.
- More complex constructs (trade-offs) and wider answer spaces reduce ML performance, indicating a need for additional feature engineering or advanced models.
Discussion
Findings support that ML text classification can, for many rubric bins, identify instructor-determined concepts in student CRs with substantial to almost-perfect agreement, particularly for items with clearer, narrower constructs (Reservoir). More complex constructs with broader answer spaces (Biomass trade-offs) depress performance, underscoring the importance of principled item/rubric design, iterative refinement, and attention to construct characteristics. Model-human agreement is sensitive to data imbalance, bin overlap, and linguistic variability; strategies such as data augmentation with dummy responses, rebalancing, synonym sets, and bin merging/multi-class coding can improve outcomes. As to student understanding, co-occurrence patterns reveal that students commonly connect hydropower with energy uses (energy for irrigation, machinery, homes) but less often elaborate water-use consequences beyond hydropower. In the biomass scenario, students more readily explain directional changes in water use than articulate multi-vertex trade-offs, and expert-like multi-effect reasoning remains rare. These insights can inform instruction and formative assessment, where automated scoring provides rapid, cohort-level distributions of understanding. The discussion also highlights prospects for generative AI to aid pattern detection and clustering for formative use, while noting integrity and validity concerns and the need to focus assessments on core disciplinary principles and systems thinking.
Conclusion
The study demonstrates initial but meaningful success in using ML-based text classification to assess complex, interdisciplinary CRs about FEW systems and systems thinking. For reservoir-based connections, automated scoring achieved high agreement with human codes; for biomass trade-offs, agreement was acceptable but lower, reflecting construct complexity and broader response variability. The work advances methods for building novice-to-expert scales and reveals characteristic student patterns—stronger on describing water-use changes than evaluating FEW trade-offs. Future directions include: refining human scoring methods and rubrics to better support ML; expanding item banks with multiple CRs targeting varied constructs; scaling data collection for greater response diversity; exploring advanced NLP (including generative AI) and multi-class/structured prediction; and examining how students integrate social/human dimensions into FEW explanations. Wider availability of models and items can reduce grading burden and support valid, reliable formative evaluation of systems thinking across interdisciplinary programs.
Limitations
Key limitations include: (1) Data imbalance and sparsity of certain ideas (e.g., Reservoir D2; Biomass A2) which degraded sensitivity and kappa despite high accuracy; (2) Overlap or ambiguity among rubric bins (e.g., hydropower vs energy transformations) challenging both human and machine discrimination; (3) Variation in student literacy and diverse expression across institutions, increasing linguistic complexity; (4) Reduced ML performance for constructs with broader answer spaces (trade-offs); (5) Potential underprediction of mid-level expertise by ML relative to human coders; (6) Significant time and labor investments for iterative item/rubric/model development and human coding; and (7) Generalizability limits due to the specific items, contexts, and datasets used.
Related Publications
Explore these studies to deepen your understanding of the subject.

