logo
ResearchBunny Logo
AXIS: Generating Explanations at Scale with Learnersourcing and Machine Learning

Education

AXIS: Generating Explanations at Scale with Learnersourcing and Machine Learning

J. J. Williams, J. Kim, et al.

Discover AXIS, a groundbreaking system that enhances online learning by generating and refining explanations through learner engagement and machine learning. This innovative approach demonstrates comparable effectiveness to traditional instruction, led by a team of renowned researchers including Joseph Jay Williams, Juho Kim, and others.... show more
Introduction

The paper addresses the challenge that many online learning platforms present answers without accompanying explanations, which limits learners’ conceptual understanding and ability to generalize to new problems. Explanations can mitigate rote procedure use and support transfer, yet they are costly for instructors to create and rarely revised over time, risking expert blind spots at scale. The research question is how to develop a scalable mechanism to generate, improve, and select effective explanations for online problems. The proposed solution, AXIS, leverages learners as a crowd to generate and evaluate explanations (learnersourcing) and applies machine learning to identify and present helpful explanations to future learners. The purpose is to offload explanation creation from instructors, continuously improve explanations with minimal manual effort, and enhance learners’ outcomes in settings such as MOOCs and online practice platforms.

Literature Review

Related work situates AXIS at the intersection of learnersourcing/crowdsourcing and reinforcement learning. Prior systems have harnessed learners’ activity to produce useful artifacts while supporting learning (e.g., real-time captioning by classroom participants; learnersourcing subgoal labels for videos; adapting video interfaces from interaction traces). From machine learning, AXIS adopts a multi-armed bandit framing to balance exploration and exploitation when choosing among candidate explanations. Thompson sampling, with strong empirical performance and interpretability, is used to maintain and update beliefs about explanation helpfulness from noisy learner ratings. Related educational applications of bandits include optimizing teaching sequences and adaptive experimentation in educational games.

Methodology

System design: AXIS has two core components: (1) a learnersourcing interface to collect learners’ knowledge self-reports, ratings of explanation helpfulness, and self-explanations; and (2) an explanation selection policy that dynamically chooses which explanation to present. Learners are prompted to rate explanations on a 1–10 scale and to write self-explanations; high-quality learner explanations may be added to the candidate pool for future learners.

Bandit formulation: Each problem is treated as a multi-armed bandit where arms are candidate explanations and the reward is the learner’s helpfulness rating. AXIS uses Thompson sampling with Beta-Bernoulli conjugate updates. For each explanation, AXIS maintains a Beta(a, b) posterior reflecting successes and failures. To accommodate 1–10 ratings with a Bernoulli likelihood, each rating is treated as 10 Bernoulli trials: successes equal the rating value, failures equal 10 minus the rating. New learnersourced explanations that pass filtering are initialized with an optimistic prior Beta(19, 1), roughly equivalent to observing ratings of 9 and 10, but with low evidence weight to enable rapid updating. The selection policy samples from posteriors and presents explanations proportional to the probability of being optimal, balancing exploration and exploitation.

Filtering rule for adding explanations: AXIS adds a learner’s explanation to the pool only if: (a) it is longer than 60 characters; (b) the learner self-reports above-average knowledge for the problem type; and (c) the learner rates its likely helpfulness to others above 6/10.

Implementation: The interface was built in Qualtrics. Data flow and bandit updates were implemented in Google Spreadsheets with Apps Script (JavaScript). The system retrieved learner interactions via the Qualtrics API, updated posteriors after each rating, and returned the selected explanation for subsequent learners.

Deployment case study (generation phase): 150 US participants from Amazon Mechanical Turk (MTurk) completed a 40-minute task (paid $3.50) involving four math problems (algebra, expressions, probability). After submitting an answer and seeing the correct answer, learners were either shown an explanation (if available) to rate and/or a prompt to write a self-explanation. Initially, pools were empty; AXIS populated pools from learnersourced explanations passing the filter. Across four problems, 60–72 explanations were generated per problem; 9–12 per problem passed the filter and were added to pools. AXIS continually updated selection policies per problem.

Evaluation experiment (assessment phase): An independent sample of 524 MTurk participants (paid $3.50) completed a randomized controlled study with two phases: (1) Learning phase: participants solved the four problems and, depending on condition, saw one explanation (randomly assigned) from: AXIS pool after 75 learners (AXIS-75), AXIS pool after 150 learners (AXIS-150), learnersourced explanations discarded by AXIS filter (Discarded), instructional designer’s original explanation (Instructor), or no explanation (Practice only). Participants rated explanation helpfulness (1–10) and self-efficacy for solving similar problems (1–10) before and after each problem to measure perceived skill increase. (2) Assessment phase: twelve problems without feedback measured transfer: isomorphic problems (surface changes only) and novel transfer problems within the same topics. Mixed-effects models with condition as a fixed factor and problem type as a random effect were used for analyses. Pairwise comparisons within these models assessed differences across conditions.

Key Findings
  • AXIS curated explanations were rated more helpful than discarded learnersourced explanations: M = 6.83 vs. 6.03 (SE = 0.28), p < 0.01.
  • Perceived skill increases: AXIS-150 led to higher increases in self-reported likelihood of solving similar problems than practice only: M = 0.71 vs. -0.01 (SE = 0.13), p < 0.001. No significant difference between AXIS-150 and Instructor explanations: M = 0.71 vs. 0.48 (SE = 0.23), p = 0.14.
  • Objective learning gains (accuracy from learning to assessment phase): AXIS explanations improved accuracy significantly compared to practice only: 12% vs. 2.7% increase (SE = 0.027), p < 0.05. Discarded explanations did not improve learning over practice (2% vs. 3%, p = 0.86) and were significantly less beneficial than AXIS explanations (12% vs. 2%, SE = 0.04, p < 0.029).
  • Transfer to novel problems: AXIS explanations increased success on transfer problems by 9–12% (SE = 0.03–0.04), p < 0.01.
  • AXIS explanations were comparable to the instructional designer’s explanations on perceived benefit and objective learning; no significant differences (all ps > 0.30).
  • Dynamic policy evolution showed AXIS increasing the probability of presenting higher-rated explanations over time, phasing out weaker ones.
Discussion

AXIS addresses the core challenge of scaling high-quality explanations by shifting generation and evaluation to learners and using a principled bandit-based policy to curate and improve explanations over time. The deployment showed that learners can produce many candidate explanations, and the bandit algorithm identifies those judged most helpful. The randomized evaluation demonstrates that AXIS-selected explanations not only feel helpful but also yield measurable learning gains, including transfer, outperforming both no-explanation practice and uncurated learner explanations. The lack of significant differences versus instructor-authored explanations suggests that learnersourced, machine-curated explanations can approach expert quality. This has practical significance for platforms lacking instructor capacity to author and constantly refine explanations, enabling continuous improvement without manual revision cycles.

Conclusion

AXIS (Adaptive eXplanation Improvement System) combines learnersourcing and multi-armed bandit algorithms to generate, evaluate, and adaptively present explanations for online problems at scale. In math problem-solving, AXIS elicited a pool of explanations and identified those that learners rated as helpful, leading to significant improvements in perceived capability, accuracy, and transfer compared to practice without explanations and to uncurated learner explanations. AXIS-curated explanations performed comparably to instructor-authored explanations. Future work includes embedding AXIS via LTI into platforms (e.g., ASSISTments, edX, Moodle, Canvas), extending to other instructional content (hints, examples, motivation), optimizing alternative reward signals (e.g., quiz performance, persistence), and exploring personalization via contextual bandits to match explanations to learner profiles.

Limitations
  • Personalization: The current system selects a single best explanation per problem without adapting to learner characteristics; contextual bandits could tailor explanations to knowledge level or preferences.
  • Participant population: Studies used MTurk workers rather than in-situ students; motivation and context may differ from classroom or MOOC settings.
  • Reward proxy: The bandit optimized subjective helpfulness ratings, which may be noisy and susceptible to metacognitive biases (e.g., illusion of explanatory depth); objective performance-based rewards were not used in the live policy.
  • Filtering and design choices: Heuristic thresholds (length >60 chars, above-average self-rated knowledge, helpfulness >6/10) may exclude potentially useful explanations or include marginal ones; optimal filtering remains open.
  • Sample sizes per explanation: Even with 524 evaluators, individual explanations averaged about 30 views, limiting precision in estimating true effectiveness.
  • Limited demographics and generalizability: Minimal demographic data were collected; generalizability across domains, ages, and educational contexts needs further validation.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny