Psychology

Humans monitor learning progress in curiosity-driven exploration

A. Ten, P. Kaushik, et al.

This insightful study explores the intriguing mechanisms behind human intrinsic goal setting during various learning activities, revealing that humans may prioritize learning progress over mere competence. Conducted by Alexandr Ten, Pramod Kaushik, Pierre-Yves Oudeyer, and Jacqueline Gottlieb, the research offers groundbreaking perspectives on curiosity and intrinsic motivation in learning.... show more

Introduction

The study investigates how humans self-organize learning over extended timescales when freely choosing among multiple activities of varying difficulty. Prior laboratory work has often operationalized curiosity as information demand about brief, unrelated events, but naturalistic curiosity involves sustained engagement (e.g., reading, searching, taking courses) where learners must strategically allocate limited time and effort across many possible activities. This raises the strategic student problem: how to allocate study time to maximize learning without knowing one’s precise learning curves in advance. Two prominent computational hypotheses could guide such exploration: (1) competence-based heuristics that prioritize high difficulty or uncertainty (e.g., seeking high prediction errors), and (2) learning progress (LP)-based heuristics that prioritize activities where performance is currently improving, the temporal derivative of competence (e.g., percent correct, PC). LP-based control is theorized to avoid both trivially easy and unlearnable tasks while fostering engagement with learnable but challenging activities. The study tests whether humans dynamically monitor PC and LP and whether these signals shape free choices among activities without explicit instruction to maximize learning.

Literature Review

The paper reviews work showing that curiosity is intrinsically rewarding and linked to neural reward systems, but typical tasks measure immediate information demand rather than sustained learning. It contrasts evidence that people prioritize high difficulty/uncertainty with evidence for a preference for intermediate difficulty across domains (e.g., trivia curiosity, sensorimotor exploration, infant attention, aesthetics). Computational frameworks for intrinsic motivation include prediction-error/uncertainty bonuses and LP-based strategies. LP has been proposed to optimize curriculum learning, avoid unlearnable tasks, and personalize education. However, empirical studies have not directly tested dynamic sensitivity to LP during ongoing learning; prior work focused on static judgments of difficulty/familiarity rather than trial-by-trial changes in performance. This study fills that gap by modeling and testing whether LP, in addition to PC, explains human activity selection.

Methodology

Participants: 400 MTurk workers (19–71 years, mean 36.15; 208 female, 187 male, 5 undisclosed) consented. Eighteen (5 EG, 13 IG) were excluded for response bias, leaving 382 analyzed. Additional 8 EG and 9 IG who did not master any activity were excluded from computational modeling. Design: Between-subjects manipulation of instruction: External Goal (EG; N=196) instructed to maximize learning across all activities and informed of a post-session test; Internal Goal (IG; N=186) told to play freely with no explicit objective. Within-subject manipulation of activity difficulty via rule complexity across four activities (A1–A4): A1, 1-dimensional categorization (easiest); A2, 1D with an irrelevant feature; A3, 2D conjunction (hardest learnable); A4, random/unlearnable mapping. Task: Each free-choice trial comprised (1) choosing one of four monster-family activities, (2) viewing a randomly drawn exemplar and guessing one of two foods, and (3) receiving immediate correctness feedback. Sequence: (i) forced-choice familiarization (15 trials per activity), (ii) prospective learnability rating, (iii) free play (250 trials), (iv) post-task ratings; EG also received the announced test (not used in main analyses). Measures: Percent correct (PC) per activity; difficulty-weighted final performance (dwfPC) computed from the last 15 trials per learnable activity with weights proportional to difficulty ranks; mastery designation NAM1/2/3 based on achieving ≥13/15 correct per activity; Self-challenge (SC) index per participant defined by normalizing the recent PC of the chosen activity on each trial to the participant’s experienced PC range and averaging across free play. Statistics: Mixed-design ANOVAs, linear models, Tukey HSD, Welch t-tests, Pearson correlations; model comparison via AIC. Computational modeling: Bandit-style softmax choice with utility Ui,t = wPC·PCi,t + wLP·LPi,t; PCi,t is proportion correct over last 15 trials for activity i; LPi,t is the difference between performance in the latter and earlier parts of the same 15-trial window (absolute value used to capture increases/decreases). Three free parameters per participant: temperature and weights wPC, wLP. Fitting by maximum likelihood with L-BFGS-B, multiple random initializations to convergence. Models compared: random baseline, univariate PC-only, univariate LP-only, and bivariate PC+LP. Coefficients normalized for some analyses; subsets identified as PC-driven (negative PC weight, near-zero LP) or LP-driven (positive LP, near-zero PC). Simulations used fitted coefficients to reproduce time allocation patterns.

Key Findings

Familiarization verified intended difficulty manipulation and equivalent initial performance across groups (no EG vs IG difference; robust main effect of activity difficulty).
Free-choice behavior: EG allocated significantly below-chance time to easy activities (A1, A2) and above-chance to the random unlearnable activity A4 (A1: 20.61%, t(1520) = -3.002, p = 0.003; A2: 19.29%, t(1520) = -3.910, p = 0.048; A4: 36.92%, t(1520) = 8.156, p < 0.001). IG showed a modest preference for the easiest A1 (33.00%, t(1520) = 5.330, p < 0.001) with lower time on others (A2: 21.42%, t(1520) = -2.387, p = 0.017). Group × activity interaction was significant (F(3,1140) = 14.578, p < 0.001).
Learning outcomes: EG achieved higher difficulty-weighted final performance (dwfPC) than IG (EG: M = 0.756, SD = 0.127; IG: M = 0.721, SD = 0.126; t(379.4) = 2.679, p = 0.008). Unweighted final PC likewise higher in EG (p = 0.011).
Individual variability: Many IG participants self-challenged and learned without explicit goals: 64.52% mastered ≥2 activities; 29.59% mastered all 3 (EG: 74.49% and 36.56%, respectively). Within NAM strata, IG and EG had similar final performance, indicating NAM captured achievement variation.
Self-challenge vs performance: dwfPC showed an inverted-U relationship with SC; a quadratic model (with linear and quadratic SC terms, plus controls) fit better than linear (ΔAIC = 11.775), with significant negative quadratic term (unweighted PC replication: adjusted R^2 = 0.191; quadratic coefficient = -0.017, t(360) = -3.561, p = 0.007). EG participants who failed to master all 3 over-challenged (higher SC), while IG under-challenged (lower SC); NAM3 achievers in both groups showed intermediate SC and similar allocations.
Modeling: The bivariate PC+LP model outperformed random and univariate PC-only or LP-only models (2-way ANOVA on AIC: effect of model form F(2,1089) = 43.992, p < 0.001; no interaction with instruction). Bivariate was best for the majority (EG: 70.74%; IG: 74.01%), with significant AIC improvements over the next-best model (Wilcoxon signed-rank, both groups p < 0.001). Simulations using fitted coefficients reproduced empirical allocation patterns across NAM and groups.
Coefficients: Normalized wPC and wLP were uncorrelated (IG: r = -0.077, p = 0.298; EG: r = 0.062, p = 0.399). wPC differed by instruction (IG positive; EG negative; F(1,363) = 40.240, p < 0.001), while wLP did not (F(1,363) = 0.065, p = 0.799), indicating separable influences.
LP vs PC-driven subgroups: PC-driven participants preferentially sampled A4 over A3; this preference was reduced/absent in LP-driven participants (significant negative interactions between activity and drive type in regressions; EG interaction slope = -47.628, p < 0.001; IG interaction slope = -125.179, p < 0.001). LP-driven participants achieved better learning: probability of mastering ≥2 activities 90.48% vs 70.59% (PC-driven), and all 3 activities 64.29% vs 34.98%, with LP-driven maintaining preference for learnable A3 and avoiding the unlearnable A4 over time.

Discussion

The findings directly address how humans self-organize learning in multi-activity settings: participants monitor not only their competence (percent correct) but also its temporal change (learning progress) and use both signals to guide exploration. Sensitivity to PC encourages sampling more difficult tasks and avoiding already-mastered ones, while sensitivity to LP helps avoid unlearnable or overly difficult activities, yielding efficient allocation of study time. This joint control produces an inverted-U relation between self-challenge and performance, with maximal learning at intermediate self-challenge. External instructions to maximize learning increased overall self-challenge and final performance but also induced some participants to persist on an unlearnable activity, demonstrating nuanced interactions between extrinsic and intrinsic motivations. The results connect biological curiosity to computational LP-based algorithms used in curriculum learning and suggest that the widely observed preference for intermediate complexity may reflect underlying LP monitoring mechanisms. These insights refine theories of exploration by showing that competence and LP signals are complementary, context-dependent drivers of curiosity-driven learning.

Conclusion

This work provides empirical evidence that humans dynamically monitor learning progress and integrate it with competence estimates to select among learning activities, leading to efficient self-organization of study time. A bivariate intrinsic utility combining PC and LP best explained choices, and LP sensitivity specifically protected against spending time on unlearnable tasks. Externally specified learning goals increased self-challenge and performance but could also bias some learners toward futile exploration of random tasks, underscoring the need to balance intrinsic and extrinsic drives. Future research should extend these paradigms to richer, more naturalistic environments with broader activity sets and longer horizons, incorporate additional factors such as forgetting, switching costs, effort, and uncertainty preferences into the learning models, collect trial-wise subjective probability/uncertainty ratings, and systematically manipulate environmental parameters (number/difficulty of activities, response categories, horizon) to map how different drives shape sustained interest and skill acquisition.

Limitations

The task used a limited set of four activities with simplified rule structures; generalization to complex real-world learning domains remains to be established.
The study did not model the internal learning process (e.g., forgetting, effort, switching costs, uncertainty preferences) and focused on choice utilities; thus causal mechanisms of learning dynamics were not dissected.
Novelty/familiarity was not explicitly modeled (and is partially confounded with past choices); PC was used as a proxy for competence/novelty in some interpretations.
Online MTurk sample and fixed compensation may limit generalizability and engagement control, despite exclusion criteria for response bias.
The LP signal was computed over fixed 15-trial windows and as an absolute change, which may not capture all forms of progress or regress in different contexts.
Presence of an unlearnable activity is a stylized feature; real-world unlearnability can be subtler, potentially affecting LP detection.