Psychology

Humans monitor learning progress in curiosity-driven exploration

A. Ten, P. Kaushik, et al.

Humans autonomously organize what and when to learn by using competence cues to avoid trivially easy tasks and favoring activities that show learning progress, as demonstrated with a free-choice experiment and computational modeling. This research was conducted by Alexandr Ten, Pramod Kaushik, Pierre-Yves Oudeyer, and Jacqueline Gottlieb.... show more

Introduction

Curiosity is a fundamental drive underlying human behavior, often studied as intrinsically motivated information seeking. Prior lab tasks typically probe short time scales where people request information without acting on it, yet natural curiosity involves sustained engagement with specific activities over extended periods. The central question is how humans self-organize investigations across multiple potential learning activities when time and resources are limited. Optimal study-time allocation is theoretically sensitive to unknown learning curves, making exact optimization impractical. Competing hypotheses suggest that people may prioritize by competence/difficulty (e.g., higher uncertainty or error rates) or prefer intermediate difficulty. Computational accounts also propose a mechanism based on learning progress (LP)—the temporal change in performance—that could efficiently guide exploration by avoiding both already-mastered and unlearnable tasks. This study tests whether humans dynamically monitor performance (percent correct, PC) and LP to guide free-choice exploration and whether such strategies emerge without explicit learning instructions.

Literature Review

Recent work shows that information can be rewarding in itself, supported by neural reward/motivation systems. However, most studies examine brief, unrelated events rather than sustained learning activities. Evidence on difficulty preferences is mixed: some studies show prioritization of high difficulty or uncertainty, while others highlight preference for intermediate difficulty across domains (trivia curiosity, sensorimotor choices, infant attention, aesthetic appreciation). LP-based control architectures from intrinsic motivation and curiosity research posit intrinsic rewards for activities where recent performance changes, enabling agents to avoid both familiar and unlearnable tasks and to self-organize curricula without knowing precise learning trajectories. These ideas have been influential in machine learning (automated curricula, educational technologies), yet empirical demonstrations in humans of sensitivity to dynamic LP signals—as distinct from static competence—have been lacking.

Methodology

Participants: 400 adults (19–71 years; 208 female, 187 male, 5 undisclosed) recruited on Amazon Mechanical Turk. Eighteen were excluded for response bias, yielding N=382 (EG: 196; IG: 186). Procedures were IRB-approved (University of Rochester); compensation was $1 regardless of performance. Design: Within-subject manipulation of activity difficulty; between-subject manipulation of instruction. Each trial consisted of: (1) free choice among 4 activity icons (monster families), (2) binary guess of preferred food for a presented monster, (3) immediate feedback. Stages: 15 forced-choice familiarization trials per activity; 250-trial free-play stage; subjective ratings before/after free-play. Activities: A1 (easiest) 1-D categorization; A2 1-D with irrelevant second feature; A3 conjunction of 2 features (hardest learnable); A4 random/unlearnable (food preference assigned randomly per monster). Instructions: External goal (EG) group asked to maximize learning across activities and informed of a post-session test (which they received); Internal goal (IG) group told to freely choose activities with no explicit learning goal. Measures: Percent correct (PC) during familiarization confirmed graded difficulty. Learning achievement categorized by number of activities mastered (NAM1/2/3) using criterion 13/15 correct (86.7%). Self-challenge (SC) index computed per trial as the normalized recent PC of the chosen activity relative to the participant’s experienced PC range, averaged over free-play; SC near 1 indicates choosing the most difficult tasks (lowest PC), near 0 indicates choosing easiest (highest PC). Difficulty-weighted final PC (dwfPC) computed from last 15 trials on A1–A3 with weights proportional to difficulty rank. Computational modeling: A softmax bandit model with intrinsic utility U_i,t = w_PCPC_i,t + w_LPLP_i,t, where PC_i,t is accuracy over the last 15 trials on activity i; LP_i,t is the change in performance within that 15-trial window (difference between recent and earlier segments), following prior intrinsic motivation models. Models fit per participant by maximum likelihood with parameters w_PC, w_LP, and temperature τ; multiple random initializations with L-BFGS-B optimization to convergence. Model comparisons included bivariate (PC+LP), univariate (PC only or LP only), and random-choice baselines using AIC. Simulations used fitted coefficients to reproduce time allocation patterns across groups and NAM categories. Statistics: Mixed ANOVAs, regression (including linear-quadratic fits), Wilcoxon signed-rank tests; analyses conducted in R and Python; α two-tailed.

Key Findings

Manipulation checks: Familiarization PC confirmed distinct difficulty levels across activities; no EG vs IG differences (group: F(1,380)=1.829, p=0.177; group×difficulty: F(3,1140)=0.820, p=0.483; main effect of activity: F(3,1140)=158.400, p<0.001; all pairwise Tukey HSD p<0.01). Group-level choices: EG participants allocated below chance to easiest activities and above chance to A4 (unlearnable); IG showed slight bias to easiest activity. EG allocation: A1 20.61% (t(1520)=-3.002, p=0.003), A2 19.29% (t(1520)=-3.910, p=0.048), A4 36.92% (t(1520)=8.156, p<0.001); IG allocation: A1 33.00% (t(1520)=5.330, p<0.001), A2 21.42% (t(1520)=-2.387, p=0.017), A3 22.16% (p>0.05), A4 23.43% (p>0.05). Instruction×activity interaction: F(3,1140)=14.578, p<0.001. Learning outcomes: EG achieved higher final performance (dwfPC EG M=0.756, SD=0.127; IG M=0.721, SD=0.126; t(379.4)=2.679, p=0.008). Unweighted average PC also higher in EG (M=0.787 vs 0.756; t(378.1)=2.539, p=0.011). Individual variability: Many IG participants self-challenged and achieved high mastery; 64.52% mastered ≥2 activities and 29.59% mastered all 3 (EG: 74.49% and 36.56%, respectively). Within NAM groups, final performance did not differ by instruction, indicating NAM captures learning achievement variability. Time allocation by achievement: In IG, NAM1/2 favored easier tasks, whereas NAM3 resembled EG with preference for harder activities; significant activity and activity×NAM effects in both groups. Self-challenge vs performance: dwfPC exhibited an inverted-U relationship with SC; adding a quadratic SC term improved fit (ΔAIC=11.775). Linear-quadratic model: R^2_adj=0.159, F(4,360)=18.238, p<0.001; quadratic coefficient negative (−0.016, t(360)=−1.966, p<0.001). Replicated with unweighted PC (R^2_adj=0.191, F(4,360)=13.642, p<0.001; quadratic coefficient −0.017, t(360)=−3.561, p=0.007). EG participants who failed to master all 3 tended to over-challenge; IG counterparts under-challenged; NAM3 in both groups had intermediate SC. Computational modeling: The bivariate PC+LP model best explained choices relative to random and univariate baselines (mean AIC 491.992 vs random 693.147). ANOVA on AIC: model form significant (F(2,1089)=43.992, p<0.001); no interaction with instruction (p=0.716). Bivariate model was best for most participants (EG: 70.74%; IG: 74.01%); significantly lower AIC than next-best model (EG mean Δ=21.503, Z(188)=55, p<0.001; IG mean Δ=21.882, Z(177)=46, p<0.001); ≥2 AIC points better in EG 58.51% and IG 62.71%. Sensitivity to LP beyond PC: Inclusion of LP improved fits irrespective of instruction, indicating dynamic LP monitoring. Coefficient properties: Normalized w_PC and w_LP were uncorrelated (IG r=-0.077, p=0.298; EG r=0.062, p=0.399). Instructions affected PC coefficients (IG M=0.255, SD=0.724; EG M=−0.232, SD=0.741; F(1,363)=40.240, p<0.001) but not LP coefficients (IG M=0.079, SD=0.640; EG M=0.062, SD=0.631; F(1,363)=0.065, p=0.799). Strategy subgroups: Compared PC-driven (negative PC, near-zero LP) vs LP-driven (positive LP, near-zero PC). Both preferred harder activities, but PC-driven over-selected unlearnable A4 relative to learnable A3 (EG slope=76.485, t(104)=7.019, p<0.001; IG slope=83.941, t(72)=5.199, p<0.001); this A4 bias was reduced or absent in LP-driven (interaction EG −47.628, t(104)=−2.726, p<0.001; IG −125.179, t(72)=5.764, p<0.001). Learning outcomes favored LP-driven: by end of free-play, mastering ≥2 activities was 90.48% (LP-driven) vs 70.59% (PC-driven); mastering all 3 was 64.29% vs 34.98%. Simulations using fitted coefficients reproduced empirical time-allocation patterns across groups and NAM levels.

Discussion

The study directly demonstrates that humans monitor learning progress (LP) during free-choice exploration and use it, together with competence/error information (PC), to self-organize study time. LP-based control helps avoid unlearnable tasks while PC-based control promotes exploration of more difficult, unfamiliar activities. The observed inverted-U relation between self-challenge and final performance highlights that intermediate challenge maximizes learning, aligning with broad evidence for preferences for intermediate complexity across perception, attention, and aesthetics. Importantly, PC and LP influences were uncorrelated and differentially modulated by instruction, suggesting complementary roles rather than mutually exclusive mechanisms. Extrinsic instruction to maximize learning increased self-challenge and overall performance but also led some participants to persist on an unlearnable task, illustrating how external goals can both bolster and hinder efficient learning strategies depending on context. These results bridge biological and computational theories of curiosity and intrinsic motivation, supporting LP-based architectures as biologically plausible heuristics for organizing extended learning without precise knowledge of future learning curves.

Conclusion

This work shows that humans dynamically track and use learning progress to guide curiosity-driven exploration, in tandem with competence-based signals. A model combining PC and LP best explains activity choices, and an intermediate level of self-challenge yields superior learning outcomes. LP-driven strategies reduce time spent on unlearnable tasks and improve mastery compared to PC-driven strategies. These findings unify insights from artificial curiosity and human cognition, offering tools to probe intrinsic motivation in extended learning. Future research should extend the paradigm to richer, more naturalistic sets of activities, and incorporate modeling of learning dynamics and modulators such as forgetting, switching costs, effort, and uncertainty preferences. Collecting trial-wise subjective probability estimates could illuminate evolving inferences. Understanding how to balance extrinsic objectives with intrinsic drives may inform educational technologies and support the development of sustained interests and lifelong skills.

Limitations

The task provided a limited, simplified set of learning activities, which may not capture the complexity of real-world environments with many challenging and unlearnable tasks. Modeling focused on activity selection utilities (PC and LP) and did not explicitly model learning mechanisms or factors such as forgetting, switching costs, effort, or preferences for uncertainty. Novelty/familiarity is defined by prior choices in this design, making it circular for explaining choices. Participants were recruited online with a uniform, low compensation and no performance-based incentives, which may limit generalizability. External test data were only available for the EG group; analyses used free-play data for comparability.