Psychology

Novelty as a drive of human exploration in complex stochastic environments

A. Modirshanechi, W. Lin, et al.

Humans explore environments even when rewards require intermediate, reward-free steps. This study shows people persistently investigate a highly stochastic but reward-free subregion, and that their behavior is best explained by novelty-driven — not information-gain or surprise — exploration. Research conducted by Alireza Modirshanechi, Wei-Hsiang Lin, He A. Xu, Michael H. Herzog, and Wulfram Gerstner.... show more

Introduction

The study investigates how humans explore complex, multistep environments to locate extrinsic rewards, probing the intrinsic drives that guide exploration in the absence of immediate feedback. Prior computational work posits that intrinsic rewards—novelty, surprise, information gain—shape exploration, but these signals lead to distinct strategies, especially in stochastic settings where agents can be distracted by non-rewarding noise (the noisy TV problem). The key research questions are: do humans exhibit attraction to reward-free stochasticity, and if so, does this attraction persist (as in novelty/surprise) or diminish with learning (as in information gain)? To address this, the authors design a multistep environment with localized stochasticity to dissociate exploration strategies and examine the role of reward optimism (expecting higher-than-found rewards) in modulating exploration.

Literature Review

The paper situates its contribution within debates on human exploration in reinforcement learning and decision-making, noting that most empirical paradigms are simple (1–2 step bandits) and may not generalize to real-world multistep tasks. It reviews intrinsically motivated RL and psychology-driven constructs (novelty, surprise, information gain) and the noisy TV problem in machine learning, where agents can be distracted by stochastic stimuli. Prior human studies report combinations of random and directed exploration, dissociable motivational signals, and neural correlates, but the precise intrinsic drive remains debated. Earlier work by the authors showed novelty best explains exploration in deterministic multistep environments; here they extend to heterogeneous stochastic settings to test robustness. The review also touches on graph learning, model-based vs. model-free control, and recent progress in complex cognitive modeling.

Methodology

Participants and design: 63 participants were recruited (57 analyzed after exclusions; 27 female; mean age 24.1 ± 4.1 y). Each participant completed 5 episodes in a computerized multistep environment with 61 states (58 nongoal + 3 goal states). Episodes began at state 1 or 2 and ended upon reaching a goal state. Environment structure: States were grouped into progressing states (1–6), trap states (7–8), and a stochastic subregion (S-1 to S-50). In progressing states, actions could move agents closer to goals, loop, or send them to traps; transitions were deterministic except the progressing action in state 6, which led to goal G* with probability 1 − ε and to G1 or G2 with ε/2 each (ε ≈ 10⁻⁸). From state 4, a designated stochastic action jumped uniformly to any stochastic state. Within stochastic states, one fixed action returned to state 4; the other two actions transitioned uniformly to another stochastic state. Trap states had deterministic transitions (either stay within traps or return to state 1). Stimuli and actions: States were presented as images; three disks below each image represented actions. The assignment of images to states and disks to actions was random but fixed, with the exception that the first chosen disk in state 4 for each participant was set to be the stochastic action (ensuring at least one visit to the stochastic part in episode 1). Reward manipulation and optimism: Participants knew there were three goal images worth 2, 3, or 4 CHF. For each participant, these reward values were randomly assigned to the three goal locations (G*, G1, G2), fixed throughout. Due to near-deterministic transitions to G*, participants primarily encountered G* in episodes. Groups were defined by the reward value assigned to G*: 2 CHF (n=21 after exclusions), 3 CHF (n=19), 4 CHF (n=17). Reward optimism was defined as the expectancy of a higher-value goal than the discovered one and varied across groups, influencing exploration pressure in episodes 2–5. Behavioral measures: Exploration was quantified by search duration (number of actions before returning to G*) and fraction of time-steps spent in the stochastic part, across episodes. Reconstruction task: After the experiment, participants were asked to draw transitions among images for progressing states (1–6), trap states (7–8), one stochastic state (S-44), and a novel image (X), presented in a pseudorandom layout. Links were analyzed as undirected; goal states were excluded. A reconstruction score was defined as (correct links/13) − (incorrect links/24), with chance at 0. Exclusions: 2 participants did not finish; 4 were excluded for extreme action counts (>3× group-average in episodes 2–5). Computational modeling: Human behavior was modeled as intrinsically motivated RL with dual value streams for intrinsic and extrinsic rewards. Two parallel RL systems (hybrid model-based/model-free) computed Q_int and Q_ext; actions were sampled via a softmax policy π(a|s) ∝ exp[β_ext Q_ext + β_int Q_int]. Intrinsic reward definitions: novelty r_int = −log P_{t−1}(s_t); surprise r_int = −log P_{t−1}(s_t|s_{t−1}, a_{t−1}); information gain r_int = D_KL(P_{t−1} || P_t); a control model had r_int = 0 with optimistic initialization. Transition probabilities and state visit frequencies were estimated nonparametrically in an expanding state space. β_int and β_ext were allowed to differ between episode 1 and episodes 2–5 and, in episodes 2–5, to depend on the discovered G* value, capturing reward optimism. Model fitting and comparison: Four candidate algorithms (novelty, information gain, surprise, no intrinsic reward) were fit via stratified threefold cross-validation, maximizing likelihood on training participants and evaluating test log-likelihood on held-out participants. Bayesian model comparison assessed expected posterior probability and protected exceedance probability. Posterior predictive checks (PPC) simulated each fitted model to compare 43 group-level summary statistics with human data. Action-level predictive accuracy was assessed cross-validated per episode.

Key Findings

Episode 1 dynamics: Participants learned to avoid traps and were attracted to the stochastic subregion within the first episode. They spent less time in traps and more time in the stochastic part in the 2nd half vs. 1st half of episode 1 (traps: t = −6.35; 95% CI = [−0.186, −0.097]; stochastic: t = 4.25; 95% CI = [0.073, 0.203]; DF = 56).
Reward optimism effects (episodes 2–5): Exploration increased with reward optimism. Search duration and fraction of time in the stochastic part were negatively correlated with the value of G* (search duration: Pearson t = −4.2; 95% CI = [−0.67, −0.27]; DF = 55; fraction in stochastic: t = −4.7; 95% CI = [−0.70, −0.32]; DF = 55). Even the 4 CHF group (no incentive for higher rewards) showed non-negligible stochastic exploration.
Persistence in the most optimistic group: The 2 CHF group exhibited constant search duration across episodes 2–5 (one-sample t on individual correlations: t = 0.63; 95% CI = [−0.20, 0.37]; DF = 20; Bayes Factor = 1/3.7 favoring the null), indicating persistent exploration of the stochastic part despite the potential to reduce exploration with learning.
Environment reconstruction: All groups achieved above-chance reconstruction scores (2 CHF: t = 16.9; 95% CI = [0.49, 0.63]; DF = 20; 3 CHF: t = 15.1; 95% CI = [0.44, 0.58]; DF = 18; 4 CHF: t = 10.0; 95% CI = [0.35, 0.53]; DF = 16). The specific link between state 4 and S-44 was more often correctly drawn than erroneous links to a novel image or multi-hop nonexisting links (vs. X: t = 2.46; 95% CI = [0.03, 0.32]; DF = 56; vs. 2-hop: t = 3.88; 95% CI = [0.12, 0.37]; DF = 56; vs. n-hop with n > 2: t = 4.44; 95% CI = [0.15, 0.39]; DF = 56). Participants who reconstructed the 4–S-44 link had visited S-44 more often (t = 3.20; 95% CI = [2.4, 11.4]; DF = 20.9) and experienced 4↔S-44 transitions more often (t = 2.14; 95% CI = [0.01, 0.97]; DF = 18.3).
Model comparison: Bayesian model comparison favored novelty-seeking over information gain, surprise, and no intrinsic reward. Novelty-seeking had the highest expected posterior probability and protected exceedance probability across participants and within each reward group; model recovery confirmed identifiability.
Predictive checks: Posterior predictive checks showed that fitted novelty-seeking reproduced key qualitative patterns (e.g., increased stochastic exploration and reduced trap time in episode 1; persistent exploration in 2 CHF group). Across 43 summary statistics, novelty-seeking had the lowest median and average relative errors.
Action prediction: Cross-validated action prediction accuracy exceeded chance (33%) in all episodes, rising from >40% in episode 1 to >60% in episode 5 for the novelty-seeking model.

Discussion

Findings indicate that human exploration in complex, multistep stochastic environments is guided by novelty and modulated by reward optimism. Participants overexplored the stochastic (reward-free) subregion, consistent with a novelty-driven strategy prone to the noisy TV problem. Reward optimism primarily affected the extent of exploration, not the underlying strategy, as novelty-seeking best explained action choices across groups and episodes. Despite persistent stochastic exploration being theoretically suboptimal compared to information-gain-driven strategies that reduce exploration with learning, participants still acquired accurate cognitive maps of the environment, ruling out poor graph learning as the cause. The paradigm parallels modern settings (e.g., social media endless scrolling), offering a potential explanatory mechanism for continued exploration despite diminishing extrinsic incentives. The computational model, while complex, serves as an interpretable data-summarization tool showing that novelty-seeking RL captures the statistical structure of human behavior better than alternative intrinsic motivations.

Conclusion

Human exploration reflects an interplay between intrinsic (novelty) and extrinsic rewards governed by reward optimism. In a multistep stochastic environment, participants persistently explored a reward-free stochastic subregion, and their actions were most accurately modeled by a hybrid RL algorithm with a novelty-seeking component. The results advance understanding of human exploration beyond simple tasks, highlighting conditions under which novelty can drive suboptimal overexploration. Future work should examine generalization to reward-free settings (e.g., passive viewing, navigation), incorporate time/action costs, and leverage complementary modeling approaches to probe underlying cognitive mechanisms.

Limitations

The modeling approach employs complex hybrid RL architectures with ~30 parameters, functioning primarily as data-summarization rather than definitive cognitive theory; therefore, fitted models may not reflect true cognitive processes. The localized noisy TV implementation differs from passive visual noise, involving action-contingent stochastic content, limiting direct generalization. Reward optimism was modeled via inverse-temperature parameters and may not capture all motivational nuances. The task excluded explicit time/action costs to maintain alignment with machine learning noisy TV formulations, potentially limiting ecological validity. Conclusions depend on specific mathematical definitions of novelty, surprise, and information gain; while expected to be robust across common formulations, some variability may arise with alternative definitions.

Related Publications

Explore these studies to deepen your understanding of the subject.

Economics

Quantitative evidence of the community of shared future for mankind as a driver of sustainable development in human society

Z. Cai and W. Zhang

Political Science

Extremist ideology as a complex contagion: the spread of far-right radicalization in the United States between 2005 and 2017

M. Youngblood

Psychology

Well-being as a function of person-country fit in human values

P. H. P. Hanel, U. Wolfradt, et al.

Psychology

Intolerance of uncertainty as a mediator of reductions in worry in a cognitive behavioral treatment program for generalized anxiety disorder

J. Bomyea, H. Ramsawh, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny