Psychology

Distinct value computations support rapid sequential decisions

A. Mah, S. S. Schiereck, et al.

This fascinating study by Andrew Mah, Shannon S. Schiereck, Veronica Bossio, and Christine M. Constantinople uncovers how rats make split-second decisions based on the value of their environment. With a high-throughput temporal wagering task involving 291 rats, the researchers revealed distinct computations behind trial initiation and wait times. Get ready to explore the dynamic interplay of cognitive behaviors in the animal kingdom!... show more

Introduction

The study investigates how animals compute the value of their environment to guide rapid sequential decisions. Value can be computed via model-free (cached) methods or model-based (inference/planning) methods, each with trade-offs in flexibility and computational efficiency. A key open question is whether and how multiple value computations are selected or combined on behaviorally relevant timescales of seconds within the same subject. Traditional tasks and analyses often obscure moment-by-moment changes by relying on binary choices and session-level regressions, leaving unclear how distinct computations may interact rapidly. The authors aim to dissociate value computations used for motivation (when to initiate a trial) versus deliberation (how long to wait for a reward) and to test whether these computations interact.

Literature Review

Prior work in reinforcement learning delineates model-free and model-based approaches, supported by distinct neural circuits, with known speed–flexibility trade-offs. Response vigor scales with expected reward rate, linking environmental value to motivation. Analytic approaches often pool across sessions, masking rapid changes in value estimates. Theoretical accounts suggest uncertainty-based arbitration between strategies, and efficient coding theories predict context-dependent value representations. The study builds on foraging theory (opportunity cost as long-run average reward), regression-based dissection of model-based vs model-free influences, and prior findings that animals adjust learning rates to environmental volatility.

Methodology

Subjects: 291 Long-Evans rats (184 male, 107 female; subset transgenic lines), water-restricted, trained in a high-throughput facility with automated data collection and version-controlled software. Apparatus: custom operant boxes with three nose ports (LEDs, IR beams), stereo speakers, water delivery via solenoids; tasks implemented on Arduino-based Bpod system with Matlab interface. Task: Temporal wagering task with hidden states (blocks). On each trial, a tone indicated the offered water reward volume (5, 10, 20, 40, 80 µL; females used 4, 8, 16, 32, 64 µL but treated equivalently). One side LED indicated the potential reward port; reward became available after a variable, exponentially distributed delay (mean ~2.5 s). Rats could wait for reward or opt-out by poking the opposite port at any time. Catch trials (15–25%) withheld rewards to force opt-out, providing a continuous readout of subjective value via wait time. Trials were self-paced; trial initiation time was the interval from outcome (reward or opt-out) to the next center poke. Hidden states (blocks): 40 completed trials per block: low (5/10/20 µL), high (20/40/80 µL), and mixed (all volumes), alternating mixed–high–mixed–low, etc. The 20 µL offer occurred in all blocks, enabling contextual comparisons. Block hazard was approximated as flat (1/40) in modeling. Shaping introduced elements progressively through 8 stages; all analyses use stage 8 (full task). Analyses: Behavioral sensitivity assessed via z-scored wait and initiation times; block effects tested with Wilcoxon signed-rank/rank-sum tests. Regression analyses in mixed blocks related current and prior offers (log2-reward) to wait times (including current offer) and to trial initiation times (previous offers only), estimating exponential decay time constants (τ). Dynamics around block transitions were quantified and compared across rats. Learning dynamics were tracked across training sessions by regressing wait times on block identity and reward, and initiation times on previous reward. Computational models: Separate models for wait time and trial initiation time, both dependent on environmental value K. Wait time model: WT = D r log((R − κT)/(C − κT)), where r is delay time constant, C = reward probability (1−catch), R = offered reward, κ proxies opportunity cost, D is scale; predicts quitting when trial value falls below environmental value. Trial initiation model: D/TI = κ (vigor relation). Two computations for environmental value were evaluated:

Inferential model (Most Likely State): three block-specific values (K_low, K_mixed, K_high); Bayesian inference over hidden block using likelihoods of observed rewards and a prior incorporating block transition structure; selects κ of the most probable block each trial. A belief-state variant used posterior-weighted averages; a suboptimal prior variant introduced λ to interpolate between optimal prior and flat prior.
Retrospective model: recency-weighted running average of rewards via temporal-difference update k_{t+1} = k_t + α δ_t, δ_t = r_t − k_t, with dynamic learning rate α = α0 G_t; gain G_t scales with trial-by-trial changes in posterior beliefs (derivative/changes in mixed-block probability from the inferential model) to capture rapid transitions; static or RPE-based gains were also tested but underperformed. Model fitting and comparison: Maximum likelihood with 100 random starts; 5-fold cross-validation; noise modeled as log-normal (variance fixed: 8 s for wait times, 4 s for initiation). Model comparisons via BIC (and confirmed with AIC and cross-validated nLL). Primary fits were to wait-time data; initiation-time models were used for qualitative predictions due to heavy-tailed distributions and multiple interacting processes. Pre-initiation cue manipulation: In a subset (N=16), the reward cue tone was played before trial initiation (and again at initiation) to reduce state uncertainty before initiating trials; behavior was compared to the original task.

Key Findings

Both deliberative (wait time) and motivational (trial initiation time) behaviors were modulated by environmental value (block type). Rats waited less and initiated faster in high-value blocks versus low-value blocks. For 20 µL offers, rats waited ~10% less time in high vs low blocks (p < 0.001, Wilcoxon signed-rank, N=291). Trial initiation times were shorter in high blocks (population p = 1.82 × 10^−122, Wilcoxon signed-rank, N=291).
Distinct temporal dynamics: At transitions into mixed blocks, wait times rapidly converged to a common value irrespective of previous block, indicating a fixed estimate of environmental value within mixed blocks. Trial initiation times showed long-timescale dependence on prior block identity and exponentially weighted sensitivity to previous rewards, reminiscent of model-free temporal-difference learning.
Regression analyses in mixed blocks showed wait times primarily depended on current offer, whereas initiation times depended on a recency-weighted history of previous offers. Exponential decay time constants (τ) differed significantly between initiation and wait times (p = 1.30 × 10^−49, Wilcoxon sign-rank, N=291) and were uncorrelated across models (r = −0.03, p = 0.53, Pearson).
Individual differences: Classifying rats by fast vs slow temporal integration of initiation times separated initiation dynamics (p < 0.05) but not wait-time dynamics (p = 0.1), supporting distinct computations.
Trial-history sensitivity: In mixed blocks, 87% of rats showed no significant dependence of wait times for 20 µL on previous offer size (p > 0.05, N = 253/291), whereas 89% showed significant previous-offer effects on initiation times (p < 0.05, N = 256/291). Across rats, sensitivity to previous offers was greater for initiation than for wait times (p = 6.21 × 10^−3, Wilcoxon sign-rank, N=291).
Modeling: Wait times were better fit by the inferential (hidden-state) model than the retrospective model by BIC (ΔBIC preference, p = 1.07 × 10^−3, Wilcoxon signed-rank, N = 291), capturing convergence in mixed blocks and insensitivity to previous offers. The belief-state model performed comparably when posteriors were stable. Retrospective model with a dynamic learning rate captured initiation-time features: long-timescale dependence on prior block, rapid dynamics at transitions, and sensitivity to previous offers; alternative dynamic-rate schemes (e.g., unsigned RPE) failed to capture both short- and long-timescale dynamics.
Interaction via learning rate: Changes in inferred-state beliefs modulated the learning rate of the retrospective computation, linking the two systems.
Learning over training: Wait-time block sensitivity (20 µL) emerged gradually with exposure to hidden states, with increasing regression coefficients for block and reward. Initiation-time block sensitivity was present from the first session in the final stage and was relatively stable; an overshoot following high→mixed transitions strengthened over training, paralleling the emergence of inference in wait times.
Reducing state uncertainty: Presenting the reward cue before trial initiation made initiation times sensitive to offered volume, but initiation times for 20 µL in mixed blocks remained modulated by previous rewards (13/16 rats, population p = 1.03 × 10^−4) and continued to depend on previous block identity. Thus, initiation decisions remained inherently retrospective despite reduced uncertainty.

Discussion

The findings demonstrate that rats deploy distinct value computations within seconds on single trials: an inferential, hidden-state computation guides deliberative waiting decisions, whereas a retrospective, recency-weighted computation governs motivational vigor (trial initiation). Despite this dissociation, the computations interact—subjective belief dynamics about hidden states modulate the learning rate of the retrospective process, coordinating their temporal dynamics. This framework explains why initiation times can show faster subthreshold sensitivity at transitions while inference-based wait times change after belief thresholds are crossed. The results align with theories proposing multiple decision systems supported by distinct neural circuits and with efficient coding accounts of context-dependent valuation. Reducing state uncertainty before trial initiation did not shift initiation behavior away from retrospective valuation, suggesting inherent retrospection for motivational actions, potentially due to neural circuitry constraints or the action’s temporal distance from reward. The work clarifies how arbitration and interaction between value systems can occur on rapid timescales, informed by subjective belief distributions.

Conclusion

This work shows that animals flexibly and rapidly switch between distinct value computations during sequential decisions: inference-based estimation for how long to wait and retrospective averaging for when to initiate. A dynamic learning rate gated by belief updates links these computations. High-throughput training across hundreds of rats enabled robust detection of individual differences and learning dynamics. These insights inform theories of parallel decision systems and may guide neurobiologically inspired approaches in AI. Future research should identify the neural circuits implementing these computations, how belief distributions are represented and leveraged to modulate learning rates, and how temporal proximity to reward shapes the selection of value strategies.

Limitations

The wait-time model’s normative basis assumes a constant environmental value within the relevant horizon; in this task, environmental value changes across blocks, so the formulation is a useful process model but not strictly normative.
The inferential model used a flat (constant) hazard rate and a Most Likely State approximation rather than a full POMDP solution; while justified and empirically adequate, this is a simplification. The hazard rate H0 was fixed and not fit across all rats.
Trial initiation times were heavy-tailed and likely reflect multiple interacting processes across timescales; consequently, initiation-time models were used for qualitative predictions and not formally fit to data.
Alternative dynamic learning-rate schemes (e.g., unsigned RPE-based) did not capture observed dynamics; other formulations may exist but were not exhaustively explored.
The pre-initiation cue manipulation was tested in a subset of rats (N=16), which, while showing clear effects, is smaller than the main cohort.

Related Publications

Explore these studies to deepen your understanding of the subject.

Psychology

Distinct basal ganglia contributions to learning from implicit and explicit value signals in perceptual decision-making

T. Balsdon, M. A. Pisauro, et al.

Psychology

The rat frontal orienting field dynamically encodes value for economic decisions under risk

C. Bao, X. Zhu, et al.

Psychology

Distinct beta frequencies reflect categorical decisions

E. Rassi, Y. Zhang, et al.

Medicine and Health

Giving parents support: a randomized trial of peer support for parents after NICU discharge

K. Fratantoni, L. Soghior, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny